Novel discoveries and enhanced genomic prediction from modelling genetic risk of cancer age-at-onset

Ekaterina S. Maksimova; Sven E. Ojavee; Kristi Läll; Marie C. Sadler; Reedik Mägi; Zoltan Kutalik; Matthew R. Robinson

doi:10.7554/eLife.89882.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Reviewing Editor
Jérémie Nsengimana
Newcastle University, Newcastle upon Tyne, United Kingdom
Senior Editor
Eduardo Franco
McGill University, Montreal, Canada

Reviewer #1 (Public Review):

Summary:
In this paper the authors present genome-wide association analyses of 11 different cancers including time-to-event analyses. The authors use two recently published Bayesian methods, one of which is constructed to handle time-to-event data. The authors demonstrate that polygenic risk scores trained on these models give nominally better predictions than standard polygenic risk scores. Further they show that performing 11 GWASs in UKB while adjusting for the polygenic effects estimated by their improved predictor, they find seven novel loci are implicated by one or both of these methods of which the authors find that three replicate in Estonian Biobank.

Strengths:
A clear strength is that the authors evaluate the performance of the model in a completely different dataset (Estonian Biobank) than the one it is trained in.

Weaknesses:
The 11 phenotypes that the authors chose have the challenge that they are rare, particularly in healthy biobank participants, which means that (i) the benefit of modeling it as a time-to-event analysis is expected to be smaller and (ii) that models have to be stable under imbalanced case/control fractions. In GWAS analyses authors handle this second problem by using a recently published association test, which is robust to imbalanced data, which likely means that they avoid inflated test statistics, but also that they do not leverage the actual time-to-event information to its full potential.

The authors chose not to use the recently published methods BayesRR-RC and BayesW, but instead they run these models and then add an extra step where they run a logistic regression with an offset term set to the LOCO genomic values as estimated by GRMR-BayesW and GRMR-BayesRR-RC respectively. They write that this was because of the imbalanced case/control proportion, but not how the problem was detected. If the authors have insight about when the standard GRMR-BayesW and GRMR-BayesRR-RC become unreliable, I think it would be helpful to share in this paper. Further, if the associations implicated by standard GRMR-BayesW and GRMR-BayesRR-RC are not reliable, I think we need some justification that the variance components reported in Figure 1 are still reliable.

The authors chose to compare the two new GWAS methods, GMRM-BayesW-adjusted and GMRM-BayesRR-RC-adjusted, to REGENIE, so an obvious first question in my opinion is if GMRM-BayesW-adjusted and GMRM-BayesRR-RC-adjusted find more signal than REGENIE.
a. We see that 7 loci where found by GMRM-BayesW but not by REGENIE, but how many were found by REGENIE but not by GMRM-BayesW?
b. Figure S5 as I understand it is showing that the mean -log(p-value) is lower in GMRM-BayesW than REGENIE for variants that have a p-value in GMRM-BayesW that is lower than 5e-8. I don't think this is a valid way to check if GMRM-BayesW has more power. I have a feeling that there could be a winner's curse-like phenomenon here. I think a more principled comparison could be provided.

The title of the paper ("Novel discoveries and enhanced genomic prediction from modelling genetic risk of cancer age-at-onset") seems to imply that the age of onset informed model (GMRM-BayesW) does better. But I think the foundation for that statement could be strengthened.
Figure S6 shows that 261 previously reported loci were replicated by GMRM-BayesW-adjusted whereas 256 were replicated by GMRM-BayesRR-RC. How were previously reported loci defined? did they include UKB data? and how many where there in total?
In the PRS analyses presented in Figure 3a GMRM-BayesW does better than GMRM-BayesRR-RC in 8/11 phenotypes, which does not itself appear significant to me. And with overlapping confidence intervals the significance of the improvement is hard to see.

In Table 1 it says that rs35763415, rs117972357 and rs7902587 replicated in the Estonian Biobank but Figure 3b it says that rs35763415, rs117972357 and rs1015362 replicated in the Estonian Biobank. What is the difference between these two analyses? In the methods it says that you checked your findings for replication in FinnGen, but I don't see any results from FinnGen anywhere?

https://doi.org/10.7554/eLife.89882.1.sa1

Reviewer #2 (Public Review):

Summary: Maksimova, Ojavee, and colleagues extend two of their methods, BayesW and BayesRR-RC to be used as mixed-model association methods by combining them with a similar approach as in step 2 of REGENIE. BayesW handles time-to-event data whereas BayesRR-RC works for case-control phenotypes. They provide UKBB results for 11 cancers and replicate findings and assess predictions in the Estonian biobank.

Strengths: Age-of-onset is becoming more and more available, and developing methods that make the best use of this additional information is valuable.

Weaknesses: In this work, there is (for now) limited validation of results and comparison with other existing methods.

https://doi.org/10.7554/eLife.89882.1.sa0

Novel discoveries and enhanced genomic prediction from modelling genetic risk of cancer age-at-onset

Peer review process

Editors

Be the first to read new articles from eLife