Genetic architecture and polygenicity of 11 cancers.

(a) Mean proportion of genetic variance explained by each of the mixtures components using either case-control or age-at-onset phenotype. We find evidence that age-at-onset is highly polygenic with most of the genetic variance attributable to SNPs contributed by markers in the 10−4 mixture group, while the majority of the case-control phenotype genetic variance is explained by the markers from the 10−3 mixture group. (b) Number of LD-independent regions (see Methods) needed to explain total genetic variance. The contributions of LD-independent regions were sorted ascendingly such that the smallest contributing regions were added first. (c) Median proportion of genetic variance explained by each mixture class and MAF quartile combination, with 95% CI. For both case-control and age-at-onset models, most of the genetic variance is attributable to the small effect common variants (MAF quartile 4), however rare variants from the first MAF quartile contribute significantly to the variance for bladder, endometrial, ovarian, testicular cancers, non-Hodgkin’s lymphoma for the BayesR model. BCC indicates basal cell carcinoma.

Predictive validation of different polygenic risk scores (PRS) in the Estonian Biobank data.

(a) Odds ratio for diagnosis of a tumour given one standard deviation increase in PRS, with 95% confidence intervals. (b) Percent of individuals diagnosed with cancer before age 50 having a top 10% or top 5% highest PRS; (c) cumulative incidence curves adjusted for competing risk for individuals with the top 5% highest PRS. The number of Estonian Biobank individuals used in the validation was N =195,432. BayesRR-RC and BayesW estimates were obtained by running the corresponding models on UK Biobank using either case-control or age-at-onset data. The LDpred-funct used the summary statistics that were calculated using the same UK Biobank individuals and variants as for BayesRR-RC or BayesW using fastGWA method; then the summary statistics were used in the LDpred-funct method [25] (see Methods). BayesRR-RC and BayesW tend to have more accurate predictions than the summary statistic approach. For all cancers except breast, cervical, endometrial and ovarian cancer, BayesW predictor gives a nominally higher odds ratios compared to BayesRR-RC predictor.

Properties of discoveries and changes in p-values.

(a) Pearson correlation between key gene scores. Correlations were calculated using all key genes (including non-significant ones). (b) Properties of novel replicated genetic regions. Repl - the discovery was replicated in the Estonian Biobank, CADD - maximum CADD score of the region is equal or greater than 12.37, DIS - maximum DeepSEA disease impact score (DIS) of the genetic region is equal or greater than 2, MLE - maximum DeepSEA mean log e-value (MLE) of the region is equal or greater than 2, eQTL - an SNP from the the genetic region is an eQTL with p-value < 5 · 10−8, OChS - open/active chromatin state (minimum 15-core chromatin score of the lead SNP is less or equal than 7), RDB - minimum RegulomeDB category of the genetic region is 1 or 2, ENH - SNP is in enhancer region. (c) Differences between p-values from standard REGENIE analysis and BayesW- or BayesRR-RC-adjusted analyses. With the exception of two classes in non-Hodgkin’s lymphoma and ovarian cancer (p < 5 · 10−8), the Bayesian adjustments yield similar or slightly improved results compared to standard REGENIE with notable improvements seen in bladder cancer, cervical cancer (5 · 10−8 < p < 5 · 10−4), melanoma, non-Hodgkin’s lymphoma (5 · 10−8 < p < 5 · 10−6) and testicular cancer.

Functional description of the novel and replicated discoveries from case-control (GMRM-BayesRR-RC) and age-at-onset (GMRM-BayesW) marginal analyses.

We performed the marginal analysis by adjusting the mixed-linear association model REGENIE with GMRM-BayesW or GMRM-BayesRR-RC genetic LOCO predictors and identified 7 novel genetic loci, 3 of which were replicated in the Estonian Biobank (see Methods for the pipeline for filtering and replication of the novel loci). Importantly, two out of the three replicated loci were only discovered using the GMRM-BayesW adjusted model and one was discovered by using both GMRM-BayesW and GMRM-BayesRR-RC LOCO predictors. We calculated various parameters related to the potential functionality of novel genetic regions for each significant independent novel SNP and minimum/maximum/common values within the genetic regions (index SNPs and those in LD r2 > 0.6). The majority of SNPs from the 7 novel genomic regions could be linked to regulatory variation. Here, NHL - non-Hodgkin’s lymphoma, Heterochrom. - heterochromatin.

SNP-heritability estimates.

Estimates with 95% CI from LD Score regression, using mixed linear association model estimates from REGENIE’s step 2 adjusted with age-at-onset LOCO predictor (GMRM-BayesW) or with case-control LOCO predictor (GMRM-BayesRR-RC), as compared with previous array or family based estimates. a - estimate from Rashkin et al. [7]; b - estimate from Mucci et al. [59]; c - estimate from Kilgour et al. [60]; d - estimates from Czene et al. [61].

UK Biobank data composition for the cancer cases and their timings used within the study.

Cancer-specific ICD10 and ICD9 codes used to select cases from the UK and Estonian biobank studies.

For each of the tumour types, the corresponding ICD10 and ICD9 codes are presented that were used to define cancer occurrence.

Alternative liability scale heritability estimates with 95% CI.

We use the observed scale from LDSC estimates (REGENIE’s summary statistics from GMRM-BayesW-adjusted analysis) and the heritability estimates from the full Bayesian model (GMRM-BayesRR-RC). Transformation of the observed scale heritabilities is done with a more conservative approach (Ojavee et al. [24]) better suited for rare diseases.

Statistically significant cross-trait genetic correlations from LD score regression analysis.

We calculated the genetic correlations between cancers with cross-trait LD score regression [49] applying it to the results from REGENIE’s GMRM-BayesW or GMRM-BayesRR-RC adjusted analyses and GWAS results for multiple phenotypes released by Neale group [50] and Global Biobank Meta-analysis Initiative consortium [51]. Both GMRM-BayesRR-RC and GMRM-BayesW based significant genetic correlations agree on the magnitude of the estimates.

Previously unreported discoveries from GMRM-BayesRR-RC or GMRM-BayesW analyses in comparison with results from an unadjusted marginal association analysis.

We observe that for the 7 previously unreported variants, the p-value in the unadjusted association analysis with REGENIE is borderline significant (5 · 10−8 < p < 10−6). However, by using the GMRM-BayesW or GMRM-BayesRR-RC adjustments in the step 1 of REGENIE, we arrive at statistically significant test statistics.

Cancer risk from birth to age 85, SEER estimate 2016-2018

To ensure that the lifetime risk estimates were similar to the study population (European ancestry, UK Biobank, oldest individual age 86) we used the estimates from SEER of non-hispanic white of getting diagnosed between ages (0-85). The explorer is accessible from https://seer.cancer.gov/explorer/. The explorer had a joint estimate for colorectal cancer that we transformed to the risk of colon cancer using the proportion of colon cancer cases among colorectal cases (70.3%, https://www.cancer.org/cancer/colon-rectal-cancer/about/key-statistics.html, accessed 24.01.2022). For basal cell carcinoma, we used a lifetime risk estimate from Miller et al. [62].

Results from case-control association analysis of 11 tumours, adjusted for BayesW predictors in other chromosomes.

The significance of each SNP was obtained using a logistic regression score test from step 2 of REGENIE on binary (case-control) phenotype that was adjusted for covariates and BayesW genetic LOCO predictor. The number of markers analysed was M =8,430,446, the number of individuals and cases for each specific cancer are shown in the Supplementary information. We present the − log10(p-value), the dotted line indicates a significance threshold of p = 5 · 10−8.

Results from case-control association analysis of 11 tumours, adjusted for BayesRR-RC predictors in other chromosomes.

The significance of each SNP was obtained using a logistic regression score test from step 2 of REGENIE on binary (case-control) phenotype that was adjusted for covariates and BayesRR-RC genetic LOCO predictor. The number of markers analysed was M =8,430,446, the number of individuals and cases for each specific cancer are shown in the Supplementary information. We present the − log10(p-value), the dotted line indicates a significance threshold of p = 5 · 10−8.

Predictive validation of different PRS on Estonian Biobank data using Harrell’s C-statistic, hazards ratio or odds ratio with 95% CI.

The statistics were calculated by finding the impact of one standard deviation increase in the PRS (Scaled), by finding the impact of belonging to top 5% quantile of the PRS or by finding the impact of belonging to the top 10% quantile of the PRS on the likelihood of having cancer. Harrel’s C-statistic was calculated from Cox proportional hazards model without covariates, odds ratio was calculated from a logistic model using sex and age-at-entry as covariates, hazards ratio was calculated from Cox proportional hazards model using sex and age-at-entry as covariates.

Prediction in Estonian Biobank using either medical record or self-reported phenotypic data in BayesW or BayesRR-RC models.

The polygenic risk scores that are using medical record data rather than self-reported data tend to be more predictive across all cancers. The odds ratios were calculated by finding the impact of one standard deviation increase in PRS in a logistic model using sex and age-at-entry as covariates.

Mean -log10 p-value from the marginal association analysis adjusted with either BayesRR-RC, BayesW or without adjustment.

The significance of each SNP was obtained using a logistic regression score test from step 2 of REGENIE on binary (case-control) phenotypes. We observe that BayesW or BayesRR-RC LOCO adjustments result in similar or decreased p-values suggesting increased statistical power.

Classification of previously reported discoveries by each cancer type.

Case-control and time-to-event substantially overlap in recovering previous findings (255/261). However, time-to-event adjustment enables replicating 6 additional loci, 1 for basal cell carcinoma, 2 for breast cancer and 3 for prostate cancer. All case-control approach discoveries were replicated by the time-to-event approach discovery.

Tissue-specific enrichment for GTEx v8 tissues.

For basal cell carcinoma, breast and prostate cancer, Downstreamer analysis highlighted significant enrichment in several tissues, including tissue-specific associations: e.g. in both sun-exposed and not sun-exposed skin for basal cell carcinoma, in mammary tissue for breast cancer, and prostate for prostate cancer. Enrichment Z-scores that were both Bonferroni and 5% FDR significant are marked with asterisks. Only tissues with significant enrichment Z-scores are shown.