Risk factors affecting polygenic score performance across diverse cohorts

Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA
Division of Nephrology, Department of Medicine, Columbia University, NY, New York
Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA
Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN
Department of Pediatrics, University of Alabama at Birmingham, Birmingham, AL
Departments of Pediatrics and Medicine, Columbia University Irving Medical Center, Columbia University, New York, NY
Department of Neurology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL
The Center for Autoimmune Genomics and Etiology, Division of Human Genetics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH
Division of Clinical Pharmacology, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
Center for Genetic Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL
Department of Biomedical Informatics, Vagelos College of Physicians & Surgeons, Columbia University, New York, NY
Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA
Regeneron Pharmaceuticals Inc., Tarrytown, NY
Departments of Medicine (Medical Genetics) and Genome Sciences, University of Washington Medical Center, Seattle, WA

https://doi.org/10.7554/eLife.88149.2

Open access
Copyright information

Figures and data

A flowchart of the project.

PGS R² stratified by quintiles for quantitative variables and by binary variables. a) Continuous covariates with significant (p < 8.1×10^-4) R² differences across quintiles in UKBB EUR. Pork and processed meat consumption per week were excluded from this plot in favor of pork and processed meat intake. b) Covariates with significant differences that were available in multiple cohorts. When traits had the same or directly comparable units between cohorts we show the actual trait values (and show percentiles for physical activity, alcohol intake frequency, and socioeconomic status, which had slightly differing phenotype definitions across cohorts) plotted on x-axis. Townsend index and income were used as variables for socioeconomic status UKBB and GERA, respectively. Note that the sign for Townsend index was reversed, since increasing Townsend index is lower socioeconomic status, while increasing income is higher socioeconomic status. Abbreviations: physical activity (PA), International Physical Activity Questionnaire (IPAQ).

Model descriptive statistics on 28 of 62 covariates, which have significant (p<.05/62) PGS-covariate interaction terms, in UKBB EUR.
The third column is the percentage change in PGS effect per unit change (standard deviations for continuous variables, binary variables encoded as 0 or 1) in covariate. The fifth column is the increase in model R² with a PGS-covariate interaction term versus a main effects only model. Abbreviations: blood pressure (BP), physical activity (PA), forced vital capacity (FVC), forced expiratory volume in 1-second (FEV1), International Physical Activity Questionnaire (IPAQ).

Model descriptive statistics on 28 of 62 covariates, which have significant (p<.05/62) PGS-covariate interaction terms, in UKBB EUR.
The third column is the percentage change in PGS effect per unit change (standard deviations for continuous variables, binary variables encoded as 0 or 1) in covariate. The fifth column is the increase in model R² with a PGS-covariate interaction term versus a main effects only model. Abbreviations: blood pressure (BP), physical activity (PA), forced vital capacity (FVC), forced expiratory volume in 1-second (FEV1), International Physical Activity Questionnaire (IPAQ).

Relative percentage changes in PGS effect per unit change in covariate, for covariates that significantly changed PGS effect (i.e., significant interaction beta at Bonferroni p < 8.1×10^-4 – denoted by asterisks) and were present in multiple cohorts and ancestries. Same covariate groupings and transformations were performed as in Figure 1. Similarly, actual values were used when variables had comparable units across cohorts, and standard deviations (SD) used otherwise.

Relationships (Pearson correlations weighted by sample size) between maximum R² differences across strata, main effects of covariate on log(BMI), and PGS-covariate interaction effects on log(BMI). Main effect units are in standard deviations, interaction effect units are in PGS standard deviations multiplied by covariate standard deviations. Only continuous variables are plotted and modeled. GERA was excluded due to slightly different phenotype definitions.

Quantile regression effects of PGS_BMI (in units of log(BMI)) on log(BMI) at each decile of BMI in each cohort and ancestry. The effect of PGS_BMI increases as BMI itself increases, suggesting that no individual covariate-PGS interaction is responsible for the nonlinear effect of PGS_BMI.

Three sets of simulated data with varying regression line slopes, showing how model R² changes when regression line slope changes, all else being equal. Residuals were sampled from a normal distribution (mean=0, sigma=sqrt(π/2)) to give mean squared error=1. 5,000 x-values were sampled for each line, uniformly distributed from 0-10. Despite having the same mean squared error, model R² increases as beta increases.

Univariable association of PGS_BMI and log(BMI) in European UKBB, separately for the bottom and top quintiles of age. R² is higher in younger individuals, which is partially a consequence of the larger effect (as shown in S Figure 2), despite the mean squared error actually being higher.

Model R² from different machine learning models across cohorts and ancestries using age and gender as covariates (along with PGS_BMI and PCs 1-5). Across all cohorts and ancestries, LASSO with PGS-age and PGS-gender interaction terms had better average 10-fold cross-validation R² than LASSO without interaction terms, while neural networks outperformed LASSO models.

PGS R² based on three sets of GWAS setups. “Main effects” were from a typical main effect GWAS, “GxAge” effects were from a GWAS with a SNP-age interaction term, and “Age stratified” GWAS had main effects only but were conducted in four age quartiles. PGS R² was evaluated using two models: one with main effects only, and one with an additional PGS*Age interaction term.

PGS-covariate interaction term -log₁₀(p-values) in UKBB EUR, with and without including the covariate PGS in the model – the mean -log₁₀(p) is reduced from 18.0899 to 14.97072 with their inclusions. Note age and sex PGS were not calculated, and their interaction p-values are excluded from this figure.

Sign up for email alerts