Constructing an atlas of associations between polygenic scores from across the human phenome and circulating metabolic biomarkers
Abstract
Background:
Polygenic scores (PGS) are becoming an increasingly popular approach to predict complex disease risk, although they also hold the potential to develop insight into the molecular profiles of patients with an elevated genetic predisposition to disease.
Methods:
We sought to construct an atlas of associations between 125 different PGS derived using results from genome-wide association studies and 249 circulating metabolites in up to 83,004 participants from the UK Biobank.
Results:
As an exemplar to demonstrate the value of this atlas, we conducted a hypothesis-free evaluation of all associations with glycoprotein acetyls (GlycA), an inflammatory biomarker. Using bidirectional Mendelian randomization, we find that the associations highlighted likely reflect the effect of risk factors, such as adiposity or liability towards smoking, on systemic inflammation as opposed to the converse direction. Moreover, we repeated all analyses in our atlas within age strata to investigate potential sources of collider bias, such as medication usage. This was exemplified by comparing associations between lipoprotein lipid profiles and the coronary artery disease PGS in the youngest and oldest age strata, which had differing proportions of individuals undergoing statin therapy. Lastly, we generated all PGS–metabolite associations stratified by sex and separately after excluding 13 established lipid-associated loci to further evaluate the robustness of findings.
Conclusions:
We envisage that the atlas of results constructed in our study will motivate future hypothesis generation and help prioritize and deprioritize circulating metabolic traits for in-depth investigations. All results can be visualized and downloaded at http://mrcieu.mrsoftware.org/metabolites_PRS_atlas.
Funding:
This work is supported by funding from the Wellcome Trust, the British Heart Foundation, and the Medical Research Council Integrative Epidemiology Unit.
Editor's evaluation
The authors describe their work on an atlas of associations between polygenic scores for 125 different traits representing a variety of quantitative phenotypes and diseases, and a large set of metabolites measured in up to 83,000 participants in the UK Biobank. These associations are all available via a public browser, and may be used to identify candidate intermediate phenotypes, as well as potential biomarkers of disease.
https://doi.org/10.7554/eLife.73951.sa0Introduction
Complex traits and disease have a polygenic architecture meaning that they are influenced by many genetic variants scattered throughout the human genome (Boyle et al., 2017). An increasingly popular approach to predict disease risk in a population is to derive weighted scores by summing the number of risk increasing variants that participants harbour. These are typically referred to as ‘polygenic scores’ (PGS) (Torkamani et al., 2018; Lewis and Vassos, 2020). In the last decade, PGS have emerged as powerful tools for predicting lifelong risk of disease, which is predominantly due to the dramatic increase in sample sizes of genome-wide association studies (GWAS) and their continued success in uncovering trait-associated genetic variants across the genome (Visscher et al., 2017). Additionally, PGS have utility in a causal inference setting to establish causal effects between risk factors and disease outcomes, as well as to help elucidate putative diagnostic and prognostic biomarkers for disease incidence (Richardson et al., 2019b, Holmes and Davey Smith, 2019; Ritchie et al., 2021a).
The human metabolome consists of over 100,000 small molecules and is a rich source of potential risk factors and biomarkers, as well as therapeutic targets (Holmes et al., 2021), for complex traits and disease (Gallois et al., 2019). Many circulating metabolic traits studied to date have a large heritable component as demonstrated by GWAS endeavours (Suhre et al., 2011; Shin et al., 2014; MacTel Consortium et al., 2021), suggesting that they have a polygenic architecture. In-depth molecular profiling has recently been undertaken in the UK Biobank (UKB) study using nuclear magnetic resonance (NMR) to capture measures of 249 circulating metabolites in approximately 120,000 participants who also have genotype data (Julkunen et al., 2021; Sudlow et al., 2015). The 249 metabolite measurements include the particle concentration, size, and composition of 14 lipoprotein subclasses, as well as the levels of phospholipids, fatty acids, amino acids, ketone bodies, and other biomarkers as discussed in a recent review (Ala-Korpela et al., 2022). This resource therefore provides an unprecedented opportunity to characterize metabolic profiles for disease risk by leveraging genome-wide variation captured by PGS. There are multiple advantages to this approach over conventional observational associations between metabolites and complex traits or endpoints. For example, as UKB is a prospective cohort study many diseases have low prevalence, such as Alzheimer’s disease which typically has a late onset. In contrast, evaluations using PGS will likely yield higher statistical power given that a continuous genetic score will be analysed for all participants in UKB based on their liability to disease.
In this study, we sought to construct an atlas of associations between 125 PGS and the 249 circulating metabolic traits in the UKB study (Figure 1). We demonstrate the usefulness of this atlas in terms of highlighting putative risk factors and biomarkers for disease risk and advocate the use of an approach known as Mendelian randomization (MR) to investigate whether a causal relationship may underlie findings (Supplementary Note 2) (Davey Smith and Ebrahim, 2003, Davey Smith and Hemani, 2014; Sanderson et al., 2022). As an exemplar, we apply MR systematically to investigate all PGS associations highlighted from a hypothesis-free scan of the inflammatory marker glycoprotein acetyls (GlycA). Furthermore, all PGS analyses were initially conducted in the full UKB sample, as well as in sex-stratified samples and age tertiles as proposed previously to evaluate the influence of medication use on findings (Bell et al., 2022). As the age of individuals in UKB is unlikely to induce sources of biases into analyses (e.g. collider bias), age-dependent stratification allows comparisons between the youngest and oldest tertiles in UKB where the level of medication use is likely to vary between groups. Stratification on medication use, by contrast, would introduce collider bias. Together our findings provide valuable insights into the effects of PGS on metabolic markers which may influence hypothesis generation and facilitate similar analyses to those presented in this paper.
Results
Constructing an atlas of polygenic score associations across the human metabolome
We obtained genome-wide summary statistics for 125 different complex traits and diseases from large-scale GWAS and constructed PGS for each of these in the UKB study. The majority of these summary statistics were obtained from the OpenGWAS platform and encompassed traits and disease outcomes from across the human phenome (Elsworth et al., 2020). GWAS were identified based on those conducted in populations of European descent given that our analysis in UKB was based on the European subset with NMR data. Furthermore, we identified studies which did not include the UKB in their study to avoid overlapping samples between PGS construction and analysis with metabolic traits. Full details for all GWAS can be found in Supplementary file 1a. Two versions of each PGS were built using different thresholds for variant-trait associations (P) and linkage disequilibrium (LD; r2). These were (1) ‘lenient’ thresholds of p < 0.05 and r2 < 0.1 and (2) a ‘stringent’ threshold of p < 5 × 10−8 and r2 < 0.001. PGS were generated using the software PLINK (Chang et al., 2015) with LD being calculated using a reference panel of 10,000 randomly selected unrelated UKB individuals of European descent (Kibinge et al., 2020). The specific weights for clumped variants used in all PGS can be found at https://tinyurl.com/PRSweights.
We investigated the association between each PGS in turn with 249 circulating metabolites measured using targeted high-throughput NMR metabolomics from Nightingale Health Ltd (biomarker quantification version 2020) (Supplementary file 1b; Julkunen et al., 2021). Our final sample size of n = 83,004 was determined based on individuals with both genotype and circulating metabolites data after removing participants with withdrawn consent, evidence of genetic relatedness or who were not of ‘white European ancestry’ based on a K-means clustering (K = 4). All PGS were standardized to have a mean of 0 and standard deviation of 1 and similarly all metabolites were subject to inverse rank normalization transformations prior to analysis allowing cross-PGS/metabolite comparisons to be made. Analyses were conducted using linear regression adjusting for age, sex, and the top 10 principal components (PCs).
To disseminate all findings from this large-scale analysis we have developed a web application (http://mrcieu.mrsoftware.org/metabolites_PRS_atlas/) to query and visualize metabolic signatures for a given PGS. In this paper, we have discussed findings using PGS that were derived using the more lenient criteria (i.e. p < 0.05 and r2 < 0.1), although all findings based on both thresholds can be found in the web atlas. PC analysis suggested that the first 19 PCs captured 95% of the variance in the NMR metabolites data (whereas the first ten PCs captured 90% and the first 41 PCs captured 99% of the variance) (Supplementary file 1c). We therefore have applied a heuristic of p < 0.05/19 in this manuscript to account for multiple testing of the associations between any single PGS and the NMR metabolic traits for downstream analyses, although users are able to download the full results to apply whatever correction they see fit. For all other analyses (e.g. associations between metabolic traits and all PGS), we apply a false discovery rate (FDR) of less than 0.05 calculated from the Benjamini–Hochberg procedure to correct for multiple testing. Based on the FDR threshold of 0.05, there were a total of 5445 associations between PGS derived (derived based on ‘lenient criteria’ with p < 0.05 variants) in the full sample and NMR-assessed circulating metabolic traits. Heatmaps depicting the Z scores of all PGS–metabolic trait associations can be found in Figure 2—figure supplements 1 and 2. The PGS with the largest number of associations robust to multiple testing corrections was body mass index (BMI) (n = 217) (Supplementary file 1d). Our atlas also includes sex-stratified estimates for PGS weighted by GWAS undertaken in female only (such as breast cancer and age at menarche) and male only (e.g. prostate cancer) populations, as well as sex-stratified estimates in both females and males separately for all other PGS–metabolite associations. We encourage users interested in these sex-stratified estimates to interpret them with caution however, given the widespread sex-differential participation bias in UKB (Pirastu et al., 2021).
In this paper, we provide several examples of how results from this atlas can be used to generate hypotheses and pave the way for in-depth and careful evaluations of associations between PGS and circulating traits. Specifically, we believe our findings can facilitate a ‘reverse gear Mendelian randomization’ approach to disentangle whether associations likely reflect metabolic traits acting as a cause or consequence of disease risk (Holmes and Davey Smith, 2019) as illustrated using triglyceride-rich very-low-density lipoprotein (VLDL) particles in the next section. Furthermore, in-depth evaluations allow careful consideration of appropriate instrumental variables for circulating metabolites which can be a challenging task as highlighted in our exemplar analysis of GlycA. Finally, we provide examples of how the plethora of sensitivity analyses within our atlas can help users further investigate the robustness of findings.
Orienting the direction of effect between putative causal relationships using Mendelian randomization
Many top associations across PGS were consistent with the known underlying biology of their corresponding diseases, as well as various proof of concepts that associations between PGS and metabolic traits may reflect both causes of disease and consequences of genetic liability towards disease. For example, we applied MR to further evaluate associations highlighted in our atlas with VLDL particles, where both VLDL particle average diameter size and concentration were associated with the PGS for BMI (Beta = 0.04, 95% CI = 0.033 to 0.046, p = 3.53 × 10−35 and Beta = 0.012, 95% CI = 0.006 to 0.019, p = 2.7 × 10–4, respectively) and also coronary heart disease (CHD) liability (Beta = 0.026, 95% CI = 0.019 to 0.032, p = 2.12 × 10–15 and Beta = 0.035, 95% CI = 0.028 to 0.042, p = 2.73 × 10–24, respectively). Conducting bidirectional MR suggested that the associations with average diameter of VLDL particles are likely attributed to a consequence of BMI and CHD liability as opposed to the size of VLDL particles having a causal influence on these outcomes (Supplementary file 1e). In contrast, MR analyses suggested that the concentration of VLDL particles increases risk of CHD (Beta = 1.28 per 1-SD change in VLDL particle concentration, 95% CI = 1.25 to 1.65, p = 2.8 × 10–7) which may explain associations between the CHD PGS and this metabolic trait within our atlas. Similar MR analyses to investigate findings from our atlas can be conducted using the full GWAS summary statistics for all 249 circulating metabolic traits available via the GWAS catalog (https://www.ebi.ac.uk/gwas/) under accession IDs GCST90092803 to GCST90093051 (Richardson et al., 2022).
Along with comparing metabolic signatures for a given PGS, our atlas facilitates hypothesis-free evaluations to inspect all PGS associations for a given metabolic trait. As an example of this, we have undertaken such an analysis based on the associations between all 125 PGS in our atlas with circulating GlycA. GlycA is a biomarker of chronic inflammation and has been found to predict various endpoints, including types of cardiovascular disease, cancer, and all-cause mortality (Lawler et al., 2016; Connelly et al., 2017). Although previous studies of genetically predicted GlycA have been conducted for hypotheses regarding single endpoints (Lord et al., 2021), whether or not circulating GlycA has a causal effect on outcomes from across the disease spectrum has yet to be comprehensively investigated. The role of GlycA is important to establish given the emerging role of inflammation as a pharmacologically modifiable pathway for the prevention and treatment of cardiovascular disease.
There were 44 PGS associations with GlycA which were robust to an FDR <5%, used as a heuristic to determine which results to investigate in further detail (Figure 2 and Supplementary file 1g). We firstly applied the inverse variance weighted (IVW) MR method to systematically assess whether genetic liability to any of these disease endpoints or complex traits provided evidence of an effect on GlycA levels. Of the 44 PGS, 36 contain one or more genetic variants that reached genome-wide significance which can be used as instrumental variables for MR. In total, eight of these exposures provided evidence of a genetically predicted effect from MR analyses based on FDR <5% (Supplementary file 1g), which included anthropometric traits such as BMI (Beta = 0.16 SD increase in GlycA levels per 1 SD increase in BMI, 95% CI = 0.11 to 0.21, FDR = 1.59 × 10–8) and genetic liability to cigarettes smoked per day (Beta = 0.27 SD change GlycA levels per 1 SD increase in cigarettes per day, 95% CI = 0.20 to 0.34, FDR = 2.84 × 10–12). Estimates based on the IVW method were typically supported by the weighted median approach, although only cigarettes smoked per day were supported by both the weighed median (Beta = 0.24, 95% CI = −0.16 to 0.33, p = 2.08 × 10–8) and MR-Egger (Beta = 0.22, 95% CI = 0.07 to 0.37, p = 0.02) methods (Supplementary file 1h).

Forest plots depicting results from a systematic evaluation of 125 polygenic scores and their associations with circulating glycoprotein acetyls (GlycA).
Associations were assessed by linear regression on up to 83,004 individuals in the UK Biobank. Error bars represent the 95% confidence intervals for the effect estimates. Results coloured in grey are associations which did not surpass a false discovery date of less than 0.05 to account for multiple testing.
Next, we investigated the converse direction of effect using MR to assess whether genetically predicted GlycA may influence any of the 44 complex traits or disease endpoints highlighted by our atlas of results. Undertaking a GWAS of GlycA in the UKB identified 59 independent genetic variants which were harnessed as instrumental variables (mean F = 100.1) (Supplementary file 1i). In contrast to the previous analysis, we identified very weak evidence using the IVW method that genetically predicted GlycA has an effect on any of the 44 traits or diseases assessed based on FDR <5% (Supplementary file 1j). We also conducted further sensitivity analyses given that the NMR signal of GlycA is a composite signal contributed by the glycan N-acetylglucosamine residues on five acute-phase proteins, including alpha1-acid glycoprotein, haptoglobin, alpha1-antitrypsin, alpha1-antichymotrypsin, and transferrin (Otvos et al., 2015). Using cis-acting plasma protein (where possible) and expression quantitative trait loci (pQTLs and eQTLs respectively) (F-stats range from 43.9 to 4468.0) as instrumental variables for these proteins (Supplementary file 1k) did not provide convincing evidence that they play a role in disease risk for associations between PGS and GlycA (Supplementary file 1l). The only effect estimate robust to multiple testing was found for higher genetically predicted alpha1-antitrypsin levels on gamma glutamyl transferase (GGT) levels (Beta = 0.05 SD change in GGT per 1 SD increase in protein levels, 95% CI = 0.03 to 0.07, FDR = 3.6 × 10–3), although this was not replicated when using estimates of genetic associations with GGT levels from a larger GWAS conducted in the UKB data (Beta = 1.6 × 10–3, 95% CI = −6.9 × 10–3 to 0.01, p = 0.71). For details of pleiotropy robust analysis and replication results see Supplementary file 1m.
Stratifying analyses by age to investigate potential sources of bias induced by medication use
A critical challenge when analysing the NMR metabolites data in UKB concerns the most appropriate manner to account for participants taking medications which may undermine inference (Bell et al., 2022). For example, UKB participants undergoing statin therapy will likely have altered levels of lipoprotein lipid metabolites compared to others. However, adjusting for statin therapy as a covariate or by stratification can induce collider bias, which may be encountered when investigating the relationship between two factors (such as genetic liability towards CHD and a lipoprotein lipid metabolite) when both influence a third factor (e.g. statin therapy) (Figure 3A). In particular due to the large sample sizes provided, collider bias in the UKB study has been shown to distort findings (Griffith et al., 2020) and in extreme cases can even result in opposite conclusions being drawn (Richardson et al., 2019a). Therefore, to investigate the influence of medication use on the results within our atlas, we repeated all analyses stratified by age tertiles as proposed previously (Bell et al., 2022), given that age is very unlikely to act as a collider between PGS and circulating metabolites (Figure 3B), and medication use is lower in the younger tertiles. Comparisons between the youngest and oldest tertiles in UKB can be systematically investigated and visualized using our web application to evaluate how medications may bias findings.

Directed acyclic graphs illustrating the potential collider bias involved in the causal relationship between the coronary artery disease polygenic score and circulating metabolites.
(A) The likelihood of participants in UK Biobank taking medication such as statins is influenced by having a higher genetic predisposition to coronary artery disease but may also be influenced by certain metabolic traits measured on the nuclear magnetic resonance (NMR) panel (e.g. having elevated low-density lipoprotein cholesterol levels). Either stratifying or adjusting for statin use in regression models may therefore induce collider bias into the association between disease liability and metabolic traits. (B) Age is commonly adjusted for in association analyses due to its role as a confounder and cannot be a collider (i.e. exposures and outcomes cannot influence the age of participants). Stratifying samples by age therefore enables the analysis of exposure–outcome associations in a group of participants with relatively consistent confounding effect from age, leading to more robust association estimates in the lower age tertile where the percentage of participants who are regularly taking medication is low. Furthermore, comparisons with participants in the highest age tertile can help highlight associations between polygenic scores and metabolic traits most likely distorted by potential colliders such as statins in the full sample.
As an example of this, in the full UKB sample there were 193 circulating metabolites associated with the PGS for CHD (constructed using genetic variants with p < 0.05 and r2 < 0.1) under a p value <0.05/19 for multiple testing correction (Supplementary file 1n). The vast majority of these were lipoprotein lipid traits, which are likely capturing causal risk factors for CHD. Amongst the top associations for this PGS was apolipoprotein B (apo B) (Beta = 0.027, 95% CI = 0.020 to 0.033, p = 7.2 × 10–15), which acts as an index of the number of circulating atherogenic lipoprotein particles and has been postulated previously to be the predominating lipoprotein lipid trait indexing CHD risk (Ference et al., 2019; Sniderman et al., 2019; Richardson et al., 2020b, Richardson et al., 2021).
Evaluating this association between age tertiles allowed us to investigate whether it may be influenced by medications in UKB, such as the impact of statin therapy on lowering low-density lipoprotein (LDL) cholesterol, which apo B particles carry. In the youngest tertile (mean age = 47.3 years, 5% statin users), the association between the CHD PGS with apo B was markedly stronger than in the total sample (Beta = 0.059, 95% CI = 0.048 to 0.070, p < 1.6 × 10–26). In stark contrast, there was weak evidence of an inverse association between apo B and the CHD PGS in the oldest tertile (mean age = 65.3 years, 29% statin users) (Beta = −0.007, 95% CI = −0.019 to 0.004, p = 0.223), which is likely attributed to the higher proportion of participants undergoing statin therapy in this sample. Similarly, concentrations of VLDL, LDL, and IDL provided evidence of a positive association with the CHD PGS in the youngest tertile (Figure 4A), whereas the corresponding associations in the oldest tertile provided weak evidence of association (and in some cases reversed direction entirely) (Figure 4). A comparison of all 249 associations with the CHD PGS derived in the youngest and oldest age tertiles can be found in Supplementary file 1o.

Circos plots illustrating the utility of age-stratified analyses in UK Biobank to investigate potential sources of bias when evaluating associations between polygenic scores (PGS) and circulating metabolites.
Associations were assessed by linear regression on up to 83,004 individuals in the UK Biobank. Error bars represent confidence intervals of the effect estimates between the coronary heart disease (CHD) PGS and traits from the six subclasses: L = total lipids, TG = triglycerides, CE = cholesteryl esters, FC = free cholesterol, C = cholesterol, PL = phospholipids. Grey bars represent associations not robust to multiple testing based on p > 0.05/19. These barcharts are oriented such that those extending to the outer rim reflect a positive association between the CHD PGS and metabolic traits whereas those extending inwards indicate inverse associations. (A) Analyses undertaken for participants in the lowest age tertile (mean age = 47.3 years, 5% statin users) and (B) the corresponding results for the oldest age tertile (mean age = 65.3 years, 29% statin users).
Elucidating polygenic associations with metabolic traits by excluding major regulators of NMR lipoprotein lipids
The polygenic nature of complex traits means that the inclusion of highly weighted pleiotropic genetic variants in PGS may introduce bias into genetic associations within our atlas. To provide insight into this issue, we constructed PGS excluding variants within the regions of the genome which encode the genes for 13 major regulators of NMR lipoprotein lipids signals which captured 75% of the gene–metabolite associations in the Finnish Metabolic Syndrome In Men (METSIM) cohort (Gallois et al., 2019). For details of these genes see Supplementary file 1p.
For PGS with these lipid loci excluded, anthropometric traits such as waist-to-hip ratio (N = 209), waist circumference (N = 206), and BMI (N = 205) still provided strong evidence of association with the majority of metabolic measurements on the NMR panel based on multiple testing corrections. Elsewhere however, the Alzheimer’s disease PGS, which was associated with 60 metabolic traits robust to p < 0.05/19 in the initial analysis including these lipid loci (Supplementary file 1q), provided no convincing evidence of association with the 249 circulating metabolites after excluding the lipid loci based on the same multiple testing threshold (Supplementary file 1r). Further inspection suggested that the likely explanation for this attenuation of evidence were due to variants located within the APOE locus (near one of the major lipid regulators APOC1) which are recognized to exert their influence on phenotypic traits via horizontally pleiotropic pathways (Ferguson et al., 2020).
Discussion
In this study, we have developed an atlas of polygenic risk score associations with circulating metabolic traits in an unprecedented sample size compared to previous studies. Our results can be used to help prioritize findings worthy of follow-up, using techniques such as MR, as a means of disentangling putative causal and non-causal relationships underlying associations between PGS and circulating biomarkers. Furthermore, conducting all analyses within age tertiles illustrates the potential of medication use within the UKB population to bias relationships within our atlas of results. These results should help highlight disease–metabolic trait relationships where researchers should exercise caution when interpreting findings from their own analyses of the recently generated NMR metabolites data in UKB, which are due to be available in all ~500,000 participants in the forthcoming years.
Amongst the thousands of PGS associations identified in this study with a p value <0.05/19, we observed an enrichment of scores derived using GWAS of anthropometric traits. This was exemplified by the waist circumference PGS which yielded the largest number of associations in our atlas (n = 212). Previous studies in the field have demonstrated the strong influence that adiposity has on circulating traits from across the metabolome (Würtz et al., 2014), and indeed across the proteome (Folkersen et al., 2020). Furthermore, as shown previously by an MR study (Bell et al., 2022), certain associations with the BMI PGS may be due to the influence of medication use in the UKB sample, for example those related to LDL (e.g. BMI PGS on total lipids in LDL: Beta = −0.020, 95% CI = 0.027 to −0.013, p < 6.17 × 10–9). In our atlas, evidence of these associations strongly attenuated in the youngest age tertile, where the influence of such factors in the UKB population may be weakest (e.g. total lipids in LDL from youngest age tertile: Beta = 0.005, 95% CI = −0.006 to 0.016, p = 0.35). In addition to the striking difference highlighted in our study between the CHD PGS and apolipoprotein B across age tertiles, findings such as this further emphasize the importance of evaluating results from the full sample analysis together with those derived in age-stratified subsamples. Moreover, we suggest that users interested in the sex-stratified estimates within our atlas should interpret them in conjunction with estimates derived across age tertiles as in this example, given that the proportions of males and females in UKB taking certain medications may differ (e.g. statins).
As an exemplar, we conducted a hypothesis-free evaluation of one of the metabolic traits on the UKB NMR panel, GlycA, as a means of demonstrating how findings from our atlas may help generate hypotheses and follow-up analyses. Whilst our MR results indicated that modifiable risk factors such as BMI and cigarette smoking may increase levels of this circulating inflammatory biomarker, they suggest that targeting GlycA itself is unlikely to yield a beneficial therapeutic effect on the complex traits and disease endpoints evaluated in this study. This highlights the value of findings from our atlas, complemented by approaches such as MR, to help both prioritize and deprioritize circulating metabolic traits for further evaluation. Similar hypothesis-free evaluations on the other 248 metabolic traits can be routinely undertaken using our web tool, in addition to evaluations using the more stringent PGS construction criteria of p < 5 × 10–8 and r2 < 0.001. We reiterate the importance of using approaches such as MR (including sensitivity analyses, which are at least partially robust to various forms of pleiotropy) to formally assess putative causal relationships which may underlie findings in our atlas however, as well as to help orient their directionality. This is particularly important given that PGS may be more prone to recapitulating sources of bias commonly encountered in observational studies in comparison to formal MR analyses (Richardson et al., 2019b, Ritchie et al., 2021a). We likewise conducted bidirectional MR to demonstrate that associations between the CHD PGS and VLDL particle size likely reflect an effect of CHD liability on this metabolic trait. In contrast, the association between the CHD PGS and VLDL concentrations are likely attributed to the causal influence of this metabolic trait on CHD risk, suggesting that it is the concentration of these triglyceride-rich particles that are important in terms of the aetiology of CHD risk as opposed to their actual size. We believe that findings from our atlas, as well as other ongoing efforts which leverage the large-scale NMR data within UKB, should facilitate further granular insight into lipoprotein lipid biology.
In terms of study limitations, we note that the NMR panel is predominantly focussed on lipoprotein lipids and as such our atlas does not facilitate analyses across the entire metabolome. Availability of metabolomics quantified by other platforms (e.g. mass spectroscopy) in large numbers with GWAS genotyping will aid in this effort (MacTel Consortium et al., 2021). Furthermore, whilst these data provide an unparalleled sample size compared to predecessors, findings are based on traits derived from whole blood and may therefore not be reflective of molecular signatures identified in other tissue types (Richardson et al., 2020a). In terms of interpretation, we emphasize that PGS can capture an estimate of an individual’s lifelong disease risk, and as such results based on the UKB NMR metabolites dataset, measured at a midlife timepoint in the lifecourse in predominantly healthy participants, may differ substantially to metabolomic profiles of patients with a disease. Conversely, findings may hold the potential to highlight biomarkers useful for disease prediction before clinical manifestation, therefore indicating a potential window of opportunity for early detection and/or intervention. Lastly, in this study we leveraged data from the European subset of the UKB study, which may therefore not be representative of individuals from other ancestries (Duncan et al., 2019). Larger sample sizes of non-European individuals with metabolomics data will facilitate analyses in other ancestries once available, in addition to findings from future large-scale GWAS which have been principally confined to individuals of European descent to date (Sirugo et al., 2019).
We envisage that findings from our atlas will motivate future study hypotheses and help prioritize (and deprioritize) circulating metabolic traits for further in-depth research. Although we highlight several key findings in this manuscript, all our findings can be queried using our web application which provides a platform to inform researchers in the field planning similar analyses. Similar evaluations to those conducted in this manuscript should help develop a deeper understanding into how circulating metabolic traits contribute towards complex trait variation and assess their putative mediatory roles along the causal pathways between modifiable lifestyle risk factors and disease endpoints.
Methods
Data sources
The UKB study
Metabolic profiling was undertaken on a random subset of individuals from the UKB study (Sudlow et al., 2015) (range between 116,353 and 121,695). Full details on genotyping quality control (QC), phasing and imputation in UKB have been described previously (Bycroft et al., 2018). In brief, samples were restricted to individuals of white British ancestry who self-report as ‘White British’ and who have very similar ancestral backgrounds according to PC analyses performed by Bycroft et al (n = 409,703). In total, 107,162 pairs of related individuals were removed based on estimated kingship coefficients derived using the KING toolset. An in-house algorithm was then applied to preferentially remove individuals related to the greatest number of other individuals until no related pairs were left (removing n = 79,448 in total). A further 2 individuals were removed as they were related to over 200 other individuals. There were n = 814 individuals with sex-mismatch (derived by comparing genetic sex and reported sex) or individuals with sex chromosome aneuploidy excluded from analyses.
In total, 249 metabolic biomarkers were generated using non-fasting plasma samples (aliquot 3) taken from UKB participants at initial or subsequent clinical visits. Targeted high-throughput NMR metabolomics from Nightingale Health Ltd (biomarker quantification version 2020) were used to generate data on each of the 249 measures. These included biomarkers on lipoprotein lipid traits, their concentrations and subclasses, fatty acids, ketone bodies, glycolysis metabolites, and amino acids. Further details are described elsewhere (Julkunen et al., 2021). For QC, data of the 249 NMR metabolomics traits were processed using the ‘ukbnmr’ R package to remove variation due to technical factors caused by differences in sample handing and measurement (Ritchie et al., 2021b). Statin users in UKB were identified based on medication codes as defined previously (Sinnott-Armstrong et al., 2021). A full list of these metabolic biomarkers and their summary characteristics can be found in Supplementary file 1b.
Ethical approval for this study was obtained from the Research Ethics Committee (REC; approval number: 11/NW/0382) and informed consent was collected from all participants enrolled in UKB. Data were accessed under UKB application #15825 and #81499.
GWAS summary statistics
Publicly available GWAS summary statistics were extracted from the OpenGWAS platform (https://gwas.mrcieu.ac.uk/) and publicly available repositories (Elsworth et al., 2020). We identified GWAS for 125 different complex traits and diseases which were selected to encompass a broad range of human phenotypes for which genome-wide data were available allowing us to construct PGS based on all variants available with p < 0.05. Furthermore, we identified GWAS based on study populations with participants of European descent, as our study was based on the unrelated European participants of UKB with NMR measures, as well as studies which had not analysed the UKB study population to avoid overlapping samples which can lead to overfitting bias in results (Fang et al., 2022). All details of these GWAS can be found in Supplementary file 1a.
PGS construction
We built two versions of each PGS in this study using the following criteria. Firstly, scores were developed with independent variants (i.e. r2 < 0.001) which were robustly associated with their traits or disease based on conventional genome-wide corrections (i.e. p < 5 × 10–8). The second versions of scores were derived using more lenient thresholds which were r2 < 0.1 and p < 0.05. LD to estimate correlation between variants was based on a previously constructed reference panel of 10,000 randomly selected unrelated UKB individuals of European descent (Kibinge et al., 2020). PGS were derived for all participants with both genotype and NMR metabolites data after firstly excluding individuals with withdrawn consent, evidence of genetic relatedness or who were not of ‘white European ancestry’ based on a K-means clustering (K = 4). These scores were built by summing trait/disease risk increasing alleles which participants harboured weighted by their effect size reported by GWAS using genotype data from hard call dosages files (plink binaries bed/bim/fam) and the software PLINK v2.0 (Chang et al., 2015).
The majority of PGS were constructed in all eligible participants, with the exception of those based on GWAS in sex-stratified populations. These were breast cancer (including ER+ and ER− PGS), endometrial cancer, ovarian cancer, endometrioid ovarian cancer, age at menarche, age at menopause, and bulimia nervosa, which were derived in females only, as well as the prostate cancer PGS derived in males only. Additionally, we also built two versions of PGS for all complex traits excluding variants at 13 lipid-associated gene loci, which were DOCK7, CELSR2, GALNT2, PCSK9, GCKR, TRIB1, LPL, APOA5, FADS2, LIPC, CETP, LDLR, and APOC1 (consisting of the encoding gene region itself as well as a 1 Mb window either side). Details of the 13 genes are presented in Supplementary file 1p.
Statistical analysis
PC analysis of the metabolites
PC analysis was performed on the post-QC metabolite data to identify the number of independent traits among the 249 highly correlated metabolites. The analysis was conducted using the princomp function from the ‘stats’ R package.
PGS analysis
To allow us to draw comparisons between PGS–metabolite associations, we standardized all PGS to have a mean of 0 and standard deviation of 1 and additionally applied inverse rank normalization transformations to all metabolic traits prior to analysis. Associations between PGS and normalized metabolites were determined by linear regression with adjustment for age, sex (where appropriate), genotyping chip, the top 10 PCs, and fasting time. Each analysis was conducted initially in the full sample, followed by analyses after stratification into age tertiles to investigate the influence of medication use on findings. Sex-stratified association analyses in the full sample were also conducted whereby metabolic traits were transformed separately among males and females before applying linear regression. All analyses were undertaken using both versions of each PGS as long as their corresponding GWAS had at least one variant with p < 5 × 10–8 necessary for the more stringent criteria. To account for multiple testing in this study, we applied a p value threshold of 0.05/19 (accounting for 19 independent variables captured 95% variances in the metabolites from PC analysis) to highlight findings worthwhile evaluating in further detail. However, all results from our analyses are available in the web application should users decide to apply a more stringent (or lenient) heuristic to prioritize findings for in-depth analyses.
Instrument selection for GlycA analysis
Genetic instruments for all PGS traits/disease points evaluated in this study using MR were obtained from the ‘TwoSampleMR’ v0.5.6 R package (Hemani et al., 2018), or by manually uploading GWAS summary statistics and using the clump_data function from this package to identify them. Instruments for GlycA and particle size of very low-density lipoprotein (VLDL) were identified by conducting a GWAS of this trait in the UKB study using the BOLT-LMM (linear mixed model) software to control for population structure (Loh et al., 2015). Analyses were undertaken after excluding individuals of non-European descent (based on K-mean clustering of K = 4) and standard exclusions, including withdrawn consent, mismatch between genetic and reported sex, and putative sex chromosome aneuploidy. Analyses were adjusted for age, sex, fasting status, and a binary variable denoting the genotyping chip used in individuals (the UKBB Axiom array or the UK BiLEVE array). Genetic instruments were defined as variants with p < 5 × 10–8 after removing those in LD using the clump_data function as above. The full GWAS summary statistics for GlycA as well as the other 248 circulating metabolic traits are available via the GWAS catalog (https://www.ebi.ac.uk/gwas/) under accession IDs GCST90092803 to GCST90093051 (Richardson et al., 2022).
The NMR signal of serum GlycA is contributed by five acute-phase proteins (alpha1-acid glycoprotein, haptoglobin, alpha1-antitrypsin, alpha1-antichymotrypsin, and transferrin) (Otvos et al., 2015). Thus, another set of genetic instruments for GlycA were selected among variants strongly associated with these five proteins for further evaluation of the genetically predicted effect of GlycA. Instrumental variables for alpha1-acid glycoprotein were identified usinge eQTLs (p < 5 × 10–8, r2 < 0.001) for ORM1 from the eQTLGen Consortium (Võsa et al., 2021) due to a lack of available protein data. Genetic instruments for haptoglobin, alpha1-antitrypsin, alpha1-antichymotrypsin, and transferrin were identified using pQTLs (p < 5 × 10–8, r2 < 0.001) for these proteins identified from 35,559 Icelanders (Ferkingstad et al., 2021). To identify cis-acting instruments for these five proteins, we restricted e/pQTL associations to variants located within 1 Mb around their encoding genes: ORM1 (encoding alpha1-acid glycoprotein; Ensembl ID: ENSG00000229314), HP (encoding haptoglobin; Ensembl ID: ENSG00000257017), SERPINA1 (encoding alpha1-antitrypsin; Ensembl ID: ENSG00000197249), SERPINA3 (encoding alpha1-antichymotrypsin; Ensembl ID: ENSG00000196136), or TF (encoding transferrin; Ensembl ID: ENSG00000196136). Independent instruments were identified using the same protocol as above (Kibinge et al., 2020).
MR analyses
MR analyses were undertaken using the ‘TwoSampleMR’ package (Hemani et al., 2018) to estimate the bidirectional effects between PGS traits/disease endpoints and metabolic traits, including GlycA and VLDL particle size. This was firstly estimated using the IVW method, which takes the SNP-outcome estimates and regresses them on those for the SNP-exposure associations (Burgess et al., 2013), followed by the weighted median and MR-Egger methods which are considered to be more robust to horizontal pleiotropy than the IVW approach (Bowden et al., 2015; Bowden et al., 2016). If only one SNP is available as genetic instrument, Wald ratio estimates were calculated by dividing the SNP-outcome estimates by the SNP-exposure estimates (Burgess et al., 2017). F-Statistics were calculated to assess weak instrument bias (Burgess and Thompson, 2013). Benjamini–Hochberg FDR threshold of less than 5% was applied as a heuristic to account for multiple testing in the results.
Forest and circos plots in this study were generated using the R package ‘ggplot’ v3.3.3 (Ginestet, 2011). Heatmaps were generated using the R package ‘pheatmap’ v1.0.12 (Kolde, 2015). The web application was developed using the R package ‘shiny’ v1.0.4.2 (Chang et al., 2020). All analyses were undertaken using R (version 3.5.1).
Data availability
All data generated in this study can be downloaded from the web application of our metabolites-PGS atlas: http://mrcieu.mrsoftware.org/metabolites_PRS_atlas/.
References
-
Apt interpretation of comprehensive lipoprotein data in large-scale epidemiology: disclosure of fundamental structural and metabolic relationshipsInternational Journal of Epidemiology 51:996–1011.https://doi.org/10.1093/ije/dyab156
-
Mendelian randomization with invalid instruments: effect estimation and bias detection through egger regressionInternational Journal of Epidemiology 44:512–525.https://doi.org/10.1093/ije/dyv080
-
Mendelian randomization analysis with multiple genetic variants using summarized dataGenetic Epidemiology 37:658–665.https://doi.org/10.1002/gepi.21758
-
Use of allele scores as instrumental variables for Mendelian randomizationInternational Journal of Epidemiology 42:1134–1144.https://doi.org/10.1093/ije/dyt093
-
A review of instrumental variable estimators for Mendelian randomizationStatistical Methods in Medical Research 26:2333–2355.https://doi.org/10.1177/0962280215597579
-
GlycA, a novel biomarker of systemic inflammation and cardiovascular disease riskJournal of Translational Medicine 15:219.https://doi.org/10.1186/s12967-017-1321-6
-
“ Mendelian randomization ”: can genetic epidemiology contribute to understanding environmental determinants of disease?International Journal of Epidemiology 32:1–22.https://doi.org/10.1093/ije/dyg070
-
Mendelian randomization: genetic anchors for causal inference in epidemiological studiesHuman Molecular Genetics 23:R89–R98.https://doi.org/10.1093/hmg/ddu328
-
Alzheimer’s disease susceptibility gene apolipoprotein E (apoe) and blood biomarkers in UK biobank (N = 395,769)Journal of Alzheimer’s Disease 76:1541–1551.https://doi.org/10.3233/JAD-200338
-
Ggplot2: elegant graphics for data analysisJournal of the Royal Statistical Society 174:245–246.https://doi.org/10.1111/j.1467-985X.2010.00676_9.x
-
Can Mendelian randomization shift into reverse GEAR?Clinical Chemistry 65:363–366.https://doi.org/10.1373/clinchem.2018.296806
-
Integrating genomics with biomarkers and therapeutic targets to invigorate cardiovascular drug developmentNature Reviews. Cardiology 18:435–453.https://doi.org/10.1038/s41569-020-00493-1
-
Characterizing the causal pathway for genetic variants associated with neurological phenotypes using human brain-derived proteome dataAmerican Journal of Human Genetics 106:885–892.https://doi.org/10.1016/j.ajhg.2020.04.007
-
Circulating N-linked glycoprotein acetyls and longitudinal mortality riskCirculation Research 118:1106–1115.https://doi.org/10.1161/CIRCRESAHA.115.308078
-
Support a health-protective effect of neuroticism in population subgroups?Psychological Science 30:629–632.
-
Mendelian randomizationNature Reviews Methods Primers 2:6.https://doi.org/10.1038/s43586-021-00092-5
-
An atlas of genetic influences on human blood metabolitesNature Genetics 46:543–550.https://doi.org/10.1038/ng.2982
-
Genetics of 35 blood and urine biomarkers in the UK BiobankNature Genetics 53:185–194.https://doi.org/10.1038/s41588-020-00757-z
-
The personal and clinical utility of polygenic risk scoresNature Reviews. Genetics 19:581–590.https://doi.org/10.1038/s41576-018-0018-x
-
10 years of GWAS discovery: biology, function, and translationAmerican Journal of Human Genetics 101:5–22.https://doi.org/10.1016/j.ajhg.2017.06.005
Decision letter
-
Jonathan K PritchardReviewing Editor; Stanford University, United States
-
Carlos IsalesSenior Editor; Medical College of Georgia at Augusta University, United States
-
Scott RitchieReviewer; University of Cambridge, United Kingdom
-
Maik PietznerReviewer
Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Decision letter after peer review:
Thank you for submitting your article "Constructing an atlas of associations between polygenic risk scores from across the human phenome and circulating metabolic biomarkers" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Carlos Isales as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Scott Ritchie (Reviewer #2); Maik Pietzner (Reviewer #3).
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
Essential revisions:
1) While this work provides a potentially important resource for the community, the examples provided tend to be illustrative, and do not really do justice to the resource. We strongly encourage the authors to revisit the data analysis for more-compelling examples of the types of impactful work that can be achieved using this resource. How and why should users want to use this Atlas?
2) One key question is whether these analyses bring us closer to causality compared to 'classical' observational associations, given LD confounding and strong metabolite variants included in PRS driving the associations. For example, how robust are PRS associations to the exclusion of individual regions? Second, can we determine whether the associations are causally upstream, or downstream of the identified phenotypes? At minimum the authors should demonstrate for examples how these issues can be teased apart (and optionally provide analyses atlas-wide).
3) The atlas would also be potentially strengthened by including additional results for sex-stratified models. Previous work has shown significant differences in NMR biomarker concentrations between males and females, so it would be valuable to see if these result in any differences in PRS associations which may arise despite PRSs largely being constructed from autosomal GWAS.
Furthermore, the reviewers provide numerous additional specific and insightful comments; we strongly encourage the authors to consider these points seriously in their revisions.
Reviewer #1 (Recommendations for the authors):
Specific comments:
Some of the text on figures is very small and it may be worth revisiting figure design.
P8: the positive control of creatinine and kidney disease seems a bit too trivial, given that creatinine is a diagnostic criterion. Perhaps there may be a better example?
P7 Para1: It would be helpful to the reader to say a bit more about the range of types of traits considered here, as well as typical sample sizes given that the data come from a variety of sources.
Consider using PGS instead of PRS as more accurate for quantitative and non-disease traits.
Reviewer #2 (Recommendations for the authors):
Introduction/discussion: this is not the first study that has demonstrated the utility of PRS associations for prioritising molecular traits for follow-up. We have recently published a paper doing so for cardiometabolic PRS and protein levels (Ritchie et al., 2021a) and this has been available as a preprint in various forms since 2019. This should be noted and cited appropriately at the relevant points in the introduction and discussion.
The associations in the atlas for the systolic blood pressure (SBP) and diastolic blood pressure (DBP) are invalid. As the authors note in the methods, PRSs for these two traits are constructed from GWAS that include UK Biobank samples. PRSs are invalid when used in samples that contributed to their underlying GWAS (Lambert et al., 2021; Wand et al., 2021; Wray et al., 2013) as this causes them to have dramatically inflated associations. Associations for these two PRS should be removed from the atlas and study, or if the authors think these two traits are of critical interest, an alternate source of GWAS that do not include UK Biobank participants should be used to construct these two PRSs.
We have found significant sources of technical variation exist in the NMR metabolomics data for UK Biobank (Ritchie et al., 2021b), which may confound some of the PRS associations here. Of particular concern in the context of this study are the presence of outlier shipping plates, on which samples have systematically high (or low) concentrations of non-biological origin (see Figure 5 in Ritchie et al. 2021b). This affects up to 5% of samples depending on the biomarker, but will have an outsized effect on associations as they result in samples being spuriously located at the extremities of biomarker distributions. I.e. this may result in associations being weaker or not significant, or less likely, false positives may be introduced if outlier plates happen to correlate with PRS. There are also a small number of biomarkers significantly impacted by other sources of technical variation, e.g. drift over time within spectrometer, or sample degradation due to extended time between sample preparation and sample measurement. We have made available an R package: ukbnmr (https://github.com/sritchie73/ukbnmr), for correcting for this technical variation, and removing samples on these outlier plates. Although this preprint is still under review, we would suggest using this package to remove the described technical variation, or at least checking whether doing so significantly changes associations in the atlas.
There are also major biological determinants of NMR biomarker concentrations that have not been accounted for, and thus may confound, PRS to biomarker associations. In addition to age, sex, and 10 genetic principal components that the authors already adjust for, NMR biomarker concentrations are also strongly correlated with body mass index (BMI), fasting time, and lipid lowering medication (statin) usage (see Figure 8 in Ritchie et al., 2021b). The atlas would be greatly improved by either appropriately accounting for these in the basic models, or including additional results using models that do so.
Regarding statin usage in particular, we appreciate that the authors have conducted a separate age-stratified analysis to evaluate the impact of medication usage on PRS to biomarker associations. However, the fact remains that statin usage will be confounding the PRS to biomarker associations in the main results (e.g. by artificially lowering LDL cholesterol in people at high CHD PRS). This confounding is typically removed in one of two ways: either (1) fitting associations excluding participants taking statins, or (2) applying biomarker-specific correction factors to concentrations prior to association analysis, such as those estimated by (Kofink et al., 2017; Sliz et al., 2018). The latter is probably the better approach as the former is likely to introduce further bias towards young and healthy participants.
The atlas would also be potentially strengthened by including additional results for sex-stratified models. We have observed significant differences in NMR biomarker concentrations between males and females, so it would be interesting to see if these result in any differences in PRS associations which may arise despite PRSs largely being constructed from autosomal GWAS.
Regarding the bi-directional Mendelian randomization analysis of GlycA, we would suggest using an alternative biomarker as exemplar in the study, as GlycA is heterogeneous and needs to be more carefully instrumented than has been done in the study currently. GlycA is an NMR signal that quantifies the total concentrations of five proteins in circulation (Otvos et al., 2015): α-1-acid glycoprotein (AGP), α-1-antitrypsin (AAT), α-1-antichymotrypsin (AACT), haptoglobin (HP), and transferrin (TF). These are acute-phase reactants whose concentrations each change in response to acute inflammation or in chronic inflammation (Connelly et al., 2017). Moreover these changes are differential with respect to each other, and over time (Ebersole and Cappelli, 2000; Gabay and Kushner, 1999). Further, a single GlycA measurement cannot be simply decomposed as two people with the same GlycA concentration can have different concentrations of the underlying proteins (Ritchie et al., 2019). This heterogeneity means that using genome-wide significant QTLs for GlycA is unlikely to be appropriate as these QTLs may include (1) protein-QTLs for one or more of the five proteins GlycA captures, and (2) signals involved in initiation of acute-phase response (e.g. interleukin-6 signalling pathways). To use GlycA as an exposure in Mendelian randomization analysis, the QTLs selected as instruments should be restricted to cis-pQTLs for the five proteins. These signals may need to be treated separately, as in the scenario GlycA is truly causal, it is unlikely that all five proteins are causal, or that they have similar causal effect sizes.
Data availability: the PRS should also be made publicly available, e.g. downloadable via the atlas. These should include at minimum: the set of variants in each PRS, variant identifier information (e.g. chromosome and position), genome build, effect allele, and weight (β or log odds from the underlying GWAS).
References
Bretherick AD, Canela-Xandri O, Joshi PK, Clark DW, Rawlik K, Boutin TS, Zeng Y, Amador C, Navarro P, Rudan I, Wright AF, Campbell H, Vitart V, Hayward C, Wilson JF, Tenesa A, Ponting CP, Baillie JK, Haley C. 2020. Linking protein to phenotype with Mendelian Randomization detects 38 proteins with causal roles in human diseases and traits. PLoS Genet 16:e1008785.
Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D, Delaneau O, O'Connell J, Cortes A, Welsh S, Young A, Effingham M, McVean G, Leslie S, Allen N, Donnelly P, Marchini J. 2018. The UK Biobank resource with deep phenotyping and genomic data. Nature 562:203-209.
Connelly MA, Otvos JD, Shalaurova I, Playford MP, Mehta NN. 2017. GlycA, a novel biomarker of systemic inflammation and cardiovascular disease risk. J Transl Med 15:219.
Ebersole JL, Cappelli D. 2000. Acute-phase reactants in infections and inflammatory diseases. Periodontol 2000 23:19-49.
Gabay C, Kushner I. 1999. Acute-phase proteins and other systemic responses to inflammation. N Engl J Med 340:448-454.
Kofink D, Eppinga RN, van Gilst WH, Bakker SJL, Dullaart RPF, van der Harst P, Asselbergs FW. 2017. Statin Effects on Metabolic Profiles: Data From the PREVEND IT (Prevention of Renal and Vascular End-stage Disease Intervention Trial). Circ Cardiovasc Genet 10. doi:10.1161/CIRCGENETICS.117.001759
Lambert SA, Gil L, Jupp S, Ritchie SC, Xu Y, Buniello A, McMahon A, Abraham G, Chapman M, Parkinson H, Danesh J, MacArthur JAL, Inouye M. 2021. The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nat Genet 53:420-425.
Otvos JD, Shalaurova I, Wolak-Dinsmore J, Connelly MA, Mackey RH, Stein JH, Tracy RP. 2015. GlycA: A Composite Nuclear Magnetic Resonance Biomarker of Systemic Inflammation. Clin Chem 61:714-723.
Ritchie SC, Kettunen J, Brozynska M, Nath AP, Havulinna AS, Männistö S, Perola M, Salomaa V, Ala-Korpela M, Abraham G, Würtz P, Inouye M. 2019. Elevated serum α-1 antitrypsin is a major component of GlycA-associated risk for future morbidity and mortality. PLoS One 14:e0223692.
Ritchie SC, Lambert SA, Arnold M, Teo SM, Lim S, Scepanovic P, Marten J, Zahid S, Chaffin M, Liu Y, Abraham G, Ouwehand WH, Roberts DJ, Watkins NA, Drew BG, Calkin AC, Di Angelantonio E, Soranzo N, Burgess S, Chapman M, Kathiresan S, Khera AV, Danesh J, Butterworth AS, Inouye M. 2021a. Integrative analysis of the plasma proteome and polygenic risk of cardiometabolic diseases. Nat Metab 3:1476-1483.
Ritchie SC, Surendran P, Karthikeyan S, Lambert SA, Bolton T, Pennells L, Danesh J, Di Angelantonio E, Butterworth AS, Inouye M. 2021b. Quality control and removal of technical variation of NMR metabolic biomarker data in ∼120,000 UK Biobank participants. medRxiv. doi:10.1101/2021.09.24.21264079
Sliz E, Kettunen J, Holmes MV, Williams CO, Boachie C, Wang Q, Männikkö M, Sebert S, Walters R, Lin K, Millwood IY, Clarke R, Li L, Rankin N, Welsh P, Delles C, Jukema JW, Trompet S, Ford I, Perola M, Salomaa V, Järvelin M-R, Chen Z, Lawlor DA, Ala-Korpela M, Danesh J, Davey Smith G, Sattar N, Butterworth A, Würtz P. 2018. Metabolomic consequences of genetic inhibition of PCSK9 compared with statin treatment. Circulation 138:2499-2512.
Wand H, Lambert SA, Tamburro C, Iacocca MA, O'Sullivan JW, Sillari C, Kullo IJ, Rowley R, Dron JS, Brockman D, Venner E, McCarthy MI, Antoniou AC, Easton DF, Hegele RA, Khera AV, Chatterjee N, Kooperberg C, Edwards K, Vlessis K, Kinnear K, Danesh JN, Parkinson H, Ramos EM, Roberts MC, Ormond KE, Khoury MJ, Janssens ACJW, Goddard KAB, Kraft P, MacArthur JAL, Inouye M, Wojcik GL. 2021. Improving reporting standards for polygenic scores in risk prediction studies. Nature 591:211-219.
Wray NR, Yang J, Hayes BJ, Price AL, Goddard ME, Visscher PM. 2013. Pitfalls of predicting complex traits from SNPs. Nat Rev Genet 14:507-515.
Zheng J, Haberland V, Baird D, Walker V, Haycock PC, Hurle MR, Gutteridge A, Erola P, Liu Y, Luo S, Robinson J, Richardson TG, Staley JR, Elsworth B, Burgess S, Sun BB, Danesh J, Runz H, Maranville JC, Martin HM, Yarmolinsky J, Laurin C, Holmes MV, Liu JZ, Estrada K, Santos R, McCarthy L, Waterworth D, Nelson MR, Davey Smith G, Butterworth AS, Hemani G, Scott RA, Gaunt TR. 2020. Phenome-wide Mendelian randomization mapping the influence of the plasma proteome on complex diseases. Nat Genet 52:1122-1131.
Reviewer #3 (Recommendations for the authors):
I have the following more specific comments that could possibly improve the study by Fang et al.:
1. The NMR platform used by the authors measures only a small number of small molecules and the vast majority of the derived measures refer to characteristics of lipoprotein particles, which are not 'classical' metabolites. The paper would benefit from a paper distinction between both measures. Therefore, it is questionable how many of the 100,000 metabolites mentioned by the authors are captured by the technology used and further it is even of interest how many approximately independent features are indeed captured by the technology. A principal component analysis or similar dimension reduction techniques would provide an important correction/estimate of/to the metabolic space captured. Given the high correlation among the NMR traits, it would be important to state how many likely independent associations have been found, for instance by clumping highly correlated metabolites in clusters, similar as the authors do for SNPs.
2. I find the section about collider bias in the introduction a bit as surprise and it is unclear to me, how this relates to the overall aim of the study. The high amount of self-citation in this section and in general throughout the paper makes me wonder, how much of an issue this really is compared to more substantial questions about the suitability of PRS to identify disease biomarkers or causal metabolites.
3. Investigating different types of genetic scores is a clear strength of this work. However, the study currently stops early leaving important questions unanswered. For instance, how many more 'helpful' associations between PRS and NMR measures are really gained by going genome-wide, that is, how many of the added associations are mainly due to unspecific pleiotropy? The authors have outstanding methodological skills in MR and related causal inference techniques, and I find it somewhat wasted here to go simply for more associations instead of digging into the relevant part, which in turn defines the usefulness of the whole atlas and hence this study. It currently reads mostly like a computational exercise.
4. In my opinion, the key question is, what do all those associations deliver, do they bring us any closer to causality compared to 'classical' observational associations, given LD confounding and strong metabolite variants included in PRS driving the associations. For example, how robust are PRS associations to the exclusion of single regions, Ritchie et al., 2021 Nat Metab provide a neat framework to address such questions. The APOE example goes in that direction and other locus-specific effects are likely underlying other disease PRS – NMR measure associations.
5. One might argue that the CKD example is somewhat self-fulling, given that the selection of cases is mainly done based on serum creatinine (or the eGFR derived from creatinine), but is a good example how strong and specific examples might point to disease biomarkers. Are there more examples for a given NMR trait being strongly and possibly specifically associated with a trait or a cluster of highly related traits? I would also tone the creatinine example done, since it is obvious that this marker is used for clinical decision making.
6. I am not quite sure about the bidirectional MR approach. By only testing diseases for which the PRS showed an association with the metabolite of interest, isn't it quite likely that an effect of the metabolite on the disease would be missed, as one would assume that a true causal association between a metabolite and a disease might well be lost in the large set of SNPs associated with the endpoint of interest.
7. I disagree with the statement on page 12 that all lipid traits associated with the CHD PRS are causal risk factors, most of them are likely not and the true underlying risk factor or biological mechanism is likely hidden among all those associations. The general inability to distinguish about the causal relevance of all those highly related measures is a massive challenge working with NMR data, in particular given that many are not real measurements but are only derived as proportion from other measures, including apolipoprotein B.
8. The authors need to do better in distinguishing between all the different lipid measures that are derived from the NMR platform, this is also seen in the discussion of Apo B. Apo B is the main structural protein for many lipoprotein particles of different densities and not just for atherogenic ones.
9. Instead of stratified analysis, why not correcting for statin intake to estimate lipid levels of participants without the use of statins? What about the effect of many other medications widely prescribed, with strong effects on NMR measures as described in van Duijn et al., Nat Med 2020?
https://doi.org/10.7554/eLife.73951.sa1Author response
Essential revisions:
1) While this work provides a potentially important resource for the community, the examples provided tend to be illustrative, and do not really do justice to the resource. We strongly encourage the authors to revisit the data analysis for more-compelling examples of the types of impactful work that can be achieved using this resource. How and why should users want to use this Atlas?
Many thanks for your comments to help us refine this resource and manuscript. We have now revisited the examples in this manuscript to showcase the value of the PGS atlas with the help of the reviewer comments:
1. We have undertaken a new application of bi-directional Mendelian randomization (MR) to demonstrate how readers may use this approach to disentangle whether associations in our atlas likely reflect either causes or consequences of PGS traits/diseases. This example is described on page 9 and discussed in the discussion on page 21:
‘We likewise conducted bi-directional MR to demonstrate that associations between the CHD PGS and VLDL particle size likely reflect an effect of CHD liability on this metabolic trait. In contrast, the association between the CHD PGS and VLDL concentrations are likely attributed to the causal influence of this metabolic trait on CHD risk, suggesting that it is the concentration of these triglyceride-rich particles that are important in terms of the aetiology of CHD risk as opposed to their actual size. We envisage that findings from our atlas, as well as other ongoing efforts which leverage the large-scale NMR data within UKB, should facilitate further granular insight into lipoprotein lipid biology.’
2. Based on reviewer #2’s comments regarding our previous example using results for GlycA, we have performed further sensitivity analyses to support the conclusions from this example.
3. Based on reviewer #3’s comments we have included further explanation regarding collider bias as well as a Figure (page 14) to emphasize the importance of age-stratified analyses in our atlas for users to evaluate the robustness of findings.
4. Finally, we have also recalculated all PGS-metabolite associations in the atlas after removing established lipid loci across the genome based on the publication by Gallois et al., (2019 Nature Communications https://www.nature.com/articles/s41467-019-12703-7). As an example of why this sensitivity analysis may be helpful to the user, we compared metabolomic profiles for the anthropometric and the Alzheimer’s disease PGS before and after removing these loci (page 19):
‘For PGS with these lipid loci excluded, anthropometric traits such as waist-to-hip ratio (N=209), waist circumference (N=206) and body mass index (N=205) still provided strong evidence of association with the majority of metabolic measurements on the NMR panel based on multiple testing corrections. Elsewhere however, the Alzheimer’s disease PGS, which was associated with 60 metabolic traits robust to P<0.05/19 in the initial analysis including these lipid loci (Supplementary Table 17), provided no convincing evidence of association with the 249 circulating metabolites after excluding the lipid loci based on the same multiple testing threshold (Supplementary Table 18). Further inspection suggested that the likely explanation for this were variants located at the APOE locus which are recognised to exert their influence on phenotypic traits via horizontally pleiotropic pathways (Ferguson et al., 2020).’
2) One key question is whether these analyses bring us closer to causality compared to 'classical' observational associations, given LD confounding and strong metabolite variants included in PRS driving the associations. For example, how robust are PRS associations to the exclusion of individual regions?
Thank you for the suggestion. As recommended by reviewer #2 we have looked into the analysis conducted in the Ritchie et al., 2021 Nat Metab paper. The authors examined the association between 5 different genome-wide polygenic scores (PGSs) and systematically removed 1,703 approximately independent LD blocks with replacement in their association analyses.
We agree that it is very important to consider the contribution of individual gene regions when examining genetic associations. Whilst this was feasible for 5 different PGS in the study by Ritchie et al., this becomes a lot more computational intensive for our study whereby we construct 125 different PGS using 2 different selection criteria. Therefore, we have added a new analysis to evaluate the contribution of 14 previously discovered lipid-associated genes (LIPC, APOA5, CETP, PCSK9, LDLR, GCKR, APOC1, LPL, GALNT2, CELSR2, TRIB1, DOCK7, FADS2 and APOE) in the PGS-metabolites associations as reported on page 19:
‘The polygenic nature of complex traits means that the inclusion of highly weighted pleiotropic genetic variants in PGS may introduce bias into genetic associations within our atlas. To provide insight into this issue, we constructed PGS excluding variants within the regions of the genome which encode the genes for 14 major regulators of NMR lipoprotein lipids signals which captured 75% of the gene-metabolite associations in the Finnish Metabolic Syndrome In Men (METSIM) cohort (Gallois et al., 2019). For details of these genes (see Supplementary Table 5).’
Followed by the example described above to help demonstrate to users to importance of these results for further evaluate the robustness of findings.
Second, can we determine whether the associations are causally upstream, or downstream of the identified phenotypes? At minimum the authors should demonstrate for examples how these issues can be teased apart (and optionally provide analyses atlas-wide).
As described above we have now added an example evaluating associations with measures of triglyceride-rich VLDL captured by the NMR panel by applying bi-directional MR to disentangle whether they may reflect causes or consequences of traits/disease.
The main reason for not conducting these types of evaluations atlas-wide is because we present this resource as one to generate hypotheses which users can then conduct in-depth evaluations of in a timely and careful manner. This was highlighted by reviewer #2 who notes that challenges of appropriately instrumenting the metabolic trait GlycA which we have also address below in response to their comment. This overall point has been discussed on page 8 of the manuscript:
‘In this paper, we provide several examples of how results from this atlas can be used to generate hypotheses and pave the way for in-depth and careful evaluations of associations between PGS and circulating traits. Specifically, we believe our findings can facilitate a ‘reverse gear Mendelian randomization’ approach to disentangle whether associations likely reflect metabolic traits acting as a cause or consequence of disease risk (Holmes and Davey Smith, 2019) as illustrated using triglyceride-rich very low density lipoprotein (VLDL) particles in the next section. Furthermore, in-depth evaluations allow careful consideration of appropriate instrumental variables for circulating metabolites which can be a challenging task as highlighted in our exemplar analysis of glycoprotein acetyls. Finally, we provide examples of how the plethora of sensitivity analyses within our atlas can help users further investigate the robustness of findings.’
3) The atlas would also be potentially strengthened by including additional results for sex-stratified models. Previous work has shown significant differences in NMR biomarker concentrations between males and females, so it would be valuable to see if these result in any differences in PRS associations which may arise despite PRSs largely being constructed from autosomal GWAS.
Many thanks for this suggestion. We have undertaken extensive analyses to generate all PGS-metabolite association stratified by sex and added these to the web platform. This has been discussed on page 8:
‘Our atlas also includes sex-stratified estimates for PGS weighted by GWAS undertaken in female only (such as breast cancer and age at menarche) and male only (e.g., prostate cancer) populations, as well as sex-stratified estimates in both females and males separately for all other PGS-metabolite associations. We encourage users interested in these sex-stratified estimates to interpret them with caution however, given the widespread sex-differential participation bias in UKB (Pirastu et al., (2021)).’
And page 20:
‘Moreover, we suggest that users interested in the sex-stratified estimates within our atlas should interpret them in conjunction with estimates derived across age tertiles as in this example, given that different proportions of males and females in UKB may be taking certain medications (e.g. statins).’
Furthermore, the reviewers provide numerous additional specific and insightful comments; we strongly encourage the authors to consider these points seriously in their revisions.
Reviewer #1 (Recommendations for the authors):
Specific comments:
Some of the text on figures is very small and it may be worth revisiting figure design.
Many thanks for this suggestion. We have now regenerated our large Circos plot figure where text was particularly small (now as a forest plot) and moved it to the supplementary Figure. Furthermore, we note that this work has been submitted to eLife which is an online only journal, therefore meaning that readers will have the added benefit of always being able to zoom in of the figures generated in our manuscript.
P8: the positive control of creatinine and kidney disease seems a bit too trivial, given that creatinine is a diagnostic criterion. Perhaps there may be a better example?
The creatinine-CKD example was originally included to demonstrate the predictive ability of this PGS; however, this has now been removed as per your suggestion. Instead, we have further highlighted associations between lipoprotein lipid metabolic traits and the coronary heart disease (CHD) PGS, which we feel is powerful positive control given the overwhelming evidence supporting a causal relationship between lipids and CHD risk.
P7 Para1: It would be helpful to the reader to say a bit more about the range of types of traits considered here, as well as typical sample sizes given that the data come from a variety of sources.
We have now added a pie chart to illustrate the range of PGS traits/disease in our atlas into the Figure 1. A similar figure reflecting associations across NMR subcategories has also been generated alongside and added to Figure 1. All sample sizes are reported in Supplementary Table 1.
Consider using PGS instead of PRS as more accurate for quantitative and non-disease traits.
Thank you – we have now changed ‘PRS’ to ‘PGS’ throughout.
Reviewer #2 (Recommendations for the authors):
Introduction/discussion: this is not the first study that has demonstrated the utility of PRS associations for prioritising molecular traits for follow-up. We have recently published a paper doing so for cardiometabolic PRS and protein levels (Ritchie et al., 2021a) and this has been available as a preprint in various forms since 2019. This should be noted and cited appropriately at the relevant points in the introduction and discussion.
We have now referenced your paper on pages 4 and 12 of the manuscript.
The associations in the atlas for the systolic blood pressure (SBP) and diastolic blood pressure (DBP) are invalid. As the authors note in the methods, PRSs for these two traits are constructed from GWAS that include UK Biobank samples. PRSs are invalid when used in samples that contributed to their underlying GWAS (Lambert et al., 2021; Wand et al., 2021; Wray et al., 2013) as this causes them to have dramatically inflated associations. Associations for these two PRS should be removed from the atlas and study, or if the authors think these two traits are of critical interest, an alternate source of GWAS that do not include UK Biobank participants should be used to construct these two PRSs.
Many thanks for your suggestion. We have now removed the SBP and DBP PGS from our atlas due to overlapping samples in UKB.
We have found significant sources of technical variation exist in the NMR metabolomics data for UK Biobank (Ritchie et al., 2021b), which may confound some of the PRS associations here. Of particular concern in the context of this study are the presence of outlier shipping plates, on which samples have systematically high (or low) concentrations of non-biological origin (see Figure 5 in Ritchie et al., 2021b). This affects up to 5% of samples depending on the biomarker, but will have an outsized effect on associations as they result in samples being spuriously located at the extremities of biomarker distributions. I.e. this may result in associations being weaker or not significant, or less likely, false positives may be introduced if outlier plates happen to correlate with PRS. There are also a small number of biomarkers significantly impacted by other sources of technical variation, e.g. drift over time within spectrometer, or sample degradation due to extended time between sample preparation and sample measurement. We have made available an R package: ukbnmr (https://github.com/sritchie73/ukbnmr), for correcting for this technical variation, and removing samples on these outlier plates. Although this preprint is still under review, we would suggest using this package to remove the described technical variation, or at least checking whether doing so significantly changes associations in the atlas.
Thank you for this suggestion. As mentioned in response to the public review, we have now repeated our entire pipeline after applying your R package.
There are also major biological determinants of NMR biomarker concentrations that have not been accounted for, and thus may confound, PRS to biomarker associations. In addition to age, sex, and 10 genetic principal components that the authors already adjust for, NMR biomarker concentrations are also strongly correlated with body mass index (BMI), fasting time, and lipid lowering medication (statin) usage (see Figure 8 in Ritchie et al., 2021b). The atlas would be greatly improved by either appropriately accounting for these in the basic models, or including additional results using models that do so.
Regarding statin usage in particular, we appreciate that the authors have conducted a separate age-stratified analysis to evaluate the impact of medication usage on PRS to biomarker associations. However, the fact remains that statin usage will be confounding the PRS to biomarker associations in the main results (e.g. by artificially lowering LDL cholesterol in people at high CHD PRS). This confounding is typically removed in one of two ways: either (1) fitting associations excluding participants taking statins, or (2) applying biomarker-specific correction factors to concentrations prior to association analysis, such as those estimated by (Kofink et al., 2017; Sliz et al., 2018). The latter is probably the better approach as the former is likely to introduce further bias towards young and healthy participants.
We have responded to this suggestion in the public review. However, just to add that the previous studies referenced here are noticeably much smaller in sample size compared to our study (n=864 & n=5359 respectively). Conditioning on colliders has become a lot more problematic with the advent of biobank scale cohort studies such as UKB. As such, previously modest sources of bias induced by conditioning on colliders, which may have not led to spurious findings in smaller cohort studies, are now large enough such that precautions (e.g., those taken in our study by stratifying on age) are necessary to investigate the robustness of findings.
The atlas would also be potentially strengthened by including additional results for sex-stratified models. We have observed significant differences in NMR biomarker concentrations between males and females, so it would be interesting to see if these result in any differences in PRS associations which may arise despite PRSs largely being constructed from autosomal GWAS.
We have undertaken extensive analyses to additionally evaluate sex-stratified associations (page 8):
‘Our atlas also includes sex-stratified estimates for PGS weighted by GWAS undertaken in female only (such as breast cancer and age at menarche) and male only (e.g., prostate cancer) populations, as well as sex-stratified estimates in both females and males separately for all other PGS-metabolite associations. We encourage users interested in these sex-stratified estimates to interpret them with caution however, given the widespread sex-differential participation bias in UKB (Pirastu et al., (2021)).’
And page 20:
‘Moreover, we suggest that users interested in the sex-stratified estimates within our atlas should interpret them in conjunction with estimates derived across age tertiles as in this example, given that different proportions of males and females in UKB may be taking certain medications (e.g. statins).’
Regarding the bi-directional Mendelian randomization analysis of GlycA, we would suggest using an alternative biomarker as exemplar in the study, as GlycA is heterogeneous and needs to be more carefully instrumented than has been done in the study currently. GlycA is an NMR signal that quantifies the total concentrations of five proteins in circulation (Otvos et al., 2015): α-1-acid glycoprotein (AGP), α-1-antitrypsin (AAT), α-1-antichymotrypsin (AACT), haptoglobin (HP), and transferrin (TF). These are acute-phase reactants whose concentrations each change in response to acute inflammation or in chronic inflammation (Connelly et al., 2017). Moreover these changes are differential with respect to each other, and over time (Ebersole and Cappelli, 2000; Gabay and Kushner, 1999). Further, a single GlycA measurement cannot be simply decomposed as two people with the same GlycA concentration can have different concentrations of the underlying proteins (Ritchie et al., 2019). This heterogeneity means that using genome-wide significant QTLs for GlycA is unlikely to be appropriate as these QTLs may include (1) protein-QTLs for one or more of the five proteins GlycA captures, and (2) signals involved in initiation of acute-phase response (e.g. interleukin-6 signalling pathways). To use GlycA as an exposure in Mendelian randomization analysis, the QTLs selected as instruments should be restricted to cis-pQTLs for the five proteins. These signals may need to be treated separately, as in the scenario GlycA is truly causal, it is unlikely that all five proteins are causal, or that they have similar causal effect sizes.
Thank you for the suggestion. We have repeated the GlycA Mendelian randomization analysis using the pQTLs (for AAT, AACT, HP and TF) and eQTL (for AGP) of these proteins. Detailed methods are presented on page 26 of the manuscript:
‘The NMR signal of serum GlycA is contributed by five acute-phase proteins (alpha1-acid glycoprotein, haptoglobin, alpha1-antitrypsin, alpha1-antichymotrypsin, and transferrin) (Otvos et al., 2015). Thus, another set of genetic instruments for GlycA were selected among SNPs strongly associated with these five proteins for further evaluation of the genetically predicted effect of GlycA. Instrumental variables for alpha1-acid glycoprotein are identified from the gene expression quantitative trait loci (eQTL) (P<5x10-8, r2<0.001) of ORM1 from the eQTLGen Consortium (Võsa et al., 2021). Genetic instruments for haptoglobin, alpha1-antitrypsin, alpha1-antichymotrypsin, and transferrin are their protein quantitative trait loci (pQTL) (P<5x10-8, r2<0.001) identified from 35,559 Icelanders (Ferkingstad et al., 2021). To identify cis-acting genetic variants, genetic variants of the five proteins are limited to those located within 1Mb around their encoding genes: ORM1 (encoding alpha1-acid glycoprotein; Ensembl ID: ENSG00000229314), HP (encoding haptoglobin; Ensembl ID: ENSG00000257017), SERPINA1 (encoding alpha1-antitrypsin; Ensembl ID: ENSG00000197249), SERPINA3 (encoding alpha1-antichymotrypsin; Ensembl ID: ENSG00000196136) or TF (encoding transferrin; Ensembl ID: ENSG00000196136). GWAS summary statistics of the protein data were performed using an in-house reference panel consisting of 10,000 randomly selected European participants of the UK Biobank (Kibinge et al., 2020) in PLINK v1.90. The eQTL data from the eQTLGene Consortium was extracted from the OpenGWAS portal.’
Findings can be found on page 11 of the manuscript:
‘We also conducted further sensitivity analyses given that the NMR signal of GlycA is a composite signal contributed by the glycan N-acetylglucosamine residues on five acute-phase proteins, including alpha1-acid glycoprotein, haptoglobin, alpha1-antitrypsin, alpha1-antichymotrypsin, and transferrin (Otvos et al., 2015). Using cis-acting plasma protein (where possible) and expression quantitative trait loci (pQTLs and eQTLs) as instrumental variables for these proteins (Supplementary Table 12) did not provide convincing evidence that they play a role in disease risk for associations between PGS and GlycA (Supplementary Table 13). The only effect estimate robust to multiple testing was found for higher genetically predicted alpha1-antitrypsin levels on γ glutamyl transferase (GGT) levels (Β=0.05 SD change in GGT per 1 SD increase in protein levels, 95% CI=0.03 to 0.07, FDR=3.6x10-3), although this was not replicated when using estimates of genetic associations with GGT levels from a larger GWAS conducted in the UK Biobank data (Β=1.6x10-3, 95% CI=-6.9 x10-3 to 0.01, P = 0.71). For details of pleiotropy robust analysis and replication results see Supplementary Table 14.’
Data availability: the PRS should also be made publicly available, e.g. downloadable via the atlas. These should include at minimum: the set of variants in each PRS, variant identifier information (e.g. chromosome and position), genome build, effect allele, and weight (β or log odds from the underlying GWAS).
All weights used to generate the PGS in our atlas are now available at https://tinyurl.com/PGSweights as mentioned on page 7:
‘The specific weights for clumped variants used in all PGS can be found at https://tinyurl.com/PGSweights.’
Reviewer #3 (Recommendations for the authors):
Apart from the general comments made above, I have the following more specific comments that could possibly improve the study by Fang et al.:
1. The NMR platform used by the authors measures only a small number of small molecules and the vast majority of the derived measures refer to characteristics of lipoprotein particles, which are not 'classical' metabolites. The paper would benefit from a paper distinction between both measures. Therefore, it is questionable how many of the 100,000 metabolites mentioned by the authors are captured by the technology used and further it is even of interest how many approximately independent features are indeed captured by the technology. A principal component analysis or similar dimension reduction techniques would provide an important correction/estimate of/to the metabolic space captured. Given the high correlation among the NMR traits, it would be important to state how many likely independent associations have been found, for instance by clumping highly correlated metabolites in clusters, similar as the authors do for SNPs.
As suggested we have conducted a PCA analysis to investigate the amount of metabolic space captured. Findings are reported on page 8:
‘Principal component analysis suggested that the first 19 principal components (PCs) captured 95% of the variance in the NMR metabolites data (whereas the first ten PCs captured 90% and the first 41 PCs captured 99% of the variance) (Supplementary Table 3).’
The full results from PCA analysis are presented in Supplementary Table 3.
2. I find the section about collider bias in the introduction a bit as surprise and it is unclear to me, how this relates to the overall aim of the study. The high amount of self-citation in this section and in general throughout the paper makes me wonder, how much of an issue this really is compared to more substantial questions about the suitability of PRS to identify disease biomarkers or causal metabolites.
We have now added a more detailed section (including a new figure in the main document) discussing the challenges of addressing potential collider in a biobank-scale dataset which is discussed on page 14:
‘Directed acyclic graphs illustrating the potential collider bias involved in the causal relationship between the coronary artery disease polygenic score and circulating metabolites. (A) The likelihood of participants in UK Biobank taking medication such as statins is influenced by having a higher genetic predisposition to coronary artery disease but may also be influenced by certain metabolic traits measured on the NMR panel (e.g. having elevated low-density lipoprotein cholesterol levels). Either stratifying or adjusting for statin use in regression models may therefore induce collider bias into the association between disease liability and metabolic traits. (B) Age is commonly adjusted for in association analyses due to its role as a confounder and cannot be a collider (i.e. exposures and outcomes cannot influence the age of participants). Stratifying samples by age therefore enables the analysis of exposure-outcome associations in a group of participants with relatively consistent confounding effect from age, leading to more robust association estimates in the lower age tertile where the percentage of participants who are regularly taking medication is low. Furthermore, comparisons with participants in the highest age tertile can help highlight associations between polygenic scores and metabolic traits most likely distorted by potential colliders such as statins in the full sample.’
3. Investigating different types of genetic scores is a clear strength of this work. However, the study currently stops early leaving important questions unanswered. For instance, how many more 'helpful' associations between PRS and NMR measures are really gained by going genome-wide, that is, how many of the added associations are mainly due to unspecific pleiotropy? The authors have outstanding methodological skills in MR and related causal inference techniques, and I find it somewhat wasted here to go simply for more associations instead of digging into the relevant part, which in turn defines the usefulness of the whole atlas and hence this study. It currently reads mostly like a computational exercise.
Thank you for your complement re. methodological skills. We have emphasized in our introduction (page 8) that the main purpose of this resource is to generate hypotheses and therefore facilitate in-depth, carefully designed follow-up analyses (such as those applying the principles of MR). We advocate in this section that causal inference is challenging to robustly undertake which is why we suggest readers conduct bespoke analyses for this task motivated by findings in this atlas of results:
‘In this paper, we provide several examples of how results from this atlas can be used to generate hypotheses and pave the way for in-depth and careful evaluations of associations between PGS and circulating traits. Specifically, we believe our findings can facilitate a ‘reverse gear Mendelian randomization’ approach to disentangle whether associations likely reflect metabolic traits acting as a cause or consequence of disease risk (Holmes and Davey Smith, 2019) as illustrated using triglyceride-rich very low density lipoprotein (VLDL) particles in the next section. Furthermore, in-depth evaluations allow careful consideration of appropriate instrumental variables for circulating metabolites which can be a challenging task as highlighted in our exemplar analysis of glycoprotein acetyls. Finally, we provide examples of how the plethora of sensitivity analyses within our atlas can help users further investigate the robustness of findings.’
4. In my opinion, the key question is, what do all those associations deliver, do they bring us any closer to causality compared to 'classical' observational associations, given LD confounding and strong metabolite variants included in PRS driving the associations. For example, how robust are PRS associations to the exclusion of single regions, Ritchie et al., 2021 Nat Metab provide a neat framework to address such questions. The APOE example goes in that direction and other locus-specific effects are likely underlying other disease PRS – NMR measure associations.
We have undertaken extensive sensitivity analyses to compute all PGS association in this atlas after removing established lipid gene loci to evaluate the sensitivity of results driven by variants at these regions of the genome. This has been added to page 19 of the manuscript:
‘The polygenic nature of complex traits means that the inclusion of highly weighted pleiotropic genetic variants in PGS may introduce bias into genetic associations within our atlas. To provide insight into this issue, we constructed PGS excluding variants within the regions of the genome which encode the genes for 14 major regulators of NMR lipoprotein lipids signals which captured 75% of the gene-metabolite associations in the Finnish Metabolic Syndrome In Men (METSIM) cohort (Gallois et al., 2019). For details of these genes (see Supplementary Table 5).
For PGS with these lipid loci excluded, anthropometric traits such as waist-to-hip ratio (N=209), waist circumference (N=206) and body mass index (N=205) still provided strong evidence of association with the majority of metabolic measurements on the NMR panel based on multiple testing corrections. Elsewhere however, the Alzheimer’s disease PGS, which was associated with 60 metabolic traits robust to P<0.05/19 in the initial analysis including these lipid loci (Supplementary Table 17), provided no convincing evidence of association with the 249 circulating metabolites after excluding the lipid loci based on the same multiple testing threshold (Supplementary Table 18). Further inspection suggested that the likely explanation for this attenuation of evidence were due to variants located within the APOE locus which are recognised to exert their influence on phenotypic traits via horizontally pleiotropic pathways (Ferguson et al., 2020).’
5. One might argue that the CKD example is somewhat self-fulling, given that the selection of cases is mainly done based on serum creatinine (or the eGFR derived from creatinine), but is a good example how strong and specific examples might point to disease biomarkers. Are there more examples for a given NMR trait being strongly and possibly specifically associated with a trait or a cluster of highly related traits? I would also tone the creatinine example done, since it is obvious that this marker is used for clinical decision making.
We have now removed the creatinine example and instead included a bi-directional MR analysis of VLDL metabolic traits (page 9).
‘For example, we applied Mendelian randomization (MR) to further evaluate associations highlighted in our atlas with triglyceride-rich very-low-density lipoprotein (VLDL) particles. For instance, both VLDL particle average diameter size and concentration were associated with the PGS for body mass index (BMI) (Β=0.04, 95% CI=0.033 to 0.046, P=3.53x10-35 & Β=0.012, 95% CI=0.006 to 0.019, P=2.7x10-4 respectively) and coronary heart disease (CHD) (Β=0.026, 95% CI=0.019 to 0.032, P< 2.12x10-15 & Β=0.035, 95% CI=0.028 to 0.042, P< 2.73x10-24 respectively). Conducting bi-directional MR suggested that the associations with average diameter of VLDL particles are likely attributed to a consequence of BMI and CHD liability as opposed to the size of VLDL particles having a causal influence on these outcomes (Supplementary Table 6). In contrast, MR analyses suggested that the concentration of VLDL particles increases risk of CHD (Β=1.28 per 1-SD change in VLDL particle concentration, 95% CI=1.25 to 1.65, P=2.8x10-7) which may explain associations between the CHD PGS and this metabolic trait within our atlas. Similar MR analyses to investigate findings from our atlas can be conducted using the full GWAS summary statistics for all 249 circulating metabolic traits available via the GWAS catalog (https://www.ebi.ac.uk/gwas/) under accession IDs GCST90092803 to GCST90093051.’
6. I am not quite sure about the bidirectional MR approach. By only testing diseases for which the PRS showed an association with the metabolite of interest, isn't it quite likely that an effect of the metabolite on the disease would be missed, as one would assume that a true causal association between a metabolite and a disease might well be lost in the large set of SNPs associated with the endpoint of interest.
The value of using PGS with a lenient threshold is that associations may highlight either causes or consequences of disease risk. This has been previously referred to as ‘reverse gear MR’ as proposed in the study by Holmes and Davey Smith (https://academic.oup.com/clinchem/article-lookup/doi/10.1373/clinchem.2018.296806). This was one of the primary motivations for choosing the VLDL example mentioned above as well as providing insight into lipoprotein lipid biology (page 21):
‘We likewise conducted bi-directional MR to demonstrate that associations between the CHD PGS and VLDL particle size likely reflect an effect of CHD liability on this metabolic trait. In contrast, the association between the CHD PGS and VLDL concentrations are likely attributed to the causal influence of this metabolic trait on CHD risk, suggesting that it is the concentration of these triglyceride-rich particles that are important in terms of the aetiology of CHD risk as opposed to their actual size. We envisage that findings from our atlas, as well as other ongoing efforts which leverage the large-scale NMR data within UKB, should facilitate further granular insight into lipoprotein lipid biology.’
7. I disagree with the statement on page 12 that all lipid traits associated with the CHD PRS are causal risk factors, most of them are likely not and the true underlying risk factor or biological mechanism is likely hidden among all those associations. The general inability to distinguish about the causal relevance of all those highly related measures is a massive challenge working with NMR data, in particular given that many are not real measurements but are only derived as proportion from other measures, including apolipoprotein B.
We have now rephrased this sentence (now on page 15) to address this comment:
‘The vast majority of these were lipoprotein lipid traits, which are likely capturing causal risk factors for CHD.’
8. The authors need to do better in distinguishing between all the different lipid measures that are derived from the NMR platform, this is also seen in the discussion of Apo B. Apo B is the main structural protein for many lipoprotein particles of different densities and not just for atherogenic ones.
We agree that it is important to distinguish between the various metabolic traits measured by the NMR platform. Given that this work is a resource paper, we have pointed readers to a review of this platform to help them navigate findings within our atlas (page 4):
‘The 249 metabolite measurements include the particle concentration, size, and composition of 14 lipoprotein subclasses, as well as the levels of phospholipids, fatty acids, amino acids, ketone bodies and others biomarkers as discussed in a recent review (Ala-Korpela et al., 2022).’
9. Instead of stratified analysis, why not correcting for statin intake to estimate lipid levels of participants without the use of statins? What about the effect of many other medications widely prescribed, with strong effects on NMR measures as described in van Duijn et al., Nat Med 2020?
As discussed above, we have added a section to page 14 to clarify our justification of age-stratified analyses to evaluate the robustness of findings to potential colliders such as statin therapy. Conditioning on colliders (including the use of statins) has become a lot more problematic with the advent of biobank scale cohort studies such as UKB. As such, previously modest sources of bias induced by conditioning on colliders, which may have not led to spurious findings in smaller cohort studies such as the one referenced here by the reviewer, are now large enough such that precautions (e.g. those taken in our study by stratifying on age) are necessary to investigate the robustness of findings.
https://doi.org/10.7554/eLife.73951.sa2Article and author information
Author details
Funding
Medical Research Council (MC_UU_00011/1)
- George Davey Smith
Medical Research Council (MC_UU_00011/4)
- Tom R Gaunt
British Heart Foundation (FS/18/23/33512)
- Michael V Holmes
Wellcome Trust (108902/Z/15/Z)
- Si Fang
The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication. For the purpose of Open Access, the authors have applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission.
Acknowledgements
We would like to thank all the consortia for making their summary statistics available for the benefit of this work. We also thank the participants of the UK Biobank study. Data were accessed under UKB application #15825 and #81499. All authors work at the MRC Integrative Epidemiology Unit at the University of Bristol (MC_UU_00011/1, MC_UU_00011/4). TRG and GDS conduct research at the NIHR Biomedical Research Centre at the University Hospitals Bristol NHS Foundation Trust and the University of Bristol. The views expressed in this publication are those of the author(s) and not necessarily those of the NHS, the National Institute for Health Research, or the Department of Health. SF is supported by a Wellcome Trust PhD studentship in Molecular, Genetic and Lifecourse Epidemiology [108902/Z/15/Z]. MVH works in a unit that receives funding from the UK Medical Research Council and is supported by a British Heart Foundation Intermediate Clinical Research Fellowship (FS/18/23/33512) and the National Institute for Health Research Oxford Biomedical Research Centre.
Ethics
Our study involves previously collected data (genomic sequencing data and metabolites data) of human participants in the UK Biobank (UKB) cohort study. Ethical approval for the UKB was obtained from the Research Ethics Committee (REC; approval number: 11/NW/0382) and informed consent was collected from all participants enrolled in UKB.
Senior Editor
- Carlos Isales, Medical College of Georgia at Augusta University, United States
Reviewing Editor
- Jonathan K Pritchard, Stanford University, United States
Reviewers
- Scott Ritchie, University of Cambridge, United Kingdom
- Maik Pietzner
Version history
- Received: September 16, 2021
- Preprint posted: October 25, 2021 (view preprint)
- Accepted: September 12, 2022
- Version of Record published: October 11, 2022 (version 1)
Copyright
© 2022, Fang et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 1,746
- Page views
-
- 235
- Downloads
-
- 6
- Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Epidemiology and Global Health
A large observational study has found that irregular sleep-wake patterns are associated with a higher risk of overall mortality, and also mortality from cancers and cardiovascular disease.
-
- Epidemiology and Global Health
Background:
Irregular sleep-wake timing may cause circadian disruption leading to several chronic age-related diseases. We examined the relationship between sleep regularity and risk of all-cause, cardiovascular disease (CVD), and cancer mortality in 88,975 participants from the prospective UK Biobank cohort.
Methods:
The sleep regularity index (SRI) was calculated as the probability of an individual being in the same state (asleep or awake) at any two time points 24 hr apart, averaged over 7 days of accelerometry (range 0–100, with 100 being perfectly regular). The SRI was related to the risk of mortality in time-to-event models.
Results:
The mean sample age was 62 years (standard deviation [SD], 8), 56% were women, and the median SRI was 60 (SD, 10). There were 3010 deaths during a mean follow-up of 7.1 years. Following adjustments for demographic and clinical variables, we identified a non-linear relationship between the SRI and all-cause mortality hazard (p [global test of spline term]<0.001). Hazard ratios, relative to the median SRI, were 1.53 (95% confidence interval [CI]: 1.41, 1.66) for participants with SRI at the 5th percentile (SRI = 41) and 0.90 (95% CI: 0.81, 1.00) for those with SRI at the 95th percentile (SRI = 75), respectively. Findings for CVD mortality and cancer mortality followed a similar pattern.
Conclusions:
Irregular sleep-wake patterns are associated with higher mortality risk.
Funding:
National Health and Medical Research Council of Australia (GTN2009264; GTN1158384), National Institute on Aging (AG062531), Alzheimer’s Association (2018-AARG-591358), and the Banting Fellowship Program (#454104).