Sparse dimensionality reduction approaches in Mendelian randomisation with highly correlated exposures
Abstract
Multivariable Mendelian randomisation (MVMR) is an instrumental variable technique that generalises the MR framework for multiple exposures. Framed as a regression problem, it is subject to the pitfall of multicollinearity. The bias and efficiency of MVMR estimates thus depends heavily on the correlation of exposures. Dimensionality reduction techniques such as principal component analysis (PCA) provide transformations of all the included variables that are effectively uncorrelated. We propose the use of sparse PCA (sPCA) algorithms that create principal components of subsets of the exposures with the aim of providing more interpretable and reliable MR estimates. The approach consists of three steps. We first apply a sparse dimension reduction method and transform the variantexposure summary statistics to principal components. We then choose a subset of the principal components based on datadriven cutoffs, and estimate their strength as instruments with an adjusted Fstatistic. Finally, we perform MR with these transformed exposures. This pipeline is demonstrated in a simulation study of highly correlated exposures and an applied example using summary data from a genomewide association study of 97 highly correlated lipid metabolites. As a positive control, we tested the causal associations of the transformed exposures on coronary heart disease (CHD). Compared to the conventional inversevariance weighted MVMR method and a weak instrument robust MVMR method (MR GRAPPLE), sparse component analysis achieved a superior balance of sparsity and biologically insightful grouping of the lipid traits.
Editor's evaluation
This paper investigated the identification of causal risk factors on health outcomes. It applies sparse dimension reduction methods on highly correlated traits in the Mendelian randomization framework. The implementation of this method helps to identify risk factors when given high dimensional traits data.
https://doi.org/10.7554/eLife.80063.sa0Introduction
Mendelian randomisation (MR) is an epidemiological study design that uses genetic variants as instrumental variables (IVs) to investigate the causal effect of a genetically predicted exposure on an outcome of interest (Smith and Ebrahim, 2003). In a randomised controlled trial (RCT) the act of randomly allocating patients to different treatment groups precludes the existence of systematic confounding between the treatment and outcome and therefore provides a strong basis for causal inference. Likewise, the alleles that determine a small proportion of variation of the exposure in MR are inherited randomly. We can therefore view the various genetically proxied levels of a lifelong modifiable exposure as a ‘natural’ RCT, avoiding the confounding that hinder traditional observational associations. Genetically predicted levels of an exposure are also less likely to be affected by reverse causation, as genetic variants are allocated before the onset of the outcomes of interest.
When evidence suggests that multiple correlated phenotypes may contribute to a health outcome, multivariable MR (MVMR), an extension of the basic univariable approach can disentangle more complex causal mechanisms and shed light on mediating pathways. Following the analogy with RCTs, the MVMR design is equivalent to a factorial trial, in which patients are simultaneously randomised to different combinations of treatments (Burgess and Thompson, 2015). An example of this would be investigation into the effect of various lipid traits on coronary heart disease (CHD) risk (Burgess and Harshfield, 2016). While MVMR can model correlated exposures, it performs suboptimally when there are many highly correlated exposures due to multicollinearity in their genetically proxied values. This can be equivalently understood as a problem of conditionally weak instruments (Sanderson et al., 2019) that is only avoided if the genetic instruments are strongly associated with each exposure conditionally on all the other included exposures. An assessment of the extent to which this assumption is satisfied can be made using the conditional Fstatistic, with a value of 10 for all exposures being considered sufficiently strong (Sanderson et al., 2019). In settings when multiple highly correlated exposures are analysed, a set of genetic instruments are much more likely to be conditionally weak instruments. In this event, causal estimates can be subject to extreme bias and are therefore unreliable. Estimation bias can be addressed to a degree by fitting weak instrument robust MVMR methods (Sanderson et al., 2020; Wang et al., 2021), but at the cost of a further reduction in precision. Furthermore, MVMR models investigate causal effects for each individual exposure, under the assumption that it is possible to intervene and change each one whilst holding the others fixed. In the highdimensional, highly correlated exposure setting, this is potentially an unachievable intervention in practice.
Our aim in this paper is instead to use dimensionality reduction approaches to concisely summarise a set of highly correlated genetically predicted exposures into a smaller set of independent principal components (PCs). We then perform MR directly on the PCs, thereby estimating their effect on health outcomes of interest. We additionally suggest employing sparsity methods to reduce the number of exposures that contribute to each PC, in order to improve their interpretability in the resulting factors.
Using summary genetic data for multiple highly correlated lipid fractions and CHD (Kettunen et al., 2016; Nelson et al., 2017), we first illustrate the pitfalls encountered by the standard MVMR approach. We then apply a range of sparse principal component analysis (sPCA) methods within an MVMR framework to the data. Finally, we examine the comparative performance of the sPCA approaches in a detailed simulation study, in a bid to understand which ones perform best in this setting.
Results
Workflow overview
Our proposed analysis strategy is presented in Figure 1. Using summary statistics for the singlenucleotide polymorphism (SNP)exposure ($\widehat{\gamma}$) and SNPoutcome ($\widehat{\mathrm{\Gamma}}$) association estimates, where $\widehat{\gamma}$ (dimensionality 148 SNPs× 97 exposures) exhibits strong correlation, we initially perform a PCA on $\widehat{\gamma}$. Additionally, we perform multiple sPCA modalities that aim to provide sparse loadings that are more interpretable (block 3, Figure 1). The choice of the number of PCs is guided by permutation testing or an eigenvalue threshold. Finally, the PCs are used in place of $\widehat{\gamma}$ in an IVW MVMR metaanalysis to obtain an estimate of the causal effect of the PC on the outcome. Similar to PC regression and in line with unsupervised methods, the outcome (SNPoutcome associations ($\widehat{\mathrm{\Gamma}}$) and corresponding standard error ($S{E}_{\widehat{\mathrm{\Gamma}}}$)) is not transformed by PCA and is used in the secondstep MVMR in the original scale. In the real data application and in the simulation study, the best balance of sparsity and statistical power was observed for the method of sparse component analysis (SCA) (Chen and Rohe, 2021). This favoured method and the related steps are coded in an R function and are available at GitHub (https://github.com/vaskarageorg/SCA_MR/, copy archived at Karageorgiou, 2023).
UVMR and MVMR
A total of 66 traits were associated with CHD at or below the Bonferronicorrected level ($\mathrm{p}=0.05/97$, Table 1). Two genetically predicted lipid exposures (M.HDL.C, M.HDL.CE) were negatively associated with CHD and 64 were positively associated (Table 3). In an MVMR model including only the 66 Bonferronisignificant traits, fitted with the purpose of illustrating the instability of IVWMVMR in conditions of severe collinearity, conditional Fstatistic (CFS) (Materials and methods) was lower than 2.2 for all exposures (with a mean of 0.81), highlighting the severe weak instrument problem. In Appendix 1—figure 3, the MVMR estimates are plotted against the corresponding univariable MR (UVMR) estimates. We interpret the reduction in identified effects as a result of the drop in precision in the MVMR model (variance inflation). Only the independent causal estimate for ApoB reached our predefined significance threshold and was less precise (OR_{MVMR} (95% CI): $1.031(1.012,1.37)$, $\mathrm{O}\mathrm{R}}_{UVMR$ (95% CI): $1.013(1.01,1.016)$ (Appendix 1—figure 4). We note that, for M.LDL.PL, the UVMR estimate ($1.52(1.35,1.71)$, p < 10^{10})) had an opposite sign to the MVMR estimate (${\mathrm{O}\mathrm{R}}_{MVMR}=0.905(0.818,1.001)$).
To see if the application of a weak instrument robust MVMR method could improve the analysis, we applied MR GRAPPLE (Wang et al., 2021). As the GRAPPLE pipeline suggests, the same threesample MR design described above is employed. In the external selection GWAS study (GLGC), a total of 148 SNPs surpass the genomewide significance level for the 97 exposures and were used as instruments. Although the method did not identify any of the exposures as significant at nominal or Bonferroniadjusted significance level, the strongest association among all exposures is ApoB.
PCA
Standard PCA with no sparsity constraints was used as a benchmark. PCA estimates a square loadings matrix of coefficients with dimension equal to the number of genetically proxied exposures $K$. The coefficients in the first column define the linear combination of exposures with the largest variability (PC1). Column 2 defines PC2, the linear combination of exposures with the largest variability that is also independent of PC1, and so on. This way, the resulting factors seek to reduce redundant information and project highly correlated SNPexposure associations to the same PC. In PC1, very lowdensity lipoprotein (VLDL) and lowdensity lipoprotein (LDL)related traits were the major contributors (Figure 2a). ApoB received the 8th largest loading (0.1371, maximum was 0.1403 for cholesterol content in small VLDL) and LDL.C received the 48th largest (0.1147). In PC2, highdensity lipoprotein (HDL)related traits were predominant. The first 18 largest positive loadings are HDLrelated and 12 describe either large or extralarge HDL traits. PC3 received its scores mainly from VLDL traits. Six components were deemed significant through the permutationbased approach (Figure 1, Materials and methods).
In the secondstep IVW regression (step 4 in Figure 1), MVMR results are presented. A modest yet precise (OR = $1.002(1.0015,1.0024)$, $p<{10}^{10}$) association of PC1 with CHD was observed. Conversely, PC3 was marginally significant for CHD at the 5% level (OR = 0.998 (0.998, 0.999), p=0.049). Since $\widehat{\gamma}$ has been transformed with linear coefficients (visualised in loadings matrix, Figure 2), the underlying causal effects are also transformed and interpreting the magnitude of an effect estimate is not straightforward, since it reflects the effect of changing the PC by one unit on the outcome; however, significance and orientation of effects can be interpreted. When positive loadings are applied to exposures that are positively associated with the outcome, the MR estimate is positive; conversely, if negative loadings are applied, the MR estimate is negative.
sPCA methods
We next employed multiple sPCA methods (Table 2) that each shrink a proportion of loadings to zero. The way this is achieved differs in each method. Their underlying assumptions and details on differences in optimisation are presented in Table 2 and further described in Materials and methods.
RSPCA (Croux et al., 2013)
Optimisation and the KSS criterion pick six PCs to be informative (Karlis et al., 2003). The loadings in Figure 2 show a VLDL, LDLdominant PC1, with some small and medium HDLrelated traits. LDL.C and ApoB received the 5th and 40th largest positive loadings. PCs 1 and 6 are positively associated with CHD and PCs 3 and 5 negatively so (Appendix 1—table 1).
SFPCA (Guo et al., 2010)
The KSS criterion retains six PCs. The loadings matrix (Figure 2) shows the ‘fused’ loadings with the identical colouring. In the two first PCs, all groups are represented. Both ApoB and LDL.C received the seventh and tenth largest loadings, together with other metabolites (Figure 2). PC1 (all groups represented) was positively associated with CHD and PC4 (negative loadings from large HDL traits) negatively so (Appendix 1—table 1).
sPCA (Zou et al., 2006)
The number of nonzero metabolites per PC was set at $\frac{148}{97}\sim 16$ (see Appendix 1—figure 6). Under this level of sparsity, the permutationbased approach suggested that six sPCs should be retained. Seventy exposures received a zero loading across all components. PC1 is constructed predominantly from LDL traits and is positively associated with CHD, but this does not retain statistical significance at the nominal level in MVMR analysis (Figure 3). Only PC4 that is comprised of small and medium HDL traits (Figure 2b) appears to exert a negative causal effect on CHD (OR (95% CI): $0.9975(0.9955,0.9995)$). The other PCs were not associated with CHD (all $p$ values > 0.05, Appendix 1—table 1).
SCA (Chen and Rohe, 2021)
Six components were retained after a permutation test. In the final model, five metabolites were regularised to zero in all PCs (CH2.DB.ratio, CH2.in.FA, FAw6, S.VLDL.C, S.VLDL.FC, Figure 2). Little overlap is noted among the metabolites. PC1 receives loadings from LDL and IDL, and PC2 from VLDL. The contribution of HDL to PCs is split in two, with large and extralarge HDL traits contributing to PC3 and small and medium ones to PC4. PC1 and PC2 were positively associated with CHD (Appendix 1—table 1, Figure 3). PC4 was negatively associated with CHD.
Comparison with UVMR
In principle, all PC methods derive independent components. This is strictly the case in standard PCA, where subsequent PCs are perfectly orthogonal, but is only approximately true in sparse implementations. We hypothesised that UVMR and MVMR could provide similar causal estimates of the associations of metabolite PCs with CHD. The results are presented in Figure 3 and concordance between UVMR and MVMR is quantified with the ${R}^{2}$ from a linear regression. The largest agreement of the causal estimates is observed in PCA. In the sparse methods, SCA (Chen and Rohe, 2021) and sPCA (Zou et al., 2006) provide similarly consistent estimates, whereas some disagreement is observed in the estimate of PC6 for RSPCA (Croux et al., 2013) on CHD.
A previous study implicated LDL.c and ApoB as causal for CHD (Zuber et al., 2020b). In Appendix 1—figure 7, we present the loadings for these two exposures across the PCs for the various methods. Ideally, we would like to see metabolites contributing to a small number of components for the sparse methods. Using a visualisation technique proposed by Kim and Kim, 2012, this is indeed observed (see Appendix 1—figure 7). In PCA, LDL.c and ApoB contribute to multiple PCs, whereas the sPCA methods limit this to one PC. Only in RSPCA do these exposures contribute to two PCs. In the secondstep IVW metaanalysis, it appears that the PCs comprising of predominantly VLDL/LDL and HDL traits robustly associate with CHD, with differences among methods (Table 3).
Instrument strength
Instrument strength for the chosen PCs was assessed via an $F$statistic, calculated using a bespoke formula that accounts for the PC process (see Materials and methods and Appendix). The $F$statistics for all transformed exposures cross the cutoff of 10. There was a trend for the first components being more strongly instrumented in all methods (see Appendix 1—figure 5), which is to be expected. In the MVMR analyses, the CFS for all exposures was less than three. Thus the move to PCbased analysis significantly improved instrument strength and mitigated against weak instrument bias.
Simulation studies
We consider the case of a data generating mechanism that reflects common scenarios found in realworld applications. Specifically, we consider a set of exposures $X$, which can be partitioned into blocks based on shared genetics. Certain groups of variants contribute exclusively to specific blocks of exposures, while having no effect on other blocks. This in turn leads to substantial correlation among the exposure blocks and a much reduced correlation of between exposure blocks, due only to shared confounding. This is visualised in Figure 4a. This data structure acts to reduce the instruments’ strength in jointly predicting all exposures. The dataset consists of $n$ participants, $k$ exposures, $p$ SNPs (with both $k$ and $p$ consisting of $b$ discrete, equally sized blocks) and a continuous outcome, $Y$. We split the simulation results into one illustrative example (for didactic purposes) and one highdimensional example.
Simple illustrative example
We generate data under the mechanism presented in Figure 4a. That is, with six individual exposures ${X}_{1},\mathrm{\dots},{X}_{6}$ split into two distinct blocks (${X}_{1}{X}_{3}$ and ${X}_{4}{X}_{6}$). A continuous outcome $Y$ is generated that is only causally affected by the exposures in block 1 (${X}_{1}{X}_{3}$). A range of sample sizes were used in the simulation in order to give a range of CFS values from approximately 2–80. We apply (a) MVMR with the six individual exposures separately, and (b) PCA and SCA. The aim of approach (b) is to demonstrate the impact of reducing the sixdimensional exposure into two PCs, so that the first PC has high loadings for block 1 (${X}_{1}{X}_{3}$) and the second PC has high loadings for block 2 (${X}_{4}{X}_{6}$). Although two PCs were chosen by both PCA methods using a KSS criterion in a large majority of cases, to simplify the simulation interpretation we fixed a priori the number of PCs at two across all simulations.
Our primary focus was to assess the rejection rates of MVMR versus PCA rather than estimation, as the two approaches are not comparable in this regard. To do this we treat each method as a test, which obtains true positive (TP), true negative (TN), false positive (FP), and false negative (FN) results. In MVMR, a TP is an exposure that is causal in the underlying model and whose causal estimate is deemed statistically significant. In the PCA and sPCA methods, this classification is determined with respect to (a) which exposure(s) determine each PC and (b) if the causal estimate of this PC is statistically significant. Exposures are considered to be major contributors to a PC if (and only if) their individual PC loading is larger than the average loading. If the causal effect estimate of a PC in the analysis deemed statistically significant, major contributors that are causal and noncausal are counted as TPs and FPs, respectively. TNs and FNs are defined similarly. Type I error therefore corresponds to the FP rate and power corresponds to the TP rate. All statistical tests were conducted at the $\alpha /B$ = $\alpha /2$ = 0.025 level.
SCA, PCA, and MVMR type I error and power are shown in the three panels (left to right) in Figure 4b, respectively. These results suggest an improved power in identifying true causal associations both with PCA and SCA compared with MVMR when the CFS is weak, albeit at the cost of an inflated type I error rate. As sample size and CFS increase, MVMR performs better. For the PC of the second block’s null exposures, PCA seems to have a suboptimal type I error control (red in Figure 4b). In this lowdimensional setting, the benefit of PCA therefore appears to be limited.
Complex highdimensional example
The aim of the highdimensional simulation is to estimate the comparative performance of the methods in a wider setting that more closely resembles real data applications. We simulate genetic data and individual level exposure and outcome data for between $K=3060$ exposures, arranged in $B=46$ blocks. The underlying data generating mechanism and the process of evaluating method performance is identical to the illustrative example, but the number of variants, exposures, and the blocks is increased. We amalgamate rejection rate results across all simulations, by calculating sensitivity (SNS) and specificity (SPC) as:
and then compare all methods by their area under the estimated receiveroperating characteristic (ROC) curve (AUC) using the metaanalytical approach of Reitsma et al., 2005. Briefly, the Reitsma method performs a bivariate metaanalysis of multiple studies that report both sensitivity and specificity of a diagnostic test, in order to provide a summary ROC curve. A bivariate model is required because sensitivity and specificity estimates are correlated. In our setting the ‘studies’ represent the results of different simulation settings with distinct numbers of exposures and blocks. Youden’s index $J$ ($J=SNS+SPC1$) was also calculated, with high values being indicative of good performance.
Two sPCA methods (SCA [Chen and Rohe, 2021], sPCA [Zou et al., 2006]) consistently achieve the highest AUC (Figure 5). This advantage is mainly driven by an increase in sensitivity for both these methods compared with MVMR. A closer look at the individual simulation results corroborates the discriminatory ability of these two methods, as they consistently achieve high sensitivities (Appendix 1—figure 10). Both standard and Bonferronicorrected MVMR performed poorly in terms of AUC (AUC 0.712 and 0.660, respectively), due to poor sensitivity. PCA performed poorly, with almost equal TP and FP results (AUC 0.560). PCA and RSPCA did not accurately identify negative results (PCA and RSPCA median specificity 0 and 0.192, respectively). This extreme result can be understood by looking at the individual simulation results in Appendix 1—figure 10; both PCA and RSPCA cluster to the upper right end of the plot, suggesting a consistently low performance in identifying TN exposures. Specifically, the estimates with both these methods were very precise across simulations and this resulted in many FP results and low specificity. We note a differing performance among the top ranking methods (SCA, sPCA); while both methods are on average similar, the results of SCA are more variable in both sensitivity and specificity (Table 4). The Youden’s indexes for these methods are also the highest (Figure 5a). Varying the sample sizes (mean instrument strength in $\widehat{\gamma}$ from $\overline{F}=221$ to 1109 and mean conditional Fstatistic $\overline{CFS}=0.3412.81$) (Appendix 1—figure 9) suggests a similar benefit for sparse methods.
Even with large sample sizes ($\overline{F}=1109.78$, $\overline{CFS}=12.82$), MVMR can still not discriminate between positive and negative exposures as robustly as the sPCA methods. A major determinant of the accuracy of these methods appears to be the number of truly causal exposures, as in a repeat simulation with only four of the exposures being causal, there was a drop in sensitivity and specificity across all methods. sPCA methods still outperformed other methods in this case, however (Appendix 1—table 2).
What determines PCA performance?
In the hypothetical example of Figure 4 and indeed any other example, if two PCs are constructed, PCA cannot differentiate between causal and noncausal exposures. The only information used in this stage of the workflow (Steps 2 and 3 in Figure 1) is the SNP$X$ association matrix. Thus, the determinant of projection to common PCs is genetic correlation and correlation due to confounding, rather than how these blocks affect $Y$. Then, if only a few of the exposures truly influence $Y$, it is likely that, PCA will falsely identify the entire block as truly causal. This means the proportion of noncausal exposures within blocks of exposures that truly influence $Y$ is a key determinant of specificity. To test this, we varied the proportion of noncausal exposures by varying the sparsity of the causal effect vector $\beta $ vector and repeated the simulations, keeping the other simulation parameters fixed. As fewer exposures within blocks are truly causal, the performance in identifying TN results drops for SCA (Appendix 1—figure 12). However, our simulation still provides a means of making comparisons across methods for a given family of simulated data.
Discussion
We propose the use of sPCA methods in MVMR in order to reduce highdimensional exposure data to a lower number of PCs and infer the latter’s causal contribution. As the dimensionality of available datasets for MR investigations increases (e.g. in NMR experiments [Biobank, 2018] and imaging studies), such approaches are becoming ever more useful. Our results support the notion that sPCA methods retain the information of the initial exposures. Although there is no single optimal method that correctly factorises the SNPexposure matrix, the goal is to find some grouping of the multiple, correlated exposures such that it may resemble a latent biological variable that generates the data. The SCA (Chen and Rohe, 2021) and sPCA (Zou et al., 2006) methods performed best in simulation studies and the SCA approach performed best in the positive control example of lipids and CHD. While conventional MR approaches did not identify any protective exposures for CHD, SCA identified a cluster of small and medium HDL exposures that appeared to independently reduce the risk of CHD. This particular subset of HDL particles has previously been implicated in coronary calcification (Ditah et al., 2016) and shown to be associated with coronary plaque stability (Wang et al., 2019).
By employing sPCA methods in a real dataset (Kettunen et al., 2016), we show that the resulting PCs group VLDL, LDL, and HDL traits together, whilst metabolites acting via alternative pathways receive zero loadings. This is a desirable property and indicates that the secondstep MR enacted on the PCs obtains causal estimates for intervening on biologically meaningful pathways (Chipman and Gu, 2005). This is in contrast with unconstrained PCA, in which all metabolites contribute to all PCs. Previously, Sulc et al., 2020 used PCA in MR to summarise highly correlated anthropometric variables. To our knowledge, this is the first investigation of different sPCA modalities in the context of MR. Our simulation studies revealed that sPCA methods exhibited superior performance compared to standard PCA, which had high FP rates, and MVMR, which had high FN rates. We additionally provide a number of ways to choose the number of components in a datadriven manner. Our proposed approach of an sPCA method naturally reduces overlap across components; for instance, in a paper by Sulc et al., 2020, the authors use PCA and identify four independent axes of variation of body morphology. There are PCs that are driven in common by trunk, arm, and leg lean mass, basal metabolic rate, and BMI; a hypothetical benefit with sparse methods would be reduction of this overlap. This is an important topic for further research. When using PCA without any sparsity constraints, our simulation studies revealed numerous FP results, at the opposite end of the nature of poor performance seen in MVMR; estimates were often misleadingly precise (FN). Although appropriate transformations of the exposures were achieved, we highly recommend exploring additional forms of T1E control to improve the performance of PCA. Nonetheless, sparse methods exhibited superior performance compared to both PCA and MVMR.
A previous work on sparse methods in genetics proposed their usefulness in multitissue transcriptomewide association studies (Feng et al., 2021). A finding of the study is that leveraging correlated gene expressions across tissues with sparse canonical correlation analysis improves power to detect genetrait pairs. Our approach that combines MR with sPCA also showed an improvement in power to detect causal effects of exposures on outcomes.
Our approach is conceptually different from the robust methods that have been developed for standard MVMR in the presence of weak instruments, such as MR GRAPPLE, which attempts to directly adjust point estimates for weak instrument bias, but are not a panacea, especially in the highdimensional setting discussed here (Wang et al., 2021). Furthermore, it reduces the need for a preselection of which exposures to include in an MVMR model. We present a complementary workflow through which we can include all available exposures with no prior selection, collate them in uncorrelated and interpretable components, and then investigate the causal contribution of these groups of exposures. It avoids the risk of generating spurious results in such an extreme setting of high collinearity compared with MVMR IVW and MR GRAPPLE formulations. For example, a 2019 threesample MR study that assessed 82 lipoprotein subfraction risk factors’ effects on CHD used an UVMR and a robust extension of MVMR. A positive effect of VLDL and LDLrelated subfractions on CHD was reported, consistent in magnitude across the sizes of the subfractions (Zhao et al., 2021). Results were less definitive on the effect of HDL subfractions of varying size on CHD, with both positive and negative effect estimates observed. In our study, the HDL subfractions were uniformly projected to similar subspaces, yielding a single component that was mainly HDL populated in all models, except for the SCA model 15 which projected the small/ medium and large/extralarge HDL traits in two different components. In all cases, the association of the sPCs with CHD was very low in magnitude. Nevertheless, the direction of effects was in line with the established knowledge on the relationship between lipid classes and CHD.
Within the sPCA methods, there were differences in the results. The sPCA method (Zou et al., 2006) favoured a sparser model in which less than 10 metabolites per PC were used. This observation is also made by Guo et al., 2010. The SCA method (Chen and Rohe, 2021) achieved good separation of the traits and very little overlap was observed. A separation of HDLrelated traits according to size, not captured by the other methods, was noted. Clinical relevance of a more highresolution HDL profiling, with larger HDL molecules mainly associated with worse outcomes, has been previously reported (Kontush, 2015).
Limitations
In the present study, many tuning parameters needed to be set in order to calibrate the PCA methods. We therefore caution against extending our conclusions on the best method outside the confines of our simulation and our specific real data example. Not all available sparse dimensionality reduction approaches were assessed in our investigation and other techniques could have provided better results.
The use of sparsity may raise the concern of neglecting horizontal pleiotropy if a variant influences multiple components, but its weight in a given component is shrunk to zero. This would not occur for standard PCA where no such shrinkage occurs. Currently, our approach is not robust to pleiotropy operating via exposures not included in the model. Our plan is to address this as future work by incorporating medianbased MVMR models into the second stage, as done by Grant and Burgess, 2021.
Interpretability
The sPCA approach outlined in this paper enables the user to perform an MVMR analysis with a large number of highly correlated exposures, but one potential downside is that the effect sizes are not as interpretable. Interpreting the causal effects of PCA components (sparse or otherwise) poses a significant challenge. This is because intervening and lowering a specific PC could be actioned by targeting any of the exposures that have a nonzero loading within it. This is in contrast to the causal effect of a single exposure, where the mode of intervention is clear. However, the same ambiguity is often a feature of a realworld intervention, such as a pharmaceutical drug. That is, even if a drug targets a specific lipid fraction, it may exert multiple effects on other lipid fractions that are not preplanned and are a result of interlinked physiological cascades, overlapping genetics, and interdependent relationships. Identifying such underlying biological mechanisms and pathways is a key step in deciding on the relevance of a PCA derived effect estimate compared to a standard MVMR estimate. We therefore suggest that this approach is best suited for initially testing how large groups of risk factors independently affect health outcomes, before a more focused MVMR within the PCAidentified exposures.
Another limitation of our work is that the instrument selection could have been more precise since we used an external study that investigated total lipid fractions rather than specific size and composition profile characteristics. Future more specific GWAS could improve this, leading to better separation in the genetic predictions of all lipid fractions.
Conclusion
In the present study, we underline the utility of sparse dimensionality reduction approaches for highly correlated exposures in MR. We present a comprehensive review of methods available to perform dimensionality reduction and describe their performance in a real data application and a simulation.
Materials and methods
Approximate relationship to a onesample analysis
Request a detailed protocolOur approach works directly with summary statistics gleaned from genomewide association studies that are used to furnish a twosample analysis, but it can also be motivated from the starting point of a onesample individual level data. Specifically, assume the data generating model is $\mathbf{y}=\mathbf{X}\mathit{\beta}+\mathbf{u}\mathbf{X}=\mathbf{G}\mathit{\gamma}+\mathbf{V}$ and so that the secondstage model of a twostage least squares procedure is $\mathbf{y}=\hat{\mathbf{X}}\mathit{\beta}+\stackrel{~}{\mathbf{u}}$, where $\hat{\mathbf{X}}=\mathbf{G}\hat{\mathit{\gamma}}$. PCA on $\hat{\mathbf{X}}$ is approximately equivalent to PCA on $\hat{\mathit{\gamma}}$ since $\hat{\mathbf{X}}}^{T}\hat{\mathbf{X}}={\hat{\mathit{\gamma}}}^{T}\hat{\mathit{\gamma}$ if $\mathbf{G}$ is normalised so that $\hat{\mathit{\gamma}}$ represents standardised effects. In the appendix we provide further simulation results that show that the loadings matrix derived from a PCA on $\hat{\mathbf{X}}$ and $\hat{\mathit{\gamma}}$ are asymptotically equivalent.
Data
The risk factor dataset reports the associations of 148 genetic variants (SNPs) with 118 NMRmeasured metabolites ((Table 5) , Kettunen et al., 2016). This study reports a highresolution profiling of mainly lipid subfractions. To focus on lipidrelated traits, we exclude amino acids and retain 97 exposures for further analyses. Fourteen size/density classes of lipoprotein particles (ranging from extra small [XS] HDL to extraextralarge [XXL] VLDL) were available and, for each class, the traits of total cholesterol, triglycerides, phospholipids, and cholesterol esters content, and the average diameter of the particles were additionally provided. Through the same procedure, estimates from NMR on genetic associations of amino acids, apolipoproteins, fatty and fluid acids, ketone bodies, and glycerides were also included. Instrument selection for this dataset has been previously described (Zuber et al., 2020a). Namely, 185 variants were selected, based on association with either one of: LDLcholesterol, HDLcholesterol, or triglycerides in the external sample of the Global Lipid Genetics Consortium at the genomewide level ($p<5\times {10}^{8}$) (Do et al., 2013). Then, this set was pruned to avoid inclusion of SNPs in linkage disequilibrium (LD) (threshold: ${r}^{2}<0.05$) and filtered (variants in distance less than 1 megabase pairs were excluded) resulting in the final set. This preprocessing strategy was performed with a view to study CHD risk.
Positive control outcome assessment is recommended in MR as an approach of investigating a risk factor that has an established causal effect on the outcome of interest (Burgess et al., 2020). We used CHD as a positive control outcome, given that lipid fractions are known to modulate its risk, with VLDL and LDLrelated traits being positively associated with CHD and HDLrelated traits negatively (Burgess et al., 2020). Summary estimates from the CARDIoGRAMplusC4D Consortium and UK Biobank metaanalysis (Nelson et al., 2017; Deloukas et al., 2013) were used. For both datasets, a predominantly European population was included, as otherwise spurious results could have arisen due to populationspecific, as opposed to CHDspecific, differences (Vilhjálmsson and Nordborg, 2013).
MR assumptions
Request a detailed protocolThe first assumption (IV1) states that IVs should be associated with the exposure. The second assumption (IV2) states that the IVs should be independent of all confounders of the risk factoroutcome association (IV2) and, finally, independent of the outcome conditional on the exposure and the confounders (IV3). The validity of the final two assumptions cannot be directly verified with the available data. For the inferences to be valid, it is necessary that the three IV assumptions apply (Davies et al., 2018). These assumptions are illustrated for the case of two exposures in Figure 6a. In situations when IV3 is not deemed likely, additional risk factors that are possible mediators can be introduced in an MVMR model (Burgess and Harshfield, 2016). Additional assumptions should hold for MVMR results to be valid. Particularly, (a) all included exposures have to be associated with at least one of the IVs, and (b) there should be no evidence of multicollinearity. If there is significant overlap, for example if G_{1} and G_{2} are associated with both exposures X_{1} and X_{2}, but only with X_{2} through X_{1} as in Figure 6b, this may result in conditionally weak instruments (Sanderson et al., 2020). The latter assumption and the way it limits eligibility of highdimensional, correlated exposures is a key motivation for the present work.
UVMR and MVMR
Request a detailed protocolTo examine how each of the metabolites is causally associated with CHD, a UVMR approach was first undertaken under the twosample summary data framework. This uses SNPexposure and SNPoutcome GWAS summary statistics ($\widehat{\gamma}$ and $\widehat{\mathrm{\Gamma}}$, respectively), obtained by regressing the exposure or outcome trait on each SNP individually. Usually, an additional adjustment has been made for variables such as age, gender, and genetic PCs. An UVMR analysis is the most straightforward way to obtain an estimate for the causal effect of the exposure, but is only reliable if all SNPs are valid instruments. However, in the Kettunen dataset, where the exposure set comprises 97 highly correlated lipid fractions, one variant may influence multiple exposures. This will generally invalidate an UVMR analysis and an MVMR approach is more appropriate.
Estimating Equation 2 provides a general means for representing the mechanics of an UVMR or MVMR analysis:
where
${\widehat{\gamma}}_{kj}$ represents the association of SNP $j$ with exposure $k$, with variance ${\sigma}_{kj}^{2}$;
${\widehat{\mathrm{\Gamma}}}_{j}$ represents the association of SNP $j$ with the outcome, with variance ${\sigma}_{Yj}^{2}$;
${\beta}_{k}$ represents the causal effect of exposure $k$ on the outcome to be estimated.
${\sigma}_{lkj}$ represents Cov(${\widehat{\gamma}}_{lj}$,${\widehat{\gamma}}_{kj}$). If individuallevel data is not available to obtain this quantity, it can be estimated using the intercept from LD score regression (BulikSullivan et al., 2015).
In an UVMR analysis there is only one exposure, so that $K$=1, whereas in an MVMR analysis $K\ge 2$ (or in the case of the Kettunen data, $K$=97). In an IVW UVMR or MVMR analysis, the causal effect parameters estimated are obtained by finding the values ${\widehat{\beta}}_{1},\mathrm{\dots},{\widehat{\beta}}_{K}$ that minimise ${Q}_{A}$, under the simplifying assumption that ${\sigma}_{A}^{2}$ = ${\sigma}_{Yj}^{2}$. This is justified if all causal effects are zero, or if the uncertainty in the SNPexposure associations is negligible (the no measurement error [NOME] assumption). In the general MVMR context, the NOME assumption is approximately satisfied if the CFS for each exposure are large, but if it is not, then IVW MVMR estimates will suffer from weak instrument bias. For exposure $k$, $CF{S}_{k}$ takes the form
where:
where $\widehat{\delta}$ is estimated by regressing an exposure ${X}_{k}$ on all other exposures ${X}_{k}$ by OLS. If an exposure $k$ can be accurately predicted conditionally on other exposures, then there won’t be sufficient variation or ‘good heterogeneity’ (Sanderson et al., 2019) in ${Q}_{Xk}$ and CFS will generally be small. This will be the case whenever there is a high degree of correlation between the SNPexposure association estimates, for example as in an MVMR analysis of all 118 lipid fractions in the Kettunen data. One option to address this would be to use weak instrument robust MVMR, such as MRGRAPPLE (Wang et al., 2021). This method performs estimation using the full definition of ${\sigma}_{A}^{2}$ in Equation 2 and using a robust loss function that additionally penalises larger outliers in the data. It can work well for MVMR analyses of relatively small numbers of exposures (e.g. up to 10) and relatively weak instruments (CFS as low as 3), but the dimension and high correlation of the Kettunen data is arguably too challenging. This motivates the use of dimension reduction techniques, such as PCA.
Dimension reduction via PCA
PCA is a singular value decomposition a given matrix of the $p\times K$ matrix of SNPexposure associations $\widehat{\mathit{\gamma}}$ as:
where $U$ and $V$ are orthogonal matrices and $D$ is a square matrix whose diagonal values are the variances explained by each component and all offdiagonal values are 0. V is the loadings matrix and serves as an indicator of the contribution of each metabolite to the transformed space of the PCs. The matrix $UD$ (PCs/scores matrix) is used in the secondstep IVW regression in place of $\widehat{\gamma}$. As $V$ estimation does not aim for sparsity, all exposures will contribute to some degree to all components, making the interpretation more complicated. Therefore, we assessed multiple sPCA methods that intentionally limit this.
sPCA (Zou et al.)
Request a detailed protocolsPCA by Zou et al., 2006, estimates the loadings matrix through an iterative procedure that progressively penalises exposures so that they do not contribute to certain PCs. In principle, this leads to a more clear picture for the consistency of each PC. This is performed as follows:
Setting a fixed matrix, the following elastic net problem is solved ${\xi}_{j}=argmi{n}_{\xi}{({\alpha}_{j}\xi )}^{T}{\widehat{\gamma}}^{T}\widehat{\gamma}({\alpha}_{j}\xi )+{\lambda}_{1j}\parallel \xi \parallel +\lambda {\parallel \xi \parallel}^{2}$, where $j$ is the PC.
For a fixed $\mathrm{\Xi}$, $\widehat{{\gamma}^{T}}\widehat{\gamma}\mathrm{\Xi}=UD{V}^{T}$ is estimated and update $A=U{V}^{T}$.
Repeat steps 1 and 2 until convergence.
Here, ${\lambda}_{1}$ is an $L1$ sparsity parameter that induces sparsity, ${\lambda}_{2}$ is an $L2$ parameter that offers numerical stability, and $\mathrm{\Xi}$ is a matrix with sparsity constraints for each exposure (Zou and Xue, 2018). As a result of the additional ${\lambda}_{1}\parallel \xi \parallel $ norm, there is sparsity in the loadings matrix and only some of the SNPexposure associations $\widehat{\gamma}$ contribute to each PC, specifically a particular subset of highly correlated exposures in $\widehat{\gamma}$.
RSPCA
Request a detailed protocolThis approach differs in that it employs a robust measure of dispersion that is not unduly influenced by large single values of $\widehat{\gamma}$ that contribute a large amount to the total variance (Rousseeuw and Croux, 1993; Heckert, 2003). As above, an L_{1} norm is used to induce sparsity. For optimisation, the tradeoff product optimisation is maximised. It does not impose a single $\lambda $ value on all PCs, thus allowing different degrees of sparsity.
SFPCA (Guo et al., 2010)
Request a detailed protocolA method that can in theory exploit distinct correlation structures. Its goal is to derive a loadings matrix in which highly positively correlated variables are similar in sign and highly negative ones are opposite. Similar magnitudes also tend to be obtained for those variables that are in the same blocks in the correlation matrix. Like the sPCA optimisation in Zou et al., 2006, sparse fused PCA (SFPCA) works by assigning highly correlated variables the exact same loadings as opposed to numerically similar ones (Figure 2d). This is achieved with two norms in the objective function: ${\lambda}_{1}$ which regulates the L_{1} norm that induces sparsity and ${\lambda}_{2}$ for the $L2$ regularisation (squared magnitude of $\widehat{\gamma}$) to guard against singular solutions. A grid search is used to identify appropriate parameters for ${\lambda}_{1}$ and ${\lambda}_{2}$. The following criterion is used:
such that ${A}^{T}A={I}_{K}$. The ‘fused’ penalty (last term) purposely penalises discordant loadings for variables that are highly correlated. The choice of the sparsity parameters is based on a Bayesian information criterion (BIC).
SCA
Request a detailed protocolSCA (Chen and Rohe, 2021) is motivated by the relative inadequacy of the classic approaches in promoting significant sparsity. It addresses this by rotating the eigenvectors to achieve approximate sparsity whilst keeping the proportion of variance explained the same. Simulation studies show the technique works especially well in highdimensional settings such as gene expression data, among other examples (Chen and Rohe, 2021).
Choice of components
Request a detailed protocolIn all dimensionality reduction approaches applied to correlated variables, there is no upper limit to how many transformed factors can be estimated. However, only a proportion of them are likely to be informative in the sense of collectively explaining a meaningful amount of total variance in the original dataset. To guide this choice, a permutationbased approach was implemented (Coombes and Wang, 2019) as follows: Firstly, the $\widehat{\gamma}$ matrix was randomly permuted and the sPCA method of interest was applied on the permuted set. The computed eigenvalues are assumed to come from a null distribution consistent with a noninformative component. This process is repeated multiple times (e.g. $perm=1000$) and the mean eigenvalues for all components stored. Finally, the sPCA method is performed in the original $\widehat{\gamma}$ matrix and whichever component has an eigenvalue larger than the mean of the permuted sets is considered informative and kept. Due to the computational complexity of the permutation method, particularly for SFPCA, an alternate method  the KSS criterion (Karlis et al., 2003)  was also used. This is based on a simple correction on the minimum nontrivial eigenvalue ($\mathrm{C}\mathrm{u}\mathrm{t}\mathrm{o}\mathrm{f}\mathrm{f}}_{KSS}=1+2\sqrt{\frac{K1}{p1}$). The authors show that the method is robust to nonnormal distributions (Karlis et al., 2003). Although KSS was not compared with the abovedescribed permutation approach, it performed better than simpler approaches, such as choosing those PCs whose eigenvalue is larger than 1 (Kaiser criterion), the broken stick method (Jolliffe, 2002) and the Velicer method (Velicer, 1976).
Instrument strength of PCs
Request a detailed protocolIn MVMR, the $IV1$ assumption requires a set of genetic variants that robustly associate with at least one of the exposures $X$ (MR assumptions). This is quantified by CFS in Equation 2 (Sanderson et al., 2020). With summary statistics of the SNP$X$ associations ${\widehat{\gamma}}_{p,k}$ ($p$: SNP, $k$: exposure), the mean F statistic for exposure $k$ used in a standard UVMR analysis is the far simpler expression
We provide a dedicated formula for estimating instrument strength measures for the $F$statistic for the PCs that is closely related to Equation 3 rather than Equation 2. This simplification is due to the fact that an MVMR analysis of a set of PCs is essentially equivalent to an UMVR analysis of each exposure separately. The full derivation is reported in the section ‘Instrument strength’ of the Appendix.
Appendix 1
Instrument strength
Since we transform $\widehat{\gamma}$ and obtain a matrix of lower dimensionality, formula (Equation 4) can’t be used as there is no longer a onetoone correspondence of the $S{E}_{s}$ with the PCs. Likewise, a conditional Fstatistic for the PCs also cannot be computed for this reason. We aim to arrive at a modified formula that bypasses this issue. For this purpose, we take advantage of two concepts, first an expression of the Fstatistic for an exposure $k$ (${F}_{k}$) in matrix notation and, second, the use of this expression to estimate Fstatistics for the PCs (${F}_{PC}$) from $\widehat{\gamma}$ decomposition.
We make the assumption that the uncertainty in the ${\widehat{\gamma}}_{G,{X}_{K}}$ estimates is similar in all $K$ exposures, that is ${\widehat{\gamma}}_{G,X}$ uncertainty estimates do not substantially differ among exposures. This is not implausible as the uncertainty is predominantly driven by sample size and minor allele frequency (Giambartolomei et al., 2014). Specifically, the authors of Giambartolomei et al., 2014, show that
where MAF is the minor allele frequency, n_{k} is the sample size in exposure $k$, and $Var({X}_{k})$ is the phenotypic variance. What this means is that, in experiments such as Kettunen et al., 2016, where n_{k} is the same across all exposures and $Var({X}_{k})$ can be standardised to 1, the main driver of differences in $Var({\widehat{\gamma}}_{{X}_{K}})$ is differences in MAF. As MAF is the same for each SNP across all exposures, the collation of SEs across exposures per SNP is well motivated.
We can then define a matrix $\mathrm{\Sigma}$ as follows.
The elements in the diagonal represent the mean variance of $\widehat{\gamma}$ for each SNP and all offdiagonal elements are zero. What is achieved through this is a summary of the uncertainty in the SNP$X$ associations that is not sensitive on the dimensions of the exposures. Instead of Equation 4, we can then express the vector of the mean $F$statistics for each exposure ${F}_{1K}=[{F}_{1},{F}_{2},\mathrm{\dots},{F}_{K}]$ as
where $\widehat{\gamma}$ is the matrix of the SNPexposure associations. In a simulation study, we generate data under the mechanism in Appendix 1—figure 1a. The strength of association is different in the three exposures. It is observed that the estimates with both methods (Equation 5 and Equation 4) align well (Appendix 1—figure 1b), supporting the equivalence of the two formulae.
Our second aim is to use this matrix notation formula for the Fstatistic to quantify the instrument strength of each PC with the respective Fstatistic (${F}_{PC}$). As presented above, we are not limited by the dimensions of point estimates and uncertainty matching exactly and we can use the formula in Equation 5 and substitute $\widehat{\gamma}$ with the PCs. For the PCA approach, where we decompose $\widehat{\gamma}$ as $\widehat{\gamma}=UD{V}^{T}$ and carry forward $M<<K$ nontrivial PCs, we use the matrix $UD$ in place of $\widehat{\gamma}$. Then, the mean ${F}_{PC}$ can be estimated as follows.
The vector ${F}_{P{C}_{1M}}=[{F}_{P{C}_{1}},{F}_{P{C}_{2}},\mathrm{\dots},{F}_{P{C}_{M}}]$ contains the ${F}_{PC}$ statistics for the $M$ PCs. In a similar manner, we estimate ${F}_{PC}$ for the sPCA methods but, instead of the scores matrix $UD$, we use the scores of the sparse methods. We illustrate the performance of this approach in a simulation study with an identical configuration for exposure generation as the one presented in Figure 5. In a configuration with $b=6$ blocks of $p=30$ genetically correlated exposures (Figure 4), we vary the strength of association $\gamma $ per block. This way, the first block has the highest strength of association and the last block the lowest, quantified by a lower mean Fstatistic in the exposures of this block (red, Appendix 1—figure 2). The instrument strength of the PCs and the sPCs follow closely the corresponding Fstatistics of the individual exposures; in other words, in a PC of five exposures, ${F}_{P{C}_{1}}$, ${F}_{SC{A}_{1}}$, and ${F}_{15}$ align well (Appendix 1—figure 2).
MVMR and UVMR
MVMR with PC scores
MVMR with IVW and GRAPPLE
A small negative effect for M.LDL.PL is noted as nominally significant in Appendix 1—figure 4. This is not concordant with the UVMR direction of effect. In GRAPPLE, no traits surpass the nominal significance threshold.
Simulation study design
We generate a minor allele frequency $MAF\sim U(0.01,0.5)$ and an unmeasured confounder $U\sim N(0,5)$. Conditionally on the MAF, we generate the genes $G\sim Bin(2,MAF)$. We define $Q$ blocks and divide the $P$ genes of $G$ to $q$ blocks, each containing $P/Q$ genes. This approach aims to model the biological phenomenon of multiple traits being influenced by a common set of variants. For instance, the traits of LDL cholesterol content and LDL phospholipid content appear to be associated with the variants in a largely similar manner (Kettunen et al., 2016).
For each block $q=1,2,\mathrm{\dots},Q$, we define a matrix ${\gamma}_{q}$ whose elements ${\gamma}_{1}{\gamma}_{P/Q}$ are nonzero and $\sim N(4,1)$. This matrix is what will direct only certain variants to influence certain exposures, leading to multicollinearity. We then generate the $K$ exposures sequentially per block, following the parametric equation ${X}_{q}={\gamma}_{q}{G}_{q}+U+{\u03f5}_{X}$. This way, specific genetic variants generate blocks of exposures as shown in Figure 4 and the exposures ${X}_{1}{X}_{K/Q}$ are highly correlated. We derive the SNPexposure associations from this dataset as $X\sim \widehat{\gamma}G$. We set $K$=50–100 and $P$=100–150 and the blocks 5–8. We let these values vary across simulations in order to generate more varying values in diagnostic accuracy. For a given sample size, $s=1000$ simulation studies were performed.
To retain the workflow of a twosample MR, we generate a second exposure set identically as above but on an independent sample. This step is important as it guarantees the no sample overlap assumption of twosample MR (Burgess et al., 2016). Based on this second ${X}^{\prime}$ matrix, we generate the outcome $Y=\beta {X}^{\prime}+U+{\u03f5}_{Y}$. The vector of causal estimates $\beta $ is generated based on any number of exposures being causal in the two blocks. This includes the null. We obtain the SNPoutcome associations as $Y\sim \mathrm{\Gamma}G$. The effect of the data generating mechanism in a single dataset is visualised in Appendix 1—figure 8.
The methods employed are:
MVMR IVW with $\widehat{\gamma}$ and $\widehat{\mathrm{\Gamma}}$(Yavorska and Staley, 2020).
MVMR IVW with Bonferroni correction.
PCA with the scores from $\widehat{\gamma}=UD{V}^{T}$.
sPCA as implemented in the elasticnet package (Zou et al., 2006).
RSPCA (Croux et al., 2013).
SCA (Chen and Rohe, 2021).
Due to computational complexity, sPCA in the elasticnet package (spca_en in Figure 5) was implemented with a simplification regarding the sparsity parameter. Specifically, it was assumed that the number of nonzero exposures per component was $P/Q$. For SCA, we use the crossvalidation method in PMA R package (Witten and Tibshirani, 2020).
To generate the summary ROC curves presented in Figure 5, we treated simulation results as individual studies and metaanalysed them with the bivariate method of Reitsma et al., 2005, in the R package. The logit sensitivity and specificity (which are correlated) are jointly metaanalysed by modelling them as a bivariate normal distribution and employing a randomeffects model for accomodating this correlation and framing it as heterogeneity.
The results for a set of simulations in $N=3000$ individuals are presented in Appendix 1—figure 10. We observe that MVMR and MVMR with Bonferroni correction estimates cluster to the bottom left of the plot, suggesting a low sensitivity. The estimates from SCA and sPCA_EN are spread in the upper left half of the plot and often achieve a sensitivity of 1.00. The PCA and RSPCA mainly provide highly sensitive estimates but perform relatively worse in specificity. The performance in increasing sample sizes is consistent with a benefit for sparse methods (Appendix 1—figure 9).
Comparison of PCA in 1SMR and 2SMR
We fix the number of blocks of exposures to $b=2$ and the number of exposures to $K=6$. In both 1SMR and 2SMR, the loadings matrices of $\widehat{X}$ and ${\widehat{\gamma}}_{GX}$ are of dimensionality $p\times p$ (p being the number of SNPs) and can be therefore compared for similarity. We use two similarity measures: a the sum of squared differences between the onesample and twosample loadings matrices $S={\sum}_{pj=1}^{P}{\sum}_{pi=1}^{P}{({V}_{ij,1SMR}{V}_{ij,2SMR})}^{2}$, and the ${R}^{2}$ from a linear regression of the onesample PC loadings on the twosample PC loadings. ${R}^{2}$ increases and $S$ decreases with increasing sample size, as expected.
Proportion of noncausal exposures as a predictor of performance
For a given sample size $N=1000$, we vary the proportion of noncausal SNPs (${P}_{NC}$) to examine the performance. The total number of total exposures is picked from the $K\in (100,160)$ range, the number of SNPs from the $p\in (100,200)$, and the number of blocks of exposures from $B\in (5,20)$. Since the applied pipeline does not discriminate between strength of association of $X$ with $Y$ in the dimensionality reduction stage, an increase in ${P}_{NC}$ was expected to induce FP. Then, in the increasingly sparser subvector ${\beta}_{b}$ of the causal effects of ${X}_{b}$ on $Y$, all ${X}_{b}$ exposures would be projected to the same PC. This PC is likely to still be associated with $Y$; however, specificity may drop. This is what was observed in the simulation study, where specificity gradually drops as ${P}_{NC}$ increases from 20% to 80%.
Data availability
The GWAS summary statistics for the metabolites (http://www.computationalmedicine.fi/data/NMR_GWAS/) and CHD (http://www.cardiogramplusc4d.org/) are publicly available. We provide code for the SCA function, the simulation study and related documentation on GitHub (https://github.com/vaskarageorg/SCA_MR/, copy archived at Karageorgiou, 2023).
References

Multivariable Mendelian randomization: The use of pleiotropic genetic variants to estimate causal effectsAmerican Journal of Epidemiology 181:251–260.https://doi.org/10.1093/aje/kwu283

Bias due to participant overlap in twosample Mendelian randomizationGenetic Epidemiology 40:597–608.https://doi.org/10.1002/gepi.21998

Mendelian randomization to assess causal effects of blood lipids on coronary heart diseaseCurrent Opinion in Endocrinology, Diabetes & Obesity 23:124–130.https://doi.org/10.1097/MED.0000000000000230

Guidelines for performing Mendelian randomization investigationsWellcome Open Research 4:186.https://doi.org/10.12688/wellcomeopenres.15555.2

Interpretable dimension reductionJournal of Applied Statistics 32:969–987.https://doi.org/10.1080/02664760500168648

Robust sparse principal component analysisTechnometrics 55:202–214.https://doi.org/10.1080/00401706.2012.727746

Pleiotropy robust methods for multivariable Mendelian randomizationStatistics in Medicine 40:5813–5830.https://doi.org/10.1002/sim.9156

Principal component analysis with sparse fused loadingsJournal of Computational and Graphical Statistics 19:930–946.https://doi.org/10.1198/jcgs.2010.08127

A simple rule for the selection of principal componentsCommunications in Statistics  Theory and Methods 32:643–666.https://doi.org/10.1081/STA120018556

HDL particle number and size as predictors of cardiovascular diseaseFrontiers in Pharmacology 6:218.https://doi.org/10.3389/fphar.2015.00218

Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviewsJournal of Clinical Epidemiology 58:982–990.https://doi.org/10.1016/j.jclinepi.2005.02.022

Alternatives to the median absolute deviationJournal of the American Statistical Association 88:1273–1283.https://doi.org/10.1080/01621459.1993.10476408

An examination of multivariable Mendelian randomization in the singlesample and twosample summary data settingsInternational Journal of Epidemiology 48:713–727.https://doi.org/10.1093/ije/dyy262

Mendelian randomization: Can genetic epidemiology contribute to understanding environmental determinants of diseaseInternational Journal of Epidemiology 32:1–22.https://doi.org/10.1093/ije/dyg070

The nature of confounding in genomewide Association studiesNature Reviews. Genetics 14:1–2.https://doi.org/10.1038/nrg3382

SoftwarePMA: Penalized multivariate analysis, version 1.2.1R Package.

SoftwareMendelianrandomization: Mendelian randomization package, version 0.5.0R Package.

Sparse principal component analysisJournal of Computational and Graphical Statistics 15:265–286.https://doi.org/10.1198/106186006X113430

A selective overview of sparse principal component analysisProceedings of the IEEE 106:1311–1320.https://doi.org/10.1109/JPROC.2018.2846588
Decision letter

Siming ZhaoReviewing Editor; Dartmouth College, United States

Martin R PollakSenior Editor; Harvard Medical School, United States

Stephen BurgessReviewer; University of Cambridge, United Kingdom
Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Decision letter after peer review:
Thank you for submitting your article "Sparse Dimensionality Reduction Approaches in Mendelian Randomization with highly correlated exposures" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Martin Pollak as the Senior Editor. The following individual involved in the review of your submission has agreed to reveal their identity: Stephen Burgess (Reviewer #1).
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
Essential revisions:
1) Clarification of the method details. All three reviewers have clarification comments.
2) Interpretation of results. See comments from reviewer 3.
3) Demonstration of benefits in real data analysis. See comments from reviewers 1 and 3.
Reviewer #1 (Recommendations for the authors):
I'm not the best person to judge, but I'm not sure how accessible this work would be to someone who doesn't understand multivariable Mendelian randomization. Figure 1 is a great overview, but I'd appreciate the viewpoint of someone who is less familiar with these methods.
The description of the different sparse PCA methods was a helpful summary.
I'm not sure that the foray into instrument strength is valuable – it is over 2 pages and 2 figures, and I'm not sure how much this adds – maintaining high conditional F statistics is important, but their calculation is not a major part of the analysis report. The whole section is a bit distracting and disproportionate.
All MR estimates seem to be on quite odd scales – the estimates are all close to 1 – 1.001 or 0.999. Could the PCs be standardized, so that the estimates are per 1 SD increase in the PC? This would improve communication of the results. Currently, there is no sense of scale – what is the corresponding change in lipids on a meaningful scale?
While I understand that the main thrust of the manuscript is methodological, I didn't get a strong sense of what the conclusion from the applied analysis is. If the authors could communicate a strong finding, this would help to justify the use of the method. Eg what are the estimates for the HDL clusters in the SCA method? This isn't clear.
Figure 6 would be key to this – currently, it's not clear what the various PCs represent. Maybe a radar chart would help? Also, the caption for Figure 6 is incorrect – the caption from Figure 8 has inadvertently been duplicated here.
While I wouldn't want to insist that the authors change their paper, the obvious application of this work appears to be imaging studies – this is currently underexplored.
Figure 5 is a really helpful summary of the performance of the different methods in this investigation.
Reviewer #2 (Recommendations for the authors):
This is a nice paper, with convincing and very useful results. I have mainly some clarification questions and suggestions.
1. The main idea is to treat the matrix of SNPexposures associations $\hat{\gamma}$ as explanatory variables in the regression $\hat{\mathrm{\Gamma}}=\hat{\gamma}\beta +e$, although this is clearly not a standard regression model, just that a WLS or GLS estimator for β is equivalent to the 2sls estimator on singlesample data. I think it would therefore be instructive to link to the construction of PCs in a singlesample individuallevel data sets. There the model is $y=X\beta +uX=G\gamma +V$ and one could do a PCA on X. Alternatively, as the 2sls estimator is leastsquares in $y=\hat{\mathbf{X}}\beta +\stackrel{~}{u}$ where $\hat{\mathbf{X}}=\mathbf{G}\stackrel{~}{\gamma}$, one could perhaps consider PCA on $\hat{\mathbf{X}}$. As $\hat{\mathbf{X}}}^{T}\hat{\mathbf{X}}={\hat{\gamma}}^{T}{\mathbf{G}}^{T}\mathbf{G}\hat{\gamma$ this is getting close to the PCA on $\hat{\gamma}$ if $\mathbf{G}}^{\mathbf{T}}\mathbf{G$ is normalized, for example ${G}^{T}\mathbf{G}/n$ = I_{p}. If it is not normalized, then it is hard to make a connection. It may therefore be worthwhile considering PCA on the weighted $\hat{\gamma}$ as per the (extended) IVW estimator in (1) on page 4. With such a link in place, it seems easier to motivate the proposed method.2. Are the covariances $\sigma}_{\text{lkj}$ reported in the GWAS? I assume the estimators are simply IVW throughout, as implied at the top of page 5?3. It is unclear to me why ∆ is estimated by OLS in (2).
4. Notation. It would be easier to follow the notation if matrices are upper case and vectors lower case. That would change for example the current vector γ to a lower case letter and the current matrix γ to an upper case letter.
5. It is not straightforward to follow the steps for Sparse PCA on page 56, as not all symbols are clear. It is not clear in 1. what "Setting a fixed matrix," means and what the α_{j} are. Also, define Xi. In 2 you simply do an svd? Should there be "hats" on γ? UDV^{T} in step 2. are not the same values as in the svd for $\hat{\gamma}$.
Reviewer #3 (Recommendations for the authors):
Here are some detailed comments.
1. Usually a good idea to number figures in the order they appear in the text (Figure 12 out of order).
2. UVMR is referred to as UMVR three times and as UMVMR once
3. At the beginning of the Methods section, it would be good to specify the dimension of \γ. It is also important to specify how the variants used are selected.
4. Figure 1 is a bit chaotic and inconsistent with the text. Figure 1 top, the graph indicates that there are 118 exposures. I assume this was chosen because it is the number of lipid exposures in the data application but it doesn't look like the lipid data is used in this figure. Figure 1 top right, the dimension of the matrix is given as L x 118. The variable L does not appear elsewhere in the paper. What is printed appears to be a 60x60 square matrix. Is this matrix meant to represent ${\gamma}^{\mathrm{\top}}\gamma$? The statement "MR estimates affected by multicolinearity" printed next to this matrix doesn't appear to relate to the figure because the figure doesn't indicate anything about the MR estimates.5. Figure 5. It looks like none of the matrices shown have all 118 rows. The sparse PCA matrix has many fewer exposures than the other methods. It would be good to clarify this in the caption.
6. Figure 6 is labeled as "Extrapolated ROX curves" but appears to be a comparison of UVMR and MVMR results. For the top right plot, I would have expected UVMR and MVMR results to be identical since the PCs are exactly orthogonal. Why is this not the case?
7. Page 12 describes Figure 5 as "The loadings for all methods (Figures 5ae) are presented for LDL.c, a previously prioritised trait." I am confused about what "for LDL.c" means here. It appears that LDL.c is one of the traits in Figure 5 but these appear to be loadings across some large subset of traits?
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "Sparse Dimensionality Reduction Approaches in Mendelian Randomization with highly correlated exposures" for further consideration by eLife. Your revised article has been evaluated by Martin Pollak (Senior Editor) and a Reviewing Editor.
The manuscript has been improved and all reviewers agree this study will be useful for the field. However, there are remaining issues and we would like to invite you to further revise this manuscript.
Revision outline:
1. Adding a deeper more nuanced discussion of what is meant by causality and how to interpret the components. The major advancement of this study comes from method development and it is important to not overstate the significance of results. See comment 1 from reviewer 3 and comment 4 from reviewer 1.
2. The choice of method. Discussing scenarios that will work or fail for the method will be helpful for the readers to better utilize this tool. See comment 4 from reviewer 1 and comment 2 from reviewer 3.
3. Adding details of simulation. Other clarification issues raised by reviewers.
Reviewer #1 (Recommendations for the authors):
1. While I appreciate the simulation study, and believe this is worthwhile, it strikes me that the primary reason for choosing one or other of these sparse methods is not performance in a simulation study, but rather how the methods define the PCs, and how the analyst could interpret the results. As there's no point running an analysis if you can't interpret the results. This will depend on the specific characteristics of the investigation, so I am not expecting a single method recommendation. But I'd appreciate a bit more acknowledgement of this in the choice of methods – what characteristics to these methods tend to result in? Interesting (and reassuring) to see that clusters from the SCA and sPCA methods look similar in Figure 2 (although then odd that in Table 3, the significance differs for the VLDL and LDL components…).
Overall, I appreciate the authors' manuscript in that it proposes sparse dimension reduction as a method and explores the properties of MVMR estimates from this approach. My sense is that it is difficult to say anything more authoritative/general than "hey, look – we tried this and it sort of worked!", but the authors do this in a way that is mostly clear (see comments above) and should be helpful to future researchers.
Reviewer #2 (Recommendations for the authors):
I am happy with this revision. The authors have responded well to my previous comments and suggestions, or clarified their preferred choices.
Reviewer #3 (Recommendations for the authors):
The resubmission has corrected some of the issues I raised previously. However, I have some remaining concerns, primarily about the interpretation of the proposed causal effects and the simulation results.
1. My largest issue is still in the causal interpretation of genetically determined PCs are sparse components. To me, I don't think there is a way to turn the components into welldefined exposures that admit a counterfactual or interventional definition of a causal effect. This approach seems closer to "causality adjacent" methods like TWAS. There was a paper a few years ago using sparse CCA in combination with TWAS (https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1008973) and this seems similar to that to me. I think one option for this paper would be to add a deeper discussion of what it means to test for a causal effect of a component that may or may not represent a latent biological variable. On one end of interpretation you have "association with genetically predicted exposure component" and then I think you could add as a bonus that if you have successfully identified a biological component, this could be interpretable as a causal effect.
2. Using the components introduces an issue of horizontal pleiotropy if a single variant is allowed to affect multiple components. This suggests that the componentbased MVMR results would be more appropriate than componentbased UVMR results, though for PCA this is not an issue due to orthogonality and approximate equivalence of UVMR and MVMR.
Some questions about the simulations:
3. In Figure 4, it seems like the MVMR plot should have 6 lines and not two. How were the 6 traits combined into two lines here? It seems unfair to average the rejection rate for the X_13 into the black line since, if MVMR performed perfectly, it would never reject X_3 and always reject X_1 and X_2.
4. Why is there an interval around only the black line in the MVMR plot and not the other plots?
5. The sentence in the caption presumably describing the color scheme is cutoff. "The results are split according to group n black…"
6. Is the appropriate number of components known apriori in the simulations?
7. The description of Figure 4 suggests that PCA and SCA are outperforming MVMR. However, PCA has exactly the same rejection rate for the causal and noncausal block of traits which seems like bad performance to me. SCA has a Type 1 error rate between 20 and 30% at a nominal level of 0.05 which also seems quite bad to me. I would not view improving power at the cost of uncontrolled type 1 error as a good outcome.
8. The description of the Figure 5 at the top of page 9 does not align with the image in Figure 5. "Both standard and Bonferronicorrected MVMR performed poorly in terms of AUC (AUC 0.660 and 0.885 respectively), due to poor sensitivity. While there was a modest improvement with PCA (AUC 0.755), it appears that PCA and RSPCA do not accurately identify negative results (median specificity 0 and 0.28 respectively)." Reading these numbers, it sounds like MVMR is the worst followed by PCA, followed by MVMR_B. However, in the figure, PCA and RSPCA both have lower AUC than MVMR and MVMR_B. These numbers are also inconsistent with Table 4.
9. I think I may be a little confused about how the ROC curves are constructed.
9a. In my experience, an ROC curve is usually constructed by varying a threshold (e.g. α). However, it seems here like α was fixed and we are averaging over the single point obtained for each simulation setting according to the Reitsma et al. method. This method makes sense when the data are real observations, however, in the case of simulations, why not use the traditional method of varying to threshold and then averaging the full curves across simulations?
9b. I think this method of displaying results also obscures the issue of calibration. If there is a nominal significance threshold, we would like to see that type 1 error is controlled below that threshold.
9c. Visually, Figure 15 looks inconsistent with Figure 5 to me. There are no blue points in the upper right of Figure 15  does it make sense to extrapolate this part of the curve? Figure 5 suggests that the red methods at least have the potential to do very well in differentiating true and false positives but this doesn't seem born out by Figure 15 where most of the red points cluster in the upper right.
10. On the selection of instruments  is it correct that you are assuming that all variants affecting any of the component metabolites also affect LDL/HDL or Triglycerdies as measured in the GLGC study? Does this assumption limit your ability to identify components that separate LDL or HDL subcomponents? For example, variants that affect only a small number of components will be less likely to be selected using the GLGC reference.
https://doi.org/10.7554/eLife.80063.sa1Author response
Reviewer #1 (Recommendations for the authors):
I'm not the best person to judge, but I'm not sure how accessible this work would be to someone who doesn't understand multivariable Mendelian randomization. Figure 1 is a great overview, but I'd appreciate the viewpoint of someone who is less familiar with these methods.
We have expanded the Introduction (first and second paragraphs), stressing the link of MR with RCTs that are more widely known. The goal of confounding adjustment is very similar in the two approaches and we believe this can motivate the presented methods for a wider audience.
The description of the different sparse PCA methods was a helpful summary.
I'm not sure that the foray into instrument strength is valuable – it is over 2 pages and 2 figures, and I'm not sure how much this adds – maintaining high conditional F statistics is important, but their calculation is not a major part of the analysis report. The whole section is a bit distracting and disproportionate.
All MR estimates seem to be on quite odd scales – the estimates are all close to 1 – 1.001 or 0.999. Could the PCs be standardized, so that the estimates are per 1 SD increase in the PC? This would improve communication of the results. Currently, there is no sense of scale – what is the corresponding change in lipids on a meaningful scale?
Thank you for your suggestion, indeed the OR estimates are on a narrow scale close to 1. We relate this to the lack of definition of the estimated causal effects. For this reason, we limit the scope of the proposed method to hypothesis testing as opposed to estimation. It is encouraging that, in the positive control example of how lipids associate with CHD, we observe that, when positive loadings are used to transform the exposures, the sign is biologically meaningful (LDL and VLDL increase risk, HDL appears to be protective). We acknowledge this and further stress its implications in the Discussion/Limitations section.
While I understand that the main thrust of the manuscript is methodological, I didn't get a strong sense of what the conclusion from the applied analysis is. If the authors could communicate a strong finding, this would help to justify the use of the method. Eg what are the estimates for the HDL clusters in the SCA method? This isn't clear.
Thank you for your suggestion, indeed there is an interesting point not originally highlighted regarding the HDL cluster. In the MVMR with those exposures that were significant in UVMR, only ApoB is positively associated with CHD. No exposures were protective. However, in SCA, there is a grouping of HDL traits in two independent PCs, corresponding to small/medium and larger HDL traits. The former PC appears to be protective for CHD. This finding is supported from other observational studies (smaller HDL traits are associated with favourable cardiovascular risk profiles) that we now report in the Discussion. Thus, if the analysis was restricted to MVMR, these findings would be masked due to imprecision (additions in Discussion/First Paragraph, updated Table 4).
Figure 6 would be key to this – currently, it's not clear what the various PCs represent. Maybe a radar chart would help? Also, the caption for Figure 6 is incorrect – the caption from Figure 8 has inadvertently been duplicated here.
Thank you for your observation, we have fixed the caption. The figure has also slightly changed following the comments of Reviewer 3 on an update in the exposures that are eligible with the SNPs that we use for the instrument. What Figure 6 reports is the comparison of the univariable and multivariable estimates of the causal effect of each PC on CHD. We would expect PCA to provide orthogonal components and therefore to have identical estimates in UVMR and MVMR. In the sparse methods, where strict orthogonality is traded with interpretability of the PCs, we expected some deviation and we wanted to visually communicate this. We also included the UVMR and MVMR estimates for SFPCA which were not in the original draft.
While I wouldn't want to insist that the authors change their paper, the obvious application of this work appears to be imaging studies – this is currently underexplored.
Thank you for your observation, we agree that imaging data would be a very good candidate for this approach. We highlight this in the Discussion.
Reviewer #2 (Recommendations for the authors):
This is a nice paper, with convincing and very useful results. I have mainly some clarification questions and suggestions.
1. The main idea is to treat the matrix of SNPexposures associations $\hat{\gamma}$ as explanatory variables in the regression $\hat{\mathrm{\Gamma}}=\hat{\gamma}\beta +e$, although this is clearly not a standard regression model, just that a WLS or GLS estimator for β is equivalent to the 2sls estimator on singlesample data. I think it would therefore be instructive to link to the construction of PCs in a singlesample individuallevel data sets. There the model is $y=X\beta +uX=G\gamma +V$ and one could do a PCA on X. Alternatively, as the 2sls estimator is leastsquares in $y=\hat{\mathbf{X}}\beta +\stackrel{~}{u}$ where $\hat{\mathbf{X}}=\mathbf{G}\stackrel{~}{\gamma}$, one could perhaps consider PCA on $\hat{\mathbf{X}}$. As $\hat{\mathbf{X}}}^{T}\hat{\mathbf{X}}={\hat{\gamma}}^{T}{\mathbf{G}}^{T}\mathbf{G}\hat{\gamma$ this is getting close to the PCA on $\hat{\gamma}$ if $\mathbf{G}}^{\mathbf{T}}\mathbf{G$ is normalized, for example ${G}^{T}\mathbf{G}/n$ = I_{p}. If it is not normalized, then it is hard to make a connection. It may therefore be worthwhile considering PCA on the weighted $\hat{\gamma}$ as per the (extended) IVW estimator in (1) on page 4. With such a link in place, it seems easier to motivate the proposed method.
Thank you for your suggestion. We now include a simulation study on onesample data where we perform PCA (PCA_{1SMR}) and one sparse implementation (SCA) on $\hat{X}$. To accomodate the larger computational time (matrix of n × K instead of p × n), we reduced sample size and accordingly increased the γ parameter, keeping the F−statistics and conditional F−statistics in the same range. We compare a twostage least squares MVMR estimator for K exposures with the following procedure. We obtain PC scores from Xˆ = UDV ^{T} and then regress UD on Y. Figure XYZ shows a similar performance with individuallevel data as the one reported in the original draft. Both MVMR and MVMR with a multiplicity correction suffer from imprecise results (low sensitivity in Figure). SCA performs better in identifying true causal exposures, whereas PCA provides a lot of false positive results. Therefore, we conclude that such an approach with twostage least squares is feasible and performs better than PCA in the simulation study.
We then examine how similar PCA_{1SMR} and PCA_{2SMR} are. As suggested, the aim is to establish how much information is lost by opting for a PCA/SCA on summary data only. If the exposures are grouped together in PCA_{1SMR} and PCA_{2SMR} in a similar fashion, then this would indicate a similar performance in individual level data and summary statistics. To quantify this, we keep the simulation setup described in the Simulation Study Section and we apply the two methods.
As you suggested, the loadings matrices for both is of dimensionality K × K. We then use two measures to quantify the similarity for increasing sample sizes: a similarity measure to compare the loadings matrices
where V_{1SMR} are the loadings assigned in onesample MR and V_{2SMR} those of twosample MR, and (b) the R^{2} of the regression of ${V}_{{k}_{1}{k}_{2},}1SMR$ on ${V}_{{k}_{1}{k}_{2}}2SMR$ ($R}^{2}=1\frac{{\u03f5}_{\text{VV}}^{T}\u03f5VV}{{V}_{{1SMR}^{T}{V}_{1SMR}}$, where ϵ_{V V} are the residuals of the regression of V_{1SMR} on V_{2SMR}). We vary the sample size and track the performance of PCA and SCA.
The results are presented in Figure 1. For SCA, PCA_{1SMR} and PCA_{1SMR} load the matrix in a more similar manner as sample size increases, however not converging to zero. A similar picture is drawn by the R^{2}, with all PCs showing a closer alignment of 1SMR and 2SMR as sample size increases. A closer examination of the results in individual simulations suggests that the blocks of exposures are correctly projected to PCs (exposures 1 to 3 in one PC and 4 to 6 in another, Figure), more so in larger sample sizes We note that the loadings for each block are not identical in 1SMR and 2SMR.
As suggested, in the applied analysis we have replaced the PCA on γ approach with the one proposed. We thus standardise γ as.$\frac{\text{\gamma GX}}{{\text{SE}}_{\text{\gamma GX}}}$ There are no differences in precision of the causal estimates.
2. Are the covariances $\sigma}_{\text{lkj}$ reported in the GWAS? I assume the estimators are simply IVW throughout, as implied at the top of page 5?
We have used the intercept of the LD score regression as a proxy for the phenotypic correlations σ_{lkj}.
We clarify that, if individuallevel data is not available, this is a sensible approach. Yes, IVW is used throughout the analyses.
3. It is unclear to me why ∆ is estimated by OLS in (2).
Thank you for your comment, $\hat{\delta}$ is estimated by regressing an exposure X_{k} on the other exposures X_{−k} by OLS. If an exposure k can be accurately predicted conditionally on other exposures, then there won’t be sufficient variation in Q_{Xk} (low conditional instrument strength). We have clarified this in the manuscript.
4. Notation. It would be easier to follow the notation if matrices are upper case and vectors lower case. That would change for example the current vector γ to a lower case letter and the current matrix γ to an upper case letter.
Thank you for your comment, we prefer to refer to the SNPX associations as γ and to the SNPY associations as Γ as this is common practice.
5. It is not straightforward to follow the steps for Sparse PCA on page 56, as not all symbols are clear. It is not clear in 1. what "Setting a fixed matrix," means and what the α_{j} are. Also, define Xi. In 2 you simply do an svd? Should there be "hats" on γ? UDV^{T} in step 2. are not the same values as in the svd for $\hat{\gamma}$.
Thank you for your comment, we have updated the explanation on sparse PCA in page 5. We are rephrasing the page 272 of the Zou et al., 2006 paper that introduces the ideas of framing PCA as a ridge regression problem (L2 norm is similar to variance maximisation) and then additionally using an L1 norm to induce sparsity.
Reviewer #3 (Recommendations for the authors):
Here are some detailed comments.
1. Usually a good idea to number figures in the order they appear in the text (Figure 12 out of order).
Thank you for spotting this, we have reordered the Figures.
2. UVMR is referred to as UMVR three times and as UMVMR once
Thank you for spotting this, we have corrected this.
3. At the beginning of the Methods section, it would be good to specify the dimension of γ. It is also important to specify how the variants used are selected.
Thank you for your suggestion, we have added the dimensions of γ (148x118). The instrument selection procedure is described in the first paragraph of the Methods/Data section.
4. Figure 1 is a bit chaotic and inconsistent with the text. Figure 1 top, the graph indicates that there are 118 exposures. I assume this was chosen because it is the number of lipid exposures in the data application but it doesn't look like the lipid data is used in this figure. Figure 1 top right, the dimension of the matrix is given as L x 118. The variable L does not appear elsewhere in the paper. What is printed appears to be a 60x60 square matrix. Is this matrix meant to represent \γ^\top \γ? The statement "MR estimates affected by multicolinearity" printed next to this matrix doesn't appear to relate to the figure because the figure doesn't indicate anything about the MR estimates.
Thank you for looking into this Figure, we indeed do not use the lipid data in this. Instead, we use simulated data. To improve the Figure, we have switched to the updated exposures γ^{T}γ matrix (147 × 99). We changed the notation to K, which is how we refer to the number of exposures throughout the text. We have moved the ’MR Estimates affected by multicollinearity’ sentence to the legend.
5. Figure 5. It looks like none of the matrices shown have all 118 rows. The sparse PCA matrix has many fewer exposures than the other methods. It would be good to clarify this in the caption.
Thank you for your suggestion, we now clarify this in the caption.
6. Figure 6 is labeled as "Extrapolated ROX curves" but appears to be a comparison of UVMR and MVMR results. For the top right plot, I would have expected UVMR and MVMR results to be identical since the PCs are exactly orthogonal. Why is this not the case?
Thank you, we have corrected the caption. This is correct, we are comparing UVMR and MVMR estimates for the PCA methods. We agree and also hypothesised that the estimates would be identical for PCA where the PCs are strictly orthogonal and similar in the sparse implementations. In the updated analysis where aminoacids are excluded and six (instead of seven) PCs are constructed, this is also observed. The agreement of the causal estimates from PCA is the highest (R^{2} = 0.93). The sparse methods provide estimates that are not as concordant as in PCA but are still highly correlated. Regarding the poor performance of robust sparse PCA on this assessment, we have increased the number of sparsity parameters per component and this appeared to improve (R^{2} = 0.79).
7. Page 12 describes Figure 5 as "The loadings for all methods (Figures 5ae) are presented for LDL.c, a previously prioritised trait." I am confused about what "for LDL.c" means here. It appears that LDL.c is one of the traits in Figure 5 but these appear to be loadings across some large subset of traits?
Thank you, we were presenting separately the loadings that corresponded to LDL.c and urea across all methods in Figure 12. Figure 5 was a reference to the methods but this referencing was confusing so we dropped it. In the updated Figure, we present visually how the different methods project LDL.c and ApoB to different components; PCA loadings suggest that LDL.c contributes to some degree to all PCs whereas the sparse methods reduce this to a single PC.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
The manuscript has been improved and all reviewers agree this study will be useful for the field. However, there are remaining issues and we would like to invite you to further revise this manuscript.
Thank you for your positive assessment of our manuscript. We have updated the manuscript according to the reviewers’ comments and have included our responses pointbypoint in the present document.
We also like to point out a change in the senior authorship, with Prof. Jack Bowden and Dr. Verena Zuber now both being joint senior authors, as Prof. Jack Bowden contributed to the manuscript to a degree that qualifies him for this. All authors agreed to this author order.
Revision outline:
1. Adding a deeper more nuanced discussion of what is meant by causality and how to interpret the components. The major advancement of this study comes from method development and it is important to not overstate the significance of results. See comment 1 from reviewer 3 and comment 4 from reviewer 1.
Thank you for your suggestion. We have updated the interpretation of the findings in line with these two comments. We stress more on how the methodological improvements are the major focus.
2. The choice of method. Discussing scenarios that will work or fail for the method will be helpful for the readers to better utilize this tool. See comment 4 from reviewer 1 and comment 2 from reviewer 3.
Thank you for your point. We have expanded the Discussion section to describe the types of datasets that the proposed methods are applicable for. We believe this makes things clearer for practitioners.
2. Adding details of simulation. Other clarification issues raised by reviewers.
Thank you for your suggestion. We have substantially rephrased and reorganised the simulation section, with changes highlighted in blue. We have added the required details and split the simulation results in subsections to improve clarity.
Reviewer #1 (Recommendations for the authors):
1. While I appreciate the simulation study, and believe this is worthwhile, it strikes me that the primary reason for choosing one or other of these sparse methods is not performance in a simulation study, but rather how the methods define the PCs, and how the analyst could interpret the results. As there's no point running an analysis if you can't interpret the results. This will depend on the specific characteristics of the investigation, so I am not expecting a single method recommendation. But I'd appreciate a bit more acknowledgement of this in the choice of methods – what characteristics to these methods tend to result in? Interesting (and reassuring) to see that clusters from the SCA and sPCA methods look similar in Figure 2 (although then odd that in Table 3, the significance differs for the VLDL and LDL components…).
Thank you for your comment. We agree that, given the study design, final recommendations on using only one method could be missing out on subtle differences; for instance, datasets with outliers might benefit more from robust approaches but this approach didn’t seem to outperform others in our simulation study nor in the applied example. We show a substantial improvement in performance for the particular realistic simulation design that we describe, and the performance measure is based on hypothesis testing rather than how the methods define the PCs. The simulation study is now reorganised with three distinct sections and we believe it is clearer how we do measure the results of how the PCs affect the outcome (power and Type I Error).
Regarding the Table 3 results, we agree that the sPCA results are not perfectly accordant with the expectation of a LDL/VLDL positive association with CHD. We used this example as a positive control outcome and we therefore interpret this as a suboptimal performance of sPCA in this particular context. It is of course possible to imagine different scenarios regarding this difference with the simulations where sPCA performed well, the main of whom would be some form of pleiotropy. We discuss this possibility in the discussion.
Overall, I appreciate the authors' manuscript in that it proposes sparse dimension reduction as a method and explores the properties of MVMR estimates from this approach. My sense is that it is difficult to say anything more authoritative/general than "hey, look – we tried this and it sort of worked!", but the authors do this in a way that is mostly clear (see comments above) and should be helpful to future researchers.
Thank you for your positive assessment of our manuscript.
Reviewer #3 (Recommendations for the authors):
The resubmission has corrected some of the issues I raised previously. However, I have some remaining concerns, primarily about the interpretation of the proposed causal effects and the simulation results.
Thank you for your positive assessment of our manuscript.
1. My largest issue is still in the causal interpretation of genetically determined PCs are sparse components. To me, I don't think there is a way to turn the components into welldefined exposures that admit a counterfactual or interventional definition of a causal effect. This approach seems closer to "causality adjacent" methods like TWAS. There was a paper a few years ago using sparse CCA in combination with TWAS (https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1008973) and this seems similar to that to me. I think one option for this paper would be to add a deeper discussion of what it means to test for a causal effect of a component that may or may not represent a latent biological variable. On one end of interpretation you have "association with genetically predicted exposure component" and then I think you could add as a bonus that if you have successfully identified a biological component, this could be interpretable as a causal effect.
Thank you for the article suggestion, we have read the particular application of sparse methods in TWAS and cite it in the discussion. This also helps contextualising the benefits of sparse methods in general, namely improving power.
We have supplemented our previous discussion of this difficulty in interpreting the components as actionable exposures with counterfactuals in the draft, in the Discussion/ Limitations section, bringing it forward in the new updated paragraph. We acknowledge that a strict causal interpretation is not readily available with our suggested approach; at the same time, we believe it’s highly unlikely that a pragmatic causal effect will act in a way such that only one particular exposure of the many, highly correlated ones will be the sole driver of an effective medical intervention.
2. Using the components introduces an issue of horizontal pleiotropy if a single variant is allowed to affect multiple components. This suggests that the componentbased MVMR results would be more appropriate than componentbased UVMR results, though for PCA this is not an issue due to orthogonality and approximate equivalence of UVMR and MVMR.
Thank you for your comment. We agree that MVMR results are better suited for this purpose and this is why we present MVMR and not UVMR in the positive control example with real data. We highlight this in the applied Results section and mention the pleiotropy concern in the limitations.
Some questions about the simulations:
3. In Figure 4, it seems like the MVMR plot should have 6 lines and not two. How were the 6 traits combined into two lines here? It seems unfair to average the rejection rate for the X_13 into the black line since, if MVMR performed perfectly, it would never reject X_3 and always reject X_1 and X_2.
Thank you for your comment. We have substantially revised the simulation study results, splitting it in three sections to improve clarity (Illustrative example, highdimensional example, determinant of performance). The changes are highlighted in blue. We have also corrected the data generating mechanism in Figure 4a, to visualise the exact mechanism that was used in the simulations in Figure 4b (six exposures forming two blocks, one block / three exposures are causal for Y).
In the first scenario (Figure 4), we have compared the rejection rates of MVMR with the two informative PCs and MVMR with the six individual exposures. Therefore, the first two panels in Figure 4b include two lines (with corresponding Monte Carlo SEs), one for each PC, and the rightmost panel includes six lines, one for each exposure. We have updated the colour coding and the DAG to better present this.
In the proposed workflow, we have used only the informative components in order to efficiently reduce the dimensions. We achieved this with permutations in the real data and with the KarlisSaportaSpinakis criterion, a simpler method for a different eigenvalue cutoff. In the simulations, we have used the two first PCs, each of which reflected three exposures.
4. Why is there an interval around only the black line in the MVMR plot and not the other plots?
We have now modified the presentation of the Figure. What we are presenting in this plot is the rejection rates for the two components (three exposures each) of PCA and SCA. In MVMR, there are six exposures and six rejection rates. The colourcoding is green for the true causal exposures (and components) and red for those that do not affect the outcome. To make this clearer, we now show different shades of red and green for the individual exposures and red and green for the transformed components. For PCA, the causal and the noncausal groups are presented as two separate lines. It was observed that PCA and SCA performed well in mapping the correlated exposures to two components, hence the equivalence visualised by the common colour coding (see also comments 3, 6). Apologies for the confusing presentation, the lines that were interpreted as intervals are in reality lines that show the rejection rates of MVMR for each exposure. We have added a table in the figure to clarify this, and updated the figure legend.
Therefore, an ideal performance would be a nominal type I error rate for the component that reflects the negative exposures (colourcoded red) and a high value of power approaching 1 (colour coded black). To improve presentation, we have also added Monte Carlo SEs following Tim Morris et al. (Table 6 in DOI: 10.1002/sim.8086). In this updated simulation, there seems to be an appropriate performance for SCA, with T1E converging at 5%. The problematic performance of MVMR is highlighted in the lower than nominal Type 1 Error throughout and, importantly, in the low power in the low instrument strength simulation (CFS ~ 3, first result/ leftmost part of the three panels).
5. The sentence in the caption presumably describing the color scheme is cutoff. "The results are split according to group n black,…."
Thank you for your comment, we have now rephrased the Figure 4 caption.
6. Is the appropriate number of components known apriori in the simulations?
This is correct. The two ways of determining the number of components (KSS criterion, permutation) performed well in the simulated data and we chose to present results with an a priori known number of components. We have added this in the simulation results/ Simple Illustrative Example section for transparency.
7. The description of Figure 4 suggests that PCA and SCA are outperforming MVMR. However, PCA has exactly the same rejection rate for the causal and noncausal block of traits which seems like bad performance to me. SCA has a Type 1 error rate between 20 and 30% at a nominal level of 0.05 which also seems quite bad to me. I would not view improving power at the cost of uncontrolled type 1 error as a good outcome.
Thank you for your suggestions, we have given the simulation results a more careful consideration. We agree that PCA performs suboptimally, albeit in a different manner than MVMR. MVMR seems to lack in power and PCA provides many false positive results. What seems to perform better is SCA. We also agree that a T1E of 20% is concerning. However, observing that the T1E curve didn’t seem to have fully converged for the instrument strength that we were previously reporting, we repeated the same simulation study with one modification. We further increase the sample sizes to achieve higher instrument strength. In the updated Figure 4, SCA indeed outperforms PCA and MVMR, with a wellcontrolled T1E for larger samples. It may be unlikely that such a high instrument strength will be observed, especially in the context of correlated exposures, however we believe this convergence to 5% is quite reassuring.
Regarding your observations on the performance of PCA, we agree and we are more critical of PCA in the Results section as we agree that false positives are a major issue. It seems that MVMR and PCA in the formulations we investigated perform poorly at the two extremes of poor performance (highly imprecise, misleadingly precise); meanwhile, sensible transformations of the exposures are achieved and we have added in the discussion the possibility of some form of T1E control to salvage PCA performance to guide further research. To clarify this, we added a segment in the discussion (Paragraph 2 ‘When using PCA without any sparsity constraints, our simulation studies revealed numerous false positive results, at the opposite end of the nature of poor performance seen in MVMR; estimates were often misleadingly precise (false negative). Although appropriate transformations of the exposures were achieved, we highly recommend exploring additional forms of T1E control to improve the performance of PCA. Nonetheless, sparse methods exhibited superior performance compared to both PCA and MVMR.6’)
8. The description of the Figure 5 at the top of page 9 does not align with the image in Figure 5. "Both standard and Bonferronicorrected MVMR performed poorly in terms of AUC (AUC 0.660 and 0.885 respectively), due to poor sensitivity. While there was a modest improvement with PCA (AUC 0.755), it appears that PCA and RSPCA do not accurately identify negative results (median specificity 0 and 0.28 respectively)." Reading these numbers, it sounds like MVMR is the worst followed by PCA, followed by MVMR_B. However, in the figure, PCA and RSPCA both have lower AUC than MVMR and MVMR_B. These numbers are also inconsistent with Table 4.
Thank you for your comments, the purpose of direct AUC comparisons is to visualise in one figure the performances of the methods. We agree that, seen in isolation, it may be misleading to miss the differences with respect to sensitivity and specificity across methods: MVMR consistently provides imprecise results and fails to identify positive associations, whereas PCA provides very precise results that yield many false positives. We were acknowledging this previously and have further highlighted in the discussion. The comparison of PCA and MVMR solely on AUC is therefore not a reliable way to measure their efficacy. Thank you for spotting the inconsistency in Table 4, the columns are now correctly labelled.
At the same time we present the results, we’re discussing what is not directly identified. Results are driven by very different behaviour on different ends of AUC curve. We acknowledge that the results are inherently dependent on the data generating model and the specific simulation settings used in our study. If the data were generated differently, the results would differ as well.
9. I think I may be a little confused about how the ROC curves are constructed.
9a. In my experience, an ROC curve is usually constructed by varying a threshold (e.g. α). However, it seems here like α was fixed and we are averaging over the single point obtained for each simulation setting according to the Reitsma et al. method. This method makes sense when the data are real observations, however, in the case of simulations, why not use the traditional method of varying to threshold and then averaging the full curves across simulations?
Thank you for bringing up this question. We have substantially updated the simulation section and we hope the workflow is now clearer. We first present the data generating mechanism, and we provide more details on the ROC curve construction in the paragraph Simulation Studies/ Complex HighDimensional Example.
We fix the α threshold to present the results across many individual simulations in a more concise manner. Varying the threshold and averaging the full curves across simulations would have been a valid approach for single simulation studies as well, but any summary across simulation studies would have made the presentation of results more complex and difficult to interpret.
Therefore, we use the Reitsma et al. method, a wellestablished approach for summarizing ROC curves in simulation studies, as it provides a comprehensive and concise summary of the results. This method jointly metaanalyses sensitivity and specificity with a bivariate mode. We further explain our rationale in the Simulation Studies/ Complex HighDimensional Example paragraph.
9b. I think this method of displaying results also obscures the issue of calibration. If there is a nominal significance threshold, we would like to see that type 1 error is controlled below that threshold.
Thank you for your comment. We have shown that SCA achieves appropriate T1E for larger samples in the updated analysis (Simulation Results/Simple illustrative example/Figure 4).
9c. Visually, Figure 15 looks inconsistent with Figure 5 to me. There are no blue points in the upper right of Figure 15  does it make sense to extrapolate this part of the curve? Figure 5 suggests that the red methods at least have the potential to do very well in differentiating true and false positives but this doesn't seem born out by Figure 15 where most of the red points cluster in the upper right.
Thank you for your observation. The curve is directly derived from the bivariate metaanalysis method and the unobserved values are extrapolated from the form of the observed ones. The true performance, as observed directly from individual simulation studies, is reported in Figure 15, where many of the results for the two superior sparse PCA methods cluster in the top right corner. Additionally, there is a clear benefit in the red methods, as evidenced by a substantial scoring in the top of the yaxis (in the sensitivity = 1 area of the plot). This benefit is not observed for PCA or MVMR. When the results are metaanalysed, these latter simulation studies in the middle region are extrapolated and indicative of good performance. This is why we choose to metaanalyse, as all these factors are taken into account, but at the same time present the individual results for visual comparison.
We have to disagree with your interpretation regarding how these two results are in disagreement, as we believe Figure 15 is a transparent and complementary way of presenting the total performance in Figure 5. In Figure 15, even though there is some apparent clustering in the top right corner, there is still much more dispersion compared to PCA and especially in the middle regions (say sensitivity and specificity ~ 0.2 0.8). These individual results give rise to the curves in the metaanalysed plot.
10. On the selection of instruments  is it correct that you are assuming that all variants affecting any of the component metabolites also affect LDL/HDL or Triglycerdies as measured in the GLGC study? Does this assumption limit your ability to identify components that separate LDL or HDL subcomponents? For example, variants that affect only a small number of components will be less likely to be selected using the GLGC reference.
This is true, ideally we would have an external study that investigates subcomponents as separate phenotypes rather than serum measurements of total fractions. We are acknowledging this as another limitation in the discussion.
https://doi.org/10.7554/eLife.80063.sa2Article and author information
Author details
Funding
State Scholarships Foundation
 Vasileios Karageorgiou
Expanding Excellence in England
 Vasileios Karageorgiou
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We would like to thank Dr. Stephen Burgess and two anonymous reviewers that provided constructive comments on the extension of the approach to individuallevel data, on improving and clarifying the applied analyses, and on updating and communicating the simulation results.
Senior Editor
 Martin R Pollak, Harvard Medical School, United States
Reviewing Editor
 Siming Zhao, Dartmouth College, United States
Reviewer
 Stephen Burgess, University of Cambridge, United Kingdom
Version history
 Received: May 6, 2022
 Preprint posted: June 16, 2022 (view preprint)
 Accepted: April 18, 2023
 Accepted Manuscript published: April 19, 2023 (version 1)
 Version of Record published: May 30, 2023 (version 2)
Copyright
© 2023, Karageorgiou et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 619
 Page views

 87
 Downloads

 2
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Evolutionary Biology
 Genetics and Genomics
Maintaining germline genome integrity is essential and enormously complex. Although many proteins are involved in DNA replication, proofreading, and repair, mutator alleles have largely eluded detection in mammals. DNA replication and repair proteins often recognize sequence motifs or excise lesions at specific nucleotides. Thus, we might expect that the spectrum of de novo mutations – the frequencies of C>T, A>G, etc. – will differ between genomes that harbor either a mutator or wildtype allele. Previously, we used quantitative trait locus mapping to discover candidate mutator alleles in the DNA repair gene Mutyh that increased the C>A germline mutation rate in a family of inbred mice known as the BXDs (Sasani et al., 2022, Ashbrook et al., 2021). In this study we developed a new method to detect alleles associated with mutation spectrum variation and applied it to mutation data from the BXDs. We discovered an additional C>A mutator locus on chromosome 6 that overlaps Ogg1, a DNA glycosylase involved in the same baseexcision repair network as Mutyh (David et al., 2007). Its effect depends on the presence of a mutator allele near Mutyh, and BXDs with mutator alleles at both loci have greater numbers of C>A mutations than those with mutator alleles at either locus alone. Our new methods for analyzing mutation spectra reveal evidence of epistasis between germline mutator alleles and may be applicable to mutation data from humans and other model organisms.

 Evolutionary Biology
 Genetics and Genomics
Although gene expression divergence has long been postulated to be the primary driver of human evolution, identifying the genes and genetic variants underlying uniquely human traits has proven to be quite challenging. Theory suggests that celltypespecific cisregulatory variants may fuel evolutionary adaptation due to the specificity of their effects. These variants can precisely tune the expression of a single gene in a single celltype, avoiding the potentially deleterious consequences of transacting changes and noncell typespecific changes that can impact many genes and cell types, respectively. It has recently become possible to quantify humanspecific cisacting regulatory divergence by measuring allelespecific expression in humanchimpanzee hybrid cells—the product of fusing induced pluripotent stem (iPS) cells of each species in vitro. However, these cisregulatory changes have only been explored in a limited number of cell types. Here, we quantify humanchimpanzee cisregulatory divergence in gene expression and chromatin accessibility across six cell types, enabling the identification of highly celltypespecific cisregulatory changes. We find that celltypespecific genes and regulatory elements evolve faster than those shared across cell types, suggesting an important role for genes with celltypespecific expression in human evolution. Furthermore, we identify several instances of lineagespecific natural selection that may have played key roles in specific cell types, such as coordinated changes in the cisregulation of dozens of genes involved in neuronal firing in motor neurons. Finally, using novel metrics and a machine learning model, we identify genetic variants that likely alter chromatin accessibility and transcription factor binding, leading to neuronspecific changes in the expression of the neurodevelopmentally important genes FABP7 and GAD1. Overall, our results demonstrate that integrative analysis of cisregulatory divergence in chromatin accessibility and gene expression across cell types is a promising approach to identify the specific genes and genetic variants that make us human.