Sparse dimensionality reduction approaches in Mendelian randomisation with highly correlated exposures

  1. Vasileios Karageorgiou  Is a corresponding author
  2. Dipender Gill
  3. Jack Bowden
  4. Verena Zuber
  1. Department of Epidemiology and Biostatistics, School of Public Health, Faculty of Medicine, Imperial College London, United Kingdom
  2. University of Exeter, United Kingdom
  3. Department of Clinical Pharmacology and Therapeutics, Institute for Infection and Immunity, St George’s, University of London, United Kingdom
  4. Genetics Department, Novo Nordisk Research Centre Oxford, United Kingdom
19 figures, 7 tables and 1 additional file

Figures

Proposed workflow.

Step 1: MVMR on a set of highly correlated exposures. Each genetic variant contributes to each exposure. The high correlation is visualised in the similarity of the single-nucleotide polymorphism (SNP)-exposure associations in the correlation heatmap (top right). Steps 2 and 3: PCA and sparse PCA on γ^. Step 4. MVMR analysis on a low dimensional set of principal components (PCs). X: exposures; Y: outcome; k: number of exposures; PCA: principal component analysis; MVMR: multivariable Mendelian randomisation.

Heatmaps for the loadings matrices in the Kettunen dataset for all methods (one with no sparsity constraints [a], four with sparsity constraints under different assumptions [b–e]).

The number of the exposures plotted on the vertical axis is smaller than K=97 as the exposures that do not contribute to any of the sparse principal components (PCs) have been left out. Blue: positive loading; red: negative loading; yellow: zero.

Comparison of univariable Mendelian randomisation (UVMR) and multivariable MR (MVMR) estimates and presentation of the major group represented in each principal component (PC) per method.
Simulation Study Outline.

(a) Data generating mechanism for the simulation study, illustrative scenario with six exposures and two blocks. In red boxes, the exposures that are correlated due to a shared genetic component are highlighted. (b) Simulation results for six exposures and three methods (sparse component analysis [SCA] [Chen and Rohe, 2021], principal component analysis [PCA], multivariable Mendelian randomisation [MVMR]). The exposures that contribute to Y (X1-3) are presented in shades of green colour and those that do not in shades of red (X4-6). In the third panel, each exposure is a line. In the first and second panels, the PCs that correspond to these exposures are presented as single lines in green and red. Monte Carlo SEs are visualised as error bars. Rejection rate: proportion of simulations where the null is rejected.

Extrapolated receiver-operating characteristic (ROC) curves for all methods.

SCA: sparse component analysis (Chen and Rohe, 2021) sPCA: sparse PCA (Zou et al., 2006) RSPCA: robust sparse PCA (Croux et al., 2013); PCA: principal component analysis; MVMR: multivariable Mendelian randomisation; MVMR_B: MVMR with Bonferroni correction.

Directed acyclic graph (DAG) for the multivariable Mendelian randomisation (MVMR) assumptions.

IV2, IV3: instrumental variable assumptions 2 and 3.

Appendix 1—figure 1
Simplified F-statistic Estimation.

(a) Data generating mechanism. Three exposures with different degrees of strength of association with G are generated γ1=1,γ2=0.5,γ3=0.1. (b) F-statistic for the three exposures X1,X2,X3 as estimated by the formulae in Equation 5 (horizontal axis) and Equation 4 (vertical axis).

Appendix 1—figure 2
Distributions of the F-statistics in principal component analysis (PCA) methods and individual (not transformed) exposures.

Exposure data in different blocks are simulated with a decreasing strength of association and the correlated blocks map to principal components (PCs). Each distribution represents the F-statistics for each PC. In the case of the individual exposures (red), the distributions represent the F-statistics for the corresponding exposures. Individual: individual exposures without any transformation; PCA: F-statistics for PCA; SCA: sparse component analysis (Chen and Rohe, 2021) sPCA: sparse PCA as described by Zou et al., 2006.

Appendix 1—figure 3
Multivariable Mendelian randomisation (MVMR) and univariable MR (UVMR) estimates.

Only ApoB is strongly associated with coronary heart disease (CHD). All SEs are larger in the MVMR model (range of SEMVMRSEUVMR2.7-225.96).

Appendix 1—figure 4
Multivariable Mendelian randomisation (MVMR) with IVW (left) and MVMR with GRAPPLE (Zhao et al., 2021) (right).

Only the 66 exposures. that are significant in univariable MR (UVMR) are put forward in these models. In IVW (left), ApoB shows nominal significance. In MR GRAPPLE (right), apolipoprotein B has the lowest p-value but no trait reaches nominal significance.

Appendix 1—figure 5
F-statistics for principal components (PCs) and sparse PCs.

The formula derived in Equation 5 is used. Black: principal component analysis (PCA) (no sparsity constraints); yellow: sparse component analysis (SCA); red: sparse PCA (Zou); blue: sparse robust PCA; green: sparse fused PCA. The dashed line represents the cutoff of 10 that is considered the minimum desired F-statistic for an exposure to be considered well instrumented. The green line diverges from the pattern of decreasing instrument strength but, when referring to the loadings heatmap (Figure 2), it can be observed that the 4th sparse PC in the fused sPCA receives negative loadings from multiple very low-density lipoprotein (VLDL)- and low-density lipoprotein (LDL)-related traits. This may in turn cause the large F-statistic.

Appendix 1—figure 6
Bayesian information criterion (BIC) for different numbers of metabolites regularized to 0.

The lowest value is achieved for one non-zero exposure per component. However, six non-zero exposures per component also achieved a similar low BIC and this was selected.

Appendix 1—figure 7
Trajectories for the loadings of total cholesterol in low-density lipoprotein (LDL) and ApoB in all methods.

Principal component analysis (PCA) loadings imply a contribution of LDL.c and ApoB to all principal components (PCs). In the sparse methods, this is limited to one PC (two for RSPCA).

Appendix 1—figure 8
Example for the block correlation in γ^ (n=5000, K=77) induced by the data generating mechanism in Figure 4.

In this example, the mean F-statistic is 231.2 and the mean CFS is 3.21.

Appendix 1—figure 9
AUC performance of multivariable Mendelian randomisation (MVMR) and dimensionality reduction methods for increasing sample sizes.

Two sparse methods (sparse component analysis [SCA], sparse principal component analysis [sPCA]) perform better compared with PCA and MVMR, with improving performance as the sample size increases. CFS: conditional F-statistic.

Appendix 1—figure 10
Individual results from s=1000 simulations.
Appendix 1—figure 11
Top panel: R2; bottom panel: similarity of loadings (S) between one-sample Mendelian randomisation (MR) and two-sample MR (Nsim=10,000).
Appendix 1—figure 12
Specificity (ability to accurately identify true negative exposures) of sparse component analysis (SCA) as a different proportion of exposures in each block are causal for Y.
Author response image 1
Top Panel: R2; Bottom Panel: Similarity of loadings (Sload) between one-sample MR and two-sample MR (Nsim = 10,000).

Tables

Table 1
Univariable Mendelian randomisation (MR) results for the Kettunen dataset with coronary heart disease (CHD) as the outcome.

Positive: positive causal effect on CHD risk; Negative: negative causal effect on CHD risk.

PositiveNegative
VLDLAM.VLDL.C, M.VLDL.CE, M.VLDL.FC, M.VLDL.L,M.VLDL.P, M.VLDL.PL, M.VLDL.TG, XL.VLDL.L,XL.VLDL.PL, XL.VLDL.TG, XS.VLDL.L, XS.VLDL.P, XS.VLDL.PL,XS.VLDL.TG, XXL.VLDL.L, XXL.VLDL.PL,L.VLDL.C, L.VLDL.CE, L.VLDL.FC, L.VLDL.L, L.VLDL.P,L.VLDL.PL, L.VLDL.TG, SVLDL.C, S.VLDL.FC,S.VLDL.L, S.VLDL.P, S.VLDL.PL, S.VLDL.TGNone
LDLALDL.C, L.LDL.C, L.LDL.CE, L.LDL.FC, L.LDL.L, L.LDL.P, L.LDL.PL,M.LDL.C, M.LDL.CE, M.LDL.L, M.LDL.P,M.LDL.PL, S.LDL.C, S.LDL.L, S.LDL.PNone
HDLS.HDL.TG, XL.HDL.TGM.HDL.C, M.HDL.CE
Table 2
Overview of sparse principal component analysis (sPCA) methods used.

KSS: Karlis-Saporta-Spinaki criterion. Package: R package implementation; Features: short description of the method; Choice: method of selection of the number of informative components in real data; PCs: number of informative PCs.

MethodPackageAuthorsFeaturesChoicePCs
RSPCApcaPPCroux et al., 2013Robust sPCA (RSPCA), different measure of dispersion (Qn)Permutation KSS6
SFPCACode in publication, Supplementary MaterialGuo et al., 2010Fused penalties for block correlationKSS6
sPCAelasticnetZou et al., 2006Formulation of sPCA as a regression problemKSS6
SCASCAChen and Rohe, 2021Rotation of eigen vectors for approximate sparsityPermutation KSS6
Table 3
Results for principal component analysis (PCA) approaches.

Overlap: Percentage of metabolites receiving non-zero loadings in ≥1 component. Overlap in PC1, PC2: overlap as above but exclusively for the first two components which by definition explain the largest proportion of variance. Very low-density lipoprotein (VLDL), low-density lipoprotein (LDL), and high-density lipoprotein (HDL) significance: results of the IVW regression model with CHD as the outcome for the respective sPCs (the sPCs that mostly received loadings from these groups). The terms VLDL and LDL refer to the respective transformed blocks of correlated exposures; for instance, VLDL refers to the weighted sum of the correlated VLDL-related γ^ associations, such as VLDL phospholipid content and VLDL triglyceride content. †: RSPCA projected VLDL- and LDL-related traits to the same PC (sPC1). ‡: SCA discriminated HDL molecules in two sPCs, one for traits of small- and medium-sized molecules and one for large- and extra-large-sized.

PCARSPCASFPCAsPCASCA
Overlap10.93810.1870.196
Overlap in PC1,PC210.43310.0100
Sparse %00.4740.0820.8350.796
VLDL significance in MR†YesNoYesNoYes
LDL significance in MRNoYesNoNoYes
HDL significance in MR‡YesYesYesNoNo
Small, medium HDL significance in MRYesNoYesYesYes
Table 4
Sensitivity and specificity presented as median and interquartile range across all simulations.

Presented as median sensitivity/specificity and interquartile range across all simulations; AUC: area under the receiver-operating characteristic (ROC) curve.

PCASCAsPCARSPCAMVMR_BMVMR
AUC0.560.9190.9410.6440.6600.712
Sensitivity1,0.11,0.211, 0.0470.667, 0.2510.222, 0.20, 0.076
Specificity0,0.020.925,0.7720.936, 0.0970.192, 0.1040.960, 0.0481,0
Youden’s J00.5840.778–0.0610.1920.044
Table 5
Two-sample Mendelian randomisation (MR).

Study characteristics.

First authorYearPMIDNCasesControlsStudy name (population)
MetabolitesKettunen20162700577824,925NMR GWAS (European)
CHDNelson201728714975453,595113,937339,658CARDIoGRAMplusC4D (European)
Appendix 1—table 1
Estimated causal effects of principal components (PCs) on coronary heart disease (CHD) risk.

PCA: principal component analysis; SCA: sparse component analysis; sPCA: sparse PCA (Zou et al., 2006); RSPCA: robust sparse PCA.

PCMethodORLCIUCI
PC1PCA1.0021.00151.0024
PC2PCA1.00020.99951.001
PC3PCA1.00131.00011.0024
PC4PCA0.99850.9970.9999
PC5PCA0.99990.99781.002
PC6PCA0.99930.99761.0009
PC1SCA1.00271.00051.0049
PC2SCA1.00271.00041.005
PC3SCA0.99970.99761.0019
PC4SCA0.99650.99410.9989
PC5SCA1.00020.9981.0024
PC6SCA1.00340.99891.0078
PC1sPCA1.00190.99991.0039
PC2sPCA1.00030.99861.002
PC3sPCA0.99880.9971.0005
PC4sPCA0.99750.99550.9995
PC5sPCA0.9980.99541.0006
PC6sPCA0.99980.99821.0014
PC1RSPCA1.00171.00061.0027
PC2RSPCA0.99980.99831.0013
PC3RSPCA0.99540.99180.999
PC4RSPCA0.99890.99691.0008
PC5RSPCA0.99440.99030.9986
PC6RSPCA1.011.00131.0188
PC1SFPCA1.0021.00151.0025
PC2SFPCA0.99910.99791.0004
PC3SFPCA0.99980.99911.0006
PC4SFPCA0.99820.99670.9997
PC5SFPCA1.00010.99771.0025
PC6SFPCA1.00090.99851.0033
Appendix 1—table 2
Simulation study on only four exposures (out of the total K=50) contributing to the outcome Y.

A drop in sensitivity and specificity is observed for sparse component analysis (SCA) and sparse principal component analysis (sPCA) compared with the simulation configuration in Table 4.

PCASCAsPCARSPCAMVMRMVMR_B
AUC0.7990.7140.8590.4920.5110.675
SNS1,0.030.75,0.251,0.170.5,0.250.25,0.250,0
SPC0,0.20.76,0.460.66,0.180.37,0.150.94,0.071,0
Youden’s J00.4280.625–0.0290.1050.032

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Vasileios Karageorgiou
  2. Dipender Gill
  3. Jack Bowden
  4. Verena Zuber
(2023)
Sparse dimensionality reduction approaches in Mendelian randomisation with highly correlated exposures
eLife 12:e80063.
https://doi.org/10.7554/eLife.80063