Statistical and computational methods for integrating microbiome, host genomics, and metabolomics data

  1. Rebecca A Deek  Is a corresponding author
  2. Siyuan Ma
  3. James Lewis
  4. Hongzhe Li  Is a corresponding author
  1. Department of Biostatistics, University of Pittsburgh, United States
  2. Department of Biostatistics, Vanderbilt School of Medicine, United States
  3. Division of Gastroenterology and Hepatology, Perelman School of Medicine, University of Pennsylvania, United States
  4. Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, United States
6 figures, 4 tables and 3 additional files

Figures

Flowchart of popular currently available methods and their best use cases.

Selection criteria are shown in yellow boxes and methods in green.

INducE remission in Crohn’s Disease (DINE-CD) study design diagram.

© 2024, BioRender Inc. Figure 2 was created using BioRender, and is published under a CC BY-NC-ND license. Further reproductions must adhere to the terms of this license.

Comparison of Mantel test p-values using original and log-transformed metabolite concentrations at baseline and 6 weeks.

Gray dashed lines denote nominal significance at the 0.05 level and the red dashed line is the y=x line. All tests are significant using the log-transformed data. There are large differences in p-values between the two scales, particularly at baseline.

Multidimensional scaling (MDS) plots from Procrustes analysis.

Color indicates omics modality. Shape indicates diet, Mediterranean (Med) or specific carbohydrate (SCD). Lines connect samples from the same subject in the metabolomics and microbial sequencing data sets. The sum of these distances squared is smaller than expected by chance for both time points thus indicating a concordance.

Heatmap of Spearman’s correlation between top metabolites and microbes contributing to the first latent factor identified by multiomics factor analysis (MOFA).

Correlations are calculated using log/clr-transformed data and clusters are identified from hierarchical clustering. There are four distinct microbe-metabolite clusters, two with positive correlations and two with negative correlations. clr, centered log-ratio.

Scatterplot of Spearman’s correlation estimated on the original (y-axis) and log/clr-transformed (x-axis) data at 6 weeks.

Color denotes significance in log-transformed data (green), original data (blue), red = both (red). Identified significant pairs vary based on which data scale is used. clr, centered log-ratio.

Tables

Table 1
Examples of available multiomics integration methods, along with main advantages and disadvantages, split by analysis type: global, feature-wise, network, longitudinal, mediation.
Analysis typeMethodsAdvantagesDisadvantages
GlobalMantel test, multivariate MiRKAT, (sparse) CCA, (sparse) PLS, Procrustes analysisAttributes variation in single analytes into other molecular modalities; extracts strongest signals for covariation between molecular profilesCovariation signals can lack interpretability; microbiome-specific properties must be properly adjusted for with advanced methods
Feature-wisePairwise correlation (Pearson, Spearman’s, and Kendall’s tau), MiRKAT, HAllA, log-linear contrast regression, Dirichlet-multinomial regressionAssumes ‘guilt-by-association’; individual tests easily implemented; appropriate for initial hypothesis generationPotential correlation structure in molecular profiles must be adjusted for to control false discoveries
NetworkSPIEC-EASI (transkingdom), MIMOSA2, AMON, DIABLO, MiMeNetDecipher complex interaction patterns between microbes and other molecular features; identify stable ‘hubs’ in community structuresComplex networks require regularization; differential network analysis to contrast between host conditions difficult to perform
LongitudinalLinear mixed effects models, GLMMlasso, dynamic Bayesian networksDetects patterns in microbes and other analytes over time or space; facilitates understanding of causal relationshipRequires knowledge of directionality; regularization or FDR control is needed; large sample sizes needed
MediationLinear structural equation models, compositional mediation analysisQuantifies direct and indirect effects; incorporates demographic and/or clinical informationPrior knowledge of confounding and causal relationships are needed
Table 2
Correlations from Mantel tests at baseline and 6 weeks (W6) using three different distances for the original metabolite concentrations.

Microbial distance was measured using the Bray-Curtis distance. Both Pearson and Spearman’s correlations were assessed. Correlation and significance, denoted by *, estimates vary based upon choice of distance and correlation type.

Metabolite distanceCorrelationBaseline rW6 r
EuclideanPearson0.0220.222*
ManhattanPearson0.0830.239*
CanberraPearson0.1480.141
EuclideanSpearman0.0610.244*
ManhattanSpearman0.1460.266*
CanberraSpearman0.1530.171*
Table 3
MMiRKAT p-values across several different coefficient of variation quantile cutoffs to select the most variable metabolites using the original and log-transformed metabolite concentrations.

Bray-Curtis distance is used for microbiome data.

Quantile (q)BaselineWeek 6
OriginalLogOriginalLog
q = 0.50.1630.1700.0640.250
q = 0.60.1580.0920.0160.174
q = 0.70.1090.1660.0030.101
q = 0.80.5010.1820.0010.011
q = 0.90.8340.030<0.0010.004
Table 4
MMiRKAT p-values across several different pseudocounts for the log-transformation of metabolite concentrations.

Coefficient of variation quantile cutoff is held constant at 0.9. Bray-Curtis distance is used for microbiome data. Choice of pseudocount influences p-value and significance.

PseudocountBaselineWeek 6
1×10-60.2200.018
1×10-50.0050.003
1×10-40.0200.008
1×10-30.0300.004
1×10-20.1560.010

Additional files

Supplementary file 1

Pairwise correlations.

Table of estimated Pearson and Spearman’s correlations for all microbe and metabolite pairs. Correlations are calculated at baseline and 6 weeks using the original and log-transformed data.

https://cdn.elifesciences.org/articles/88956/elife-88956-supp1-v1.xlsx
Supplementary file 2

Hierarchical all-against-all (HAllA) clusters.

Table of significant HAllA clusters and their associated adjusted p-value.

https://cdn.elifesciences.org/articles/88956/elife-88956-supp2-v1.xlsx
Supplementary file 3

Linear regression coefficients.

Table of regression coefficients, standard errors, z-scores, and p-values from the model regressing each metabolite on each microbe and prior surgery status. Adjusted p-values, after false discovery rate (FDR) correction at the 0.05 level, for the microbe coefficient are reported as well. Models are fit at both baseline and week 6.

https://cdn.elifesciences.org/articles/88956/elife-88956-supp3-v1.xlsx

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Rebecca A Deek
  2. Siyuan Ma
  3. James Lewis
  4. Hongzhe Li
(2024)
Statistical and computational methods for integrating microbiome, host genomics, and metabolomics data
eLife 13:e88956.
https://doi.org/10.7554/eLife.88956