Optimal transport for automatic alignment of untargeted metabolomic data

eLife assessment

The authors describe an important tool, GromovMatcher, that can be used to compare proteomic data from various experimental approaches. The underlying method is innovative, the algorithm is clearly described, and the validation that is presented is convincing.

https://doi.org/10.7554/eLife.91597.3.sa0

Significance of the findings:

Important: Findings that have theoretical or practical implications beyond a single subfield

Landmark
Fundamental
Important
Valuable
Useful

Strength of evidence:

Convincing: Appropriate and validated methodology in line with current state-of-the-art

Exceptional
Compelling
Convincing
Solid
Incomplete
Inadequate

During the peer-review process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments

Abstract
Introduction
Results
Discussion
Materials and methods
Appendix 1
Appendix 2
Appendix 3
Appendix 4
Appendix 5
Data availability
References
Article and author information
Metrics

Abstract

Untargeted metabolomic profiling through liquid chromatography-mass spectrometry (LC-MS) measures a vast array of metabolites within biospecimens, advancing drug development, disease diagnosis, and risk prediction. However, the low throughput of LC-MS poses a major challenge for biomarker discovery, annotation, and experimental comparison, necessitating the merging of multiple datasets. Current data pooling methods encounter practical limitations due to their vulnerability to data variations and hyperparameter dependence. Here, we introduce GromovMatcher, a flexible and user-friendly algorithm that automatically combines LC-MS datasets using optimal transport. By capitalizing on feature intensity correlation structures, GromovMatcher delivers superior alignment accuracy and robustness compared to existing approaches. This algorithm scales to thousands of features requiring minimal hyperparameter tuning. Manually curated datasets for validating alignment algorithms are limited in the field of untargeted metabolomics, and hence we develop a dataset split procedure to generate pairs of validation datasets to test the alignments produced by GromovMatcher and other methods. Applying our method to experimental patient studies of liver and pancreatic cancer, we discover shared metabolic features related to patient alcohol intake, demonstrating how GromovMatcher facilitates the search for biomarkers associated with lifestyle risk factors linked to several cancer types.

Introduction

Untargeted metabolomics is a powerful analytical technique used to identify and measure a large number of metabolites in a biological sample without preselecting targets (Patti, 2011). This approach allows for a comprehensive overview of an individual’s metabolic profile, provides insights into the biochemical processes involved in cellular and organismal physiology (Wishart, 2019; Pirhaji et al., 2016), and allows for the exploration of how environmental factors impact metabolism (Rappaport et al., 2014; Bedia, 2022). It creates new opportunities to investigate health-related conditions, including diabetes (Wang et al., 2011), inflammatory bowel diseases Franzosa et al., 2019, and various cancer types (Loftfield et al., 2021; Li et al., 2020). However, a major challenge in biomarker discovery, metabolic signature identification and other untargeted metabolomic analyses lies in the low throughput of experimental data, necessitating the development of efficient pooling algorithms capable of merging datasets from multiple sources (Loftfield et al., 2021).

A common experimental technique in untargeted metabolomics is liquid chromatography-mass spectrometry (LC-MS) which assembles a list of thousands of unlabeled metabolic features characterized by their mass-to-charge ratio ( $m / z$ ), retention time (RT; Zhou et al., 2012), and intensity across all biological samples. Combining LC-MS datasets from multiple experimental studies remains challenging due to variation in the $m / z$ and RT of a feature from one study to another (Zhou et al., 2012; Ivanisevic and Want, 2019). This problem is further compounded by differing instruments and analytical protocols across laboratories, resulting in seemingly incompatible metabolomic datasets.

Manual matching of metabolic features can be a laborious and error-prone task (Loftfield et al., 2021). To address this challenge, several automated methods have been developed for metabolic feature alignment. One such method is MetaXCMS, which matches LC-MS features based on user-defined $m / z$ and RT thresholds (Tautenhahn et al., 2011). More advanced tools use information on feature intensities measured in samples. For instance, PAIRUP-MS uses known shared metabolic features to impute the intensities of all features from one dataset to another Hsu et al., 2019. MetabCombiner (Habra et al., 2021) and M2S (Climaco Pinto et al., 2022) compare average feature intensities, along with their $m / z$ and RT values, to align datasets without requiring extensive knowledge of shared features. These automated alignment methods have accelerated our ability to pool and annotate datasets as well as extract biologically meaningful biomarkers. However, they demand substantial fine-tuning of user-defined parameters and ignore correlations among metabolic features which provide a wealth of additional information on shared features.

Here, we introduce GromovMatcher, a user-friendly flexible algorithm which automates the matching of metabolic features across experiments. The main technical innovation of GromovMatcher lies in its ability to incorporate the correlation information between metabolic feature intensities, building upon the powerful mathematical framework of computational optimal transport (OT; Peyré and Cuturi, 2019; Villani, 2021). OT has proven effective in solving various matching problems and has found applications in multiomics analysis (Demetci et al., 2022), cell development (Schiebinger et al., 2019; Yang et al., 2020), and chromatogram alignment (Skoraczyński et al., 2022). Here, we leverage the Gromov-Wasserstein (GW) method (Mémoli, 2011; Solomon et al., 2016), which matches datasets based on their distance structure and has been seminally applied to spatial reconstruction problems in genomics Nitzan et al., 2019. GromovMatcher builds upon the GW algorithm to automatically uncover the shared correlation structure among metabolic feature intensities while also incorporating $m / z$ and RT information in the final matching process.

To assess the performance of GromovMatcher, we systematically benchmark it on synthetic data with varying levels of noise, feature overlap, and data normalizations, outperforming prior state-of-the-art methods of metabCombiner (Habra et al., 2021) and M2S (Climaco Pinto et al., 2022). Next, we apply GromovMatcher to align experimental patient studies of liver and pancreatic cancer to a reference dataset and associate the shared metabolic features to each patient’s alcohol intake. Through these efforts, we demonstrate how GromovMatcher data pooling improves our ability to discover biomarkers of lifestyle risk factors associated with several types of cancer.

Results

GromovMatcher algorithm

GromovMatcher uses the mathematical framework of OT to find all matching metabolic features between two untargeted metabolomic datasets (Figure 1). It accepts two LC-MS datasets with possibly different numbers of metabolic features and samples. Each feature, ${fx}_{i}$ in Dataset 1 and ${fy}_{j}$ in Dataset 2, is identified by its $m / z$ , RT, and vector of feature intensities across samples (Figure 1a). The primary tenet of GromovMatcher is that shared metabolic features have similar correlation patterns in both datasets and can be matched based on the distance/correlations between their feature intensity vectors. Specifically, GromovMatcher computes the pairwise distances between the feature intensity vectors of each metabolic feature in a dataset and saves them into a distance matrix, one per dataset (Figure 1b). In practice, we use either the Euclidean distance or the cosine distance (negative of correlation) to perform this step (Materials and methods). The resulting distance matrices contain information about the feature intensity similarity within each study. Using optimal transport, we can deduce shared subsets of metabolic features in both datasets which have corresponding feature intensity distance structures.

Figure 1

Download asset Open asset

An optimal transport approach for combining untargeted metabolomics datasets (GromovMatcher).

(a) Inputs are two LC-MS datasets of unlabeled metabolic features (rows) identified by their $m / z$ , RT, and feature intensities across biospecimen samples. Both studies can have differing numbers of metabolic features and samples. (b) In both datasets, the intensities across samples of each metabolic feature are formed into a vector and Euclidean distances between these feature vectors are computed and stored in a distance matrix. (c) Based on the technique of optimal transport, the unbalanced GW algorithm learns a coupling matrix $\tilde{Π}$ that places large weights ${\tilde{Π}}_{i j} \geq 0$ when ${fx}_{i}$ and ${fy}_{j}$ likely correspond to the same metabolic feature. It optimizes $\tilde{Π}$ to match features with similar pairwise distances (red outlined boxes) whose $m / z$ ratios are close. (d) The final step of GromovMatcher plots the retention times of features from both datasets against each other and fits a spline interpolation $\hat{f}$ weighted by the estimated coupling weights $\tilde{Π}$ . This retention time drift function is then used to set all entries ${\tilde{Π}}_{i j}$ to zero for those outlier pairs $({fx}_{i}, {fy}_{j})$ which exceed twice the median absolute deviation (MAD) around $\hat{f}$ (green highlighted region). Finally, the coupling matrix $\tilde{Π}$ is filtered and/or thresholded to obtain a refined coupling $\hat{Π}$ which is then binarized to obtain a one-to-one matching $M$ between a subset of metabolite pairs in both datasets.

OT was originally developed to optimize the transportation of soil for the construction of forts (Monge, 1781) and was later generalized through the language of probability theory and linear programming (Kantorovich, 2006), leading to efficient numerical algorithms and direct applications to planning problems in economics. The ability of OT to efficiently match source to target locations found applications in data science for the alignment of distributions (Courty et al., 2017; Alvarez-Melis et al., 2019) and was generalized by the Gromov-Wasserstein (GW) method (Peyré et al., 2016; Alvarez-Melis and Jaakkola, 2018) to align datasets with features of differing dimensions.

In practice, a sizeable fraction of the metabolic features measured in one study may not be present in the other. Hence, in most cases only a subset of features in both datasets can be matched. Recent GW formulations for unbalanced matching problems (Sejourne et al., 2021) allow for matching only subsets of metabolic features with similar intensity structures (Figure 1c). To incorporate additional feature information, we modify the optimization objective of unbalanced GW to penalize feature matches whose $m / z$ differences exceed a fixed threshold (Materials and methods, Appendix 1). The optimization of this objective computes a coupling matrix $\tilde{Π}$ where each entry ${\tilde{Π}}_{i j} \geq 0$ indicates the level of confidence in matching metabolic feature ${fx}_{i}$ in Dataset 1 to ${fy}_{j}$ in Dataset 2.

Differences in experimental conditions can induce variations in RT between datasets that can be nonlinear and large in magnitude (Zhou et al., 2012; Climaco Pinto et al., 2022; Habra et al., 2021). In the spirit of previous methods for LC-MS batch or dataset alignment (Smith et al., 2006; Brunius et al., 2016; Liu et al., 2020; Vaughan et al., 2012; Habra et al., 2021; Climaco Pinto et al., 2022; Skoraczyński et al., 2022), the learned coupling $\tilde{Π}$ is used to estimate a nonlinear map (drift function) between RTs of both datasets by weighted spline regression, which allows us to filter unlikely matches from the coupling matrix to obtain a refined coupling matrix $\hat{Π}$ (Figure 1d, Materials and methods). An optional thresholding step removes matches with small weights from the coupling matrix. The final output of GromovMatcher is a binary matching matrix $M$ where $M_{i j}$ is equal to 1 if features ${fx}_{i}$ and ${fy}_{j}$ are matched and 0 otherwise. Throughout the paper, we refer to the two variants of GromovMatcher, with and without the optional thresholding step as GMT and GM respectively.

Validation on ground-truth data

We first evaluate the performance of GromovMatcher using a real-world untargeted metabolomics study of cord blood across 499 newborns containing 4712 metabolic features characterized by their $m / z$ , RT, and feature intensities (Alfano et al., 2020). To generate ground-truth data, we randomly divide the initial dataset into two smaller datasets sharing a subset of features (Figure 2). We simulate diverse acquisition conditions by adding noise to the $m / z$ and RT of dataset 2, and to the feature intensities in both datasets. Moreover, we introduce an RT drift in dataset 2 to replicate the retention time variations observed in real LC-MS experiments (Materials and methods). For comparison, we also test M2S (Climaco Pinto et al., 2022) and metabCombiner (Habra et al., 2021), both of which use $m / z$ , RT, and median or mean feature intensities to match features (Figure 3). MetabCombiner is supplied with 100 known shared metabolic features to automatically set its hyperparameters, while M2S parameters are manually fine-tuned to optimize the F1-score in each scenario (Appendix 2). We assess the performance of GM, GMT, metabCombiner, and M2S across 20 randomly generated dataset pairs in terms of their precision (fraction of true matches among the detected matches) and recall/sensitivity (fraction of true matches detected) averaged across 20 dataset pairs.

Figure 2

Download asset Open asset

Simulated data for testing untargeted metabolomics alignment methods.

(a) Initial LC-MS dataset taken from the EXPOsOMICS project with $m / z$ , RT, and feature intensities of $p = 4, 712$ metabolites identified in cord blood across $n = 499$ newborns. (b) Newborns (rows) are split into two disjoint groups of sizes $n_{1} = 249$ and $n_{2} = 250$ respectively and metabolic features (columns) are split into two equal groups of size $p_{1} = p_{2}$ with overlap $λ p$ where $λ = 0.25, 0.5, 0.75$ (Materials and methods). Datasets are perturbed by additive noise of magnitude $(σ_{M}, σ_{RT}, σ_{FI})$ and a nonlinear drift $f (x)$ is applied to the RTs of dataset 2. (c) The two resulting datasets share $λ = 25 %, 50 %$ , or 75% of the original dataset’s metabolic features.

Figure 3 with 2 supplements see all

Download asset Open asset

Comparison of MetabCombiner, M2S, and GromovMatcher on simulated data.

(a) Ground-truth matchings, and matchings inferred by metabCombiner, M2S, GM, and GMT. Pairs of datasets are generated for three levels of overlap (low, medium and high), with a medium noise level (Materials and methods). Matches correctly recovered (true positives) are represented in green. True matches that are not recovered (false negatives) are highlighted in grey. Incorrect matches (false positives) are plotted in red. Features in rows and columns of matching matrices are reordered for visual clarity. (b) Average precision and recall on 20 randomly generated pairs of datasets, for three levels of overlap (low, medium, and high) with a medium noise level.

To investigate how the number of shared features affects dataset alignment, we generate pairs of LC-MS datasets with low, medium, and high feature overlap (25%, 50%, and 75%), while maintaining a medium noise level (Materials and methods). Here, we find that GM and GMT generally outperform existing alignment methods, with a recall above 0.95 while metabCombiner and M2S tend to be less sensitive (Figure 3b). All methods drop in precision as the feature overlap is decreased, with GM and GMT still maintaining an average precision above 0.8.

Next we evaluate all four methods at low, moderate, and high noise levels for pairs of datasets with 50% overlap in their features (Materials and methods). Our results show that GMT, GM, and M2S maintain an average recall above 0.89, while metabCombiner’s recall drops below 0.6 for high noise. At large noise levels, RT drift estimation becomes more challenging, leading to a higher rate of false matches between metabolites (lower precision) for all four methods (Figure 3—figure supplement 1). Nevertheless, GMT obtains a high average precision and recall of 0.86 and 0.92, respectively.

A notable difference between GM, metabCombiner, and M2S lies in their use of feature intensities. MetabCombiner expects that the mean feature intensity rankings are identical across studies, while M2S assumes that shared features have similar median intensities. In contrast, GM uses both the mean feature intensities and their variances and covariances. In practice, differences in experimental assays or study populations can lead to greater variation in feature intensities, making matchings based on these statistics less reliable. Centering and scaling the feature intensities to unit variance avoids potential biases arising from inconsistent feature intensity magnitudes, but preserves correlations that GM leverages.

Exploring this further, we test how sensitive all four methods are to centering and scaling of feature intensities. MetabCombiner and M2S are tuned using the same methodology as for non-centered and non-scaled data. For M2S, we match features solely based on their $m / z$ and RT. In this experiment (Figure 3—figure supplement 2), the absence of intensity magnitude information significantly affects metabCombiner’s performance and, to a lesser extent, M2S. GM and GMT still obtain accurate matchings, due to their use of correlation structures which are preserved under centering and scaling.

Application to EPIC data

Next, we apply GM, metabCombiner and M2S to align datasets from the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort, a prospective study conducted across 23 European centers. EPIC comprises more than 500,000 participants who provided blood samples at recruitment (Riboli et al., 2002). Untargeted metabolomics data were successively acquired in several studies nested within the full cohort.

In the present work, we use LC-MS data from the EPIC cross-sectional (CS) study (Slimani et al., 2003) and two matched case-control studies nested within EPIC, on hepatocellular carcinoma (HCC; Stepien et al., 2016; Stepien et al., 2021) and pancreatic cancer (PC; Gasull et al., 2019). LC-MS untargeted metabolomic data were acquired at the International Agency for Research on Cancer, making use of the same platform and methodology (Materials and methods). The number of samples and features in each study is displayed in Figure 4a.

Figure 4 with 3 supplements see all

Download asset Open asset

Application of GromovMatcher and comparison to existing methods on EPIC dataset.

(a) Dimensions of the three EPIC studies used. For each ionization mode, the cross-sectional (CS) study is aligned successively with the hepatocellular carcinoma (HCC) study and the pancreatic cancer (PC) study. (b) Demonstration of expert manual matching and GromovMatcher (GM) matching between the CS and HCC studies in positive mode. Experts manually match 90 features (Table 1) from Loftfield et al., 2021 and the correlation matrices of these features in both datasets have similar structure (bottom two matrices). GM discovers 996 shared features between the CS and HCC datasets which have similar correlation structure (top two matrices). We validate that 88 of the 90 features from the manually expert matched subset are contained in the set of features matched by GM. (c) Performance of metabCombiner (mC), M2S and GM in positive mode. Precisions and recalls are measured on a validation subset of 163 manually examined features, and 95% confidence intervals are computed using modified Wilson score intervals. (d) Performance of mC, M2S, and GM in negative mode. Precision and recall are measured on a validation subset of 42 manually examined features, and 95% confidence intervals are computed using modified Wilson score intervals. See Table 2 and Table 3 for exact precisions, recalls, and confidence intervals in positive and negative mode, respectively.

Loftfield et al., 2021 previously matched features from the CS, HCC, and PC studies in EPIC for alcohol biomarker discovery. The authors first identified 205 features (163 in positive and 42 in negative mode) associated with alcohol intake in the CS study. These features were then manually matched by an expert to features in both the HCC and PC studies (Materials and methods, Table 1). In our analysis, we use these features as a validation set and compare each method’s matchings to the expert manual matchings on this subset. Due to the imbalance between the number of positive and negative mode features in the validation subset, our main analysis focuses on the alignment results of CS with HCC and CS with PC in positive mode (Table 2). We delegate the matching results between the negative mode studies (Table 3) to Appendix 4.

Table 1

Results from the manual matching conducted for Loftfield et al., 2021.

Features from the CS study (163 features in positive mode, 42 features in negative mode) were manually investigated for matches in the HCC and PC studies.

Study	Manual matches found in positive mode	Manual matches found in negative mode
Hepatocellular carcinoma (HCC)	90	19
Pancreatic cancer (PC)	66	28

Table 2

Precision and recall on the EPIC validation subset in positive mode.

95% confidence intervals were computed using modified Wilson score intervals (Brown et al., 2001; Agresti and Coull, 1998).

	$CS ⟷ HCC$		$CS ⟷ PC$
Method	Precision	Recall	Precision	Recall
GromovMatcher	0.989 (0.939, 0.999)	0.978 (0.923, 0.996)	0.903 (0.813, 0.952)	0.985 (0.919, 0.999)
M2S	0.967 (0.908, 0.991)	0.978 (0.923, 0.996)	0.855 (0.759, 0.917)	0.985 (0.919, 0.999)
metabCombiner	0.961 (0.868, 0.993)	0.544 (0.442, 0.643)	0.967 (0.833, 0.998)	0.439 (0.326, 0.559)

Table 3

Precision and recall on the EPIC validation subset in negative mode.

95% confidence intervals were computed using modified Wilson score intervals (Brown et al., 2001; Agresti and Coull, 1998).

	$CS ⟷ HCC$		$CS ⟷ PC$
Method	Precision	Recall	Precision	Recall
GromovMatcher	0.950 (0.764, 0.997)	1.000 (0.832, 1.000)	0.929 (0.774, 0.987)	0.929 (0.774, 0.987)
M2S	1.000 (0.824, 1.000)	0.947 (0.754, 0.997)	0.931 (0.780, 0.988)	0.964 (0.823, 0.998)
metabCombiner	0.875 (0.529, 0.993)	0.368 (0.191, 0.590)	1.000 (0.845, 1.000)	0.750 (0.566, 0.873)

In this section, we use the same settings for GM as in our simulation study, and do not apply an additional thresholding step. The parameters of metabCombiner and M2S are calibrated using the validation subset as prior knowledge (Appendix 2).

Preliminary analysis of the validation subset reveals inconsistencies in the mean feature intensities (Figure 4—figure supplement 1), but Figure 4b shows that on centered and scaled data, the 90 expert matched features shared between the CS and HCC studies have similar correlation structures. Hence, to avoid potential errors we center and scale the feature intensities which improves the performance of all three methods tested below (Appendix 4, Appendix 4—table 1).

Hepatocellular carcinoma

Here, we analyze the quality of the matchings obtained by GM, M2S, and metabCombiner between the CS and HCC datasets in positive mode. Both GM and M2S identify approximately 1000 shared features while metabCombiner finds a smaller number of about 700 shared features. We refer the reader to Figure 4—figure supplement 2a for the precise matched feature sizes and details on the agreement between the feature matchings of all three methods.

We evaluate the performance of metabCombiner, M2S, and GM on the validation subset in positive mode (Figure 4c, Table 2), which consist of 90 features from the CS study manually matched to features from the HCC study and 73 features specific to the CS study. MetabCombiner demonstrates precise matching but lacks sensitivity. M2S’s precision and recall are comparable with GM, in contrast to its performance on simulated data. This can be attributed to the RT drift shape between the CS and HCC studies (Appendix 2), which is estimated to be close to linear (Figure 4—figure supplement 3). Because the parameters of M2S are fine-tuned in the validation subset, it is able to learn this linear drift and apply tight RT thresholds to achieve accurate matchings. In contrast to metabCombiner and M2S, the GM algorithm is not given any prior knowledge of the validation subset, and nevertheless demonstrates the highest precision and recall rates of the three methods (Figure 4c). Figure 4b shows how GM recovers the majority of the expert matched pairs by leveraging the shared correlations.

Pancreatic cancer

Matching features between the CS and PC studies in positive mode, GM and M2S identify approximately 1000 common features, while metabCombiner detects approximately 600 matches (Figure 4—figure supplement 2b). We examine the performance of all three methods on the validation subset consisting of 66 manually matched features between CS and PC along with 97 features specific to the CS study. As before, GM and M2S have high recall while the recall of metabCombiner is less than 0.5.

A decrease in precision is observed for both GM and M2S compared to the previous CS-HCC matchings. We therefore manually inspect the false positive matches; the set of CS features matched by the method to the PC study but explicitly examined and left unmatched in the expert manual matching. Assessing the GM results, we identify seven false positive feature matches. Upon secondary inspection, three pairs are revealed as correct matches that were not initially identified in the expert matching. M2S finds 11 false positive matches which include the 7 false positives recovered by GM. Manual examination of the four remaining pairs reveals two clear mismatches. These results highlight the advantage of using automated methods for data alignment, as both GM and M2S detect correct matches that were not identified by experts, with GM being more precise than M2S.

Illustration for alcohol biomarker discovery

Loftfield et al., 2021 identified biomarkers of habitual alcohol intake by first performing a discovery step, where they examined the relationship between alcohol intake and metabolic features in the CS study. They then manually matched the significant features in CS to features from the HCC and PC studies, and repeated the analysis with samples from the HCC and PC studies to determine whether the association with alcohol intake persisted. This led to the identification of 10 features possibly associated with alcohol intake (Figure 5a).

Figure 5

Download asset Open asset

Comparison of GromovMatcher and Loftfield et al., 2021 analysis for alcohol biomarker discovery on EPIC data.

(a) Loftfield study implemented a discovery step, examining the relationship between alcohol intake and metabolic features in the CS study. The significant features in CS were manually matched to features from the HCC and PC and the analysis was repeated using samples from the HCC and PC studies. After this step, 10 features associated with alcohol intake were identified. (b) GromovMatcher analysis begins by matching features from CS study to HCC and PC studies respectively (top blue, yellow, and red boxes). Samples corresponding to each CS feature are combined with the samples of its matched feature in the HCC study, PC study, or both. This generates a larger pooled data matrix with the same number of features as the CS study but with more samples pooled across the three original studies (center matrix). Because some features in the CS study may not have matches in HCC or PC, the corresponding entries in the pooled matrix are set to NaN/missing values (white regions in matrix). Each column/feature in this matrix is statistically tested for association with alcohol intake (ignoring missing values) and an FDR or a stricter Bonferroni correction is performed to retain only a subset of features from the pooled study that have a strong association. (c) Venn diagrams show intersection of feature sets (in positive and negative mode) found to be associated with alcohol intake by one of the four different analyses.

To extend this analysis and illustrate the benefit of GM automatic matching for biomarker discovery, we use GM to pool features from the CS, HCC, and PC studies, and examine the relationship between metabolic features and alcohol intake in the pooled study (Materials and methods and Figure 5b).

Applying an FDR correction on the pooled study, we identify 243 features associated with alcohol intake, including 185 features consistent with the discovery step of Loftfield et al., 2021, and 55 newly discovered features (Figure 5c). Using the more stringent Bonferroni correction on the pooled data, we identify 36 features shared by all three studies that are significantly associated with alcohol intake. These features include all 10 features identified in Loftfield et al. (Figure 5c). These findings highlight the potential benefits of using GM automatic matching for biomarker discovery in untargeted metabolomics data. Additional information regarding the methodology and findings of our GM and Loftfield et al. analyses can be found in Materials and methods and Appendix 4.

Discussion

LC-MS metabolomics has emerged as an increasingly powerful tool for biological and biomedical research, offering promising opportunities for epidemiological and clinical investigations. However, integrating data from different sources remains challenging. To address this issue, we introduce GromovMatcher, a method based on optimal transport that automatically aligns LC-MS data from pairs of studies. Our method exhibits superior performance on both simulated and real data when compared to existing approaches. Additionally, it presents a user-friendly interface with few hyperparameters.

While GromovMatcher is robust to noise and variations in data, it may face limitations when aligning LC-MS studies from populations with different characteristics, where the correlation structures between features may be inconsistent across studies. In this case, the base assumption of GromovMatcher can be relaxed by focusing on subsamples with similar characteristics, as exemplified in a recent study (Gomari et al., 2022).

A current limitation is that GromovMatcher does not account for more than two datasets simultaneously, although this can be overcome by aligning multiple studies to a chosen reference dataset, as demonstrated in our biomarker experiments. The extension of Gromov-Wasserstein to multiple distributions (Beier et al., 2022) is another promising approach for generalizing GromovMatcher to multiple dataset alignment. Further improvements can be made by incorporating existing knowledge about the studies being matched, such as known shared features, samples in common, or MS/MS data.

The results obtained from GromovMatcher are highly promising, opening the door for various analyses of metabolomic datasets acquired in different experimental laboratories. Here, we demonstrated the potential of GromovMatcher in expediting the combination and meta-analysis of data for biomarker and metabolic signature discovery. The matchings learned by GromovMatcher also allow for comparison between experimental protocols by assessing the drift in $m / z$ , RT, and feature intensities across studies. Finally, inter-institutional annotation efforts can directly benefit from incorporating this method to transfer annotations between aligned datasets. Bridging the gap between otherwise incompatible LC-MS data, GromovMatcher enables seamless comparison of untargeted metabolomics experiments.

Materials and methods

GromovMatcher method overview

GromovMatcher accepts as input two feature tables from separate LC-MS untargeted metabolomics studies. Each feature table for dataset 1 and dataset 2 consists of $n_{1}, n_{2}$ biospecimen samples respectively and $p_{1}, p_{2}$ metabolic features respectively detected in the study. Features in dataset 1 are given the label ${fx}_{i}$ for $i = 1, \dots, p_{1}$ . Every feature is characterized by a mass-to-charge ratio ( $m / z$ ) denoted by $m_{i}^{x}$ , a retention time (RT) denoted by $R T_{i}^{x}$ , and a vector of intensities across all samples written as $X_{i} \in ℝ^{n_{1}}$ . Similarly, features in dataset 2 are labeled as ${fy}_{j}$ for $j = 1, \dots, p_{2}$ and are characterized by their $m / z$ , retention time $R T_{j}^{y}$ , and a vector of intensities across all samples $Y_{i} \in R^{n_{2}}$ .

Our goal is to identify pairs of indexes $(i, j)$ with $i \in {1, \dots, p_{1}}$ and $j \in {1, \dots, p_{2}}$ , such that ${fx}_{i}$ and ${fy}_{j}$ correspond to the same metabolic feature. More formally, we aim to identify a matching matrix $M \in {0, 1}^{p_{1} \times p_{2}}$ such that $M_{i j} = 1$ if ${fx}_{i}$ and ${fy}_{j}$ correspond to the same feature, hereafter referred to as matched features. Otherwise, we set $M_{i j} = 0$ .

Because the $m / z$ and RT values of metabolomic features are often noisy and subject to experimental bias, our matching algorithm leverages metabolite feature intensities $X_{i}, Y_{j}$ to produce accurate dataset alignments. The GromovMatcher method is based on the idea that signal intensities of the same metabolites measured in two different studies should exhibit similar correlation structures, in addition to having compatible $m / z$ and RT values. Here, we define the Pearson correlation for vectors $u, v \in R^{n}$ as

corr (u, v) = \frac{⟨ u - \bar{u}, v - \bar{v} ⟩}{‖ u - \bar{u} ‖ ‖ v - \bar{v} ‖}

where we define

\bar{u} = \frac{1}{n} \sum_{i = 1}^{n} u_{i}, ‖ u ‖ = \sqrt{\sum_{i = 1}^{n} u_{i}^{2}}, ⟨ u, v ⟩ = \sum_{i = 1}^{n} u_{i} v_{i}

as the mean value, Euclidean norm and inner product respectively. If measurements $X_{i}, Y_{j}$ correspond to the same underlying feature, and similarly, measurements $X_{k}, Y_{l}$ share the same an underlying feature, we expect that

corr (X_{i}, X_{k}) \approx corr (Y_{j}, Y_{l}) .

This idea that the feature intensities of shared metabolites have the same correlation structure in both datasets also holds more generally for distances, under a suitable choice of distance. For example, the correlation coefficient $corr (u, v)$ can be turned into a dissimilarity metric by defining

d^{cos} (u, v) = \sqrt{1 - corr (u, v)}

commonly referred to as the cosine distance. Preservation of feature intensity correlations then trivially amounts to the preservation of cosine distances.

Another classical notion of distance between vectors $u, v \in R^{n}$ is the normalized Euclidean distance

d^{euc} (u, v) = \frac{1}{\sqrt{n}} ‖ u - v ‖ = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (u_{i} - v_{i})^{2}}

which is equal to the cosine distance (up to constants) when the vectors $u, v$ are centered and scaled to have zero mean and a standard deviation of one. The Euclidean distance depends on the magnitude or mean intensity of metabolic features, and hence is a useful metric for matching metabolites as long as these mean feature intensities are reliably collected.

To summarize, the main tenant of GromovMatcher is that if measurements $X_{i}, Y_{j}$ correspond to the same feature and $X_{k}, Y_{l}$ correspond to the same feature, then for suitably chosen distances $d_{x} : R^{n_{1}} \times R^{n_{1}} \to R$ and $d_{y} : ℝ^{n_{2}} \times ℝ^{n_{2}} \to ℝ$ , these distances are preserved

d_{x} (X_{i}, X_{k}) \approx d_{y} (Y_{j}, Y_{l})

across both datasets. In this paper, the distances $d_{x}, d_{y}$ are taken to be the normalized Euclidean distances in Equation 5. We take care to specify those experiments where the metabolic features $X$ and $Y$ are centered and scaled. In these cases, implicitly the Euclidean distance between normalized feature vectors becomes the cosine distance Equation 4 between the original (unnormalized) feature vectors.

Unbalanced Gromov–Wasserstein

Request a detailed protocol

The goal of GromovMatcher is to learn a matching matrix $M \in {0, 1}^{p_{1} \times p_{2}}$ that gives an alignment between a subset of metabolites in both datasets. However, searching over the combinatorially large set of binary matrices would be an inefficient approach for dataset alignment. The mathematical framework of optimal transport Peyré and Cuturi, 2019 instead enlarges this space of binary matrices to the set of coupling matrices with real nonnegative entries $Π \in R_{+}^{p_{1} \times p_{2}}$ . The entries $Π_{i j}$ with large weights indicate that feature ${fx}_{i}$ in dataset 1 and feature ${fy}_{j}$ in dataset 2 are a likely match. Taking inspiration from Equation 6, we minimize the following objective function

E (Π) = \sum_{i, k = 1}^{p_{1}} \sum_{j, l = 1}^{p_{2}} Π_{i j} Π_{k l} | d_{x} (X_{i}, X_{k}) - d_{y} (Y_{j}, Y_{l}) |

to estimate the coupling matrix $Π$ .

A standard approach is to optimize this objective over all coupling matrices $Π$ under exact marginal constraints $Π 𝟏_{p_{2}} = \frac{1}{p_{1}} 𝟏_{p_{1}}, Π^{T} 𝟏_{p_{1}} = \frac{1}{p_{2}} 𝟏_{p_{2}}$ . Here, we define $1_{n}$ is the ones vector of length $n$ , and $Π_{1} = Π 1_{p_{2}}, Π_{2} = Π^{T} 1_{p_{1}}$ denote the column and row sums of the coupling matrix. Objective Equation 7 under these exact marginal constraints defines a distance between the two sets of metabolic feature vectors ${X_{i}}_{i = 1}^{p_{1}}, {Y_{i}}_{i = 1}^{p_{2}}$ known as the Gromov–Wasserstein distance Mémoli, 2011, a generalization of optimal transport to metric spaces. Note that for pairs $X_{i}, Y_{j}$ and $X_{k}, Y_{l}$ for which $d_{x} (X_{i}, X_{k}) \approx d_{y} (Y_{j}, Y_{l})$ , the entries $Π_{i j}, Π_{k l}$ are penalized less and hence matches between features ${fx}_{i}, {fy}_{j}$ and features ${fx}_{k}, {fy}_{l}$ are more favored. In our optimization, we avoid enforcing exact marginal constraints on the marginal distributions $Π 1_{p_{2}}$ and $Π^{T} 𝟏_{p_{1}}$ of our coupling matrix as this would enforce that all metabolites in both datasets are matched (Appendix 1). However, without any marginal constraints on the coupling $Π$ , the objective function Equation 7 is trivially minimized by $Π = 0$ , leaving all metabolites in both datasets unmatched.

To account for this, we follow the ideas of unbalanced Gromov–Wasserstein (UGW) (Sejourne et al., 2021) and add three regularization terms to our objective

\begin{array}{ll} L_{ρ, ε} (Π) = E (Π) & + ρ D_{KL} (Π_{1} \otimes Π_{1}, a \otimes a) \\ + ρ D_{KL} (Π_{2} \otimes Π_{2}, b \otimes b) \\ + ε D_{KL} (Π \otimes Π, {(a \otimes b)}^{\otimes 2}) \end{array}

where $ρ, ε > 0$ and we define $a = 1_{p_{1}}, b = 1_{p_{2}}$ . Here ⊗ denotes the Kronecker product. We define $D_{KL}$ as the Kullback–Leibler (KL) divergence between two discrete distributions $μ, ν \in ℝ_{+}^{p}$ by

D_{KL} (μ, ν) = \sum_{i = 1}^{p} μ_{i} \ln (\frac{μ_{i}}{ν_{i}}) - \sum_{i = 1}^{p} μ_{i} + \sum_{i = 1}^{p} ν_{i}

which measures the closeness of probability distributions.

The first two regularization terms in Equation 8 enforce that the row sums and column sums of the coupling matrix $Π$ do not deviate too much from a uniform distribution, leading our optimization to match as many metabolic features as possible. The magnitude of the regularizer $ρ$ roughly enforces the fraction of metabolites in both datasets that are matched where large $ρ$ implies most metabolites are matched across datasets. The final regularization term $ε$ in Equation 8 controls the smoothness (entropy) of the coupling matrix $Π$ where larger values of $ε$ encourage $Π$ to put uniform weights on many of its entries, leading to less precision in the metabolite matches. However, increasing $ε$ also leads to better numerical stability and a significant speedup of the alternating minimization algorithm used to optimize the objective function (Appendix 1). In our implementation, we set $ρ$ and $ϵ$ to the lowest possible values under which our optimization converges, with $ρ = 0.05$ and $ϵ = 0.005$ .

Our full optimization problem can now be written as

{UGW}_{ρ, ε} = min_{Π \in R_{+}^{p_{1} \times p_{2}}} L_{ρ, ε} (Π) .

The UGW objective function is optimized through alternating minimization based on the code of Sejourne et al., 2021 using the unbalanced Sinkhorn algorithm Séjourné et al., 2019 from optimal transport (Appendix 1).

Constraint on $m / z$ ratios

Request a detailed protocol

Matched metabolic features must have compatible $m / z$ so we enforce that $Π_{i j} = 0$ when $| m_{i}^{x} - m_{j}^{y} | > m_{gap}$ where $m_{gap}$ is a user-specified threshold. Based on prior literature (Loftfield et al., 2021; Hsu et al., 2019; Climaco Pinto et al., 2022; Habra et al., 2021; Chen et al., 2021), we set $m_{gap}$ = 0.01 ppm. Note that $m_{gap}$ is not explicitly used in Equation 10 but is rather enforced in each iteration of our alternating minimization algorithm for the UGW objective (Appendix 1).

Unlike the $m / z$ ratios discussed above, RTs often exhibit a non-linear deviation (drift) between studies so we cannot enforce compatibility of RTs directly in our optimization. Instead, in the following step of our pipeline we ensure matched metabolite pairs have compatible RTs by estimating the drift function and subsequently using it to filter out metabolite matches whose RT values are inconsistent with the estimated drift.

Estimation of the RT drift and filtering

Request a detailed protocol

Estimating the drift between RTs of two studies is a crucial step in assessing the validity of metabolite matches and discarding those pairs which are incompatible with the estimated drift.

Let $\tilde{Π} \in R_{+}^{p_{1} \times p_{2}}$ be the minimizer of Equation 10 obtained after optimization. We seek to estimate the RT drift function $f : ℝ_{+} \to ℝ_{+}$ which relates the retention times of matched features between the two studies. Namely, if feature ${fx}_{i}$ and feature ${fy}_{j}$ correspond to the same metabolic feature, then we must have that $R T_{j}^{y} \approx f (R T_{i}^{x})$ .

We propose to learn the drift $f$ through the weighted spline regression

min_{f \in B_{n, k}} \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} {\tilde{Π}}_{i j} | f (R T_{i}^{x}) - R T_{j}^{y} |

where $B_{n, k}$ is the set of $n$ -order B-splines with $k$ knots. All pairs $(R T_{i}^{x}, R T_{j}^{y})$ in objective Equation 11 are weighted by the coefficients of $\tilde{Π}$ so that larger weights are given to pairs identified with high confidence in the first step of our procedure. The order of the B-splines was set to $n = 3$ by default, while the number of knots $k$ was selected by 10-fold cross-validation.

Pairs identified as incompatible with the estimated RT drift are then discarded from the coupling matrix. To do this, we first take the estimated RT drift $\hat{f}$ , and the set of pairs $S = {i, j : {\tilde{Π}}_{i, j} \neq 0}$ recovered in $\tilde{Π}$ . We then define the residual associated with $(i, j) \in S$ as

r_{\hat{f}} (i, j) = | \hat{f} (R T_{i}^{x}) - R T_{j}^{y} | .

The 95% prediction interval and the median absolute deviation (MAD) of these residuals are given by

\begin{matrix} PI = 1.96 \times std ({r_{\hat{f}} (i, j), (i, j) \in S}) \\ MAD = median ({| r_{\hat{f}} (i, j) - μ_{r} |, (i, j) \in S}) \\ μ_{r} = median ({| r_{\hat{f}} (i, j) |, (i, j) \in S}) \end{matrix}

where $| 𝒮 |$ is the size of $𝒮$ and the functions std and median denote the standard deviation and median respectively. Similar to the approach in Climaco Pinto et al., 2022, we create a new filtered coupling matrix $\hat{Π} \in ℝ_{+}^{p_{1} \times p_{1}}$ given by

{\hat{Π}}_{i j} = {\begin{cases} {\tilde{Π}}_{i j} & if r_{\hat{f}} (i, j) < μ_{r} + r_{thresh} \\ 0 & otherwise \end{cases} .

where $r_{thresh}$ is a given filtering threshold. Following Habra et al., 2021, the estimation and outlier detection step can be repeated for multiple iterations, to remove pairs that deviate significantly from the estimated drift and improve the robustness of the drift estimation. In our main algorithm, we use two preliminary iterations where estimate the RT drift and discard outliers outside of the 95% prediction interval by setting $r_{thresh} = PI$ . We the re-estimate the drift and perform a final filtering step with the more stringent MAD by setting $r_{thresh} = 2 \times MAD$ .

At this stage, it is possible for $\hat{Π}$ to still contain coefficients of very small magnitude. As an optional postprocessing step, we discard these coefficients by setting all entries smaller than $τ max (\hat{Π})$ to zero, for some user-defined $τ \in [0, 1]$ . Lastly, a feature from either study could have multiple possible matches, since $\hat{Π}$ can have more than one non-zero coefficient per row or column. Although reporting multiple matches can be helpful in an exploratory context, for the sake of simplicity in our analysis, the final output of GromovMatcher returns a one-to-one matching, as we only keep those metabolite pairs $(i, j)$ where the entry ${\hat{Π}}_{i j}$ is largest in its corresponding row and column. All nonzero entries of $\hat{Π}$ which do not satisfy this criterion are set to zero. Finally, we convert $\hat{Π}$ into a binary matching matrix $M \in {0, 1}^{p_{1} \times p_{2}}$ with ones in place of its nonzero entries and this final output is returned to the user.

As a naming convention, we use the abbreviation GM for our GromovMatcher method, and use the abbreviation GMT when running GromovMatcher with the optional $τ$ -thresholding step with $τ = 0.3$ .

Metrics for dataset alignment

Request a detailed protocol

Every alignment method studied in this paper returns a binary partial matching matrix $M \in {0, 1}^{p_{1} \times p_{1}}$ which has at most one nonzero entry in each row and column. Specifically, $M_{i j} = 1$ if metabolic features $i$ and $j$ in both datasets correspond to each other and $M_{i j} = 0$ otherwise. In our simulated experiments, we compare the partial matching $M$ to a known ground-truth partial matching matrix $M^{*} \in {0, 1}^{p_{1} \times p_{2}}$ .

To do this, we first compute the number of true positives, false positives, true negatives, and false negatives as

\begin{array}{ll} TP = \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} 1_{M_{i j} = 1} 1_{M_{i j}^{*} = 1} \\ FP = \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} 1_{M_{i j} = 1} 1_{M_{i j}^{*} = 0} \\ TN = \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} 1_{M_{i j} = 0} 1_{M_{i j}^{*} = 0} \\ FN = \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} 1_{M_{i j} = 0} 1_{M_{i j}^{*} = 1} \end{array}

where 1 denotes the indicator function. Then we use these values to compute the precision and recall as

\begin{array}{cc} P r e c i s i o n = \frac{T P}{T P + F P} \\ R e c a l l = \frac{T P}{T P + F N} . \end{array}

Precision measures the fraction of correctly found matches out of all discovered metabolite matches, while recall, also know as sensitivity, measures the fraction of correctly matched pairs out of all truly matched pairs. These two statistics can be summarized into one metric called the F1-score by taking their harmonic mean

F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

These three metrics, precision, recall, and the F1-score, are used throughout the paper to assess the performance of dataset alignment methods, both on simulated data where the ground-truth matching is known, and on the validation subset in EPIC, using results from the manual examination as the ground-truth benchmark.

Validation on simulated data

To assess the performance of GromovMatcher and compare it to existing dataset alignment methods, we simulate realistic pairs of untargeted metabolomics feature with known ground-truth matchings. This allows us to analyze the dependence of alignment methods on the number of shared metabolites, dataset noise level, and feature intensity centering and scaling.

Dataset generation

Request a detailed protocol

Our pairs of synthetic feature tables are generated from one real untargeted metabolomics study of 500 newborns within the EXPOsOMICS project, which uses reversed phase liquid chromatography-quadrupole time-of-flight mass spectrometry (UHPLC-QTOF-MS) system in positive ion mode Alfano et al., 2020. The original dataset is first preprocessed following the procedure detailed in Alfano et al., 2020, resulting in p=4712 features measured in $n = 499$ samples available for subsequent analysis. Features and samples from the original study are then divided into two feature tables of respective size $(n_{1}, p_{1})$ and $(n_{2}, p_{2})$ , with $n_{1} + n_{2} = n$ and $p_{1}, p_{2} \leq p$ . In order to do this, $n_{1} = ⌊ n / 2 ⌋$ randomly chosen samples from the original study are placed into dataset 1 and the remaining $n_{2} = ⌈ n / 2 ⌉$ samples from the original study are placed into dataset 2. Here, $⌊ \cdot ⌋$ and $⌈ \cdot ⌉$ denote integer floor and ceiling functions. The features of the original study are randomly assigned to dataset 1, dataset 2, or both, allowing the resulting studies to have both common and study-specific features (Figure 2). Specifically, for a fixed overlap parameter $λ \in [0, 1]$ , we assign a random subset of $\approx λ p$ features into both dataset 1 and dataset 2 while the remaining $\approx (1 - λ p)$ features are divided equally between the two studies such that $p_{1} = p_{2}$ . We choose $λ \in {0.25, 0.5, 0.75}$ corresponding to low, medium and high overlap. For more detailed information on how the dataset split is performed and for additional validation experiments with unbalanced dataset splits (e.g. $n_{1} \neq n_{2}, p_{1} \neq p_{2}$ ) we refer the reader to Appendix 3.

After generating a pair of studies, random noise is added to the $m / z$ , RT and intensity levels of features in dataset 2 to mimic variations in data acquisition across two different experiments. The noise added to each $m / z$ value in study 2 is sampled from a uniform distribution on the interval $[- σ_{M}, σ_{M}]$ with $σ_{M} = 0.01$ (Climaco Pinto et al., 2022). The RTs of dataset 2 are first deviated by the function $f (x) = 1.1 x + 1.3 \sin (1.2 \sqrt{x})$ , corresponding to a systematic inter-dataset drift (Habra et al., 2021; Climaco Pinto et al., 2022; Brunius et al., 2016). A uniformly distributed noise on the interval $[- σ_{RT}, σ_{RT}]$ is added to the deviated RTs of dataset 2, with $σ_{RT} \in {0.2, 0.5, 1}$ (in minutes) corresponding to low, moderate and high variations (Climaco Pinto et al., 2022; Habra et al., 2021; Vaughan et al., 2012). Finally, we add a Gaussian noise $𝒩 (0, σ_{FI}^{2})$ to the feature intensities of both studies where $σ_{F I}$ is the scalar variance of the noise. This noise perturbs the correlation matrices of dataset 1 and dataset 2, making matching based on feature intensity correlations more challenging. We vary $σ_{F I}$ over the set of values {0.1, 0.5, 1}.

Given this data generation process, we test the performance of the four alignment methods (M2S, metabCombiner, GM, and GMT) under the parameter settings described below.

Dependence on overlap

Request a detailed protocol

We first assess how the performance of the four methods is affected by the number of metabolic features shared in both datasets. For each value of $λ = 0.25, 0.5, 0.75$ (low, medium, and high overlap), we randomly generate 20 pairs of datasets with noise on the $m / z$ , RT and feature intensities set to $σ_{M} = 0.01, σ_{RT} = 0.5, σ_{FI} = 0.5$ . The precision and recall of each method at low, medium, and high overlap is recorded for each of the repetitions.

Noise robustness

Request a detailed protocol

Next, we test the robustness to noise of each method by fixing the metabolite overlap fraction at $λ = 0.5$ and generating 20 random pairs of datasets at low ( $σ_{RT} = 0.2, σ_{FI} = 0.1$ ), medium ( $σ_{RT} = 0.5, σ_{FI} = 0.5$ ), and high ( $σ_{RT} = 1, σ_{FI} = 1$ ) noise levels. Similarly, the precision and recall of each method is saved for each noise level across the 20 repetitions.

Feature intensity centering and scaling

Request a detailed protocol

In order to test how all four methods are affected when the mean feature intensities and variance are not comparable across studies, we assess their performance when the feature intensities in both studies are mean centered and standardized to have unit standard deviation across all samples. We again generate 20 random pairs of datasets with medium overlap and medium noise, normalize the feature intensities in each pair of datasets, and compute the precision and recall of each method across the 20 repetitions.

EPIC data

We also evaluate our method on data collected within the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort, an ongoing multicentric prospective study with over 500,000 participants recruited between 1992 and 2000 from 23 centers in 10 European countries, and who provided blood samples at the inclusion in the study (Riboli et al., 2002). In EPIC, untargeted metabolomics data were successively acquired in several studies nested within the full cohort.

In the present work, we use untargeted metabolomics data acquired in three studies nested in EPIC, namely the EPIC cross-sectional (CS) study (Slimani et al., 2003) and two matched case-control studies nested within EPIC, on hepatocellular carcinoma (HCC; Stepien et al., 2016; Stepien et al., 2021) and pancreatic cancer (PC; Gasull et al., 2019), respectively. All data were acquired at the International Agency for Research on Cancer, making use of the same plateform and methodology: UHPLC-QTOF-MS (1290 Binary Liquid chromatography system, 6550 quadrupole time-of-flight mass spectrometer, Agilent Technologies, Santa Clara, CA) using reversed phase chromatography and electrospray ionization in both positive and negative ionization mode.

In a previous analysis aiming at identifying biomarkers of habitual alcohol intake in EPIC, the 205 features associated with alcohol intake in the CS study were manually matched to features in both the HCC and PC studies Loftfield et al., 2021. The results from this manual matching are presented in Table 1. This matching process was based on the proximity of $m / z$ and RT, using a matching tolerance of ± 15 ppm and ± 0.2 min, and on the comparison of the chromatograms of features in a quality control samples from both studies.

Preprocessing

Request a detailed protocol

In the HCC and PC studies, samples corresponding to participants selected as cases in either study (i.e. participants selected in the study because of a diagnosis of incident HCC or PC) are excluded. Indeed, the metabolic profiles of participants selected as controls are expected to be more comparable across studies than those of cases, especially if certain features are associated with the risk of HCC or PC. Apart from this additional exclusion criterion, the untargeted metabolomics data of each study is pre-processed following the steps described in Loftfield et al., 2021, to eliminate unreliable features and samples, impute missing values and minimize technical variations in the feature intensity levels.

Alcohol biomarker discovery

Request a detailed protocol

Loftfield et al., 2021 used the untargeted metabolomics data of the CS, HCC and PC studies in their alcohol biomarker discovery study in EPIC, without being able to automatically match their common features and pool the three datasets. Instead, the authors first implemented a discovery step, examining the relationship between alcohol intake and metabolic features measured in the CS study and accounting for multiple testing using a false discovery rate (FDR) correction. This led to the identification of 205 features significantly associated with alcohol intake in the CS study. In order to gauge the robustness of these associations, the authors of Loftfield et al., 2021 then implemented a validation step using data from two independent test sets. The first test set was composed of data from the EPIC HCC and PC studies, while the second was derived from the Finnish Alpha-Tocopherol, Beta-Carotene Cancer Prevention (ATBC) study. The 205 features identified in the discovery step were manually investigated for matches in the EPIC test set, and 67 features were effectively matched to features in the HCC or PC study, or both. The authors then evaluated the association between alcohol intake and those 67 features, applying a more conservative Bonferroni correction to determine whether the association with alcohol intake persisted. This step led to the identification of 10 features associated with alcohol intake (Extended Data Figure 5a). The second test set was then used to determine whether those 10 features were also significant in the ATBC population, which was indeed the case.

To conduct a more in-depth investigation of the matchings produced by the GromovMatcher algorithm, we build upon the analysis previously conducted by Loftfield et al., 2021 by exploring potential alcohol biomarkers using a pooled dataset created from the CS, HCC, and PC studies. Our goal is to assess whether pooling the data leads to increased statistical power and allows for the detection of more features associated with alcohol intake. Namely, we generate the pooled dataset by aligning a chosen reference dataset (CS study) with the HCC and PC studies successively using the GM matchings computed in both positive and negative mode (Materials and methods and Extended Data Figure 5b). Features that are not detected in either the HCC or PC studies are designated as ‘missing’ in the final pooled dataset for samples belonging to the respective studies where the feature is not found.

To evaluate the potential relationship between alcohol consumption and pooled metabolic features, we use a methodology akin to that of Loftfield et al., 2021. The self-reported alcohol intake data is adjusted for various demographic and lifestyle factors (age, sex, country, body-mass-index, smoking status and intensity, coffee consumption, and study) via the residual method in linear regression models. Feature intensities are also adjusted for technical variables (plate number and position within the plate) via linear mixed effect models. The significance of the association is assessed using correlation coefficients computed from the residuals for both self-reported alcohol intake and feature intensities. p-Values are corrected using either false discovery rate (FDR) or Bonferroni correction to account for multiple testing. Corrected p-values less than 5% are considered significant.

Materials and correspondence

Request a detailed protocol

All correspondence and material requests should be addressed to V.V.

IARC disclaimer

Request a detailed protocol

Where authors are identified as personnel of the International Agency for Research on Cancer/World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy, or views of the International Agency for Research on Cancer/World Health Organization.

Appendix 1

In this paper, we study how to match metabolic features across two datasets where Dataset 1 has p₁ metabolic features measured across n₁ patients and Dataset 2 has p₂ metabolic features measured across n₂ patients. Our goal is to identify pairs of indexes $(i, j)$ with $i \in {1, \dots, p_{1}}$ and $j \in {1, \dots, p_{2}}$ , such that feature $i$ in Dataset 1 and feature $j$ in Dataset 2 correspond to the same metabolic feature. More formally, we aim to identify a matching matrix $M^{*} \in {0, 1}^{p_{1} \times p_{2}}$ such that $M_{i, j}^{*} = 1$ if features $i$ in Dataset 1 and feature $j$ in Dataset 2 correspond to the same feature, hereafter referred to as matched features. Otherwise we set $M_{i, j}^{*} = 0$ otherwise. We emphasize that a matching matrix $M^{*}$ can have at most one nonzero entry in each row and column.

Both of the datasets we aim to match are obtained from liquid chromatography-mass spectrometry (LC-MS) experiments. Hence, for Dataset 1 each metabolite $i \in [p_{1}]$ is labeled with a mass-to-charge ( $m / z$ ) ratio $m_{i}^{1}$ as well as a retention time (RT) given by $R T_{i}^{1}$ . Additionally, each metabolite has a vector of intensities across patients denoted by $X_{i} \in ℝ^{n_{1}}$ . Similarly, each metabolite $j \in [p_{2}]$ in Dataset 2 is labeled by its $m / z$ ratio $m_{j}^{2}$ , its retention time $R T_{j}^{2}$ and its vector of intensities across samples $Y_{j} \in ℝ^{n_{2}}$ .

Correlations and distances between metabolomic features

Features cannot be aligned based on their $m / z$ and RT alone as they are often too inconsistent across studies. Our method is based on the idea that, in addition to their $m / z$ and RT being compatible, the signal intensities of metabolites measured in two different studies should exhibit similar correlation structures, or more generally exhibit similar distances between their intensity vectors. In other words, if feature intensity vectors $X_{i} \in ℝ^{n_{1}}, Y_{j} \in ℝ^{n_{2}}$ correspond to the same underlying feature ( $M_{i j}^{*} = 1$ ) and similarly if $X_{k} \in ℝ^{n_{1}}, Y_{l} \in ℝ^{n_{2}}$ correspond to the same feature ( $M_{k l}^{*} = 1$ ), then we expect that

c o r r (X_{i}, X_{k}) \approx c o r r (Y_{j}, Y_{l}) i f Π_{i j}^{*} = Π_{k l}^{*} = 1.

Here we define $corr (u, v)$ to be the Pearson correlation coefficient between two feature intensity vectors $u, v \in ℝ^{n}$ by

c o r r (u, v) = \frac{⟨ u - \bar{u}, v - \bar{v} ⟩}{‖ u - \bar{u} ‖ ‖ v - \bar{v} ‖}

where we define

\bar{u} = \frac{1}{n} \sum_{i = 1}^{n} u_{i}, ‖ u ‖ = \sqrt{\sum_{i = 1}^{n} u_{i}^{2}}, ⟨ u, v ⟩ = \sum_{i = 1}^{n} u_{i} v_{i}

as the mean value, Euclidean norm and inner product respectively. More generally, with d_x and d_y denoting two given distances on $ℝ^{n_{1}}$ and $ℝ^{n_{2}}$ respectively, we expect that

d_{x} (X_{i}, X_{k}) \approx d_{y} (Y_{j}, Y_{l}) i f Π_{i j}^{*} = Π_{k l}^{*} = 1.

Throughout this paper, we use the normalized Euclidean distance defined for any $u, v \in ℝ^{n}$ as

d^{euc} (u, v) = \frac{1}{\sqrt{n}} ‖ u - v ‖

where for d_x and d_y we take $n = n_{1}, n_{2}$ respectively. If the signal intensity vectors $u, v$ are mean centered and normalized by their standard deviation as

u \mapsto \sqrt{n} \cdot \frac{u - \frac{1}{n} \sum_{i = 1}^{n} u_{i}}{‖ u - \frac{1}{n} \sum_{i = 1}^{n} u_{i} ‖}

and likewise for $v$ , then it follows that

d^{euc} (u, v) = \sqrt{2 (1 - corr (u, v))} = \sqrt{2} d^{cos} (u, v)

where we denote $d^{cos} (u, v) = \sqrt{1 - corr (u, v)}$ as the cosine distance. For the purposes of this paper, we will always assume that d_x and d_y denote the normalized Euclidean distance from Equation 22. As shown above, this will be implicitly equal to the cosine distance from Equation 24 on centered and scaled data.

The goal of metabolomic feature matching is to learn the binary matching matrix $M^{*}$ that aligns the distances between pairs of features in the most consistent way possible as shown in Equation 21. To formalize this notion into a practical algorithm, we use the mathematical theory of optimal transport (Peyré and Cuturi, 2019) which we discuss next.

Optimal transport

Optimal transport (OT) applies in the setting when the points ${X_{i}}_{i = 1}^{p_{1}}$ and ${Y_{j}}_{j = 1}^{p_{2}}$ being matched live in the same dimensional space $n_{1} = n_{2} = n$ . It aims to find a matching between each point $X_{i}$ and its corresponding point $Y_{j}$ such that the sum of distances between matches is minimized. Matches between each pair of points can be stored in a matching matrix $M \in {0, 1}^{p_{1} \times p_{2}}$ such that $M_{i j} = 1$ if $X_{i}$ and $Y_{j}$ are matched, and $M_{i j} = 0$ otherwise. Again we note that $M$ must have at most one nonzero entry in each row and column to be a valid matching matrix.

Instead of searching over this space of binary matching matrices, optimal transport places masses $a_{i} \geq 0$ at all points $X_{i}$ for $i = 1, \dots, p_{1}$ and masses $b_{j} \geq 0$ at all points $Y_{j}$ for $j = 1, \dots, p_{2}$ and optimizes over the space of probabilistic couplings $Π \in ℝ_{+}^{p_{1} \times p_{2}}$ which move a $Π_{i j}$ amount of mass from $X_{i}$ to $Y_{j}$ . We assume here for simplicity that the sum of masses in both datasets are equal to one $\sum_{i = 1}^{p_{1}} a_{i} = \sum_{i = 1}^{p_{2}} b_{i} = 1$ and that the coupling $Π$ transports all mass from $a$ into $b$ . More formally, optimal transport optimizes over the constrained set of couplings

U (a, b) = {Π \in R_{+}^{p_{1} \times p_{2}} : Π 1_{p_{2}} = a a n d Π^{T} 1_{p_{1}} = b}

where $𝟏_{p}$ denotes the all ones vector of length $p$ . In practice, the points $X_{i}$ and $Y_{j}$ in each dataset are all treated the same and the masses placed on the data are chosen to be uniform $a = \frac{1}{p_{1}} 𝟏_{p_{1}}$ and $b = \frac{1}{p_{2}} 𝟏_{p_{2}}$ .

The cost function which optimal transport minimizes is the sum of squared distances of its transported mass

E (Π) = \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} d (X_{i}, Y_{j})

where $d (u, v) = d^{euc} (u, v) = \frac{1}{\sqrt{n}} ∥ u - v ∥$ is the Euclidean distance. The distance matrix $d (X_{i}, Y_{j})$ in the OT objective can be replaced more generally with a cost matrix $C \in ℝ^{p_{1} \times p_{2}}$ that is not necessarily a distance matrix. In this case the cost function becomes

E (Π) = \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} C_{i j}

When the transport cost $C_{i j}$ is a distance, the OT optimization defines a valid distance metric known as the optimal transport distance between discrete distributions ${(a_{i}, X_{i})}_{i = 1}^{p_{1}}$ and ${(b_{j}, Y_{j})}_{j = 1}^{p_{2}}$ in $ℝ^{n}$ given by

OT (a, b) = min_{Π \in U (a, b)} E (Π) .

When $d (u, v)$ is Euclidean, this OT distance is also referred to as the $L^{1}$ optimal transport distance, the Wasserstein 1-distance, or the Earth mover’s distance. As formulated, the computation of the optimal transport objective involves an optimization over coupling matrices $Π$ which can be solved by linear programming (Peyré and Cuturi, 2019). The OT optimization problem becomes time consuming for problems with many points $p_{1}, p_{2} ≫ 1$ . We show in the next section how augmenting this distance with a regularization term leads to a more efficient algorithm for learning the optimal coupling $Π$ .

Entropic regularization

Define the Kullback–Leibler (KL) divergence between two positive vectors $μ, ν \in ℝ_{+}^{p}$ as

D_{KL} (μ, ν) = \sum_{i = 1}^{p} μ_{i} \ln (\frac{μ_{i}}{ν_{i}}) - \sum_{i = 1}^{p} μ_{i} + \sum_{i = 1}^{p} ν_{i}

Given fixed marginals $a \in ℝ^{p_{1}}$ and $b \in ℝ^{p_{2}}$ from the previous section, we can define the entropy of a coupling matrix $Π \in ℝ_{+}^{p_{1} \times p_{2}}$ with respect to these fixed marginals as

D_{K L} (Π, a \otimes b) = \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} \ln (\frac{Π_{i j}}{a_{i} b_{j}}) - \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} + \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} a_{i} b_{j} .

where ${(a \otimes b)}_{i j} = a_{i} b_{j}$ denotes the outer product. This can be further simplified as

\begin{array}{ll} D_{KL} (Π, a \otimes b) & = \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} \ln (\frac{Π_{i j}}{a_{i} b_{j}}) - \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} + (\sum_{i = 1}^{p_{1}} a_{i}) (\sum_{j = 1}^{p_{2}} b_{j}) \\ = \sum_{i = 1}^{P_{1}} \sum_{j = 1}^{P_{2}} Π_{i j} l n (\frac{Π_{i j}}{a_{i} b_{j}}) \\ = H (Π) + \ln (p_{1}) + \ln (p_{2}) \end{array}

where we define $H (Π)$ by

H (Π) = \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} \ln (Π_{i j}) .

In the second line of the derivation above, we used the fact that the entries of $a, b$ , and $Π$ summed to one, and in the third line we used the fact that the marginals $a$ and $b$ were uniform. Under these assumptions, we see that the KL divergence $D_{KL} (Π, a \otimes b)$ is independent of the values of the marginals $a, b$ and is equal to $H (Π)$ up to constants.

Although here the general definition of entropy through the KL divergence reduces to the simpler formula of $H (Π)$ , in the following sections we will need to extend our analysis to cases when $a, b$ , and $Π$ have positive values that do not sum to one (i.e. not distributions). In this context, we will no longer have that $D_{KL} (Π, a \otimes b) = H (Π) + c o n s t$ but we will still be able to use $D_{KL} (Π, a \otimes b)$ as a general notion of entropy for $Π$ .

The entropy of a coupling $D_{KL} (Π, a \otimes b)$ is an important notion because it quantifies how uniform or smooth $Π$ is with respect to the product distribution $a \otimes b$ . In particular, if $a$ and $b$ are set to uniform distributions as commonly done in practice, then $D_{KL} (Π, a \otimes b)$ is small when $Π$ has close to uniform entries and is large otherwise. This notion of smoothness allows us to use $D_{KL} (Π, a \otimes b)$ as a regularizer in our optimal transport distance as

L_{ε} (Π) = \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} C_{i j} + ε D_{KL} (Π, a \otimes b)

where $ε$ is a small regularization parameter. Note that here we have denoted the transport cost matrix by $C \in ℝ^{p_{1} \times p_{2}}$ which is not necessarily a distance matrix. The introduction of the regularizer $ε D_{KL} (Π, a \otimes b)$ gives us an efficient iterative algorithm known as the Sinkhorn algorithm for optimizing $Π$ which we describe in the following sections.

Unbalanced optimal transport

Before we introduce the Sinkhorn algorithm, we introduce a final modification to our optimal transport distance that allows us to learn couplings between distributions $a, b \in ℝ_{+}^{p}$ that do not preserve mass. In other words, the coupling $Π$ is not required to perfectly satisfy the marginal constraints $Π 𝟏_{p_{2}} = a$ and $Π^{T} 𝟏_{p_{1}} = b$ . In our metabolite matching problem, this is particularly useful as not all metabolites in one dataset necessarily appear in the other dataset and hence should be left unmatched. This modification of optimal transport, known as unbalanced optimal transport (UOT) Chizat et al., 2018, optimizes the following cost function

L_{ρ, ε} (Π) = \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} C_{i j} + ρ D_{KL} (Π 1_{p_{2}}, a) + ρ D_{KL} (Π^{T} 1_{p_{1}}, b) + ε D_{KL} (Π, a \otimes b)

where we have added two KL terms with regularization parameter $ρ$ to enforce that the marginals of the coupling $Π 𝟏_{p_{2}}, Π^{T} 𝟏_{p_{1}}$ are approximately close to the prescribed marginals $a, b$ respectively. We have also kept the smoothness/entropy regularizer $ε D_{KL} (Π, a \otimes b)$ from the previous section.

Unbalanced Sinkhorn algorithm

Now we are ready to present the unbalanced Sinkhorn algorithm Peyré and Cuturi, 2019 for optimizing the unbalanced optimal transport cost defined above. First we rewrite our optimization as

\begin{array}{ll} min_{Π \in R_{+}^{p_{1} \times p_{2}}} L_{ρ, ε} (Π) & = min_{Π \in R_{+}^{p_{1} \times p_{2}}} \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} C_{i j} + ρ D_{KL} (Π 1_{p_{2}}, a) + ρ D_{KL} (Π^{T} 1_{p_{1}}, b) + ε D_{KL} (Π, a \otimes b) \\ = min_{u \in R_{+}^{p_{1}}, v \in R_{+}^{p_{2}}} min_{\begin{matrix} Π \in R_{+}^{p_{1} \times p_{2}} \\ Π 1_{p_{2}} = u, Π^{T} 1_{p_{1}} = v \end{matrix}} \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} C_{i j} + ρ D_{KL} (u, a) + ρ D_{KL} (v, b) + ε D_{KL} (Π, a \otimes b) . \end{array}

The inner minimization can be solved exactly by introducing dual variables $f \in ℝ^{p_{1}}, g \in ℝ^{p_{2}}$ and writing out the Lagrange dual problem

\begin{array}{ll} min_{\begin{matrix} Π \in R_{+}^{p_{1} \times p_{2}} \\ Π 1_{p_{2}} = u, Π^{T} 1_{p_{1}} = v \end{matrix}} \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} C_{i j} + ε D_{KL} (Π, a \otimes b) - f^{T} (Π 1_{p_{2}} - u) - g^{T} (Π^{T} 1_{p_{1}} - v) \\ = max_{f \in R^{p_{1}}, g \in R^{p_{2}}} min_{Π \in R_{+}^{p_{1} \times p_{2}}} \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} C_{i j} + ε D_{KL} (Π, a \otimes b) - f^{T} (Π 1_{p_{2}} - u) - g^{T} (Π^{T} 1_{p_{1}} - v) . \end{array}

where we have removed the terms $ρ D_{KL} (u, a)$ and $ρ D_{KL} (v, b)$ since they do not depend on $Π$ . Taking the gradient in $Π$ in the inner minimization and setting it to zero we get

C_{i j} + ε \log (\frac{Π_{i j}}{a_{i} b_{j}}) - f_{i} - g_{j} = 0

which implies that

Π_{i j} = a_{i} b_{j} \exp (\frac{f_{i} + g_{j} - C_{i j}}{ε})

Now we can substitute this expression for $Π$ back into our Lagrange dual problem. First we compute

ε D_{KL} (Π, a \otimes b) = \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} (f_{i} + g_{j} - C_{i j}) - ε \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} + ε \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} a_{i} b_{j}

which implies that

\begin{array}{ll} \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} C_{i j} + ε D_{KL} (Π, a \otimes b) - f^{T} (Π 1_{p_{2}} - u) - g^{T} (Π^{T} 1_{p_{1}} - v) \\ = u^{T} f + v^{T} g - ε \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} a_{i} b_{j} \exp (\frac{f_{i} + g_{j} - C_{i j}}{ε}) + ε \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} a_{i} b_{j} . \end{array}

Hence, the outer maximization in our Lagrange dual problem for $f$ and $g$ can now be written as

max_{f \in R^{p_{1}}, g \in R^{p_{2}}} u^{T} f + v^{T} g - ε \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} a_{i} b_{j} \exp (\frac{f_{i} + g_{j} - C_{i j}}{ε})

where we have removed the last constant sum in $a_{i} b_{j}$ . Finally we can rewrite our entire minimization from the start of this section as

\begin{array}{ll} min_{Π \in R_{+}^{p_{1} \times p_{2}}} L_{ρ, ε} (Π) & = min_{u \in R_{+}^{p_{1}}, v \in R_{+}^{p_{2}}} min_{\begin{matrix} Π \in R_{+}^{p_{1} \times p_{2}} \\ Π 1_{p_{2}} = u, Π^{T} 1_{p_{1}} = v \end{matrix}} \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} C_{i j} + ρ D_{KL} (u, a) + ρ D_{KL} (v, b) + ε D_{KL} (Π, a \otimes b) \\ = min_{u \in R_{+}^{p_{1}}, v \in R_{+}^{p_{2}}} max_{f \in R^{p_{1}}, g \in R^{p_{2}}} u^{T} f + v^{T} g - ε \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} a_{i} b_{j} \exp (\frac{f_{i} + g_{j} - C_{i j}}{ε}) + ρ D_{KL} (u, a) + ρ D_{KL} (v, b) . \end{array}

By strong duality, we can interchange the minimum and maximum above to write

\begin{array}{ll} min_{Π \in R_{+}^{p_{1} \times p_{2}}} L_{ρ, ε} (Π) & = max_{f \in R^{p_{1}}, g \in R^{p_{2}}} min_{u \in R_{+}^{p_{1}}, v \in R_{+}^{p_{2}}} u^{T} f + v^{T} g - ε \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} a_{i} b_{j} \exp (\frac{f_{i} + g_{j} - C_{i j}}{ε}) + ρ D_{KL} (u, a) + ρ D_{KL} (v, b) \\ = U^{*} (f) + V^{*} (g) - ε \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} a_{i} b_{j} \exp (\frac{f_{i} + g_{j} - C_{i j}}{ε}) \end{array}

where we define the functions

\begin{array}{ll} U^{*} (f) = min_{u \in R_{+}^{p_{1}}} u^{T} f + ρ D_{KL} (u, a) \\ V^{*} (g) = min_{v \in R_{+}^{p_{2}}} v^{T} g + ρ D_{KL} (v, b) . \end{array}

In fact, we can solve the minimizations in $U^{*}$ and $V^{*}$ in closed form to get the minimizers $u^{*} = a ⊙ \exp (- f / ρ)$ and $v^{*} = b ⊙ \exp (- g / ρ)$ which we can substitute back in to get

\begin{array}{ll} U^{*} (f) & = \sum_{i = 1}^{p_{1}} u_{i}^{*} f_{i} + ρ \sum_{i = 1}^{p_{1}} u_{i}^{*} \ln (\frac{u_{i}^{*}}{a_{i}}) - ρ \sum_{i = 1}^{p_{1}} u_{i}^{*} + ρ \sum_{i = 1}^{p_{1}} a_{i} \\ = \sum_{i = 1}^{p_{1}} u_{i}^{*} f_{i} - \sum_{i = 1}^{p_{1}} u_{i}^{*} f_{i} - ρ \sum_{i = 1}^{p_{1}} a_{i} \exp (- f_{i} / ρ) + ρ \sum_{i = 1}^{p_{1}} a_{i} \\ = - ρ \sum_{i = 1}^{p_{1}} a_{i} \exp (- f_{i} / ρ) + ρ \sum_{i = 1}^{p_{1}} a_{i} . \end{array}

Likewise we can see that

V^{*} (f) = - ρ \sum_{i = 1}^{p_{2}} b_{i} \exp (- g_{i} / ρ) + ρ \sum_{i = 1}^{p_{2}} b_{i} .

Thus, we can rewrite our full optimization as

\begin{array}{ll} min_{Π \in R_{+}^{p_{1} \times p_{2}}} L_{ρ, ε} (Π) & = max_{f \in R^{p_{1}}, g \in R^{p_{2}}} - ρ \sum_{i = 1}^{p_{1}} a_{i} \exp (- f_{i} / ρ) - ρ \sum_{i = 1}^{p_{2}} b_{i} \exp (- g_{i} / ρ) - ε \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} a_{i} b_{j} \exp (\frac{f_{i} + g_{j} - C_{i j}}{ε}) \\ = min_{f \in R^{p_{1}}, g \in R^{p_{2}}} ρ \sum_{i = 1}^{p_{1}} a_{i} \exp (- f_{i} / ρ) + ρ \sum_{i = 1}^{p_{2}} b_{i} \exp (- g_{i} / ρ) + ε \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} a_{i} b_{j} \exp (\frac{f_{i} + g_{j} - C_{i j}}{ε}) \end{array}

where we have removed the terms independent of $f$ and $g$ .

Note that now we can optimize the cost function above by performing an alternating minimization on the dual variables $f$ and $g$ . Taking the gradient in $f$ and setting it to zero we see that

- a_{i} \exp (- f_{i} / ρ) + a_{i} \sum_{j = 1}^{p_{2}} b_{j} \exp (\frac{f_{i} + g_{j} - C_{i j}}{ε}) = 0

which implies that

f_{i} = - \frac{ε ρ}{ε + ρ} \ln (\sum_{j = 1}^{p_{2}} b_{j} \exp (\frac{g_{j} - C_{i j}}{ε})) .

Similarly, we can write out

g_{j} = - \frac{ε ρ}{ε + ρ} \ln (\sum_{i = 1}^{p_{1}} a_{i} \exp (\frac{f_{i} - C_{i j}}{ε})) .

We are now ready to write out the full unbalanced Sinkhorn algorithm which performs an alternating minimization on the dual potentials $f, g$ as outlined above. We remind the reader that the coupling matrix can be recovered from the dual potentials by the formula

Π_{i j} = a_{i} b_{j} \exp (\frac{f_{i} + g_{j} - C_{i j}}{ε}) .

The unbalanced Sinkhorn algorithm proceeds as follows.

Algorithm 1. UnbalancedSinkhorn
input: Transport cost $C$ , marginals $a, b$ , marginal relaxation $ρ$ , entropic regularization $ε$ output: Return the coupling matrix $Π$ Initialize $g = 0$ while $(f, g)$ has not converged do Set $f_{i} \leftarrow - \frac{ε ρ}{ε + ρ} \ln (\sum_{j = 1}^{p_{2}} b_{j} \exp (\frac{g_{j} - C_{i j}}{ε}))$ for $i \in [p_{1}]$ Set $g_{j} \leftarrow - \frac{ε ρ}{ε + ρ} \ln (\sum_{i = 1}^{p_{1}} a_{i} \exp (\frac{f_{i} - C_{i j}}{ε}))$ for $j \in [p_{2}]$ Return the coupling matrix $Π_{i j} = a_{i} b_{j} \exp (\frac{f_{i} + g_{j} - C_{i j}}{ε})$ for $i \in [p_{1}]$ and $j \in [p_{2}]$

Algorithm 1. UnbalancedSinkhorn

input: Transport cost

C

, marginals

a, b

, marginal relaxation

ρ

, entropic regularization

ε

output: Return the coupling matrix

Π

Initialize

g = 0

while

(f, g)

has not converged do
Set

f_{i} \leftarrow - \frac{ε ρ}{ε + ρ} \ln (\sum_{j = 1}^{p_{2}} b_{j} \exp (\frac{g_{j} - C_{i j}}{ε}))

for

i \in [p_{1}]

Set

g_{j} \leftarrow - \frac{ε ρ}{ε + ρ} \ln (\sum_{i = 1}^{p_{1}} a_{i} \exp (\frac{f_{i} - C_{i j}}{ε}))

for

j \in [p_{2}]

Return the coupling matrix

Π_{i j} = a_{i} b_{j} \exp (\frac{f_{i} + g_{j} - C_{i j}}{ε})

for

i \in [p_{1}]

and

j \in [p_{2}]

The final output of the Sinkhorn algorithm optimization is a real-valued coupling matrix $Π \in ℝ_{+}^{p_{1} \times p_{2}}$ . In some cases, it is desirable to transform the coupling matrix into a binary-valued matching matrix $M \in {0, 1}^{p_{1} \times p_{2}}$ with possibly an added restriction that there is at most one nonzero element in each row and column (to obtain a valid partial matching). This can be done by either thresholding the real matrix $Π$ or by assigning all maximal entries in each row (or column) to one and setting the remaining entries to zero. For our metabolomics matching problem, we describe our procedure for transforming our real-valued coupling into a binary matching matrix in the section on the GromovMatcher algorithm below.

Gromov–Wasserstein

Now that we have introduced the general formulation of unbalanced optimal transport and its corresponding Sinkhorn algorithm, we can extend this formulation to matching problems between distributions of points that live in different dimensional spaces. In our metabolomics setting, we aim to match two datasets of p₁ and p₂ metabolic features respectively where each feature in a dataset is associated with a feature intensity vector ${X_{i}}_{i = 1}^{p_{1}} \subset ℝ^{n_{1}}$ and ${Y_{j}}_{j = 1}^{p_{2}} \subset ℝ^{n_{2}}$ respectively across samples. We assume that there exists a true matching matrix $M^{*} \in {0, 1}^{p_{1} \times p_{2}}$ with at most one nonzero entry in each row and column such that two metabolites $(i, j)$ are matched if $M_{i j}^{*} = 1$ .

We make the further assumption that if feature vectors $X_{i}, Y_{j}$ are matched and feature vectors $X_{k}, Y_{l}$ are matched under $M^{*}$ , then we expect that

d_{x} (X_{i}, X_{k}) \approx d_{y} (Y_{j}, Y_{l}) i f M_{i j}^{*} = M_{k l}^{*} = 1.

where d_x is a distance metric on $ℝ^{n_{1}}$ and d_y is a distance metric n $ℝ^{n_{2}}$ . In practice, we always choose these distance metrics to be the normalized Euclidean distance defined for any $u, v \in ℝ^{n}$ as

d^{euc} (u, v) = \frac{1}{\sqrt{n}} ‖ u - v ‖

which is equal to the cosine distance $d^{cos}$ (i.e. one minus the correlation) for centered and scaled data. Given these two distance matrices $D^{x} = {[d_{x} (X_{i}, X_{k})]}_{i, k = 1}^{p_{1}} \in ℝ^{p_{1} \times p_{1}}$ and $D^{y} = {[d_{y} (Y_{j}, Y_{l})]}_{j, l = 1}^{p_{2}} \in ℝ^{p_{2} \times p_{2}}$ we would like to infer the true matching matrix $M^{*}$ by solving an optimization problem.

Consider the following objective function

ℰ (M) = \sum_{i, k = 1}^{p_{1}} \sum_{j, l = 1}^{p_{2}} M_{i j} M_{k l} | D_{i k}^{x} - D_{j l}^{y} | .

where the matching matrices $M \in {0, 1}^{p_{1} \times p_{2}}$ we optimize over are constrained to satisfy marginal constraints $Π 1_{p_{2}} > 0$ and $Π^{T} 1_{p_{1}} > 0$ . These marginal constraints simply impose that there is at least one nonzero entry in each row and column (i.e. each metabolite in both datasets has at least one corresponding match). Searching for the $Π$ minimizing $ℰ_{X, Y} (Π)$ consists of putting the non-zero entries in $Π$ such that the distance profiles of the matched features are similar, so that the minimizer of this criterion provides a good candidate estimate of $Π^{*}$ . This is closely related to the Gromov–Hausdorff distance Gromov, 2001, an extension of optimal transport to the case where the sets to be coupled do not lie in the same metric space.

In practice, it is often desirable to optimize over a different set of matrices in order to make the optimization problem more tractable. Here we take intuition from optimal transport, and search over the set of coupling matrices with marginal constraints

𝐔 (a, b) = {Π \in ℝ_{+}^{p_{1} \times p_{2}} : Π 𝟏_{p_{2}} = a and Π^{T} 𝟏_{p_{1}} = b} .

where as before, $a \in ℝ_{+}^{p_{1}}$ and $b \in ℝ_{+}^{p_{2}}$ are desired marginals which are typically set to be uniform distributions $a = \frac{1}{p_{1}} 𝟏_{p_{1}}$ and $b = \frac{1}{p_{2}} 𝟏_{p_{2}}$ . These marginal vectors can be interpreted as distributions of masses a_i and b_j on the feature vectors $X_{i}$ and $Y_{j}$ respectively for $i \in [p_{1}], j \in [p_{2}]$ .

Coupling matrices in $𝐔 (a, b)$ transport the distribution of masses $a$ in the first dataset to the distribution of masses $b$ in the second dataset. Now we can formulate the Gromov–Wasserstein (GW) distance, introduced by Mémoli, 2011, as

GW (a, b) = min_{Π \in U (a, b)} E (Π)

By optimizing this objective, each entry $Π_{i j}$ now reflects the strength of the matched pair $(X_{i}, Y_{j})$ . Optimizing $GW (a, b)$ then amounts to placing larger entries in $Π$ whose paired features have similar distance profiles. Before we develop an algorithm to optimize this objective, we first modify it to allow for unbalanced matchings where marginal constraints are not enforced exactly (e.g. features in both datasets can remain unmatched).

Unbalanced Gromov–Wasserstein

In an untargeted context, all features measured in one study are not necessarily observed in another, either because these features are truly not shared or because of measurement error. However, the constraint $Π \in 𝐔 (a, b)$ in the original GW optimization criterion Equation 40 ensures that all the mass is transported from one set to another, resulting in all features being matched across studies. In order to discard study-specific features during the GW computation, we use the unbalanced Gromov–Wasserstein (UGW) distance with an additional entropic regularization for computational purposes, described in Sejourne et al., 2021. The optimization problem therefore reads

\begin{array}{ll} L_{ρ, ε} (Π) = E (Π) & + ρ D_{KL} (Π 1_{p_{2}} \otimes Π 1_{p_{2}}, a \otimes a) \\ + ρ D_{KL} (Π^{T} 1_{p_{1}} \otimes Π^{T} 1_{p_{1}}, b \otimes b) \\ + ε D_{KL} (Π \otimes Π, {(a \otimes b)}^{\otimes 2}) \end{array}

{UGW}_{ρ, ε} = min_{Π \in R_{+}^{p_{1} \times p_{2}}} L_{ρ, ε} (Π)

with $ρ, ε > 0$ . Here $D_{KL}$ is the Kullback–Leibler divergence defined in the previous sections and we define the tensor product ${(P \otimes P)}_{i, j, k, l} = P_{i, j} P_{k, l}$ . Here we set the desired marginal constraints to $a = \frac{1}{p_{1}} 𝟏_{p_{1}}$ and $b = \frac{1}{p_{2}} 𝟏_{p_{2}}$ as before.

As in the case of unbalanced optimal transport (Chizat et al., 2018), the regularization $ρ$ times the Kullback–Leibler divergences allows for the relaxation of the marginal constraints $Π 𝟏_{p_{2}} = a$ and $Π^{T} 𝟏_{p_{1}} = b$ . The value of $ρ > 0$ controls the extent to which we allow for mass destruction. Smaller values of $ρ$ tend to lessen the constraint on the marginals of $Π$ , while balanced GW is recovered when $ρ \to + \infty$ . As proposed in the original paper (Sejourne et al., 2021), our UGW cost modifies the UOT formulation by using the quadratic Kullback-Leibler divergence in $Π 𝟏_{p_{2}} \otimes Π 𝟏_{p_{2}}$ and $Π^{T} 𝟏_{p_{1}} \otimes Π^{T} 𝟏_{p_{1}}$ instead, hence preserving the quadratic form of the GW cost function $ℰ (Π)$ .

The term $ϵ D_{K L} (Π \otimes Π, {(a \otimes b)}^{\otimes 2})$ serves as an entropic regularization, inspired again by optimal transport. Adding such a penalty is a standard way to compute an approximate solution to the optimal transport problem using the Sinkhorn algorithm as we shall show in the following section. Here again, we modify the entropic penalty in UGW to have a quadratic form in $Π \otimes Π$ to agree with the quadratic form of the GW cost $ℰ (Π)$ . The parameter $ε$ controls the smoothness (entropy) of the coupling matrix $Π$ where larger values of $ε$ encourage $Π$ to put uniform weights on many of its entries, leading to less precision in the feature matches. However, increasing $ε$ also leads to better numerical stability and a significant speedup of the alternating Sinkhorn algorithm used to optimize the objective function described below.

UGW optimization algorithm

Now we are ready to write out an algorithm to optimize the UGW objective in Equation 42. First write our objective as

\begin{array}{ll} L_{ρ, ε} (Π) = \sum_{i, k = 1}^{p_{1}} \sum_{j, l = 1}^{p_{2}} Π_{i j} Π_{k l} | D_{i k}^{x} - D_{j l}^{y} | & + ρ D_{KL} (Π 1_{p_{2}} \otimes Π 1_{p_{2}}, a \otimes a) \\ + ρ D_{KL} (Π^{T} 1_{p_{1}} \otimes Π^{T} 1_{p_{1}}, b \otimes b) \\ + ε D_{KL} (Π \otimes Π, {(a \otimes b)}^{\otimes 2}) . \end{array}

Using the quadratic nature of our cost function, we aim to perform an alternating minimization in the two copies of $Π$ . For the moment, let’s differentiate these two copies by $Π$ and $Γ$ and write the new cost

\begin{array}{ll} F_{ρ, ε} (Π, Γ) = \sum_{i, k = 1}^{p_{1}} \sum_{j, l = 1}^{p_{2}} Π_{i j} Γ_{k l} | D_{i k}^{x} - D_{j l}^{y} | & + ρ D_{KL} (Π 1_{p_{2}} \otimes Γ 1_{p_{2}}, a \otimes a) \\ + ρ D_{KL} (Π^{T} 1_{p_{1}} \otimes Γ^{T} 1_{p_{1}}, b \otimes b) \\ + ε D_{KL} (Π \otimes Γ, (a \otimes b)^{\otimes 2}) . \end{array}

Before we expand this cost, we introduce the notation $m (π)$ to denote the sum of the elements of $π$ which can be a vector, matrix or tensor. In general, for four positive distributions $π, a \in ℝ_{+}^{p}$ and $γ, b \in ℝ_{+}^{q}$ we have that the KL satisfies the tensorization property

\begin{array}{ll} D_{KL} (π \otimes γ, a \otimes b) & = \sum_{i = 1}^{p} \sum_{j = 1}^{q} π_{i} γ_{j} \ln (\frac{π_{i} γ_{j}}{a_{i} b_{j}}) - \sum_{i = 1}^{p} \sum_{j = 1}^{q} π_{i} γ_{j} + \sum_{i = 1}^{p} \sum_{j = 1}^{q} a_{i} b_{j} \\ = m (γ) \sum_{i = 1}^{p} π_{i} \ln (\frac{π_{i}}{a_{i}}) + m (π) \sum_{j = 1}^{q} γ_{j} \ln (\frac{γ_{j}}{b_{j}}) - m (π) m (γ) + m (a) m (b) \\ = m (γ) D_{KL} (π, a) + m (π) D_{KL} (γ, b) + (m (π) - m (a)) (m (γ) - m (b)) . \end{array}

Specifically, if we remove those terms that do not depend on $γ$ we are left with

D_{KL} (π \otimes γ, a \otimes b) = m (π) D_{KL} (γ, b) + m (γ) \sum_{i = 1}^{p} π_{i} \ln (\frac{π_{i}}{a_{i}}) + c o n s t .

This allows us to write for the marginal constraints $a \in ℝ_{+}^{p_{1}}, b \in ℝ_{+}^{p_{2}}$ and couplings $Π, Γ \in ℝ_{+}^{p_{1} \times p_{2}}$ that

\begin{array}{cc} D_{KL} (Π 1_{p_{2}} \otimes Γ 1_{p_{2}}, a \otimes a) = m (Π) D_{KL} (Γ 1_{p_{2}}, a) + m (Γ) \sum_{i = 1}^{p_{1}} (Π 1_{p_{2}})_{i} \ln (\frac{(Π 1_{p_{2}})_{i}}{a_{i}}) + c o n s t . \\ D_{KL} (Π^{T} 1_{p_{1}} \otimes Γ^{T} 1_{p_{1}}, b \otimes b) = m (Π) D_{KL} (Γ^{T} 1_{p_{1}}, b) + m (Γ) \sum_{j = 1}^{p_{2}} (Π^{T} 1_{p_{1}})_{j} \ln (\frac{(Π^{T} 1_{p_{1}})_{j}}{b_{j}}) + c o n s t . \\ D_{KL} (Π \otimes Γ, (a \otimes b)^{\otimes 2}) = m (Π) D_{KL} (Γ, a \otimes b) + m (Γ) \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} \ln (\frac{Π_{i j}}{a_{i} b_{j}}) + c o n s t . \end{array}

where in the expansions above we have removed all terms that are independent of $Γ$ . Finally, expanding out $ℱ_{ρ, ε} (Π, Γ)$ and keeping only those terms that depend on $Γ$ we get

F_{ρ, ε} (Π, Γ) = \sum_{k = 1}^{p_{1}} \sum_{l = 1}^{p_{2}} Γ_{k l} C_{k l}^{Π} + ρ m (Π) D_{KL} (Γ 1_{p_{2}}, a) + ρ m (Π) D_{KL} (Γ^{T} 1_{p_{1}}, b) + ε m (Π) D_{KL} (Γ, a \otimes b)

where the cost matrix $C^{Π} \in ℝ^{p_{1} \times p_{2}}$ is defined as

C_{k l}^{Π} = \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} | D_{i k}^{x} - D_{j l}^{y} | + ρ \sum_{i = 1}^{p_{1}} (Π 1_{p_{2}})_{i} \ln (\frac{(Π 1_{p_{2}})_{i}}{a_{i}}) + ρ \sum_{j = 1}^{p_{2}} (Π^{T} 1_{p_{1}})_{j} \ln (\frac{(Π^{T} 1_{p_{1}})_{j}}{b_{j}}) + ε \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} Π_{i j} \ln (\frac{Π_{i j}}{a_{i} b_{j}}) .

where we have hidden the dependence of $C^{Π}$ on the distance matrices $D^{x}, D^{y}$ , the marginals $a, b$ , and the regularization parameters $ρ, ε$ for ease of notation.

Remarkably, the cost above in $Γ$ for fixed $Π$ is in the form of an unbalanced optimal transport problem which can be solved through unbalanced Sinkhorn iterations (Algorithm 1). Note that in our derivation above, it did not matter whether we optimized $Γ$ with $Π$ fixed or vice versa because the cost $ℱ_{ρ, ε} (Π, Γ)$ is symmetric in both of its arguments.

Our iterative algorithm for solving the unbalanced GW problem will proceed at each iteration by optimizing $Γ$ to minimize the cost above using the unbalanced Sinkhorn method, setting $Π$ equal to $Γ$ and repeating. With each iteration, we expect this iterative procedure to make smaller and smaller updates to $Γ$ until convergence. By definition, at the end of each iteration we assign $Π = Γ$ so the minimizer of $ℱ_{ρ, ε} (Π, Γ)$ we converge to should also be a minimizer of the original UGW cost $ℒ_{ρ, ε} (Π)$ in the sense that the relaxation of $ℒ_{ρ, ε} (Π)$ to $ℱ_{ρ, ε} (Π, Γ)$ is tight. This is proven rigorously under strict mathematical assumptions in Sejourne et al., 2021. We state the full UGW optimization algorithm below.

Algorithm 2. UnbalancedGromovWasserstein
input: Distance matrices $D^{x}, D^{y}$ , marginals a, b marginal relaxation $ρ$ , entropic regularization $ε$ output: Return the coupling matrix $Π$ Initialize $Π = Γ = a \otimes b / \sqrt{m (a) m (b)}$ while $(Π, Γ)$ has not converged do Update $Π \leftarrow Γ$ Update $Γ = UnbalancedSinkhorn (C^{Π}, a, b, ρ m (Π), ε m (Π))$ Rescale $Γ \leftarrow \sqrt{m (Π) / m (Γ)} Γ$ Update $Π \leftarrow Γ$ Return $Π$ .

Algorithm 2. UnbalancedGromovWasserstein

input: Distance matrices

D^{x}, D^{y}

, marginals a, b marginal relaxation

ρ

, entropic regularization

ε

output: Return the coupling matrix

Π

Initialize

Π = Γ = a \otimes b / \sqrt{m (a) m (b)}

while

(Π, Γ)

has not converged do
Update

Π \leftarrow Γ

Update

Γ = UnbalancedSinkhorn (C^{Π}, a, b, ρ m (Π), ε m (Π))

Rescale

Γ \leftarrow \sqrt{m (Π) / m (Γ)} Γ

Update

Π \leftarrow Γ

Return

Π

Following the implementation of the UGW algorithm in Sejourne et al., 2021, we initialize both $Π$ and $Γ$ to be the product distribution of the marginals $a \otimes b / \sqrt{m (a) m (b)}$ before we begin the optimization. Also, we note that if $(Π, Γ)$ is a minimizer of our UGW objective $ℱ_{ρ, ε} (Π, Γ)$ , then so is $(\frac{1}{s} Π, s Γ)$ for any scale factor $s > 0$ . Hence, we can set $m (\frac{1}{s} Π) = m (s Γ)$ by choosing $s = \sqrt{m (Π) / m (Γ)}$ . This motivates the final step in the while loop of the UGW algorithm where the rescaling of $Γ$ by the factor $\sqrt{m (Π) / m (Γ)}$ leads to mass equality $m (Π) = m (Γ)$ and also stabilizes the convergence of the algorithm.

Returning to our metabolomics matching problem, we further guide our UGW optimization procedure by discouraging it from matching metabolic feature pairs whose mass-to-charge ratios are incompatible. Namely, we choose a value $m_{gap}$ such that for all pairs $(i, j)$ with $i \in [p_{1}], j \in [p_{2}]$ and mass-to-charge ratios $m_{i}^{x}, m_{j}^{y}$ we enforce that

| m_{i}^{x} - m_{j}^{y} | > m_{gap} ⟹ Π_{i j} = 0.

In practice, this is done by taking the optimal transport cost $C^{Π}$ in every iteration of the UGW algorithm and premultiplying it elementwise by a factor $W \in ℝ_{+}^{p_{1} \times p_{2}}$ given by

C^{Π} \to W ⊙ C^{Π}, W_{i j} = 99 \cdot 1_{{| m_{i}^{x} - m_{j}^{y} | > m_{gap}}} + 1

where $𝟏_{𝒳}$ denotes the indicator function that is one when the condition $𝒳$ is satisfied and zero otherwise. Such a prefactor changes the transport cost to be very large for feature matches with incompatible mass-to-charge ratio times, and hence, the entries of $Π$ set small weights at these entries. Our weighted UGW algorithm is rewritten below.

Algorithm 3. WeightedUnbalancedGromovWasserstein
input : Distance matrices $D^{x}, D^{y}$ , marginals $a, b$ , marginal relaxation $ρ$ , entropic regularization $ε$ , mass-to-charge ratios $m^{x}, m^{y}$ , mass-to-charge ratio gap $m_{gap}$ output: Return the coupling matrix $Π$ Initialize $Π = Γ = a \otimes b / \sqrt{m (a) m (b)}$ Set $W_{i j} = 99 \cdot 1_{{\| m_{i}^{x} - m_{j}^{y} \| > m_{gap}}} + 1$ for $i \in [p_{1}]$ and $j \in [p_{2}]$ while $(Π, Γ)$ has not converged do Update $Π \leftarrow Γ$ Update $Γ = UnbalancedSinkhorn (W ⊙ C^{Π}, a, b, ρ m (Π), ε m (Π))$ Rescale $Γ \leftarrow \sqrt{m (Π) / m (Γ)} Γ$ Update $Π \leftarrow Γ$ Return $Π$ .

Algorithm 3. WeightedUnbalancedGromovWasserstein

input : Distance matrices

D^{x}, D^{y}

, marginals

a, b

, marginal relaxation

ρ

, entropic regularization

ε

,
mass-to-charge ratios

m^{x}, m^{y}

, mass-to-charge ratio gap

m_{gap}

output: Return the coupling matrix

Π

Initialize

Π = Γ = a \otimes b / \sqrt{m (a) m (b)}

Set

W_{i j} = 99 \cdot 1_{{| m_{i}^{x} - m_{j}^{y} | > m_{gap}}} + 1

for

i \in [p_{1}]

and

j \in [p_{2}]

while

(Π, Γ)

has not converged do
Update

Π \leftarrow Γ

Update

Γ = UnbalancedSinkhorn (W ⊙ C^{Π}, a, b, ρ m (Π), ε m (Π))

Rescale

Γ \leftarrow \sqrt{m (Π) / m (Γ)} Γ

Update

Π \leftarrow Γ

Return

Π

As mentioned before, the coupling matrix returned by our weighted UGW algorithm is a real-valued matrix rather than a binary matching matrix. In the next section, we describe how we incorporate metabolite retention time information to filter out unlikely pairs in our coupling matrix and transform it into a valid one-to-one matching of features across two datasets.

Retention time drift estimation and filtering

To filter out unlikely matches from the coupling matrix returned by Algorithm 3 above, we use the retention times (RTs) of the metabolites in both datasets. We remind the reader that RTs were not incorporated into the weighted UGW algorithm since they often exhibit a non-linear deviation between datasets, and hence are not directly comparable. However, using the metabolite coupling $\tilde{Π} \in ℝ_{+}^{p_{1} \times p_{2}}$ obtained from Algorithm 3, it is possible to estimate this RT drift. The estimated RT drift $\hat{f} : ℝ_{+} \to ℝ_{+}$ allows us to assess the plausibility of the pairs recovered by the restricted UGW coupling $\tilde{Π}$ , and discard pairs incompatible with the estimated drift.

We propose to learn the drift $\hat{f}$ through the weighted spline regression

min_{f \in B_{n, k}} \sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} {\tilde{Π}}_{i j} | f (R T_{i}^{x}) - R T_{j}^{y} |

where $ℬ_{n, k}$ is the set of $n$ -order B-splines with $k$ knots. All pairs $(R T_{i}^{x}, R T_{j}^{y})$ in objective Equation 51 are weighted by the coefficients of $\tilde{Π}$ so that larger weights are given to pairs identified with high confidence in the first step of our procedure.

Pairs identified as incompatible with the estimated RT drift are then discarded from the coupling matrix. To do this, we first take the estimated RT drift $\hat{f}$ , and the set of pairs $𝒮 = {i, j : {\tilde{Π}}_{i, j} \neq 0}$ recovered in $\tilde{Π}$ with nonzero entries. We then define the residual associated with $(i, j) \in 𝒮$ as

r_{\hat{f}} (i, j) = | \hat{f} (R T_{i}^{x}) - R T_{j}^{y} | .

The 95% prediction interval and the median absolute deviation (MAD) of these residuals are given by

\begin{matrix} PI = 1.96 \times std ({r_{\hat{f}} (i, j), (i, j) \in 𝒮}) \\ MAD = median ({| r_{\hat{f}} (i, j) - μ_{r} |, (i, j) \in 𝒮}) \\ μ_{r} = median ({| r_{\hat{f}} (i, j) |, (i, j) \in 𝒮}) \end{matrix}

where $| 𝒮 |$ is the size of $𝒮$ and the functions std, median denote the standard deviation and median respectively. Following Habra et al., 2021, we then create a new filtered coupling matrix $\hat{Π} \in ℝ_{+}^{p_{1} \times p_{1}}$ given by

{\hat{Π}}_{i j} = {\begin{cases} {\tilde{Π}}_{i j} & if r_{\hat{f}} (i, j) < μ_{r} + r_{thresh} \\ 0 & otherwise \end{cases} .

where $r_{thresh}$ is a given filtering threshold. The procedure of estimating the drift function $\hat{f}$ in Equation 51 and filtering the coupling can be repeated for multiple iterations, to improve the drift and coupling estimation. In our main algorithm, we use two preliminary iterations where we estimate the RT drift and discard outliers with $r_{thresh} = PI$ , defined as points falling outside of the 95% prediction interval. We the re-estimate the drift and perform a final filtering step with the more stringent MAD by setting $r_{thresh} = 2 \times MAD$ .

At this stage, it is possible for $\hat{Π}$ to still contain coefficients of very small magnitude. As an optional postprocessing step, we discard these coefficients by setting all entries smaller than $τ max (\hat{Π})$ to zero for some scaling constant $τ \in [0, 1]$ . Lastly, a feature from either study could have multiple possible matches, since $\hat{Π}$ can have more than one non-zero coefficient per row or column. Although reporting multiple matches can be helpful in an exploratory context, for the sake of simplicity in our analysis, the final output of GromovMatcher returns a one-to-one matching. Consequently, we only keep those metabolite pairs $(i, j)$ where the entry ${\hat{Π}}_{i j}$ is largest in its corresponding row and column. All nonzero entries of $\hat{Π}$ which do not satisfy this criterion are set to zero. Finally, we convert $\hat{Π}$ into a binary matching matrix $M \in {0, 1}^{p_{1} \times p_{2}}$ with ones in place of its nonzero entries and this final output is returned to the user.

As a naming convention, we use the abbreviation GM for our GromovMatcher method, and use the abbreviation GMT when running GromovMatcher with the optional $τ$ -thresholding step.

GromovMatcher algorithm summary

In summary, our full GromovMatcher algorithm consists of (1) UGW optimization followed by (2) retention time drift estimation and filtering.

The tuning of $ρ$ and $ϵ$ was computationally driven and the two parameters were set as low as possible, with $ρ = 0.05$ and $ϵ = 0.005$ . Based on literature (Loftfield et al., 2021; Hsu et al., 2019; Climaco Pinto et al., 2022; Habra et al., 2021; Chen et al., 2021) and what is considered to be a plausible variation of a feature’s $m / z$ , we set $m_{gap}$ = 0.01 ppm. For RT drift estimation, the order of the B-splines was set to $n = 3$ by default, while the number of knots $k$ was selected by 10-fold cross-validation. If the optional thresholding step was applied in GMT, we set $τ = 0.3$ . Otherwise, we let $τ = 0$ which gives the unthresholded GM algorithm.

Algorithm 4. GromovMatcher
input : Distance matrices $D^{x}, D^{y}$ , marginals $a, b$ , marginal relaxation $ρ$ , entropic regularization $ε$ , mass-to-charge ratios $m^{x}, m^{y}$ , mass-to-charge ratio gap $m_{gap}$ , retention times $R T^{x}, R T^{y}$ , B-spline order $n$ , filtering threshold $τ$ output: Return the matching matrix $M$ and the retention time drift $\hat{f}$ # Step 1: Weighted UGW optimization Compute $\tilde{Π} = WeightedUnbalancedGromovWasserstein (D^{x}, D^{y}, a, b, ρ, ε, m^{x}, m^{y})$ # Step 2: Retention time drift estimation and filtering for $i = 1 : 3$ do Perform weighted spline regression Equation 51 for RT drift $\hat{f} \in B_{n, k}$ where k is chosen by 10-fold cross validation Initialize $r_{thresh} = 0$ if $i < 3$ then Set $r_{thresh} = PI$ from Equation 53 else Set $r_{thresh} = 2 \times MAD$ from Equation 53 Set $\tilde{Π} = \hat{Π}$ Compute $𝒰 = \max (\hat{Π})$ Set ${\hat{Π}}_{i j} = 0$ if ${\hat{Π}}_{i j} < τ U$ for $i \in [p_{1}]$ and $j \in [p_{2}]$ Set ${\hat{Π}}_{i j} = 0$ if $i \neq {argmax}_{k} {\hat{Π}}_{k, j}$ or $j \neq {argmax}_{k} {\hat{Π}}_{i, k}$ for $i \in [p_{1}]$ and $j \in [p_{2}]$ Define the binarized matching $M_{i j} = 1_{{{\hat{Π}}_{i j} > 0}}$ Return $M$ and $\hat{f}$ .

Algorithm 4. GromovMatcher

input : Distance matrices

D^{x}, D^{y}

, marginals

a, b

, marginal relaxation

ρ

, entropic regularization

ε

,
mass-to-charge ratios

m^{x}, m^{y}

, mass-to-charge ratio gap

m_{gap}

,
retention times

R T^{x}, R T^{y}

, B-spline order

n

, filtering threshold

τ

output: Return the matching matrix

M

and the retention time drift

\hat{f}

# Step 1: Weighted UGW optimization
Compute

\tilde{Π} = WeightedUnbalancedGromovWasserstein (D^{x}, D^{y}, a, b, ρ, ε, m^{x}, m^{y})

# Step 2: Retention time drift estimation and filtering
for

i = 1 : 3

do
Perform weighted spline regression Equation 51 for RT drift

\hat{f} \in B_{n, k}

where k is chosen by 10-fold cross validation
Initialize

r_{thresh} = 0

i < 3

then
Set

r_{thresh} = PI

from Equation 53
else
Set

r_{thresh} = 2 \times MAD

from Equation 53
Set

\tilde{Π} = \hat{Π}

Compute

𝒰 = \max (\hat{Π})

Set

{\hat{Π}}_{i j} = 0

{\hat{Π}}_{i j} < τ U

for

i \in [p_{1}]

and

j \in [p_{2}]

Set

{\hat{Π}}_{i j} = 0

i \neq {argmax}_{k} {\hat{Π}}_{k, j}

j \neq {argmax}_{k} {\hat{Π}}_{i, k}

for

i \in [p_{1}]

and

j \in [p_{2}]

Define the binarized matching

M_{i j} = 1_{{{\hat{Π}}_{i j} > 0}}

Return

M

and

\hat{f}

Appendix 2

Here we discuss existing metabolomic alignments methods and the hyperparameter experiments we perform on these methods. We consider two existing alignment methods for comparison, metabCombiner (Habra et al., 2021) and M2S (Climaco Pinto et al., 2022). Both of them take the same kind of input as GromovMatcher, i.e. feature tables with features identified with their $m / z$ , RT, and intensities across samples.

MetabCombiner hyperparameter experiments

MetabCombiner (Habra et al., 2021) is a three-step process that begins by grouping features based on their $m / z$ within user-specified bins. This creates a search space for potential feature pairs. In the second step, MetabCombiner estimates the RT drift using the potential feature pairs identified in the first step, and eliminates outlying pairs over several iterations. This step can incorporate prior knowledge by identifying shared features and marking them as anchors, which are not discarded. In the final step, MetabCombiner scores the remaining feature pairs based on their $m / z$ , RT, and relative intensity compatibility to discriminate between multiple matches for one feature. The scoring system relies on weights assigned to $m / z$ , RT, and feature intensities, with the magnitude of those weights reflecting the reliability of the corresponding measurements across studies.

MetabCombiner (Habra et al., 2021) includes adjustable parameters throughout the pipeline. We set most of them to default values unless otherwise stated. MetabCombiner first establishes candidate pairs by binning features in the $m / z$ dimension with a width of binGap, and pairing the features sorted by relative intensities. The ‘binGap’ parameter sets the $m / z$ tolerance of metabCombiner, similar to $m_{gap}$ in GromovMatcher. We used the same value of 0.01 as in GromovMatcher.

MetabCombiner then estimates the RT drift using basis splines, and removes pairs associated with a high residual (twice the mean model error) from the candidate set.

In our main experiment, the RT drift is estimated exclusively using candidate pairs selected by the pipeline. However, it is also possible to include known ground truth pairs as ’anchors’ to estimate the RT drift. We choose not to rely on prior knowledge for drift estimation as Habra et al., 2021 show their drift estimation to be efficient and robust, even without prior knowledge. To confirm this claim, we conduct a sensitivity analysis comparing the results obtained in our main experiment with those obtained when supplying metabCombiner with known shared metabolites to anchor the RT drift estimation. We randomly select 100 anchors from the ground truth matching and compute the metabCombiner matchings with otherwise identical settings as in our main experiment. The results from this analysis (reported in Appendix 2—figure 1) show that the unsupervised RT drift estimation (using anchors selected by the pipeline only) performs as well as the supervised RT drift estimation, showing the drift estimation to be very consistent, with or without shared entities.

After establishing candidate pairs and filtering out those that contradict the estimated RT drift, metabCombiner discriminates between multiple matches using a scoring system that considers $m / z$ , RT, and rankings of the median feature intensities. Each dimension has a specific weight that can be left at default, manually adjusted, or automatically tuned using known matched pairs. Habra et al., 2021 provide qualitative guidelines for tuning the weights manually, mainly based on the experimental conditions and visual inspection of the RT drift plot. Since this approach is difficult to implement in the various settings we consider for our simulation study, we rely on the quantitative tuning function included in the metabCombiner pipeline. This function takes into account known shared features and tunes the weights to optimize the scores of those known matches. We randomly select 100 known true matches to define the objective function metabCombiner maximizes. We search over the recommended range of values, with the $m / z$ weight $A \in [50, 150]$ , the RT weight $B \in [5, 20]$ and the feature intensities weight $C \in [0, 1]$ . Appendix 2—figure 1 presents the results obtained with the weights set at default values ( $A = 100, B = 15, C = 0.5$ ), as a sensitivity analysis.

Appendix 2—figure 1

Download asset Open asset

Performance of metabCombiner with the different parameter settings.

The first setting, labelled ‘Scores’ correspond to the design of our main analysis, where 100 randomly selected true pairs are supplied to metabCombiner to set the scoring weights automatically, but are not otherwise used. In the second setting, labelled ‘Scores + RT’, metabCombiner is allowed to use the 100 true pairs not only to set the scoring weights, but also to estimate the RT drift. Finally, in the third ‘Default’ setting, we do not use any prior knowledge for the RT drift estimation and keep the scoring weights’ default values.

M2S hyperparameter experiments

Climaco Pinto et al., 2022 introduce M2S as a more versatile alternative to metabCombiner, while still adhering to most of its core principles. Like metabCombiner, M2S follows a three-step process. First it searches for matches within user-defined thresholds for $m / z$ , retention time, and mean feature intensity. Next, M2S estimates $m / z$ , RT and feature intensity drifts between datasets and removes any outlier pairs. Finally, M2S selects the best match using a scoring system that weighs each measurement, similar to metabCombiner. M2S notably stands out by providing greater flexibility in the methods and measurements used at each step of the procedure, resulting however in a larger number of parameters that require manual fine-tuning. To address this, we adopt two different approaches for the simulation study and the EPIC study alignment. In the simulation study, we set the initial thresholds to oracle values and investigate technical parameters. For the EPIC study alignment, we use the combination of technical parameters with the best average F1-score in the simulation study and select the best threshold values based on the performance on the validation subset.

More precisely, M2S first matches all pairs of metabolic features whose absolute difference in $m / z$ , RT, and median of $l o g_{10}$ FI are within the user-defined thresholds ‘MZ_intercept’, ‘RT_intercept’ and ‘log10FI_intercept’. On simulated data experiments, we set these thresholds to MZ_intercept = 0.01, RT_intercept = 3.5 and log10FI_intercept = 0.2 which are large enough to not exclude any true feature matches in any of the scenarios for our simulated data under low, medium, and high overlap/noise (see Methods). M2S also offers more detailed options to match features whose absolute difference stays within two lower and upper bound lines with a given slope where the intercepts of these lines are defined using the values above. In our analysis, we set the slopes of these linear boundaries to zero so as to not remove any true matches. Because the reference and target studies we are matching in the simulated analysis are on the same scale, we set the FI adjust method to ‘none’.

The second step of M2S involves calculating penalization scores for every pair of matches which are used to determine the best set of matches between metabolic features of both datasets. This step depends on a set of hyperparameters which we perform a grid search over to optimize the performance of M2S. For estimating the $m / z$ , RT, and FI drift, the hyperparameters are the percentage of neighbors ‘nrNeighbors‘, the neighborhood shape ‘neighMethod’, and the LOESS span percentage ‘pctPointsLoess’ used to smooth the estimated drift functions. After the drifts are estimated, they are normalized using a method specified by ‘residPercentile’ that puts the $m / z$ , RT, and FI residuals on the same scale. We always fix residPercentile = NaN which defaults to the standard 2 × MAD normalization. Next, for every remaining metabolic feature match, the residuals/drifts of the $m / z$ , RT, and FI are added together by taking the weighted square root sum of squares. For unnormalized data where feature intensity magnitudes are important, we weight all three drifts equally using $W = (1, 1, 1)$ and for data with normalized feature intensities we set the FI drift weight to zero such that $W = (1, 1, 0)$ . Finally, using these weighted penalization scores, M2S selects the best matched pair within a multiple match cluster to obtain a one-to-one matching between datasets.

The third and final step of M2S involves removing those remaining matches which have large differences in $m / z$ , RT, or FI. This can be performed using several methods indicated by the hyperparameter ‘methodType’. Each method excludes those matched pairs whose differences in $m / z$ , RT, or FI exceed a certain number of median absolute deviations indicated by the parameter ‘nrMAD’. The remaining one-to-one metabolic feature matches are returned as the final result of the M2S algorithm.

To optimally tune M2S on our simulated experiments, we determine the optimal M2S parameter combination for each individual simulation setting (low, medium, high overlap and noise) by performing a grid search over the product of parameter lists

nrNeighbors = [0.01, 0.05, 0.1, 0.5, 1]
neighMethod = [‘cross’, ‘circle’]
pctPointsLoess = [0, 0.1, 0.5]
methodType = [‘none’, ‘scores’, ‘byBins’, ‘trend_mad’, ‘residuals_mad’]
nrMAD = [1, 3, 5]

Each parameter combination for M2S is tested across 20 randomly generated datasets at the same overlap and noise settings. For each setting, the combination of parameters above with the best average F1-score across these 20 trials is used as the optimal parameter choice.

M2S applies initial RT thresholds to search for candidate pairs, which may favor settings where the RT drift follows a linear trend. Therefore, as a sensitivity analysis, we apply M2S to simulated data with a linear drift. The simulation process is identical to that of our main simulation study, except for the deviation of the RT in dataset 2. Specifically, for a given overlap value, we divide the original real-world dataset into two smaller datasets and introduce random noise to the $m / z$ , RT and intensities of the features, without introducing a systematic deviation to the RT in dataset 2. M2S parameters are kept identical to the ones used in our main analysis in comparable settings. The results obtained by M2S on three pairs of datasets generated for three overlap values (0.25, 0.5 and 0.75) and a medium noise level are reported in Appendix 2—table 1. While the results obtained in a high overlap setting are close to those obtained in our main analysis M2S demonstrates better performance in a low overlap setting when the RT drift is linear than in our main analysis. This observation is consistent with the results obtained by M2S on EPIC data, considering the relatively low estimated overlap between the aligned EPIC studies in our main analysis.

Appendix 2—table 1

Performance of M2S in a setting where the RT drift between studies is linear.

Metric	Low overlap	Medium overlap	High overlap
Precision	0.831	0.917	0.947
Recall	0.934	0.933	0.939

For the EPIC data, we select the parameter combination that yields the highest F1-score across all simulated settings. However, due to the unavailability of oracle values for setting initial thresholds, we perform a search over several MZ intercept values (0.01, 0.05, and 0.1), RT intercept values (0.1, 0.5, 1, and 5), and logFI intercept values (1, 10, and 100).

Appendix 3

In this section, we study the sensitivity of all three alignment methods GMT, M2S, and mC to the validation dataset split when creating two validation studies for matching. As described in the section "Validation on ground-truth data" and depicted in Figure 2 of the main text, we generate two datasets to be matched by splitting an initial LC-MS dataset with $p$ features and $n$ samples into two smaller overlapping datasets. The first dataset has p₁ features and n₁ samples while the second dataset has p₂ features and n₂ samples. The sets of samples in both datasets are disjoint such that $n_{1} + n_{2} = n$ . However, the dataset split is constructed such that both datasets share $\approx λ p$ of their features where $λ \in [0, 1]$ is an overlap fraction. Namely, this is done by defining the dataset feature sizes as

p_{1} = ⌊ (λ + λ_{f} (1 - λ)) p ⌋, p_{2} = ⌊ (λ + (1 - λ_{f}) (1 - λ)) p ⌋

and the dataset sample sizes as

n_{1} = ⌊ λ_{s} n ⌋, n_{2} = n - ⌊ λ_{s} n ⌋ .

As before, $⌊ \cdot ⌋$ and $⌈ \cdot ⌉$ denote integer floor and ceiling functions. Then taking the original LC-MS dataset and randomly permuting its samples and features, the first p₁ features and first n₁ samples are placed into dataset 1 while the last p₂ features and last n₂ samples are placed into dataset 2. It is indeed easy to check here that with such a splitting procedure, the feature overlap between both datasets is $p_{1} + p_{2} - p \approx λ p$ .

Here $λ_{f} \in [0.5, 1]$ controls the fraction of features in dataset 1 that is not shared with dataset 2 and $λ_{s} \in (0, 1)$ controls the fraction of samples in dataset 1 vs. dataset 2. In particular, if $λ_{f} = 1$ then the features in dataset 2 are entirely a subset of those in dataset 1. In the experiments described in the main text, we always set $λ_{f} = λ_{s} = 0.5$ as to balance the number of features and samples in both resulting datasets.

Now we study how the performance of all three alignment methods changes when $λ_{f}, λ_{s}$ and the feature overlap $λ$ are varied. Here we vary the feature overlap $λ \in {0.25, 0.5, 0.75}$ , the feature fraction $λ_{f} \in {0.5, 0.6, 0.7, 0.8, 0.9}$ , and the sample fraction $λ_{s} \in {0.1, 0.2, \dots, 0.9}$ . In Appendix 3—figures 1–3 we show how the precision and recall of GMT, M2S, and mC depend on these parameters. Here we use the same unnormalized validation data and experimental setup as decribed in the main text section "Validation on ground-truth data" and in the Methods and Materials section "Validation on simulated data". For each triple $(λ, λ_{f}, λ_{s})$ we randomly generate 20 dataset splits with these parameters and show the average precision and recall for each method over these trials. Our method GMT (thresholded GromovMatcher) is applied out-of-box with the default hyperparameter settings. The algorithm hyperparameters for mC and M2S are chosen optimally for each individual triple of dataset parameters $(λ, λ_{f}, λ_{s})$ to maximize the average F1 score in each setting. The hyperparameters searched over when optimizing mC and M2S are described in detail in Appendix 2.

Consistent with prior validation experiments, we find that GromovMatcher outperforms both mC and M2S in all dataset regimes, for low overlap and high overlap $λ$ as well as for varying balances of features $λ_{f}$ and samples $λ_{s}$ . Remarkably, all three methods exhibit the same sensitivity to variations of $(λ, λ_{f}, λ_{s})$ . All methods exhibit a monotonic decrease in their precision as $λ_{f}$ drops from 0.9 to 0.5. In other words, the most challenging setting for matching both datasets is when dataset 1 and dataset 2 both have an equal number of unshared features (e.g. $λ_{f} = 0.5$ ). Likewise, the simplest setting for matching is when the features in dataset 2 are exactly a subset of the features in dataset 1 (e.g. $λ_{f} = 1$ ). Sensitivity to this parameter $λ_{f}$ is most noticeable at low feature overlap $λ = 0.25$ .

Appendix 3—figure 1

Download asset Open asset

Sensitivity of thresholded GromovMatcher (GMT) to feature overlap fraction $λ$ , feature imbalance fraction $λ_{f}$ , and sample imbalance fraction $λ_{s}$ between two datasets being matched.

Appendix 3—figure 2

Download asset Open asset

Sensitivity of M2S to feature overlap fraction $λ$ , feature imbalance fraction $λ_{f}$ , and sample imbalance fraction $λ_{s}$ between two datasets being matched.

Appendix 3—figure 3

Download asset Open asset

Sensitivity of metabCombiner (mC) to feature overlap fraction $λ$ , feature imbalance fraction $λ_{f}$ , and sample imbalance fraction $λ_{s}$ between two datasets being matched.

Appendix 4

Here we describe additional preprocessing details and analyses of the EPIC data.

Centered and scaled data - Negative mode

In this section, we present the results obtained on centered and scaled EPIC data in negative mode, shown in Figure 4 of our main paper. However, due to the smaller size of the validation subset (42 features examined in negative mode compared to 163 in positive mode), the evaluation of the performance of the three methods may be less reliable than in positive mode.

First, we align the CS and HCC studies in negative mode and detect a total of 449, 492, and 180 matches with GM, M2S, and metabCombiner, respectively. Similar to the positive mode analysis, we evaluate the precision and recall of the three methods on the 42 feature validation subset, of which 19 were manually matched. GM and M2S demonstrate identical F1-scores of 0.98, while metabCombiner performs poorly in comparison. GM is able to recover all 19 true matches and identified only 1 false positive, while M2S recovers no false positives but missed 1 true positive.

Next, we align the CS and PC studies in negative mode and detect a total of 485, 569, and 314 matches with GM, M2S, and metabCombiner, respectively. Again, we evaluate the precision and recall of the three methods on the 42 feature validation subset, of which 26 were manually matched. MetabCombiner performs better than in the other EPIC pairings with an F1-score of 0.857, but is still outperformed by the other two methods. GM is slightly outperformed by M2S in this setting, with an almost identical precision of 0.93, but a slightly higher recall for M2S due to detecting 1 additional true positive. However, this remains a good performance for GM since M2S was optimally tuned using the validation subset itself.

Non-centered and non-scaled data

As a sensitivity analysis, we apply the three methods to EPIC data that has not been centered or scaled. The detailed results can be found in Appendix 4—table 1.

Appendix 4—table 1

Precision and recall on the EPIC validation subset for unnormalized data in (a) positive mode, and (b) negative mode.

95% confidence intervals were computed using modified Wilson score intervals Brown et al., 2001; Agresti and Coull, 1998.

	$CS ⟷ HCC$		$CS ⟷ PC$
Method	Precision	Recall	Precision	Recall
GromovMatcher	0.988 (0.937, 0.999)	0.944 (0.876, 0.997)	0.873 (0.776, 0.932)	0.939 (0.854, 0.976)
M2S	0.967 (0.908, 0.991)	0.978 (0.923, 0.996)	0.855 (0.759, 0.917)	0.985 (0.919, 0.999)
metabCombiner	0.979 (0.889, 0.999)	0.511 (0.410, 0.612)	0.926 (0.766, 0.987)	0.379 (0.271, 0.499)
(a) Positive mode
	$CS ⟷ HCC$		$CS ⟷ PC$
Method	Precision	Recall	Precision	Recall
GromovMatcher	0.950 (0.764, 0.997)	1.000 (0.832, 1.000)	0.964 (0.823, 0.998)	0.964 (0.823, 0.998)
M2S	1.000 (0.824, 1.000)	0.947 (0.754, 0.997)	0.931 (0.780, 0.988)	0.964 (0.823, 0.998)
metabCombiner	1.000 (0.566, 1.000)	0.263 (0.118, 0.488)	1.000 (0.785, 1.000)	0.500 (0.326, 0.674)
(b) Negative mode

M2S was tuned manually on the validation subset to ignore feature intensities in both cases. As a result, it maintains its performance compared to our main experiment. On the other hand, the performance of GM and metabCombiner is affected by the lack of consistency in feature intensities. MetabCombiner’s recall drops slightly but its precision remains comparable to that of our main experiment, with the method clearly favoring the latter. Although GM’s recall decreases slightly in positive mode, it remains more precise than the optimally tuned M2S, and it balances precision and recall better than metabCombiner. Interestingly, GM’s results in negative mode are improved compared to our main experiment, and it outperforms both mC and M2S. However, since the validation subset in negative mode is relatively small, these differences may not be significant. Nonetheless, GM maintains a good performance, similar to that of the optimally tuned M2S.

Similar to the analysis we conducted on centered and scaled data, we find a high number of false positives when aligning the CS study and the PC study in positive mode. Therefore, we manually examine the matches recovered by GM. Our examination reveals 2 false positives, 4 unclear matches, and 3 additional good matches that GM also identifies in our main analysis. This demonstrates that the lack of centering and scaling results in two additional false positives for GM that are not present in our main results.

Illustration for alcohol biomarker discovery

Loftfield et al., 2021 identified 205 features associated with alcohol intake in the CS study, using a false discovery rate (FDR) correction to account for multiple testing. By applying an FDR correction in our pooled analysis, we identify 243 features associated with alcohol intake. Out of those 243 features, 185 are consistent with the features identified in the discovery step of Loftfield et al., 2021, while 55 features are newly discovered (Figure 5c). We examine the 20 features identified as significant in Loftfield et al.’s discovery analysis but that are not significant in our pooled analysis. Both manual and GM matching yield identical results for these features, indicating that the loss of significance is not due to incorrect matching. Upon further investigation, we find that these features do not demonstrate a meaningful association with alcohol intake in the HCC and PC studies. This observation is reinforced by the fact that none of these features are among the 10 features that persisted after the validation step in Loftfield et al.

Out of the 205 features initially discovered in Loftfield et al., 2021, 10 are replicated in the EPIC HCC and PC studies using the more stringent Bonferroni correction. When using a Bonferroni correction in our pooled analysis, we find significant association between alcohol intake and 92 features, 36 of which are effectively shared by the three studies. Notably, these features include all 10 features that were retained in Loftfield et al. (Figure 5c).

This analysis illustrates how GromovMatcher can be used in the context of biomarker discovery, and its potential to allow for increased statistical power.

Appendix 5

Here we investigate how the choice of the reference dataset influences the discovery of metabolites shared across the CS, HCC and PC EPIC studies by GromovMatcher. All three methods considered in this paper, GromovMatcher, M2S, and metabCombiner, are limited to the comparison of two datasets. However, they can still be used to compare and pool multiple datasets using a multi-step procedure. Namely, this can be done by designating a ‘reference’ dataset and aligning all studies to it one by one. We take this exact approach in our analysis when aligning the CS, HCC, and PC studies of the EPIC data in positive mode. Namely, the HCC and PC studies are both aligned to the CS study (see main text Figure 5b). However, this method raises two critical questions: (i) how does the use of a reference dataset affect matching results, and (ii) how is the matching affected by the choice of reference dataset.

To address these questions, we compare the features identified as common to the three studies using two different studies as references: the CS study used as reference in the main analysis, and the HCC study. For simplicity, let’s denote $M_{study 1, study 2}$ the matching matrix obtained when aligning study 1 and study 2.

Changes in matching results when reference dataset is used

Concerning question (i), we compare two matchings: HCC to CS to PC (the matrix product $M_{CS, HCC}^{T} M_{CS, PC}$ ) which we will refer to as the reference matching, and the direct matching of PC to HCC ( $M_{HCC, PC}$ ). Note that these matchings are not fully comparable as the former considers only features found in CS, potentially missing unique HCC and PC matches. We can however compare the two matchings on the subset of 706 features common to all three studies, as determined by the reference matching. We find that the direct matching supports 683 out of them, indicating that the matching via a reference still yields good results compared to the direct matching (see Appendix 5—figure 1).

Appendix 5—figure 1

Download asset Open asset

Overlap between the 706 features common to the HCC and PC studies found via reference matching, and the 938 features common to HCC and PC found by direct matching.

Effect of reference dataset choice on matching results

Concerning question (ii), we compare the features identified as common to the three studies using two different studies as references: the CS study used in the paper, and the HCC study. We find that they identify 706 and 708 common features respectively, with an overlap of 640 features (see Appendix 5—figure 2). This highlights that the choice of reference does matter to some extent. In the paper, choosing CS as a reference was informed by CS’s sample size, and study population.

Appendix 5—figure 2

Download asset Open asset

Overlap between the features identified as common to the three EPIC studies using either the CS study or the HCC study as a reference.

Data availability

The LC-MS data used to generate our simulated validation experiments can be downloaded at https://www.ebi.ac.uk/metabolights/MTBLS1684/files at the bottom of the "Files" section in under filename 'FILES/metabolomics\_normalized\_data.xlsx'. The EPIC data is considered sensitive data and is therefore not publicly available. It is centralised at IARC and can be analysed through the IARC Scientific IT platform after a Data Use Agreement has been signed. Access requests should be submitted to the IARC Steering Committee https://epic.iarc.fr/access/index.php. All code for the data preprocessing, figure generation, as well as the GromovMatcher algorithm and its comparison to other methods are available at: https://github.com/sgstepaniants/GromovMatcher (copy archived at Breeur and Stepaniants, 2024). Instructions and examples for how to run the GromovMatcher method are provided in the Github repository. The metabCombiner implementation written by the original authors was taken from their Github codebase: https://github.com/hhabra/metabCombiner (Habra, 2024). The M2S implementation of the original authors was taken from their Github codebase: https://github.com/rjdossan/M2S (Rjdossan, 2024).

The following previously published data sets were used

(2020) MetaboLights
ID MTBLS1684. A multi-omic analysis of birthweight in newborn cord blood reveals new underlying mechanisms related to cholesterol metabolism.

https://www.ebi.ac.uk/metabolights/editor/MTBLS1684/descriptors

References

1. Agresti A
2. Coull BA
(1998) Approximate is better than “exact” for interval estimation of binomial proportions
The American Statistician 52:119–126.

https://doi.org/10.1080/00031305.1998.10480550
- Google Scholar
1. Alfano R
2. Chadeau-Hyam M
3. Ghantous A
4. Keski-Rahkonen P
5. Chatzi L
6. Perez AE
7. Herceg Z
8. Kogevinas M
9. de Kok TM
10. Nawrot TS
11. Novoloaca A
12. Patel CJ
13. Pizzi C
14. Robinot N
15. Rusconi F
16. Scalbert A
17. Sunyer J
18. Vermeulen R
19. Vrijheid M
20. Vineis P
21. Robinson O
22. Plusquin M
(2020) A multi-omic analysis of birthweight in newborn cord blood reveals new underlying mechanisms related to cholesterol metabolism
Metabolism 110:154292.

https://doi.org/10.1016/j.metabol.2020.154292
- PubMed
- Google Scholar
Conference
1. Alvarez-Melis D
2. Jaakkola T
(2018) Gromov-Wasserstein Alignment of Word Embedding Spaces
EMNLP Brussels, Belgium: Association for Computational Linguistics. pp. 1881–1889.

https://doi.org/10.18653/v1/D18-1214
- Google Scholar
Conference
(2019)
Towards optimal transport with global Invariances

Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics. pp. 1870–1879.
- Google Scholar
1. Bedia C
(2022) Metabolomics in environmental toxicology: Applications and challenges
Trends in Environmental Analytical Chemistry 34:e00161.

https://doi.org/10.1016/j.teac.2022.e00161
- Google Scholar
Preprint
(2022) Multi-Marginal Gromov-Wasserstein Transport and Barycenters
arXiv.

https://arxiv.org/abs/2205.06725
- Google Scholar
Software
1. Breeur M
2. Stepaniants G
(2024) Gromovmatcher, version swh:1:rev:c38a56b55e8746d874f94e371c6cdd1074b28b74
Software Heritage.

https://archive.softwareheritage.org/swh:1:dir:50b50a1a6db39925adf98e2590b931405370ad0f;origin=https://github.com/sgstepaniants/GromovMatcher;visit=swh:1:snp:0aaffd41891c81ac2f957cc0ea084767876eb756;anchor=swh:1:rev:c38a56b55e8746d874f94e371c6cdd1074b28b74
(2001) Interval estimation for a binomial proportion
Statistical Science 16:133.

https://doi.org/10.1214/ss/1009213286
- Google Scholar
(2016) Large-scale untargeted LC-MS metabolomics data correction using between-batch feature alignment and cluster-based within-batch signal intensity drift correction
Metabolomics 12:173.

https://doi.org/10.1007/s11306-016-1124-4
- PubMed
- Google Scholar
1. Chen L
2. Lu W
3. Wang L
4. Xing X
5. Chen Z
6. Teng X
7. Zeng X
8. Muscarella AD
9. Shen Y
10. Cowan A
11. McReynolds MR
12. Kennedy BJ
13. Lato AM
14. Campagna SR
15. Singh M
16. Rabinowitz JD
(2021) Metabolite discovery through global annotation of untargeted metabolomics data
Nature Methods 18:1377–1385.

https://doi.org/10.1038/s41592-021-01303-3
- Google Scholar
(2018) Unbalanced optimal transport: Dynamic and Kantorovich formulations
Journal of Functional Analysis 274:3090–3123.

https://doi.org/10.1016/j.jfa.2018.03.008
- Google Scholar
1. Climaco Pinto R
2. Karaman I
3. Lewis MR
4. Hällqvist J
5. Kaluarachchi M
6. Graça G
7. Chekmeneva E
8. Durainayagam B
9. Ghanbari M
10. Ikram MA
11. Zetterberg H
12. Griffin J
13. Elliott P
14. Tzoulaki I
15. Dehghan A
16. Herrington D
17. Ebbels T
(2022) Finding correspondence between metabolomic features in untargeted liquid chromatography-mass spectrometry metabolomics datasets
Analytical Chemistry 94:5493–5503.

https://doi.org/10.1021/acs.analchem.1c03592
- PubMed
- Google Scholar
Conference
(2017)
Joint distribution optimal transportation for domain adaptation

NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems. pp. 3733–3742.
- Google Scholar
(2022) SCOT: Single-Cell Multi-Omics Alignment with Optimal Transport
Journal of Computational Biology 29:3–18.

https://doi.org/10.1089/cmb.2021.0446
- PubMed
- Google Scholar
1. Franzosa EA
2. Sirota-Madi A
3. Avila-Pacheco J
4. Fornelos N
5. Haiser HJ
6. Reinker S
7. Vatanen T
8. Hall AB
9. Mallick H
10. McIver LJ
11. Sauk JS
12. Wilson RG
13. Stevens BW
14. Scott JM
15. Pierce K
16. Deik AA
17. Bullock K
18. Imhann F
19. Porter JA
20. Zhernakova A
21. Fu J
22. Weersma RK
23. Wijmenga C
24. Clish CB
25. Vlamakis H
26. Huttenhower C
27. Xavier RJ
(2019) Gut microbiome structure and metabolic activity in inflammatory bowel disease
Nature Microbiology 4:293–305.

https://doi.org/10.1038/s41564-018-0306-4
- PubMed
- Google Scholar
1. Gasull M
2. Pumarega J
3. Kiviranta H
4. Rantakokko P
5. Raaschou-Nielsen O
6. Bergdahl IA
7. Sandanger TM
8. Goñi F
9. Cirera L
10. Donat-Vargas C
11. Alguacil J
12. Iglesias M
13. Tjønneland A
14. Overvad K
15. Mancini FR
16. Boutron-Ruault M-C
17. Severi G
18. Johnson T
19. Kühn T
20. Trichopoulou A
21. Karakatsani A
22. Peppa E
23. Palli D
24. Pala V
25. Tumino R
26. Naccarati A
27. Panico S
28. Verschuren M
29. Vermeulen R
30. Rylander C
31. Nøst TH
32. Rodríguez-Barranco M
33. Molinuevo A
34. Chirlaque M-D
35. Ardanaz E
36. Sund M
37. Key T
38. Ye W
39. Jenab M
40. Michaud D
41. Matullo G
42. Canzian F
43. Kaaks R
44. Nieters A
45. Nöthlings U
46. Jeurnink S
47. Chajes V
48. Matejcic M
49. Gunter M
50. Aune D
51. Riboli E
52. Agudo A
53. Gonzalez CA
54. Weiderpass E
55. Bueno-de-Mesquita B
56. Duell EJ
57. Vineis P
58. Porta M
(2019) Methodological issues in a prospective study on plasma concentrations of persistent organic pollutants and pancreatic cancer risk within the EPIC cohort
Environmental Research 169:417–433.

https://doi.org/10.1016/j.envres.2018.11.027
- Google Scholar
(2022) Variational autoencoders learn transferrable representations of metabolomics data
Communications Biology 5:645.

https://doi.org/10.1038/s42003-022-03579-3
- PubMed
- Google Scholar
Book
1. Gromov M
(2001) Metric Structures for Riemannian and Non-Riemannian Spaces
Birkhäuser Boston, Inc.

https://doi.org/10.1007/978-0-8176-4583-0
- Google Scholar
1. Habra H
2. Kachman M
3. Bullock K
4. Clish C
5. Evans CR
6. Karnovsky A
(2021) metabCombiner: Paired Untargeted LC-HRMS Metabolomics Feature Matching and Concatenation of Disparately Acquired Data Sets
Analytical Chemistry 93:5028–5036.

https://doi.org/10.1021/acs.analchem.0c03693
- PubMed
- Google Scholar
Software
1. Habra H
(2024) metabCombiner, version d248824
GitHub.

https://github.com/hhabra/metabCombiner
1. Hsu YHH
2. Churchhouse C
3. Pers TH
4. Mercader JM
5. Metspalu A
6. Fischer K
7. Fortney K
8. Morgen EK
9. Gonzalez C
10. Gonzalez ME
11. Esko T
12. Hirschhorn JN
(2019) PAIRUP-MS: Pathway analysis and imputation to relate unknowns in profiles from mass spectrometry-based metabolite data
PLOS Computational Biology 15:e1006734.

https://doi.org/10.1371/journal.pcbi.1006734
- PubMed
- Google Scholar
1. Ivanisevic J
2. Want EJ
(2019) From Samples to Insights into Metabolism: Uncovering Biologically Relevant Information in LC-HRMS Metabolomics Data
Metabolites 9:308.

https://doi.org/10.3390/metabo9120308
- PubMed
- Google Scholar
1. Kantorovich LV
(2006) On the translocation of masses
Journal of Mathematical Sciences 133:1381–1382.

https://doi.org/10.1007/s10958-006-0049-2
- Google Scholar
1. Li L
2. Zheng X
3. Zhou Q
4. Villanueva N
5. Nian W
6. Liu X
7. Huan T
(2020) Metabolomics-based discovery of molecular signatures for triple negative breast cancer in asian female population
Scientific Reports 10:370.

https://doi.org/10.1038/s41598-019-57068-5
- PubMed
- Google Scholar
1. Liu Q
2. Walker D
3. Uppal K
4. Liu Z
5. Ma C
6. Tran V
7. Li S
8. Jones DP
9. Yu T
(2020) Addressing the batch effect issue for LC/MS metabolomics data in data preprocessing
Scientific Reports 10:13856.

https://doi.org/10.1038/s41598-020-70850-0
- PubMed
- Google Scholar
1. Loftfield E
2. Stepien M
3. Viallon V
4. Trijsburg L
5. Rothwell JA
6. Robinot N
7. Biessy C
8. Bergdahl IA
9. Bodén S
10. Schulze MB
11. Bergman M
12. Weiderpass E
13. Schmidt JA
14. Zamora-Ros R
15. Nøst TH
16. Sandanger TM
17. Sonestedt E
18. Ohlsson B
19. Katzke V
20. Kaaks R
21. Ricceri F
22. Tjønneland A
23. Dahm CC
24. Sánchez M-J
25. Trichopoulou A
26. Tumino R
27. Chirlaque M-D
28. Masala G
29. Ardanaz E
30. Vermeulen R
31. Brennan P
32. Albanes D
33. Weinstein SJ
34. Scalbert A
35. Freedman ND
36. Gunter MJ
37. Jenab M
38. Sinha R
39. Keski-Rahkonen P
40. Ferrari P
(2021) Novel biomarkers of habitual alcohol intake and associations with risk of pancreatic and liver cancers and liver disease mortality
Journal of the National Cancer Institute 113:1542–1550.

https://doi.org/10.1093/jnci/djab078
- PubMed
- Google Scholar
1. Mémoli F
(2011) Gromov–wasserstein distances and the metric approach to object matching
Foundations of Computational Mathematics 11:417–487.

https://doi.org/10.1007/s10208-011-9093-5
- Google Scholar
1. Monge G
(1781)
Mémoire sur la théorie des déblais et des remblais

Mem. Math. Phys. Acad. Royale Sci pp. 666–704.
- Google Scholar
(2019) Gene expression cartography
Nature 576:132–137.

https://doi.org/10.1038/s41586-019-1773-3
- PubMed
- Google Scholar
1. Patti GJ
(2011) Separation strategies for untargeted metabolomics
Journal of Separation Science 34:3460–3469.

https://doi.org/10.1002/jssc.201100532
- PubMed
- Google Scholar
Conference
(2016) Gromov-wasserstein averaging of kernel and distance matrices
ICML. pp. 2664–2672.

https://doi.org/10.5555/3045390.3045671
- Google Scholar
1. Peyré G
2. Cuturi M
(2019) Computational optimal transport: With applications to data science
Foundations and Trends in Machine Learning 11:355–607.

https://doi.org/10.1561/2200000073
- Google Scholar
1. Pirhaji L
2. Milani P
3. Leidl M
4. Curran T
5. Avila-Pacheco J
6. Clish CB
7. White FM
8. Saghatelian A
9. Fraenkel E
(2016) Revealing disease-associated pathways by network integration of untargeted metabolomics
Nature Methods 13:770–776.

https://doi.org/10.1038/nmeth.3940
- PubMed
- Google Scholar
(2014) The blood exposome and its role in discovering causes of disease
Environmental Health Perspectives 122:769–774.

https://doi.org/10.1289/ehp.1308015
- PubMed
- Google Scholar
Conference
1. Reuther A
2. Kepner J
3. Byun C
4. Samsi S
5. Arcand W
6. Bestor D
7. Bergeron B
8. Gadepally V
9. Houle M
10. Hubbell M
11. Jones M
12. Klein A
13. Milechin L
14. Mullen J
15. Prout A
16. Rosa A
17. Yee C
18. Michaleas P
(2018) Interactive Supercomputing on 40,000 Cores for Machine Learning and Data Analysis
2018 IEEE High Performance Extreme Computing Conference. pp. 1–6.

https://doi.org/10.1109/HPEC.2018.8547629
- Google Scholar
1. Riboli E
2. Hunt KJ
3. Slimani N
4. Ferrari P
5. Norat T
6. Fahey M
7. Charrondière UR
8. Hémon B
9. Casagrande C
10. Vignat J
11. Overvad K
12. Tjønneland A
13. Clavel-Chapelon F
14. Thiébaut A
15. Wahrendorf J
16. Boeing H
17. Trichopoulos D
18. Trichopoulou A
19. Vineis P
20. Palli D
21. Bueno-de-Mesquita HB
22. Peeters PHM
23. Lund E
24. Engeset D
25. González CA
26. Barricarte A
27. Berglund G
28. Hallmans G
29. Day NE
30. Key TJ
31. Kaaks R
32. Saracci R
(2002) European Prospective Investigation into Cancer and Nutrition (EPIC): study populations and data collection
Public Health Nutrition 5:1113–1124.

https://doi.org/10.1079/PHN2002394
- Google Scholar
Software
1. Rjdossan
(2024) M2S, version aaedc0a
GitHub.

https://github.com/rjdossan/M2S
1. Schiebinger G
2. Shu J
3. Tabaka M
4. Cleary B
5. Subramanian V
6. Solomon A
7. Gould J
8. Liu S
9. Lin S
10. Berube P
11. Lee L
12. Chen J
13. Brumbaugh J
14. Rigollet P
15. Hochedlinger K
16. Jaenisch R
17. Regev A
18. Lander ES
(2019) Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming
Cell 176:928–943.

https://doi.org/10.1016/j.cell.2019.01.006
- PubMed
- Google Scholar
Preprint
(2019) Sinkhorn Divergences for Unbalanced Optimal Transport
arXiv.

https://arxiv.org/abs/1910.12958
- Google Scholar
Conference
(2021)
The unbalanced gromov wasserstein distance: Conic formulation and relaxation

Advances in Neural Information Processing Systems 34. pp. 8766–8779.
- Google Scholar
(2022) Alignstein: Optimal transport for improved LC-MS retention time alignment
GigaScience 11:giac101.

https://doi.org/10.1093/gigascience/giac101
- PubMed
- Google Scholar
1. Slimani N
2. Bingham S
3. Runswick S
4. Ferrari P
5. Day NE
6. Welch AA
7. Key TJ
8. Miller AB
9. Boeing H
10. Sieri S
11. Veglia F
12. Palli D
13. Panico S
14. Tumino R
15. Bueno-De-Mesquita B
16. Ocké MC
17. Clavel-Chapelon F
18. Trichopoulou A
19. Van Staveren WA
20. Riboli E
(2003)
Group level validation of protein intakes estimated by 24-hour diet recall and dietary questionnaires against 24-hour urinary nitrogen in the European Prospective Investigation into Cancer and Nutrition (EPIC) calibration study

Cancer Epidemiology, Biomarkers & Prevention 12:784–795.
- PubMed
- Google Scholar
1. Smith CA
2. Want EJ
3. O’Maille G
4. Abagyan R
5. Siuzdak G
(2006) XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification
Analytical Chemistry 78:779–787.

https://doi.org/10.1021/ac051437y
- PubMed
- Google Scholar
1. Solomon J
2. Peyré G
3. Kim VG
4. Sra S
(2016) Entropic metric alignment for correspondence problems
ACM Transactions on Graphics 35:1–13.

https://doi.org/10.1145/2897824.2925903
- Google Scholar
1. Stepien M
2. Duarte-Salles T
3. Fedirko V
4. Floegel A
5. Barupal DK
6. Rinaldi S
7. Achaintre D
8. Assi N
9. Tjønneland A
10. Overvad K
11. Bastide N
12. Boutron-Ruault M-C
13. Severi G
14. Kühn T
15. Kaaks R
16. Aleksandrova K
17. Boeing H
18. Trichopoulou A
19. Bamia C
20. Lagiou P
21. Saieva C
22. Agnoli C
23. Panico S
24. Tumino R
25. Naccarati A
26. Bueno-de-Mesquita HBA
27. Peeters PH
28. Weiderpass E
29. Quirós JR
30. Agudo A
31. Sánchez M-J
32. Dorronsoro M
33. Gavrila D
34. Barricarte A
35. Ohlsson B
36. Sjöberg K
37. Werner M
38. Sund M
39. Wareham N
40. Khaw K-T
41. Travis RC
42. Schmidt JA
43. Gunter M
44. Cross A
45. Vineis P
46. Romieu I
47. Scalbert A
48. Jenab M
(2016) Alteration of amino acid and biogenic amine metabolism in hepatobiliary cancers: Findings from a prospective cohort study
International Journal of Cancer 138:348–360.

https://doi.org/10.1002/ijc.29718
- PubMed
- Google Scholar
1. Stepien M
2. Keski-Rahkonen P
3. Kiss A
4. Robinot N
5. Duarte-Salles T
6. Murphy N
7. Perlemuter G
8. Viallon V
9. Tjønneland A
10. Rostgaard-Hansen AL
11. Dahm CC
12. Overvad K
13. Boutron-Ruault M-C
14. Mancini FR
15. Mahamat-Saleh Y
16. Aleksandrova K
17. Kaaks R
18. Kühn T
19. Trichopoulou A
20. Karakatsani A
21. Panico S
22. Tumino R
23. Palli D
24. Tagliabue G
25. Naccarati A
26. Vermeulen RCH
27. Bueno-de-Mesquita HB
28. Weiderpass E
29. Skeie G
30. Ramón Quirós J
31. Ardanaz E
32. Mokoroa O
33. Sala N
34. Sánchez M-J
35. Huerta JM
36. Winkvist A
37. Harlid S
38. Ohlsson B
39. Sjöberg K
40. Schmidt JA
41. Wareham N
42. Khaw K-T
43. Ferrari P
44. Rothwell JA
45. Gunter M
46. Riboli E
47. Scalbert A
48. Jenab M
(2021) Metabolic perturbations prior to hepatocellular carcinoma diagnosis: Findings from a prospective observational cohort study
International Journal of Cancer 148:609–625.

https://doi.org/10.1002/ijc.33236
- PubMed
- Google Scholar
1. Tautenhahn R
2. Patti GJ
3. Kalisiak E
4. Miyamoto T
5. Schmidt M
6. Lo FY
7. McBee J
8. Baliga NS
9. Siuzdak G
(2011) metaXCMS: second-order analysis of untargeted metabolomics data
Analytical Chemistry 83:696–700.

https://doi.org/10.1021/ac102980g
- PubMed
- Google Scholar
1. Vaughan AA
2. Dunn WB
3. Allwood JW
4. Wedge DC
5. Blackhall FH
6. Whetton AD
7. Dive C
8. Goodacre R
(2012) Liquid chromatography-mass spectrometry calibration transfer and metabolomics data fusion
Analytical Chemistry 84:9848–9857.

https://doi.org/10.1021/ac302227c
- PubMed
- Google Scholar
Book
1. Villani C
(2021)
Topics in Optimal Transportation

American Mathematical Soc.
- Google Scholar
1. Wang TJ
2. Larson MG
3. Vasan RS
4. Cheng S
5. Rhee EP
6. McCabe E
7. Lewis GD
8. Fox CS
9. Jacques PF
10. Fernandez C
11. O’Donnell CJ
12. Carr SA
13. Mootha VK
14. Florez JC
15. Souza A
16. Melander O
17. Clish CB
18. Gerszten RE
(2011) Metabolite profiles and the risk of developing diabetes
Nature Medicine 17:448–453.

https://doi.org/10.1038/nm.2307
- PubMed
- Google Scholar
1. Wishart DS
(2019) Metabolomics for investigating physiological and pathophysiological processes
Physiological Reviews 99:1819–1875.

https://doi.org/10.1152/physrev.00035.2018
- PubMed
- Google Scholar
(2020) Predicting cell lineages using autoencoders and optimal transport
PLOS Computational Biology 16:e1007828.

https://doi.org/10.1371/journal.pcbi.1007828
- PubMed
- Google Scholar
1. Zhou B
2. Xiao JF
3. Tuli L
4. Ressom HW
(2012) LC-MS-based metabolomics
Molecular bioSystems 8:470–481.

https://doi.org/10.1039/c1mb05350g
- PubMed
- Google Scholar

Article and author information

Author details

Marie Breeur

Nutrition and Metabolism Branch, International Agency for Research on Cancer, Lyon, France

Contribution
Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing

Contributed equally with
George Stepaniants

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-1251-8360
George Stepaniants

Massachusetts Institute of Technology, Department of Mathematics, Boston, United States

Contribution
Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing

Contributed equally with
Marie Breeur

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-7834-7536
Pekka Keski-Rahkonen

Nutrition and Metabolism Branch, International Agency for Research on Cancer, Lyon, France

Contribution
Data curation, Writing - original draft, Writing - review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-9437-3040
Philippe Rigollet

Massachusetts Institute of Technology, Department of Mathematics, Boston, United States

Contribution
Conceptualization, Resources, Supervision, Funding acquisition, Investigation, Methodology, Writing - original draft, Writing - review and editing

Competing interests
No competing interests declared
Vivian Viallon

Nutrition and Metabolism Branch, International Agency for Research on Cancer, Lyon, France

Contribution
Conceptualization, Resources, Supervision, Funding acquisition, Investigation, Methodology, Writing - original draft, Writing - review and editing

For correspondence
viallonv@iarc.who.int

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-9799-4421

Funding

National Science Foundation (Graduate Research Fellowship Program 1745302)

George Stepaniants

National Science Foundation (IIS-1838071)

Philippe Rigollet

National Science Foundation (DMS-2022448)

Philippe Rigollet

National Science Foundation (CCF-2106377)

Philippe Rigollet

World Cancer Research Fund International (IIG_FULL_2022_013)

Marie Breeur
Vivian Viallon

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Jörn Dunkel for helpful advice on our manuscript. We acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center Reuther et al., 2018 for providing HPC resources that have contributed to the research results reported within this paper. GS acknowledges support through a National Science Foundation Graduate Research Fellowship under Grant No. 1745302. PR is supported by NSF grants IIS-1838071, DMS-2022448, and CCF-2106377. MB and VV acknowledge support from World Cancer Research Fund (UK) through the World Cancer Research Fund International grant program (grant number: IIG_FULL_2022_013). We are grateful to the Principal Investigators of each of the EPIC centers for sharing the data for our experimental application.

Ethics

The EPIC study, and in particular the three studies nested within EPIC, were conducted according to the Declaration of Helsinki and approved by the ethics committee at the International Agency for Research on Cancer (IARC) (IEC 10-16 for the HCC and pancreatic cancer studies, IEC 12-29 for the cross-sectional study). Written informed consent was obtained from all subjects involved in the study.

Version history

Sent for peer review: September 11, 2023
Preprint posted: September 13, 2023
Reviewed Preprint version 1: December 11, 2023
Reviewed Preprint version 2: April 9, 2024
Version of Record published: June 18, 2024

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.91597. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

1,135

views
95

downloads
1

citation

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Citations by DOI

1

citation for umbrella DOI https://doi.org/10.7554/eLife.91597

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Marie Breeur
George Stepaniants
Pekka Keski-Rahkonen
Philippe Rigollet
Vivian Viallon

(2024)

Optimal transport for automatic alignment of untargeted metabolomic data

eLife 12:RP91597.

https://doi.org/10.7554/eLife.91597.3

Categories and tags

Research organism

Human

Share this article

Cite this article

An optimal transport approach for combining untargeted metabolomics datasets (GromovMatcher).

Simulated data for testing untargeted metabolomics alignment methods.

Comparison of MetabCombiner, M2S, and GromovMatcher on simulated data.

Application of GromovMatcher and comparison to existing methods on EPIC dataset.

Results from the manual matching conducted for Loftfield et al., 2021.

Precision and recall on the EPIC validation subset in positive mode.

Precision and recall on the EPIC validation subset in negative mode.

Comparison of GromovMatcher and Loftfield et al., 2021 analysis for alcohol biomarker discovery on EPIC data.

Performance of metabCombiner with the different parameter settings.

Performance of M2S in a setting where the RT drift between studies is linear.

Sensitivity of thresholded GromovMatcher (GMT) to feature overlap fraction λ, feature imbalance fraction λf, and sample imbalance fraction λs between two datasets being matched.

Sensitivity of M2S to feature overlap fraction λ, feature imbalance fraction λf, and sample imbalance fraction λs between two datasets being matched.

Sensitivity of metabCombiner (mC) to feature overlap fraction λ, feature imbalance fraction λf, and sample imbalance fraction λs between two datasets being matched.

Precision and recall on the EPIC validation subset for unnormalized data in (a) positive mode, and (b) negative mode.

Overlap between the 706 features common to the HCC and PC studies found via reference matching, and the 938 features common to HCC and PC found by direct matching.

Overlap between the features identified as common to the three EPIC studies using either the CS study or the HCC study as a reference.

Author details

Marie Breeur

Contribution

Contributed equally with

Competing interests

George Stepaniants

Contribution

Contributed equally with

Competing interests

Pekka Keski-Rahkonen

Contribution

Competing interests

Philippe Rigollet

Contribution

Competing interests

Vivian Viallon

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism

Sensitivity of thresholded GromovMatcher (GMT) to feature overlap fraction $λ$ , feature imbalance fraction $λ_{f}$ , and sample imbalance fraction $λ_{s}$ between two datasets being matched.

Sensitivity of M2S to feature overlap fraction $λ$ , feature imbalance fraction $λ_{f}$ , and sample imbalance fraction $λ_{s}$ between two datasets being matched.

Sensitivity of metabCombiner (mC) to feature overlap fraction $λ$ , feature imbalance fraction $λ_{f}$ , and sample imbalance fraction $λ_{s}$ between two datasets being matched.