Optimal transport for automatic alignment of untargeted metabolomic data
Abstract
Untargeted metabolomic profiling through liquid chromatographymass spectrometry (LCMS) measures a vast array of metabolites within biospecimens, advancing drug development, disease diagnosis, and risk prediction. However, the low throughput of LCMS poses a major challenge for biomarker discovery, annotation, and experimental comparison, necessitating the merging of multiple datasets. Current data pooling methods encounter practical limitations due to their vulnerability to data variations and hyperparameter dependence. Here, we introduce GromovMatcher, a flexible and userfriendly algorithm that automatically combines LCMS datasets using optimal transport. By capitalizing on feature intensity correlation structures, GromovMatcher delivers superior alignment accuracy and robustness compared to existing approaches. This algorithm scales to thousands of features requiring minimal hyperparameter tuning. Manually curated datasets for validating alignment algorithms are limited in the field of untargeted metabolomics, and hence we develop a dataset split procedure to generate pairs of validation datasets to test the alignments produced by GromovMatcher and other methods. Applying our method to experimental patient studies of liver and pancreatic cancer, we discover shared metabolic features related to patient alcohol intake, demonstrating how GromovMatcher facilitates the search for biomarkers associated with lifestyle risk factors linked to several cancer types.
eLife assessment
The authors describe an important tool, GromovMatcher, that can be used to compare proteomic data from various experimental approaches. The underlying method is innovative, the algorithm is clearly described, and the validation that is presented is convincing.
https://doi.org/10.7554/eLife.91597.3.sa0Important: Findings that have theoretical or practical implications beyond a single subfield
 Landmark
 Fundamental
 Important
 Valuable
 Useful
Convincing: Appropriate and validated methodology in line with current stateoftheart
 Exceptional
 Compelling
 Convincing
 Solid
 Incomplete
 Inadequate
During the peerreview process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments
Introduction
Untargeted metabolomics is a powerful analytical technique used to identify and measure a large number of metabolites in a biological sample without preselecting targets (Patti, 2011). This approach allows for a comprehensive overview of an individual’s metabolic profile, provides insights into the biochemical processes involved in cellular and organismal physiology (Wishart, 2019; Pirhaji et al., 2016), and allows for the exploration of how environmental factors impact metabolism (Rappaport et al., 2014; Bedia, 2022). It creates new opportunities to investigate healthrelated conditions, including diabetes (Wang et al., 2011), inflammatory bowel diseases Franzosa et al., 2019, and various cancer types (Loftfield et al., 2021; Li et al., 2020). However, a major challenge in biomarker discovery, metabolic signature identification and other untargeted metabolomic analyses lies in the low throughput of experimental data, necessitating the development of efficient pooling algorithms capable of merging datasets from multiple sources (Loftfield et al., 2021).
A common experimental technique in untargeted metabolomics is liquid chromatographymass spectrometry (LCMS) which assembles a list of thousands of unlabeled metabolic features characterized by their masstocharge ratio ($m/z$), retention time (RT; Zhou et al., 2012), and intensity across all biological samples. Combining LCMS datasets from multiple experimental studies remains challenging due to variation in the $m/z$ and RT of a feature from one study to another (Zhou et al., 2012; Ivanisevic and Want, 2019). This problem is further compounded by differing instruments and analytical protocols across laboratories, resulting in seemingly incompatible metabolomic datasets.
Manual matching of metabolic features can be a laborious and errorprone task (Loftfield et al., 2021). To address this challenge, several automated methods have been developed for metabolic feature alignment. One such method is MetaXCMS, which matches LCMS features based on userdefined $m/z$ and RT thresholds (Tautenhahn et al., 2011). More advanced tools use information on feature intensities measured in samples. For instance, PAIRUPMS uses known shared metabolic features to impute the intensities of all features from one dataset to another Hsu et al., 2019. MetabCombiner (Habra et al., 2021) and M2S (Climaco Pinto et al., 2022) compare average feature intensities, along with their $m/z$ and RT values, to align datasets without requiring extensive knowledge of shared features. These automated alignment methods have accelerated our ability to pool and annotate datasets as well as extract biologically meaningful biomarkers. However, they demand substantial finetuning of userdefined parameters and ignore correlations among metabolic features which provide a wealth of additional information on shared features.
Here, we introduce GromovMatcher, a userfriendly flexible algorithm which automates the matching of metabolic features across experiments. The main technical innovation of GromovMatcher lies in its ability to incorporate the correlation information between metabolic feature intensities, building upon the powerful mathematical framework of computational optimal transport (OT; Peyré and Cuturi, 2019; Villani, 2021). OT has proven effective in solving various matching problems and has found applications in multiomics analysis (Demetci et al., 2022), cell development (Schiebinger et al., 2019; Yang et al., 2020), and chromatogram alignment (Skoraczyński et al., 2022). Here, we leverage the GromovWasserstein (GW) method (Mémoli, 2011; Solomon et al., 2016), which matches datasets based on their distance structure and has been seminally applied to spatial reconstruction problems in genomics Nitzan et al., 2019. GromovMatcher builds upon the GW algorithm to automatically uncover the shared correlation structure among metabolic feature intensities while also incorporating $m/z$ and RT information in the final matching process.
To assess the performance of GromovMatcher, we systematically benchmark it on synthetic data with varying levels of noise, feature overlap, and data normalizations, outperforming prior stateoftheart methods of metabCombiner (Habra et al., 2021) and M2S (Climaco Pinto et al., 2022). Next, we apply GromovMatcher to align experimental patient studies of liver and pancreatic cancer to a reference dataset and associate the shared metabolic features to each patient’s alcohol intake. Through these efforts, we demonstrate how GromovMatcher data pooling improves our ability to discover biomarkers of lifestyle risk factors associated with several types of cancer.
Results
GromovMatcher algorithm
GromovMatcher uses the mathematical framework of OT to find all matching metabolic features between two untargeted metabolomic datasets (Figure 1). It accepts two LCMS datasets with possibly different numbers of metabolic features and samples. Each feature, $\text{fx}}_{i$ in Dataset 1 and $\text{fy}}_{j$ in Dataset 2, is identified by its $m/z$, RT, and vector of feature intensities across samples (Figure 1a). The primary tenet of GromovMatcher is that shared metabolic features have similar correlation patterns in both datasets and can be matched based on the distance/correlations between their feature intensity vectors. Specifically, GromovMatcher computes the pairwise distances between the feature intensity vectors of each metabolic feature in a dataset and saves them into a distance matrix, one per dataset (Figure 1b). In practice, we use either the Euclidean distance or the cosine distance (negative of correlation) to perform this step (Materials and methods). The resulting distance matrices contain information about the feature intensity similarity within each study. Using optimal transport, we can deduce shared subsets of metabolic features in both datasets which have corresponding feature intensity distance structures.
OT was originally developed to optimize the transportation of soil for the construction of forts (Monge, 1781) and was later generalized through the language of probability theory and linear programming (Kantorovich, 2006), leading to efficient numerical algorithms and direct applications to planning problems in economics. The ability of OT to efficiently match source to target locations found applications in data science for the alignment of distributions (Courty et al., 2017; AlvarezMelis et al., 2019) and was generalized by the GromovWasserstein (GW) method (Peyré et al., 2016; AlvarezMelis and Jaakkola, 2018) to align datasets with features of differing dimensions.
In practice, a sizeable fraction of the metabolic features measured in one study may not be present in the other. Hence, in most cases only a subset of features in both datasets can be matched. Recent GW formulations for unbalanced matching problems (Sejourne et al., 2021) allow for matching only subsets of metabolic features with similar intensity structures (Figure 1c). To incorporate additional feature information, we modify the optimization objective of unbalanced GW to penalize feature matches whose $m/z$ differences exceed a fixed threshold (Materials and methods, Appendix 1). The optimization of this objective computes a coupling matrix$\stackrel{~}{\mathrm{\Pi}}$ where each entry ${\stackrel{~}{\mathrm{\Pi}}}_{ij}\ge 0$ indicates the level of confidence in matching metabolic feature $\text{fx}}_{i$ in Dataset 1 to $\text{fy}}_{j$ in Dataset 2.
Differences in experimental conditions can induce variations in RT between datasets that can be nonlinear and large in magnitude (Zhou et al., 2012; Climaco Pinto et al., 2022; Habra et al., 2021). In the spirit of previous methods for LCMS batch or dataset alignment (Smith et al., 2006; Brunius et al., 2016; Liu et al., 2020; Vaughan et al., 2012; Habra et al., 2021; Climaco Pinto et al., 2022; Skoraczyński et al., 2022), the learned coupling $\stackrel{~}{\mathrm{\Pi}}$ is used to estimate a nonlinear map (drift function) between RTs of both datasets by weighted spline regression, which allows us to filter unlikely matches from the coupling matrix to obtain a refined coupling matrix $\hat{\mathrm{\Pi}}$ (Figure 1d, Materials and methods). An optional thresholding step removes matches with small weights from the coupling matrix. The final output of GromovMatcher is a binary matching matrix $M$ where $M}_{ij$ is equal to 1 if features $\text{fx}}_{i$ and $\text{fy}}_{j$ are matched and 0 otherwise. Throughout the paper, we refer to the two variants of GromovMatcher, with and without the optional thresholding step as GMT and GM respectively.
Validation on groundtruth data
We first evaluate the performance of GromovMatcher using a realworld untargeted metabolomics study of cord blood across 499 newborns containing 4712 metabolic features characterized by their $m/z$, RT, and feature intensities (Alfano et al., 2020). To generate groundtruth data, we randomly divide the initial dataset into two smaller datasets sharing a subset of features (Figure 2). We simulate diverse acquisition conditions by adding noise to the $m/z$ and RT of dataset 2, and to the feature intensities in both datasets. Moreover, we introduce an RT drift in dataset 2 to replicate the retention time variations observed in real LCMS experiments (Materials and methods). For comparison, we also test M2S (Climaco Pinto et al., 2022) and metabCombiner (Habra et al., 2021), both of which use $m/z$, RT, and median or mean feature intensities to match features (Figure 3). MetabCombiner is supplied with 100 known shared metabolic features to automatically set its hyperparameters, while M2S parameters are manually finetuned to optimize the F1score in each scenario (Appendix 2). We assess the performance of GM, GMT, metabCombiner, and M2S across 20 randomly generated dataset pairs in terms of their precision (fraction of true matches among the detected matches) and recall/sensitivity (fraction of true matches detected) averaged across 20 dataset pairs.
To investigate how the number of shared features affects dataset alignment, we generate pairs of LCMS datasets with low, medium, and high feature overlap (25%, 50%, and 75%), while maintaining a medium noise level (Materials and methods). Here, we find that GM and GMT generally outperform existing alignment methods, with a recall above 0.95 while metabCombiner and M2S tend to be less sensitive (Figure 3b). All methods drop in precision as the feature overlap is decreased, with GM and GMT still maintaining an average precision above 0.8.
Next we evaluate all four methods at low, moderate, and high noise levels for pairs of datasets with 50% overlap in their features (Materials and methods). Our results show that GMT, GM, and M2S maintain an average recall above 0.89, while metabCombiner’s recall drops below 0.6 for high noise. At large noise levels, RT drift estimation becomes more challenging, leading to a higher rate of false matches between metabolites (lower precision) for all four methods (Figure 3—figure supplement 1). Nevertheless, GMT obtains a high average precision and recall of 0.86 and 0.92, respectively.
A notable difference between GM, metabCombiner, and M2S lies in their use of feature intensities. MetabCombiner expects that the mean feature intensity rankings are identical across studies, while M2S assumes that shared features have similar median intensities. In contrast, GM uses both the mean feature intensities and their variances and covariances. In practice, differences in experimental assays or study populations can lead to greater variation in feature intensities, making matchings based on these statistics less reliable. Centering and scaling the feature intensities to unit variance avoids potential biases arising from inconsistent feature intensity magnitudes, but preserves correlations that GM leverages.
Exploring this further, we test how sensitive all four methods are to centering and scaling of feature intensities. MetabCombiner and M2S are tuned using the same methodology as for noncentered and nonscaled data. For M2S, we match features solely based on their $m/z$ and RT. In this experiment (Figure 3—figure supplement 2), the absence of intensity magnitude information significantly affects metabCombiner’s performance and, to a lesser extent, M2S. GM and GMT still obtain accurate matchings, due to their use of correlation structures which are preserved under centering and scaling.
Application to EPIC data
Next, we apply GM, metabCombiner and M2S to align datasets from the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort, a prospective study conducted across 23 European centers. EPIC comprises more than 500,000 participants who provided blood samples at recruitment (Riboli et al., 2002). Untargeted metabolomics data were successively acquired in several studies nested within the full cohort.
In the present work, we use LCMS data from the EPIC crosssectional (CS) study (Slimani et al., 2003) and two matched casecontrol studies nested within EPIC, on hepatocellular carcinoma (HCC; Stepien et al., 2016; Stepien et al., 2021) and pancreatic cancer (PC; Gasull et al., 2019). LCMS untargeted metabolomic data were acquired at the International Agency for Research on Cancer, making use of the same platform and methodology (Materials and methods). The number of samples and features in each study is displayed in Figure 4a.
Loftfield et al., 2021 previously matched features from the CS, HCC, and PC studies in EPIC for alcohol biomarker discovery. The authors first identified 205 features (163 in positive and 42 in negative mode) associated with alcohol intake in the CS study. These features were then manually matched by an expert to features in both the HCC and PC studies (Materials and methods, Table 1). In our analysis, we use these features as a validation set and compare each method’s matchings to the expert manual matchings on this subset. Due to the imbalance between the number of positive and negative mode features in the validation subset, our main analysis focuses on the alignment results of CS with HCC and CS with PC in positive mode (Table 2). We delegate the matching results between the negative mode studies (Table 3) to Appendix 4.
In this section, we use the same settings for GM as in our simulation study, and do not apply an additional thresholding step. The parameters of metabCombiner and M2S are calibrated using the validation subset as prior knowledge (Appendix 2).
Preliminary analysis of the validation subset reveals inconsistencies in the mean feature intensities (Figure 4—figure supplement 1), but Figure 4b shows that on centered and scaled data, the 90 expert matched features shared between the CS and HCC studies have similar correlation structures. Hence, to avoid potential errors we center and scale the feature intensities which improves the performance of all three methods tested below (Appendix 4, Appendix 4—table 1).
Hepatocellular carcinoma
Here, we analyze the quality of the matchings obtained by GM, M2S, and metabCombiner between the CS and HCC datasets in positive mode. Both GM and M2S identify approximately 1000 shared features while metabCombiner finds a smaller number of about 700 shared features. We refer the reader to Figure 4—figure supplement 2a for the precise matched feature sizes and details on the agreement between the feature matchings of all three methods.
We evaluate the performance of metabCombiner, M2S, and GM on the validation subset in positive mode (Figure 4c, Table 2), which consist of 90 features from the CS study manually matched to features from the HCC study and 73 features specific to the CS study. MetabCombiner demonstrates precise matching but lacks sensitivity. M2S’s precision and recall are comparable with GM, in contrast to its performance on simulated data. This can be attributed to the RT drift shape between the CS and HCC studies (Appendix 2), which is estimated to be close to linear (Figure 4—figure supplement 3). Because the parameters of M2S are finetuned in the validation subset, it is able to learn this linear drift and apply tight RT thresholds to achieve accurate matchings. In contrast to metabCombiner and M2S, the GM algorithm is not given any prior knowledge of the validation subset, and nevertheless demonstrates the highest precision and recall rates of the three methods (Figure 4c). Figure 4b shows how GM recovers the majority of the expert matched pairs by leveraging the shared correlations.
Pancreatic cancer
Matching features between the CS and PC studies in positive mode, GM and M2S identify approximately 1000 common features, while metabCombiner detects approximately 600 matches (Figure 4—figure supplement 2b). We examine the performance of all three methods on the validation subset consisting of 66 manually matched features between CS and PC along with 97 features specific to the CS study. As before, GM and M2S have high recall while the recall of metabCombiner is less than 0.5.
A decrease in precision is observed for both GM and M2S compared to the previous CSHCC matchings. We therefore manually inspect the false positive matches; the set of CS features matched by the method to the PC study but explicitly examined and left unmatched in the expert manual matching. Assessing the GM results, we identify seven false positive feature matches. Upon secondary inspection, three pairs are revealed as correct matches that were not initially identified in the expert matching. M2S finds 11 false positive matches which include the 7 false positives recovered by GM. Manual examination of the four remaining pairs reveals two clear mismatches. These results highlight the advantage of using automated methods for data alignment, as both GM and M2S detect correct matches that were not identified by experts, with GM being more precise than M2S.
Illustration for alcohol biomarker discovery
Loftfield et al., 2021 identified biomarkers of habitual alcohol intake by first performing a discovery step, where they examined the relationship between alcohol intake and metabolic features in the CS study. They then manually matched the significant features in CS to features from the HCC and PC studies, and repeated the analysis with samples from the HCC and PC studies to determine whether the association with alcohol intake persisted. This led to the identification of 10 features possibly associated with alcohol intake (Figure 5a).
To extend this analysis and illustrate the benefit of GM automatic matching for biomarker discovery, we use GM to pool features from the CS, HCC, and PC studies, and examine the relationship between metabolic features and alcohol intake in the pooled study (Materials and methods and Figure 5b).
Applying an FDR correction on the pooled study, we identify 243 features associated with alcohol intake, including 185 features consistent with the discovery step of Loftfield et al., 2021, and 55 newly discovered features (Figure 5c). Using the more stringent Bonferroni correction on the pooled data, we identify 36 features shared by all three studies that are significantly associated with alcohol intake. These features include all 10 features identified in Loftfield et al. (Figure 5c). These findings highlight the potential benefits of using GM automatic matching for biomarker discovery in untargeted metabolomics data. Additional information regarding the methodology and findings of our GM and Loftfield et al. analyses can be found in Materials and methods and Appendix 4.
Discussion
LCMS metabolomics has emerged as an increasingly powerful tool for biological and biomedical research, offering promising opportunities for epidemiological and clinical investigations. However, integrating data from different sources remains challenging. To address this issue, we introduce GromovMatcher, a method based on optimal transport that automatically aligns LCMS data from pairs of studies. Our method exhibits superior performance on both simulated and real data when compared to existing approaches. Additionally, it presents a userfriendly interface with few hyperparameters.
While GromovMatcher is robust to noise and variations in data, it may face limitations when aligning LCMS studies from populations with different characteristics, where the correlation structures between features may be inconsistent across studies. In this case, the base assumption of GromovMatcher can be relaxed by focusing on subsamples with similar characteristics, as exemplified in a recent study (Gomari et al., 2022).
A current limitation is that GromovMatcher does not account for more than two datasets simultaneously, although this can be overcome by aligning multiple studies to a chosen reference dataset, as demonstrated in our biomarker experiments. The extension of GromovWasserstein to multiple distributions (Beier et al., 2022) is another promising approach for generalizing GromovMatcher to multiple dataset alignment. Further improvements can be made by incorporating existing knowledge about the studies being matched, such as known shared features, samples in common, or MS/MS data.
The results obtained from GromovMatcher are highly promising, opening the door for various analyses of metabolomic datasets acquired in different experimental laboratories. Here, we demonstrated the potential of GromovMatcher in expediting the combination and metaanalysis of data for biomarker and metabolic signature discovery. The matchings learned by GromovMatcher also allow for comparison between experimental protocols by assessing the drift in $m/z$, RT, and feature intensities across studies. Finally, interinstitutional annotation efforts can directly benefit from incorporating this method to transfer annotations between aligned datasets. Bridging the gap between otherwise incompatible LCMS data, GromovMatcher enables seamless comparison of untargeted metabolomics experiments.
Materials and methods
GromovMatcher method overview
GromovMatcher accepts as input two feature tables from separate LCMS untargeted metabolomics studies. Each feature table for dataset 1 and dataset 2 consists of $n}_{1},{n}_{2$ biospecimen samples respectively and $p}_{1},{p}_{2$ metabolic features respectively detected in the study. Features in dataset 1 are given the label $\text{fx}}_{i$ for $i=1,\dots ,{p}_{1}$. Every feature is characterized by a masstocharge ratio ($m/z$) denoted by $m}_{i}^{x$, a retention time (RT) denoted by $R{T}_{i}^{x}$, and a vector of intensities across all samples written as ${X}_{i}\in {\mathbb{R}}^{{n}_{1}}$. Similarly, features in dataset 2 are labeled as $\text{fy}}_{j$ for $j=1,\dots ,{p}_{2}$ and are characterized by their $m/z$, retention time $R{T}_{j}^{y}$, and a vector of intensities across all samples $Y}_{i}\in {\mathbb{R}}^{{n}_{2}$.
Our goal is to identify pairs of indexes $(i,j)$ with $i\in \{1,\dots ,{p}_{1}\}$ and $j\in \{1,\mathrm{\dots},{p}_{2}\}$, such that $\text{fx}}_{i$ and $\text{fy}}_{j$ correspond to the same metabolic feature. More formally, we aim to identify a matching matrix $M\in \{0,1{\}}^{{p}_{1}\times {p}_{2}}$ such that ${M}_{ij}=1$ if $\text{fx}}_{i$ and $\text{fy}}_{j$ correspond to the same feature, hereafter referred to as matched features. Otherwise, we set ${M}_{ij}=0$.
Because the $m/z$ and RT values of metabolomic features are often noisy and subject to experimental bias, our matching algorithm leverages metabolite feature intensities $X}_{i},{Y}_{j$ to produce accurate dataset alignments. The GromovMatcher method is based on the idea that signal intensities of the same metabolites measured in two different studies should exhibit similar correlation structures, in addition to having compatible $m/z$ and RT values. Here, we define the Pearson correlation for vectors $u,v\in {\mathbb{R}}^{n}$ as
where we define
as the mean value, Euclidean norm and inner product respectively. If measurements $X}_{i},{Y}_{j$ correspond to the same underlying feature, and similarly, measurements $X}_{k},{Y}_{l$ share the same an underlying feature, we expect that
This idea that the feature intensities of shared metabolites have the same correlation structure in both datasets also holds more generally for distances, under a suitable choice of distance. For example, the correlation coefficient $\text{corr}(u,v)$ can be turned into a dissimilarity metric by defining
commonly referred to as the cosine distance. Preservation of feature intensity correlations then trivially amounts to the preservation of cosine distances.
Another classical notion of distance between vectors $u,v\in {\mathbb{R}}^{n}$ is the normalized Euclidean distance
which is equal to the cosine distance (up to constants) when the vectors $u,v$ are centered and scaled to have zero mean and a standard deviation of one. The Euclidean distance depends on the magnitude or mean intensity of metabolic features, and hence is a useful metric for matching metabolites as long as these mean feature intensities are reliably collected.
To summarize, the main tenant of GromovMatcher is that if measurements $X}_{i},{Y}_{j$ correspond to the same feature and $X}_{k},{Y}_{l$ correspond to the same feature, then for suitably chosen distances $d}_{x}:{\mathbb{R}}^{{n}_{1}}\times {\mathbb{R}}^{{n}_{1}}\to \mathbb{R$ and ${d}_{y}:{\mathbb{R}}^{{n}_{2}}\times {\mathbb{R}}^{{n}_{2}}\to \mathbb{R}$, these distances are preserved
across both datasets. In this paper, the distances $d}_{x},{d}_{y$ are taken to be the normalized Euclidean distances in Equation 5. We take care to specify those experiments where the metabolic features $X$ and $Y$ are centered and scaled. In these cases, implicitly the Euclidean distance between normalized feature vectors becomes the cosine distance Equation 4 between the original (unnormalized) feature vectors.
Unbalanced Gromov–Wasserstein
Request a detailed protocolThe goal of GromovMatcher is to learn a matching matrix $M\in {\{0,1\}}^{{p}_{1}\times {p}_{2}}$ that gives an alignment between a subset of metabolites in both datasets. However, searching over the combinatorially large set of binary matrices would be an inefficient approach for dataset alignment. The mathematical framework of optimal transport Peyré and Cuturi, 2019 instead enlarges this space of binary matrices to the set of coupling matrices with real nonnegative entries $\mathrm{\Pi}\in {\mathbb{R}}_{+}^{{p}_{1}\times {p}_{2}}$. The entries ${\mathrm{\Pi}}_{ij}$ with large weights indicate that feature $\text{fx}}_{i$ in dataset 1 and feature ${\text{fy}}_{j}$ in dataset 2 are a likely match. Taking inspiration from Equation 6, we minimize the following objective function
to estimate the coupling matrix $\mathrm{\Pi}$.
A standard approach is to optimize this objective over all coupling matrices $\mathrm{\Pi}$ under exact marginal constraints $\mathrm{\Pi}{\mathrm{\U0001d7cf}}_{{p}_{2}}=\frac{1}{{p}_{1}}{\mathrm{\U0001d7cf}}_{{p}_{1}},{\mathrm{\Pi}}^{T}{\mathrm{\U0001d7cf}}_{{p}_{1}}=\frac{1}{{p}_{2}}{\mathrm{\U0001d7cf}}_{{p}_{2}}$. Here, we define $\mathbf{1}}_{n$ is the ones vector of length $n$, and $\mathrm{\Pi}}_{1}=\mathrm{\Pi}{\mathbf{1}}_{{p}_{2}},{\mathrm{\Pi}}_{2}={\mathrm{\Pi}}^{T}{\mathbf{1}}_{{p}_{1}$ denote the column and row sums of the coupling matrix. Objective Equation 7 under these exact marginal constraints defines a distance between the two sets of metabolic feature vectors ${\{{X}_{i}\}}_{i=1}^{{p}_{1}},{\{{Y}_{i}\}}_{i=1}^{{p}_{2}}$ known as the Gromov–Wasserstein distance Mémoli, 2011, a generalization of optimal transport to metric spaces. Note that for pairs $X}_{i},{Y}_{j$ and ${X}_{k},{Y}_{l}$ for which ${d}_{x}({X}_{i},{X}_{k})\approx {d}_{y}({Y}_{j},{Y}_{l})$, the entries ${\mathrm{\Pi}}_{ij},{\mathrm{\Pi}}_{kl}$ are penalized less and hence matches between features $\text{fx}}_{i},{\text{fy}}_{j$ and features ${\text{fx}}_{k},{\text{fy}}_{l}$ are more favored. In our optimization, we avoid enforcing exact marginal constraints on the marginal distributions $\mathrm{\Pi}{\mathbf{1}}_{{p}_{2}}$ and ${\mathrm{\Pi}}^{T}{\mathrm{\U0001d7cf}}_{{p}_{1}}$ of our coupling matrix as this would enforce that all metabolites in both datasets are matched (Appendix 1). However, without any marginal constraints on the coupling $\mathrm{\Pi}$, the objective function Equation 7 is trivially minimized by $\mathrm{\Pi}=0$, leaving all metabolites in both datasets unmatched.
To account for this, we follow the ideas of unbalanced Gromov–Wasserstein (UGW) (Sejourne et al., 2021) and add three regularization terms to our objective
where $\rho ,\epsilon >0$ and we define $a={\mathbf{1}}_{{p}_{1}},b={\mathbf{1}}_{{p}_{2}}$. Here ⊗ denotes the Kronecker product. We define $D}_{\text{KL}$ as the Kullback–Leibler (KL) divergence between two discrete distributions $\mu ,\nu \in {\mathbb{R}}_{+}^{p}$ by
which measures the closeness of probability distributions.
The first two regularization terms in Equation 8 enforce that the row sums and column sums of the coupling matrix $\mathrm{\Pi}$ do not deviate too much from a uniform distribution, leading our optimization to match as many metabolic features as possible. The magnitude of the regularizer $\rho$ roughly enforces the fraction of metabolites in both datasets that are matched where large $\rho $ implies most metabolites are matched across datasets. The final regularization term $\epsilon$ in Equation 8 controls the smoothness (entropy) of the coupling matrix $\mathrm{\Pi}$ where larger values of $\epsilon$ encourage $\mathrm{\Pi}$ to put uniform weights on many of its entries, leading to less precision in the metabolite matches. However, increasing $\epsilon$ also leads to better numerical stability and a significant speedup of the alternating minimization algorithm used to optimize the objective function (Appendix 1). In our implementation, we set $\rho $ and $\u03f5$ to the lowest possible values under which our optimization converges, with $\rho =0.05$ and $\u03f5=0.005$.
Our full optimization problem can now be written as
The UGW objective function is optimized through alternating minimization based on the code of Sejourne et al., 2021 using the unbalanced Sinkhorn algorithm Séjourné et al., 2019 from optimal transport (Appendix 1).
Constraint on $m/z$ ratios
Request a detailed protocolMatched metabolic features must have compatible $m/z$ so we enforce that ${\mathrm{\Pi}}_{ij}=0$ when ${m}_{i}^{x}{m}_{j}^{y}>{m}_{\text{gap}}$ where $m}_{\text{gap}$ is a userspecified threshold. Based on prior literature (Loftfield et al., 2021; Hsu et al., 2019; Climaco Pinto et al., 2022; Habra et al., 2021; Chen et al., 2021), we set ${m}_{\text{gap}}$ = 0.01 ppm. Note that $m}_{\text{gap}$ is not explicitly used in Equation 10 but is rather enforced in each iteration of our alternating minimization algorithm for the UGW objective (Appendix 1).
Unlike the $m/z$ ratios discussed above, RTs often exhibit a nonlinear deviation (drift) between studies so we cannot enforce compatibility of RTs directly in our optimization. Instead, in the following step of our pipeline we ensure matched metabolite pairs have compatible RTs by estimating the drift function and subsequently using it to filter out metabolite matches whose RT values are inconsistent with the estimated drift.
Estimation of the RT drift and filtering
Request a detailed protocolEstimating the drift between RTs of two studies is a crucial step in assessing the validity of metabolite matches and discarding those pairs which are incompatible with the estimated drift.
Let $\stackrel{~}{\mathrm{\Pi}}\in {\mathbb{R}}_{+}^{{p}_{1}\times {p}_{2}}$ be the minimizer of Equation 10 obtained after optimization. We seek to estimate the RT drift function $f:{\mathbb{R}}_{+}\to {\mathbb{R}}_{+}$ which relates the retention times of matched features between the two studies. Namely, if feature $\text{fx}}_{i$ and feature ${\text{fy}}_{j}$ correspond to the same metabolic feature, then we must have that $R{T}_{j}^{y}\approx f(R{T}_{i}^{x})$.
We propose to learn the drift $f$ through the weighted spline regression
where $\mathcal{B}}_{n,k$ is the set of $n$order Bsplines with $k$ knots. All pairs $(R{T}_{i}^{x},R{T}_{j}^{y})$ in objective Equation 11 are weighted by the coefficients of $\stackrel{~}{\mathrm{\Pi}}$ so that larger weights are given to pairs identified with high confidence in the first step of our procedure. The order of the Bsplines was set to $n=3$ by default, while the number of knots $k$ was selected by 10fold crossvalidation.
Pairs identified as incompatible with the estimated RT drift are then discarded from the coupling matrix. To do this, we first take the estimated RT drift $\widehat{f}$, and the set of pairs $\mathcal{S}=\{i,j:{\stackrel{~}{\mathrm{\Pi}}}_{i,j}\ne 0\}$ recovered in $\stackrel{~}{\mathrm{\Pi}}$. We then define the residual associated with $(i,j)\in \mathcal{S}$ as
The 95% prediction interval and the median absolute deviation (MAD) of these residuals are given by
where $\mathcal{S}$ is the size of $\mathcal{S}$ and the functions std and median denote the standard deviation and median respectively. Similar to the approach in Climaco Pinto et al., 2022, we create a new filtered coupling matrix $\widehat{\mathrm{\Pi}}\in {\mathbb{R}}_{+}^{{p}_{1}\times {p}_{1}}$ given by
where ${r}_{\text{thresh}}$ is a given filtering threshold. Following Habra et al., 2021, the estimation and outlier detection step can be repeated for multiple iterations, to remove pairs that deviate significantly from the estimated drift and improve the robustness of the drift estimation. In our main algorithm, we use two preliminary iterations where estimate the RT drift and discard outliers outside of the 95% prediction interval by setting ${r}_{\text{thresh}}=\text{PI}$. We the reestimate the drift and perform a final filtering step with the more stringent MAD by setting ${r}_{\text{thresh}}=2\times \text{MAD}$.
At this stage, it is possible for $\widehat{\mathrm{\Pi}}$ to still contain coefficients of very small magnitude. As an optional postprocessing step, we discard these coefficients by setting all entries smaller than $\tau \text{max}(\widehat{\mathrm{\Pi}})$ to zero, for some userdefined $\tau \in [0,1]$. Lastly, a feature from either study could have multiple possible matches, since $\widehat{\mathrm{\Pi}}$ can have more than one nonzero coefficient per row or column. Although reporting multiple matches can be helpful in an exploratory context, for the sake of simplicity in our analysis, the final output of GromovMatcher returns a onetoone matching, as we only keep those metabolite pairs $(i,j)$ where the entry ${\widehat{\mathrm{\Pi}}}_{ij}$ is largest in its corresponding row and column. All nonzero entries of $\widehat{\mathrm{\Pi}}$ which do not satisfy this criterion are set to zero. Finally, we convert $\widehat{\mathrm{\Pi}}$ into a binary matching matrix $M\in {\{0,1\}}^{{p}_{1}\times {p}_{2}}$ with ones in place of its nonzero entries and this final output is returned to the user.
As a naming convention, we use the abbreviation GM for our GromovMatcher method, and use the abbreviation GMT when running GromovMatcher with the optional $\tau $thresholding step with $\tau =0.3$.
Metrics for dataset alignment
Request a detailed protocolEvery alignment method studied in this paper returns a binary partial matching matrix $M\in {\{0,1\}}^{{p}_{1}\times {p}_{1}}$ which has at most one nonzero entry in each row and column. Specifically, ${M}_{ij}=1$ if metabolic features $i$ and $j$ in both datasets correspond to each other and ${M}_{ij}=0$ otherwise. In our simulated experiments, we compare the partial matching $M$ to a known groundtruth partial matching matrix ${M}^{*}\in {\{0,1\}}^{{p}_{1}\times {p}_{2}}$.
To do this, we first compute the number of true positives, false positives, true negatives, and false negatives as
where 1 denotes the indicator function. Then we use these values to compute the precision and recall as
Precision measures the fraction of correctly found matches out of all discovered metabolite matches, while recall, also know as sensitivity, measures the fraction of correctly matched pairs out of all truly matched pairs. These two statistics can be summarized into one metric called the F1score by taking their harmonic mean
These three metrics, precision, recall, and the F1score, are used throughout the paper to assess the performance of dataset alignment methods, both on simulated data where the groundtruth matching is known, and on the validation subset in EPIC, using results from the manual examination as the groundtruth benchmark.
Validation on simulated data
To assess the performance of GromovMatcher and compare it to existing dataset alignment methods, we simulate realistic pairs of untargeted metabolomics feature with known groundtruth matchings. This allows us to analyze the dependence of alignment methods on the number of shared metabolites, dataset noise level, and feature intensity centering and scaling.
Dataset generation
Request a detailed protocolOur pairs of synthetic feature tables are generated from one real untargeted metabolomics study of 500 newborns within the EXPOsOMICS project, which uses reversed phase liquid chromatographyquadrupole timeofflight mass spectrometry (UHPLCQTOFMS) system in positive ion mode Alfano et al., 2020. The original dataset is first preprocessed following the procedure detailed in Alfano et al., 2020, resulting in p=4712 features measured in $n=499$ samples available for subsequent analysis. Features and samples from the original study are then divided into two feature tables of respective size $({n}_{1},{p}_{1})$ and $({n}_{2},{p}_{2})$, with ${n}_{1}+{n}_{2}=n$ and ${p}_{1},{p}_{2}\le p$. In order to do this, ${n}_{1}=\lfloor n/2\rfloor $ randomly chosen samples from the original study are placed into dataset 1 and the remaining ${n}_{2}=\lceil n/2\rceil $ samples from the original study are placed into dataset 2. Here, $\lfloor \cdot \rfloor $ and $\lceil \cdot \rceil $ denote integer floor and ceiling functions. The features of the original study are randomly assigned to dataset 1, dataset 2, or both, allowing the resulting studies to have both common and studyspecific features (Figure 2). Specifically, for a fixed overlap parameter $\lambda \in [0,1]$, we assign a random subset of $\approx \lambda p$ features into both dataset 1 and dataset 2 while the remaining $\approx (1\lambda p)$ features are divided equally between the two studies such that ${p}_{1}={p}_{2}$. We choose $\lambda \in \{0.25,0.5,0.75\}$ corresponding to low, medium and high overlap. For more detailed information on how the dataset split is performed and for additional validation experiments with unbalanced dataset splits (e.g. ${n}_{1}\ne {n}_{2},{p}_{1}\ne {p}_{2}$) we refer the reader to Appendix 3.
After generating a pair of studies, random noise is added to the $m/z$, RT and intensity levels of features in dataset 2 to mimic variations in data acquisition across two different experiments. The noise added to each $m/z$ value in study 2 is sampled from a uniform distribution on the interval $[{\sigma}_{\text{M}},{\sigma}_{\text{M}}]$ with ${\sigma}_{\text{M}}=0.01$ (Climaco Pinto et al., 2022). The RTs of dataset 2 are first deviated by the function $f(x)=1.1x+1.3\mathrm{sin}(1.2\sqrt{x})$, corresponding to a systematic interdataset drift (Habra et al., 2021; Climaco Pinto et al., 2022; Brunius et al., 2016). A uniformly distributed noise on the interval $[{\sigma}_{\text{RT}},{\sigma}_{\text{RT}}]$ is added to the deviated RTs of dataset 2, with ${\sigma}_{\text{RT}}\in \{0.2,0.5,1\}$ (in minutes) corresponding to low, moderate and high variations (Climaco Pinto et al., 2022; Habra et al., 2021; Vaughan et al., 2012). Finally, we add a Gaussian noise $\mathcal{N}(0,{\sigma}_{\text{FI}}^{2})$ to the feature intensities of both studies where ${\sigma}_{FI}$ is the scalar variance of the noise. This noise perturbs the correlation matrices of dataset 1 and dataset 2, making matching based on feature intensity correlations more challenging. We vary ${\sigma}_{FI}$ over the set of values {0.1, 0.5, 1}.
Given this data generation process, we test the performance of the four alignment methods (M2S, metabCombiner, GM, and GMT) under the parameter settings described below.
Dependence on overlap
Request a detailed protocolWe first assess how the performance of the four methods is affected by the number of metabolic features shared in both datasets. For each value of $\lambda =0.25,0.5,0.75$ (low, medium, and high overlap), we randomly generate 20 pairs of datasets with noise on the $m/z$, RT and feature intensities set to ${\sigma}_{\text{M}}=0.01,{\sigma}_{\text{RT}}=0.5,{\sigma}_{\text{FI}}=0.5$. The precision and recall of each method at low, medium, and high overlap is recorded for each of the repetitions.
Noise robustness
Request a detailed protocolNext, we test the robustness to noise of each method by fixing the metabolite overlap fraction at $\lambda =0.5$ and generating 20 random pairs of datasets at low (${\sigma}_{\text{RT}}=0.2,{\sigma}_{\text{FI}}=0.1$), medium (${\sigma}_{\text{RT}}=0.5,{\sigma}_{\text{FI}}=0.5$), and high (${\sigma}_{\text{RT}}=1,{\sigma}_{\text{FI}}=1$) noise levels. Similarly, the precision and recall of each method is saved for each noise level across the 20 repetitions.
Feature intensity centering and scaling
Request a detailed protocolIn order to test how all four methods are affected when the mean feature intensities and variance are not comparable across studies, we assess their performance when the feature intensities in both studies are mean centered and standardized to have unit standard deviation across all samples. We again generate 20 random pairs of datasets with medium overlap and medium noise, normalize the feature intensities in each pair of datasets, and compute the precision and recall of each method across the 20 repetitions.
EPIC data
We also evaluate our method on data collected within the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort, an ongoing multicentric prospective study with over 500,000 participants recruited between 1992 and 2000 from 23 centers in 10 European countries, and who provided blood samples at the inclusion in the study (Riboli et al., 2002). In EPIC, untargeted metabolomics data were successively acquired in several studies nested within the full cohort.
In the present work, we use untargeted metabolomics data acquired in three studies nested in EPIC, namely the EPIC crosssectional (CS) study (Slimani et al., 2003) and two matched casecontrol studies nested within EPIC, on hepatocellular carcinoma (HCC; Stepien et al., 2016; Stepien et al., 2021) and pancreatic cancer (PC; Gasull et al., 2019), respectively. All data were acquired at the International Agency for Research on Cancer, making use of the same plateform and methodology: UHPLCQTOFMS (1290 Binary Liquid chromatography system, 6550 quadrupole timeofflight mass spectrometer, Agilent Technologies, Santa Clara, CA) using reversed phase chromatography and electrospray ionization in both positive and negative ionization mode.
In a previous analysis aiming at identifying biomarkers of habitual alcohol intake in EPIC, the 205 features associated with alcohol intake in the CS study were manually matched to features in both the HCC and PC studies Loftfield et al., 2021. The results from this manual matching are presented in Table 1. This matching process was based on the proximity of $m/z$ and RT, using a matching tolerance of ± 15 ppm and ± 0.2 min, and on the comparison of the chromatograms of features in a quality control samples from both studies.
Preprocessing
Request a detailed protocolIn the HCC and PC studies, samples corresponding to participants selected as cases in either study (i.e. participants selected in the study because of a diagnosis of incident HCC or PC) are excluded. Indeed, the metabolic profiles of participants selected as controls are expected to be more comparable across studies than those of cases, especially if certain features are associated with the risk of HCC or PC. Apart from this additional exclusion criterion, the untargeted metabolomics data of each study is preprocessed following the steps described in Loftfield et al., 2021, to eliminate unreliable features and samples, impute missing values and minimize technical variations in the feature intensity levels.
Alcohol biomarker discovery
Request a detailed protocolLoftfield et al., 2021 used the untargeted metabolomics data of the CS, HCC and PC studies in their alcohol biomarker discovery study in EPIC, without being able to automatically match their common features and pool the three datasets. Instead, the authors first implemented a discovery step, examining the relationship between alcohol intake and metabolic features measured in the CS study and accounting for multiple testing using a false discovery rate (FDR) correction. This led to the identification of 205 features significantly associated with alcohol intake in the CS study. In order to gauge the robustness of these associations, the authors of Loftfield et al., 2021 then implemented a validation step using data from two independent test sets. The first test set was composed of data from the EPIC HCC and PC studies, while the second was derived from the Finnish AlphaTocopherol, BetaCarotene Cancer Prevention (ATBC) study. The 205 features identified in the discovery step were manually investigated for matches in the EPIC test set, and 67 features were effectively matched to features in the HCC or PC study, or both. The authors then evaluated the association between alcohol intake and those 67 features, applying a more conservative Bonferroni correction to determine whether the association with alcohol intake persisted. This step led to the identification of 10 features associated with alcohol intake (Extended Data Figure 5a). The second test set was then used to determine whether those 10 features were also significant in the ATBC population, which was indeed the case.
To conduct a more indepth investigation of the matchings produced by the GromovMatcher algorithm, we build upon the analysis previously conducted by Loftfield et al., 2021 by exploring potential alcohol biomarkers using a pooled dataset created from the CS, HCC, and PC studies. Our goal is to assess whether pooling the data leads to increased statistical power and allows for the detection of more features associated with alcohol intake. Namely, we generate the pooled dataset by aligning a chosen reference dataset (CS study) with the HCC and PC studies successively using the GM matchings computed in both positive and negative mode (Materials and methods and Extended Data Figure 5b). Features that are not detected in either the HCC or PC studies are designated as ‘missing’ in the final pooled dataset for samples belonging to the respective studies where the feature is not found.
To evaluate the potential relationship between alcohol consumption and pooled metabolic features, we use a methodology akin to that of Loftfield et al., 2021. The selfreported alcohol intake data is adjusted for various demographic and lifestyle factors (age, sex, country, bodymassindex, smoking status and intensity, coffee consumption, and study) via the residual method in linear regression models. Feature intensities are also adjusted for technical variables (plate number and position within the plate) via linear mixed effect models. The significance of the association is assessed using correlation coefficients computed from the residuals for both selfreported alcohol intake and feature intensities. pValues are corrected using either false discovery rate (FDR) or Bonferroni correction to account for multiple testing. Corrected pvalues less than 5% are considered significant.
Materials and correspondence
Request a detailed protocolAll correspondence and material requests should be addressed to V.V.
IARC disclaimer
Request a detailed protocolWhere authors are identified as personnel of the International Agency for Research on Cancer/World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy, or views of the International Agency for Research on Cancer/World Health Organization.
Appendix 1
In this paper, we study how to match metabolic features across two datasets where Dataset 1 has p_{1} metabolic features measured across n_{1} patients and Dataset 2 has p_{2} metabolic features measured across n_{2} patients. Our goal is to identify pairs of indexes $(i,j)$ with $i\in \{1,\mathrm{\dots},{p}_{1}\}$ and $j\in \{1,\mathrm{\dots},{p}_{2}\}$, such that feature $i$ in Dataset 1 and feature $j$ in Dataset 2 correspond to the same metabolic feature. More formally, we aim to identify a matching matrix${M}^{*}\in {\{0,1\}}^{{p}_{1}\times {p}_{2}}$ such that ${M}_{i,j}^{*}=1$ if features $i$ in Dataset 1 and feature $j$ in Dataset 2 correspond to the same feature, hereafter referred to as matched features. Otherwise we set ${M}_{i,j}^{*}=0$ otherwise. We emphasize that a matching matrix ${M}^{*}$ can have at most one nonzero entry in each row and column.
Both of the datasets we aim to match are obtained from liquid chromatographymass spectrometry (LCMS) experiments. Hence, for Dataset 1 each metabolite $i\in [{p}_{1}]$ is labeled with a masstocharge ($m/z$) ratio ${m}_{i}^{1}$ as well as a retention time (RT) given by $R{T}_{i}^{1}$. Additionally, each metabolite has a vector of intensities across patients denoted by ${X}_{i}\in {\mathbb{R}}^{{n}_{1}}$. Similarly, each metabolite $j\in [{p}_{2}]$ in Dataset 2 is labeled by its $m/z$ ratio ${m}_{j}^{2}$, its retention time $R{T}_{j}^{2}$ and its vector of intensities across samples ${Y}_{j}\in {\mathbb{R}}^{{n}_{2}}$.
Correlations and distances between metabolomic features
Features cannot be aligned based on their $m/z$ and RT alone as they are often too inconsistent across studies. Our method is based on the idea that, in addition to their $m/z$ and RT being compatible, the signal intensities of metabolites measured in two different studies should exhibit similar correlation structures, or more generally exhibit similar distances between their intensity vectors. In other words, if feature intensity vectors ${X}_{i}\in {\mathbb{R}}^{{n}_{1}},{Y}_{j}\in {\mathbb{R}}^{{n}_{2}}$ correspond to the same underlying feature (${M}_{ij}^{*}=1$) and similarly if ${X}_{k}\in {\mathbb{R}}^{{n}_{1}},{Y}_{l}\in {\mathbb{R}}^{{n}_{2}}$ correspond to the same feature (${M}_{kl}^{*}=1$), then we expect that
Here we define $\text{corr}(u,v)$ to be the Pearson correlation coefficient between two feature intensity vectors $u,v\in {\mathbb{R}}^{n}$ by
where we define
as the mean value, Euclidean norm and inner product respectively. More generally, with d_{x} and d_{y} denoting two given distances on ${\mathbb{R}}^{{n}_{1}}$ and ${\mathbb{R}}^{{n}_{2}}$ respectively, we expect that
Throughout this paper, we use the normalized Euclidean distance defined for any $u,v\in {\mathbb{R}}^{n}$ as
where for d_{x} and d_{y} we take $n={n}_{1},{n}_{2}$ respectively. If the signal intensity vectors $u,v$ are mean centered and normalized by their standard deviation as
and likewise for $v$, then it follows that
where we denote ${d}^{\text{cos}}(u,v)=\sqrt{1\text{corr}(u,v)}$ as the cosine distance. For the purposes of this paper, we will always assume that d_{x} and d_{y} denote the normalized Euclidean distance from Equation 22. As shown above, this will be implicitly equal to the cosine distance from Equation 24 on centered and scaled data.
The goal of metabolomic feature matching is to learn the binary matching matrix ${M}^{*}$ that aligns the distances between pairs of features in the most consistent way possible as shown in Equation 21. To formalize this notion into a practical algorithm, we use the mathematical theory of optimal transport (Peyré and Cuturi, 2019) which we discuss next.
Optimal transport
Optimal transport (OT) applies in the setting when the points ${\{{X}_{i}\}}_{i=1}^{{p}_{1}}$ and ${\{{Y}_{j}\}}_{j=1}^{{p}_{2}}$ being matched live in the same dimensional space ${n}_{1}={n}_{2}=n$. It aims to find a matching between each point ${X}_{i}$ and its corresponding point ${Y}_{j}$ such that the sum of distances between matches is minimized. Matches between each pair of points can be stored in a matching matrix $M\in {\{0,1\}}^{{p}_{1}\times {p}_{2}}$ such that ${M}_{ij}=1$ if ${X}_{i}$ and ${Y}_{j}$ are matched, and ${M}_{ij}=0$ otherwise. Again we note that $M$ must have at most one nonzero entry in each row and column to be a valid matching matrix.
Instead of searching over this space of binary matching matrices, optimal transport places masses ${a}_{i}\ge 0$ at all points ${X}_{i}$ for $i=1,\mathrm{\dots},{p}_{1}$ and masses ${b}_{j}\ge 0$ at all points ${Y}_{j}$ for $j=1,\mathrm{\dots},{p}_{2}$ and optimizes over the space of probabilistic couplings $\mathrm{\Pi}\in {\mathbb{R}}_{+}^{{p}_{1}\times {p}_{2}}$ which move a ${\mathrm{\Pi}}_{ij}$ amount of mass from ${X}_{i}$ to ${Y}_{j}$. We assume here for simplicity that the sum of masses in both datasets are equal to one ${\sum}_{i=1}^{{p}_{1}}{a}_{i}={\sum}_{i=1}^{{p}_{2}}{b}_{i}=1$ and that the coupling $\mathrm{\Pi}$ transports all mass from $a$ into $b$. More formally, optimal transport optimizes over the constrained set of couplings
where ${\mathrm{\U0001d7cf}}_{p}$ denotes the all ones vector of length $p$. In practice, the points ${X}_{i}$ and ${Y}_{j}$ in each dataset are all treated the same and the masses placed on the data are chosen to be uniform $a=\frac{1}{{p}_{1}}{\mathrm{\U0001d7cf}}_{{p}_{1}}$ and $b=\frac{1}{{p}_{2}}{\mathrm{\U0001d7cf}}_{{p}_{2}}$.
The cost function which optimal transport minimizes is the sum of squared distances of its transported mass
where $d(u,v)={d}^{\text{euc}}(u,v)=\frac{1}{\sqrt{n}}\parallel uv\parallel $ is the Euclidean distance. The distance matrix $d({X}_{i},{Y}_{j})$ in the OT objective can be replaced more generally with a cost matrix $C\in {\mathbb{R}}^{{p}_{1}\times {p}_{2}}$ that is not necessarily a distance matrix. In this case the cost function becomes
When the transport cost ${C}_{ij}$ is a distance, the OT optimization defines a valid distance metric known as the optimal transport distance between discrete distributions ${\{({a}_{i},{X}_{i})\}}_{i=1}^{{p}_{1}}$ and ${\{({b}_{j},{Y}_{j})\}}_{j=1}^{{p}_{2}}$ in ${\mathbb{R}}^{n}$ given by
When $d(u,v)$ is Euclidean, this OT distance is also referred to as the ${L}^{1}$ optimal transport distance, the Wasserstein 1distance, or the Earth mover’s distance. As formulated, the computation of the optimal transport objective involves an optimization over coupling matrices $\mathrm{\Pi}$ which can be solved by linear programming (Peyré and Cuturi, 2019). The OT optimization problem becomes time consuming for problems with many points ${p}_{1},{p}_{2}\gg 1$. We show in the next section how augmenting this distance with a regularization term leads to a more efficient algorithm for learning the optimal coupling $\mathrm{\Pi}$.
Entropic regularization
Define the Kullback–Leibler (KL) divergence between two positive vectors $\mu ,\nu \in {\mathbb{R}}_{+}^{p}$ as
Given fixed marginals $a\in {\mathbb{R}}^{{p}_{1}}$ and $b\in {\mathbb{R}}^{{p}_{2}}$ from the previous section, we can define the entropy of a coupling matrix $\mathrm{\Pi}\in {\mathbb{R}}_{+}^{{p}_{1}\times {p}_{2}}$ with respect to these fixed marginals as
where ${(a\otimes b)}_{ij}={a}_{i}{b}_{j}$ denotes the outer product. This can be further simplified as
where we define $H(\mathrm{\Pi})$ by
In the second line of the derivation above, we used the fact that the entries of $a,b$, and $\mathrm{\Pi}$ summed to one, and in the third line we used the fact that the marginals $a$ and $b$ were uniform. Under these assumptions, we see that the KL divergence ${D}_{\text{KL}}(\mathrm{\Pi},a\otimes b)$ is independent of the values of the marginals $a,b$ and is equal to $H(\mathrm{\Pi})$ up to constants.
Although here the general definition of entropy through the KL divergence reduces to the simpler formula of $H(\mathrm{\Pi})$, in the following sections we will need to extend our analysis to cases when $a,b$, and $\mathrm{\Pi}$ have positive values that do not sum to one (i.e. not distributions). In this context, we will no longer have that ${D}_{\text{KL}}(\mathrm{\Pi},a\otimes b)=H(\mathrm{\Pi})+const$ but we will still be able to use ${D}_{\text{KL}}(\mathrm{\Pi},a\otimes b)$ as a general notion of entropy for $\mathrm{\Pi}$.
The entropy of a coupling ${D}_{\text{KL}}(\mathrm{\Pi},a\otimes b)$ is an important notion because it quantifies how uniform or smooth $\mathrm{\Pi}$ is with respect to the product distribution $a\otimes b$. In particular, if $a$ and $b$ are set to uniform distributions as commonly done in practice, then ${D}_{\text{KL}}(\mathrm{\Pi},a\otimes b)$ is small when $\mathrm{\Pi}$ has close to uniform entries and is large otherwise. This notion of smoothness allows us to use ${D}_{\text{KL}}(\mathrm{\Pi},a\otimes b)$ as a regularizer in our optimal transport distance as
where $\epsilon $ is a small regularization parameter. Note that here we have denoted the transport cost matrix by $C\in {\mathbb{R}}^{{p}_{1}\times {p}_{2}}$ which is not necessarily a distance matrix. The introduction of the regularizer $\epsilon {D}_{\text{KL}}(\mathrm{\Pi},a\otimes b)$ gives us an efficient iterative algorithm known as the Sinkhorn algorithm for optimizing $\mathrm{\Pi}$ which we describe in the following sections.
Unbalanced optimal transport
Before we introduce the Sinkhorn algorithm, we introduce a final modification to our optimal transport distance that allows us to learn couplings between distributions $a,b\in {\mathbb{R}}_{+}^{p}$ that do not preserve mass. In other words, the coupling $\mathrm{\Pi}$ is not required to perfectly satisfy the marginal constraints $\mathrm{\Pi}{\mathrm{\U0001d7cf}}_{{p}_{2}}=a$ and ${\mathrm{\Pi}}^{T}{\mathrm{\U0001d7cf}}_{{p}_{1}}=b$. In our metabolite matching problem, this is particularly useful as not all metabolites in one dataset necessarily appear in the other dataset and hence should be left unmatched. This modification of optimal transport, known as unbalanced optimal transport (UOT) Chizat et al., 2018, optimizes the following cost function
where we have added two KL terms with regularization parameter $\rho $ to enforce that the marginals of the coupling $\mathrm{\Pi}{\mathrm{\U0001d7cf}}_{{p}_{2}},{\mathrm{\Pi}}^{T}{\mathrm{\U0001d7cf}}_{{p}_{1}}$ are approximately close to the prescribed marginals $a,b$ respectively. We have also kept the smoothness/entropy regularizer $\epsilon {D}_{\text{KL}}(\mathrm{\Pi},a\otimes b)$ from the previous section.
Unbalanced Sinkhorn algorithm
Now we are ready to present the unbalanced Sinkhorn algorithm Peyré and Cuturi, 2019 for optimizing the unbalanced optimal transport cost defined above. First we rewrite our optimization as
The inner minimization can be solved exactly by introducing dual variables $f\in {\mathbb{R}}^{{p}_{1}},g\in {\mathbb{R}}^{{p}_{2}}$ and writing out the Lagrange dual problem
where we have removed the terms $\rho {D}_{\text{KL}}(u,a)$ and $\rho {D}_{\text{KL}}(v,b)$ since they do not depend on $\mathrm{\Pi}$. Taking the gradient in $\mathrm{\Pi}$ in the inner minimization and setting it to zero we get
which implies that
Now we can substitute this expression for $\mathrm{\Pi}$ back into our Lagrange dual problem. First we compute
which implies that
Hence, the outer maximization in our Lagrange dual problem for $f$ and $g$ can now be written as
where we have removed the last constant sum in ${a}_{i}{b}_{j}$. Finally we can rewrite our entire minimization from the start of this section as
By strong duality, we can interchange the minimum and maximum above to write
where we define the functions
In fact, we can solve the minimizations in ${U}^{*}$ and ${V}^{*}$ in closed form to get the minimizers ${u}^{*}=a\odot \mathrm{exp}(f/\rho )$ and ${v}^{*}=b\odot \mathrm{exp}(g/\rho )$ which we can substitute back in to get
Likewise we can see that
Thus, we can rewrite our full optimization as
where we have removed the terms independent of $f$ and $g$.
Note that now we can optimize the cost function above by performing an alternating minimization on the dual variables $f$ and $g$. Taking the gradient in $f$ and setting it to zero we see that
which implies that
Similarly, we can write out
We are now ready to write out the full unbalanced Sinkhorn algorithm which performs an alternating minimization on the dual potentials $f,g$ as outlined above. We remind the reader that the coupling matrix can be recovered from the dual potentials by the formula
The unbalanced Sinkhorn algorithm proceeds as follows.
Algorithm 1. UnbalancedSinkhorn 

input: Transport cost $C$, marginals $a,b$, marginal relaxation $\rho $, entropic regularization $\epsilon $ output: Return the coupling matrix $\mathrm{\Pi}$ Initialize $g=0$ while $\mathrm{(}f\mathrm{,}g\mathrm{)}$ has not converged do Set ${f}_{i}\leftarrow \frac{\epsilon \rho}{\epsilon +\rho}\mathrm{ln}(\sum _{j=1}^{{p}_{2}}{b}_{j}\mathrm{exp}(\frac{{g}_{j}{C}_{ij}}{\epsilon}))$ for $i\in [{p}_{1}]$ Set ${g}_{j}\leftarrow \frac{\epsilon \rho}{\epsilon +\rho}\mathrm{ln}(\sum _{i=1}^{{p}_{1}}{a}_{i}\mathrm{exp}(\frac{{f}_{i}{C}_{ij}}{\epsilon}))$ for $j\in [{p}_{2}]$ Return the coupling matrix ${\mathrm{\Pi}}_{ij}={a}_{i}{b}_{j}\mathrm{exp}\left(\frac{{f}_{i}+{g}_{j}{C}_{ij}}{\epsilon}\right)$ for $i\in [{p}_{1}]$ and $j\in [{p}_{2}]$ 
The final output of the Sinkhorn algorithm optimization is a realvalued coupling matrix $\mathrm{\Pi}\in {\mathbb{R}}_{+}^{{p}_{1}\times {p}_{2}}$. In some cases, it is desirable to transform the coupling matrix into a binaryvalued matching matrix $M\in {\{0,1\}}^{{p}_{1}\times {p}_{2}}$ with possibly an added restriction that there is at most one nonzero element in each row and column (to obtain a valid partial matching). This can be done by either thresholding the real matrix $\mathrm{\Pi}$ or by assigning all maximal entries in each row (or column) to one and setting the remaining entries to zero. For our metabolomics matching problem, we describe our procedure for transforming our realvalued coupling into a binary matching matrix in the section on the GromovMatcher algorithm below.
Gromov–Wasserstein
Now that we have introduced the general formulation of unbalanced optimal transport and its corresponding Sinkhorn algorithm, we can extend this formulation to matching problems between distributions of points that live in different dimensional spaces. In our metabolomics setting, we aim to match two datasets of p_{1} and p_{2} metabolic features respectively where each feature in a dataset is associated with a feature intensity vector ${\{{X}_{i}\}}_{i=1}^{{p}_{1}}\subset {\mathbb{R}}^{{n}_{1}}$ and ${\{{Y}_{j}\}}_{j=1}^{{p}_{2}}\subset {\mathbb{R}}^{{n}_{2}}$ respectively across samples. We assume that there exists a true matching matrix ${M}^{*}\in {\{0,1\}}^{{p}_{1}\times {p}_{2}}$ with at most one nonzero entry in each row and column such that two metabolites $(i,j)$ are matched if ${M}_{ij}^{*}=1$.
We make the further assumption that if feature vectors ${X}_{i},{Y}_{j}$ are matched and feature vectors ${X}_{k},{Y}_{l}$ are matched under ${M}^{*}$, then we expect that
where d_{x} is a distance metric on ${\mathbb{R}}^{{n}_{1}}$ and d_{y} is a distance metric n ${\mathbb{R}}^{{n}_{2}}$. In practice, we always choose these distance metrics to be the normalized Euclidean distance defined for any $u,v\in {\mathbb{R}}^{n}$ as
which is equal to the cosine distance ${d}^{\text{cos}}$ (i.e. one minus the correlation) for centered and scaled data. Given these two distance matrices ${D}^{x}={[{d}_{x}({X}_{i},{X}_{k})]}_{i,k=1}^{{p}_{1}}\in {\mathbb{R}}^{{p}_{1}\times {p}_{1}}$ and ${D}^{y}={[{d}_{y}({Y}_{j},{Y}_{l})]}_{j,l=1}^{{p}_{2}}\in {\mathbb{R}}^{{p}_{2}\times {p}_{2}}$ we would like to infer the true matching matrix ${M}^{*}$ by solving an optimization problem.
Consider the following objective function
where the matching matrices $M\in {\{0,1\}}^{{p}_{1}\times {p}_{2}}$ we optimize over are constrained to satisfy marginal constraints $\mathrm{\Pi}{\mathbf{1}}_{{p}_{2}}>0$ and ${\mathrm{\Pi}}^{T}{\mathbf{1}}_{{p}_{1}}>0$. These marginal constraints simply impose that there is at least one nonzero entry in each row and column (i.e. each metabolite in both datasets has at least one corresponding match). Searching for the $\mathrm{\Pi}$ minimizing ${\mathcal{E}}_{X,Y}\left(\mathrm{\Pi}\right)$ consists of putting the nonzero entries in $\mathrm{\Pi}$ such that the distance profiles of the matched features are similar, so that the minimizer of this criterion provides a good candidate estimate of ${\mathrm{\Pi}}^{*}$. This is closely related to the Gromov–Hausdorff distance Gromov, 2001, an extension of optimal transport to the case where the sets to be coupled do not lie in the same metric space.
In practice, it is often desirable to optimize over a different set of matrices in order to make the optimization problem more tractable. Here we take intuition from optimal transport, and search over the set of coupling matrices with marginal constraints
where as before, $a\in {\mathbb{R}}_{+}^{{p}_{1}}$ and $b\in {\mathbb{R}}_{+}^{{p}_{2}}$ are desired marginals which are typically set to be uniform distributions $a=\frac{1}{{p}_{1}}{\mathrm{\U0001d7cf}}_{{p}_{1}}$ and $b=\frac{1}{{p}_{2}}{\mathrm{\U0001d7cf}}_{{p}_{2}}$. These marginal vectors can be interpreted as distributions of masses a_{i} and b_{j} on the feature vectors ${X}_{i}$ and ${Y}_{j}$ respectively for $i\in [{p}_{1}],j\in [{p}_{2}]$.
Coupling matrices in $\mathbf{\mathbf{U}}(a,b)$ transport the distribution of masses $a$ in the first dataset to the distribution of masses $b$ in the second dataset. Now we can formulate the Gromov–Wasserstein (GW) distance, introduced by Mémoli, 2011, as
By optimizing this objective, each entry ${\mathrm{\Pi}}_{ij}$ now reflects the strength of the matched pair $({X}_{i},{Y}_{j})$. Optimizing $\text{GW}(a,b)$ then amounts to placing larger entries in $\mathrm{\Pi}$ whose paired features have similar distance profiles. Before we develop an algorithm to optimize this objective, we first modify it to allow for unbalanced matchings where marginal constraints are not enforced exactly (e.g. features in both datasets can remain unmatched).
Unbalanced Gromov–Wasserstein
In an untargeted context, all features measured in one study are not necessarily observed in another, either because these features are truly not shared or because of measurement error. However, the constraint $\mathrm{\Pi}\in \mathbf{\mathbf{U}}(a,b)$ in the original GW optimization criterion Equation 40 ensures that all the mass is transported from one set to another, resulting in all features being matched across studies. In order to discard studyspecific features during the GW computation, we use the unbalanced Gromov–Wasserstein (UGW) distance with an additional entropic regularization for computational purposes, described in Sejourne et al., 2021. The optimization problem therefore reads
with $\rho ,\epsilon >0$. Here ${D}_{\text{KL}}$ is the Kullback–Leibler divergence defined in the previous sections and we define the tensor product ${\left(P\otimes P\right)}_{i,j,k,l}={P}_{i,j}{P}_{k,l}$. Here we set the desired marginal constraints to $a=\frac{1}{{p}_{1}}{\mathrm{\U0001d7cf}}_{{p}_{1}}$ and $b=\frac{1}{{p}_{2}}{\mathrm{\U0001d7cf}}_{{p}_{2}}$ as before.
As in the case of unbalanced optimal transport (Chizat et al., 2018), the regularization $\rho $ times the Kullback–Leibler divergences allows for the relaxation of the marginal constraints $\mathrm{\Pi}{\mathrm{\U0001d7cf}}_{{p}_{2}}=a$ and ${\mathrm{\Pi}}^{T}{\mathrm{\U0001d7cf}}_{{p}_{1}}=b$. The value of $\rho >0$ controls the extent to which we allow for mass destruction. Smaller values of $\rho $ tend to lessen the constraint on the marginals of $\mathrm{\Pi}$, while balanced GW is recovered when $\rho \to +\mathrm{\infty}$. As proposed in the original paper (Sejourne et al., 2021), our UGW cost modifies the UOT formulation by using the quadratic KullbackLeibler divergence in $\mathrm{\Pi}{\mathrm{\U0001d7cf}}_{{p}_{2}}\otimes \mathrm{\Pi}{\mathrm{\U0001d7cf}}_{{p}_{2}}$ and ${\mathrm{\Pi}}^{T}{\mathrm{\U0001d7cf}}_{{p}_{1}}\otimes {\mathrm{\Pi}}^{T}{\mathrm{\U0001d7cf}}_{{p}_{1}}$ instead, hence preserving the quadratic form of the GW cost function $\mathcal{E}(\mathrm{\Pi})$.
The term $\u03f5{D}_{KL}(\mathrm{\Pi}\otimes \mathrm{\Pi},{\left(a\otimes b\right)}^{\otimes 2})$ serves as an entropic regularization, inspired again by optimal transport. Adding such a penalty is a standard way to compute an approximate solution to the optimal transport problem using the Sinkhorn algorithm as we shall show in the following section. Here again, we modify the entropic penalty in UGW to have a quadratic form in $\mathrm{\Pi}\otimes \mathrm{\Pi}$ to agree with the quadratic form of the GW cost $\mathcal{E}(\mathrm{\Pi})$. The parameter $\epsilon $ controls the smoothness (entropy) of the coupling matrix $\mathrm{\Pi}$ where larger values of $\epsilon $ encourage $\mathrm{\Pi}$ to put uniform weights on many of its entries, leading to less precision in the feature matches. However, increasing $\epsilon $ also leads to better numerical stability and a significant speedup of the alternating Sinkhorn algorithm used to optimize the objective function described below.
UGW optimization algorithm
Now we are ready to write out an algorithm to optimize the UGW objective in Equation 42. First write our objective as
Using the quadratic nature of our cost function, we aim to perform an alternating minimization in the two copies of $\mathrm{\Pi}$. For the moment, let’s differentiate these two copies by $\mathrm{\Pi}$ and $\mathrm{\Gamma}$ and write the new cost
Before we expand this cost, we introduce the notation $m(\pi )$ to denote the sum of the elements of $\pi $ which can be a vector, matrix or tensor. In general, for four positive distributions $\pi ,a\in {\mathbb{R}}_{+}^{p}$ and $\gamma ,b\in {\mathbb{R}}_{+}^{q}$ we have that the KL satisfies the tensorization property
Specifically, if we remove those terms that do not depend on $\gamma $ we are left with
This allows us to write for the marginal constraints $a\in {\mathbb{R}}_{+}^{{p}_{1}},b\in {\mathbb{R}}_{+}^{{p}_{2}}$ and couplings $\mathrm{\Pi},\mathrm{\Gamma}\in {\mathbb{R}}_{+}^{{p}_{1}\times {p}_{2}}$ that
where in the expansions above we have removed all terms that are independent of $\mathrm{\Gamma}$. Finally, expanding out ${\mathcal{F}}_{\rho ,\epsilon}(\mathrm{\Pi},\mathrm{\Gamma})$ and keeping only those terms that depend on $\mathrm{\Gamma}$ we get
where the cost matrix ${C}^{\mathrm{\Pi}}\in {\mathbb{R}}^{{p}_{1}\times {p}_{2}}$ is defined as
where we have hidden the dependence of ${C}^{\mathrm{\Pi}}$ on the distance matrices ${D}^{x},{D}^{y}$, the marginals $a,b$, and the regularization parameters $\rho ,\epsilon $ for ease of notation.
Remarkably, the cost above in $\mathrm{\Gamma}$ for fixed $\mathrm{\Pi}$ is in the form of an unbalanced optimal transport problem which can be solved through unbalanced Sinkhorn iterations (Algorithm 1). Note that in our derivation above, it did not matter whether we optimized $\mathrm{\Gamma}$ with $\mathrm{\Pi}$ fixed or vice versa because the cost ${\mathcal{F}}_{\rho ,\epsilon}(\mathrm{\Pi},\mathrm{\Gamma})$ is symmetric in both of its arguments.
Our iterative algorithm for solving the unbalanced GW problem will proceed at each iteration by optimizing $\mathrm{\Gamma}$ to minimize the cost above using the unbalanced Sinkhorn method, setting $\mathrm{\Pi}$ equal to $\mathrm{\Gamma}$ and repeating. With each iteration, we expect this iterative procedure to make smaller and smaller updates to $\mathrm{\Gamma}$ until convergence. By definition, at the end of each iteration we assign $\mathrm{\Pi}=\mathrm{\Gamma}$ so the minimizer of ${\mathcal{F}}_{\rho ,\epsilon}(\mathrm{\Pi},\mathrm{\Gamma})$ we converge to should also be a minimizer of the original UGW cost ${\mathcal{L}}_{\rho ,\epsilon}(\mathrm{\Pi})$ in the sense that the relaxation of ${\mathcal{L}}_{\rho ,\epsilon}(\mathrm{\Pi})$ to ${\mathcal{F}}_{\rho ,\epsilon}(\mathrm{\Pi},\mathrm{\Gamma})$ is tight. This is proven rigorously under strict mathematical assumptions in Sejourne et al., 2021. We state the full UGW optimization algorithm below.
Algorithm 2. UnbalancedGromovWasserstein 

input: Distance matrices $D}^{x},{D}^{y$, marginals a, b marginal relaxation $\rho$, entropic regularization $\epsilon$ output: Return the coupling matrix $\mathrm{\Pi}$ Initialize $\mathrm{\Pi}=\mathrm{\Gamma}=a\otimes b/\sqrt{m(a)m(b)}$ while $\mathrm{(}\mathrm{\Pi}\mathrm{,}\mathrm{\Gamma}\mathrm{)}$ has not converged do Update $\mathrm{\Pi}\leftarrow \mathrm{\Gamma}$ Update $\mathrm{\Gamma}=\text{UnbalancedSinkhorn}({C}^{\mathrm{\Pi}},a,b,\rho m(\mathrm{\Pi}),\epsilon m(\mathrm{\Pi}))$ Rescale $\mathrm{\Gamma}\leftarrow \sqrt{m(\mathrm{\Pi})/m(\mathrm{\Gamma})}\mathrm{\Gamma}$ Update $\mathrm{\Pi}\leftarrow \mathrm{\Gamma}$ Return $\mathrm{\Pi}$. 
Following the implementation of the UGW algorithm in Sejourne et al., 2021, we initialize both $\mathrm{\Pi}$ and $\mathrm{\Gamma}$ to be the product distribution of the marginals $a\otimes b/\sqrt{m(a)m(b)}$ before we begin the optimization. Also, we note that if $(\mathrm{\Pi},\mathrm{\Gamma})$ is a minimizer of our UGW objective ${\mathcal{F}}_{\rho ,\epsilon}(\mathrm{\Pi},\mathrm{\Gamma})$, then so is $(\frac{1}{s}\mathrm{\Pi},s\mathrm{\Gamma})$ for any scale factor $s>0$. Hence, we can set $m(\frac{1}{s}\mathrm{\Pi})=m(s\mathrm{\Gamma})$ by choosing $s=\sqrt{m(\mathrm{\Pi})/m(\mathrm{\Gamma})}$. This motivates the final step in the while loop of the UGW algorithm where the rescaling of $\mathrm{\Gamma}$ by the factor $\sqrt{m(\mathrm{\Pi})/m(\mathrm{\Gamma})}$ leads to mass equality $m(\mathrm{\Pi})=m(\mathrm{\Gamma})$ and also stabilizes the convergence of the algorithm.
Returning to our metabolomics matching problem, we further guide our UGW optimization procedure by discouraging it from matching metabolic feature pairs whose masstocharge ratios are incompatible. Namely, we choose a value ${m}_{\text{gap}}$ such that for all pairs $(i,j)$ with $i\in [{p}_{1}],j\in [{p}_{2}]$ and masstocharge ratios ${m}_{i}^{x},{m}_{j}^{y}$ we enforce that
In practice, this is done by taking the optimal transport cost ${C}^{\mathrm{\Pi}}$ in every iteration of the UGW algorithm and premultiplying it elementwise by a factor $W\in {\mathbb{R}}_{+}^{{p}_{1}\times {p}_{2}}$ given by
where ${\mathrm{\U0001d7cf}}_{\mathcal{X}}$ denotes the indicator function that is one when the condition $\mathcal{X}$ is satisfied and zero otherwise. Such a prefactor changes the transport cost to be very large for feature matches with incompatible masstocharge ratio times, and hence, the entries of $\mathrm{\Pi}$ set small weights at these entries. Our weighted UGW algorithm is rewritten below.
Algorithm 3. WeightedUnbalancedGromovWasserstein 

input : Distance matrices ${D}^{x},{D}^{y}$, marginals $a,b$, marginal relaxation $\rho $, entropic regularization $\epsilon $, masstocharge ratios ${m}^{x},{m}^{y}$, masstocharge ratio gap ${m}_{\text{gap}}$ output: Return the coupling matrix $\mathrm{\Pi}$ Initialize $\mathrm{\Pi}=\mathrm{\Gamma}=a\otimes b/\sqrt{m(a)m(b)}$ Set ${W}_{ij}=99\cdot {\mathbf{1}}_{\{{m}_{i}^{x}{m}_{j}^{y}>{m}_{\text{gap}}\}}+1$ for $i\in [{p}_{1}]$ and $j\in [{p}_{2}]$ while $\mathrm{(}\mathrm{\Pi}\mathrm{,}\mathrm{\Gamma}\mathrm{)}$ has not converged do Update $\mathrm{\Pi}\leftarrow \mathrm{\Gamma}$ Update $\mathrm{\Gamma}=\text{UnbalancedSinkhorn}(W\odot {C}^{\mathrm{\Pi}},a,b,\rho m(\mathrm{\Pi}),\epsilon m(\mathrm{\Pi}))$ Rescale $\mathrm{\Gamma}\leftarrow \sqrt{m(\mathrm{\Pi})/m(\mathrm{\Gamma})}\mathrm{\Gamma}$ Update $\mathrm{\Pi}\leftarrow \mathrm{\Gamma}$ Return $\mathrm{\Pi}$. 
As mentioned before, the coupling matrix returned by our weighted UGW algorithm is a realvalued matrix rather than a binary matching matrix. In the next section, we describe how we incorporate metabolite retention time information to filter out unlikely pairs in our coupling matrix and transform it into a valid onetoone matching of features across two datasets.
Retention time drift estimation and filtering
To filter out unlikely matches from the coupling matrix returned by Algorithm 3 above, we use the retention times (RTs) of the metabolites in both datasets. We remind the reader that RTs were not incorporated into the weighted UGW algorithm since they often exhibit a nonlinear deviation between datasets, and hence are not directly comparable. However, using the metabolite coupling $\stackrel{~}{\mathrm{\Pi}}\in {\mathbb{R}}_{+}^{{p}_{1}\times {p}_{2}}$ obtained from Algorithm 3, it is possible to estimate this RT drift. The estimated RT drift $\widehat{f}:{\mathbb{R}}_{+}\to {\mathbb{R}}_{+}$ allows us to assess the plausibility of the pairs recovered by the restricted UGW coupling $\stackrel{~}{\mathrm{\Pi}}$, and discard pairs incompatible with the estimated drift.
We propose to learn the drift $\widehat{f}$ through the weighted spline regression
where ${\mathcal{B}}_{n,k}$ is the set of $n$order Bsplines with $k$ knots. All pairs $(R{T}_{i}^{x},R{T}_{j}^{y})$ in objective Equation 51 are weighted by the coefficients of $\stackrel{~}{\mathrm{\Pi}}$ so that larger weights are given to pairs identified with high confidence in the first step of our procedure.
Pairs identified as incompatible with the estimated RT drift are then discarded from the coupling matrix. To do this, we first take the estimated RT drift $\widehat{f}$, and the set of pairs $\mathcal{S}=\{i,j:{\stackrel{~}{\mathrm{\Pi}}}_{i,j}\ne 0\}$ recovered in $\stackrel{~}{\mathrm{\Pi}}$ with nonzero entries. We then define the residual associated with $(i,j)\in \mathcal{S}$ as
The 95% prediction interval and the median absolute deviation (MAD) of these residuals are given by
where $\mathcal{S}$ is the size of $\mathcal{S}$ and the functions std, median denote the standard deviation and median respectively. Following Habra et al., 2021, we then create a new filtered coupling matrix $\widehat{\mathrm{\Pi}}\in {\mathbb{R}}_{+}^{{p}_{1}\times {p}_{1}}$ given by
where ${r}_{\text{thresh}}$ is a given filtering threshold. The procedure of estimating the drift function $\widehat{f}$ in Equation 51 and filtering the coupling can be repeated for multiple iterations, to improve the drift and coupling estimation. In our main algorithm, we use two preliminary iterations where we estimate the RT drift and discard outliers with ${r}_{\text{thresh}}=\text{PI}$, defined as points falling outside of the 95% prediction interval. We the reestimate the drift and perform a final filtering step with the more stringent MAD by setting ${r}_{\text{thresh}}=2\times \text{MAD}$.
At this stage, it is possible for $\widehat{\mathrm{\Pi}}$ to still contain coefficients of very small magnitude. As an optional postprocessing step, we discard these coefficients by setting all entries smaller than $\tau \text{max}(\widehat{\mathrm{\Pi}})$ to zero for some scaling constant $\tau \in [0,1]$. Lastly, a feature from either study could have multiple possible matches, since $\widehat{\mathrm{\Pi}}$ can have more than one nonzero coefficient per row or column. Although reporting multiple matches can be helpful in an exploratory context, for the sake of simplicity in our analysis, the final output of GromovMatcher returns a onetoone matching. Consequently, we only keep those metabolite pairs $(i,j)$ where the entry ${\widehat{\mathrm{\Pi}}}_{ij}$ is largest in its corresponding row and column. All nonzero entries of $\widehat{\mathrm{\Pi}}$ which do not satisfy this criterion are set to zero. Finally, we convert $\widehat{\mathrm{\Pi}}$ into a binary matching matrix $M\in {\{0,1\}}^{{p}_{1}\times {p}_{2}}$ with ones in place of its nonzero entries and this final output is returned to the user.
As a naming convention, we use the abbreviation GM for our GromovMatcher method, and use the abbreviation GMT when running GromovMatcher with the optional $\tau $thresholding step.
GromovMatcher algorithm summary
In summary, our full GromovMatcher algorithm consists of (1) UGW optimization followed by (2) retention time drift estimation and filtering.
The tuning of $\rho $ and $\u03f5$ was computationally driven and the two parameters were set as low as possible, with $\rho =0.05$ and $\u03f5=0.005$. Based on literature (Loftfield et al., 2021; Hsu et al., 2019; Climaco Pinto et al., 2022; Habra et al., 2021; Chen et al., 2021) and what is considered to be a plausible variation of a feature’s $m/z$, we set ${m}_{\text{gap}}$ = 0.01 ppm. For RT drift estimation, the order of the Bsplines was set to $n=3$ by default, while the number of knots $k$ was selected by 10fold crossvalidation. If the optional thresholding step was applied in GMT, we set $\tau =0.3$. Otherwise, we let $\tau =0$ which gives the unthresholded GM algorithm.
Algorithm 4. GromovMatcher 

input : Distance matrices ${D}^{x},{D}^{y}$, marginals $a,b$, marginal relaxation $\rho $, entropic regularization $\epsilon $, masstocharge ratios ${m}^{x},{m}^{y}$, masstocharge ratio gap ${m}_{\text{gap}}$, retention times $R{T}^{x},R{T}^{y}$, Bspline order $n$, filtering threshold $\tau $ output: Return the matching matrix $M$ and the retention time drift $\widehat{f}$ # Step 1: Weighted UGW optimization Compute $\stackrel{~}{\mathrm{\Pi}}=\text{WeightedUnbalancedGromovWasserstein}({D}^{x},{D}^{y},a,b,\rho ,\epsilon ,{m}^{x},{m}^{y})$ # Step 2: Retention time drift estimation and filtering for $i=1:3$ do Perform weighted spline regression Equation 51 for RT drift $\hat{f}\in {\mathcal{B}}_{n,k}$ where k is chosen by 10fold cross validation Initialize ${r}_{\text{thresh}}=0$ if $i<3$ then Set ${r}_{\text{thresh}}=\text{PI}$ from Equation 53 else Set ${r}_{\text{thresh}}=2\times \text{MAD}$ from Equation 53 Set $\stackrel{~}{\mathrm{\Pi}}=\widehat{\mathrm{\Pi}}$ Compute $\mathcal{U}=\mathrm{max}(\widehat{\mathrm{\Pi}})$ Set ${\widehat{\mathrm{\Pi}}}_{ij}=0$ if $\hat{\mathrm{\Pi}}}_{ij}<\tau \mathcal{U$ for $i\in [{p}_{1}]$ and $j\in [{p}_{2}]$ Set ${\widehat{\mathrm{\Pi}}}_{ij}=0$ if $i\ne {\text{argmax}}_{k}{\widehat{\mathrm{\Pi}}}_{k,j}$ or $j\ne {\text{argmax}}_{k}{\widehat{\mathrm{\Pi}}}_{i,k}$ for $i\in [{p}_{1}]$ and $j\in [{p}_{2}]$ Define the binarized matching $M}_{ij}={\mathbf{1}}_{\{{\hat{\mathrm{\Pi}}}_{ij}>0\}$ Return $M$ and $\widehat{f}$. 
Appendix 2
Here we discuss existing metabolomic alignments methods and the hyperparameter experiments we perform on these methods. We consider two existing alignment methods for comparison, metabCombiner (Habra et al., 2021) and M2S (Climaco Pinto et al., 2022). Both of them take the same kind of input as GromovMatcher, i.e. feature tables with features identified with their $m/z$, RT, and intensities across samples.
MetabCombiner hyperparameter experiments
MetabCombiner (Habra et al., 2021) is a threestep process that begins by grouping features based on their $m/z$ within userspecified bins. This creates a search space for potential feature pairs. In the second step, MetabCombiner estimates the RT drift using the potential feature pairs identified in the first step, and eliminates outlying pairs over several iterations. This step can incorporate prior knowledge by identifying shared features and marking them as anchors, which are not discarded. In the final step, MetabCombiner scores the remaining feature pairs based on their $m/z$, RT, and relative intensity compatibility to discriminate between multiple matches for one feature. The scoring system relies on weights assigned to $m/z$, RT, and feature intensities, with the magnitude of those weights reflecting the reliability of the corresponding measurements across studies.
MetabCombiner (Habra et al., 2021) includes adjustable parameters throughout the pipeline. We set most of them to default values unless otherwise stated. MetabCombiner first establishes candidate pairs by binning features in the $m/z$ dimension with a width of binGap, and pairing the features sorted by relative intensities. The ‘binGap’ parameter sets the $m/z$ tolerance of metabCombiner, similar to ${m}_{\text{gap}}$ in GromovMatcher. We used the same value of 0.01 as in GromovMatcher.
MetabCombiner then estimates the RT drift using basis splines, and removes pairs associated with a high residual (twice the mean model error) from the candidate set.
In our main experiment, the RT drift is estimated exclusively using candidate pairs selected by the pipeline. However, it is also possible to include known ground truth pairs as ’anchors’ to estimate the RT drift. We choose not to rely on prior knowledge for drift estimation as Habra et al., 2021 show their drift estimation to be efficient and robust, even without prior knowledge. To confirm this claim, we conduct a sensitivity analysis comparing the results obtained in our main experiment with those obtained when supplying metabCombiner with known shared metabolites to anchor the RT drift estimation. We randomly select 100 anchors from the ground truth matching and compute the metabCombiner matchings with otherwise identical settings as in our main experiment. The results from this analysis (reported in Appendix 2—figure 1) show that the unsupervised RT drift estimation (using anchors selected by the pipeline only) performs as well as the supervised RT drift estimation, showing the drift estimation to be very consistent, with or without shared entities.
After establishing candidate pairs and filtering out those that contradict the estimated RT drift, metabCombiner discriminates between multiple matches using a scoring system that considers $m/z$, RT, and rankings of the median feature intensities. Each dimension has a specific weight that can be left at default, manually adjusted, or automatically tuned using known matched pairs. Habra et al., 2021 provide qualitative guidelines for tuning the weights manually, mainly based on the experimental conditions and visual inspection of the RT drift plot. Since this approach is difficult to implement in the various settings we consider for our simulation study, we rely on the quantitative tuning function included in the metabCombiner pipeline. This function takes into account known shared features and tunes the weights to optimize the scores of those known matches. We randomly select 100 known true matches to define the objective function metabCombiner maximizes. We search over the recommended range of values, with the $m/z$ weight $A\in [50,150]$, the RT weight $B\in [5,20]$ and the feature intensities weight $C\in [0,1]$. Appendix 2—figure 1 presents the results obtained with the weights set at default values ($A=100,B=15,C=0.5$), as a sensitivity analysis.
M2S hyperparameter experiments
Climaco Pinto et al., 2022 introduce M2S as a more versatile alternative to metabCombiner, while still adhering to most of its core principles. Like metabCombiner, M2S follows a threestep process. First it searches for matches within userdefined thresholds for $m/z$, retention time, and mean feature intensity. Next, M2S estimates $m/z$, RT and feature intensity drifts between datasets and removes any outlier pairs. Finally, M2S selects the best match using a scoring system that weighs each measurement, similar to metabCombiner. M2S notably stands out by providing greater flexibility in the methods and measurements used at each step of the procedure, resulting however in a larger number of parameters that require manual finetuning. To address this, we adopt two different approaches for the simulation study and the EPIC study alignment. In the simulation study, we set the initial thresholds to oracle values and investigate technical parameters. For the EPIC study alignment, we use the combination of technical parameters with the best average F1score in the simulation study and select the best threshold values based on the performance on the validation subset.
More precisely, M2S first matches all pairs of metabolic features whose absolute difference in $m/z$, RT, and median of $lo{g}_{10}$ FI are within the userdefined thresholds ‘MZ_intercept’, ‘RT_intercept’ and ‘log10FI_intercept’. On simulated data experiments, we set these thresholds to MZ_intercept = 0.01, RT_intercept = 3.5 and log10FI_intercept = 0.2 which are large enough to not exclude any true feature matches in any of the scenarios for our simulated data under low, medium, and high overlap/noise (see Methods). M2S also offers more detailed options to match features whose absolute difference stays within two lower and upper bound lines with a given slope where the intercepts of these lines are defined using the values above. In our analysis, we set the slopes of these linear boundaries to zero so as to not remove any true matches. Because the reference and target studies we are matching in the simulated analysis are on the same scale, we set the FI adjust method to ‘none’.
The second step of M2S involves calculating penalization scores for every pair of matches which are used to determine the best set of matches between metabolic features of both datasets. This step depends on a set of hyperparameters which we perform a grid search over to optimize the performance of M2S. For estimating the $m/z$, RT, and FI drift, the hyperparameters are the percentage of neighbors ‘nrNeighbors‘, the neighborhood shape ‘neighMethod’, and the LOESS span percentage ‘pctPointsLoess’ used to smooth the estimated drift functions. After the drifts are estimated, they are normalized using a method specified by ‘residPercentile’ that puts the $m/z$, RT, and FI residuals on the same scale. We always fix residPercentile = NaN which defaults to the standard 2 × MAD normalization. Next, for every remaining metabolic feature match, the residuals/drifts of the $m/z$, RT, and FI are added together by taking the weighted square root sum of squares. For unnormalized data where feature intensity magnitudes are important, we weight all three drifts equally using $W=(1,1,1)$ and for data with normalized feature intensities we set the FI drift weight to zero such that $W=(1,1,0)$. Finally, using these weighted penalization scores, M2S selects the best matched pair within a multiple match cluster to obtain a onetoone matching between datasets.
The third and final step of M2S involves removing those remaining matches which have large differences in $m/z$, RT, or FI. This can be performed using several methods indicated by the hyperparameter ‘methodType’. Each method excludes those matched pairs whose differences in $m/z$, RT, or FI exceed a certain number of median absolute deviations indicated by the parameter ‘nrMAD’. The remaining onetoone metabolic feature matches are returned as the final result of the M2S algorithm.
To optimally tune M2S on our simulated experiments, we determine the optimal M2S parameter combination for each individual simulation setting (low, medium, high overlap and noise) by performing a grid search over the product of parameter lists
nrNeighbors = [0.01, 0.05, 0.1, 0.5, 1]
neighMethod = [‘cross’, ‘circle’]
pctPointsLoess = [0, 0.1, 0.5]
methodType = [‘none’, ‘scores’, ‘byBins’, ‘trend_mad’, ‘residuals_mad’]
nrMAD = [1, 3, 5]
Each parameter combination for M2S is tested across 20 randomly generated datasets at the same overlap and noise settings. For each setting, the combination of parameters above with the best average F1score across these 20 trials is used as the optimal parameter choice.
M2S applies initial RT thresholds to search for candidate pairs, which may favor settings where the RT drift follows a linear trend. Therefore, as a sensitivity analysis, we apply M2S to simulated data with a linear drift. The simulation process is identical to that of our main simulation study, except for the deviation of the RT in dataset 2. Specifically, for a given overlap value, we divide the original realworld dataset into two smaller datasets and introduce random noise to the $m/z$, RT and intensities of the features, without introducing a systematic deviation to the RT in dataset 2. M2S parameters are kept identical to the ones used in our main analysis in comparable settings. The results obtained by M2S on three pairs of datasets generated for three overlap values (0.25, 0.5 and 0.75) and a medium noise level are reported in Appendix 2—table 1. While the results obtained in a high overlap setting are close to those obtained in our main analysis M2S demonstrates better performance in a low overlap setting when the RT drift is linear than in our main analysis. This observation is consistent with the results obtained by M2S on EPIC data, considering the relatively low estimated overlap between the aligned EPIC studies in our main analysis.
For the EPIC data, we select the parameter combination that yields the highest F1score across all simulated settings. However, due to the unavailability of oracle values for setting initial thresholds, we perform a search over several MZ intercept values (0.01, 0.05, and 0.1), RT intercept values (0.1, 0.5, 1, and 5), and logFI intercept values (1, 10, and 100).
Appendix 3
In this section, we study the sensitivity of all three alignment methods GMT, M2S, and mC to the validation dataset split when creating two validation studies for matching. As described in the section "Validation on groundtruth data" and depicted in Figure 2 of the main text, we generate two datasets to be matched by splitting an initial LCMS dataset with $p$ features and $n$ samples into two smaller overlapping datasets. The first dataset has p_{1} features and n_{1} samples while the second dataset has p_{2} features and n_{2} samples. The sets of samples in both datasets are disjoint such that ${n}_{1}+{n}_{2}=n$. However, the dataset split is constructed such that both datasets share $\approx \lambda p$ of their features where $\lambda \in [0,1]$ is an overlap fraction. Namely, this is done by defining the dataset feature sizes as
and the dataset sample sizes as
As before, $\lfloor \cdot \rfloor $ and $\lceil \cdot \rceil $ denote integer floor and ceiling functions. Then taking the original LCMS dataset and randomly permuting its samples and features, the first p_{1} features and first n_{1} samples are placed into dataset 1 while the last p_{2} features and last n_{2} samples are placed into dataset 2. It is indeed easy to check here that with such a splitting procedure, the feature overlap between both datasets is ${p}_{1}+{p}_{2}p\approx \lambda p$.
Here ${\lambda}_{f}\in [0.5,1]$ controls the fraction of features in dataset 1 that is not shared with dataset 2 and ${\lambda}_{s}\in (0,1)$ controls the fraction of samples in dataset 1 vs. dataset 2. In particular, if ${\lambda}_{f}=1$ then the features in dataset 2 are entirely a subset of those in dataset 1. In the experiments described in the main text, we always set ${\lambda}_{f}={\lambda}_{s}=0.5$ as to balance the number of features and samples in both resulting datasets.
Now we study how the performance of all three alignment methods changes when ${\lambda}_{f},{\lambda}_{s}$ and the feature overlap $\lambda $ are varied. Here we vary the feature overlap $\lambda \in \{0.25,0.5,0.75\}$, the feature fraction ${\lambda}_{f}\in \{0.5,0.6,0.7,0.8,0.9\}$, and the sample fraction ${\lambda}_{s}\in \{0.1,0.2,\mathrm{\dots},0.9\}$. In Appendix 3—figures 1–3 we show how the precision and recall of GMT, M2S, and mC depend on these parameters. Here we use the same unnormalized validation data and experimental setup as decribed in the main text section "Validation on groundtruth data" and in the Methods and Materials section "Validation on simulated data". For each triple $(\lambda ,{\lambda}_{f},{\lambda}_{s})$ we randomly generate 20 dataset splits with these parameters and show the average precision and recall for each method over these trials. Our method GMT (thresholded GromovMatcher) is applied outofbox with the default hyperparameter settings. The algorithm hyperparameters for mC and M2S are chosen optimally for each individual triple of dataset parameters $(\lambda ,{\lambda}_{f},{\lambda}_{s})$ to maximize the average F1 score in each setting. The hyperparameters searched over when optimizing mC and M2S are described in detail in Appendix 2.
Consistent with prior validation experiments, we find that GromovMatcher outperforms both mC and M2S in all dataset regimes, for low overlap and high overlap $\lambda $ as well as for varying balances of features ${\lambda}_{f}$ and samples ${\lambda}_{s}$. Remarkably, all three methods exhibit the same sensitivity to variations of $(\lambda ,{\lambda}_{f},{\lambda}_{s})$. All methods exhibit a monotonic decrease in their precision as ${\lambda}_{f}$ drops from 0.9 to 0.5. In other words, the most challenging setting for matching both datasets is when dataset 1 and dataset 2 both have an equal number of unshared features (e.g. ${\lambda}_{f}=0.5$). Likewise, the simplest setting for matching is when the features in dataset 2 are exactly a subset of the features in dataset 1 (e.g. ${\lambda}_{f}=1$). Sensitivity to this parameter ${\lambda}_{f}$ is most noticeable at low feature overlap $\lambda =0.25$.
Appendix 4
Here we describe additional preprocessing details and analyses of the EPIC data.
Centered and scaled data  Negative mode
In this section, we present the results obtained on centered and scaled EPIC data in negative mode, shown in Figure 4 of our main paper. However, due to the smaller size of the validation subset (42 features examined in negative mode compared to 163 in positive mode), the evaluation of the performance of the three methods may be less reliable than in positive mode.
First, we align the CS and HCC studies in negative mode and detect a total of 449, 492, and 180 matches with GM, M2S, and metabCombiner, respectively. Similar to the positive mode analysis, we evaluate the precision and recall of the three methods on the 42 feature validation subset, of which 19 were manually matched. GM and M2S demonstrate identical F1scores of 0.98, while metabCombiner performs poorly in comparison. GM is able to recover all 19 true matches and identified only 1 false positive, while M2S recovers no false positives but missed 1 true positive.
Next, we align the CS and PC studies in negative mode and detect a total of 485, 569, and 314 matches with GM, M2S, and metabCombiner, respectively. Again, we evaluate the precision and recall of the three methods on the 42 feature validation subset, of which 26 were manually matched. MetabCombiner performs better than in the other EPIC pairings with an F1score of 0.857, but is still outperformed by the other two methods. GM is slightly outperformed by M2S in this setting, with an almost identical precision of 0.93, but a slightly higher recall for M2S due to detecting 1 additional true positive. However, this remains a good performance for GM since M2S was optimally tuned using the validation subset itself.
Noncentered and nonscaled data
As a sensitivity analysis, we apply the three methods to EPIC data that has not been centered or scaled. The detailed results can be found in Appendix 4—table 1.
M2S was tuned manually on the validation subset to ignore feature intensities in both cases. As a result, it maintains its performance compared to our main experiment. On the other hand, the performance of GM and metabCombiner is affected by the lack of consistency in feature intensities. MetabCombiner’s recall drops slightly but its precision remains comparable to that of our main experiment, with the method clearly favoring the latter. Although GM’s recall decreases slightly in positive mode, it remains more precise than the optimally tuned M2S, and it balances precision and recall better than metabCombiner. Interestingly, GM’s results in negative mode are improved compared to our main experiment, and it outperforms both mC and M2S. However, since the validation subset in negative mode is relatively small, these differences may not be significant. Nonetheless, GM maintains a good performance, similar to that of the optimally tuned M2S.
Similar to the analysis we conducted on centered and scaled data, we find a high number of false positives when aligning the CS study and the PC study in positive mode. Therefore, we manually examine the matches recovered by GM. Our examination reveals 2 false positives, 4 unclear matches, and 3 additional good matches that GM also identifies in our main analysis. This demonstrates that the lack of centering and scaling results in two additional false positives for GM that are not present in our main results.
Illustration for alcohol biomarker discovery
Loftfield et al., 2021 identified 205 features associated with alcohol intake in the CS study, using a false discovery rate (FDR) correction to account for multiple testing. By applying an FDR correction in our pooled analysis, we identify 243 features associated with alcohol intake. Out of those 243 features, 185 are consistent with the features identified in the discovery step of Loftfield et al., 2021, while 55 features are newly discovered (Figure 5c). We examine the 20 features identified as significant in Loftfield et al.’s discovery analysis but that are not significant in our pooled analysis. Both manual and GM matching yield identical results for these features, indicating that the loss of significance is not due to incorrect matching. Upon further investigation, we find that these features do not demonstrate a meaningful association with alcohol intake in the HCC and PC studies. This observation is reinforced by the fact that none of these features are among the 10 features that persisted after the validation step in Loftfield et al.
Out of the 205 features initially discovered in Loftfield et al., 2021, 10 are replicated in the EPIC HCC and PC studies using the more stringent Bonferroni correction. When using a Bonferroni correction in our pooled analysis, we find significant association between alcohol intake and 92 features, 36 of which are effectively shared by the three studies. Notably, these features include all 10 features that were retained in Loftfield et al. (Figure 5c).
This analysis illustrates how GromovMatcher can be used in the context of biomarker discovery, and its potential to allow for increased statistical power.
Appendix 5
Here we investigate how the choice of the reference dataset influences the discovery of metabolites shared across the CS, HCC and PC EPIC studies by GromovMatcher. All three methods considered in this paper, GromovMatcher, M2S, and metabCombiner, are limited to the comparison of two datasets. However, they can still be used to compare and pool multiple datasets using a multistep procedure. Namely, this can be done by designating a ‘reference’ dataset and aligning all studies to it one by one. We take this exact approach in our analysis when aligning the CS, HCC, and PC studies of the EPIC data in positive mode. Namely, the HCC and PC studies are both aligned to the CS study (see main text Figure 5b). However, this method raises two critical questions: (i) how does the use of a reference dataset affect matching results, and (ii) how is the matching affected by the choice of reference dataset.
To address these questions, we compare the features identified as common to the three studies using two different studies as references: the CS study used as reference in the main analysis, and the HCC study. For simplicity, let’s denote ${M}_{\text{study 1},\text{study 2}}$ the matching matrix obtained when aligning study 1 and study 2.
Changes in matching results when reference dataset is used
Concerning question (i), we compare two matchings: HCC to CS to PC (the matrix product ${M}_{\text{CS},\text{HCC}}^{T}{M}_{\text{CS},\text{PC}}$) which we will refer to as the reference matching, and the direct matching of PC to HCC (${M}_{\text{HCC},\text{PC}}$). Note that these matchings are not fully comparable as the former considers only features found in CS, potentially missing unique HCC and PC matches. We can however compare the two matchings on the subset of 706 features common to all three studies, as determined by the reference matching. We find that the direct matching supports 683 out of them, indicating that the matching via a reference still yields good results compared to the direct matching (see Appendix 5—figure 1).
Effect of reference dataset choice on matching results
Concerning question (ii), we compare the features identified as common to the three studies using two different studies as references: the CS study used in the paper, and the HCC study. We find that they identify 706 and 708 common features respectively, with an overlap of 640 features (see Appendix 5—figure 2). This highlights that the choice of reference does matter to some extent. In the paper, choosing CS as a reference was informed by CS’s sample size, and study population.
Data availability
The LCMS data used to generate our simulated validation experiments can be downloaded at https://www.ebi.ac.uk/metabolights/MTBLS1684/files at the bottom of the "Files" section in under filename 'FILES/metabolomics\_normalized\_data.xlsx'. The EPIC data is considered sensitive data and is therefore not publicly available. It is centralised at IARC and can be analysed through the IARC Scientific IT platform after a Data Use Agreement has been signed. Access requests should be submitted to the IARC Steering Committee https://epic.iarc.fr/access/index.php. All code for the data preprocessing, figure generation, as well as the GromovMatcher algorithm and its comparison to other methods are available at: https://github.com/sgstepaniants/GromovMatcher (copy archived at Breeur and Stepaniants, 2024). Instructions and examples for how to run the GromovMatcher method are provided in the Github repository. The metabCombiner implementation written by the original authors was taken from their Github codebase: https://github.com/hhabra/metabCombiner (Habra, 2024). The M2S implementation of the original authors was taken from their Github codebase: https://github.com/rjdossan/M2S (Rjdossan, 2024).

MetaboLightsID MTBLS1684. A multiomic analysis of birthweight in newborn cord blood reveals new underlying mechanisms related to cholesterol metabolism.
References

Approximate is better than “exact” for interval estimation of binomial proportionsThe American Statistician 52:119–126.https://doi.org/10.1080/00031305.1998.10480550

ConferenceGromovWasserstein Alignment of Word Embedding SpacesEMNLP Brussels, Belgium: Association for Computational Linguistics. pp. 1881–1889.https://doi.org/10.18653/v1/D181214

ConferenceTowards optimal transport with global InvariancesProceedings of the TwentySecond International Conference on Artificial Intelligence and Statistics. pp. 1870–1879.

Metabolomics in environmental toxicology: Applications and challengesTrends in Environmental Analytical Chemistry 34:e00161.https://doi.org/10.1016/j.teac.2022.e00161

SoftwareGromovmatcher, version swh:1:rev:c38a56b55e8746d874f94e371c6cdd1074b28b74Software Heritage.

Interval estimation for a binomial proportionStatistical Science 16:133.https://doi.org/10.1214/ss/1009213286

Unbalanced optimal transport: Dynamic and Kantorovich formulationsJournal of Functional Analysis 274:3090–3123.https://doi.org/10.1016/j.jfa.2018.03.008

ConferenceJoint distribution optimal transportation for domain adaptationNIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems. pp. 3733–3742.

SCOT: SingleCell MultiOmics Alignment with Optimal TransportJournal of Computational Biology 29:3–18.https://doi.org/10.1089/cmb.2021.0446

BookMetric Structures for Riemannian and NonRiemannian SpacesBirkhäuser Boston, Inc.https://doi.org/10.1007/9780817645830

On the translocation of massesJournal of Mathematical Sciences 133:1381–1382.https://doi.org/10.1007/s1095800600492

Novel biomarkers of habitual alcohol intake and associations with risk of pancreatic and liver cancers and liver disease mortalityJournal of the National Cancer Institute 113:1542–1550.https://doi.org/10.1093/jnci/djab078

Gromov–wasserstein distances and the metric approach to object matchingFoundations of Computational Mathematics 11:417–487.https://doi.org/10.1007/s1020801190935

Mémoire sur la théorie des déblais et des remblaisMem. Math. Phys. Acad. Royale Sci pp. 666–704.

Separation strategies for untargeted metabolomicsJournal of Separation Science 34:3460–3469.https://doi.org/10.1002/jssc.201100532

ConferenceGromovwasserstein averaging of kernel and distance matricesICML. pp. 2664–2672.https://doi.org/10.5555/3045390.3045671

Computational optimal transport: With applications to data scienceFoundations and Trends in Machine Learning 11:355–607.https://doi.org/10.1561/2200000073

The blood exposome and its role in discovering causes of diseaseEnvironmental Health Perspectives 122:769–774.https://doi.org/10.1289/ehp.1308015

ConferenceInteractive Supercomputing on 40,000 Cores for Machine Learning and Data Analysis2018 IEEE High Performance Extreme Computing Conference. pp. 1–6.https://doi.org/10.1109/HPEC.2018.8547629

European Prospective Investigation into Cancer and Nutrition (EPIC): study populations and data collectionPublic Health Nutrition 5:1113–1124.https://doi.org/10.1079/PHN2002394

ConferenceThe unbalanced gromov wasserstein distance: Conic formulation and relaxationAdvances in Neural Information Processing Systems 34. pp. 8766–8779.

Group level validation of protein intakes estimated by 24hour diet recall and dietary questionnaires against 24hour urinary nitrogen in the European Prospective Investigation into Cancer and Nutrition (EPIC) calibration studyCancer Epidemiology, Biomarkers & Prevention 12:784–795.

Entropic metric alignment for correspondence problemsACM Transactions on Graphics 35:1–13.https://doi.org/10.1145/2897824.2925903

Alteration of amino acid and biogenic amine metabolism in hepatobiliary cancers: Findings from a prospective cohort studyInternational Journal of Cancer 138:348–360.https://doi.org/10.1002/ijc.29718

Metabolic perturbations prior to hepatocellular carcinoma diagnosis: Findings from a prospective observational cohort studyInternational Journal of Cancer 148:609–625.https://doi.org/10.1002/ijc.33236

metaXCMS: secondorder analysis of untargeted metabolomics dataAnalytical Chemistry 83:696–700.https://doi.org/10.1021/ac102980g

Liquid chromatographymass spectrometry calibration transfer and metabolomics data fusionAnalytical Chemistry 84:9848–9857.https://doi.org/10.1021/ac302227c

Metabolite profiles and the risk of developing diabetesNature Medicine 17:448–453.https://doi.org/10.1038/nm.2307

Metabolomics for investigating physiological and pathophysiological processesPhysiological Reviews 99:1819–1875.https://doi.org/10.1152/physrev.00035.2018

Predicting cell lineages using autoencoders and optimal transportPLOS Computational Biology 16:e1007828.https://doi.org/10.1371/journal.pcbi.1007828
Article and author information
Author details
Funding
National Science Foundation (Graduate Research Fellowship Program 1745302)
 George Stepaniants
National Science Foundation (IIS1838071)
 Philippe Rigollet
National Science Foundation (DMS2022448)
 Philippe Rigollet
National Science Foundation (CCF2106377)
 Philippe Rigollet
World Cancer Research Fund International (IIG_FULL_2022_013)
 Marie Breeur
 Vivian Viallon
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Jörn Dunkel for helpful advice on our manuscript. We acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center Reuther et al., 2018 for providing HPC resources that have contributed to the research results reported within this paper. GS acknowledges support through a National Science Foundation Graduate Research Fellowship under Grant No. 1745302. PR is supported by NSF grants IIS1838071, DMS2022448, and CCF2106377. MB and VV acknowledge support from World Cancer Research Fund (UK) through the World Cancer Research Fund International grant program (grant number: IIG_FULL_2022_013). We are grateful to the Principal Investigators of each of the EPIC centers for sharing the data for our experimental application.
Ethics
The EPIC study, and in particular the three studies nested within EPIC, were conducted according to the Declaration of Helsinki and approved by the ethics committee at the International Agency for Research on Cancer (IARC) (IEC 1016 for the HCC and pancreatic cancer studies, IEC 1229 for the crosssectional study). Written informed consent was obtained from all subjects involved in the study.
Version history
 Sent for peer review:
 Preprint posted:
 Reviewed Preprint version 1:
 Reviewed Preprint version 2:
 Version of Record published:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.91597. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2023, Breeur, Stepaniants et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 622
 views

 50
 downloads

 0
 citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Cancer Biology
 Genetics and Genomics
Solitary fibrous tumors have gene expression signatures similar to those of neuroendocrine tumors.

 Cancer Biology
Pediatric diffuse midline gliomas (pDMG) are an aggressive type of childhood cancer with a fatal outcome. Their major epigenetic determinism has become clear, notably with the identification of K27M mutations in histone H3. However, the synergistic oncogenic mechanisms that induce and maintain tumor cell phenotype have yet to be deciphered. In 20 to 30% of cases, these tumors have an altered BMP signaling pathway with an oncogenic mutation on the BMP type I receptor ALK2, encoded by ACVR1. However, the potential impact of the BMP pathway in tumors nonmutated for ACVR1 is less clear. By integrating bulk, singlecell, and spatial transcriptomic data, we show here that the BMP signaling pathway is activated at similar levels between ACVR1 wildtype and mutant tumors and identify BMP2 and BMP7 as putative activators of the pathway in a specific subpopulation of cells. By using both pediatric isogenic glioma lines genetically modified to overexpress H3.3K27M and patientsderived DIPG cell lines, we demonstrate that BMP2/7 synergizes with H3.3K27M to induce a transcriptomic rewiring associated with a quiescent but invasive cell state. These data suggest a generic oncogenic role for the BMP pathway in gliomagenesis of pDMG and pave the way for specific targeting of downstream effectors mediating the K27M/BMP crosstalk.