Microbiomes Through the Looking Glass: Linking Species Interactions to Dysbiosis through a Disordered Lotka-Volterra Framework

Jacopo Pasqualini; Amos Maritan; Andrea Rinaldo; Sonia Facchin; Edoardo Savarino; Ada Altieri; Samir Suweis

doi:10.7554/eLife.105948.2

1 Introduction

Microbial communities are a fundamental reservoir of ecological functions and biological diversity. They are relevant for any environmental and host-associated ecosystem, ranging from soil to the human gut. In particular, the human gut microbiome, by interacting with the host’s metabolism and immune system Levy et al. [2016], Barroso-Batista et al. [2015], Barreto and Gordo [2023], is a key regulator of human health. Dysbiosis is defined as an alteration in the composition of healthy microbiomes associated with the gastrointestinal tract Lloyd-Price et al. [2019], Das and Nair [2019], Nishida et al. [2018].

Recently, stochastic non-interacting neutral Zeng et al. [2015], Venkataraman et al. [2015], Sala et al. [2016], Sieber et al. [2019], Descheemaeker and De Buyl [2020], Grilli [2020], Zaoli and Grilli [2022], and logistic models have been successfully used to describe empirical patterns such as species abundance distribution (SAD) and species presence/absence statistics in different types of communities. However, focusing on single-species properties Azaele et al. [2016], Grilli [2020] fails to characterize altered states of microbiome Seppi et al. [2023], Pasqualini et al. [2024]. In particular, recent studies indicate that gut dysbiosis is associated with shifts in microbial species interaction patterns. Network-level properties of species interactions – such as the balance of positive and negative interactions, average interaction strength, and connectivity of the inferred interaction matrix – have been shown to systematically differ between healthy and dysbiotic gut communities (see, for instance, Bashan et al. [2016], Seppi et al. [2023]). Pairwise species interactions have been quantified in simplified in-vitro consortia Kehe et al. [2021], Venturelli et al. [2018]; we assume that the same classes of interactions also operate – albeit in a more complex form – in the native gut microbiome.

For these reasons, several approaches have been proposed in recent years to infer microbial interactions Faust and Raes [2012], Xiao et al. [2017], Camacho-Mateu et al. [2024]. However, such inference protocols remain very challenging and problematic in many respects Faisal et al. [2010], Angulo et al. [2017], Tu et al. [2019], Armitage and Jones [2019], Holt [2020]: a) the very high dimension of typical microbiome datasets and the lack of longitudinal (long-term) experiments; b) the time-dependent nature of microbial species interactions Pacciani-Mori et al. [2021]; c) abiotic factors such as resources, temperature, and pH, can introduce environmental filtering effects Sireci et al. [2022] and induce effective interactions; d) experimental and technical challenges that introduce sampling effects, false positive species Tovo et al. [2020], spurious correlation, and bias in the data Weiss et al. [2016], Gloor et al. [2017], Dohlman and Shen [2019]. Therefore, even in a scenario where metagenomic samples are noise- and bias-free, reconstructing species-interaction networks of the underlying microbial dynamics remains a hard task.

An alternative, theoretically more elegant approach was pioneered by May May [1972], proposing to use random matrices to model species interaction networks, i.e., each entry of the adjacency matrix is extracted at random from a given distribution. Given the impossibility of empirically measuring the interaction strengths, the advantage of such an approach is the reduction from ∼ S² (if S is the number of species) to just a few parameters (e.g., 2) used to parameterize the distribution Allesina and Tang [2012].

1.1 Insights from Disordered Systems Theory

Although previous studies have largely focused on neutral theory and logistic models – including variants like the Stochastic Logistic Model that account for environmental fluctuations – they often neglect inter-species interactions, limiting their ability to reproduce macroecological patterns at a global scale. Moreover, higher-order correlations among species are known to generate non-trivial effects, such as the emergence of persistent fluctuations or chaotic dynamics due to the non-reciprocity of interactions.

In recent years, increasing attention has been devoted to studying population dynamics through the generalized Lotka-Volterra (gLV) equations, employing disordered systems techniques (also known as glassy), such as replica and cavity methods and Dynamical Mean-Field Theory approaches, originally developed in the context of statistical physics Bunin [2017], Galla [2018], Biroli et al. [2018], Altieri et al. [2021], Lorenzana and Altieri [2022]. Indeed, due to the inherently high dimensionality of microbial datasets, random matrix theory and methods from disordered systems turn out to be particularly well suited. A striking feature of their application to ecological dynamics is that the resulting properties and dynamical regimes do not depend on species-specific details, provided that species are statistically equivalent under relabeling or time-averaging. Phase diagrams can thus be established in terms of a few effective parameters (e.g., the mean μ and the variance σ² of the random interactions strengths) Barbier et al. [2018]. However, despite a range of very interesting results, this approach has remained confined mainly to purely theoretical domains (but see Hatton et al. [2024]) and has only recently been used to give an interpretation of controlled experiments with synthetic microbial communities Hu et al. [2022].

2 Results

We now provide a proof of concept of the applicability of a high-dimensional disordered setting to human microbiomes. Specifically in the following section, we introduce the disordered generalized Lokta-Volterra (dgLV) model, infer its parameters from healthy and unhealthy cohorts of gut microbiomes, and test whether the resulting interaction patterns and stability metrics discriminate healthy from diseased macro-ecological states. Finally, we will introduce quantitative metrics to define microbiome stability and estimate the contribution of distinct ecological forces to the dynamics.

2.1 Disordered generalized Lotka-Volterra model

The dgLV model describes the time evolution of the concentration abundances of a local pool of S interacting species, i.e.,

where N_i is the populations of species i-th, K_i is its carrying capacity, and r_i the growth rate where is a constant, which we will assume will not depend on the species, i.e. ρ_i = ρ. The coefficients α_ij are random i.i.d random variables with and . We also include an immigration species-independent rate λ, which will be treated as a reflecting wall mathematically regularizing the problem, and a demographic noise term Altieri et al. [2021] η_i, with , and where δ denotes the Dirac delta and T sets the scale of noise intensity. For notational purposes, we also introduce its inverse β = T⁻¹. Finally, we require that the interactions are symmetric, i.e., α_ij = α_ji. We can thus map this problem into an equilibrium thermodynamic one that is exactly solvable. As shown in the Appendix, section S1, it is possible to justify this symmetric assumption and a linear dependence between the growth rate and the carrying capacity, namely r_i = ρK_i, by considering the quasi-stationary approximation of the MacArthur consumer-resource model. In the following, we set ρ =1 (following also Biroli et al. [2018]).

The presence of a noise term in Eq. (1) allows us to write the Fokker-Planck equation of the system and study its stationary solution (see Altieri et al. [2021], Altieri and Biroli [2022] for a detailed derivation in a similar Hamiltonian formalism). Here, we adopt the Itô prescription for the stochastic dynamics in such a way as to prevent species resurgence by noise.

The replica formalism, a well-known technique in disordered systems, comes into play and allows us to derive a non-interacting Hamiltonian corresponding to the dgLV symmetric interactions α_ij (see Methods and Appendix, Section S2, for a complete derivation). In simple words, instead of trying to solve the problem for one random setup, the formalism considers many replicas of the system and averages their behaviors. This approach helps to smooth out the randomness and reveals the typical behavior of the system. We also assume a single equilibrium scenario, known as replica-symmetric (RS) regime. Although the SADs of empirical microbial communities display fat tails, a feature more compatible with the multi-attractors phase of the asymmetric 1RSB Mallmin et al. [2024] case or with the gLV with time-dependent disordered interactions Suweis et al. [2024] (annealed version), we consider this simplification as it is the regime in which we can obtain explicit analytical relations between the model parameters and the data and where model inversion is feasible. By employing a cavity argument Mezard and Montanari [2009], Altieri and Baity-Jesi [2024], one can indeed analytically derive the SAD of the model (see Methods and Appendix, Section S4):

where the auxiliary variable takes the disorder interactions into account; z is a standardized Gaussian variable, ν = βλ > 0. We introduce, respectively, the mean abundance , the self-overlap , and the overlap , which we will collectively refer to as order parameters in the following sections. Specifically, q₀ measures the similarity between two different configurations at stationarity of the system with two different disorder realizations, while q_d measures the similarity of two stationary configurations generated with the same disorder realization. Finally, we introduce a constant that, inspired by the Field Theory jargon, we dub as mass m = 1 − βσ² (q_d − q₀) > 0 for the theory to be consistent Biroli et al. [2018].

The RS ansatz can also be characterized by its stability to external perturbations, such as the external temperature, the immigration rate, or the interaction heterogeneity. To investigate the stability of the RS phase, we consider the Hessian matrix of the theory by performing a harmonic expansion of the replicated free energy, as originally pointed out in de Almeida and Thouless [1978]. When the leading eigenvalue of the Hessian matrix, the so-called replicon mode, goes below zero, unstable equilibria appear: either the system moves towards a One-step Replica Symmetric Breaking (1-RSB) phase with multiple locally stable minima, or it develops a marginal full-RSB phase, leading to a hierarchical structure of states and therefore to an extremely rough landscape (also known as Gardner amorphous-like phase). In other words, the RS ansatz no longer represents a thermodynamically stable phase in the latter cases.

Based on standard calculations in disordered systems, the replicon mode ℜ can be analytically evaluated and reads:

where is the species response to an external perturbation, keeping track of non-extinct species only. For more details, see S3 in the Appendix.

(The replicon eigenvalue refers to a particular type of fluctuation around the saddle-point (mean-field) solution in the replica framework. When diagonalizing the Hessian matrix of the replicated free energy, fluctuations split into three sectors: longitudinal, anomalous, and replicon. The replicon mode is the most sensitive to criticality signaling – by its vanishing trend – the emergence of many nearly-degenerate states. It essentially describes how ‘soft’ the system is to microscopic rearrangements in configuration space.)

2.2 Data Through the Glass

We now consider cross-sectionally sampled gut microbiomes of two different cohorts: one of healthy and another of diseased individuals. Here we focus on chronic inflammation syndromes, but, in principle, what we present holds for groups of phenotypically distinct populations. We can idealize the samples of each group as generated from the stationary distribution of the dgLV Eq. (1), with different realizations of the disorder α, but with shared μ and σ (see Material and Methods). The approach we propose considers all samples in the same group as coming from the same statistical ensemble. On the other hand, it distinguishes ecological regimes characterized by distinct phenotypes through different statistics (e.g., dissimilar μ and σ) of the random species interactions α. We aim to test the hypothesis that each cohort will be described by a different set of ecological parameters, precisely with θ = (μ, σ, T, λ).

To calculate the order parameters (h, q_d, q₀, K) from the data, we thus need to specify how we perform the ensemble and the disorder averages empirically (see Figure 1). Since time series data are rarely available, we rely on an effective mean-field description and estimate the ensemble average by averaging over species (that is, different species are realizations of the same underlying stochastic process Azaele et al. [2016], known as neutral hypothesis), i.e., . Then, we assume that the average over the disorder can be computed as a sample average. In other words, within a given phenotype (healthy/unhealthy), each measured microbiome configuration is a sample from the stationary distribution of the dgLV model, with a given realization of the disorder, i.e., , where R is the number of samples and typically R ≫ 1. Moreover, due to our limited knowledge of the fine details governing the interactions in each microbiome, it is reasonable to assume that all communities with a given macrostate – a point in the (h, q₀, q_d, K) space – experience the same demographic noise T, immigration rate λ, and disorder parameters μ and σ. Eventually, we can compute the order parameters from the data as:

where N_j,a represents the population density of species j in sample a.

The inference protocol of the dgLV generative model is performed by a moment matching optimization procedure.
We aim to infer the free parameters θ = (μ, σ, T, λ) – shown on the right – so that to match the mean abundance, higher-order correlations between species abundances, and the average carrying capacity for the two cohorts, i.e. (h, q₀, *q_d*, K) – on the left. This procedure allows us to extract ecological dynamics information for cross-sectional data of healthy (blue) and diseased (red) microbiomes, which are conceptualized as independent disorder realizations.

To evaluate the carrying capacities K_i from the data, as done in most of the works using dgLV equations Bunin [2017], Biroli et al. [2018], Altieri et al. [2021], Suweis et al. [2024], Mallmin et al. [2024], we assume that in each cohort all species are characterized by the same carrying capacity K_i = K. Here, we define K as the average of the maximum relative abundances of species across the available samples for each cohort, so that .

Note also that, from metagenomics, we only have access to compositional data for species abundances Pasqualini et al. [2024]. This is crucial to properly treat different samples and end up with a consistent analysis. Moreover, this approach allows us to give an ecological interpretation of some of the order parameters. In particular, because of the compositionality of the data, we have that and , while q_d provides information on pairwise products between species abundances within each replica. Figure 2 shows the values of such order parameters between healthy and unhealthy cohorts for two randomized instances. In the data, we observe a systematic difference between healthy and unhealthy cohorts, pointing to a higher average local diversity in healthy samples (panels a, c), as also observed in Pasqualini et al. [2024]. Panel b highlights a higher q_d in unhealthy patients, a signature of the weakening species interactions in those samples. We will further investigate this aspect by inferring the species interactions in both cohorts.

Order parameters inferred from the data using Eqs. (4).
Panel a) shows h, b) *q_d*, and panel c) q₀ in healthy (blue) and unhealthy (red) cohorts. The gray points show the values for the corresponding null models where sample labels have been randomized. Standard deviations have been computed over 5000 realizations.

From the modeling point of view, it can be rigorously shown that constraining the abundances to be normalized to one corresponds to introducing a Lagrange multiplier in the expression of the free energy. However, this only acts on the linear term, contributing to shifting the mean of the random interactions, μ, but does not affect the heterogeneity σ, and therefore the phase diagram overall. For more details, we refer the interested reader to the Appendix (end of Section S2) along with Altieri et al. [2021]. In Altieri et al. [2021] the authors also showed how a global normalizing constraint on species abundances reflects in a one-to-one mapping with the random replicant model Biscari and Parisi [1995] (the replicator equations originally introduced by Diederich and Opper [1989] and recast within the replica formalism Biscari and Parisi [1995], Altieri et al. [2021] describe the dynamics of an ensemble of replicants that evolve via random couplings).

2.3 Species interactions patterns characterize the state of microbiomes

We thus collect all the parameters estimated from the data in a vector π = (h, q_d, q₀, K). As we will better detail in the Methods, we develop a moment matching inference algorithm to infer the model parameters θ, as depicted in Figure 1. The idea of the method is to introduce a cost function 𝒞(θ|π), representing a total relative error for some self-consistent equations. If the parameters θ are such that the right part of the self-consistent equation equals the left part, the problem is considered solved. Because the landscape associated with this cost function presents several minima, we perform multiple optimization procedures to collect an ensemble of possible solutions, from which we retain the top 30. First, we find that different solutions θ* of the optimization problem provide ecological insights into the underlying microbiome populations.

As originally predicted in Altieri et al. [2021], among all the parameters that define Eq. (1), the only ones relevant for reproducing the theoretical phase diagram are the amplitude of demographic noise and the heterogeneity of interactions. The mean interaction strength, provided it is sufficiently positive, does not play a significant role. This prediction is fully confirmed by the inference procedure applied to the two microbiome datasets, allowing us to identify a universal signature that distinguishes healthy from unhealthy states. Figure 3a shows, indeed, that inferred noise (T) and interaction heterogeneity strength (σ) for healthy and diseased microbiomes are clustered in the two-dimensional plane.

In particular, the SAD for the healthy cohort is robust among the different solutions of the inference procedure, as depicted by the superposition of the different curves in the inset of Figure 3A. On the other hand, SADs inferred from unhealthy patients have high sensitivity to different solutions. In particular, some of them display a mode for high-abundance species (light red lines in Figure 3a), a signature of dominant strain in the gut. Consistently, the distribution of the interactions P (α_i,j) generated through the inferred parameters μ and σ is different between healthy and diseased cohorts, giving a distinct pattern of interactions (see Figure 3b), a result that is compatible with that found by Bashan and collaborators Bashan et al. [2016]. Remarkably, we find that dysbiosis reduces the heterogeneity of interaction strengths, a result also observed when taking correlations as a proxy for interactions Seppi et al. [2023].

Panel a: inferred T (demographic noise strength) and σ (interactions heterogeneity) for healthy (blue) and diseased (red) microbiomes are clustered. Darker dots correspond to better solutions (i.e., solutions with a lower value of the cost function 𝒜), while the two points with hexagonal markers correspond to the best two (healthy and diseased, respectively) solutions. In the first panel inset, we also show (in log-log scale) the SADs corresponding to each solution. To have a more concise representation, we present each SAD fixing the disorder to its average . Panel b: the probability density function of the inferred interactions *α_i,j* for healthy (blue) and diseased (red) microbiomes. Dysbiosis reduces the heterogeneity of the interaction strengths. The quantities reported in the legend are the average and standard deviation of *α_i,j*. They have are calculated as and , where *S_X* is the species pool size, estimates as the set of all the observed species in a dataset, X. X can be healthy (H) or diseased (D).

We then assess how close the inferred σ and β = 1/T (a.k.a. inverse temperature in a statistical physics approach) are to the critical replica symmetry-breaking (RSB) line of the dgLV (ℜ = 0), evaluated by keeping all the other parameters constant (see Methods). We find again that the replicon values ℜ corresponding to each solution of our optimization protocol are significantly different for the two investigated microbiome phenotypes (see Figure 4a). In particular, diseased microbiomes are closer to marginal stability within the replica-symmetric ansatz Altieri et al. [2021], Mézard et al. [1987], de Almeida and Thouless [1978]. Furthermore, by investigating the shape of the SAD given by Eq. (2), we can estimate the ratio between niche (represented by species interaction) and neutral (represented by birth/death and immigration) ecological forces, which can be captured by the quantity ψ Wu et al. [2021]. It detects the emergence of peaks in the species abundance distribution as a hallmark of niche processes (see Appendix 2).

Panel a: The replicon eigenvalue corresponding to each solution of our optimization procedure (shaded dots). The solid hexagon represents the replicon corresponding to the best solutions that minimize the error in predicting the order parameters of the theory (minimum 𝒞). The two investigated microbiome phenotypes (healthy in blue, diseased in red) are significantly different. In particular, diseased microbiomes are closer to the marginal stability of replica-symmetric ansatz (grey horizontal line). Panel b: Solutions of the moment-matching objective function are shown as a function of ψ and m, which in turn depend on the SAD parameters (see main text). Healthy (blue) and diseased (red) microbiomes appear to be clustered. Therefore, distinct ecological organization scenarios (strong neutrality/emergent neutrality) emerge. Darker dots correspond to solutions with lower values of the cost function, while hexagonal markers correspond to the two best solutions.

Inspired by field-theory arguments (see Methods and Appendix, Section S2), we can call mass of the theory the m parameter, as defined above in Eq. (2). In classical and quantum field theory, the particle-particle interaction embedded in the quadratic term is typically referred to as a mass source. In our context, m = 1 − βσ² (q_d − q₀) captures quadratic fluctuations of species abundances, as also appearing in the expression of the leading eigenvalue of the stability matrix. When m → 0, the analytical order parameters diverge and the system enters the unphysical regime of unbounded growth. As such, the mass term can be considered a complementary stability measure, capable of capturing the transition to the unbounded growth regime.

In the model, two kinds of effects compete to shape the community structure. On the one hand, we have niche effects, encoded in disordered interactions and thus tracked by the parameters μ, σ, and K. Their overall effect is selective and tends to concentrate the SAD around the typical abundance value. On the other hand, we have neutral effects encoded in the stochastic dynamics and immigration, governing the low abundance regime of the SAD. When the demographic noise amplitude is stronger than immigration (ν < 1, as in our case), the SAD exhibits a low-abundance integrable divergence. In the opposite scenario, for ν > 1, there is no divergence and the SAD is modal. Since interactions are random, the probability of observing an internal mode can be estimated as the fraction of SADs realizations having non-trivial solutions to the stationary point equation. Such a quantity, dubbed as the niche-neutral ratio, can be analytically evaluated:

where and . When ψ ≈ 1, niche and neutral forces give comparable contributions to the dynamics, as both low abundance divergence and a finite abundance mode coexist in the SAD. Finally, if the typical abundance diverges, we enter the unbounded growth phase, which means that the mass m and the niche-neutral ratio ψ are not independent, as suggested by the analytical expression for ψ. For an exhaustive derivation of this result, see Appendix 2. With the obtained model parameters, we are able to evaluate m and ψ for healthy and diseased microbiomes. Also in this case healthy and diseased microbiomes are visibly clustered, as shown in Figure 4. Unhealthy microbiomes turn out to be closer to the unbounded growth phase, and the niche-neutral ratio is larger by five orders of magnitude than the healthy case ψ_D ≈ 10⁵ψ_H. This leads us to argue that selective pressure is way larger in diseased states, while in the healthy one, birth and death effects are the key drivers of the dynamics. These results are also confirmed by the SAD shapes in the inset of Figure 3 (panel a).

In summary, in the Result section we demonstrate that: (i) the inference pipeline robustly recovers demographic noise and interaction heterogeneity by calculating h, q₀ and q_d from the data; (ii) These parameters cluster by health status and (iii) diseased microbiomes lie closer to the replica-symmetry-breaking threshold, indicating reduced ecological resilience.

3 Discussion and Conclusions

In our exploration of the gut microbiome through the lens of disordered systems and random matrix theory, we have proposed a connection between the theoretical framework of disordered systems and practical analyses of environmental microbiome data. In particular, we have characterized healthy and unhealthy gut microbiomes using the disordered generalized Lotka-Volterra model (dgLV) for population dynamics. We now interpret our theoretical findings in a biological context, contrast them with previous work, and outline limitations and future directions.

The first major result of our work suggests a different role for the various ecological forces shaping the human gut microbial community. In this sense, the niche-neutral ratio ψ, highlights the different roles of interactions in healthy and diseased microbiomes. In the healthy case, neutral forces, such as random birth and death of individuals, characterize the dynamics, making configurations corresponding to this state alike. On the contrary, in diseased microbiomes, disordered, sample-specific interactions are the dominant ecological force, making individual realizations differ significantly from one another. An ecological interpretation of our findings suggests that healthy microbiomes are governed primarily by demographic stochasticity, reflecting a quasi-neutral regime characterized by similar community structures across individuals. Conversely, microbiomes from diseased patients exhibit significantly greater variability, suggesting that deterministic ecological factors – such as weakened species interactions – override neutrality, leading to structural instability and distinct microbial compositions. This observation aligns with the “Anna Karenina principle” Pasqualini et al. [2024], Ma [2020] holding for gut microbiomes, which can be phrased as: “All healthy gut microbiomes are alike; each unhealthy gut microbiome is unhealthy in its own way”. Supporting this interpretation, our analysis of the replicon eigenvalue ℜ shows that the healthy state is associated with pronounced stability to external perturbations. The unhealthy state, instead, being closer to the RSB line, exhibits diminished stability and, consequently, reduced robustness against external perturbations.

Our study also sheds light on the role of demographic noise within the context of the dgLV model. In the limit T → 0, the SAD transitions to a truncated Gaussian, with a prominent peak at zero reflecting the fraction of extinct species. In contrast to a zero-noise scenario, where species extinctions are observed, the inclusion of demographic noise in the dgLV model suggests a picture where no species goes extinct, supporting the everything is everywhere hypothesis Grilli [2020], Pigani et al. [2024]. In other words, within this framework, zeros in the data are due to sampling effects and not due to local species extinctions Grilli [2020], Pasqualini et al. [2024].

A notable limitation of our study lies in the discrepancy between the empirical Species Abundance Distribution (SAD) observed in the data and the theoretical distribution predicted by the quenched dgLV model Suweis et al. [2024]. While empirical data showcase a diverse range of species abundance, following a power-law distribution, the model predictions tend to exhibit exponential decay in SAD tails. This mismatch underscores the need for further refinement of the model to accurately capture the nuanced patterns observed in real-world data. For example, it has recently been proposed that introducing an annealed disorder can generate SADs that more closely resemble the empirical ones Suweis et al. [2024] (unlike the quenched approximation, the annealed version assumes that random couplings are not fixed but rather fluctuate over time, with their covariance governed by independent Ornstein-Uhlenbeck processes). Another possibility is to set the dgLV model parameters in such a way as to reproduce the multi-attractor phase. In fact, in this region, the SADs display a more heterogeneous shape Arnoulx de Pirey and Bunin [2024]. We plan to explore this follow-up direction by combining one-step replica-symmetry-breaking (1RSB) computations in the multiple-attractor phase with Dynamical Mean-Field Theory analysis for the asymmetric interaction case. This latter approach is particularly well suited for studying inherently non-equilibrium dynamics and for extending the framework to systems subject to environmental fluctuations in addition to demographic noise.

Another related limitation is the challenge of generating species abundance samples from the dgLV model that mirrors the statistical properties of the observed empirical data. In our current framework, each microbiome sample could be extracted from p(N|ζ), where ζ is a realization of the disorder. However, p(N|ζ) near N → 0 presents a power-law exponent ν − 1, with ν_H,D ≈ 10⁻³. This results in numerical instabilities and dominates the sampling process, posing difficulties in generating representative synthetic samples. Moreover, while microbiome data are inherently compositional Pasqualini et al. [2024], the dgLV model species populations N_i are positive real numbers. However, as already noted, it can be shown that introducing a normalization in such equations does not change the structure of the proposed solutions Altieri et al. [2021] and therefore should not affect the conclusions of our work.

In conclusion, our work proposes a bridge between theory and data, particularly in refining the theoretical models to better align with empirical observations and in exploring the nuances of species abundance distributions within the microbiome context. Moreover, the integration of other forms of environmental variability and species-specific traits could provide a more holistic view of ecological dynamics, as also proposed by the One Health-One Microbiome framework Tomasulo et al. [2024].

Overall, our study builds a quantitative link between metagenomic data and the disordered gLV framework, revealing how dysbiosis alters gut species interaction networks. By doing so, it lays the groundwork for more advanced, mechanistically informed models to better interpret and ultimately manage complex microbial ecosystems.

Material and Methods

Microbiome dataset and code

We have selected gut microbiome data from three studies Franzosa et al. [2019], Lloyd-Price et al. [2019], Mars et al. [2020], focusing on inflammatory syndromes of the gastrointestinal tract (Crohn’s Disease, Ulcerative Colitis, and Irritable Bowel Syndromes). Considering all the available metadata, we have selected the patients less affected by possible perturbing factors, such as drugs. Finally, our dataset consists of R_Healthy = 91 shotgun metagenomic samples from healthy control individuals and R_Diseased = 202 shotgun metagenomic samples. All metagenomic preprocessing and reads classification are extensively described in Pasqualini et al. [2024]. Species abundance profiles from metagenomic data and the parameters values obtained from the moment matching inference are available at the associated Zenodo page.

All the scripts implementing the moment matching inference and the jupyter notebooks to generate the figures are available at the associated GitHub page.

Free-energy landscape exploration: replica formalism

In the case of symmetric interactions – corresponding to conservative forces in the dynamics – a one-to-one mapping between the multi-species dynamics and a thermodynamic formalism can be safely worked out. The first step consists of writing the Fokker-Planck equation in the presence of a white Gaussian noise defined by a zero mean and a variance of amplitude 2T. All technical details, leveraging a Fokker-Planck derivation, can be found in Altieri and Biroli [2022] for a similar pairwise interacting model, but a different self-regulation term accounting for non-logistic behavior (1).

Once the (quenched) disordered Hamiltonian of the model is obtained, we can resort to techniques known in statistical physics of disordered systems, such as the replica and cavity methods Mezard and Montanari [2009], Mézard et al. [1987], Zamponi [2010]. The replica trick, in particular, allows us to handle disordered quantities, such as the free energy and the partition function, which would be otherwise unaffordable (see Appendix below).

We summarize the main findings here along with the expressions of the (RS) order parameters of the model, (h, q_d, q₀). The three expressions below, originally obtained in Altieri et al. [2021], have offered the starting point of this work, allowing for a thorough comparison with the same order parameters measured from metagenomic data. Their analytical expressions are self-consistently determined by the system of equations:

where the calligraphic notation stands for the Gaussian integration . In other words, the external average is equivalent to averaging over the quenched disorder , whereas the most internal one – over the continuous variable N – is interpreted as a thermal average over the single-equilibrium Hamiltonian ℋ_RS(N; q_d, q₀, h). The latter is denoted by . See Appendix (section S2) for more details.

As long as the system of Eqs. (6) admits physically reasonable solutions, we might claim that the RS ansatz safely holds. This condition is nevertheless necessary but not sufficient because the stability of the RS solution must also be checked. It therefore requires studying the Hessian matrix of free energy and diagonalizing it on a suitable subspace, called replicon, ℜ. The main outcome of this computation is captured by Eq. (3) of the main text. The averaged difference describes the fluctuations between the first and second moments of the species abundances within one state, namely between the diagonal value q_d and the off-diagonal contribution q₀ of the overlap matrix. Detecting a vanishing value of ℜ corresponds to the appearance of a marginally stable RS solution (see section s3 of the Appendix).

Moment matching inference

The parameters h, q₀, and q_d can be self-consistently determined through the saddle point of the dgLV free energy in Eqs. (6) (see also Altieri et al. [2021]). We thus aim to estimate which set of model parameters (i.e. θ = (μ, σ, T, λ)) will generate values of the order parameters (h, q_d, q₀) matching those directly estimated from the data. The solution of such an inference problem may not be unique or exact. We have thus developed an optimization algorithm so to find a pool of possible solutions that minimize the difference between the order parameters estimated by the model and those directly obtained from the data. To infer the parameters θ, we can thus use the self-consistent equations for the order parameters and solve the inverse problem to find the dgLV parameters that match the empirical observations. For each self- consistent equation, we can introduce a relative error δH(θ|π) = (H(θ|π) − h)/h, δQ_d(θ|π) = (Q_d(θ|π) − q_d)/q_d and δQ₀(θ|π) = (Q₀(θ|π) − q₀)/q₀. By summing the square of each of these contributions, we introduce the cost function C for our moment matching problem, i.e.,

As already observed, the cost function has multiple local minima. To explore the rich structure of minima, we adopt a greedy search optimization strategy. First, we generate a vector θ₀ so that m₀ = m(θ₀|π) = 1/2 max(m) = 1/2. This condition ensures that the starting point of the optimization is far from the unbounded growth phase. In particular, it allows us to randomly generate an initial value of the interactions heterogeneity from a broad range σ ∼ Uniform(0,10) and to get the corresponding initial value of the demographic noise by mean of the relation . The other two parameters are randomly drawn, respectively, as μ ∼ Uniform(−1,1) and log₁₀ λ₀ ∼ Uniform(−8, −3). The choice of the μ₀ range is justified by the fact that we do not want to bias the interactions to be mutualistic or competitive. Since the unbounded growth phase emerges at μ₀ = −1 Biroli et al. [2018], Bunin [2017], one reasonable choice for the initial condition upper bound is μ₀ = 1. Secondly, λ₀ is introduced as a regularizing term for the Langevin dynamics and can be considered small: in this way, we can bound its values as described above. Once the initialization is set, we optimize the cost function. To explore the largest set of solutions, we employ the Broyden–Fletcher–Goldfarb–Shanno algorithm provided by the scipy Virtanen et al. [2020] routine. Briefly, this method allows us to optimize scalar functions of multiple variables using a generalized secant method. Since 𝒞 is flat almost everywhere except in the region where local minima are clustered, other methods tend to provide – with the same initialization procedure – results on the boundary of the optimization region, signaling poor convergence performance when tested for our problem. To explore a large subset of solutions and take the flatness of the cost function into account, we repeat the process 10⁵ times and bound the solutions into a region way larger than the initial conditions μ ∈ [−1,100], σ ∈ [0,10], T ∈ [10⁻⁴, 10⁻²] and λ ∈ [10⁻⁹, 10⁻¹]. In the downstream analysis, we only retain the best 30 solutions, minimizing 𝒞. At the end of the procedure, we obtain a set of parameters θ = (μ, σ, T, λ) that, if used in the self-consistent equations, are capable of satisfying them with mean relative error ɛ = δH + δQ_d + δQ₀ of order ɛ ≈ 10⁻², 10⁻³. As a consistency check, we report the value of λ* = 2 × 10⁻⁶ ≈ min N_data = 9 × 10⁻⁶ (constant for all of the top 30 solutions), which is slightly below the minimum species relative abundance of the data.

Cavity method for the species abundance distribution

Another powerful technique rooted in disordered systems is the cavity method, which turns out to be particularly convenient for deriving the species abundance distribution at equilibrium. Without demographic noise, the SAD in the single equilibrium phase is typically captured by a truncated Gaussian distribution Yoshino et al. [2008], Bunin [2017], Altieri and Franz [2019]. In the presence of noise and finite migration, the computation gets more involved but is still doable within the cavity approach Mezard and Montanari [2009]. The basic idea consists of adding a new species to the ecosystem and investigating the resulting joint probability distribution of the typical species. In the thermodynamic limit, the difference between a system composed of S species and the corresponding one with S + 1 species is negligible. Therefore, one can write:

By gathering all relevant information about the so-called cavity field, , and the higher-order correlation term, we obtain the field distribution, which is defined by the two moments

For compactness, we skip all technical details at this stage. Proceeding step-by-step – the full derivation can nevertheless be found in the Appendix – we end up writing the expression for the marginal probability distribution:

where N_c has been replaced by N denoting the typical species abundance, m = [1 − σ²β(q_d − q₀)] denotes the mass term borrowing field-theory terminology, and ζ is an auxiliary Gaussian variable.

Acknowledgements

We thank Silvia De Monte for insightful discussions. S.S. acknowledges Iniziativa PNC0000002-DARE – Digital Lifelong Prevention. A.M. acknowledges financial support under the National Recovery and Resilience Plan (NRRP), Mission 4, Component 2, Investment 1.1, Call for tender No. 104 by the Italian Ministry of University and Research (MUR), funded by the European Union – NextGenerationEU – Project Title “Emergent Dynamical Patterns of Disordered Systems with Applications to Natural Communities” – CUP 2022WPHMXK - Grant Assignment Decree No. 2022WPHMXK adopted on 19/09/2023 by the Italian Ministry of Ministry of University and Research (MUR). A.A. acknowledges the support received from the Agence Nationale de la Recherche (ANR), under the grant ANR-23-CE30-0012-01 (project “SIDECAR”).

Additional information

Author contributions

S.S. and A.A. designed and supervised research, J.P. performed research, E.V.S. and S.F. provided data filtering criteria; all the authors wrote the paper.

Appendices

S1. Mapping between MacArthur and Lotka-Volterra Models

We aim to investigate the mapping between the Lokta-Volterra model and the consumer-resource (CR) or MacArthur (MA) dynamics, as analyzed in Tikhonov and Monasson [2017], Altieri and Franz [2019] for a high-dimensional version. We thus consider Eq. (1) of the main text, which describes the evolution of S randomly interacting species. The self-interaction term parameters are the growth rate r_i, and the carrying capacity K_i, whereas all the inter-species interaction parameters are collected into the matrix α. For this argument, we will consider a simplified model version, where stochasticity (a.k.a. effective temperature) and immigration rate are set to zero, leading to a deterministic version, as in Biroli et al. [2018]. This model describes the dynamics of the relative species abundances N_i – previously normalized with respect to the total number of individuals in the pool – according to:

where we have introduced , the proportionality constant between growth rate and carrying capacity. Our goal is to re-derive here a generalized Lotka-Volterra (gLV) model with symmetric interactions and therefore to justify the assumption adopted in the main text. To do that, we introduce the MacArthur model, which describes how S species compete to grab M available resources. For more details, see also Wu et al. [2021]. We thus consider the following set of coupled ODEs, describing the joint dynamics of resources abundances R_α and species abundances N_i:

where describes resources logistic regulation in the absence of consumer species. The parameters here are the death rate of each species m_i, the carrying capacity of each resource , and the uptake matrix C, which describes the consumption or production versus microbial growth inhibition of the resource β per unit of time by the species i. Each row of the matrix can be considered an M-dimensional vector, , which we will refer to as the uptake profile of the species i. The key point is that such vectors might have negative entries, as it happens in the case of microbial growth inhibition. If the dynamics of resources is much faster than the one of species, we can consider the associated equation in the stationary state. With this approximation, provided that the resources abundance is larger than zero (R_α > 0), we can evaluate the stationary value of the resource abundances

By plugging the obtained resource abundance in the equation for species dynamics, we precisely recover the Generalized Lotka-Volterra model

(16)

Note that, between the second and third passage, we decomposed the sum over j in the diagonal and off-diagonal terms, respectively j = i and j ≠ i, then swapped the summation over resources and species. Finally, we have introduced the quantities

Since the interactions are expressed as a scalar product between uptake profiles, the derived equations correspond to a gLV model with symmetric interactions.

S2. Replica formalism for the disordered Generalized Lotka-Volterra model

To obtain a full characterization of the different emergent phases in the disordered Generalized Lotka-Volterra model, we can take advantage of the replica method. Using the replica identity Mézard et al. [1987], Altieri and Baity-Jesi [2024], i.e. – which has been rigorously proven in some specific instances Guerra and Toninelli [2002], Panchenko and Talagrand [2007], Dia et al. [2016] – the disordered average is directly computed on the replicated partition function, for n replicas of the same sample. The index n is initially treated as an integer, however relying on the assumption that the analytical continuation n → 0 exists.

where β = 1/T is the inverse of the demographic noise amplitude. The overline in the partition function denotes the average over the disorder, namely over the Gaussian variables α_ij with finite mean μ/S and variance σ²/S. The index a takes the number of replicas into account, with a = 1,…, n.

The analytical treatment requires the introduction of the overlap matrix, Q_ab – whose diagonal value is , which can be easily parametrized in the replica space, as well as the average abundance For the latter, we will consider the uniformity condition, with h_a = h, Ɐa. We can thus re-express the free energy as a function of the aforementioned order parameters

which remains, at this level, as general as possible. The action 𝒜 reads

to be eventually evaluated by the Laplace method, or saddle-point approximation, in the large S limit.

By resorting to the replica-symmetric (RS) ansatz – according to which the overlap matrix is parametrized by two values only, q_d and q₀, the diagonal value or self-overlap, and the off-diagonal value, respectively – the last piece in Eq. (20) can be expressed as a function of an effective Hamiltonian . It essentially embeds the contribution of the quadratic Lotka-Volterra potential and the additional Lagrange multipliers enforcing the above expressions for Q_ab, Q_aa and h_a.

The free energy is thus re-written as . However, integrating the quenched disorder out implies a further complication: replica indices turn out to be coupled, as is evident in the second term above. The next step will then require the introduction of an auxiliary Gaussian variable z, with zero mean and unit variance so that:

with the associated replica symmetric Hamiltonian:

Note that, although in the computation above we have considered a species-dependent K_i to be as general as possible, in our work we have assumed K_i = K for all species. The full expressions of the parameters h, q₀, and q_d, to be obtained by a saddle-point approximation of the RS free energy have been reported in the Methods. They can be solved iteratively as also explained in Altieri et al. [2021], Lorenzana and Altieri [2022].

Comment on compositional abundances

For purely theoretical purposes, dealing with absolute or relative abundances makes little difference. However, for the metagenomics data available to us, it is essential to perform the analysis as a function of relative abundances. Forcing the abundances to be normalized is formally equivalent to adding a global constraint in the Hamiltonian such that through the Lagrange multiplier γ. Accordingly, the original partition function can be recast as

The procedure explained above remains exactly the same along with the introduction of n replicas of the reference system. Optimizing over Q_ab, H_a, and the parameter γ_a things get easier within the RS approximation. In fact, γ_a = γ, satisfying a uniformity condition for all replicas. The only visible difference is in the linear term of (23) which now incorporates the explicit dependence on the multiplier γ leading to

We have already discussed in the text how the μ parameter does not play a significant role in determining the different phases, confirming prior results by Bunin and collaborators Bunin [2017], Lorenzana and Altieri [2022]. Once more, we can safely claim that the actual differences and emerging data clustering between healthy and sick samples are driven by the noise amplitude T and the interaction heterogeneity σ.

S3. Stability analysis: Hessian matrix and zero modes

To properly understand the stability of the single equilibrium phase and highlight possibly emergent multiple attractor regimes, we need to perform a stability computation. Computing the stability against external perturbations requires first the definition of the replicated partition function:

then the study the harmonic fluctuation of the free energy around the RS solution. For the latter, the Hessian matrix of the action 𝒜 is needed

where the subscript denotes the connected part of the correlation function. Leveraging well-known techniques in field theory and disordered systems De Dominicis and Giardina [2006], we decompose the Hessian matrix [27] as a function of three different correlators among which the one living in the replicon subspace De Dominicis and Giardina [2006], Altieri et al. [2016] provides the leading contribution:

Three different elements appear in the computation depending on the constraint/equality condition of their replica indices

Sticking to a single-equilibrium regime, namely in the RS approximation, the expression for the replicon eigenvalue can be further simplified

which precisely corresponds to Eq. (3) of the main text. A strictly positive or a vanishing value of the replicon are associated with a stable or marginally stable phase, respectively. In the second case, the RS ansatz is no longer valid and requires the use of a one-step or multiple-step replica-symmetry breaking.

S4. Cavity argument for the species abundance distribution

In the absence of demographic fluctuations and immigration, it is well-known that the species abundance distribution reflects a truncated Gaussian Yoshino et al. [2008], Bunin [2017], Tikhonov and Monasson [2017], Lorenzana and Altieri [2022]. Albeit the computation gets more complicated in the presence of demographic noise, we can nevertheless employ the cavity method Mezard and Montanari [2009], Zamponi [2010], Altieri and Baity-Jesi [2024] to obtain the single-species marginal probability distribution. Based on a cavity argument, we pretend to add a new species to the ecosystem and eventually investigate the resulting joint probability distribution. In the thermodynamic limit, for S ≫ 1, it reads

where, for the sake of simplicity, we have assumed the carrying capacities to be species-independent, i.e. K_i = K.

The last piece in parentheses denotes the cavity field, which, as long as the single equilibrium phase is concerned, can be assumed to be a Gaussian random variable. The probability distribution is formally factorized and the only coupling with the new species occurs via h_c:

The probability distribution conditioned to h_c then reads:

where the thermal and the disordered averages of the cavity field, h_c, must eventually be evaluated. Furthermore, to go beyond a naïve mean-field approximation, we need to subtract the background effect exerted by all other species on the picked one through an Onsager reaction term.

In the above expression, and stand respectively for the first moment and the variance of the field with respect to the thermal average, in the thermodynamic limit. The normalization, Z, can formally be expressed as a combination of error and hypergeometric functions as follows

The next step requires the computation of the disordered average. By doing so, we end up with the distribution of the second field, which is defined in terms of the two moments:

A similar derivation was obtained in the case of a generalized Lotka-Volterra model with noise to be compared with numerical distributions in different regimes Wu et al. [2021]. The resulting species abundance distribution typically exhibits a divergence at N = 0 and a secondary maximum, therefore closer to what is referred to as niche scenario.

Then, by introducing a rescaled random variable ζ and integrating over it, the marginal probability distribution becomes

where N_c is just replaced by N denoting the typical species abundance and T → 1/β. The expression for the normalization is roughly similar to the one in Eq. (34) with the only difference being that the mean and the variance of the cavity field must be replaced by the quenched average of and σ² (q_d − q₀), respectively.

S5. Estimation of Neutral and Niche processes

Following the results of the cavity method in Section 5, we can describe the species abundance distribution of the system (see Eq. (37)) via the quantities ν = βλ > 0, the quadratic coefficient m = 1 − σ²β(q_d − q₀) and the disorder variable , where z is a standardized normal variable.

As we specified in the main text, we refer to the quadratic coefficient m as a mass. When the mass becomes negative, the self-consistent equations (6) in the Methods section are no longer well-defined, and the order parameters diverge. This signals a transition to an unbounded growth regime.

So far, the obtained SAD is flexible enough to accommodate different qualitative shapes frequently appearing in theoretical ecology Azaele et al. [2016]. When ν > 1, the migration strength turns out to be stronger than demographic fluctuations and the distribution resembles a modal log-normal. On the other hand, when 0 < ν < 1 demographic noise overcomes immigration, and a low-abundance (integrable) divergence appears. In our case, ν ≈ 10⁻³, so demographic noise is much stronger than immigration.

When the divergence in zero appears, two further shapes are possible. If there exists a secondary extremum for the species abundance distribution, then there is an coexistence of niche (driven by species interactions) and neutral (driven by demographic noise) effects. Conversely, if no secondary extrema exist, birth and death effects are the key drivers of the dynamics. In our interpretation of the model, the species abundance distribution is the stationary distribution of a single realization of the Lotka-Volterra dynamics. Given a set of parameters θ = (μ, σ, β, λ), it is possible to evaluate how many realizations of the Lotka-Volterra dynamics will display a secondary extremum. Taking the form of a fraction, we introduce a quantity ψ that captures the chance of displaying such an extremum and measures the relative contribution between selective and neutral effects. Namely, we can evaluate the probability that ℋ_RS has a minimum different from zero:

Being an algebraic equation of degree two, it has two solutions . Introducing the condition for the local extrema to appear if . This probability amounts to

where are the moments of the variable tracking the different disorder realizations. When the quantity ψ is of order 1, niche and neutral contributions to the dynamics can be considered comparable. In principle, we should check which of the two ± solutions is actually a maximum. The convexity of the SAD is easy to evaluate:

where . From the first inequality, straightforward algebraic manipulations confirm that is the stable solution. Intuitively, since P(N|ζ) has a divergence in zero, any secondary maximum – if it exists – must be greater than the minimum .

Significance of findings

Strength of evidence

Abstract

1 Introduction

1.1 Insights from Disordered Systems Theory

2 Results

2.1 Disordered generalized Lotka-Volterra model

2.2 Data Through the Glass

The inference protocol of the dgLV generative model is performed by a moment matching optimization procedure.

Order parameters inferred from the data using Eqs. (4).

2.3 Species interactions patterns characterize the state of microbiomes

3 Discussion and Conclusions

Material and Methods

Microbiome dataset and code

Free-energy landscape exploration: replica formalism

Moment matching inference

Cavity method for the species abundance distribution

Acknowledgements

Additional information

Author contributions

Appendices

S1. Mapping between MacArthur and Lotka-Volterra Models

S2. Replica formalism for the disordered Generalized Lotka-Volterra model

Comment on compositional abundances

S3. Stability analysis: Hessian matrix and zero modes

S4. Cavity argument for the species abundance distribution

S5. Estimation of Neutral and Niche processes

References

Article and author information

Author information

Jacopo Pasqualini

Amos Maritan

Andrea Rinaldo

Sonia Facchin

Edoardo Savarino

Ada Altieri

Samir Suweis

Version history

Cite all versions

Copyright

Metrics