Community diversity is associated with intra-species genetic diversity and gene loss in the human gut microbiome

  1. Naïma Madi
  2. Daisy Chen
  3. Richard Wolff
  4. B Jesse Shapiro  Is a corresponding author
  5. Nandita R Garud  Is a corresponding author
  1. Département de sciences biologiques, Université de Montréal, Canada
  2. Computational and Systems Biology, University of California, Los Angeles, United States
  3. Bioinformatics and Systems Biology Program, University of California, San Diego, United States
  4. Department of Ecology and Evolutionary Biology, University of California, Los Angeles, United States
  5. McGill Genome Centre, McGill University, Canada
  6. Quebec Centre for Biodiversity Science, Canada
  7. McGill Centre for Microbiome Research, Canada
  8. Department of Microbiology and Immunology, McGill University, Canada
  9. Department of Human Genetics, University of California, Los Angeles, United States

Abstract

How the ecological process of community assembly interacts with intra-species diversity and evolutionary change is a longstanding question. Two contrasting hypotheses have been proposed: Diversity Begets Diversity (DBD), in which taxa tend to become more diverse in already diverse communities, and Ecological Controls (EC), in which higher community diversity impedes diversification. Previously, using 16S rRNA gene amplicon data across a range of microbiomes, we showed a generally positive relationship between taxa diversity and community diversity at higher taxonomic levels, consistent with the predictions of DBD (Madi et al., 2020). However, this positive 'diversity slope' plateaus at high levels of community diversity. Here we show that this general pattern holds at much finer genetic resolution, by analyzing intra-species strain and nucleotide variation in static and temporally sampled metagenomes from the human gut microbiome. Consistent with DBD, both intra-species polymorphism and strain number were positively correlated with community Shannon diversity. Shannon diversity is also predictive of increases in polymorphism over time scales up to ~4-6 months, after which the diversity slope flattens and becomes negative – consistent with DBD eventually giving way to EC. Finally, we show that higher community diversity predicts gene loss at a future time point. This observation is broadly consistent with the Black Queen Hypothesis, which posits that genes with functions provided by the community are less likely to be retained in a focal species' genome. Together, our results show that a mixture of DBD, EC, and Black Queen may operate simultaneously in the human gut microbiome, adding to a growing body of evidence that these eco-evolutionary processes are key drivers of biodiversity and ecosystem function.

Editor's evaluation

This paper analyses meta-genomic human gut microbiome data to understand how biodiversity arises and can be maintained. It makes an important contribution by strengthening the diversity-begets-diversity hypothesis and linking it to signatures of gene loss expected from the Black Queen hypothesis. While only correlative data is used to draw conclusions, the methods are solid and alternative hypotheses are clearly outlined.

https://doi.org/10.7554/eLife.78530.sa0

Introduction

Our understanding of microbial evolution and diversification has been enriched by experimental studies of bacterial isolates in the laboratory, but it remains a challenge to study evolution in the context of more complex communities (Lenski, 2017). Ongoing advances in culture-independent technologies have allowed us to study bacteria in the complex and dense communities in which they naturally occur (Garud and Pollard, 2020). Within a community, individual players engage in many negative and positive ecological interactions. Negative interactions can originate from competition for resources and biomolecular warfare (Hibbing et al., 2010; Mitri and Foster, 2013), while positive interactions can stem from secreted metabolites that are used by other members of the community (cross-feeding) (Venturelli et al., 2018). These ecological interactions can create new niches and selective pressures, leading to eco-evolutionary feedbacks whose nature are yet to be fully understood.

Ecological interactions can yield positive or negative effects on the diversification of a focal species. Under the ‘diversity begets diversity’ (DBD) hypothesis, higher levels of community diversity increase the rate of speciation (or diversification, more generally) due to positive feedback mechanisms such as niche construction (Calcagno et al., 2017; Schluter and Pennell, 2017). Competition for limited niche space could also drive DBD if species diversify into new niches to avoid competition (Meyer and Kassen, 2007; Mitri and Foster, 2013; Schluter, 2000). By contrast, the ‘ecological controls’ (EC) hypothesis posits that competition for a limited number of niches at high levels of community diversity results in a negative effect on further diversification. Metabolic models predict that DBD may initially spur diversification due to cross-feeding, but the diversification rate eventually slows and reaches a plateau as metabolic niches are filled (San Roman and Wagner, 2021). These theoretical predictions are largely supported by our previous study involving 16S rRNA gene amplicon sequencing data from the Earth Microbiome Project, in which we observed a generally positive relationship (which we call the diversity slope; Figure 1) between community diversity and focal-taxon diversity at most taxonomic levels, reaching a plateau at the highest levels of diversity (Madi et al., 2020).

Diversity begets diversity (DBD) and ecological controls (EC) hypotheses illustrated.

Hypothetical microbial communities are illustrated as gray circles containing assemblages of microbial species, shown in different colors. 'DBD' means that the focal species is more likely to acquire diversity – through de novo mutation, invasion of a different strain of the same species, or a combination of both – in a community with high diversity. This is because new niches are created in a more diverse community. By contrast, 'EC' means that the focal species is more likely to acquire diversity through strain invasion or mutation in a community with low diversity. This is because niches remain unfilled in a low-diversity community, while niche space is saturated in a high-diversity community, impeding further diversification.

In this previous study, we found stronger support for DBD in the animal gut relative to more diverse microbiomes such as soils and sediments, which were closer to a plateau of diversity (Madi et al., 2020). While diversity slopes were generally positive at taxonomic levels as fine as amplicon sequence variants (akin to species or strains) within a genus, they were most positive at higher levels such as classes or phyla. A recent experiment on soil bacteria also found evidence of DBD at the family level, likely driven by niche construction and metabolic cross-feeding (Estrela et al., 2022). It therefore remains unclear if the predictions of DBD hold primarily at these higher taxonomic levels, involving the ecological process of community assembly, or if they also apply at the finer intra-species level. Within-host intra-species diversity can arise by co-colonization of a host by genetically distinct strains belonging to the same species or evolutionary diversification of a lineage via de novo mutation and gene gain/loss events within a host.

Such fine-scale strain-level variation has important functional and ecological consequences; among other things, strains are known to engage in interactions that cannot be predicted from their species identity alone (Goyal et al., 2022). Although closely-related bacteria are expected to have broadly similar niche preferences, finer-scale niches may differ below the species level (Martiny et al., 2015). For example, the acquisition of a carbohydrate-active enzyme by Bacteroides plebeius allows it to exploit a new dietary niche in the guts of people consuming nori (seaweed) (Hehemann et al., 2010), and single nucleotide adaptations permit Enterococcus gallinarum translocation across the intestinal barrier resulting in inflammation (Yang et al., 2022). Despite their potential phenotypic effects, it is unknown if such fine-scale genetic changes are favored by higher community diversity (due, for example, to niche construction, as predicted by DBD) or suppressed (due to competition for limited niche space, as predicted by EC). Competition could also lead to DBD if focal species evolve new niche preferences to avoid extinction (Mitri and Foster, 2013; Schluter, 2000) – an idea with some support in experimental microcosms (Meyer and Kassen, 2007) but largely unexplored in natural communities.

Here, we investigate the relationship between intra-species genetic diversity and community diversity in the human gut microbiome, a well-studied system in which we previously found support for DBD at higher taxonomic levels. We use static and temporal shotgun metagenomic data from a large panel of healthy adult hosts from the Human Microbiome Project (HMP) (Lloyd-Price et al., 2017; Human Microbiome Project Consortium, 2012) as well as from four healthy individuals sampled almost daily over the course of 1 year (Poyet et al., 2019). Using metagenomic data allows us to track change in single nucleotide variation, strain diversity, and gene gain or loss events within relatively abundant species in the microbiome, and to study how these measures of intra-species diversity are associated with community diversity. Although such analyses of natural diversity cannot fully control for unmeasured confounding environmental factors, they are an important complement to controlled experimental and theoretical studies which lack real-world complexity.

Results

We investigated the relationship between community diversity and within-species genetic diversity in human gut microbiota using two shotgun metagenomic datasets. First, we analyzed data from a panel of 249 healthy hosts (Lloyd-Price et al., 2017; Human Microbiome Project Consortium, 2012), in which stool samples were collected one to three times from each host at approximately 6-month intervals. Second, we analyzed data from four individuals sampled more densely over the course of ~18 months (Poyet et al., 2019). In both cases, we only consider intra-species diversity of relatively abundant species that are well sampled in these metagenomic datasets (Methods).

We examined several metrics of community diversity and intra-species diversity and calculated the slope of their relationship, defined as the diversity slope (Figure 1). We note that intra-species diversity can arise within hosts via de novo point mutation, gene gain or loss, or the coexistence of genetically distinct strains that diverged before colonizing the host. To quantify community diversity, we calculated Shannon diversity and richness at the species level. Shannon diversity is relatively insensitive to sampling effort (Madi et al., 2020; Walters and Martiny, 2020) but richness can be underestimated in low sample sizes. We therefore computed richness on data rarefied to an equal number of reads per sample, yielding generally similar results to unrarefied data (described below). In all cases, we included the number of reads per sample (coverage) as a covariate in our models, as this could affect estimates of both community diversity and intra-species diversity. To quantify intra-species diversity, we used a reference genome-based approach to call single nucleotide variants (SNVs) and gene copy number variants (CNVs) within each focal species and computed polymorphism rates, measured as the fraction of synonymous nucleotide sites in a species’ core genome with intermediate allele frequencies (between 0.2 and 0.8) within a host (Methods). We also repeated the analysis on nonsynonymous sites, as these are subject to stronger selective constraints. As an additional metric of intra-species diversity, we inferred the number of strains within each species using StrainFinder applied to all polymorphic sites (including those outside the 0.2–0.8 frequency range) (Smillie et al., 2018).

Community diversity is positively associated with intra-species polymorphism in the human gut microbiome

As an exploratory visualization, we began by plotting the relationship between community diversity and intra-species polymorphism rate calculated at synonymous sites in cross-sectional HMP metagenomes for the nine most prevalent species (Figure 2A and B). The slope of this relationship (the diversity slope; Figure 1) provides an indicator of the evidence for DBD (positive slope) or EC (flat or negative slope). The relationship between polymorphism rate and community diversity was mostly positive in the top nine most prevalent species in HMP hosts (Figure 2A and B). These nine species are used as a simple illustration of the diversity slope, not as a formal hypothesis-testing framework.

Figure 2 with 2 supplements see all
Positive association between community diversity and within-species polymorphism in cross-sectional Human Microbiome Project (HMP) samples.

(A) Scatter plots showing the relationship between community Shannon diversity and within-species polymorphism rate (estimated at synonymous sites) in the nine most prevalent species in HMP. (B) Scatter plots showing the relationship between species richness and within-species polymorphism rate in the nine most prevalent species in HMP. These are simple correlations to show the relationships in the raw data. Significant correlations are shown with red trendlines (Spearman correlation, p<0.05); non-significant trendlines are in gray. Results of generalized additive models (GAMs) predicting polymorphism rate in a focal species as a function of (C) Shannon diversity, (D) species richness estimated on all sequence data, and (E) species richness estimated on rarefied sequence data. GAMs are based on data from 69 bacterial species across 249 HMP stool donors. Adjusted R2 and Chi-square p-values corresponding to the predictor effect are displayed in each panel. Shaded areas show the 95% confidence interval of each model prediction. See Supplementary file 1a and Supplementary file 2 section 1 for detailed model outputs.

To generalize across species and to formally test the predictions of DBD, we fit generalized additive models (GAMs) to the HMP data. Using GAMs, we are able to model non-linear relationships and account for random variation in the strength of the diversity slope across bacterial species, the uneven number of samples per host, and the non-independence of samples from the same host (Methods; see Supplementary file 1a and Supplementary file 2 section 1 for additional model details). These GAMs included 69 focal species with sufficient coverage to quantify within-species polymorphism (Methods); the results therefore apply to relatively abundant species in the human gut microbiome. GAMs showed an overall positive association between within-species polymorphism and Shannon diversity (Figure 2C, GAM, p=0.031, Chi-square test) as well as between within-species polymorphism and community richness after controlling for coverage as a covariate (Figure 2D, GAM, p=0.017, Chi-square test) or rarefying samples to an equal number of reads (Figure 2E, GAM, p=2.63e-04, Chi-square test). The random effect of species identity is highly significant in all models, indicating that each bacterial species has its own characteristic diversity slope (Supplementary file 1a). It appears that synonymous polymorphism reaches a plateau at high levels of community richness, which is particularly evident when using rarefied data (Figure 2E). Using the same GAMs applied to nonsynonymous polymorphism, we found no significant associations between diversity and within-species polymorphism rate (GAM, p>0.05, Chi-square test) (Supplementary file 1b, Supplementary file 2 section 4). This could be due to lower statistical power, since there are fewer nonsynonymous than synonymous sites, or it could reflect a true difference in the diversity slope between these site categories.

These generally positive correlations between focal species polymorphism and species-level measures of community diversity also hold when community diversity is measured at higher taxonomic levels; specifically, synonymous polymorphism rate was significantly positively associated with Shannon diversity calculated at the genus and family levels (GAMs, p<0.05, Chi-square test) (Figure 2—figure supplement 1, Supplementary file 1c). However, synonymous polymorphism rate was not significantly associated with Shannon diversity calculated at the highest taxonomic levels (order, class, and phylum, GAMs, p>0.05, Chi-square test). The positive correlation between polymorphism rate and richness held at all taxonomic levels (GAMs, p<0.05, Chi-square test) (Figure 2—figure supplement 1, Supplementary file 1c, Supplementary file 2 sections 2 and 3). When estimated at nonsynonymous sites, polymorphism rate was not significantly correlated with Shannon diversity at any taxonomic level (GAMs, p>0.05, Chi-square test), but was positively correlated with richness at the highest levels (phyla, class, and order, p=3e-04, p=0.017, and p=6.11e-04, respectively, Chi-square test from GAMs) (Figure 2—figure supplement 2, Supplementary file 1d, Supplementary file 2 sections 5 and 6). Even when not statistically significant, the diversity slopes were generally positive at all taxonomic levels for both synonymous and nonsynonymous polymorphism (Figure 2—figure supplements 1 and 2). Overall, these results are consistent with the predictions of DBD at most taxonomic levels. However, slightly different relationships are observed when considering different measures of community diversity (Shannon or richness) and different components of within-species diversity (nonsynonymous or synonymous).

Different measures of community diversity have contrasting associations with intra-species strain diversity

Within host polymorphism rates span several orders of magnitude (10–5/bp to 10–2/bp), largely due to the fact that strain content is variable across hosts. As previously argued (Garud et al., 2019), with conservatively high estimates for mutation rate (μ~10−9) (Sung et al., 2012), generation times (~10/day) (Poulsen et al., 1995), and time since colonization (<100 years), polymorphism rates of ~10–2/bp or more are inconsistent with within-host diversification of a single colonizing lineage. Therefore, hosts with relatively high intra-host polymorphism rates are likely colonized by mixtures of multiple strains that diverged long before colonizing a host. Moreover, recent work suggests that the numbers and genetic composition of strains colonizing a host can vary from host to host (Garud et al., 2019; Olm et al., 2017; Russell and Cavanaugh, 2017; Truong et al., 2017; Verster et al., 2017). The associations between polymorphism and community diversity (Figure 2) are likely driven by a combination of de novo mutation and co-colonization by multiple strains.

To separate these two sources of diversity and to explicitly account for the strain structure within hosts, we inferred the number of strains per focal species with StrainFinder (Smillie et al., 2018) (Methods) and used strain number as another quantifier of intra-species diversity. The relationship between community diversity and strain number varied depending on the focal species and the measure of community diversity. For example, the inferred number of Bacteroides vulgatus strains increased with community diversity, while Bacteroides uniformis strain count decreased or remained flat (Figure 3A and B). Expanding upon these examples, we used generalized linear mixed models (GLMMs) to investigate the relationship between the number of strains per focal species and community diversity, while taking into account coverage per sample as a covariate and variation between species, hosts and samples as random effects (Methods). GLMMs are a special case of GAMs that can handle overdispersed, zero-truncated count data such as strain counts. The number of strains per focal species was positively correlated with community Shannon diversity (GLMM, p=3.58e-07, likelihood ratio test [LRT]) (Figure 3C, Supplementary file 1e, Supplementary file 2 section 7.1). This suggests that the positive correlation between polymorphism rate and Shannon diversity (Figure 2) is due at least in part to strain diversity.

Figure 3 with 1 supplement see all
Associations between community diversity and strain number in cross-sectional Human Microbiome Project (HMP) samples.

(A) Scatter plots showing the relationship between Shannon diversity and the inferred number of strains within each of the nine most prevalent species in HMP. (B) Scatter plots showing the relationship between species richness and the inferred number of strains within each of the nine most prevalent species in HMP. Significant linear correlations are shown with red trendlines (Pearson correlation, p<0.05); non-significant trend lines are in gray. Results of generalized linear mixed models (GLMMs) predicting strain count in a focal species as a function of (C) Shannon diversity, (D) species richness estimated on all data, and (E) species richness estimated on rarefied sequence data. Diversity estimates (X-axis) are standardized to zero mean and unit variance in the models. The Y-axis shows the mean number of strains per focal species predicted by the GLMM. GLMMs are based on data from 184 bacterial species across 249 HMP stool donors. p-Values (likelihood ratio test) are displayed in each panel. Shaded areas show the 95% confidence interval of each model prediction. See Supplementary file 1e and Supplementary file 2 section 7 for detailed model outputs.

By contrast, species richness was negatively correlated with strain number (GLMM, p=1.67e-06, LRT) (Figure 3D, Supplementary file 1e, Supplementary file 2 section 7.2). The negative relationship with richness was unlikely to be confounded by sequencing depth, since the same result was obtained using rarefied data (Figure 3E, Supplementary file 1e, Supplementary file 2 section 7.3). The negative strain number-richness relationship also held at all other taxonomic ranks (GLMM, p<0.05, LRT), while the strain number-Shannon diversity relationship was generally positive (Figure 3—figure supplement 1, Supplementary file 1f, Supplementary file 2 sections 8 and 9). These effects also appear to be species-specific: for example, the number of B. vulgatus strains per host is positively correlated with both Shannon diversity and richness (consistent with DBD predictions) whereas Bacteroides ovatus has no relationship with Shannon diversity but a negative correlation with richness (consistent with EC; Figure 2A and B). Together, these results reveal that different components of community diversity can have contrasting effects on the diversity slope.

Community Shannon diversity is a predictor of intra-species polymorphism and gene loss in time series data

Our analyses thus far have considered only individual time points, which represent static snapshots of the dynamic processes of community assembly and evolution in the microbiome. To interrogate these phenomena over time, we analyzed 160 HMP subjects who were sampled two to three times ~6 months apart. Under a DBD model, we expect community diversity at an earlier time point to result in higher within-species polymorphism at a future time point. To test this expectation, we defined ‘polymorphism change’ as the difference between polymorphism rates at the two time points (Methods). We also investigated the effects of community diversity on gene loss and gain events within a focal species, as such changes in gene content are known to occur frequently within host gut microbiomes (Garud et al., 2019; Groussin et al., 2021; Yaffe and Relman, 2020; Zhao et al., 2019). Here, a gene was considered absent if its copy number (c) was <0.05 and present if 0.6 ≤c ≤ 1.2. As in the cross-sectional analyses above, we also controlled for sequencing depth of the sample and excluded genes with aberrant coverage or presence in multiple species (Methods).

In HMP samples, polymorphism change showed no significant relationships with community diversity at the earlier time point, whether it was estimated with Shannon index or species richness (GAM, p>0.05) (Supplementary file 2 section 10.1). These results suggest that DBD is negligible or undetectable over ~6 month time lags in the human gut. By contrast, we found that gene loss in a focal species between two consecutive time points was positively correlated with community diversity at the earlier time point (Figure 4; GLMM, p=0.028, p=0.034, and p=0.049, LRT for Shannon, richness and rarefied richness, respectively) (Supplementary file 1g, Supplementary file 2 section 10.3). Gene gains did not show any significant relationships with community diversity (GLMM, p>0.05). Selection for gene loss in more diverse communities is a prediction of the Black Queen hypothesis (BQH), provided that higher community diversity results in more redundant gene functions that compensate for losses in a focal species (Morris et al., 2012). Most species in HMP samples lost fewer than 10 genes over ~6 months – consistent with de novo deletion events of a few genes – but occasionally hundreds of genes were lost from a host, suggesting that strains with smaller genomes were selected in more diverse communities (Figure 4A and B).

Positive association between community diversity and gene loss in Human Microbiome Project (HMP) time series.

(A) Scatter plots showing the relationship between Shannon diversity at time point 1 (tp1) and gene loss between tp1 and tp2 within each of the nine most prevalent species in HMP. (B) Scatter plots showing the relationship between species richness at tp1 and gene loss between tp1 and tp2 within each of the nine most prevalent species in HMP. Significant linear correlations are shown with red trendlines (Pearson correlation, p<0.05); non-significant trend lines are in gray. The Y-axis is plotted on a log10 scale for clarity. Results of generalized linear mixed models (GLMMs) predicting gene loss in a focal species as a function of (C) Shannon diversity, (D) species richness estimated on all data, and (E) species richness estimated on rarefied sequence data. p-Values (likelihood ratio test) are displayed in each panel. Shaded areas show the 95% confidence interval of each model prediction. The Y-axis is plotted on the link scale, which corresponds to log for negative binomial GLMMs with a count response. GLMMs are based on data from 54 bacterial species across 154 HMP stool donors sampled at more than one time point. See Supplementary file 1g and Supplementary file 2 section 10 for detailed model outputs.

To study these dynamics at higher temporal resolution, we analyzed shotgun metagenomic data from four more frequently sampled healthy individuals from a previous study (Poyet et al., 2019). Stool from donor am was sequenced over 18 months with a median of 1 day between samples; an over 12 months (median 2 days between samples); ao over 5 months (median 1 day between samples); and ae over 7 months (median 2 days between samples). In this data, we tracked both polymorphism change and gene gains and losses between two successive time points in 15 species with a minimal marker gene coverage of 10 in at least 10 samples. These include seven species of Bacteroides, two Eubacterium, two Faecalibacterium, two Ruminococcus, as well as Alistipes putredinis and Parabacteroides merdae.

Using the Poyet dataset, we asked whether community diversity in the gut microbiome at one time point could predict polymorphism change at a future time point by fitting GAMs with the change in polymorphism rate as a function of the interaction between community diversity at the first time point and the number of days between the two time points. Shannon diversity at the earlier time point was correlated with increases in polymorphism (consistent with DBD) up to ~150 days (~4.5 months) into the future (Figure 5—figure supplement 1), but this relationship became weaker and then inverted (consistent with EC) at longer time lags (Figure 5A, Supplementary file 1h, GAM, p=0.023, Chi-square test). The diversity slope is approximately flat for time lags between 4 and 6 months, which could explain why no significant relationship was found in HMP, where samples were collected every ~6 months. No relationship was observed between community richness and changes in polymorphism (Supplementary file 1h, GAM, p>0.05).

Figure 5 with 1 supplement see all
Community diversity is associated with increases in focal species polymorphism over short time lags and net gene loss in dense gut microbiome time series.

(A) Results of a generalized additive model (GAM) predicting polymorphism change in a focal species as a function of the interaction between Shannon diversity at the first time point and the time lag (days) between two time points in data from Poyet et al. The response (Y-axis) was log-transformed in the Gaussian GAM. Results of generalized linear mixed models (GLMMs) predicting (B) number of genes lost and (C) number of genes gained between two time points in a focal species as a function of the interaction between Shannon diversity at the first time point and the time lag between the two time points. (D) Results of the GLMM predicting the number of genes gained in a focal species as a function of the interaction between rarefied species richness at the first time point and the time lag between the two time points. The illustrated time lags correspond to the first quartile (50 days), the median (130 days), and the third quartile (250 days). See Supplementary file 1h and i and Supplementary file 2 section 11 for detailed model outputs. These analyses are based on data from 15 bacterial species across four stool donors from Poyet et al. Only statistically significant relationships are plotted. Non-significant relationships are not shown: the GAM predicting polymorphism change as a function of rarefied richness (p>0.05) and the GLMM predicting the number of genes lost as a function of rarefied richness (p>0.05).

We next asked if community diversity at one time point could predict gene gains or losses at future time points by fitting GLMMs (analogous to the GAMs above, but more appropriate for gain/loss count data). Our method does not explicitly distinguish between gene gain/loss arising from recombination or deletion versus replacement of strains with different gene content. We found that community Shannon diversity predicted future gene loss in a focal species, and this effect became stronger with longer time lags (Figure 5B, Supplementary file 1i, GLMM, p=0.006, LRT for the effect of the interaction between the initial Shannon diversity and time lag on the number of genes lost). The model predicts that increasing Shannon diversity from its minimum to its maximum would result in the loss of 0.075 genes from a focal species after 250 days. In other words, about one of the 15 focal species considered would be expected to lose a gene in this time frame.

Higher Shannon diversity was also associated with fewer gene gains, and this relationship also became stronger over time (Figure 5C, Supplementary file 1i, GLMM, p=1.11e-09, LRT). We found a similar relationship between community species richness and gene gains, although the relationship was slightly positive at shorter time lags (Figure 5D, Supplementary file 1i, GLMM, p=3.41e-04, LRT). No significant relationship was observed between richness and gene loss (Supplementary file 1i, GLMM, p>0.05). Taken together with the HMP results (Figure 4), these longer time series reveal how the sign of the diversity slope can vary over time and how community diversity is generally predictive of reduced focal species gene content.

Discussion

How eco-evolutionary feedbacks shape biological communities is an open question that, to date, has received substantial experimental and theoretical attention but is challenging to address in nature. In our previous study using 16S rRNA amplicon sequences from the Earth Microbiome Project, we found generally positive diversity slopes that eventually flattened at high levels of community diversity (Madi et al., 2020). This pattern is generally consistent with the predictions of DBD during the early stages of community assembly, but at later stages becomes more consistent with EC as niches become filled. Based on the time series metagenomic data analyzed here, the predictions of DBD also tend to hold over short time scales but fail over longer time scales of several months. Whether this leads to a terminal plateau of diversity, or whether ecological disturbances lead to cycles of DBD and EC, deserves further study.

In our previous study, the animal gut microbiome had one of the highest positive diversity slopes, making it an ideal candidate for investigating eco-evolutionary interactions at greater intra-species resolution using metagenomic data. In this follow-up study, we investigate the same phenomenon at a subspecies level, with results that are broadly consistent with the predictions of DBD giving way to EC over long time scales. We note that experiments supporting DBD have generally been conducted over short time scales ranging from 2 to 20 days (Estrela et al., 2022; Jousset et al., 2016), consistent with the importance of DBD early in community assembly. We also identify several nuances and caveats to this general conclusion, which are discussed below in detail.

Another recent study also found evidence for eco-evolutionary feedbacks in the HMP, in the form of a positive relationship between evolutionary modifications or strain replacements in a focal species and community diversity (Good and Rosenfeld, 2022). Using a model, they further showed that these eco-evolutionary dynamics could be explained by resource competition and did not require the cross-feeding interactions previously invoked (Estrela et al., 2022; San Roman and Wagner, 2021; San Roman and Wagner, 2018) to explain DBD at higher taxonomic levels. This could be because cross-feeding operates at the family or genus level and is less relevant at finer evolutionary scales.

There are several noteworthy caveats to our study. First, using metagenomic data from human microbiomes allowed us to study genetic diversity, but limited us to considering only relatively abundant species with genomes that were well covered by short sequence reads. Deeper or more targeted sequencing may permit us to determine whether the same patterns hold for rarer members of the microbiome. However, it is notable that the majority of the dozens of species across the two datasets analyzed support DBD, suggesting that the phenomenon may generalize.

Second, we cannot establish causal relationships without controlled experiments. We are therefore careful to conclude that positive diversity slopes are consistent with the predictions of DBD, and negative slopes with EC, but unmeasured environmental drivers could be at play. For example, increased dietary diversity could simultaneously select for higher community diversity and also higher intra-species diversity. In our previous study, we found that positive diversity slopes persisted even after controlling for potential abiotic drivers such as pH and temperature (Madi et al., 2020), but a similar analysis was not possible here due to a lack of metadata. Neutral processes can account for several ecological patterns such as species-area relationships (Hubbell, 2001), and must be rejected in favor of niche-centric models like DBD or EC. Using neutral models without DBD or EC, we found generally flat or negative diversity slopes due to sampling processes alone and that positive slopes were hard to explain with a neutral model (Madi et al., 2020). These models were intended mainly for 16S rRNA gene sequence data, but we expect the general conclusions to extend to metagenomic data. Nevertheless, further modeling and experimental work will be required to fully exclude a neutral explanation for the diversity slopes we report in the human gut microbiome.

Based on controlled experiments (Estrela et al., 2022) and modeling studies (San Roman and Wagner, 2021), DBD is a plausible causal explanation for positive diversity slopes in the gut microbiome. Although they also note that causality is difficult to establish, Good and Rosenfeld, 2022 suggest the importance of focal species evolution as a driver of changes in community structure, as shown in an experimental study of Pseudomonas in compost communities (Padfield et al., 2020). Clearly, further work is needed to establish the extent and relative rates of eco-evolutionary feedback in both directions. How these feedbacks among bacteria are influenced by abiotic factors and by interactions with fungi, archaea, and phages also deserves further study.

Third, the diversity slope changes depending on which component of within-species diversity or community diversity is considered. Notably, the number of strains within a focal species is positively correlated with Shannon diversity, but inversely correlated with species richness, suggesting that the ability of strains to colonize a host may be associated with higher community evenness rather than total species count. Higher evenness might maximize the chance of inter-species interactions, whereas higher richness might be driven by rare species that are less likely to interact. Although Shannon diversity is considered to be more robust and informative than richness in estimating bacterial diversity (He et al., 2013; Reese and Dunn, 2018), we observe the same contrasting results between Shannon diversity and richness when community diversity is calculated at higher taxonomic levels, suggesting that this pattern is not due to artifacts such as sequencing effort.

Our measures of intra-species diversity included both synonymous and nonsynonymous SNVs, inferred strain richness, and gene content. Synonymous nucleotide variation was consistently and positively associated with both community richness and Shannon diversity at all taxonomic levels (although not always with statistical significance). Nonsynonymous variation also tended to track positively with both measures of community diversity but was only statistically significantly associated with phylum and class richness. This suggests that evolutionarily older, less selectively constrained synonymous mutations and more recent nonsynonymous mutations that affect protein structure both track similarly with measures of community diversity. Nonetheless, a parsimonious explanation for possible differences between the two classes is that while they are affected similarly, we have more statistical power to identify correlations in the more numerous synonymous mutations. This merits further investigation.

Metagenomes from the same individual sampled over time allowed us to detect gene gain and loss events. In both HMP and Poyet et al. time series, community diversity was predictive of future gene loss in a focal species. This phenomenon is not explicitly predicted by either DBD or EC but it is compatible with aspects of the BQH, with some caveats. BQH predicts that a focal species will be less likely to encode genes with functions provided by other members of the surrounding community if such functions are ‘leaky’ and available as diffusible public goods (Morris et al., 2012). The BQH could also act as a driver of polymorphism within a species (Morris et al., 2014). Gene loss may be adaptive, provided that there is a cost to encoding and expressing the relevant genes (Albalat and Cañestro, 2016; Koskiniemi et al., 2012; Simonsen, 2022). The tendency for reductive genome evolution in bacteria is well established (Albalat and Cañestro, 2016; Koskiniemi et al., 2012; Puigbò et al., 2014). Genome reduction is a particular hallmark of endosymbiotic bacteria, which depend on their hosts for many metabolic gene products (McCutcheon and Moran, 2011; Nikoh et al., 2011). It has been shown that uncultivated bacteria from the gut have undergone considerable genome reduction, which may be an adaptive process that results from reliance on public goods (Nayfach et al., 2019). In the gut microbiome, the BQH has been invoked to explain the distribution of genes involved in vitamin B metabolism (Sharma et al., 2019) and iron acquisition (Vatanen et al., 2019).

Our findings in human gut metagenomes are compatible with the BQH under the assumption that increasing community diversity also increases the availability of leaky gene products – which may not be the case if genomes in the gut microbiome are functionally redundant, as inferred in a recent study (Tian et al., 2020). This study found that species in the gut microbiome were highly redundant at the level of annotated metabolic pathways (KEGG orthologs) and that more functionally redundant microbiomes were more resistant to colonization by fecal transplants. Relatively low redundancy microbiomes could therefore be more easily colonized but might also require migrants to encode more gene functions in order to persist. Importantly, functional redundancy may be high at the level of well-annotated metabolic functions, but low at the finer level of individual gene families, as demonstrated in marine microbiomes (Galand et al., 2018) but not yet studied explicitly in the gut. Here, we report that genome reduction in the gut is higher in more diverse gut communities. This could be due to de novo gene loss, preferential establishment of migrant strains encoding fewer genes, or a combination of the two. The mechanisms underlying this correlation remain unclear and could be due to biotic interactions – including metabolic cross-feeding as posited by some models (Estrela et al., 2022; San Roman and Wagner, 2021; San Roman and Wagner, 2018) but not others (Good and Rosenfeld, 2022) – or due to unknown abiotic drivers of both community diversity and gene loss. Finally, we measured community diversity from the phylum to the species level, not below. We therefore did not investigate how the BQH could extend to maintain gene content variation within a species, as has been shown experimentally in Escherichia coli (Morris et al., 2014). This could be an avenue for future work.

In our previous analysis of lower-resolution 16S rRNA amplicon sequences, we reported a tendency for focal genera with larger genomes to have higher diversity slopes, perhaps because they experience stronger DBD (Madi et al., 2020). At face value, this tendency seems at odds with the BQH, which predicts genome reduction in more diverse communities. This apparent contradiction may be reconciled by considering eco-evolutionary dynamics on different time scales. A recent study used phylogenetic and metabolic reconstructions to show that gene gains often drive metabolic dependencies among bacteria (Goyal, 2021), potentially explaining why genera with larger maximum genome size could experience stronger DBD. Our earlier study only had the genetic resolution to consider focal taxa down to the genus level, and by using the maximum genome size observed in a public database we did not capture the dynamic process of gene gain and loss within a species, as was possible in the current metagenomic study. It is therefore possible that on longer (ecological) time scales, larger genomes have more metabolic interactions and thus experience stronger DBD, while genome reduction in more diverse communities occurs on shorter (evolutionary) time scales.

In summary, we demonstrate how metagenomic data can be used to test the predictions of eco-evolutionary theory, including DBD, EC, and the BQH. It remains to be seen whether the distinct eco-evolutionary processes proposed by DBD and the BQH operate orthogonally or whether they interact. If BQH leads to gene losses that remain polymorphic rather than being lost entirely from the species (Morris et al., 2014) – or invasions of strains with fewer genes that remain incomplete and do not replace the resident strain – this could be viewed as a form of diversification and perhaps a special case of DBD. Here, we considered gene loss as a directional process; we did not attempt to distinguish between directional changes in gene copy number and the complete extinction of a gene, which is difficult to show using metagenomic data. Future work could attempt to resolve this point and to potentially combine DBD and BQH into a unified theory.

Methods

Metagenomic analyses

Estimation of species, gene, and SNV content of metagenomic samples

We used MIDAS (Metagenomic Intra-Species Diversity Analysis System, version 1.2, downloaded on November 21, 2016) (Nayfach et al., 2016) to estimate within-species nucleotide and gene content of raw metagenomic whole genome shotgun sequencing data for HMP1-2 and Poyet et al., 2019, data. MIDAS relies on a reference database comprised of 31,007 bacterial genomes that are clustered into 5952 species, covering roughly 50% of species found in human stool metagenomes from ‘urban’ individuals. Described below are the parameters used to estimate species abundances, SNVs, and gene CNVs with MIDAS.

Estimation of species content

We estimated species abundances, SNVs, and CNVs by mapping metagenomic shotgun reads to reference genomes. Since a component of this work relies on quantifying polymorphism and CNV changes over time, we constructed a ‘personal’ reference database to avoid spurious inferences of allele frequency and CNV changes due to errors in mapping of reads to regions of the genome shared by multiple species (Garud et al., 2019). This per-host reference database was comprised of the union of all species present at one or more time points so as to be as inclusive as possible to prevent reads from being ‘donated’ to reference genome, while also being selective to prevent a reference genome from ‘stealing’ reads from a species truly present.

To estimate the species relative abundances for each host × time point sample, we mapped reads to 15 universal single-copy marker genes that are a part of the MIDAS pipeline (Nayfach et al., 2016; Wu et al., 2013) and belong to the 5952 species in the MIDAS reference database. A species with an average marker gene coverage ≥3 was considered present for the purposes of building a per-host database for mapping reads to infer SNVs and CNVs below. The per-host database was constructed by including all species present at one or more time points with coverage ≥3. However, more stringent thresholds were imposed for calling SNVs and CNVs, as described below.

Estimation of CNVs

To estimate gene CNVs, we mapped reads to the pangenomes of species present in a host’s personal database using Bowtie2 (Langmead and Salzberg, 2012) with default MIDAS settings (local alignment, MAPID ≥94.0%, READQ ≥20, and ALN_COV ≥0.75). Each gene’s coverage was estimated by dividing the total number of reads mapped to a given gene by the gene length. These genes included the aforementioned 15 universal single-copy marker genes. A given gene’s copy number (c) was estimated by taking the ratio of its coverage and the median coverage of the species’ single-copy marker genes.

With these copy number values, we estimated the prevalence of genes in the between-host population, defined as the fraction of samples with copy number c≤3 and c≥0.3 (conditional on the mean single gene marker coverage being ≥5×). For each species, we computed ‘core genes’, defined as genes in the MIDAS reference database that are present in at least 90% of samples within a given cohort. Within-host polymorphism rates were computed in core genes.

Orthologous genes present in multiple species can result in read ‘stealing’ and read ‘donating’ to species from which the reads did not originate. Thus, we excluded a set of genes belonging to a ‘blacklist’ composed of genes present in multiple species. This blacklist was constructed in Garud et al., 2019, using USEARCH (Edgar, 2010) to cluster all genes in human-associated reference genomes with a 95% nucleotide identity threshold. Since some genes may be absent from the MIDAS database but may nevertheless be shared across species, we implemented another filter (as in Garud et al., 2019) in which genes with c≥3 in at least one sample in our cohort were excluded from analysis of polymorphism rate or gene changes over time.

Inferring SNVs within bacterial species

To call SNVs, we mapped reads to a single representative reference genome as per the default MIDAS software. Reads were mapped with Bowtie2, with default MIDAS mapping thresholds: global alignment, MAPID ≥94.0%, READQ ≥20, ALN_COV ≥0.75, and MAPQ ≥20. Species were excluded from further analysis if reads mapped to ≤40% of their genome. We additionally excluded samples from further analysis if they had low median read coverage (D_) at protein coding sites. Specifically, samples with D_ < 5 across all protein coding sites with nonzero coverage were excluded. This MIDAS SNV output was then used for computing within-species polymorphism rates and inferring the number of strains present for each species in each sample (see below).

To compute polymorphism rates, additional bioinformatic filters were imposed to avoid read stealing and donating across different species. First, we did not call SNVs in blacklisted genes present in multiple species. Additionally, we excluded sites in a given sample if D<0.3 D_ or D>3 D_ as these sites harbor anomalously low or high coverage compared to the genome-wide average D_ . Additional filters are described below.

Shannon diversity, species richness, and polymorphism rate calculations

Shannon diversity and richness were computed within each sample by including any species with abundance greater than zero. Rarefied species richness estimates are based on HMP1-2 samples rarefied to 20 million reads and Poyet samples rarefied to 5 million reads. SNV and gene content variation within a focal species were ascertained only from the full dataset and not the rarefied dataset.

The polymorphism rate of a species in a sample was computed as the proportion of synonymous sites in core genes with intermediate allele frequencies (0.2 ≤f ≤0.8), as was done in Garud et al., 2019. Only species with a MIDAS marker gene coverage of 10 or more in 10 or more samples were included, yielding 69 species in 249 HMP stool donors and 15 species in four Poyet et al., 2019 donors. As explained in SI text 1 in Garud et al., 2019, this is quantitatively similar to the more traditional population genetic measure of heterozygosity, H=E[2f(1 − f)], in which intermediate frequency alleles contribute the most weight to heterozygosity. By computing polymorphism with the criteria 0.2 ≤f ≤0.8, we avoid inclusion of low-frequency sequencing errors, which can otherwise greatly influence the mean heterozygosity. Polymorphism rates were computed separately for synonymous (fourfold degenerate) and nonsynonymous (onefold degenerate) sites. The degeneracy of sites was determined based on MIDAS output.

Temporal changes in polymorphism rates and gene content

Polymorphism change was computed as the difference in polymorphism rates between time points within a host. Gene gains and losses between time points were computed in species with sufficient prevalence (at least 10 samples with marker gene coverage of at least 10, as in the polymorphism analysis above) by identifying genes with copy number c≤0.05 (indicating gene absence) in one sample and 0.6≤c≤1.2 (with marker coverage ≥20×) in another (indicating single copy gene presence). These thresholds were used in Garud et al., 2019, when inferring gene changes in temporal data and reflect a range of copy numbers expected in either the absence of a gene or presence of a single copy of a gene given typical coverage values in growing cells (Korem et al., 2015). These copy number cutoffs were chosen to avoid spuriously analyzing genes linked to multiple species. In such cases, mapping artifacts in which reads can be arbitrarily assigned to multiple species cannot be disentangled. For example, a gene present in multiple species would likely have copy number significantly deviating from 1 (including values that lie in an ambiguous zone of 0.05–0.6, as well as >>1), reflecting the joint abundances of the multiple species. Thus, although we may miss many biologically interesting multi-copy genes (e.g. transporter genes in Bacteroides; Wexler and Goodman, 2017), our thresholds avoid confounding our analysis with read stealing or donating among different species. Filters for coverage and blacklisted genes were applied as described above.

Strain number inference

We used StrainFinder (Smillie et al., 2018) to infer the number of strains present within each species in each HMP1-2 metagenomic sample. To do so, we used allele frequencies from MIDAS SNV output, generated as described above. For each species in each host, all multi-allelic sites with coverage of 20× or greater were passed as input to StrainFinder. Species/host pairs which had fewer than 100 sites with 20× coverage were removed from the analysis. StrainFinder was then run on each sample separately for strain numbers 1, 2, 3, and 4, and the optimal strain number was chosen based on the Bayesian information criterion. This range of strain number was chosen for biological reasons. A number of studies have demonstrated that at most a small handful of strains (between 1 and 4) not sharing a common ancestor within the host are ever observed within a single gut microbiome at any one time (Garud et al., 2019; Truong et al., 2017; Verster et al., 2017; Yassour et al., 2018). Additionally, for the four densely longitudinally sampled hosts in Poyet et al., 2019, multiple analyses employing distinct sequencing strategies and strain phasing techniques have similarly concluded that a maximum of four strains were present at any one time within a host for the ~30 most prevalent species (Poyet et al., 2019; Wolff et al., 2021; Zheng et al., 2022). Thus, four strains were chosen as the maximum to accommodate the range of observed possibilities.

Statistical analyses

Model construction and evaluation

Using data from the HMP and Poyet et al., 2019, we examined the relationship between within-species genetic diversity and the gut microbiome community diversity. Within-species genetic diversity was estimated with polymorphism rate and strain richness. Community diversity was estimated with the Shannon index, species richness estimated on the whole data, and species richness calculated on the data rarefied to an equal number of reads per sample (as described above). Generalized additive mixed models (mgcv function from the mgcv R package – RStudio version 1.2.5042) were used for most analyses, except when the response data were counts, such as the number of strains, gene gains, or gene losses. In these cases, we used GLMMs (glmmTMB function from the glmmTMB R package – RStudio version 1.2.5042). GLMMs are currently more flexible than GAMs in the range of count models that it can fit (https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html; Bolker, 2023). glmmTMB can deal with overdispersion in count data via two versions of negative binomial distributions negative binomial1 and negative binomial2, respectively, with linear and quadratic parameterization (Hardin and Hilbe, 2018), and can handle zero-truncated count data (Shonkwiler, 2016) with truncated Poisson and truncated negative binomial for both linear and quadratic parameterizations. In our case, strain count is an overdispersed positive variable, so a zero-truncated distribution was needed. We fit three different GLMMs with truncated-Poisson, truncated-negative binomial1 and truncated-negative binomial2, and then selected the best model based on the Akaike information criterion (AIC) as described in Brooks et al., 2017. The same methods were used to fit GLMMs for gene gains and losses.

To account for variation in sequencing depth, which can affect estimates of both community diversity and within-species genetic diversity, we added read count per sample (coverage) as a covariate to all generalized mixed models. Species name, subject identifier, and sample identifier were added as random effects to account for variation between different species and subjects, and to account for non-independence between observations. The R syntax and statistics of all generalized models are reported in Supplementary file 2.

In GLMMs, the predictors were standardized to zero mean and unit variance before analyses. We first assessed random effects significance by comparing nested models where each random effect was dropped one at a time using the LRT (ANOVA function from the R stats package) and only significant random effects were included in the final models. We then assessed the fixed effects’ significance with LRTs implemented in the drop1 function in the R stats package. This function drops individual terms from the full model and reports the AIC and the LRT p-value. All the p-values reported for the GLMMs correspond to LRT and not to the Wald p-values reported by glmm.summary function from the R package glmmTMB, as was recommended in https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html; (Bolker, 2023). We again used LRTs to compare the full significant models to null models including all random effects but no fixed effects other than the intercept. The difference in Akaike information criterion (∆AIC) between full and null model and their associated p-values are reported in Supplementary file 1e–g. As an additional evaluation of the goodness of fits, we estimated the coefficient of determination (R2) using the r2 function from the performance R package. Two values are reported: the marginal R2, a measure of the variance explained only by fixed effects, and the conditional R2, a measure of the variance explained by the entire model.

We evaluated model fits by inspecting the residuals using the DHARMa library in R (simulateResiduals and plot functions) for the GLMMs and by inspecting residual distributions and fitted-observed value plots using the gam.check function from the mgcv R package for the GAMs. Adjusted R2 values (from gam.summary function from the mgcv R package) are reported as a goodness of fit for the GAMs. All model outputs (summary function from mgcv and glmmTMB R packages) are reported in the Supplementary file 2.

To study the relationship between focal species polymorphism and community diversity calculated at higher taxonomic ranks (from genus to phylum), we used GTDBK and the Genome Taxonomy Database (GTDB) (Chaumeil et al., 2019) to annotate MIDAS reference genomes. Richness at each level was estimated with the total number of distinct taxonomic units in the sample. The Shannon index was calculated based on the relative abundances table from MIDAS: at each taxonomic level, we used the sum of the abundances of all species belonging to that taxonomic level to calculate the Shannon index (using the diversity function from the R vegan library). We then fit two separate GAMs for each taxonomic rank (from genus to phylum) with either Shannon diversity or richness as the predictors of within-species polymorphism rate (with the coverage per sample as a covariate and species name, sample and subject identifiers as random effects). These GAMs were fitted with a beta error distribution with logit-link function, chosen because polymorphism rate is a continuous value strictly bounded by 1, and all the terms were smoothed terms (see Supplementary file 1c and Supplementary file 2 sections 1–3 for additional model details).

We repeated the same methods for focal species synonymous and nonsynonymous polymorphism separately. See Supplementary file 1b and d and Supplementary file 2 section 4–6 for details of the models applied to nonsynonymous polymorphism.

Analysis of strain counts per focal species

To study the relationship between community diversity and the number of strains within a focal species in the HMP, we restricted the analysis to 184 focal species genomes with at least 100 nucleotide sites with 20× coverage in a sample. We fit separate GLMMs with strain count in a focal species as a function of community diversity estimated with Shannon diversity, species richness, or rarefied species richness. Since strain number is positive count data, we compared alternative zero-truncated count models based on the AIC score (AICtab function from bbmle R library) (Brooks et al., 2017). We fit the model with the truncated negative binomial distribution (truncated_nbinom2 or truncated_nbinom1 in glmmTMB; the second best fit) in order to resolve the overdispersion detected in the best fit (the truncated Poisson model). Overdispersion was tested using the check_overdispersion function from the performance R package as described here: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html; (Bolker, 2023).

As described above for focal species polymorphism, we tested the relationship between focal species strain count and community diversity at higher taxonomic levels from genus to phylum, fitting a separate GLMM with strain count in a focal species as a function of each metric of diversity (Shannon and richness) at higher taxonomic levels (from genus to phylum). All GLMM details are reported in Supplementary file 1f and Supplementary file 2 sections 7–9.

Analysis of time series data

To test the predictions of DBD over time, we used HMP samples with multiple time points from the same person to look at the relationship between within-species polymorphism change, defined as the difference between polymorphism rate at two time points, and community diversity at the earlier time point. We fit GAMs with log-transformed polymorphism change as a function of community diversity at the earlier time point, and added the coverage per sample at the earlier time point as a covariate as well as species name, sample, and subject identifiers as random effects (Supplementary file 2 section 10.1).

In addition, we investigated the effect of community diversity at one time point on gene content variation (gains and losses considered separately) at the subsequent time point. Gene gains and losses were both overdispersed count data, so we selected the best negative binomial model (between linear and quadratic parameterization) based on the AIC, and fit separate negative binomial GLMMs with gene gain as the response and each of the metrics of community diversity as the predictor, with the same covariates and random effects used in the previous models (Supplementary file 2 section 10.2). The same method was used to test how gene loss was related to community diversity (Supplementary file 1g, Supplementary file 2 section 10.3).

HMP longitudinal data consisted of hosts sampled at a time lag of ~6 months. To assess the relationship between within-species genetic diversity and community diversity at higher temporal resolution, we used the same methods to analyze longitudinal metagenomic data from four more frequently sampled healthy stool donors (hosts am, an, ao, and ae) (Poyet et al., 2019). Stool from donor am was sequenced over 18 months with a median of 1 day between samples; an over 12 months (median 2 days between samples); ao over 5 months (median 1 day between samples); and ae over 7 months (median 2 days between samples). We looked at polymorphism change and gene gains and losses between two time points in the 15 species with a minimal marker gene coverage of 10 in at least 10 samples. Community diversity was estimated with Shannon diversity (unrarefied) and richness calculated on rarefied data to 5 million reads per sample.

We used the same methods as in HMP time series to study the relationship between community diversity at the initial time point and polymorphism change between the initial time point and all the future time points. We fit Gaussian generalized additive mixed models with log-transformed polymorphism change as the response and the interaction between community diversity at the first time point and the number of days between time points as the predictor. Covariates included coverage, species name, sample, and subject identifiers as random effects (Supplementary file 1h, Supplementary file 2 sections 11.1 and 11.2). To study the relationship between gene variation (gains and losses separately) and diversity at the first time point, we fit negative binomial GLMMs with gene variation as a function of the interaction between diversity at the first time point and the number of days between the two time points, with the same covariate and random effects as used above for polymorphism change over time (Supplementary file 1i, Supplementary file 2 sections 11.3–11.6).

Data availability

The raw sequencing reads for the metagenomic samples used in this study were downloaded from Human Microbiome Project Consortium, 2012 and Lloyd-Price et al., 2017 (URL: https://aws.amazon.com/datasets/human-microbiome-project/); and Poyet et al., 2019 (NCBI accession number PRJNA544527). All computer code for this paper is available at https://github.com/Naima16/DBD_in_gut_microbiome, (copy archived at swh:1:rev:eef9dc1643f279e4bb2f6d7e19c4fd7d5acc384f; Madi et al., 2023).

The following previously published data sets were used
    1. Human Microbiome Project
    (2013) Registry of Open Data on AWS
    ID human-microbiome-project. Human Microbiome Project Data Set.
    1. Poyet M
    2. Groussin M
    3. Gibbons SM
    4. Avila-Pacheco J
    5. Jiang X
    6. Kearney SM
    7. Perrotta AR
    8. Berdy B
    9. Zhao S
    10. Lieberman TD
    11. Swanson PK
    12. Smith M
    13. Roesemann S
    14. Alexander JE
    15. Rich SA
    16. Livny J
    17. Vlamakis H
    18. Clish C
    19. Bullock K
    20. Deik A
    21. Scott J
    22. Pierce KA
    23. Xavier RJ
    24. Alm EJ
    (2019) NCBI BioProject
    ID PRJNA544527. BIO-ML: The Broad Institute-OpenBiome Microbiome Library enables mechanistic studies with isolates, genomes, and longitudinal metagenomics and metabolomic data.

References

  1. Book
    1. Hardin JW
    2. Hilbe JM
    (2018)
    Generalized Linear Models and Extensions
    Stata Press.
  2. Book
    1. Hubbell SP
    (2001)
    The Unified Neutral Theory of Biodiversity and Biogeography
    Princeton University Press.
  3. Book
    1. Schluter D
    (2000)
    The Ecology of Adaptive Radiation
    Oxford Series in Ecology and Evolution.

Decision letter

  1. Sara Mitri
    Reviewing Editor; University of Lausanne, Switzerland
  2. Detlef Weigel
    Senior Editor; Max Planck Institute for Biology Tübingen, Germany
  3. Djordje Bajić
    Reviewer; Yale University, United States

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Community diversity is associated with intra-species genetic diversity and gene loss in the human gut microbiome" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Detlef Weigel as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Djordje Bajić (Reviewer #1).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

Three main weaknesses of the work were pointed out that would require extensive work to address, which would be beyond the scope of the current paper. The reviewers are requesting instead that these weaknesses be made very clear in the manuscript. The first is that DBD is only one possible process that can explain the observed patterns, and alternative hypotheses are not sufficiently outlined. The second is that the sequencing depth of the analyzed samples is little for gut microbiomes and hence allows only to look at a fraction of the strain-level diversity present. Third, it is not clear what the mechanism would link the BQH and DBD. Based on these points, it would be important to

(i) tone down the conclusions,

(ii) elaborate on certain concepts,

(iii) discuss these limitations of your study, and

(iv) consider whether any further analyses of your data might address these points (more detail below).

Reviewer #2 (Recommendations for the authors):

The study is purely correlative, some of the correlations are weak or not significant, only a subset of the analysed species shows a significant positive correlation, and when changing the method to assess the diversity not the same species are being identified to have a positive correlation (diversity-SNV analysis) or opposite trends are being observed (diversity-strain number). This means that the detected patterns are not very robust. The authors refer to a recent pre-print (Estrela et al) that provides experimental support for DBD. However, the environmental conditions in that study were very different from the one found in the gut, in that only a single carbon source was provided. DBD is only one possible process that may explain the observed patterns (see above). Therefore, I recommend the authors tune down their conclusions, discuss their findings more critically, i.e. offer alternative explanations, or argue better why these alternative explanations are less likely than DBD to explain the observed patterns.

From the discussion, it is not clear what would explain the DBD at finer taxonomic scale. It would be great to have a bit more insights about how the authors believe that diversity at e.g. the species, genus, or even family level could influence strain-level diversity within an individual species. What type of niches can be created by the presence of additional families? And how can these niches be occupied by distinct strains of the same species? Although it is true that some strains may be ecologically distinct, I would still assume that most strains occupy similar niches and are competing for similar nutrients. Moreover, wouldn't we expect that there is so much functional redundancy in the gut microbiome already that a slight increase in taxonomic diversity would not necessarily create new metabolic niches from which closely related strains with similar metabolic capacities would profit? I think it would help to explain a bit more how the DBD should work out at the strain level and to what extent there is evidence that the increased strain-level diversity can really be adaptation to new niches created by other microbes in the system.

The HMP1-1 and Poyet samples were rarified to 20 and 5 mio reads. This seems very little sequencing depth considering the high amount of genus/species level diversity in the human gut microbiome and the fact that the authors want to look at strain-level diversity. I have my doubts that this amount of data will allow to quantitatively assess diversity at the strain level in these samples, which may in part explain the not very robust correlations observed. How much of the total diversity is assessed with this number of reads? Rarefaction curves of SNVs or genes per species discovered when sub-setting the datasets across samples would be helpful. In addition to the low overall sequencing depth, only polymorphisms with a frequency of 0.2-0.8 were considered, which further limits the number of strains that can be detected. What was the idea behind applying this cut-off?

Why was only one species analysed across the entire Poyet dataset? If this is a general pattern it should be observable for more species than just B. vulgatus. While I acknowledge that read coverage may not be high enough for all samples for other species, there should be enough consecutive timepoints with sufficient coverage for some species. Unless there is a good reason why such analysis can only be done for B. vulgatus, I strongly recommend to extend this analysis to other community members to find further support for DBD-like patterns.

Along the same lines, the Poyet dataset would offer an opportunity to follow how diversity of B. vulgatus changes over more timepoints than just two in response to the species/genus/family diversity (e.g. over the entire 18 months?). This could provide interesting new insights that may help to understand how diversity changes over time at different levels. Why not look at changes in diversity across all timepoints? Would we expect that DBD continues over a longer period of time or would we see some type of negative feedback, because of ecological control kicking in at one point leading to oscillations of diversity?

The figures with the correlations should be improved. Specifically, Figure 2, 3, and 5 include too many data points of different species on top of each other. It is impossible to look at the distribution of the data of individual species and appreciate the existence of correlations. In panel A of Figure 2, I see only one data point for Dialister invisus. Why? Is the color legend missing next to panel B in Figure 3?

For the StrainFinder analysis, it was assumed that species for which no site passed the 20x threshold are presented by a single strain. This seems wrong in my opinion. Such data should be excluded as the shallow sequencing simply does not allow assessing the number of strains of that species in these samples.

Out of 68 species only 15 or 18 show a significant slope. This means that there are more species that display either EC or no correlation (as EC may not always detectable). This takes away from the conclusion that just DBD explains the patterns of diversity found in the gut microbiome, and should be acknowledged.

Line 148: What is the overlap of the 15 and 18 detected positive correlations between Shannon and richness analysis? From comparing the legends of Figure 2A and B, it seems that the overlap is not great. It would be helpful if the authors could state the overlap in their paper and discuss it.

Line 147: Significance of the correlations should be corrected for multiple testing. Would the identified correlations still be significant?

How does the method to identify gene loss/gene gain differentiate between gene loss and strain-level dynamics? That is, we do not know whether a certain genome lost genes or a strain lacking those genes happened to dominate the community. If such strain happened to migrate into the community for example, this would not be evidence for the BQH.

Reviewer #3 (Recommendations for the authors):

The paper is very well written and generally well-integrated with previous work by the group and by others.

I found the main novelty of the present work to be the usage of time series data to enquire about how present microbiome community diversity may influence within species polymorphism at a future time point, motivated by the mechanism underlying the Black Queen Hypothesis (BQH) put forward ten years ago.

There are however several points that I think need to be revised:

1. It was not clear to me in the authors' introduction and discussion if and how the DBD hypothesis is integrated with the BQH.

Do the authors consider these two hypotheses to be independent? What sort of mechanisms do the authors envisage to drive a positive association between community diversity and polymorphism? And is the mechanism underlying the BQH assumed to result in the fixation of a gene loss within the focal species or may it also result in polymorphism as in Morris, Papoulis and Lenski 2014 Coexistence of evolving bacteria stabilized by a shared black queen function?

I think clarifying these points would strengthen the manuscript.

2. Another issue that was not clear to me was why the authors compute the polymorphism rates with only synonymous sites. If there is a reason to exclude non-synonymous sites it should be mentioned in the manuscript. In addition, the abstract and the conclusions should precisely state that the significant correlations are with polymorphism rates at synonymous sites.

3. The manuscript emphasizes the finding of a positive correlation in several species but does not emphasize that for the majority of species no correlation was found. Thus, the conclusion that DBD prevails (pg. 7 line 150 and abstract) looks a bit exaggerated.

https://doi.org/10.7554/eLife.78530.sa1

Author response

Essential revisions:

Three main weaknesses of the work were pointed out that would require extensive work to address, which would be beyond the scope of the current paper. The reviewers are requesting instead that these weaknesses be made very clear in the manuscript. The first is that DBD is only one possible process that can explain the observed patterns, and alternative hypotheses are not sufficiently outlined. The second is that the sequencing depth of the analyzed samples is little for gut microbiomes and hence allows only to look at a fraction of the strain-level diversity present. Third, it is not clear what the mechanism would link the BQH and DBD. Based on these points, it would be important to

(i) tone down the conclusions,

(ii) elaborate on certain concepts,

(iii) discuss these limitations of your study, and

(iv) consider whether any further analyses of your data might address these points (more detail below).

We thank the reviewers and editors for their thorough and thoughtful assessment of our paper and we have worked to address these essential revisions. First, we have extensively re-written this manuscript to be more balanced in evaluating alternative hypotheses to DBD, including ecological controls (EC) and abiotic drivers of diversity. Second, we clarify that our results apply only to relatively abundant species with sufficient sequencing depth to study sub-species diversity. We explicitly state that further work will be required to extend these results to rarer species (please see Discussion). Third, we acknowledge it is not clear which mechanism would link DBD and BQH – or if these models operate orthogonally. This is also an avenue for future work, which we mention in the Discussion.

As suggested, we:

i) Tone down our conclusions and mention alternative explanations for the observed patterns;

ii) Elaborate on certain concepts, including why sub-species genetic diversity might alterniches and thus be impacted by community diversity, and also how DBD or EC could arise by de novo mutation and/or strain migration across hosts;

iii) Include several paragraph of caveats and limitations of our study (please see Discussion); and

iv) We comprehensively analyze 15 species from four hosts instead of only Bacteroides vulgatus from a single host from Poyet et al. to reinforce the correlation between community diversity and gene loss over time. We also add new analyses to determine whether the DBD relationship differs between non-synonymous and synonymous nucleotide variants. Finally, we closely evaluated the pathway-level analysis and found that the gene-level analyses were more robust. We have therefore removed the pathway analyses.

Below we elaborate on these points and include passages from the revised text.

Reviewer #2 (Recommendations for the authors):

The study is purely correlative, some of the correlations are weak or not significant, only a subset of the analysed species shows a significant positive correlation, and when changing the method to assess the diversity not the same species are being identified to have a positive correlation (diversity-SNV analysis) or opposite trends are being observed (diversity-strain number). This means that the detected patterns are not very robust. The authors refer to a recent pre-print (Estrela et al) that provides experimental support for DBD. However, the environmental conditions in that study were very different from the one found in the gut, in that only a single carbon source was provided. DBD is only one possible process that may explain the observed patterns (see above). Therefore, I recommend the authors tune down their conclusions, discuss their findings more critically, i.e. offer alternative explanations, or argue better why these alternative explanations are less likely than DBD to explain the observed patterns.

While we still think that DBD provides a plausible mechanism to explain the observed patterns, we are now careful to acknowledge alternative mechanisms. We more prominently list possible caveats and limitations in the Discussion. We also give more weight in the Results to contrasting patterns that we observe in the data. For example:

“The negative strain number-richness relationship also held at all other taxonomic ranks (GLMM, P<0.05, LRT) (Figure S3, Table S6, Supplementary File 1 section 9), while the strain number-Shannon diversity relationship was generally positive (Figure S3, Supplementary File 1 section 8). These effects also appear to be species-specific: for example, the number of Bacteroides vulgatus strains per host is positively correlated with both Shannon diversity and richness (consistent with DBD predictions) whereas B. ovatus has no relationship with Shannon diversity but a negative correlation with richness (consistent with EC; Figure 2A, B). Together, these results reveal that different components of community diversity can have contrasting effects on the diversity slope.”

From the discussion, it is not clear what would explain the DBD at finer taxonomic scale. It would be great to have a bit more insights about how the authors believe that diversity at e.g. the species, genus, or even family level could influence strain-level diversity within an individual species. What type of niches can be created by the presence of additional families? And how can these niches be occupied by distinct strains of the same species? Although it is true that some strains may be ecologically distinct, I would still assume that most strains occupy similar niches and are competing for similar nutrients. Moreover, wouldn't we expect that there is so much functional redundancy in the gut microbiome already that a slight increase in taxonomic diversity would not necessarily create new metabolic niches from which closely related strains with similar metabolic capacities would profit? I think it would help to explain a bit more how the DBD should work out at the strain level and to what extent there is evidence that the increased strain-level diversity can really be adaptation to new niches created by other microbes in the system.

Thank you for raising this important point. We have now added the following Introduction paragraph to outline the reasons we might expect strain-level diversity to affect niche preferences, and thus be relevant to DBD:

“Such fine-scale strain-level variation has important functional and ecological consequences; among other things, strains are known to engage in interactions that cannot be predicted from their species identity alone (Goyal et al., 2022). Although closely-related bacteria are expected to have broadly similar niche preferences, finer-scale niches may differ below the species level (Martiny et al., 2015). For example, the acquisition of a carbohydrate-active enzyme by Bacteroides plebeius allows it to exploit a new dietary niche in the guts of people consuming nori (seaweed) (Hehemann et al., 2010), and single nucleotide adaptations permit Enterococcus gallinarum translocation across the intestinal barrier resulting in inflammation (Yang et al., 2022). Despite their potential phenotypic effects, it is unknown if such fine-scale genetic changes are favored by higher community diversity (due for example to niche construction, as predicted by DBD) or suppressed (due to competition for limited niche space, as predicted by EC). Competition could also lead to DBD if focal species evolve new niche preferences to avoid extinction (Mitri and Foster, 2013; Schluter, 2000) – an idea with some support in experimental microcosms (Meyer and Kassen, 2007) but largely unexplored in natural communities.”

Regarding functional redundancy, we have added the following Discussion paragraph:

“Our findings in human gut metagenomes are compatible with the BQH under the assumption that increasing community diversity also increases the availability of leaky gene products – which may not be the case if genomes in the gut microbiome are functionally redundant, as inferred in a recent study (Tian et al., 2020). This study found that species in the gut microbiome were highly redundant at the level of annotated metabolic pathways (KEGG orthologs) and that more functionally redundant microbiomes were more resistant to colonization by fecal transplants. Relatively low-redundancy microbiomes could therefore be more easily colonized, but might also require migrants to encode more gene functions in order to persist. Importantly, functional redundancy may be high at the level of well-annotated metabolic functions, but low at the finer level of individual gene families, as demonstrated in marine microbiomes (Galand et al., 2018) but not yet studied explicitly in the gut. Here we report that genome reduction in the gut is higher in more diverse gut communities. This could be due to de novo gene loss, preferential establishment of migrant strains encoding fewer genes, or a combination of the two. The mechanisms underlying this correlation remain unclear and could be due to biotic interactions – including metabolic cross-feeding as posited by some models (Estrela et al., 2022; San Roman and Wagner, 2021, 2018) but not others (Good and Rosenfeld, 2022) – or due to unknown abiotic drivers of both community diversity and gene loss.”

The HMP1-1 and Poyet samples were rarified to 20 and 5 mio reads. This seems very little sequencing depth considering the high amount of genus/species level diversity in the human gut microbiome and the fact that the authors want to look at strain-level diversity. I have my doubts that this amount of data will allow to quantitatively assess diversity at the strain level in these samples, which may in part explain the not very robust correlations observed. How much of the total diversity is assessed with this number of reads? Rarefaction curves of SNVs or genes per species discovered when sub-setting the datasets across samples would be helpful. In addition to the low overall sequencing depth, only polymorphisms with a frequency of 0.2-0.8 were considered, which further limits the number of strains that can be detected. What was the idea behind applying this cut-off?

We thank the reviewer for their concerns about potentially under-calling SNVs and genes when subsetting datasets. However, we only rarefied the data when ascertaining species richness, since richness is most affected by sampling effort. We realize that it may have been confusing as to when rarefaction was applied to the data and therefore have modified the Methods to be clearer about our approach as follows:

“SNV and gene content variation within a focal species were ascertained only from the full dataset and not the rarefied dataset.”

As mentioned above, we now clearly mention the caveat that our results apply only to relatively abundant focal species for which we are able to estimate sub-species diversity. For example, Strain Finder only estimates the number of strains within species with sufficient coverage. We also include sequence coverage (number of total reads in the metagenome) in all of our GLMM and GAM analyses, so the effects of sampling effort are taken into account throughout.

We thank the reviewer for asking about the 0.2-0.8 thresholds, as it provides us an opportunity to clarify our methodology. First we wish to emphasize that these thresholds of 0.2 – 0.8 were not used for strain detection. Rather, all SNVs were used for strain detection with StrainFinder, provided that the site had 20x or greater coverage. We clarify this in the main text as well as the methods.

The 0.2-0.8 threshold was applied when computing polymorphism rates, as in Garud, Good et al. 2019, where we analyzed the same HMP dataset. This measure of polymorphism rate is similar to a more traditional measure of heterozygosity H=E[2f(1−f)], in which intermediate frequency alleles contribute the most weight. Our approach is preferable, however, because it is more robust to low-frequency sequencing errors that can overwhelm the average in H. We clarify our citations to our previous paper in the Methods section, where our rationale is explained.

Why was only one species analysed across the entire Poyet dataset? If this is a general pattern it should be observable for more species than just B. vulgatus. While I acknowledge that read coverage may not be high enough for all samples for other species, there should be enough consecutive timepoints with sufficient coverage for some species. Unless there is a good reason why such analysis can only be done for B. vulgatus, I strongly recommend to extend this analysis to other community members to find further support for DBD-like patterns.

We now expand the analysis to 15 species and 4 individuals as we elaborate in our response above.

Along the same lines, the Poyet dataset would offer an opportunity to follow how diversity of B. vulgatus changes over more timepoints than just two in response to the species/genus/family diversity (e.g. over the entire 18 months?). This could provide interesting new insights that may help to understand how diversity changes over time at different levels. Why not look at changes in diversity across all timepoints? Would we expect that DBD continues over a longer period of time or would we see some type of negative feedback, because of ecological control kicking in at one point leading to oscillations of diversity?

Thank you for this very useful suggestion. As described above, the analysis of these time series now includes time lag in the model. We find that the positive correlation between community diversity and focal species polymorphism at a future time point holds on shorter time lags, then flattens and becomes negative at longer time lags (Figure 5 and S4). The relationship between community diversity and gene loss is always positive, with slight variation depending on the time lag. It is intriguing to think about possible oscillations of DBD and EC, and we hope this can be explored in future work. We now mention this in the first Discussion paragraph:

“Based on the time series metagenomic data analyzed here, the predictions of DBD also tend to hold over short time scales, but fail over longer time scales of several months. Whether this leads to a terminal plateau of diversity, or whether ecological disturbances lead to cycles of DBD and EC, deserves further study.”

The figures with the correlations should be improved. Specifically, Figure 2, 3, and 5 include too many data points of different species on top of each other. It is impossible to look at the distribution of the data of individual species and appreciate the existence of correlations. In panel A of Figure 2, I see only one data point for Dialister invisus. Why? Is the color legend missing next to panel B in Figure 3?

We agree that the original figures were difficult to read and we have remade them completely. The new figures 2, 3, and 4 now show each of the nine most prevalent focal species as a separate panel, to avoid overlapping points, and also show the GLMM or GAM model fits in lower panels to illustrate the broad patterns supported by all the data.

For the StrainFinder analysis, it was assumed that species for which no site passed the 20x threshold are presented by a single strain. This seems wrong in my opinion. Such data should be excluded as the shallow sequencing simply does not allow assessing the number of strains of that species in these samples.

We thank the reviewer for this suggestion, and we have incorporated it into our revised analysis. Now, for the purposes of strain finding, we consider absent from a given sample any species in which there are fewer than 100 polymorphic sites with greater than 20x coverage. 9% of species/sample pairs used in the original analysis were excluded after applying this cutoff.

100 sites was chosen as a conservative threshold for strain detection. Prior work has shown that when multiple strains of a species are present within a single host, fixed genetic differences between strains tend to greatly outnumber genetic differences between individuals belonging to the same strain (Yassour et al. 2016; Garud, Good et al. 2019; Zheng et al. 2022). In particular, while tens of thousands of sites or more typically segregate between strains, only hundreds to a few thousands of sites typically segregate within a strain. Additionally, assuming a conservative sequencing error rate of 10^-3 per base pair, thousands of sites may be spuriously marked as polymorphic due to technical error—such polymorphisms will, however, be found at low frequency, and are explicitly accounted for by StrainFinder. In total, genetic differences between strains should account for a large proportion of the total polymorphisms present within a sample when multiple strains coexist. Therefore, we expect that a significant majority of the (minimum) 100 polymorphic sites will be informative as to the presence of multiple strains.

Out of 68 species only 15 or 18 show a significant slope. This means that there are more species that display either EC or no correlation (as EC may not always detectable). This takes away from the conclusion that just DBD explains the patterns of diversity found in the gut microbiome, and should be acknowledged.

Our original intention in reporting these simple linear correlations for each species was to give illustrative examples, not to establish statistical significance. We have now removed these individual correlations for each species, which were not corrected for multiple tests, nor did they include important covariates like coverage. We now focus entirely on the more robust GLMM or GAMs, which include these covariates and provide better statistical descriptions of the entire dataset in question.

Line 148: What is the overlap of the 15 and 18 detected positive correlations between Shannon and richness analysis? From comparing the legends of Figure 2A and B, it seems that the overlap is not great. It would be helpful if the authors could state the overlap in their paper and discuss it.

As mentioned in the response above, these individual linear correlations have been removed.

Line 147: Significance of the correlations should be corrected for multiple testing. Would the identified correlations still be significant?

As mentioned above, these correlations were meant to be anecdotal and illustrative. Since they are not statistically robust, we have removed them in favor of the more appropriate GLMMs and GAMs. This obviates the need for multiple test corrections, since the entire dataset is used to test a single overarching hypothesis.

How does the method to identify gene loss/gene gain differentiate between gene loss and strain-level dynamics? That is, we do not know whether a certain genome lost genes or a strain lacking those genes happened to dominate the community. If such strain happened to migrate into the community for example, this would not be evidence for the BQH.

Indeed, our method does not distinguish between these possibilities, and we now clarify this as follows in the Results:

“Our method does not explicitly distinguish between gene gain/loss arising from recombination or deletion versus replacement of strains with different gene content.”

However, we disagree with the reviewer that gene loss due to strain dynamics rather than de novo gene loss would not be evidence for BQH. We think that selection for strains encoding fewer genes in more diverse communities (with the proviso that these more diverse communities encode more public goods) is also compatible with the general predictions of BQH. This is now mentioned in the Discussion as follows:

“Here we report that genome reduction in the gut is higher in more diverse gut communities. This could be due to de novo gene loss, preferential establishment of migrant strains encoding fewer genes, or a combination of the two.”

Reviewer #3 (Recommendations for the authors):

The paper is very well written and generally well-integrated with previous work by the group and by others.

I found the main novelty of the present work to be the usage of time series data to enquire about how present microbiome community diversity may influence within species polymorphism at a future time point, motivated by the mechanism underlying the Black Queen Hypothesis (BQH) put forward ten years ago.

Thank you for these kind words.

There are however several points that I think need to be revised:

1. It was not clear to me in the authors' introduction and discussion if and how the DBD hypothesis is integrated with the BQH.

Do the authors consider these two hypotheses to be independent? What sort of mechanisms do the authors envisage to drive a positive association between community diversity and polymorphism? And is the mechanism underlying the BQH assumed to result in the fixation of a gene loss within the focal species or may it also result in polymorphism as in Morris, Papoulis and Lenski 2014 Coexistence of evolving bacteria stabilized by a shared black queen function?

I think clarifying these points would strengthen the manuscript.

We thank the reviewer for their thoughtful comments. In the Discussion, we now elaborate further on how our results are consistent with DBD and BQH. However, we fully acknowledge that future work is needed to integrate DBD and BQH into a single unified model. We have now written a new final Discussion paragraph on this topic:

“In summary, we demonstrate how metagenomic data can be used to test the predictions of eco-evolutionary theory, including DBD, EC, and the BQH. It remains to be seen whether the distinct eco-evolutionary processes proposed by DBD and the BQH operate orthogonally or whether they interact. If BQH leads to gene losses that remain polymorphic rather than being lost entirely from the species (Morris et al., 2014) – or invasions of strains with fewer genes that remain incomplete and do not replace the resident strain – this could be viewed as a form of diversification and perhaps a special case of DBD. Here we considered gene loss as a directional process; we did not attempt to distinguish between directional changes in gene copy number and the complete extinction of a gene, which is difficult to show using metagenomic data. Future work could attempt to resolve this point and to potentially combine DBD and BQH into a unified theory.”

Additionally, we appreciate the reviewer mentioning the possibility that BQH could lead to polymorphism within a species. We acknowledge this could be true especially if multiple strains of a species colonize a host. In the Discussion we now mention this possibility needs to be investigated further:

“Finally, we measured community diversity from the phylum to the species level, not below. We therefore did not investigate how the BQH could extend to maintain gene content variation within a species, as has been shown experimentally in E. coli (Morris et al., 2014). This could be an avenue for future work.”

2. Another issue that was not clear to me was why the authors compute the polymorphism rates with only synonymous sites. If there is a reason to exclude non-synonymous sites it should be mentioned in the manuscript. In addition, the abstract and the conclusions should precisely state that the significant correlations are with polymorphism rates at synonymous sites.

We thank the reviewer for this comment. We have now added an analysis of nonsynonymous sites, which largely follow the same patterns observed for synonymous sites. This can be seen by comparing Figures S1 (synonymous) and S2 (non-synonymous), which both show generally positive associations between community diversity and focal species polymorphism. The trends are very similar for both synonymous and non-synonymous polymorphism, although generally less statistically significant for the non-synonymous sites. This is to be expected since non-synonymous sites are less numerous, providing a smaller sample size that could reduce statistical power. This is now reported in the Results, and in the Discussion as follows:

“Nonsynonymous variation also tended to track positively with both measures of community diversity but was only statistically significantly associated with phylum and class richness. This suggests that evolutionarily older, less selectively constrained synonymous mutations and more recent nonsynonymous mutations that affect protein structure both track similarly with measures of community diversity. Nonetheless, a parsimonious explanation for possible differences between the two classes is that while they are affected similarly, we have more statistical power to identify correlations in the more numerous synonymous mutations. This merits further investigation.”

3. The manuscript emphasizes the finding of a positive correlation in several species but does not emphasize that for the majority of species no correlation was found. Thus, the conclusion that DBD prevails (pg. 7 line 150 and abstract) looks a bit exaggerated.

The simple linear correlations for each species were meant to be illustrative, but were not properly representative of statistical significance. We have now removed these individual correlations for each species, which were not corrected for multiple tests, nor did they include important covariates like coverage. We now focus on the more robust GLMM or GAMs, which include these covariates and provide better statistical descriptions of the entire dataset in question. This now shows that a positive relationship between community diversity and focal species polymorphism is the predominant pattern, although of course this varies by time scale and by species. We make this clear in the Results in this passage, among others:

“These effects also appear to be species-specific: for example, the number of Bacteroides vulgatus strains per host is positively correlated with both Shannon diversity and richness (consistent with DBD predictions) whereas B. ovatus has no relationship with Shannon diversity but a negative correlation with richness (consistent with EC; Figure 2A, B). Together, these results reveal that different components of community diversity can have contrasting effects on the diversity slope.”

We have now toned down the conclusion that “DBD prevails” and have rewritten the Abstract to include the following text, which we think is more balanced:

“We find that both intra-species polymorphism and strain number are positively correlated with community Shannon diversity. This trend is consistent with DBD, although we cannot exclude abiotic drivers of diversity. Shannon diversity is also predictive of increases in polymorphism over time scales up to ~4-6 months, after which the diversity slope flattens and then becomes negative—consistent with DBD eventually giving way to EC. Also supporting a complex mixture of DBD and EC, the number of strains per focal species is positively associated with Shannon diversity but negatively associated with richness.”

https://doi.org/10.7554/eLife.78530.sa2

Article and author information

Author details

  1. Naïma Madi

    Département de sciences biologiques, Université de Montréal, Montréal, Canada
    Contribution
    Conceptualization, Software, Formal analysis, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8793-7389
  2. Daisy Chen

    1. Computational and Systems Biology, University of California, Los Angeles, Los Angeles, United States
    2. Bioinformatics and Systems Biology Program, University of California, San Diego, San Diego, United States
    Contribution
    Data curation, Formal analysis, Methodology, Writing – review and editing
    Contributed equally with
    Richard Wolff
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6516-7029
  3. Richard Wolff

    Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, United States
    Contribution
    Formal analysis, Investigation, Methodology, Writing – review and editing
    Contributed equally with
    Daisy Chen
    Competing interests
    No competing interests declared
  4. B Jesse Shapiro

    1. Département de sciences biologiques, Université de Montréal, Montréal, Canada
    2. McGill Genome Centre, McGill University, Montreal, Canada
    3. Quebec Centre for Biodiversity Science, Montreal, Canada
    4. McGill Centre for Microbiome Research, Montreal, Canada
    5. Department of Microbiology and Immunology, McGill University, Montreal, Canada
    Contribution
    Conceptualization, Supervision, Funding acquisition, Investigation, Visualization, Writing – original draft, Project administration, Writing – review and editing
    For correspondence
    jesse.shapiro@mcgill.ca
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6819-8699
  5. Nandita R Garud

    1. Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, United States
    2. Department of Human Genetics, University of California, Los Angeles, Los Angeles, United States
    Contribution
    Conceptualization, Data curation, Supervision, Funding acquisition, Investigation, Methodology, Writing – original draft, Project administration, Writing – review and editing
    For correspondence
    ngarud@ucla.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4217-4407

Funding

Allen Institute (Paul Allen Frontiers Group)

  • Nandita R Garud

Research Corporation for Science Advancement

  • Nandita R Garud

Natural Sciences and Engineering Research Council of Canada

  • B Jesse Shapiro

Canada Research Chairs

  • B Jesse Shapiro

National Institutes of Health (R25 MH 109172)

  • Daisy Chen

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We sincerely thank members of the Garud and Shapiro labs, and Pleuni Pennings, for their feedback during the development of this paper. NRG received support from the Paul Allen Frontiers Group, a University of California Hellman fellowship, a UCLA Faculty Career Development award, and the Research Corporation for Science Advancement. DWC received funding support from NIH R25 MH 109172. BJS was supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant and a Canada Research Chair. We also thank Djordje Bajić and two anonymous reviewers for their constructive suggestions that substantially improved the manuscript.

Ethics

Human subjects: All human-derived samples used in this study were previously published. We include no additional identifiable or sensitive information.

Senior Editor

  1. Detlef Weigel, Max Planck Institute for Biology Tübingen, Germany

Reviewing Editor

  1. Sara Mitri, University of Lausanne, Switzerland

Reviewer

  1. Djordje Bajić, Yale University, United States

Version history

  1. Preprint posted: March 8, 2022 (view preprint)
  2. Received: March 16, 2022
  3. Accepted: February 8, 2023
  4. Accepted Manuscript published: February 9, 2023 (version 1)
  5. Version of Record published: February 28, 2023 (version 2)

Copyright

© 2023, Madi et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,350
    Page views
  • 199
    Downloads
  • 3
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Naïma Madi
  2. Daisy Chen
  3. Richard Wolff
  4. B Jesse Shapiro
  5. Nandita R Garud
(2023)
Community diversity is associated with intra-species genetic diversity and gene loss in the human gut microbiome
eLife 12:e78530.
https://doi.org/10.7554/eLife.78530

Further reading

    1. Ecology
    2. Evolutionary Biology
    Hannah J Williams, Vivek H Sridhar ... Amanda D Melin
    Review Article

    Groups of animals inhabit vastly different sensory worlds, or umwelten, which shape fundamental aspects of their behaviour. Yet the sensory ecology of species is rarely incorporated into the emerging field of collective behaviour, which studies the movements, population-level behaviours, and emergent properties of animal groups. Here, we review the contributions of sensory ecology and collective behaviour to understanding how animals move and interact within the context of their social and physical environments. Our goal is to advance and bridge these two areas of inquiry and highlight the potential for their creative integration. To achieve this goal, we organise our review around the following themes: (1) identifying the promise of integrating collective behaviour and sensory ecology; (2) defining and exploring the concept of a ‘sensory collective’; (3) considering the potential for sensory collectives to shape the evolution of sensory systems; (4) exploring examples from diverse taxa to illustrate neural circuits involved in sensing and collective behaviour; and (5) suggesting the need for creative conceptual and methodological advances to quantify ‘sensescapes’. In the final section, (6) applications to biological conservation, we argue that these topics are timely, given the ongoing anthropogenic changes to sensory stimuli (e.g. via light, sound, and chemical pollution) which are anticipated to impact animal collectives and group-level behaviour and, in turn, ecosystem composition and function. Our synthesis seeks to provide a forward-looking perspective on how sensory ecologists and collective behaviourists can both learn from and inspire one another to advance our understanding of animal behaviour, ecology, adaptation, and evolution.

    1. Ecology
    2. Plant Biology
    Daniel Fuks, Yoel Melamed ... Ehud Weiss
    Research Article

    Global agro-biodiversity has resulted from processes of plant migration and agricultural adoption. Although critically affecting current diversity, crop diffusion from Classical antiquity to the Middle Ages is poorly researched, overshadowed by studies on that of prehistoric periods. A new archaeobotanical dataset from three Negev Highland desert sites demonstrates the first millennium CE&'s significance for long-term agricultural change in southwest Asia. This enables evaluation of the 'Islamic Green Revolution' (IGR) thesis compared to 'Roman Agricultural Diffusion' (RAD), and both versus crop diffusion during and since the Neolithic. Among the finds, some of the earliest aubergine (Solanum melongena) seeds in the Levant represent the proposed IGR. Several other identified economic plants, including two unprecedented in Levantine archaeobotany-jujube (Ziziphus jujuba/mauritiana) and white lupine (Lupinus albus)-implicate RAD as the greater force for crop migrations. Altogether the evidence supports a gradualist model for Holocene-wide crop diffusion, within which the first millennium CE contributed more to global agricultural diversity than any earlier period.