1. Genetics and Genomics
Download icon

Single-amino acid variants reveal evolutionary processes that shape the biogeography of a global SAR11 subclade

  1. Tom O Delmont
  2. Evan Kiefl
  3. Ozsel Kilinc
  4. Ozcan C Esen
  5. Ismail Uysal
  6. Michael S Rappé
  7. Steven Giovannoni
  8. A Murat Eren  Is a corresponding author
  1. The University of Chicago, United States
  2. University of Chicago, United States
  3. University of South Florida, United States
  4. University of Hawaii at Manoa, United States
  5. Oregon State University, United States
  6. Marine Biological Laboratory, United States
Research Article
  • Cited 0
  • Views 903
  • Annotations
Cite this article as: eLife 2019;8:e46497 doi: 10.7554/eLife.46497

Abstract

Members of the SAR11 order Pelagibacterales dominate the surface oceans. Their extensive diversity challenges emerging operational boundaries defined for microbial 'species' and complicates efforts of population genetics to study their evolution. Here, we employed single-amino acid variants (SAAVs) to investigate ecological and evolutionary forces that maintain the genomic heterogeneity within ubiquitous SAR11 populations we accessed through metagenomic read recruitment using a single isolate genome. Integrating amino acid and protein biochemistry with metagenomics revealed that systematic purifying selection against deleterious variants governs non-synonymous variation among very closely related populations of SAR11. SAAVs partitioned metagenomes into two main groups matching large-scale oceanic current temperatures, and six finer proteotypes that connect distant oceanic regions. These findings suggest that environmentally-mediated selection plays a critical role in the journey of cosmopolitan surface ocean microbial populations, and the idea ‘everything is everywhere but the environment selects’ has credence even at the finest resolutions.

https://doi.org/10.7554/eLife.46497.001

Introduction

The SAR11 order Pelagibacterales (Thrash et al., 2011; Ferla et al., 2013) is one of the most ubiquitous free-living lineages of heterotrophic bacteria in the world’s oceans (Giovannoni et al., 1990; Morris et al., 2002; Carlson et al., 2009; Eiler et al., 2009; Schattenhofer et al., 2009; Treusch et al., 2009). Successful cultivation efforts and single amplified genomes from the environment have led to studies revealing their critical role in marine carbon cycling (Rappé et al., 2002; Giovannoni et al., 2005; Stingl et al., 2007; Oh et al., 2011; Tsementzi et al., 2016; White et al., 2019), and environmental sequencing surveys have offered detailed insights into the ecology of this ancient branch of life in aquatic environments across the globe (Zinger et al., 2011; Brown et al., 2012).

The evolution of SAR11 is an active area of research (Giovannoni, 2017) that is critically important to understanding the determinants of its remarkable ability to maintain abundant populations in the global ocean. The evolutionary origins of SAR11 and thus its precise placement in the Tree of Life is debated (Thrash et al., 2011; Rodríguez-Ezpeleta and Embley, 2012; Ferla et al., 2013; Viklund et al., 2013), and our understanding of the evolutionary processes that define the biogeography of SAR11 cells is not complete. At the level of major SAR11 clades, previous studies have attributed markedly distinct patterns of distribution in the global ocean to both niche-based (Brown et al., 2012; Eren et al., 2013a) and neutral processes (Manrique and Jones, 2017). At the level of individual populations, a key simulation by Hellweger et al. (2014) showed that the intra-population sequence divergence that reflects the geographic patterns of distribution for SAR11 cells could emerge solely as a function of ocean currents, without selection (Hellweger et al., 2014). Between the extremes of inter-clade and intra-population diversity lies a wealth of variation that potentially can yield insights into the ecological and genetic forces that determine genomic diversity and fitness between closely-related, naturally occurring SAR11 populations.

High-throughput sequencing of metagenomes provides access to genome-wide heterogeneity within environmental populations (Simmons et al., 2008), and current computational strategies can reveal associations between ecological parameters and microdiversity patterns at various levels of resolution (Eren et al., 2015; Scholz et al., 2016; Nayfach et al., 2016; Costea et al., 2017; Truong et al., 2017). However, SAR11 poses multiple challenges for such investigations, including their remarkable intra-population genomic diversity and the limited success of reconstructing SAR11 genomes from metagenomic data. Comprehensive investigations of the genetic contents of naturally occurring microbial populations (see Denef, 2018) for a review) often rely on population genomes directly reconstructed from metagenomes (Simmons et al., 2008; Bendall et al., 2016; Anderson et al., 2017; Garcia et al., 2018). While advances in genome-resolved metagenomics have made microbial clades more accessible without cultivation (Spang et al., 2015; Brown et al., 2015; Anantharaman et al., 2016), reconstructing SAR11 genomes from the surface ocean remains a difficult endeavor, as evident in recent comprehensive surveys of metagenome-assembled genomes (MAGs) from seawater samples from around the globe (Tully et al., 2018; Delmont et al., 2018). In the absence of population genomes recovered directly from the environment, genomes from isolates can also offer insights into environmental populations through genome-wide recruitment analyses in which short metagenomic reads are aligned to a reference (Denef, 2018).

Using metagenomic read recruitment to investigate the structure of environmental populations is confounded by the challenge of defining the boundaries of microbial populations. Without an established species concept in microbiology, defining units of microbial diversity and their boundaries is a significant challenge (see Shapiro, 2018 and Cohan, 2019 for discussions). Nevertheless, from analyses of isolated microbial strains with formal taxonomic descriptions, a genome-wide average nucleotide identity (gANI) cutoff of 95% emerged as an operational delineation of species (Konstantinidis and Tiedje, 2005; Varghese et al., 2015) and was confirmed in a recent analysis of eight billion pairwise comparisons of whole genomes (Jain et al., 2018). Both gANI calculations using complete genomes, as well as the average nucleotide identity of metagenomic short reads (ANIr) recruited from environmental metagenomes using reference genomes, show an interesting discontinuity among sequence‐discrete populations at sequence identity levels between 80% and 90–95% (Konstantinidis and DeLong, 2008; Caro-Quintero and Konstantinidis, 2012; Jain et al., 2018). Regardless of their theoretical significance, these cutoffs are essential for multiple practical purposes, such as the identification and subsequent exclusion of metagenomic reads that originate from non-target environmental populations, to avoid inflating variants arising from contaminating non-specific reads in microbial population genetics studies.

Interestingly, the boundaries of environmental SAR11 populations appear to not comply with the 95% ANIr cutoff. For instance, Tsementzi et al. (2016) observed substantial sequence diversity within sequence-discrete SAR11 subclades in the environment, and suggested that an ANIr as low as 92% would be required to adequately define the boundaries of the SAR11 populations recovered in their study (Tsementzi et al., 2016). These findings are consistent with a comprehensive study of isolate genomes and marine metagenomes by Nayfach et al. (2016), which suggested that SAR11 is one of the most genetically heterogeneous marine microbial clades (Nayfach et al., 2016). The substantial sequence diversity within environmental SAR11 populations not only explains the absence of SAR11 population genomes in genome-resolved metagenomics studies, but also challenges conventional approaches to the study of population genetics in microorganisms. For instance, the multiple occurrence of single-nucleotide variants in individual codon positions would render commonly used computational strategies that classify synonymous and non-synonymous variations based on independent nucleotide sites (such as in Schloissnig et al., 2013; Bendall et al., 2016) unfeasible. Despite these challenges, SAR11, with its ubiquity in surface seawater samples, extensive diversity in sequence space, and unique evolutionary history, remains one of the exciting puzzles of contemporary microbiology.

Here we investigated the evolutionary processes that maintain genetic diversity within a natural SAR11 lineage accessible through a single isolate genome that recruited more than 1% of surface ocean metagenomic reads from a global dataset. Using single-amino acid variants, we were able to (1) delineate multiple proteotypes whose distributions were more closely linked to large-scale oceanic current temperatures than they were to geographic proximity, and (2) resolve positive and negative selection mediated by temperature and its co-variables. Our findings suggest that environmentally mediated selection, rather than neutral processes, dominate the biogeographic partitioning of SAR11 at fine scales of taxonomic resolution. Our study also offers new computational approaches to characterize variation within complex microbial populations, including additional means to integrate amino acid and protein biochemistry into microbial population genetics.

Results and discussion

To find the most appropriate SAR11 isolate genome to study the population genetics of naturally occurring SAR11, we used the complete genomes of 21 SAR11 isolates in a competitive recruitment of short reads from 103 metagenomes. Most of these metagenomes were from the TARA Oceans Project (Sunagawa et al., 2015), and correspond to 93 stations across four oceans and two seas. We also included an additional 10 metagenomes from the Ocean Sampling Day Project (Kopf et al., 2015) to cover high-latitude areas of the Northern hemisphere. All metagenomes correspond to small planktonic cells (0.2–3 μm in size) from the surface (0–15 meters depth; n = 71) and deep chlorophyll maximum (17–95 meters depth; n = 32) layers of the water column (Supplementary file 1a). The isolates we used belonged to SAR11 subclades Ia.1 (n = 6), Ia.3 (n = 11), II (n = 1), IIIa (n = 2) and the related alphaproteobacterium Va (n = 1) (Supplementary file 1b), which collectively recruited 1,029,716,339 reads from all metagenomes, or 3.3% of the dataset (Supplementary file 1c).

The metapangenome of SAR11

To investigate associations between ecology and gene content of SAR11 lineages, we first performed a pangenomic analysis in conjunction with read recruitment from the metagenomic data. The pangenome of SAR11 genomes consisted of all 29,719 genes grouped into 6175 gene clusters (Supplementary file 1d). The clustering of genomes based on shared gene clusters (Supplementary file 1e) matched that of the previously described phylogenetic clades (Grote et al., 2012) (Figure 1A; an interactive version of which is available at http://anvi-server.org/p/4Q2TNo). The SAR11 pangenome across metagenomes (i.e., the SAR11 metapangenome) revealed distinct distribution patterns for each clade within SAR11 (Figure 1A). Clade Ia recruited the most reads compared to other clades (Supplementary file 1b), consistent with previous studies that found this clade to be highly abundant in surface seawater (Field et al., 1997; Brown et al., 2012; Eren et al., 2013a; Manrique and Jones, 2017). Gene clusters divided clade Ia into two main clusters corresponding to the high-latitude subclade Ia.1 and the low-latitude subclade Ia.3 (Figure 1A). While all high-latitude genomes displayed a bi-polar geographic distribution in the metagenomic dataset, gene clusters in low-latitude genomes revealed multiple sub-groups that also showed different patterns of geographic distribution (Figure 1A). This emphasized the need to further refine subclade 1a.3, in which each genome pair had over 98.6% sequence identity at the 16S rRNA gene level (Supplementary file 1f). Our consideration of geographical co-occurrence patterns, phylogenomic characteristics, and pangenomic properties in this metapangenome revealed six subclades within 1a.3 with cultured representatives (Figure 1A, also see Supplementary file 1g for gANI estimates between SAR11 genomes). We tentatively name them SAR11 subclade 1a.3.I (HTCC7211, HTCC7214 and HTCC7217; gANI of >93% and 16S rRNA gene identity of >99.4%), 1a.3.II (HIMB5), 1a.3.III (HIMB4 and HIMB1321; gANI of 94.8% and 16S rRNA gene identity of 100%), 1a.3.IV (HTCC8051 and HTCC9022; gANI of 86.9% and 16S rRNA gene identity of 100%), 1a.3.V (HIMB83) and 1a.3.VI (HIMB122 and HIMB140; gANI of 94.6% and 16S rRNA gene identity of 99.7%). Overall, the refinement of SAR11 subclades reveals a striking agreement between phylogeny, pangenome, and the ecology of the members of the SAR11 clade Ia.

Figure 1 with 1 supplement see all
The SAR11 metapangenome.

Panel A describes the pangenome of 21 SAR11 isolate genomes based on the occurrence of 6175 gene clusters, in conjunction with their phylogeny (clade level) and relative distribution of recruited reads in 103 metagenomes ordered by latitude from the North Pole to the South Pole (top right heat map). The relative distributions were displayed for a minimum value of 0.1% and a maximum value of 1%. The layer named ‘Core 1a.3.V genes’ displays the occurrence of the 799 core 1a.3.V genes (in green) and those found in HIMB83 but not in the 1a.3.V lineage (in purple). Panel B describes the relative distribution of reads the 799 core 1a.3.V genes recruited across surface metagenomes from TARA Oceans.

https://doi.org/10.7554/eLife.46497.002

A remarkably abundant and widespread SAR11 lineage at low latitudes

While Ia.3 was the most abundant SAR11 subclade in our dataset, the new subclades we defined in this group differed remarkably in their competitive recruitment of short reads from metagenomes (Figure 1A, Supplementary file 1b). For example, while the least abundant subclade (1a.3.II; represented by HIMB5) recruited 22.6 million reads, the most abundant one (1a.3.V; represented by HIMB83), recruited 390.9 million reads, or 1.18% of the entire metagenomic dataset (Supplementary file 1b). For perspective, this is roughly two times more reads than the most abundant Prochlorococcus isolate genome recruited from the same dataset (Delmont and Eren, 2018) (Supplementary file 1h). Strain HIMB83 contains a 1.4 Mbp genome with 1470 genes, and was isolated from coastal seawaters off Hawai’i, USA. But it also recruited large numbers of reads from locations that were distant to the source of isolation (Supplementary file 1c). The gANI between HIMB83 and the most similar genome in our dataset, HIMB122 (1a.3.VI) was 82.6%, and the remarkable abundance of HIMB83 has also been recognized by others (Brucks, 2014; Nayfach et al., 2016). To the best of our knowledge, 1a.3.V is the most abundant and widespread SAR11 subclade in the euphotic zone of low-latitude oceans and seas.

Although it is a member of the subclade 1a.3.V, the genomic context HIMB83 provides does not exhaustively describe the gene content of all members of 1a.3.V. Nevertheless, it gives access to the core 1a.3.V genes through read recruitment. To identify core 1a.3.V genes, we used a conservative two-step filtering approach. First, we defined a subset of the 103 metagenomes within the main ecological niche of 1a.3.V using genomic mean coverage values (Supplementary file 1c). Our selection of 74 metagenomes in which the mean coverage of HIMB83 was >50X encompassed three oceans and two seas between −35.2° and +43.7° latitude, and water temperatures at the time of sampling between 14.1°C and 30.5°C (Figure 1—figure supplement 1, Supplementary file 1i). We then defined a subset of HIMB83 genes as the core 1a.3.V genes if they occurred in all 74 metagenomes and their mean coverage in each metagenome remained within a factor of 5 of the mean coverage of all HIMB83 genes in the same metagenome. This criterion accounted for biological characteristics influencing coverage values in metagenomic surveys of the surface ocean such as cell division rates and variations in coverage as a function of changes in GC-content throughout the genomic context. Figure 1—figure supplement 1 displays the coverage of all HIMB83 genes across all metagenomes, and Supplementary file 1j reports underlying coverage statistics. While the 799 genes that met these criteria systematically occurred within the niche boundaries of 1a.3.V, 40% of the remaining 671 HIMB83 genes that were filtered out were present in five or fewer metagenomes and coincided with hypervariable genomic loci (Figure 1—figure supplement 1). Hypervariable genome regions are common features of surface ocean microbes (Coleman et al., 2006; Zaremba-Niedzwiedzka et al., 2013; Kashtan et al., 2014; Delmont and Eren, 2018) that are not readily addressed through metagenomic read recruitment but do influence pangenomic trends. Here, less than 10% of gene clusters unique to HIMB83 were among core 1a.3.V genes (Figure 1A), indicating HIMB83’s unique genes are mostly accessory to the members of 1a.3.V. In contrast, more than 80% of gene clusters that were core to the 21 SAR11 genomes matched to the core 1a.3.V genes. The overlap between environmental core genes of 1a.3.V revealed by the metagenomic read recruitment and the genomic core of SAR11 revealed by the pangenomic analysis of isolate genomes suggests that these genes represent a large fraction of the 1a.3.V genomic backbone (Figure 1A). Core 1a.3.V genes recruited on average 1.25% of reads in the 74 metagenomes (Figure 1B, Supplementary file 1j). The broad geographic prevalence of core 1a.3.V genes represents a unique opportunity to study the population genetics of an abundant marine microbial subclade across distant geographies.

SAR11 subclade 1a.3.V maintains a substantial amount of genomic heterogeneity

To investigate the amount of genomic heterogeneity within 1a.3.V, we first studied individual short reads that the HIMB83 genome recruited from metagenomes. The percent identity of reads that matched to the 799 core 1a.3.V genes ranged from 88% to 100% (Figure 2), which is considerably more diverse than those observed in similar reference-based metagenomic studies (Konstantinidis and DeLong, 2008; Tsementzi et al., 2016; Meziti et al., 2019). Notably, we also observed similar trends for the other SAR11 genomes included in this study (Figure 2—figure supplement 1), suggesting that the relatively high sequence diversity observed among core 1a.3.V genes may be a characteristic shared with other SAR11 lineages in the surface ocean.

Figure 2 with 2 supplements see all
Statistics of recruited reads.

Left panel shows percent identity distributions in each of the 74 metagenomes. Curves are colored based on height. Metagenomes are ordered according to how the percent identity distributions hierarchically cluster based on Euclidean distance (dendrogram). Right panels display a summary of distribution statistics for each percent identity distribution compared against in situ temperature in a linear regression (correlations to all other available parameters are summarized in Figure 2—figure supplement 2). Each point is a metagenome and black lines are lines of best fit. For visual clarity, the data in left panel considers only the median read length and interpolates between data points, whereas the data in right panels consider all read lengths with no interpolation.

https://doi.org/10.7554/eLife.46497.004

Overall, our data confirm that ANIr values of >95% used previously to delineate sequence-discrete populations does not apply to SAR11. One immediate implication of this substantial amount of sequence diversity that defies previous empirical observations is our inability to explicitly define what we are accessing in the environment. This challenge is partially because a precise and exhaustive description of what constitutes a ‘population’ remains elusive (Cohan and Perry, 2007; Shapiro and Polz, 2015; Cohan, 2019), which creates significant practical challenges (Rocha, 2018), such as the accurate determination of the boundaries of naturally occurring microbial populations especially in metagenomic read recruitment results. Nevertheless, the term ‘population’ is frequently used in literature (Simmons et al., 2008; Kashtan et al., 2014; Bendall et al., 2016), which implies that Charles Darwin’s observation in his historical work ‘On the Origin of Species’ continues to summarize our struggle in life sciences to describe theoretical boundaries of fundamental units of life even though contemporary enviornmental microbiology has gone beyond the term species in this pursuit: 'no one definition of species has yet satisfied all naturalists; yet every naturalist knows vaguely what [they mean] when [they speak] of a species’ (Darwin, 1859). Our study is not well-positioned to offer a precise theoretical definition for the term 'population', either. Instead, similar to previous studies, we resort to an operational definition that suggests a population is 'an agglomerate of naturally occurring microbial cells, genomes of which are similar enough to align to the same genomic reference with high sequence identity’ (Delmont and Eren, 2018; also see Denef, 2018 and references therein for a comprehensive discussion of what constitutes a population from a metagenomic perspective). By outsourcing the hypothetical radius of a population in sequence space to the minimum sequence identity of short reads recruited from metagenomes, this approach offers a practical means to study very closely related environmental sequences without invoking theoretical considerations. The broad heterogeneity continuum that possesses no discernible sequence-discrete components we observed within the narrow sequence set defined this way, i.e., the metagenomic reads that match competitively to conserved HIMB83 genes (Figure 2), supports the assumption that this set originates within a population boundary (Figure 1—figure supplement 1). However, due to the incomplete theoretical foundation and limitations associated with the use of short metagenomic reads, in discussions here we more conservatively assume that our reads originate from multiple closely related yet intertwined SAR11 populations within subclade 1a.3.V.

Both high recombination rates between cells displaying low gANI values and frequent transfer of adaptive genes between ecologically distinct clades could explain the high-level of cohesion between SAR11 populations in the surface ocean (Cohan, 2019; Vergin et al., 2007). The high density of closely related 1a.3.V cells in the surface ocean suggests the strength of these two forces could be high within populations as well. At least two hypotheses reconcile extensive SAR11 sequence diversity and aide in understanding its implications. One hypothesis is that the members of 1a.3.V we access are in the process of evolving into multiple sequence-discrete populations and we are simply observing an emerging fork in the evolutionary journey of SAR11. Alternatively, the observed diversity may represent a cloud of random sequence variants akin to a quasispecies (Domingo et al., 2012). To examine these hypotheses, we tested the correlation between basic statistical properties of these curves (i.e., mean, standard deviation, and skewness) and environmental parameters via linear regression (Figure 2—figure supplement 2, Supplementary file 1k). This analysis revealed a significant correlation between in situ temperature and distribution shape (mean p-value: 2.0×10-3; standard deviation p-value: 3.4×10-8; skewness p-value: 1.0×10-8), which suggests a strong influence of temperature and its co-variables on the sequence heterogeneity within 1a.3.V (Figure 2) and is incompatible with the hypothesis of random sequence variants.

SAAVs: Accurate characterization of non-synonymous variation

Percent identity distributions are useful to assess overall alignment statistics of short reads to a reference; however, they do not convey information regarding allele frequencies, their functional significance, or association with biogeography. To bridge this gap, we implemented a framework to characterize amino acid substitutions in metagenomic data and to study genomic variation that impacts amino acid sequences (see Materials and methods). Briefly, our approach employs only metagenomic short reads that cover all three nucleotides in a given codon to determine the frequency of single-amino acid variants (SAAVs) in translated protein sequences. While synonymity is a codon characteristic, in practice it is often determined from a single-nucleotide variant (SNV) with the assumption that the two remaining nucleotides are invariant. However, populations with extensive nucleotide variation can violate this assumption. Indeed, in the case of the core 1a.3.V genes, on average 22.5% of SNVs per metagenome co-occurred with other SNVs in the same codon. Thus, quantifying frequencies of full codon sequences as implemented in the SAAV workflow is a requirement to correctly assess synonymity.

Among the 799 core 1a.3.V genes and 74 metagenomes, we identified 1,074,096 SAAVs in which >10% of amino acids diverged from the consensus (i.e., the most frequent amino acid for a given codon position and metagenome). The SAAV density (the percentage of codon positions that harbor a SAAV) of core 1a.3.V genes averaged 5.76% and correlated with SNV density (19.3% on average) across the 74 metagenomes (linear regression, p-value <2.2×1016; R2: 0.90; Figure 1—figure supplement 1 and Supplementary file 1L). SNV and SAAV density metrics did not decrease in metagenomes sampled closest to the source of isolation (Supplementary file 2a, 2b, and 2c), suggesting that the location of isolation for strain HIMB83 does not predict the biogeography and population genetics of 1a.3.V. To improve downstream beta-diversity analyses, we discarded codon positions if their coverage in any of the 74 metagenomes was <20X, which resulted in a final collection of 738,324 SAAVs occurred in 37,416 codon positions that harbored a SAAV in at least one metagenome among the total of 252,333 codon positions (14.8%) within the core 1a.3.V genes (Supplementary file 2d). We considered a protein to be ‘invariant’ (i.e., absence of variation due to intensive purifying or positive selection) in a given metagenome if it lacked SAAVs. They were rare in our data: in total, we detected 2,548 invariant proteins (only 4.3% of all possibilities across the 74 metagenomes) that encompassed only 113 genes (Supplementary file 2e). In addition, all genes, except one 679 nucleotide long ABC transporter (gene id 1469), contained at least one SAAV in at least one metagenome (Supplementary file 2d), revealing a wide range of amino acid sequence diversification among core 1a.3.V proteins.

Hydrophobicity influences the strength of purifying selection acting on amino acids

To understand how commonly each amino acid was found in variant sites, we compared the amino acid composition of SAAVs to the amino acid composition of the core 1a.3.V genes (see Materials and methods). In a scenario in which amino acids are as common in SAAVs as they are across all 799 core genes, the frequency that an amino acid occurred in SAAVs (variant sites) would share one-to-one correspondence with its frequency within the core genes (all sites). While these variables were correlated (linear regression, p-value: 9.8×10-6; R2: 0.65), we observed large deviations from this null expectation, implying strong differential occurrence of amino acids in SAAVs relative to their occurrence in core genes (Figure 3A, Figure 3—figure supplement 1, Supplementary file 2f). All negatively charged (Asp, Glu) and uncharged polar (Thr, Asn, Ser, Gln) amino acids were significantly enriched in SAAVs compared to the core 1a.3.V genes (Figure 3A). For instance, while asparagine made up only 6.34% of all amino acids in the core genes, on average 10.7% (±0.16%) of SAAVs involved asparagine substitutions across the 74 metagenomes (Supplementary file 2f). Interestingly, unlike negatively charged amino acids, positively charged amino acids did not exhibit substantial differences (<4% deviation between core 1a.3.V genes and SAAVs). Thus, hydrophilic amino acids were either overrepresented or exhibited little change in SAAVs with respect to their frequency within core genes. In stark contrast, all hydrophobic amino acids, with the very notable exceptions of isoleucine and valine, were underrepresented in SAAVs (Figure 3A, Figure 3—figure supplement 1, Supplementary file 2f).

Hydrophobic interactions within the solvent inaccessible core of proteins are known to be critical for maintaining the stability required for folding and activity, which enforces a strong purifying selection placed on mutations occurring in buried (solvent inaccessible) positions (Bustamante et al., 2000; Chen and Zhou, 2005; Worth et al., 2009). Since hydrophobic amino acids form the majority of buried positions, they are on average under stronger purifying selection, which is the likely explanation for the underrepresentation of hydrophobic amino acids within SAAVs. On the other hand, mutations in exposed (solvent accessible) positions on the surface of proteins are tolerated more, as they are less likely to disrupt protein architecture. Overall, our compositional analysis revealed that the occurrence of amino acids in SAAVs is roughly correlate with the occurrence of amino acids within the core 1a.3.V genes, and that deviations from this expectation are driven in part by levels of purifying selection that depend upon the suitability of an amino acid’s hydrophobicity for a given physicochemical environment (Figure 3—figure supplement 1).

Amino acid exchange rates reveal hallmarks of neutral, purifying, and adaptive evolution

Next, we sought to investigate amino acids that co-occur in variable sites. SAAVs were often dominated by a few amino acids; hence, the frequency vector for a given SAAV contained many zero values. To reduce sparsity, we first simplified our data by associating each SAAV with an amino acid substitution type (AAST), defined as the two most frequent amino acids in a given SAAV. In 738,324 SAAVs, we observed 182 of 210 theoretically possible unique AASTs and a highly skewed AAST frequency distribution (Supplementary file 2g, Figure 3B boxplots). For example, the two most frequent AASTs, ‘isoleucine/valine’ and ‘aspartic/glutamic acid’, together comprised 20% of all SAAVs (Figure 3B). This is not surprising, since the amino acids in both of these AASTs (1) are common in the genome, (2) share very similar chemical structure (both differing by only a single methylene bridge), and (3) can be substituted through a single nucleotide substitution. On the other hand, the ‘glycine/tryptophan’ pair represents an opposite example: these amino acids (1) are uncommon in the genome, (2) share no chemical or structural similarity to one another, and (3) can only be substituted through a triple nucleotide substitution. Expectedly, ‘glycine/tryptophan’ was exceedingly rare in our data and occurred only once in 738,324 SAAVs (Supplementary file 2h).

Figure 3 with 4 supplements see all
Physico-chemical properties of amino acid variants.

The top panel describes the structure of 20 amino acids grouped by their main chemical properties. Panel A describes the solvent accessibility of amino acids, their relative distribution in both the core 1a.3.V genes and SAAVs, and their percentage increase in SAAVs as compared to the core 1a.3.V genes. The solvent accessibility of amino acids derives from the analysis of 55 proteins (Bordo and Argos, 1991). Panel B describes the relative abundance of the top 25 most prevalent amino acid substitution types (AASTs) across 74 metagenomes (boxplots), along with the classes their amino acids belong to and the correlation coefficient between AAST prevalence and in situ temperature calculated via linear regression (see Figure 3—figure supplement 2 for p-values). The area shaded in light gray shows bounds for the expected frequency distribution given strictly neutral processes. The upper bound is Model one and the lower bound is Model 2 (see Materials and methods). The four insets example the relationship between AAST prevalence and in situ temperature for the AASTs 'aspartic/glutamic acid', 'isoleucine/threonine', 'alanine/serine', and 'leucine/phenylalanine' (Figure 3—figure supplement 2 illustrate similar plots for all 25 of the most prevalent AASTs). The 25 AASTs included in the analysis cover 87.1% of all SAAVs. Panel C displays SAAVs on the predicted protein structures of four core 1a.3.V genes across six metagenomes from distant locations.

https://doi.org/10.7554/eLife.46497.007

While such a skewed AAST frequency distribution cannot be explained by strictly random mutational process (Figure 3B light-gray shaded area), it is compatible with standard theories of neutral or nearly-neutral evolution, since such theories consider the role of purifying selection (Ohta and Gillespie, 1996). Within subclade 1a.3.V the distribution of AAST frequencies was notably constrained across geographies (Figure 3B). For example, the relative standard deviation of ‘aspartic/glutamic acid’ frequencies across the 74 metagenomes was just 3.0%, and the statistical spread of other AASTs was comparable (Figure 3B). The overall consistency of AAST frequency distributions across geographies supports the hypothesis that purifying selection controls the permissibility of amino acid exchangeability within 1a.3.V and enables an interpretation of these data through a neutral model: SAAVs composing the AAST frequency distribution represent primarily neutral mutations that have drifted to measurable levels, and the lack of SAAVs in AASTs of dissimilar amino acids that likely represent deleterious mutations reflect the influence of purifying selection. However, a closer inspection reveals a subtle divergence of amino acid exchangeabilities that correlates with water temperature and/or its co-variables (Figure 3B insets, Figure 3—figure supplement 2). Note that this divergence is AAST specific; for example, positions with mixed proportions of glutamic and aspartic acid are less commonly found in warm waters (linear regression, uncorrected p-value: 2.7×10-6), yet for isoleucine and valine such a correlation is nonexistent (linear regression, uncorrected p-value: 0.418). These findings suggest that amidst a signal that is predominantly indicative of purifying selection, there appears to be a fingerprint of adaptive/divergent processes caused by temperature and/or its co-variables that subtly shift the mutational profile within 1a.3.V. We were unable to attribute the magnitude or direction of these correlations to differences between amino acids (i.e., changes in hydrophobicity, size, or charge). This was likely due to the insufficiency of characterizing SAAVs with only the chemical properties of the involved amino acids, and disregarding position-specific information, such as the surrounding physicochemical environment that can only be studied with knowledge of the protein’s structure.

To address this shortcoming, we next sought to link SAAVs to predicted protein structures of the core 1a.3.V genes, 436 of which had significant matches in Protein Data Bank for template-based structure modeling (see Materials and methods). Placing SAAVs on predicted protein structures revealed that their occurrence was not randomly distributed but was instead strongly dependent on the local physicochemical environment of the structure (Figure 3C, Supplementary file 3a and http://data.merenlab.org/sar11-saavs). Within the subset of the 1a.3.V proteome accessible to us, we found that buried amino acids (0-10% relative solvent accessibility) were approximately 4.4 times less likely to be variant than those that were exposed (41-100% relative solvent accessibility) (ANOVA, p-value: <2×1016). This observation was strikingly apparent in TIM barrels, where SAAVs mostly occurred in the outer alpha helix and loop regions (e.g., Figure 3C gene 2,128). This trend directly confirmed our previous inference (based on the underrepresentation of hydrophobic amino acids) that solvent inaccessible positions are subject to higher levels of purifying selection and thus contain fewer SAAVs. The local physicochemical environment therefore shapes variation, and visual inspection of Figure 3C indicates that this is conserved across distant geographies; that is positions that vary in one metagenome are likely to vary in others, as well. Overall, 91.7% of variant positions in the core 1a.3.V genes varied in 10 or more metagenomes, and 21.7% varied in all 74 metagenomes (Supplementary file 2i).

Temperature correlates with amino acid allele frequency trajectories

In addition to considering patterns of variability that emerged when we pooled data across 37,416 codon positions exhibiting variation within the core 1a.3.V genes, we also investigated the allele frequency trajectories of individual positions (i.e., the relative frequency between the two most prevalent amino acids across the 74 metagenomes) and sought to identify those that correlate with in situ temperature and/or its co-variables. Amino acid allele frequencies in 4592 of the 37,416 positions were correlated with temperature (Supplementary file 3b; Benjamini–Hochberg multiple testing correction on linear regression p-values, false discovery rate 5%). Figure 3—figure supplement 3 illustrates example cases and correlation statistics per AAST. It is statistically implausible that such correlations with temperature could have arisen from neutral evolution, given that distant oceans share similar temperatures (Supplementary file 1a). It is therefore most plausible to conclude that these allele frequency trajectories are the result of environmentally mediated selection. Although we note that, considering the pervasive effect of genetic hitchhiking in microbial evolution (Good et al., 2017), variation in a considerable fraction of positions may be neutral despite their association with temperature.

We then sought to investigate which positions are under selection, and whether the variation at these positions can be explained by differing levels of purifying selection, or diversifying selection that could be evidence of adaptive evolution. Scrutinizing all 4592 positions to address these critical questions is an intractable problem, so we narrowed our focus to genes possessing disproportionately high ratios of temperature-correlated to temperature-uncorrelated SAAV positions, since we expected this to be a reasonable criterion for identifying likely candidates of adaptive evolution (Supplementary file 3c). Of the 10 genes fitting this criterion (see Materials and methods), the permease subunit of a glycine betaine ATP-binding cassette (ABC) transporter stood out due to its appreciated relevance to SAR11 biology: glycine betaine transporters of SAR11 are highly translated proteins in the environment and transport osmolyte compounds into cells for energy production (Noell and Giovannoni, 2019). To investigate the positioning of amino acids in the tertiary structure of the permease relative to the cellular membrane, we first categorized the location of each residue as transmembrane, cytosolic (inside the inner membrane), or periplasmic (outside the inner membrane) (Figure 3—figure supplement 4). Positions that were not correlated with temperature were commonly transmembrane, and infrequently periplasmic. In contrast, most positions that correlated with temperature were periplasmic (Figure 3—figure supplement 4). The probability of observing a similar distribution between temperature-correlated and temperature-uncorrelated positions across transmembrane, periplasmic, and cytosolic regions was only 0.034 (analytic trinomial test, temperature-uncorrelated distribution as prior), which indicates temperature-correlated positions are subjected to unique evolutionary forces. A previous study suggested that periplasmic residues of transmembrane proteins undergo higher rates of adaptive evolution due to their increased exposure to changing environmental conditions (Sojo et al., 2016). This observation lends additional support to the hypothesis that periplasmic SAAV positions within this gene that correlate with temperature are more likely shaped by adaptive processes.

Allele frequency trajectories also provide an opportunity to study the directionality of exchange rates of AASTs. For example, of the 1066 positions dominated by 'alanine/serine' SAAVs, 158 positions correlated with temperature (Figure 3—figure supplement 3). If there was no temperature-driven preference for either amino acid in this subset of positions, the frequency of alanine should positively correlate with temperature as often as the frequency of serine does. Yet this expectation is grossly violated: in 103 of 158 positions alanine frequencies positively correlated with temperature (binomial test, Bonferroni-corrected p-value: 0.004). Overall, this result indicates temperature-dependent amino acid substitution preferences that are independent of site (Figure 3—figure supplement 3).

SAAV partitioning between warm and cold currents

We finally sought to extend the concept of allele frequency tracking at individual SAAV positions to investigate large-scale geographic partitioning of metagenomes. For this, we simplified the 738,324 SAAVs into a presence-absence matrix for codon position-specific AASTs across 74 metagenomes (Supplementary file 2i, also see ‘Recovering codon position-specific AASTs from SAAVs’ in Materials and methods). Of 57,277 codon position-specific AASTs affiliated with 37,415 unique codon positions, we detected 1.94% in all 74 metagenomes, while 33.3% were found in single metagenomes (Supplementary file 2i). To estimate distances between metagenomes based on these data, we used a Deep Learning approach. Briefly, this approach relies on a graph-based activity regularization technique for competitive learning from hyper-dimensional data, modified to reveal latent groups of variants in a fully unsupervised manner through frequent random sampling of variants (Kilinc and Uysal, 2017). Hierarchical clustering of samples based on Deep Learning-estimated distances (Supplementary file 4a) resulted in two main groups: the Western (warm) and Eastern (cold) boundary currents (Figure 4A). High latitude, relatively cold, and relatively nutrient rich waters are the source of Eastern boundary currents, which warm up and typically decline in nutrients as they transit in an equatorial direction. The opposite is true of Western boundary currents, which move poleward. The first group of 41 metagenomes, which matched cold currents (Benguela, Canary, California and Peru), encompassed most metagenomes from the Eastern Pacific Ocean, as well as the East side of the Atlantic Ocean (except near the southern tip of Africa) and the Mediterranean Sea (Figure 4B). The second group of 33 metagenomes, which matched warm currents (Agulhas, Somali, Mozambique, Brazil and Gulf stream), encompassed all metagenomes from the Red Sea and Indian Ocean, as well as metagenomes from the West side of the Atlantic Ocean (Figure 4B). Samples collected from the deep chlorophyll maximum layer of the water column mirrored trends observed in the surface samples (Figure 4—figure supplement 1). The association between SAAVs and ocean current type revealed a strong, global signal at the amino acid-level for 1a.3.V and suggested the presence of two main ecological niches for this lineage. Warm and cold currents are dynamic environments that differ in a host of factors in addition to the latitude and temperature of source waters. Factors that could drive adaptive changes in amino acid sequences between warm and cold currents include major differences in phytoplankton communities, altered composition of dissolved organic carbon pools, and the water temperature itself. Interestingly, the niche defined by cold currents exhibited significantly more SAAVs (ANOVA, p-value:1.66×10-12). This observation could be explained either by (1) extinction/re-emergence events that operate continually on specific codon positions (adaptive evolution), or (2) changes in abundances within a large seed bank of variants due to positive and negative selection as the lineage transits. A recent study using Lagrangian particle tracking and network theory suggested that all regions of the surface ocean are connected to each other with less than a decade of transit (Jönsson and Watson, 2016), which might favor the latter scenario due to lack of time for the extinction and reemergence of variants in abundant marine microbial lineages.

Figure 4 with 5 supplements see all
Biogeography of SAR11 subclade 1a.3.V based on single amino acid variants.

Panel A describes the organization of 74 metagenomes based on 57,277 codon position-specific AASTs affiliated with 37,415 unique codon positions and summarizes the number of detected SAAVs and percent identity of reads HIMB83 recruited for each metagenome. The world map in panel B displays the geographic partitioning of the two main metagenomic groups and six proteotypes. Panel B also describes the relative abundance of 1a.3.V and the number of invariant proteins across the six proteotypes.

https://doi.org/10.7554/eLife.46497.012

To explore more detailed trends of the relationships between metagenomes, we further divided our dendrogram into six sub-clusters based on the elbow of the intra-cluster sum-of-squares curve of k-means clusters (Figure 4—figure supplement 2). These 1a.3.V ‘proteotypes’ grouped samples with similar amino acid variations (Figure 4A) and could not have been predicted from the clustering of samples based on percent identity distributions of short reads alone Figure 4—figure supplement 3). Among the environmental measurements for each metagenome (Supplementary file 4b, and 4c), latitude and temperature at the time of sampling were the most significant predictors of the proteotypes (ANOVA, p-values: 8.56 × 1013 and 3.57 × 10-7, respectively).These two variables were followed by the concentrations of nitrate, phosphate, oxygen, and to a lesser extent, silicate and latitude (Supplementary file 4d). The number of SAAVs and the number of invariant proteins, however, were more significant predictors of these groups compared to all environmental parameters (ANOVA, p-values: <2×1016, Supplementary file 4c, and 4d). Strikingly, most 1a.3.V proteotypes linked samples from distant geographical regions (Figure 4B). An exception to this was the proteotype A, which only contained Pacific Ocean metagenomes (Figure 4B). For instance, proteotypes E and F occurred both in the Indian Ocean and the West side of the Atlantic Ocean and associated with distinct warm currents: E was characteristic of the Mozambique and Brazil currents while F dominated the Agulhas current (Figure 4B). One of the most interesting proteotypes, D, whose reads most closely resembled the HIMB83 genome itself (Figure 4B), contained a distinctively low number of SAAVs, and grouped metagenomes sampled from both sides of the Panama Canal with metagenomes from the Red Sea and North of the Indian Ocean (Figure 4B). We also clustered the same data set using fixation index, a widely-used metric to measure population structure (Weir, 2012), which we modified in accordance with (Schloissnig et al., 2013) to permit multi-allelic variant positions. Both approaches preserved assocaitions between distant geographies (i.e., Proteotype D; Figure 4 and Figure 4—figure supplement 4), however, they were not identical in their organization of metagenomes (i.e., Proteotype E was associated with colder currents according to fixation index rather than warmer ones; Figure 4—figure supplement 4), highlighting the non-trivial nature of establishing individual proteotypes from SAAVs.That said, the significance of in situ temperature to explain clustering of metagenomes into two main groups and six proteotypes was higher with Deep Learning (Figure 4—figure supplement 4), suggesting that Deep Learning was able to better capture the strong association between temperature and the genomic heterogeneity within 1a.3.V through SAAVs.

The striking connection between geographically distant regions of the oceans through SAAVs suggests a likely role for adaptive processes to maintain the genomic heterogeneity of closely related SAR11 populations within 1a.3.V (Figure 4—figure supplement 5). In fact, both the main ecological niches and more refined proteotypes indicate that SAAVs are not primarily structured by the global dispersal of water masses but instead tend to link distant geographic regions with similar environmental conditions (Figure 4B). Overall, these results indicate that environmentally-mediated selection is a strong determinant of SAR11 evolution and biogeography.

One question remains: what is the proportion of distinct evolutionary processes acting upon closely related SAR11 populations within 1a.3.V? Offering a precise answer to this critical question is compounded by multiple theoretical and technical factors. These factors include, but are not limited to, (1) the phenomenon of genetic hitchhiking that prevents accurate determination of amino acid positions that likely confer fitness, (2) the metagenomic short-read recruitment strategy that prevents absolute confidence regarding the origin of each fragment, (3) heavy reliance on temperature as the sole environmental stressor to predict associations between environmental parameters and variation due to limited insights into in situ physiochemistry, (4) the lack of a complete understanding of syntrophic relationships between taxa in the environment, and (5) computational bottlenecks to gain rapid and accurate insights into the role of variable amino acid residues even when protein structures are available. With these significant limitations in mind, we could nevertheless speculate that among the 252,333 total codon positions, 37,416 were variable, suggesting purifying selection maintains the conservancy of 85% of the positions within 799 core 1a.3.V genes. Of those 37,416 positions that were within the scope of permissible mutations, 4592 had amino acid frequency trajectories that significantly correlated with temperature, suggesting an upper-bound of 12% for the variable positions that are likely under the influence of temperature-driven adaptive processes, while neutral processes explain at least 88% of the variation. In summary, this global view of the data suggests that among the remarkable amount of variation within some of the most abundant and prevalent microbial populations in the ocean, adaptive evolutionary processes operating on core genes are responsible for variation in about 2% of all codon positions.

Conclusions

We took advantage of billions of metagenomic reads to investigate single-amino acid variants (SAAVs) within the environmental core genes of the remarkably abundant and closely related SAR11 populations within subclade Ia.3.V, which we defined from a SAR11 metapangenome. The results elicit a highly-resolved quantitative description of purifying selection constraining the scope of permissible mutations to those that are not detrimental to protein stability requirements. Of permissible variation, thousands of codon positions harbored allele frequencies that systematically correlated with in situ temperatures and, overall, patterns of amino acid diversity reflected the temperature trends of large-scale ocean currents. This was especially apparent regarding the clear SAAV partitioning between Western and Eastern boundary currents. Previous studies have subdivided SAR11 clade Ia into cold-water (Ia.1) and warm-water (Ia.3) subclades with distinct latitudinal distributions (Brown et al., 2012), and reported sinusoidal oscillations between their abundances as a function of seawater temperature at a single temperate ocean site (Eren et al., 2015). At a much finer evolutionary scale (i.e., closely related populations within Ia.3.V), we observed significantly more protein variants in cold currents and more invariant proteins in warm currents, revealing a global pattern of alternating diversity for SAR11 in surface ocean currents in temperate and tropical latitudes. We were able to track this variation to changes in amino acid sequence preserved by selection.

Trends that emerged from our culture-independent survey of SAR11 were consistent with a recent study that also suggested an important role for environmental and ecological selective processes defining the spatial and temporal distribution of a widespread diatom species (Whittaker and Rynearson, 2017). Overall, these findings suggest that environmentally-mediated selection plays a critical role in the journey of cosmopolitan microbial populations in the surface ocean, lending credence to the idea for marine systems that ‘everything is everywhere but the environment selects’ (Baas-Becking, 1934). However, identifying environmental variables and their contributions to genomic heterogeneity within microbial populations is shrouded by both the dynamism and complexity of natural habitats, as well as the rich evolutionary dynamics that arise even in the simplest of conceivable environments (Good et al., 2017). These formidable challenges stress the importance of designing appropriate experiments to uncover variables that underpin the evolutionary divergence of closely related lineages, and drive transitions between them through space and time.

Materials and methods

The URL http://merenlab.org/data/sar11-saavs contains a reproducible bioinformatics workflow that extends the descriptions and parameters of programs used here for (1) the metapangenome of SAR11 using cultivar genomes, (2) the profiling of metagenomic reads that the cultivar genomes recruited, (3) the analysis of single nucleotide variants using Deep Learning, and (4) the visualization of single nucleotide variants in the context of protein structures.

SAR11 cultivar genomes

Request a detailed protocol

We acquired the genomic content of 21 SAR11 isolates from NCBI and simplified the deflines using anvi’o (Eren et al., 2015). We then concatenated all contigs into a single FASTA file, and generated an anvi’o contigs database, during which Prodigal (Hyatt et al., 2010) v2.6.3 identified open reading frames in contigs, and we annotated them with InterProScan (Zdobnov and Apweiler, 2001) v1.17. Supplementary file 1b reports the main genomic features.

Metagenomic datasets

Request a detailed protocol

We acquired 103 metagenomes from the European Bioinformatics Institute (EBI) repository under the project IDs ERP001736 (n = 93; TARA Oceans project) and ERP009703 (n = 10; Ocean Sampling Day project), and removed noisy reads with the illumina-utils library (Eren et al., 2013b) v1.4.1 (available from https://github.com/meren/illumina-utils using the program ‘iu-filter-quality-minoche’ with default parameters, which implements the method previously described by Minoche et al. (2011). Supplementary file 1a reports accession numbers and additional information (including the number of reads and environmental metadata) for each metagenome.

Pangenomic analysis

Request a detailed protocol

We used the anvi’o pangenomic workflow (Delmont and Eren, 2018) to organize translated gene sequences from SAR11 genomes into gene clusters. Briefly, anvi’o uses BLAST (Altschul et al., 1990) to assess the similarity between each pair of amino acid sequences among all genomes, and then resolves this graph into gene clusters using the Markov Cluster algorithm (Enright et al., 2002). We built the gene clustering metric using a minimum percent identity of 30%, an inflation value of 2, and a maxbit score of 0.5 for high sensitivity. Anvi’o used the occurrence of gene clusters across genomes data, which are also reported in Supplementary file 1e, to compute clustering dendrograms both for SAR11 genomes and gene clusters using Euclidian distance and Ward linkage algorithm.

Estimating distances between isolate genomes based on full-length 16S ribosomal RNA gene sequences

Request a detailed protocol

We used the program 'anvi-get-sequences-for-hmm-hits' (with parameters ‘--hmm-source Ribosomal_RNAs’ and ‘--gene-name Bacterial_16S_rRNA’) to recover full-length 16S ribosomal RNA gene sequences from the anvi’o contigs database for the 21 isolate genomes. We then used PyANI (Pritchard et al., 2016) through the program 'anvi-compute-ani' to estimate pairwise distances between each sequence.

Competitive recruitment and profiling of metagenomic reads

Request a detailed protocol

We mapped reads competitively from each metagenome against a single FASTA file containing all SAR11 genomes using Bowtie2 (Langmead and Salzberg, 2012) v.2.0.5 with default parameters, and converted the resulting SAM files into BAM files using samtools (Li et al., 2009a) v1.3.1. Competitive read recruitment ensures that short reads that match to more than one genome are assigned uniquely and randomly to one of the matching genomes. This minimizes computational biases at the mapping level and avoid inflated coverage statistics. To confirm our observations, we also used BWA (Li and Durbin, 2009b) to recruit reads (with the option n = 0.05). We used anvi’o to generate profile databases from the BAM files and combine these mapping profiles into a merged profile database, which stored coverage and variability statistics as outlined in Eren et al. (2015). Supplementary file 1c reports the mapping results (number of recruited reads, as well as mean coverage and detection statistics) per genome across the 103 metagenomes.

Determining the coverage of HIMB83 genes across metagenomes

Request a detailed protocol

The anvi’o merged profile database contains the coverage of individual genes across metagenomes. We normalized the coverage of HIMB83 genes in each metagenome (summarized in Supplementary file 1j) and calculated their coefficient of gene variation. We used the coefficient of gene variation estimates to identify metagenomes in which HIMB83 was well detected, yet the coverage values of its genes were highly unstable, which is an indicator of non-specific read recruitment from other lineages.

Determining the main ecological niche and core genes of 1a.3.V

Request a detailed protocol

We considered metagenomes in which HIMB83 was sufficiently abundant (mean genomic coverage >50X) with a stable detection of its genes (coefficient of gene variation <1.25) to represent the main ecological niche of 1a.3.V. To determine the core 1a.3.V genes, we first disregarded metagenomes that displayed an unusually high coefficient of gene coverage variation (Figure 1—figure supplement 1), which can indicate non-specific read recruitments from other abundant populations. The 74 metagenomes fitting these criteria are summarized in Supplementary file 1i. We defined the subset of HIMB83 genes as the core 1a.3.V genes if in each of the 74 metagenomes, the mean coverage of a gene remained within a factor of 5 of the mean coverage across all genes. The 799 genes fitting this criterion are summarized in Supplementary file 1j.

Calculation of percent identity distributions of recruited metagenomic short reads

Request a detailed protocol

We used percent identity distributions to broadly characterize how well short reads within a metagenome matched to the reference sequences by which they were recruited. We determined the percent identity for each read as 100×(Nn)/N where n is the number of mismatches to the reference and N is the read length. For simplicity, visualization of these distributions only included reads lengths of which matched to the median read length, and we defined bins to contain only one unique value. For example, if the median length of reads was 100, the bin domains for visualization purposes were 99, 100, 98, 99, 97, 98,,[0,1]. In contrast, all statistical calculations were carried out using all read lengths.

Generating single-nucleotide variants (SNV) data

Request a detailed protocol

We used the program 'anvi-gen-variability-profile' to report variability tables describing the nucleotide frequency (i.e., ratio of the four nucleotides) in recruited metagenomic reads per SNV position. To study the extent of variation of the core 1a.3.V genes across all metagenomes, we instructed anvi’o to report positions with more than 1% variation at the nucleotide level (i.e., at least 1% of recruited reads differ from the consensus nucleotide). To compare the densities of SAAVs to SNVs, we instructed anvi’o to report only positions with more than 10% variation at the nucleotide level. Supplementary file 1c reports the density of SNVs for all SAR11 genomes across all metagenomes. We also used anvi’o to report SNVs for a subset of genes and metagenomes, and by considering only nucleotide positions with a minimum coverage cut-off across metagenomes under consideration. Controlling the minimum coverage of single nucleotide positions across metagenomes improves confidence in variability analyses. Supplementary file 1L reports the SNV density values for all core 1a.3.V genes.

Definitions of ‘SAAV’, ‘allele frequency’ and ‘AAST’

Request a detailed protocol

A single amino acid variant (SAAV) is a codon position that exhibits variation in a metagenome, and the unique identifier of a SAAV is a single codon position and a metagenome. The position of a SAAV in the reference sequence, and a vector of 21 elements that contain the allele frequencies of each amino acid as well as the stop codon fully characterize a SAAV. The allele frequency of an amino acid is equal to the number of short reads that fully cover the codon that resolves to the amino acid, divided by the total number of reads that fully cover the same position (the sum of all 21 allele frequencies is therefore 1). We also attributed to each SAAV an amino acid substitution type (AAST), which corresponds to the two amino acids with the largest and second largest allele frequencies.

Generating single-amino acid variants (SAAVs) data

Request a detailed protocol

The program `anvi-gen-variability-profile` (with an additional ‘--engine AA’ flag) reported variability tables describing the allele frequencies for each SAAV. Anvi’o only considers short reads that cover the entire codon to determine amino acid frequencies at a given codon position in a metagenome. We instructed anvi’o to report only positions with more than 10% variation at the amino acid-level (i.e., at least 10% of recruited reads differ from the consensus amino acid). Supplementary file 1L reports the density of SAAVs for all SAR11 genome across all metagenomes. We also used anvi’o to report SAAVs for a subset of genes and metagenomes, and by considering only gene codons with a minimum coverage cut-off of 20X across all metagenomes of interest. Controlling the minimum coverage of gene codons across metagenomes improves confidence in variability analyses.

Differential occurrence of amino acids in SAAVs and in the core 1a.3.V genes

Request a detailed protocol

We determined the amino acid composition in the 799 core 1a.3.V genes as well as in SAAVs maintained in each metagenome using anvi’o programs `anvi-get-aa-counts` and `anvi-get-codon-frequencies` (with the flag `--return-AA-frequencies-instead`). We quantified the amino acid composition of all core 1a.3.V genes of in HIMB83 using the program `anvi-get-aa-counts`. In contrast, we quantified the amino acid composition of SAAVs by calculating the frequency of a given amino acid being one of the two dominant alleles. We then calculated p-values via a binomial test that represents the probability of observing the difference between amino acid frequencies computed over all core 1a.3.V genes versus only 1a.3.V SAAVs, given the null hypothesis that amino acids in 1a.3.V SAAVs are distributed according to the same distribution as the amino acids in the core 1a.3.V genes.

Estimating a neutral AAST frequency distribution

Request a detailed protocol

This calculation provides an estimate for the AAST frequency distribution given strictly neutral mutations. Unlike the neutral theory of evolution, it excludes the influential effects of purifying selection (negative selection coefficients). Since all mutations are equally likely to drift to detectable frequencies under a neutral model, the expected number of variant positions that have Ci and Cj as their two dominant alleles, is proportional to the rate that Ci mutates to Cj plus the rate that Cj mutates to Ci. Expressed mathematically,

ENCi, CjPCimPCiCjCi,m+ PCjmP(CjCi|Cj,m)

Where ENCi, Cj is the expected number of variant positions that have Ci and Cj as their two dominant alleles, PCim is the probability that a Ci position mutates given that a mutation has occurred, and PCiCjCi,m is the probability that such a mutation will mutate to Cj. Assuming all sites are equally likely to mutate, PCim is equivalent to the fraction of codons in the reference sequence that are Ci, and we denote this quantity as fCi. To extend the equation to the expected number of variant positions that have amino acids A1 and A2 as their two dominant alleles, that is a quantity proportional to the AAST frequency, one must enumerate over all codons in A1 and A2:

E(NAAST={A1, A2})CiA1CjA2P({Ci, Cj})

In general, PCiCjCi,m will depend primarily upon the nucleotide edit distance between Ci and Cj, which we denote as d, as well as the transition/transversion rate ratio, which we will denote κ. How the model handles these aspects will critically influence the expected frequency distribution. To encapsulate the broadest possible interpretation of the neutral model, we evaluate expressions for two extreme cases: In the first case (Model 1), we assume that the probability of an edit distance d > 1 is 0 (in reality, estimates at least for eukaryotes range from 0.003 [Smith et al., 2003] to 0.03 [Schrider et al., 2011]). We also impose a κ value of 2 so that transitions are twice as likely as transversions. Intuitively, these impositions have the effect of skewing the AAST frequency distribution towards AASTs that possess highly similar codons. In the second case (Model 2), we assume all codon transitions are equally likely regardless of edit distance or the number of transitions/transversions (κ=1). Intuitively, this has the effect of homogenizing the AAST frequency distribution towards a more uniform-like distribution.

In Model 1, PCiCjCi,m=13δd,1P(m), where δd,1 is a Kronecker delta function describing the probability the mutation has an edit distance d, 1/3 is the probability that the correct nucleotide position is mutated, and P(m) is the probability that the mutation occurs based on whether or not it is a transition. Formally,

P(m)={κ/κ+2; m=transition1/κ+2; m=transversion

In Model 2, P(CiCj|Ci,m)=1/63, since all 63 possible mutations are permissible and equally probable. The expressions for ENAAST=A1, A2 for Model 1 and Model 2 thus simplify to:

M1E(NAAST={A1, A2})CiA1CjA2fCi, fCjδd,1P(m)
M2E(NAAST={A1, A2})CiA1CjA2fCi, fCj

where M1 and M2 refer to Model 1 and Model 2, respectively. To compare directly with observation, we extracted fCi for the 64 codons from the HIMB83 reference sequence using 'anvi-get-codon-frequencies' and the distributions under both models were calculated from the above equations.

Predicting 3D structure of proteins using template-based modeling

Request a detailed protocol

We used a template-based structure modeling tool, RaptorX Structure Prediction (Källberg et al., 2012), to predict structures of 1a.3.V amino acid sequences based on available data from the Protein Data Bank (PDB) (Bernstein et al., 1977). We used the program blastp in NCBI’s BLAST distribution to identify core 1a.3.V genes that matched to an entry with at least 30% similarity over the length of the given core gene. We then programmatically mapped SAAVs from metagenomes onto the predicted tertiary structures, and used PyMOL (DeLano, 2002; Schrödinger LLC, 2015) to visualize these data. We colored SAAVs based on RaptorX-predicted structural properties, including solvent accessibility and secondary structure.

Identifying genes with disproportionately high number of temperature-correlated positions

Request a detailed protocol

First, we calculated the number of temperature-correlated and temperature-uncorrelated positions for each of the 1a.3.V core genes. Then, we performed a one-sided binomial test that these numbers are biased towards higher proportion of temperature-correlated positions compared to a model distribution defined from the total number of temperature-correlated positions in 1a.3.V. Since there were 4,592 such positions out of 37,416, the model probability of success was defined as p0=459237416=0.123. In other words, the expected proportion of variant positions in a gene that are temperature-correlated is 0.123 under the model. We corrected the resulting p-values for each gene for multiple testing using Benjamini & Hochberg’s method (Benjamini and Hochberg, 1995).

Predicting transmembrane, periplasmic, and cytosolic regions in the glycine betaine permease

Request a detailed protocol

To categorize amino acid positions as transmembrane, periplasmic, and cytosolic, we used Phobius (Käll et al., 2004; Käll et al., 2007), a membrane topology and prediction software through the webserver at http://phobius.sbc.su.se. The output is a probability of the four classes for each residue, and to simplify the data we categorized each residue into the class found to be most probable. We removed residues with signaling peptide association from downstream analyses.

Recovering codon position-specific AASTs from SAAVs

Request a detailed protocol

We simplified the hyper-dimensional SAAV data into a simpler presence-absence matrix for downstream analyses. For this, we defined codon position-specific AASTs (cAASTs) and summarized their occurrence across metagenomes. In such a table the value of ‘1’ indicates that a given metagenome had a SAAV at a given codon position that resolved to a given AAST. In contrast, the value ‘0’ indicates that the metagenome did not have a SAAV that resolved to this AAST. In the latter case a given metagenome may have another AAST in this particular codon position (in which case this information would appear in another row in the same table that is affiliated with the same AAST with the same codon position). Hence, each AAST listed in the first column of the table will be unique to a single codon position, yet a given codon position may have different AASTs in different metagenomes, resulting in multiple AASTs in the resulting table that belong to the same codon position. Combining AAST with the codon position would then result in a unique cAAST.

Application of deep learning to codon-position-specific AASTs data

Request a detailed protocol

To estimate an unbiased distance between our metagenomes based on SAAVs, we used a novel deep neural network modification called the auto-clustering output layer (ACOL). Briefly, ACOL relies on a recently introduced graph-based activity regularization (GAR) technique for competitive learning from hyper-dimensional data to demarcate fine clusters within user-defined ‘parent’ classes (Kilinc and Uysal, 2017). In this application of ACOL, however, we modified the algorithm so it can reveal latent groups in our SAAVs in a fully unsupervised manner through frequent random sampling of SAAVs to create pseudo-parent class labels instead of ​user-defined classes (Kilinc and Uysal, 2018). See the URL http://merenlab.org/data/sar11-saavs for the details of the pseudo parent-class generation algorithm, and the reproducible distance estimation workflow in Python.

Other statistical tests and visualization

Request a detailed protocol

We used the aov function in R to perform one way ANOVA tests, used the ggplot2 (Ginestet, 2011) package for R to visualize the relative distribution of 1a.3.V genes and geographic distribution of proteotypes, and finalized all figures using an open-source vector graphics editor, Inkscape (available from http://inkscape.org/).

Code and data availability

Request a detailed protocol

The vast majority of analyses relied on the open-source software platform anvi’o v2.4.0 (available from http://merenlab.org/software/anvio). The URL http://merenlab.org/data/sar11-saavs serves the remaining custom code used in our analyses. We made available (1) SAR11 isolate genomes (doi:10.6084/m9.figshare.5248945), (2) the anvi’o contigs database and merged profile for SAR11 genomes across metagenomes (doi:10.5281/zenodo.835218) and the static HTML summary for the mapping results (doi:10.6084/m9.figshare.5248453), (3) the SAR11 metapangenome (doi:10.6084/m9.figshare.5248459), single-nucleotide and single-amino acid variant reports for 1a.3.V across 74 TARA Oceans metagenomes (doi:10.6084/m9.figshare.5248447), and (4) SAAVs overlaid on predicted tertiary structures of 58 core 1a.3.V genes (doi:10.6084/m9.figshare.5248432). The URL http://anvi-server.org/p/4Q2TNo serves an interactive version of the SAR11 metapangenome, and the URL http://data.merenlab.org/sar11-saavs serves an interactive web page to investigate the link between SAAVs and predicted protein structures.

References

  1. 1
  2. 2
  3. 3
  4. 4
    Geobiologie of Inleiding Tot De Milieukunde
    1. LGM Baas-Becking
    (1934)
    Den Haag: W.P. Van Stockum & Zoon.
  5. 5
  6. 6
  7. 7
    The protein data bank
    1. FC Bernstein
    2. TF Koetzle
    3. GJ Williams
    4. EF Meyer
    5. MD Brice
    6. JR Rodgers
    7. O Kennard
    8. T Shimanouchi
    9. M Tasumi
    (1977)
    European Journal of Biochemistry 80:319–324.
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
    On the Origin of Species
    1. C Darwin
    (1859)
    Routledge.
  21. 21
  22. 22
  23. 23
  24. 24
    Population Genomics: Microorganisms
    1. VJ Denef
    (2018)
    Peering into the genetic makeup of natural microbial populations using metagenomics, Population Genomics: Microorganisms, Cham, Springer, 10.1007/13836_2018_14.
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
    Diversity and depth-specific distribution of SAR11 cluster rRNA genes from marine planktonic Bacteria
    1. KG Field
    2. D Gordon
    3. T Wright
    4. M Rappé
    5. E Urback
    6. K Vergin
    7. SJ Giovannoni
    (1997)
    Applied and Environmental Microbiology 63:63–70.
  33. 33
  34. 34
    ggplot2: elegant graphics for data analysis
    1. C Ginestet
    (2011)
    Journal of the Royal Statistical Society: Series A 174:245–246.
    https://doi.org/10.1111/j.1467-985X.2010.00676_9.x
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
    The ocean sampling day consortium
    1. A Kopf
    2. M Bicak
    3. R Kottmann
    4. J Schnetzer
    5. I Kostadinov
    6. K Lehmann
    7. A Fernandez-Guerra
    8. C Jeanthon
    9. E Rahav
    10. M Ullrich
    11. A Wichels
    12. G Gerdts
    13. P Polymenakou
    14. G Kotoulas
    15. R Siam
    16. RZ Abdallah
    17. EC Sonnenschein
    18. T Cariou
    19. F O'Gara
    20. S Jackson
    21. S Orlic
    22. M Steinke
    23. J Busch
    24. B Duarte
    25. I Caçador
    26. J Canning-Clode
    27. O Bobrova
    28. V Marteinsson
    29. E Reynisson
    30. CM Loureiro
    31. GM Luna
    32. GM Quero
    33. CR Löscher
    34. A Kremp
    35. ME DeLorenzo
    36. L Øvreås
    37. J Tolman
    38. J LaRoche
    39. A Penna
    40. M Frischer
    41. T Davis
    42. B Katherine
    43. CP Meyer
    44. S Ramos
    45. C Magalhães
    46. F Jude-Lemeilleur
    47. ML Aguirre-Macedo
    48. S Wang
    49. N Poulton
    50. S Jones
    51. R Collin
    52. JA Fuhrman
    53. P Conan
    54. C Alonso
    55. N Stambler
    56. K Goodwin
    57. MM Yakimov
    58. F Baltar
    59. L Bodrossy
    60. J Van De Kamp
    61. DM Frampton
    62. M Ostrowski
    63. P Van Ruth
    64. P Malthouse
    65. S Claus
    66. K Deneudt
    67. J Mortelmans
    68. S Pitois
    69. D Wallom
    70. I Salter
    71. R Costa
    72. DC Schroeder
    73. MM Kandil
    74. V Amaral
    75. F Biancalana
    76. R Santana
    77. ML Pedrotti
    78. T Yoshida
    79. H Ogata
    80. T Ingleton
    81. K Munnik
    82. N Rodriguez-Ezpeleta
    83. V Berteaux-Lecellier
    84. P Wecker
    85. I Cancio
    86. D Vaulot
    87. C Bienhold
    88. H Ghazal
    89. B Chaouni
    90. S Essayeh
    91. S Ettamimi
    92. elH Zaid
    93. N Boukhatem
    94. A Bouali
    95. R Chahboune
    96. S Barrijal
    97. M Timinouni
    98. F El Otmani
    99. M Bennani
    100. M Mea
    101. N Todorova
    102. V Karamfilov
    103. P Ten Hoopen
    104. G Cochrane
    105. S L'Haridon
    106. KC Bizsel
    107. A Vezzi
    108. FM Lauro
    109. P Martin
    110. RM Jensen
    111. J Hinks
    112. S Gebbels
    113. R Rosselli
    114. F De Pascale
    115. R Schiavon
    116. A Dos Santos
    117. E Villar
    118. S Pesant
    119. B Cataletto
    120. F Malfatti
    121. R Edirisinghe
    122. JA Silveira
    123. M Barbier
    124. V Turk
    125. T Tinta
    126. WJ Fuller
    127. I Salihoglu
    128. N Serakinci
    129. MC Ergoren
    130. E Bresnan
    131. J Iriberri
    132. PA Nyhus
    133. E Bente
    134. HE Karlsen
    135. PN Golyshin
    136. JM Gasol
    137. S Moncheva
    138. N Dzhembekova
    139. Z Johnson
    140. CD Sinigalliano
    141. ML Gidley
    142. A Zingone
    143. R Danovaro
    144. G Tsiamis
    145. MS Clark
    146. AC Costa
    147. M El Bour
    148. AM Martins
    149. RE Collins
    150. AL Ducluzeau
    151. J Martinez
    152. MJ Costello
    153. LA Amaral-Zettler
    154. JA Gilbert
    155. N Davies
    156. D Field
    157. FO Glöckner
    (2015)
    GigaScience 4:27.
    https://doi.org/10.1186/s13742-015-0066-5
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
    The {PyMOL} Molecular Graphics System
    1. Schrödinger LLC
    (2015)
    The {PyMOL} Molecular Graphics System, ~1.8.
  73. 73
    Population Genomics: Microorganisms
    1. BJ Shapiro
    (2018)
    What microbial population genomics has taught Us about speciation, Population Genomics: Microorganisms, Chem, Springer, 10.1007/13836_2018_10.
  74. 74
    Microbial speciation
    1. BJ Shapiro
    2. MF Polz
    (2015)
    Cold Spring Harbor Perspectives in Biology 7:a018143.
    https://doi.org/10.1101/cshperspect.a018143
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
  91. 91
  92. 92
  93. 93
  94. 94
  95. 95

Decision letter

  1. Paul B Rainey
    Reviewing Editor; Max Planck Institute for Evolutionary Biology, Germany
  2. Detlef Weigel
    Senior Editor; Max Planck Institute for Developmental Biology, Germany
  3. Paul B Rainey
    Reviewer; Max Planck Institute for Evolutionary Biology, Germany
  4. Fred Cohan
    Reviewer; Wesleyan

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your work entitled "The global biogeography of a single SAR11 population is governed by natural selection" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Paul Rainey as the Reviewing Editor, and the evaluation has been overseen by Detlef Weigel as the Senior Editor. The outside reviewers have opted to remain anonymous.

Our decision has been reached after extensive consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that your work will not be considered further for publication at this time. However, we do see the potential for worthy eLife publication, provided all the issues raised by the reviewers can be addressed. Any resubmission would need to be a very substantial revision and would be treated as a new submission.

All reviewers agree that the topic is of considerable interest. They also indicate that your methods appear to have many strengths, but at the same time all are in agreement that much more needs to be done to justify your approaches. For example, Reviewer 3 requests calibration against standard population genetic approaches. Additional problems concern the main claims of the paper such as selective sweeps, the effect of selection, and so forth, which are stated, but not evidently backed by data. Further issues surround use and understanding of terms in population biology/ecology: Reviewer 2 points out the problems with refereeing to diversification within a population when the "population" of interest is really a substantial portion of a genus.

Reviewer #1:

This deals with the causes of patterns of diversity in a widespread marine microbe. The data come from a set of contigs assembled from shot gun sequencing of 21 genomes onto which are read-mapped metagenomic data from 103 samples. The focus is a particular clone that is particularly abundant in southern samples. A core genome is assembled from the metagenomic data and this is interrogated for signatures of selection using a curious approach based on physico-chemical properties of amino acids. A further step looks for correlations with water currents.

At first glance I found the paper reasonably clear, but the closer I read, the more murky things became. In fact I am not too sure what the authors have really done and I am not persuaded that their main claim, that selection is responsible for biogeography in S-LLPA, is supported by their data. Overall the level of explanation is insufficient to allow the reader to understand what has been done.

Among the difficulties:

Title/Abstract: The title states that biogeography is governed by selection, but in the Abstract the authors use more temperate language. So I guess they are not so sure. The Abstract claims that systematic purifying selection and adaptive mechanisms governing non-synonymous variation have been identified, but at best what they have are signatures consistent with expectations. I disagree that analysis of sequence data reveals different niches (it may reveal the existence of ecotypes that may be indicative of distinct niches). What is a proteotype?

Place of isolation of the focal 21 SAR11 genomes is important in a study of biogeography, but no attention has been given to this.

There are many challenges in drawing conclusions from metagenomic data via read-mapping to a reference. Particularly problematic is the fact that diversity of closely related organisms within a sample is not known. Thus when many SNPs are detected is this indicative of a diverse population, or is it a signature that comes from a single over-represented clone that is rather divergent? Similarly, the authors take high similarity as signature of a recent sweep, but this might reflect a single over-represented genome from a closely related organism. I'm not sure how the authors dealt with these and similar problems.

With regard to selective sweeps, the results of the analysis appear not to be given (subsection “SNVs to SAAVs: Accurate characterization of non-synonymous variation”, gives the conditions that need to be fulfilled in order for a protein to have undergone a sweep, but not the results).

Patterns of amino acid substitution types seems an interesting way to go, but this needs to be unpacked, explained and justified.

Partitioning of SAAV between warm and cold currents: there seems to be a correlation with clustering based on SAAVs, but as above, I need to have more explanation. Much relies on the robustness of the SAAV profile clustering (Figure 3A) and there is no indication of its statistical likelihood.

The crunch of the paper is rejection of the neutral model, but this is done by simply saying that neutral models could not explain the observed patters of amino acid substitutions, but no evidence is provided. Also, Figure 4, which I hoped would enlighten me, does not exist.

In general I found the figures overly dense and very difficult to understand given the paucity of explanation in the captions. Take Figure 1: latitudinal gradient, I presume north is at the top? Colouring seems inconsistent and so on.

Reviewer #2:

It's hard to tell who this paper is written for. The bioinformatics applications are the most interesting: identifying the core genome of a lineage of interest from metagenomics data, using a novel algorithm for (somehow) identifying significant subclades within the lineage, and finding a surprisingly high frequency of certain amino acid substitutions that are not predicted by Blosum. There are also interesting implications for ecological diversification, but this aspect of the paper is written without enough explanation or interpretation of data, as I will detail. Also, I think most microbial ecologists would be amazed to see a paper on diversification within "a single population" that is actually not a population at all, but rather a fair chunk of a genus.

It is so inappropriate to call the S-LLPA lineage a "single population," since it has nearly 20% sequence divergence within it, and because there is clearly profound ecological diversification within this group. (Given that the Eren group has made tremendous progress toward discovering newly divergent, ecologically distinct populations of bacteria with their MED algorithm, it seems ironic that the group is now calling this phylogenetically huge group a single population.) Calling this group a population seems to have led to the conclusion that "their broad geographic prevalence suggest<s> dispersion is not a limiting factor for SAR11 in surface seawater…." But this conclusion is based on a false reification of all the various true populations within what the authors are calling a population.

The authors talk about S-LLPA offers "a unique opportunity to study the genomic diversity and evolutionary genomics of a single marine microbial population…." Not so unique when the "population" is a huge chunk of a genus.

The authors do justify their calling this clade a population because it constitutes what is recruited by a single isolate from various metagenomes. But their own work clearly shows, to my mind, that this is not a good way to discover populations.

The authors have used their Deep Learning algorithm for demarcating six phylogenetic groups, which they call proteotypes, from their S-LLPA lineage. It is frustrating that the authors neglected to give any rationale for their approach. Is it based on sequence clustering of shared genes? Is it based on genome content sharing? Or both? (Even though the algorithm has been published, they should say something briefly about what the inputs are and how it works.)

On two occasions the authors write about protein sweeps. They state that the sweeps are "rare." I think most evolutionary ecologists of bacteria would want to know how rare, and in which genes. Also, it would be important to indicate whether the sweeps traveled across ecologically distinct groups. In the first paragraph of the Discussion, the authors mention that there are more sweeps in warm currents, but this should be fleshed out and made clearer.

Subsection “Purifying selection governs the identity of amino acid substitution types”, first paragraph. It wasn't clear to me whether the authors were claiming that their focus group was unique in having a stable proportion of amino acids on a global scale, or whether other groups are showing the same pattern.

In the subsection “Purifying selection governs the identity of amino acid substitution types”, I don't see how the authors are finding evidence of purifying selection. Rather, it seems they are finding the constraints on adaptive substitutions of amino acids. That's different from purifying selection, which is usually the evolutionary stability of a particular sequence.

It's interesting that there are two clades within S-LLPA that are associated with different temperature regimes. And it's interesting that the subclades within these clades are different in the geographic regions where they are most abundantwith similar temperatures. The interpretation of this result depends on whether different cold-adapted subclades are never found together (meaning that they may have no adaptation differences) or whether they sympatric to some extent (meaning that they are coexisting based on ecological differences other than temperature partitioning). It seems that the latter is the case, but the authors should have made this clearer. For example, pie charts showing the relative abundances of the subclades in each region would be useful.

Reviewer #3:

The work by Delmont et al. describes a large-scale and thorough analysis of an important marine bacterium population, SAR111. They use multiple approaches to identify signals of selection to understand the evolution and ecology of these microbes. The authors deserve credit for their very clear presentation of data and methods. However, I have multiple concerns regarding the analysis and interpretation of the data:

1) The nature, and strength of, selection pressures acting on this bacterial population are not clearly defined.

a) The description of the substitution analysis "suggests a powerful influence of purifying selection", while also "suggesting a role for adaptive processes for functional diversification of S-LLPA proteins". Which of these dominates? Could this just be noisy neutral evolution – what is the expected variation of the S statistic, and the SAAV patterns, under a neutral scenario?

b) The neutral model of Hellweger et al. still predicts geographic separation of strains, which should be made clear in the Introduction. A falsification of this would be a repeated association of bacterial strains with ecological niche. This appears to be the case at the highest level of the hierarachy inferred with "deep learning", but at lower levels, predictors are genetic, not ecological, and therefore likely represent characteristics that were identified by the clustering algorithm itself. This might be expected under a neutral model.

2) The authors have opted against traditional population genetics methods. It is possible that the new methods are a substantial improvement, but they need to be justified. Some examples:

a) The authors justify using the "S" statistic rather than d(N)/d(S) based on there being many instances of multiple substitutions within the same codon. But this is exactly what d(N)/d(S) is meant to correct for, and S does not capture – there may be many synonymous substitutions on an evolutionary pathway, but if only one of them is non-synonymous, then S will only count this only as a non-silent change, underestimating purifying selection. What is the neutral expectation for S?

b) The BLOSUM matrices are used as "as a proxy to assess the functional difference between the pairs of amino acids", which are then compared to the substitution frequencies – but this is circular reasoning, as these matrices are calculated from substitution frequencies. The versions used in the paper are also inappropriate for this dataset, as they were calculated using more closely-related sequences than are compared in this paper.

c) "Deep learning" is applied to cluster populations together through a method designed without genetic data in mind. This does not mean it is wrong, but as it is not an intuitive approach, it would be helpful to see validation against a more standard method, such as the fixation index, F(ST). This would help understand the distinct properties of the six subclusters – for instances, proteotype D looks to be a single successful "strain" – is this expected to be picked up as a signal by the "deep learning"?

3) Quality of read mapping. Many of the sequences being compared are highly divergent, but most mapping software is intended to align reads to highly similar reference sequences. While there is a promising correlation between the SNP densities identified by BWA and BOWTIE2, there is a greater than two-fold difference in the absolute densities. This suggests a very high false positive or false negative rate, depending on the method. The authors need to validate the performance of these algorithms when using highly divergent reads. Additionally, what is the justification for using the "the base-5 logarithm of the mean coverage of a gene remained within {plus minus}1 of that for the mean coverage across all genes" for assessing consistency of coverage?

[Editors’ note: what now follows is the decision letter after the authors submitted for further consideration.]

Thank you for submitting your article "Single-amino acid variants reveal evolutionary processes that shape the biogeography of a global SAR11 lineage" for consideration by eLife. Your article has been further reviewed by three peer reviewers, including Paul B Rainey as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Detlef Weigel as the Senior Editor. The following individual involved in review of your submission has agreed to reveal his identity: Fred Cohan (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a further revised submission.

All reviewers appreciate and acknowledge the significant work that has gone into the paper during the course of the revision, the reviewers are all equally enthusiastic about the paper. However, there remain issues that need attention in order for the paper achieve the impact that it deserves. The reviewers' comments are pasted below, with the collective comments boiling down to a request to supplement the deep learning analysis with a more standard evolutionary approach in order for the reader to know whether the results are consistent with what most people in the field know. Here it is important to clearly distinguish between neutral evolution, purifying selection, and adaptive divergence. It is the latter that is particularly relevant to addressing the question of the possible ecological distinctions among populations.

Collected reviewer comments:

The paper is vastly improved since the original version. It is now much clearer and more compelling, and the paper deals with the issue of populations better.

The paper is fascinating for showing that whatever cohesion occurs at >95% ANI in pretty much every other bacterial group does not occur here in SAR11. The paper is also really interesting for giving a robust and sensitive way to identify ecologically distinct metagenome groups from sequence data.

My concerns are mostly focused on what the authors have found out about ecological diversity within the 1a.3.V lineage.

First, I appreciate that the authors are trying to be more careful about using the concept of population, but there are still places where they are still unclear about what they mean. The authors write of "remarkable intra-population genomic diversity…." It's not clear exactly what the authors mean by "population" here, but in this context what they mean is very important. If there isn't a good definition of population being used, then how can one talk about one population's intra-population genomic diversity being higher than another? It all depends on what you mean by population, and that's not clear here.

Figure 2 purports to show that there are not sequence-discrete populations within 1a.3.v, at least not in the sense that Kostas Konstantinidis demonstrates them in his studies, most recently in Jain et al., 2018. It is really interesting that all the SAR11 clades (with just this one clade shown in Figure 2) do not show the drop-off in recruitment of metagenome sequences below about 95% ANI. The authors argue that part of the great sequence diversity of what is recruited into this lineage (that is with much lower ANI) emerges from there being ecological diversity within the lineage (adaptations to different temperatures among different sublineages). I don't think this is a reasonable explanation, as almost every species taxon, nearly all obeying the 95% ANI drop-off rule, is also ecologically heterogeneous. I have suggested, in an interpretation of the recent Jain et al. paper, that the cohesion holding a species taxon together stems from an ability of recombination to act frequently at >95% ANI, plus the possibility of adaptive genes passing between ecologically distinct populations that are >95% ANI (Cohan, 2019) I'd argue that there is more potential for either recurrent recombination to limit divergence among ecologically distinct populations in SAR11, or greater opportunities for a single gene to be adaptive across different populations.

I'd like to add that whatever cohesion occurs at >95% ANI in most species taxa does not address the ecological diversity within species taxa, or whether there are sequence-discrete populations within each taxon. In fact, most species taxa have sequence-discrete and sequence-discoverable ecologically distinct populations. (This is the point of the second half of my Current Biology dispatch I mentioned.) So, I don't think that Figure 2 makes an argument that there are not sequence-discrete, ecologically distinct populations within the focus lineage.

The authors have noted that there is a small minority of genes that are invariant at the amino acid level across a given metagenome. In the earlier version, they interpreted these observations as evidence of sweeps, but here they have chosen not to address the dynamics that can be inferred by these invariant genes. I think that such single-gene sweeps are very interesting, and have been the topic of discussion of some very interesting papers, including papers by Jesse Shapiro and by Rex Malmstrom. I'd encourage the authors to bring this discussion back. I'll mention that these instances provide evidence that a generally adaptive gene has passed by recombination across all the ecologically distinct populations within a lineage, but I wouldn't expect the authors to necessarily buy in to that interpretation.

I found the authors' partitioning of metagenomes into ecologically distinct groups fascinating. And it's particularly interesting that these groups could not be revealed (at least entirely) by analyzing single reads (as shown in Figure 4—figure supplement 3). I'll encourage the authors to venture from their interpretation regarding whole metagenomes to inferring that there appear to be multiple ecologically distinct populations that cause these metagenomes to hold different niches. What might be the ecological distinctions of the populations? Clearly they are different for their temperature adaptation, as revealed by Figure 4 (mislabeled Figure 5). Perhaps the authors could glean something about what is different about the constituent populations from their Figure 4—figure supplement 3, that is by comparing the clusters that they do find with known environmental parameters.

Re the concluding statement about "everything is everywhere," I'd temper that by saying 'at least for marine bacteria'. This is because the statement does talk about "everything," so I think there needs to be some limitation to the conclusion.

Concerns about evolutionary analyses:

In response to point (1), the authors state:

"Simply, we observed that certain amino acids (mostly hydrophilic), and most notably, few AASTs (e.g., alanine/isoleucine) predominated in our SAAV table."

It has been well-established for decades that hydrophobic amino acids in solvent-inaccessible positions are more conserved (e.g. see reviews cited in the Introduction to Ramsey et al. 2011 Genetics 188(2):478-88). Reproducing this observation in the SAR11 dataset simply confirms that purifying selection can be observed over long timescales. It does not help identify what the changes are that enable ecological adaptation.

Delmont et al: "These points are interesting, but they do not favor natural selection over neutral evolution. Instead they provide information regarding permissible versus non-permissible diversification"

I assume "permissible" refers to purifying selection? Which is an example of natural selection – rather than random, neutral diversification. It is distinct from the adaptive/Darwinian evolution that allows the ecological differentiation of the bacteria. Not separating neutrality, purifying selection and adaptive evolution plagues the whole manuscript. The authors have a strain distribution that appears to reflect ecological differentiation, rather than neutral diffusion. They have a mutation distribution that mainly seems to reflect purifying selection, rather than neutral evolution. Some analyses (e.g. below) suggest adaptive evolution, which is potentially very interesting, but the overall trend is consistent with purifying selection. The authors need to clearly distinguish their different conclusions.

Delmont et al: "we determined that thousands of 1a.3.V allele frequency trajectories correlated with in situ temperature, in line with the biogeography of proteotypes"

It is not surprising that many of the variable sites correlate with temperature, because the proteotypes are defined using the variable sites, and the proteotypes correlate with temperature. The authors should seek to synthesize these data – do the variable sites have properties that suggest selection between niches? Do they concentrate in particular genes that might drive adaptation to different niches? Or do they simply correlate with the population structure?

Overall, I return to my original point from the first round of reviews: is the diversification mainly neutral; dominated by purifying selection; or dominated by adaptive evolution? The most interesting signals may not be from the dominant evolutionary process, of course.

In response to point 2, the authors state:

Delmont et al: "As far as we know, there is no published study that effectively takes into account multiple SNVs per codon without calculating the exact codon frequencies to estimate synonymity accurately…SNV density was high for 1a.3.V and many SNVs co-occurred in the same codon, rendering classical d(N)/d(S) analyses limited, which had never been challenged with extremely complex environments, if not completely irrelevant."

Yang (2007) MBE 24(8):1586-91 – with over 6,000 citations – describes how baseml can be used to calculate d(N)/d(S) using base substitution, rather than codon substitution, models. This accounts for multiple substitutions per codon, and provides a statistical test for evidence of selection versus neutrality, unlike the methods presented in the current version. It does not matter how complex the environments are – indeed, d(N)/d(S) has been applied to the question of SAR11 adaptation to different ecologies by Brown et al., 2012 and Luo and Hughes, 2012 Mol. Syst. Biol. 8:625. The authors should at least apply standard methods to the assembled genomes, based on the ecological separation they have determined. Methods also exist for calculation of this statistic from metagenomic data, and might be more informative.

Delmont et al: "The significant shift in SAAVs/SNVs ratio between cold currents and warm currents is of relevance to our study, especially in light of other insights (especially, the linkage between allele frequency trajectories and in situ temperature, and the biogeography of proteotypes), favors the predominant role of natural selection acting on the permissible amino acid diversification traits of this global SAR11 lineage."

The correlation between allele frequencies and temperature, and proteotype and temperature, is confounded by the correlation between proteotypes and allele frequencies. This should be explained (if this is wrong, please also make that clear in the text). The authors need to make clear whether the natural selection they are referring to is stronger/weaker purifying selection changing the SAAV/SNV ratio; or positive selection driving adaptive diversification increasing the SAAV/SNV in some populations.

Delmont et al: "Regarding the fixation index, we agree with the reviewer that this would have been a valuable addition to our work…We believe the significant amount of methodological development in this work that offers descriptions of the biogeography of the most abundant lineage in surface oceans conveys a story that will not benefit from the inclusion of additional analyses. "

I don't understand how this statistic can be "a valuable addition", yet the story "will not benefit from the inclusion of additional analyses.". The deep learning analysis is still not intuitively explained in the new version (not the authors' fault, as these methods are designed to be complex). It should be complemented by fixation index calculations, which are simple, and would allow the readers a much greater insight into the basis of the results.

https://doi.org/10.7554/eLife.46497.040

Author response

[Editors’ note: the author responses to the first round of peer review follow.]

Reviewer #1:

This deals with the causes of patterns of diversity in a widespread marine microbe. The data come from a set of contigs assembled from shot gun sequencing of 21 genomes onto which are read-mapped metagenomic data from 103 samples. The focus is a particular clone that is particularly abundant in southern samples. A core genome is assembled from the metagenomic data and this is interrogated for signatures of selection using a curious approach based on physico-chemical properties of amino acids. A further step looks for correlations with water currents.

At first glance I found the paper reasonably clear, but the closer I read, the more murky things became. In fact I am not too sure what the authors have really done and I am not persuaded that their main claim, that selection is responsible for biogeography in S-LLPA, is supported by their data. Overall the level of explanation is insufficient to allow the reader to understand what has been done.

We thank the reviewer for their interest in our study. We agree with the reviewer’s criticism regarding the lack of clarity of our initial submission, which made it difficult for readers to follow our analyses and findings. These concerns prompted us to go back to the drawing board and improve the clarity of our manuscript as well as substantiate our claims with new analyses.

Our revised manuscript includes a simplified terminology, an entirely re-written Introduction, and a vastly modified Results section that clarifies both the question we set out to investigate, and the novelty of the approaches we developed to address it. It also includes new analyses to support our main claim. These analyses include (1) the percent identity distribution of recruited short metagenomic reads for each SAR11 genome, (2) the investigation of the relative frequency of amino acids involved in SAAVs, to demonstrate the link between purifying selection and hydrophobicity, and (3) the identification of thousands of SAAV allele frequency trajectories correlating with in situ temperature. One critical point is that we now follow more common naming conventions for SAR11 clades and sub-clades. Thus, in our revision we now define six sublineages within 1a.3 by linking ecology, phylogeny, and gene content of available SAR11 genomes, and replace the term S-LLPA with 1a.3.V.

Overall, these new analyses support our main claim that environmentally-mediated selection is a strong determinant of SAR11 evolution and biogeography by providing more evidence for the linkage between SAAVs and in situ temperature at the highly resolved level of allele frequency trajectories for individual codon positions across metagenomes.

Among the difficulties:

Title / Abstract: The title states that biogeography is governed by selection, but in the Abstract the authors use more temperate language. So I guess they are not so sure. The Abstract claims that systematic purifying selection and adaptive mechanisms governing non-synonymous variation have been identified, but at best what they have are signatures consistent with expectations. I disagree that analysis of sequence data reveals different niches (it may reveal the existence of ecotypes that may be indicative of distinct niches). What is a proteotype?

We have replaced “natural selection” by “environmentally-mediated selection” throughout the manuscript to clarify our findings, and incorporated new analyses corroborating the role of selection in the global biogeography of 1a.3.V. We have also changed our title to better reflect our findings.

We define a proteotype as homogeneous configuration of environmental single-amino acid variants in the context of a microbial lineage we access through a reference genome and metagenomic read recruitment. We have defined the six ‘proteotypes’ in our revised manuscript from sets of metagenomes that are clustered together based on environmental single-amino acid variant profiles. Thus, these proteotypes correspond to oceanic regions that shared similar amino acid variations for 1a.3.V. Respecting the controversy surrounding the term, we adopt a simple definition of ‘niche’ as in “the fit of a [population] living under specific environmental conditions” (doi:10.1007/978-94-017-9014-7_26). If proteotypes reflect environmental conditions, which is the main premise of our study, as they reflect oceanic current temperatures, we do not see why the sequence data analyses would not be insufficient to distinguish niches. Hence, we prefer to keep the term in our manuscript, and hope the reviewer agrees with our resolution.

Place of isolation of the focal 21 SAR11 genomes is important in a study of biogeography, but no attention has been given to this.

We agree with the reviewer that it is an intuitive consideration that we should have been more careful in addressing. Our analyses show that the location of isolation (coastal Hawai’i) has very little relevance to the biogeography and population genetics of the genome. 1a.3.V is not particularly more prevalent, or less diverse in samples closer to the location of isolation. We now clarify this point in two sections of the text:

“Strain HIMB83 contains a 1.4 Mbp genome with 1,470 genes, and was isolated from coastal seawaters off Hawai’i, USA. Despite this, it recruited more reads from distant locations than from samples closest to the source of isolation (Supplementary file 1C).”

“The SAAV density (the percentage of codon positions that harbor a SAAV) of core 1a.3.V genes averaged 5.76% and correlated with SNV density (19.3% on average) across the 74 metagenomes (R2=0.89; Figure 1—figure supplement 1). SNV and SAAV density metrics did not decrease in metagenomes sampled closest to the source of isolation, suggesting that the location of isolation for strain HIMB83 does not predict the biogeography and population genetics of 1a.3.V.”

Our revision also includes a new analysis that summarizes percent identity statistics of all metagenomic reads recruited by HIMB83 to substantiate this observation.

There are many challenges in drawing conclusions from metagenomic data via read-mapping to a reference. Particularly problematic is the fact that diversity of closely related organisms within a sample is not known. Thus when many SNPs are detected is this indicative of a diverse population, or is it a signature that comes from a single over-represented clone that is rather divergent? Similarly, the authors take high similarity as signature of a recent sweep, but this might reflect a single over-represented genome from a closely related organism. I'm not sure how the authors dealt with these and similar problems.

We thank the reviewer for their insightful question and an opportunity to clarify this point, which in fact is an important aspect of our study. While we rely on reference genomes for metagenomic read recruitment, our analyses of genetic heterogeneity in the environment through singlenucleotide variants (SNVs) and single-amino acid variants (SAAVs) rely on the consensus of the environment rather than divergence from the reference sequence. In other words, we established our SNVs and SAAVs based on the information content within a metagenome, rather than by calculating distance between a metagenome and a reference genome. Thus, if there was a single over-represented clone in the environment as the reviewer suggests, there would have been no SNVs or SAAVs identified from that metagenome: despite the disagreement between that clone and HIMB83, the environment would reveal subtle variation within the clone. In contrast, we found many variants and remarkable diversity that differed across metagenomes, which allowed us to to place this variation in the context of geography. The new Figure 2 in our revised manuscript substantiates this point.

Sweeps could occur at the genome-level causing over-representation of a single genome (which we did not observe for SAR11 in any of the metagenomes analyzed here) or at the gene-level when all closely related members of a diverse lineage share the same gene (which we also rarely observed). Thanks to SAAVs, we also had the opportunity to study conservancy of genes at the level of amino acid sequences. These analyses also suggested a lack of sweeps, and instead revealed the systematic occurrence of different variants in sets of metagenomes that originated from distant geographic locations. These two observations support the hypothesis that there was not a single over-represented clone in any of the samples analyzed here.

With regard to selective sweeps, the results of the analysis appear not to be given (subsection “SNVs to SAAVs: Accurate characterization of non-synonymous variation”, gives the conditions that need to be fulfilled in order for a protein to have undergone a sweep, but not the results).

Supplementary file 2D describes SAAV density for each gene across the 74 metagenomes; values of zero indicate no variation in the environment. The reviewer’s comment prompted us to be more careful with the terminology we use to describe this particular observation, which is the lack of variation rather than an event of gene or protein sweep, which requires more evidence than what our data can afford. Our Figure 5 now summarizes the percentage of invariant proteins across proteotypes and, to address this particular reviewer concern, we extended the section where invariant proteins are described with the following addition:

“We considered a protein to be ‘invariant’ (i.e., absence of variation due to intensive purifying selection acting on introduced mutations) in a given metagenome if it lacked SAAVs; such invariant proteins were rare in our data. In total, we detected 2,548 invariant proteins (only 4.3% of all possibilities across the 74 metagenomes) that encompassed only 113 genes. In addition, all genes, except one 679 nucleotide long ABC transporter (gene id 1469), contained at least one SAAV in at least one metagenome, revealing a wide range of amino acid sequence diversification among core 1a.3.V proteins (Supplementary file 2D).”

Patterns of amino acid substitution types seems an interesting way to go, but this needs to be unpacked, explained and justified.

We thank the reviewer for their guidance. AASTs indeed effectively summarize the multidimensional information the complex SAAV table carries. We now have significantly edited the related section for clarity:

“Next, we sought to investigate amino acids that co-occur in variable sites. SAAVs were often dominated by a few amino acids; hence, the frequency vector for a given SAAV contained many zero values. […] Expectedly, ‘glycine/tryptophan’ was exceedingly rare in our data and occurred only once in 738,324 SAAVs (Supplementary file 2H).”

In addition, where we investigated the correlation between AASTs and temperature, we described our results the following way, further justifying its relevance:

“Within subclade 1a.3.V the distribution of AAST frequencies was notably constrained across geographies (Figure 3B). […] Yet, close inspection of AAST frequencies revealed that amino acid exchangeabilities subtly diverge in a pattern that correlates with water temperature and/or its co-variables (Figure 3B insets, Figure 3—figure supplement 2).”

Finally, we relied on AASTs to clarify temperature-associated exchange rates between amino acids in AASTs, which further explains their use:

“The results of this analysis also provide an opportunity to study the directionality of exchange rates of AASTs. […] The most statistically significant asymmetry was between the most common AAST, isoleucine and valine, in which valine was highly preferred in warmer waters (binomial test p-value: 5.7×10-6) despite chemical structure similarities.”

Partitioning of SAAV between warm and cold currents: there seems to be a correlation with clustering based on SAAVs, but as above, I need to have more explanation. Much relies on the robustness of the SAAV profile clustering (Figure 3A) and there is no indication of its statistical likelihood.

Here we reanalyzed our data using also a more conventional approach rather than Deep Learning analysis to demonstrate that our findings are preserved even with a different clustering strategy. Deep Learning provided us with a distance matrix that estimates the dissimilarity between 74 metagenomes based on their SAAV profiles, and a hierarchical clustering of these data enabled us to identify 6 clusters. To confirm this outcome was not a by-product of our bioinformatics approach, we also performed an analysis of the SAAV table using Euclidian distance and Ward’s linkage. The figures in Author response images 1 and 2 shows the two main clusters, which are identical to the two main clusters Deep Learning revealed.

Author response image 1

Asking for 6 clusters also provided us with near identical organization of metagenomes:

Author response image 2

We favored the Deep Learning approach since its unbiased distance estimation resulted in more highly-resolved branches in the resulting dendrogram.

To further link the clustering of metagenomes based on SAAV profiles, and the underlying sequence composition, in our revised Figure 4A we added the percent identity curves of recruited reads at the bottom of the dendrogram. The strong association between percent identity curves and clusters indicate that our SAAV clustering approach was able to infer the signal that underpins the differentiation among these metagenomes in the context of a single lineage:

We hope these additional insights give the reviewer more confidence in the patterns we observe, and adequately demonstrate that these results are not due to an algorithmic anomaly.

The crunch of the paper is rejection of the neutral model, but this is done by simply saying that neutral models could not explain the observed patters of amino acid substitutions, but no evidence is provided. Also, Figure 4, which I hoped would enlighten me, does not exist.

We regret that Figure 4 was not available to the reviewer:

We hope the remarkable differences between the neutral model and the clusters identified through our analysis (most strikingly the group D, C, and E that connect distant geographies) offers the reviewer better insights.

Our new analysis of allele frequency trajectories adds further statistical support to our claim, corroborating the biogeography of proteotypes. We wrote:

“In addition to considering patterns of variability that emerged when we pooled data across 37,416 codon positions exhibiting variation within the core 1a.3.V genes, we also investigated the allele frequency trajectories of individual positions (i.e., the relative frequency between the two most prevalent amino acids across the 74 metagenomes) and sought to identify those that correlate with in situ temperature and/or its co-variables. […] It is therefore most plausible to conclude that allele frequencies at these positions are predominantly shaped by environmentally mediated selection.”

And we now end the Results and Discussion section with:

“The striking connection between geographically distant regions of the oceans through SAAVs render neutral processes an unlikely candidate for evolutionary processes that maintain the genomic heterogeneity within 1a.3.V (Figure 4—figure supplement 5). […] Overall, these results indicate that environmentally-mediated selection is a strong determinant of SAR11 evolution and biogeography.”

In general I found the figures overly dense and very difficult to understand given the paucity of explanation in the captions. Take Figure 1: latitudinal gradient, I presume north is at the top? Colouring seems inconsistent and so on.

We thank the reviewer for their attention. We now have clarified the latitudinal gradient in Figure 1 (the reviewer was correct in their guess: the organization of metagenomes top to bottom followed North to South). We have also tried to strategically use color to enhance the clarity of new figures incorporated into the resubmission.

We agree with the reviewer that some of the figures are dense with information. We have dedicated considerable time to create them so that they could convey relevant information from our holistic analyses (e.g., the SAR11 metapangenome in Figure 1A) and the new concepts introduced in this study (e.g., SAAVs in the context of inferred protein structures in Figures 3C and 4A). Carefully selected main figures cover significant aspects of the study, and altogether support our main claim that environmentally-mediated selection plays a role in governing the global biogeography of SAR11.

We thank the reviewer for their efforts to evaluate our study and for their helpful insights.

Reviewer #2:

It's hard to tell who this paper is written for. The bioinformatics applications are the most interesting: identifying the core genome of a lineage of interest from metagenomics data, using a novel algorithm for (somehow) identifying significant subclades within the lineage, and finding a surprisingly high frequency of certain amino acid substitutions that are not predicted by Blosum. There are also interesting implications for ecological diversification, but this aspect of the paper is written without enough explanation or interpretation of data, as I will detail. Also, I think most microbial ecologists would be amazed to see a paper on diversification within "a single population" that is actually not a population at all, but rather a fair chunk of a genus.

It is so inappropriate to call the S-LLPA lineage a "single population," since it has nearly 20% sequence divergence within it, and because there is clearly profound ecological diversification within this group. (Given that the Eren group has made tremendous progress toward discovering newly divergent, ecologically distinct populations of bacteria with their MED algorithm, it seems ironic that the group is now calling this phylogenetically huge group a single population.) Calling this group a population seems to have led to the conclusion that "their broad geographic prevalence suggest<s> dispersion is not a limiting factor for SAR11 in surface seawater…." But this conclusion is based on a false reification of all the various true populations within what the authors are calling a population.

We are very thankful for the reviewer’s criticism. In fact, their comments and concerns motivated us to perform new set of comprehensive analyses that helped us improve our intellectual and technical approaches and led to significant improvements in our manuscript. We agree with the reviewer the occurrence of multiple sequence-discrete populations would have influenced our conclusions dramatically, although our new analyses suggest that it is not the case. The new Figure 2 provides new insights into the percent identity of reads core 1a.3.V genes recruited across the 74 metagenomes. These results show that the data suggest the absence of various true populations within the niche of 1a.3.V, and much more clearly reveal the complex nature of this lineage, which was also observed in other studies that investigated sequence-discrete SAR11 populations (we cited multiple of these studies in our new introduction).

We agree with the reviewer that the use of the term “population” to describe 1a.3.V without extensive justification was an oversight on our part. However, we respectfully disagree with the reviewer’s suggestion that the metagenomic reads we competitively recruit through the HIMB83 genome represent a fair chunk of a genus. There is no consensus among microbiologists on what the appropriate theoretical definition of a population should be. While there are considerably wellsupported computational attempts to offer operational definitions for practical conveniences, SAR11 is a well-known and previously documented outlier of these operational definitions (our new introduction makes a more appropriate attempt to cover the literature). Where does a genus start and end? In the absence of an appropriate theoretical definition for a population, the boundaries between different levels of taxonomy are subject to interpretation. We believe, however, that most microbiologists and microbial ecologists would agree that a genus describing multiple ‘species’ must contain multiple sequence-discrete populations (SDPs). In other words, it is conceivable to expect that if metagenomic reads we recruit from the environment using HIMB83 represent a fair chunk of a genus, then there should be multiple SDPs that can be inferred from their levels of identity. However, what we found in our data was a lack of any evidence for SDPs. In fact, the recruited reads showed a much broader identity distribution pattern, similar to what is proposed for SAR11 by leading scientists in the field. Thus, our new short-read-level analyses do not support the hypothesis of the presence of multiple SDPs in our dataset. But ultimately, after much consideration we elected to not use the controversial term ‘population’. Our data gives access to a large number of environmental cells that differ from each other rather dramatically, but not randomly: in a cloud of sequences that can’t be distilled into multiple SDPs, disagreements across the genomes that make up this cloud emerge from very specific locations of genes without the presence of individually abundant clones. Our study distills that information to show that amino acid sequence-level diversification connects geographically distinct stations, and we accomplish this by focusing on a prevalent and abundant subclade of SAR11 that has a single cultivated representative.

We have extensively modified our text to better introduce and describe our work. We thank the reviewer for their insights, and hope that our new framework satisfies the reviewer.

The authors talk about S-LLPA offers "a unique opportunity to study the genomic diversity and evolutionary genomics of a single marine microbial population…." Not so unique when the "population" is a huge chunk of a genus.

We have removed this sentence as a part of the changes described above.

The authors do justify their calling this clade a population because it constitutes what is recruited by a single isolate from various metagenomes. But their own work clearly shows, to my mind, that this is not a good way to discover populations.

Based on the reviewer’s comments we have modified the wording accordingly. HIMB83 provided the opportunity to study an abundant and widespread SAR11 lineage (1a.3.V) that genomeresolved metagenomics has thus far failed to provide access. We do agree that read recruitment to isolate genomes can in principle lead to mistakes due to non-specific recruitment; however, we have taken care to guard against this error and show that our approach offers valuable insights when sufficient samples of a population’s genomes are unavailable. Our gene-level recruitment results and the reliance on the core genes of this lineage minimize shortcomings of the use of isolate genomes to support our claims.

The authors have used their Deep Learning algorithm for demarcating six phylogenetic groups, which they call proteotypes, from their S-LLPA lineage. It is frustrating that the authors neglected to give any rationale for their approach. Is it based on sequence clustering of shared genes? Is it based on genome content sharing? Or both? (Even though the algorithm has been published, they should say something briefly about what the inputs are and how it works.)

This important point requires clarification. We first identified 738,324 SAAVs across the 799 core 1a.3.V genes and 74 metagenomes. We then simplified our data by taking into account two most frequent amino acids in each SAAV. To explain our reasoning we have added the following description to the text:

“SAAVs were often dominated by a few amino acids; hence, the frequency vector for a given SAAV contained many zero values. To reduce sparsity, we first simplified our data by associating each SAAV with an amino acid substitution type (AAST), defined as the two most frequent amino acids in a given SAAV.”

We have also clarified the data used by the Deep Learning algorithm (we regret the poor description in our initial submission) and further explained the Deep Learning approach. The relevant section now reads:

“We then sought to extend the concept of allele frequency tracking at individual SAAV positions to investigate the potential for large-scale geographic partitioning within the metagenome dataset. […] Hierarchical clustering of samples based on Deep Learning-estimated distances (Supplementary file 4A) resulted in two main groups: the Western (warm) and Eastern (cold) boundary currents (Figure 4A).”

On two occasions the authors write about protein sweeps. They state that the sweeps are "rare." I think most evolutionary ecologists of bacteria would want to know how rare, and in which genes. Also, it would be important to indicate whether the sweeps traveled across ecologically distinct groups. In the first paragraph of the Discussion, the authors mention that there are more sweeps in warm currents, but this should be fleshed out and made clearer.

We agree with the reviewer that protein sweeps and their linkage to ecologically distinct groups and functions are of particular interest. We have modified the relevant section for clarity and included in our revision a new supplementary table that describes additional statistics:

“We considered a protein to be ‘invariant’ (i.e., absence of variation due to intensive purifying selection acting on introduced mutations) in a given metagenome if it lacked SAAVs; such invariant proteins were rare in our data. In total, we detected 2,548 invariant proteins (only 4.3% of all possibilities across the 74 metagenomes) that encompassed only 113 genes. In addition, all genes, except one 679 nucleotide long ABC transporter (gene id 1469), contained at least one SAAV in at least one metagenome, revealing a wide range of amino acid sequence diversification among core 1a.3.V proteins (Supplementary file 2D).”

While we are happy to see the reviewer finds this of interest, we believe protein sweeps do not represent significant aspects of our study. Yet the data we have made available provide access to the occurrence of invariant proteins in the context of metagenomes (Supplementary file 2D) along with extensive information on metagenomes (Supplementary file 1A) as well as each gene (Supplementary file 1J). We hope the reviewer agrees with our allocation of the limited manuscript space.

Subsection “Purifying selection governs the identity of amino acid substitution types”, first paragraph. It wasn't clear to me whether the authors were claiming that their focus group was unique in having a stable proportion of amino acids on a global scale, or whether other groups are showing the same pattern.

We have revised the section in question to improve its clarity using our new analyses:

“Within subclade 1a.3.V the distribution of AAST frequencies was notably constrained across geographies (Figure 3B). For example, the relative standard deviation of ‘aspartic/glutamic acid’ frequencies across the 74 metagenomes was just 3.0%, and the statistical spread of other AASTs was comparable (Figure 3B). This suggests that there exists some evolutionary mechanism(s) maintaining amino acid exchangeability within subclade 1a.3.V throughout the global surface ocean.”

Our study does not intend to make a claim regarding how unique observed patterns are to microbial taxa. However, we find the question posed by the reviewer an interesting one.

In the subsection “Purifying selection governs the identity of amino acid substitution types”, I don't see how the authors are finding evidence of purifying selection. Rather, it seems they are finding the constraints on adaptive substitutions of amino acids. That's different from purifying selection, which is usually the evolutionary stability of a particular sequence.

We thank the reviewer for bringing up this point. Our revised study includes significant changes to clarify our observations.

Purifying selection, the selective removal of alleles that are deleterious, indeed does not impact permissible amino acid substitutions, but we found evidence that purifying selection shapes the relative frequency of amino acids involved in SAAVs as a function of hydrophobicity, which is supported by the existing literature:

“Hydrophobic interactions within the solvent inaccessible core of proteins are known to be critical for maintaining the stability required for folding and activity, which enforces a strong purifying selection placed on mutations occurring in buried (solvent inaccessible) positions (Bustamante, Townsend and Hartl, 2000; Chen and Zhou, 2005; Worth, Gong and Blundell, 2009). […]Overall, our compositional analysis revealed that the occurrence of amino acids in SAAVs is roughly correlated with the occurrence of amino acids within the core 1a.3.V genes, and that deviations from this expectation are driven in part by levels of purifying selection that depend upon the suitability of an amino acid’s hydrophobicity for a given physicochemical environment (Figure 3—figure supplement 1).”

Besides the relative proportion of amino acids involved in SAAVs, our analysis of inferred 3D structures for hundreds of core 1a.3.V genes provides additional insights into the apparent effect of purifying selection in the occurrence of SAAVs as a function of solvent accessibility:

“Within the subset of the 1a.3.V proteome accessible to us, we found that buried amino acids (0-10% relative solvent accessibility) were approximately 4.4 times less likely to be variant than those that were exposed (41-100% relative solvent accessibility) (ANOVA, pvalue: <2×10-16). […] The local physicochemical environment therefore shapes variation, and visual inspection of Figure 3C indicates that this is conserved across distant geographies; i.e. positions that vary in one metagenome are likely to vary in others, as well.”

Strong biases associated with hydrophobicity and solvent accessibility both expose a predominant role of purifying selection in observed amino acid frequencies in SAAVs within the members of SAR11 lineage 1a.3.V. We hope the reviewer appreciates our clarifications.

It's interesting that there are two clades within S-LLPA that are associated with different temperature regimes. And it's interesting that the subclades within these clades are different in the geographic regions where they are most abundantwith similar temperatures. The interpretation of this result depends on whether different cold-adapted subclades are never found together (meaning that they may have no adaptation differences) or whether they sympatric to some extent (meaning that they are coexisting based on ecological differences other than temperature partitioning). It seems that the latter is the case, but the authors should have made this clearer. For example, pie charts showing the relative abundances of the subclades in each region would be useful.

The reviewer brings up a very important point, which was of significant interest to us as well. However, our short-read level analyses suggested that an increased fitness to a given temperature regime does not yield sequence-discrete populations that are distinguishable, which prevents us from making quantitative estimates of their differential abundance across geography. Deconvoluting haplotypes found in metagenomic read recruitment results is possible thanks to the increasing availability of tools (i.e., Lineage, DESMAN, or ConStrains), however, none of them are suitable to resolve the extensive diversity we observe in SAR11 subclade 1a.3.V. We are looking forward to new contributions from single-cell genomics and cultivation to address this significant question our study could not contribute.

Reviewer #3:

The work by Delmont et al. describes a large-scale and thorough analysis of an important marine bacterium population, SAR111. They use multiple approaches to identify signals of selection to understand the evolution and ecology of these microbes. The authors deserve credit for their very clear presentation of data and methods. However, I have multiple concerns regarding the analysis and interpretation of the data:

1) The nature, and strength of, selection pressures acting on this bacterial population are not clearly defined.

We agree with the reviewer. The manuscript now better describes the nature and strength of selection pressures acting on 1a.3.V. Especially, we show that (1) hydrophobicity and solvent accessibility influences the strength of purifying selection acting on amino acids, and (2) few AASTs along with thousands of allele frequency trajectories significantly correlate with in situ temperatures, in line with the biogeography of proteotypes. Our response below clarifies these points further.

a) The description of the substitution analysis "suggests a powerful influence of purifying selection", while also "suggesting a role for adaptive processes for functional diversification of S-LLPA proteins". Which of these dominates? Could this just be noisy neutral evolution – what is the expected variation of the S statistic, and the SAAV patterns, under a neutral scenario?

This is an important consideration and requires a comprehensive clarification. Our revision no longer contains the sentences the reviewer cited, and better clarifies that our investigation of purifying selection acting on 1a.3.V is independent of the debate regarding neutral evolution versus natural selection. Simply, we observed that certain amino acids (mostly hydrophilic), and most notably, few AASTs (e.g., alanine/isoleucine) predominated in our SAAV table.

Regarding the amino acids, we now write:

“Hydrophobic interactions within the solvent inaccessible core of proteins are known to be critical for maintaining the stability required for folding and activity, which enforces a strong purifying selection placed on mutations occurring in buried (solvent inaccessible) positions (Bustamante, Townsend and Hartl, 2000; Chen and Zhou, 2005; Worth, Gong and Blundell, 2009). […] Overall, our compositional analysis revealed that the occurrence of amino acids in SAAVs is roughly correlated with the occurrence of amino acids within the core 1a.3.V genes, and that deviations from this expectation are driven in part by levels of purifying selection that depend upon the suitability of an amino acid’s hydrophobicity for a given physicochemical environment (Figure 3—figure supplement 1).”

Besides the relative proportion of amino acids involved in SAAVs, the analysis of inferred 3D structures for hundreds of core 1a.3.V genes provides additional insights into the apparent effect of purifying selection on the occurrence of SAAVs (role of solvent accessibility this time):

“Within the subset of the 1a.3.V proteome accessible to us, we found that buried amino acids (0-10% relative solvent accessibility) were approximately 4.4 times less likely to be variant than those that were exposed (41-100% relative solvent accessibility) (ANOVA, pvalue: <2×〖10〗^(-16)). […] The local physicochemical environment therefore shapes variation, and visual inspection of Figure 3C indicates that this is conserved across distant geographies; i.e. positions that vary in one metagenome are likely to vary in others, as well.”

Regarding the AASTs, we found that predominant ones corresponded to competing amino acids with similar properties, shedding a different light on purifying selection. We described our findings in our revision:

“In 738,324 SAAVs, we observed 182 of 210 theoretically possible unique AASTs at varying frequencies (Supplementary file 2H). […] Expectedly, ‘glycine/tryptophan’ was exceedingly rare in our data and occurred only once in 738,324 SAAVs (Supplementary file 2H).”

These points are interesting, but they do not favor natural selection over neutral evolution. Instead they provide information regarding permissible versus non-permissible diversification. On the other hand, analysis of the permissible SAAVs can shed light in the natural selection versus neutral evolution debate. This is now covered in subsequent sections of our revision to help the reader. For instance, we found that some AASTs, and thousands of allele frequency trajectories, significantly correlated with in situ temperature, suggesting an important role for natural selection. We first wrote:

“Within subclade 1a.3.V the distribution of AAST frequencies was notably constrained across geographies (Figure 3B). […] Yet, close inspection of AAST frequencies revealed that amino acid exchangeabilities subtly diverge in a pattern that correlates with water temperature and/or its co-variables (Figure 3B insets, Figure 3—figure supplement 2).”

We then found that for some AASTs, one amino acid tended to be avoured by temperature increases compared to its competing amino acid genome-wide. We added:

“The results of this analysis also provide an opportunity to study the directionality of exchange rates of AASTs. […] The most statistically significant asymmetry was between the most common AAST, isoleucine and valine, in which valine was highly preferred in warmer waters (binomial test p-value: 5.7×10-6) despite chemical structure similarities.”

Finally, we determined that thousands of 1a.3.V allele frequency trajectories correlated with in situ temperature, in line with the biogeography of proteotypes. We concluded:

“In addition to considering patterns of variability that emerged when we pooled data across 37,416 codon positions exhibiting variation within the core 1a.3.V genes, we also investigated the allele frequency trajectories of individual positions (i.e., the relative frequency between the two most prevalent amino acids across the 74 metagenomes) and sought to identify those that correlate with in situ temperature and/or its co-variables. In 2,740 of the 37,416 positions, the null hypothesis that no correlation with temperature exists was rejected (Supplementary file 3B; p-value < 0.01). Figure 3—figure supplement 3 illustrates example cases and correlation statistics per AAST. It is statistically implausible that such correlations with temperature could have arisen from neutral evolution, given that distant oceans share similar temperatures (Supplementary file 1A). It is therefore most plausible to conclude that allele frequencies at these positions are predominantly shaped by environmentally mediated selection.”

Regarding the last question of the reviewer: under a neutral scenario, SAAV patterns should have reproduced trends depicted in Panel C of this supplemental figure (patterns we observed are different, and favor natural selection due to the distant metagenomes sharing similar SAAV profiles, for which Proteotype D offers a remarkable demonstration), see Figure 4—figure supplement 5.

b) The neutral model of Hellweger et al. still predicts geographic separation of strains, which should be made clear in the Introduction. A falsification of this would be a repeated association of bacterial strains with ecological niche. This appears to be the case at the highest level of the hierarachy inferred with "deep learning", but at lower levels, predictors are genetic, not ecological, and therefore likely represent characteristics that were identified by the clustering algorithm itself. This might be expected under a neutral model.

In principle, the reviewer is correct. But we did not observe variation at the level of SAAVs that fit the neutral model, even at finer scales of evolutionary distance. At the level of two main groups (Western versus Eastern boundary currents) as well as at the level of individual proteotypes, the clustering of metagenomes based on differential occurrence of SAAVs favors natural selection (now introduced as environmentally-mediated selection). Proteotype D is a particularly good example connecting the Red Sea and two sides of the Panama despite considerable distances.

Proteotype C connects remarkably distant, yet polar regions of the ocean. Similarly, Proteotype E also connects distant locations. Neutral evolution alone could lead to such patterns, as indicated by the simulation (Figure 4—figure supplement 3, panel C).

To cover this point, the Introduction now reads:

“At the level of individual populations, a key simulation by Hellweger et al., 2014, showed that the intra-population sequence divergence that reflects the geographic patterns of distribution for SAR11 cells could emerge solely as a function of ocean currents, without selection (Hellweger et al., 2014).

And the Results and Discussion reads:

“The striking connection between geographically distant regions of the oceans through SAAVs renders neutral processes an unlikely candidate for evolutionary processes that maintain the genomic heterogeneity within 1a.3.V (Figure 4—figure supplement 5). In fact, both the main ecological niches and more refined proteotypes indicate that SAAVs are not primarily structured by the global dispersal of water masses but instead tend to link distant geographic regions with similar environmental conditions (Figure 4B). This is corroborated by thousands of allele frequency trajectories correlated with in situ temperatures. Overall, these results indicate that environmentally-mediated selection is a strong determinant of SAR11 evolution and biogeography.”

2) The authors have opted against traditional population genetics methods. It is possible that the new methods are a substantial improvement, but they need to be justified. Some examples:

a) The authors justify using the "S" statistic rather than d(N)/d(S) based on there being many instances of multiple substitutions within the same codon. But this is exactly what d(N)/d(S) is meant to correct for, and S does not capture – there may be many synonymous substitutions on an evolutionary pathway, but if only one of them is non-synonymous, then S will only count this only as a non-silent change, underestimating purifying selection. What is the neutral expectation for S?

The “S” statistic (introduced as SAAVs/SNVs ratio) is not central to our study but provides an opportunity to link SNV and SAAV densities at the level of individual genes or entire metagenomes. We observed interesting trends, such as the increase of this ratio in cold waters (proteotypes A, B and C).

The accuracy of d(N)/d(S) deteriorates if a given codon contains more than a single SNV (the URL http://merenlab.org/2015/07/20/analyzing-variability/#single-codon-variants also contains some visual support for this statement with a mock example). A substantial fraction of codons analyzed in our study contained more than one SNV, a critical aspect of our study that is now better introduced in our revised manuscript. In those situations, SNVs need to be analyzed simultaneously, since the effect of one SNV (whether it is synonymous or not) will depend on the occurrence of others. As far as we know, there is no published study that effectively takes into account multiple SNVs per codon without calculating the exact codon frequencies to estimate synonymity accurately. Since they rely on actual codon frequencies and not pileups of nucleotide positions that are independent of each other, we hope the reviewer agrees that SAAVs will provide a better metric of synonymity compared to d(N)/d(S) in the case of lineages with extremely high SNV densities.

Under the neutral model, changes in the SAAVs/SNVs ratio must tend to associate with geographical adjacency as a function of the movement of water mases, rather than environmental parameters. The significant shift in SAAVs/SNVs ratio between cold currents and warm currents is of relevance to our study, especially in light of other insights (especially, the linkage between allele frequency trajectories and in situ temperature, and the biogeography of proteotypes), favors the predominant role of natural selection acting on the permissible amino acid diversification traits of this global SAR11 lineage.

Overall, we do not suggest that the SAAVs/SNVs ratio should replace d(N)/d(S). Simply, SNV density was high for 1a.3.V and many SNVs co-occurred in the same codon, rendering classical d(N)/d(S) analyses limited, which had never been challenged with extremely complex environments, if not completely irrelevant.

We appreciate reviewer’s input as it helped us better clarify the relevance of SAAVs regarding synonymy in our revision in multiple places. Specifically, the Introduction now reads:

“The substantial sequence diversity within environmental SAR11 populations not only explains the absence of SAR11 population genomes in genome-resolved metagenomics studies, but also challenges conventional approaches to the study of population genetics in microorganisms. For instance, the multiple occurrence of single-nucleotide variants in individual codon positions would render commonly used computational strategies that classify synonymous and non-synonymous variations based on independent nucleotide sites (such as in (Schloissnig et al., 2012; Bendall et al., 2016)) unfeasible.”

And the revised Results and Discussion reads:

“Percent identity distributions are useful to assess overall alignment statistics of short reads to a reference; however, they do not convey information regarding allele frequencies, their functional significance, or association with biogeography. […] Thus, quantifying frequencies of full codon sequences is a requirement to correctly assess synonymity, and it is this principle we employed in our workflow to extract SAAVs.”

And finally, regarding the specific case of the SAAVs/SNVs ratio, we added the following section:

“Interestingly, the niche defined by cold currents exhibited significantly more SAAVs (ANOVA, p-value: 1.66×〖10〗^(-12)) and a significantly higher SAAVs/SNVs ratio (ANOVA, p-value: 1.07×〖10〗^(-10)). This observation could be explained either by (1) extinction/re-emergence events that operate continually on specific codon positions (adaptive evolution), or (2) changes in abundances within a large seed bank of variants due to positive and negative selection as the lineage transits.”

We hope the reviewer agrees with our changes.

b) The BLOSUM matrices are used as "as a proxy to assess the functional difference between the pairs of amino acids", which are then compared to the substitution frequencies – but this is circular reasoning, as these matrices are calculated from substitution frequencies. The versions used in the paper are also inappropriate for this dataset, as they were calculated using more closely-related sequences than are compared in this paper.

We agree with the reviewer. We have removed BLOSUM matrices from our revision.

c) "Deep learning" is applied to cluster populations together through a method designed without genetic data in mind. This does not mean it is wrong, but as it is not an intuitive approach, it would be helpful to see validation against a more standard method, such as the fixation index, F(ST). This would help understand the distinct properties of the six subclusters – for instances, proteotype D looks to be a single successful "strain" – is this expected to be picked up as a signal by the "deep learning"?

First, we would like to address reviewer’s suggestion regarding the nature of Proteotype D. Our analysis of SAAVs using Deep Learning estimated distances between 74 metagenomes based on single amino acid variants we observed in them in the context of a single genome. This analysis linked 74 metagenomes to six proteotypes. Proteotype D emerged as a tight cluster despite its geographical spread, as the metagenomes it represents displayed similar SAAVs. If this was a single successful strain, we would have expected to find zero divergence from the environmental consensus for SAAVs. However, this was not the case: although much less when compared to other proteotypes, proteotype D did contain >5,000 SAAVs per metagenome (Figure 5, panel A). In addition, our new analysis of the percent identity of recruited reads also showed that proteotype D still maintains a large degree of variability and is far from a state that can be considered nearclonal.

Regarding the fixation index, we agree with the reviewer that this would have been a valuable addition to our work. However, we respectfully disagree with the reviewer regarding the level of intuitiveness of our approach, and we hope that our revision did a better job at introducing its relevance to study this lineage. Besides, the application of classical population genetics methods to these data are challenging due to technical reasons we outlined in our response regarding the remarkable complexity of these populations. We are hoping that our study will cultivate interest from the experts of population genetics to develop new methods that can deal with intra-population diversity of complex environmental agglomerates through ‘omics data. Our study formally introduces strategies and analytical tools for (1) the concept of SAAVs and AASTs, (2) the linkage between environmental variants and predicted protein structures, and (3) the linkage between allele frequency trajectories and environmental variables. We believe the significant amount of methodological development in this work that offers descriptions of the biogeography of the most abundant lineage in surface oceans conveys a story that will not benefit from the inclusion of additional analyses.

On the reviewer concern regarding the validation of Deep Learning: as we mentioned in our response to the Reviewer #1, we did use a more standard approach to cluster metagenomes based on the same SAAV frequency table as well. Here we would like to share parts of our previous response with the Reviewer #3 in case they do not have access to other responses: To make sure our findings with Deep Learning were not a by-product of our bioinformatics approach, we also performed an analysis of the SAAV table using Euclidian distance and Ward’s linkage. The figure in Author response images 1 and 2 shows the two main clusters, which are identical to the two main clusters Deep Learning revealed

Despite these results we elected to use the Deep Learning approach since the unbiased distance estimation afforded by this approach resulted in more highly-resolved branches in the resulting dendrogram.

3) Quality of read mapping. Many of the sequences being compared are highly divergent, but most mapping software is intended to align reads to highly similar reference sequences. While there is a promising correlation between the SNP densities identified by BWA and BOWTIE2, there is a greater than two-fold difference in the absolute densities. This suggests a very high false positive or false negative rate, depending on the method. The authors need to validate the performance of these algorithms when using highly divergent reads. Additionally, what is the justification for using the "the base-5 logarithm of the mean coverage of a gene remained within {plus minus}1 of that for the mean coverage across all genes" for assessing consistency of coverage?

We agree with the reviewer that mapping stringency is indeed critical to study population genetics, especially when one relies on reference genomes and recruitment of metagenomic reads. In most microbial lineages, the extent of diversity within a population is rather low and stringent mapping stringencies (e.g., >95% sequence identity) can be applied. This, however, is not the case for SAR11. To better clarify this point, we have significantly modified the Introduction. The relevant section now reads:

“Interestingly, the boundaries of environmental SAR11 populations appear to not comply with the 95% ANIr cutoff. For instance, Tsementzi et al., 2016, observed substantial sequence diversity within sequence-discrete SAR11 subclades in the environment, and suggested that an ANIr as low as 92% would be required to adequately define the boundaries of the SAR11 populations recovered in their study (Tsementzi et al., 2016). These findings are consistent with a comprehensive study of isolate genomes and marine metagenomes by Nayfach et al., 2016, which suggested that SAR11 is one of the most genetically heterogeneous marine microbial clades (Nayfach et al., 2016). The substantial sequence diversity within environmental SAR11 populations not only explains the absence of SAR11 population genomes in genome-resolved metagenomics studies, but also challenges conventional approaches to the study of population genetics in microorganisms.”

In line with observations from Tsementzi et al., 2016, our new analyses of short reads recruited by HIMB83 revealed extensive diversity, suggesting lower mapping stringencies would benefit deeper insights into SAR11 population genetics. During our in-house tests we found that using a stringency cut-off of 95% resulted in drastic changes in coverage values for Ia.3.V, and led decreased stability of coverage across metagenomes which resulted in much less number of core genes and metagenomes to work with. After exploring different software mapping algorithms and parameters, we concluded that the decreased SNV density with BWA was most likely due to false negatives (gene sections with high degree of SNVs were entirely overlooked). Using Bowtie2 allowed us to study the heterogeneity within this lineage at the amino acid variants level using DNA sequences that were up to 12% divergent from the reference (Figure 2—figure supplement 1 displays similar trends for all other isolate genomes).

Regarding the “base-5 logarithm”: we thank the reviewer for pointing it out. We have now better clarified our strategy:

“We then defined a subset of HIMB83 genes as the core 1a.3.V genes if they occurred in all 74 metagenomes and their mean coverage in each metagenome remained within a factor of 5 of the mean coverage of all HIMB83 genes in the same metagenome. This criterion accounted for biological characteristics influencing coverage values in metagenomic surveys of the surface ocean such as cell division rates and variations in coverage as a function of changes in GC-content throughout the genomic context. Figure 1—figure supplement 1 displays the coverage of all HIMB83 genes across all metagenomes, and Supplementary file 1J reports the coverage statistics.”

We would also like to acknowledge that given the complexity of both SAR11 lineages and metagenomics, it is indeed difficult to defend any heuristic to identify core genes. We fully appreciate this limitation, and to address further concerns we made an attempt to better explain the relevance of the 799 core 1a.3.V genes we identified:

“While the 799 genes that met these criteria systematically occurred within the niche boundaries of 1a.3.V, 40% of the remaining 671 HIMB83 genes that were filtered out were present in five or less metagenomes and coincided with hypervariable genomic loci (Figure 1—figure supplement 1). Hypervariable genome regions are common features of surface ocean microbes (Coleman et al., 2006; Zaremba-Niedzwiedzka et al., 2013; Kashtan et al., 2014; Delmont and Eren, 2018) that are not readily addressed through metagenomic read recruitment but do influence pangenomic trends. Here, less than 10% of gene clusters unique to HIMB83 were among core 1a.3.V genes (Figure 1A), indicating HIMB83’s unique genes are mostly accessory to the 1a.3.V lineage. In contrast, more than 80% of gene clusters that were core to the 21 SAR11 genomes matched to the core 1a.3.V genes. The overlap between environmental core genes of 1a.3.V revealed by the metagenomic read recruitment and the genomic core of SAR11 revealed by the pangenomic analysis of isolate genomes suggests that these genes represent a large fraction of the 1a.3.V genomic backbone (Figure 1A).”

We hope the reviewer agrees with our resolution.

[Editors' note: the author responses to the re-review follow.]

Collected reviewer comments:

The paper is vastly improved since the original version. It is now much clearer and more compelling, and the paper deals with the issue of populations better.

The paper is fascinating for showing that whatever cohesion occurs at >95% ANI in pretty much every other bacterial group does not occur here in SAR11. The paper is also really interesting for giving a robust and sensitive way to identify ecologically distinct metagenome groups from sequence data.

My concerns are mostly focused on what the authors have found out about ecological diversity within the 1a.3.V lineage.

First, I appreciate that the authors are trying to be more careful about using the concept of population, but there are still places where they are still unclear about what they mean. The authors write of "remarkable intra-population genomic diversity…." It's not clear exactly what the authors mean by "population" here, but in this context what they mean is very important. If there isn't a good definition of population being used, then how can one talk about one population's intra-population genomic diversity being higher than another? It all depends on what you mean by population, and that's not clear here.

The reviewer comment made us realize that our attempt to tip-toe around this issue by trying to avoid the use of the term ‘population’ did not improve the clarity of our study. Instead, it became more confusing due to uncommon and imprecise use of other terms (such as ‘lineage’) to describe what we are studying. We also came to the realization that most studies that use the term ‘population’ do not offer an explicit description (i.e., doi:10.1126/science.1248575), adding more to the confusion. We have now updated the manuscript with the following paragraph, which includes an explicit operational definition to clarify what we mean by a ‘population’ from a practical perspective with its shortcomings and cites insightful studies that offer deeper theoretical and practical discussions for this historical challenge:

“Overall, our data confirm that ANIr values of >95% used previously to delineate sequence-discrete populations does not apply to SAR11. […] However, due to the incomplete theoretical foundation and limitations associated with the use of short metagenomic reads, in discussions here we more conservatively assume that our reads originate from multiple closely related yet intertwined SAR11 populations within subclade 1a.3.V.”

We thank the reviewer for their persistence on this matter.

Figure 2 purports to show that there are not sequence-discrete populations within 1a.3.v, at least not in the sense that Kostas Konstantinidis demonstrates them in his studies, most recently in Jain et al., 2018. It is really interesting that all the SAR11 clades (with just this one clade shown in Figure 2) do not show the drop-off in recruitment of metagenome sequences below about 95% ANI. The authors argue that part of the great sequence diversity of what is recruited into this lineage (that is with much lower ANI) emerges from there being ecological diversity within the lineage (adaptations to different temperatures among different sublineages). I don't think this is a reasonable explanation, as almost every species taxon, nearly all obeying the 95% ANI drop-off rule, is also ecologically heterogeneous. I have suggested, in an interpretation of the recent Jain et al. paper, that the cohesion holding a species taxon together stems from an ability of recombination to act frequently at >95% ANI, plus the possibility of adaptive genes passing between ecologically distinct populations that are >95% ANI (Cohan, 2019) I'd argue that there is more potential for either recurrent recombination to limit divergence among ecologically distinct populations in SAR11, or greater opportunities for a single gene to be adaptive across different populations.

Very helpful insights. We believe the reviewer will find in our revision additional analyses that more clearly demonstrate that the temperature-driven changes in variable positions constitute a very small fraction of variable amino acids. We also offer a clearer discussion of populations in general and alternative interpretations of sequence heterogeneity within 1a.3.V. As a reminder, we also reworded a related paragraph to highlight possible forces driving cohesiveness within a species:

“Both high recombination rates between cells displaying low gANI values and frequent transfer of adaptive genes between ecologically distinct clades could explain the high-level of cohesion between SAR11 populations in the surface ocean (Vergin et al., 2007; Cohan, 2019). […] This analysis revealed a significant correlation between in situ temperature and distribution shape (mean p-value: 2.0×10^(-3); standard deviation p-value: 3.4×10^(-8); skewness p-value: 1.0×10^(-8)), which suggests a strong influence of temperature and its co-variables on the sequence heterogeneity within 1a.3.V (Figure 2) and is incompatible with the hypothesis of random sequence variants.”

I'd like to add that whatever cohesion occurs at >95% ANI in most species taxa does not address the ecological diversity within species taxa, or whether there are sequence-discrete populations within each taxon. In fact, most species taxa have sequence-discrete and sequence-discoverable ecologically distinct populations. (This is the point of the second half of my Current Biology dispatch I mentioned.) So, I don't think that Figure 2 makes an argument that there are not sequence-discrete, ecologically distinct populations within the focus lineage.

We agree with the reviewer. ANI >95% is nothing but an empirical observation that delineates species boundaries and does not suggest anything regarding the within-species population structures. But various values of ANI that are above 95% are often used by others as a practical expedient to define sequence-discrete populations in metagenomic read-recruitment studies (i.e., doi:10.1038/ismej.2015.241, doi: 10.1371/journal.pbio.0060177, etc), and these practical cutoffs slowly become de facto expectations. The data in Figure 2 shows that if we were to follow a similar strategy for SAR11 by selecting a sequence similarity cutoff of ANI >95%, we would have been arbitrarily cutting a biologically relevant continuum. We are (and the entirety of our study is) in agreement with the reviewer on this point; the purpose of the sentence the reviewer quotes from our study is to clarify the inappropriateness to apply the commonly used cutoffs to our data.

The authors have noted that there is a small minority of genes that are invariant at the amino acid level across a given metagenome. In the earlier version, they interpreted these observations as evidence of sweeps, but here they have chosen not to address the dynamics that can be inferred by these invariant genes. I think that such single-gene sweeps are very interesting, and have been the topic of discussion of some very interesting papers, including papers by Jesse Shapiro and by Rex Malmstrom. I'd encourage the authors to bring this discussion back. I'll mention that these instances provide evidence that a generally adaptive gene has passed by recombination across all the ecologically distinct populations within a lineage, but I wouldn't expect the authors to necessarily buy in to that interpretation.

We removed the section on ‘protein sweeps’ because we realized that in the absence of timeseries data it was not straightforward to suggest invariant proteins were evidence of protein sweeps or intensive purifying selection. We changed our text to remind the reader both alternatives by adding this sentence: “We considered a protein to be ‘invariant’ (i.e., absence of variation due to intensive purifying or positive selection) in a given metagenome if it lacked SAAVs”. That said, we have further explored their dynamic across metagenomes to contribute to the exchange with the reviewer. For this, we have clustered metagenomes in a similar fashion to their clustering in the manuscript, but this time only using the differential occurrence of invariant genes at the amino acid level (see Supplementary File 2h and 2i). This analysis recapitulated Proteotype D, a set of distantly related metagenomes central to our study.

Author response image 3

As one interpretation, it is possible that the same adaptive genes have passed by recombination across sub lineages in distant geographies (Proteotype D encompasses two sides of Panama as well as Red Sea and north of Indian Ocean). Yet, predominance of the same sub lineage in these regions due to natural selection is also a valid interpretation of the results. As mentioned above, while such analyses may seem to contribute to our study, we believe it would be misleading to suggest invariant proteins used for this figure to have resulted from events of protein sweeps. We hope the reviewer agrees with our resolution.

I found the authors' partitioning of metagenomes into ecologically distinct groups fascinating. And it's particularly interesting that these groups could not be revealed (at least entirely) by analyzing single reads (as shown in Figure 4—figure supplement 3). I'll encourage the authors to venture from their interpretation regarding whole metagenomes to inferring that there appear to be multiple ecologically distinct populations that cause these metagenomes to hold different niches. What might be the ecological distinctions of the populations? Clearly they are different for their temperature adaptation, as revealed by Figure 4 (mislabeled Figure 5). Perhaps the authors could glean something about what is different about the constituent populations from their Figure 4—figure supplement 3, that is by comparing the clusters that they do find with known environmental parameters.

We thank the reviewer very much for their suggestion (and catching the mislabeled Figure 5, which is now fixed). Taking their advice, we have now created a new table (Supplementary File 4d) to investigate and make available to others the ANOVA statistic for each known environmental parameter released by the Tara Oceans Project along with data our study revealed.

As shown in Figure 4, the strong association with latitude can easily be explained by the geographic partitioning of oceanic currents. Besides longitude, temperature best explains clusters of proteotypes as a function of SAAVs. We have updated the results in our revision to cite this table.

Re the concluding statement about "everything is everywhere," I'd temper that by saying 'at least for marine bacteria'. This is because the statement does talk about "everything," so I think there needs to be some limitation to the conclusion.

Fair. The sentence now reads:

“Overall, these findings suggest that environmentally-mediated selection plays a critical role in the journey of cosmopolitan microbial populations in the surface ocean, lending credence to the idea for marine systems that “everything is everywhere but the environment selects” (Baas-Becking, 1934).”

Concerns about evolutionary analyses:

In response to point (1), the authors state:

"Simply, we observed that certain amino acids (mostly hydrophilic), and most notably, few AASTs (e.g., alanine/isoleucine) predominated in our SAAV table."

It has been well-established for decades that hydrophobic amino acids in solvent-inaccessible positions are more conserved (e.g. see reviews cited in the Introduction to Ramsey et al. 2011 Genetics 188(2):478-88). Reproducing this observation in the SAR11 dataset simply confirms that purifying selection can be observed over long timescales. It does not help identify what the changes are that enable ecological adaptation.

Indeed, this section does not contribute to our investigation of ecological adaptation within the SAR11 lineage. It sheds light into purifying selection.

Delmont et al: "These points are interesting, but they do not favor natural selection over neutral evolution. Instead they provide information regarding permissible versus non-permissible diversification"

I assume "permissible" refers to purifying selection? Which is an example of natural selection – rather than random, neutral diversification. It is distinct from the adaptive/Darwinian evolution that allows the ecological differentiation of the bacteria. Not separating neutrality, purifying selection and adaptive evolution plagues the whole manuscript. The authors have a strain distribution that appears to reflect ecological differentiation, rather than neutral diffusion. They have a mutation distribution that mainly seems to reflect purifying selection, rather than neutral evolution. Some analyses (e.g. below) suggest adaptive evolution, which is potentially very interesting, but the overall trend is consistent with purifying selection. The authors need to clearly distinguish their different conclusions.

We are extremely thankful for the reviewer for bringing up this point, which made us revise our manuscript carefully to elucidate which aspects of our data are consistent with neutral evolution, purifying selection, and adaptive evolution. We hope that our revisions summarized below will satisfy the reviewer.

First, to convey that a skewed AAST frequency distribution is incompatible with a purely neutral process, but is indicative of purifying selection that is consistent with modern neutral theories of molecular evolution, we have added the following text at the introduction of the AAST frequency distribution:

"While such a skewed AAST frequency distribution cannot be explained by strictly random mutational process (Figure 3B light-gray shaded area), it is compatible with standard theories of neutral or nearly-neutral evolution, since such theories consider the role of purifying selection (Ohta and Gillespie, 1996)."

To support this point and to illustrate how neutral events alone are unable to recapitulate the observed AAST distribution, we have calculated the expected AAST frequency distribution under strictly neutral mutational events by considering the codon frequencies observed in the HIMB83 reference sequence, the edit distance between codons, and the transition/transversion ratio in a probabilistic model (we added a new Materials and methods section "Estimating a neutral AAST frequency distribution" to detail our approach, and updated our reproducible workflow with the details of the procedure). Briefly, we used extremes for the transition probabilities between codons to calculate upper and lower bounds for a strictly neutral AAST frequency distribution, and updated panel B in our Figure 3 to illustrate the expected AAST frequency distribution in the absence of both purifying and adaptive selection (light gray shaded area).

Next, we revised the text to explicitly state and offer interpretation to the apparent role that purifying selection has in maintaining the AAST frequency distribution across distant geographies. We replaced the following text in our previous submission with the following one:

"The overall consistency of AAST frequency distributions across geographies supports the hypothesis that purifying selection maintains amino acid exchangeability within 1a.3.V and enables an interpretation of these data through a neutral model: SAAVs composing the AAST frequency distribution represent primarily neutral mutations that have drifted to measurable levels, and the lack of SAAVs in uncommon AASTs arises due to a strong purifying selection acting against deleterious mutations that occur frequently between dissimilar amino acids."

Moreover, in our previous manuscript we had stated: “Yet, close inspection of AAST frequencies revealed that amino acid exchange abilities subtly diverge in a pattern that correlates with water temperature and/or its co-variables (Figure 3B insets, Figure 3—figure supplement 2).”

But after careful consideration of the reviewer’s point, we found that this statement does not do justice in describing the subtlety of this signal in comparison to the predominant signal of purifying selection. To address this, we have amended the text so that it now reads:

"However, a closer inspection reveals a subtle divergence of amino acid exchangeabilities that correlates with water temperature and/or its co-variables (Figure 3B insets, Figure 3—figure supplement 2). Note that this divergence is AAST specific; for example, positions with mixed proportions of glutamic and aspartic acid are less commonly found in warm waters (linear regression, uncorrected p-value: 2.7×10^(-6)), yet for isoleucine and valine such a correlation is nonexistent (linear regression, uncorrected p-value: 0.418). These findings suggest that amidst a signal that is predominantly indicative of purifying selection, there appears to be a fingerprint of adaptive/divergent processes caused by temperature and/or its co-variables that subtly shift the mutational profile within 1a.3.V."

In addition, in the section entitled “Temperature correlates with amino acid allele frequency trajectories”, we have made more substantial changes that are detailed in our response to the next reviewer concern. Finally, we added a final paragraph that summarizes the ratio between these processes as suggested by the data, which is also described later in this document.

Delmont et al: "we determined that thousands of 1a.3.V allele frequency trajectories correlated with in situ temperature, in line with the biogeography of proteotypes"

It is not surprising that many of the variable sites correlate with temperature, because the proteotypes are defined using the variable sites, and the proteotypes correlate with temperature. The authors should seek to synthesize these data – do the variable sites have properties that suggest selection between niches? Do they concentrate in particular genes that might drive adaptation to different niches? Or do they simply correlate with the population structure?

We are again very thankful for this question, which helped us to make a remarkable observation that not only supports our findings but also will be helpful for those who study SAR11 biology. Unfortunately, as of today, scrutinizing all 4,592 positions where amino acid frequencies correlated with temperature with their structural context in protein structures is a computationally intractable problem. However, to tackle this important point we surveyed all genes to quantify the ratio of temperature-correlated and temperature-uncorrelated SAAV positions in each, and then focused on a single gene, which encoded glycine betaine ATP-binding cassette transporter, for an in-depth analysis using its predicted structure. The reason we chose this gene is two-fold. First, betaine transporters of SAR11 are highly translated proteins in the environment that transport osmolyte compounds into cells for energy production, thus playing an essential role in SAR11 biology. Second, this gene exhibited a high ratio of temperature-correlated and temperature-uncorrelated SAAV positions, enabling us to tease apart the distribution of these SAAVs in the structure of the gene while being able to rely on the large number of positions to test whether our observations are statistically meaningful. We laid out our findings in a new paragraph, with this critical section:

“(…) To investigate the positioning of amino acids in the tertiary structure of the permease structure relative to the cellular membrane, we first categorized the location of each residue as transmembrane, cytosolic (inside the inner membrane), or periplasmic (outside the inner membrane) (Figure 3—figure supplement 4). Positions that were not correlated with temperature were commonly transmembrane, and infrequently periplasmic. In contrast, most positions that correlated with temperature were periplasmic (Figure 3—figure supplement 4). The probability of observing a similar distribution between temperature-correlated and temperature-uncorrelated positions across transmembrane, periplasmic, and cytosolic regions was only 0.034 (analytic trinomial test, temperature-uncorrelated distribution as prior), which indicates temperaturecorrelated positions are subjected to unique evolutionary forces. A previous study suggested that periplasmic residues of transmembrane proteins undergo higher rates of adaptive evolution due to their increased exposure to changing environmental conditions (Sojo et al., 2016). This observation lends additional support to the hypothesis that periplasmic SAAV positions within this gene that correlate with temperature are more likely shaped by adaptive processes.”

We also added a new supplementary figure (Figure 3—figure supplement 4) to put this observation into a visual context.

We hope the reviewer appreciates our effort to show that correlation with temperature is not evenly or randomly distributed in the context of this gene. In contrast, there is a significant coupling between the kind of variation and the structural context. We are striving to improve upon current computational limitations to ensure that in the future investigations of such phenomena will be able to explore a wider range of structural diversity.

Overall, I return to my original point from the first round of reviews: is the diversification mainly neutral; dominated by purifying selection; or dominated by adaptive evolution? The most interesting signals may not be from the dominant evolutionary process, of course.

We are certain that the reviewer agrees that quantifying these differential forces accurately is a challenging task. However, we also acknowledge that the reviewer is raising a valid question which will be asked by many readers, and a gross summary is necessary. We addressed this need by including the following paragraph in our revised manuscript to solidify our consideration of the primary evolutionary forces, and to offer a broad summary of our data that shows their ratios. We also used this opportunity to include a limitations statement to warn careful readers with the hope that they will take these suggestions with a grain of salt:

“One question remains: what is the proportion of distinct evolutionary processes acting upon closely related SAR11 populations within 1a.3.V? Offering a precise answer to this critical question is compounded by multiple theoretical and technical factors. […] In summary, this gross summary of the data suggests that among the remarkable amount of variation within some of the most abundant and prevalent microbial populations in the ocean, adaptive evolutionary processes operate on core genes are responsible for variation in about 2% of all codon positions.”

We thank the reviewer for pushing us to better communicate our findings.

In response to point 2, the authors state:

Delmont et al: "As far as we know, there is no published study that effectively takes into account multiple SNVs per codon without calculating the exact codon frequencies to estimate synonymity accurately…SNV density was high for 1a.3.V and many SNVs co-occurred in the same codon, rendering classical d(N)/d(S) analyses limited, which had never been challenged with extremely complex environments, if not completely irrelevant."

Yang (2007) MBE 24(8):1586-91 – with over 6,000 citations – describes how baseml can be used to calculate d(N)/d(S) using base substitution, rather than codon substitution, models. This accounts for multiple substitutions per codon, and provides a statistical test for evidence of selection versus neutrality, unlike the methods presented in the current version. It does not matter how complex the environments are – indeed, d(N)/d(S) has been applied to the question of SAR11 adaptation to different ecologies by Brown et al., 2012 and Luo and Hughes, 2012, Mol. Syst. Biol. 8:625. The authors should at least apply standard methods to the assembled genomes, based on the ecological separation they have determined. Methods also exist for calculation of this statistic from metagenomic data, and might be more informative.

We apologize for the vague statement. Indeed, when performing comparative genomics, it is possible to calculate d(N)/d(S) even in the case of multiple substitutions per codon. The critical point here is that both Yang as well as Brown et al. relied on comparative genomics to study genes under positive selection, i.e., they did have genomes to compare with each other (for instance, Brown et al. wrote in their manuscript: “Signatures of positive selection were detected in the two polar genomes compared with their most closely related tropical counterparts”). Our study differs from these studies and other previous attempts as it deals with novel challenges that arise when working with metagenomes. To address these additional challenges that none of the previous studies addressed, we developed a new workflow to determine SAAVs regardless of the number of SNVs per codon in a given metagenome without relying on comparative genomics. We regret that our previous responses may have compounded this misunderstanding; however, the manuscript clearly states that our contribution is from the perspective of metagenomics rather than comparative genomics:

“SAAVs: Accurate characterization of non-synonymous variation

Percent identity distributions are useful to assess overall alignment statistics of short reads to a reference; however, they do not convey information regarding allele frequencies, their functional significance, or association with biogeography. […] Thus, quantifying frequencies of full codon sequences is a requirement to correctly assess synonymity, and it is this principle we employed in our workflow to extract SAAVs.”

We hope this clarification will satisfy the reviewer.

Delmont et al: "The significant shift in SAAVs/SNVs ratio between cold currents and warm currents is of relevance to our study, especially in light of other insights (especially, the linkage between allele frequency trajectories and in situ temperature, and the biogeography of proteotypes), favors the predominant role of natural selection acting on the permissible amino acid diversification traits of this global SAR11 lineage."

The correlation between allele frequencies and temperature, and proteotype and temperature, is confounded by the correlation between proteotypes and allele frequencies. This should be explained (if this is wrong, please also make that clear in the text).

We agree with the reviewer that allele frequency trajectories confound proteotype classification, and so we have removed the sentence, “This is corroborated by thousands of allele frequency trajectories correlated with in situ temperatures” from the text.

The authors need to make clear whether the natural selection they are referring to is stronger/weaker purifying selection changing the SAAV/SNV ratio; or positive selection driving adaptive diversification increasing the SAAV/SNV in some populations.

We agree that the manuscript needed to clearly distinguish between varying levels of purifying selection versus adaptive evolution, and we hope the reviewer is satisfied with our consistency in delineating the evolutionary forces that can be gleaned through our data in the latest version of our manuscript. In the specific case of the SAAV/SNV ratio, we have opted to remove our SAAV/SNV ratio analysis from our revision altogether. Our rationale stems from planned future efforts to create a more robust and meaningful metric, in contrast to the current one that has weaknesses as the reviewer pointed. Nevertheless SAAV/SNV ratio was only ever briefly mentioned in our study, and the exclusion of it does not influence our findings. We thank the reviewer for pointing this.

Delmont et al: "Regarding the fixation index, we agree with the reviewer that this would have been a valuable addition to our work…We believe the significant amount of methodological development in this work that offers descriptions of the biogeography of the most abundant lineage in surface oceans conveys a story that will not benefit from the inclusion of additional analyses. "

I don't understand how this statistic can be "a valuable addition", yet the story "will not benefit from the inclusion of additional analyses.". The deep learning analysis is still not intuitively explained in the new version (not the authors' fault, as these methods are designed to be complex). It should be complemented by fixation index calculations, which are simple, and would allow the readers a much greater insight into the basis of the results.

We thank the reviewer for their persistence in benchmarking the Deep Learning method against classically used and well-understood methods. We have now calculated fixation indices for the SAAV data, and created a new computational tool to make sure this approach is accessible to other researchers in a more straightforward manner, as well. The results recapitulate our primary observations of distant geographies being linked, and we have added a new supplementary figure (Figure 4—figure supplement 4) comparing the dendrograms and placement of the clusters onto the world map.

https://doi.org/10.7554/eLife.46497.041

Article and author information

Author details

  1. Tom O Delmont

    Department of Medicine, The University of Chicago, Chicago, United States
    Contribution
    Conceptualization, Data curation, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft
    Contributed equally with
    Evan Kiefl
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7053-7848
  2. Evan Kiefl

    1. Department of Medicine, The University of Chicago, Chicago, United States
    2. Graduate Program in Biophysical Sciences, University of Chicago, Chicago, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—review and editing
    Contributed equally with
    Tom O Delmont
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6473-0921
  3. Ozsel Kilinc

    Department of Electrical Engineering, University of South Florida, Tampa, United States
    Contribution
    Software, Writing—review and editing
    Competing interests
    No competing interests declared
  4. Ozcan C Esen

    Department of Medicine, The University of Chicago, Chicago, United States
    Contribution
    Software, Visualization, Methodology
    Competing interests
    No competing interests declared
  5. Ismail Uysal

    Department of Electrical Engineering, University of South Florida, Tampa, United States
    Contribution
    Software, Supervision
    Competing interests
    No competing interests declared
  6. Michael S Rappé

    Hawaii Institute of Marine Biology, University of Hawaii at Manoa, Kaneohe, United States
    Contribution
    Supervision, Validation, Writing—review and editing
    Competing interests
    No competing interests declared
  7. Steven Giovannoni

    Department of Microbiology, Oregon State University, Corvallis, United States
    Contribution
    Supervision, Validation, Project administration, Writing—review and editing
    Competing interests
    No competing interests declared
  8. A Murat Eren

    1. Department of Medicine, The University of Chicago, Chicago, United States
    2. Marine Biological Laboratory, Woods Hole, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    meren@uchicago.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9013-4827

Funding

University of Chicago

  • A Murat Eren

Marine Biological Laboratory

  • A Murat Eren

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank the TARA Oceans consortium for generating metagenomic datasets of great legacy, as well as all researchers involved in the characterization of SAR11 isolates. We thank Kostas Konstantinidis, Edward Delong, Lois Maignien, Mike Lee, and the members of the Meren Lab for helpful discussions. We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) to EK, as well as the Frank R Lillie Research Innovation Award and the start-up funds from the University of Chicago to AME.

Senior Editor

  1. Detlef Weigel, Max Planck Institute for Developmental Biology, Germany

Reviewing Editor

  1. Paul B Rainey, Max Planck Institute for Evolutionary Biology, Germany

Reviewers

  1. Paul B Rainey, Max Planck Institute for Evolutionary Biology, Germany
  2. Fred Cohan, Wesleyan

Publication history

  1. Received: March 1, 2019
  2. Accepted: August 13, 2019
  3. Version of Record published: September 3, 2019 (version 1)

Copyright

© 2019, Delmont et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 903
    Page views
  • 115
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Genetics and Genomics
    2. Microbiology and Infectious Disease
    Isabella Vlisidou et al.
    Research Article
    1. Genetics and Genomics
    2. Plant Biology
    Mark Zander et al.
    Research Article Updated