Hybridization breaks species barriers in long-term coevolution of a cyanobacterial population

  1. Gabriel Birzu  Is a corresponding author
  2. Harihara Subrahmaniam Muralidharan
  3. Danielle Goudeau
  4. Rex R Malmstrom
  5. Daniel S Fisher  Is a corresponding author
  6. Devaki Bhaya  Is a corresponding author
  1. Department of Applied Physics, Stanford University, United States
  2. Department of Physics, University of Florida, United States
  3. Department of Computer Science, University of Maryland, United States
  4. DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, United States
  5. Department of Plant Biology, Carnegie Institution for Science, United States

eLife Assessment

This study provides important insights into bacterial genome evolution by analyzing single-cell genome sequences of cyanobacteria from Yellowstone hot springs. Using compelling evidence, the authors demonstrate that both homologous recombination within species and frequent hybridization across species are major drivers of genome diversification. Despite the challenges that are inherent to sparse and fragmented single-cell data, the analyses are thorough, carefully controlled, and supported by multiple complementary approaches, making the conclusions highly robust. This work represents a significant advance in our understanding of microbial evolution in natural environments.

https://doi.org/10.7554/eLife.90849.3.sa0

Abstract

Bacterial species often undergo rampant recombination yet maintain cohesive genomic identity. Ecological differences can generate recombination barriers between species and sustain genomic clusters in the short term. But can these forces prevent genomic mixing during long-term coevolution? Cyanobacteria in Yellowstone hot springs comprise several diverse species that have coevolved for hundreds of thousands of years, providing a rare natural experiment. By analyzing more than 300 single-cell genomes, we show that despite each species forming a distinct genomic cluster, much of the diversity within species is the result of hybridization driven by selection, which has mixed their ancestral genotypes. This widespread mixing is contrary to the prevailing view that ecological barriers can maintain cohesive bacterial species and highlights the importance of hybridization as a source of genomic diversity.

eLife digest

Bacteria are among the most successful organisms on Earth. These small, single-cell microbes inhabit nearly every environment on the planet and play essential roles in health, disease, ecology and industry. Their rapid cell division and ability to take up DNA from the environment allow them to adapt quickly to new challenges. After uptake, bacteria foreign DNA into genomes through a process known as recombination. This enables them to acquire beneficial genetic variants, such as antibiotic resistance genes, and spread them through the population.

Although the molecular mechanisms responsible for bacterial recombination have been extensively studied, its evolutionary consequences remain poorly understood. Unlike sexually reproducing organisms, in which the entire genome is recombined every generation, bacteria exchange only a small fraction – typically less than 1% of their genome. In theory, recombination could reduce genetic differences within a population, making them more similar. Alternatively, if genetic divergence between bacterial strains continued to accumulate despite recombination, those strains could eventually evolve into distinct species.

Previous research on thermophilic cyanobacteria from Yellowstone National Park revealed evidence of frequent DNA exchange between these microbes. This recombination produced genomes essentially containing – even in their conserved core – a random assortment of gene variants from across the population. Building on this work, Birzu et al. investigated whether such recombination affects the cohesiveness of different species or whether DNA exchange occurs primarily within a species. Answering this question is an important step toward developing quantitative models of evolution in natural microbial populations.

Birzu et al. quantified the impact of recombination between species – referred to as hybridization – on genetic diversity within species. To do so, they developed a method to identify hybrid DNA segments transferred from other species. When the donor species is known, hybrid segments can be identified simply by comparing genomes, a method typically used to identify Neanderthal DNA in human genomes. However, this approach may fail when donor genomes are unavailable or extinct.

To overcome this limitation, Birzu et al. exploited the fact that genetic distances between species are typically much larger than those within species. Hybridization would thus generate long sequences of highly correlated mutations within the recipient population. Using this idea, Birzu et al. identified hybrid segments and showed that hybridization accounted for up to 95% of the genetic diversity within one of the species.

These results suggest that – rather than diverging – the Yellowstone cyanobacterial population is being homogenized by recombination, leading to a gradual blending between different species. These results have broader implications for understanding microbial evolution in contexts such as the emergence of new pathogens or the adaptation of marine microbes to climate change. They suggest that microbial species may function less as a small number of cohesive strains, and more as quasi-sexual populations with continual DNA exchange within and between species. Extending the approach from this paper to other microbial populations and developing new methods to quantify the impact of hybridization will help clarify which of these evolutionary scenarios is most common in nature.

Introduction

The origin, persistence, and extinction of species in obligate sexual populations is a fundamental problem in evolutionary theory (Coyne and Orr, 2004). But in bacteria, long thought to be primarily asexual, our understanding is still very limited (Fraser et al., 2009; Rocha, 2018). While it has long been known that bacteria can acquire non-essential genes from distantly related strains (Rivera et al., 1998), in recent years, studies across a wide range of bacteria have found that homologous recombination of core genes is also ubiquitous (Didelot and Maiden, 2010; Yahara et al., 2016; Rosen et al., 2015; Garud et al., 2019; Sakoparnig et al., 2021). How does the interplay between selection acting both asexually and on transferred elements, and barriers to recombination from geographical separation, mechanistic effects, and epistatic interactions lead to formation and maintenance of cohesive species?

One explanation is that ecological specialization provides barriers to recombination (Kashtan et al., 2014; Arevalo et al., 2019). This view is supported by evidence that selective sweeps of individual genes through parts of a population can lead to distinct ecological subtypes, or ecotypes, even in highly recombining bacterial populations (Croucher et al., 2011; Shapiro et al., 2012; Bao et al., 2016). But it also raises an important open question: if species coexist and recombine with each other, do genetic differences continue to increase over evolutionary time, or does recombination gradually break down barriers between species?

Answering this question is challenging because data on the underlying evolutionary dynamics is lacking. The time scales of direct observations and laboratory experiments are too short, while those corresponding to phylogenetic relationships between extant species are too long. And since spatial dynamics over these intermediate time scales are unknown in most bacteria (Hanson et al., 2012; Garud et al., 2019), it is difficult to control for changes in local community structure during evolution. Therefore, microbial communities that coevolved for long time periods can act as natural experiments for studying bacterial speciation. The cyanobacterial populations found in hot springs from Yellowstone National Park form a natural incubator in which one such natural experiment is ongoing. These communities have two key features that allow us to identify and quantify the effects of coevolution on intermediate time scales.

First, like other thermophilic communities, these populations are geographically isolated (Papke et al., 2003; Whitaker et al., 2003). To date, the two most abundant species, Synechococcus A and B, have not been found outside of North America, suggesting that their global mixing time is much longer than for most other bacteria (Papke et al., 2003). Comparison of two isolate genomes from these species, OS-A and OS-B’, suggest a divergence time considerably longer than the formation of the Yellowstone caldera around 640,000 years ago (Rosen et al., 2018). This would imply their ancestors colonized the caldera independently.

Second, these populations have considerable sub-species diversity, suggesting ongoing evolutionary processes generate and maintain it. Analysis of variants of 16S rRNA and marker genes was initially interpreted as evidence of asexual ecotypes occupying distinct niches (Ward et al., 2006; Ward et al., 1998). However, deep amplicon data from a diverse, but largely OS-B’-like sample revealed extensive recombination (Rosen et al., 2015). Genetic exchange with other genomic clusters—which we refer to as hybridization—was also evident, particularly with relatives of OS-A. To date, evidence of hybridization in bacteria has mainly been from comparisons between reference genomes from large genomic databases (Smillie et al., 2011; Diop et al., 2022), with only a handful of studies investigating hybridization in natural environments (Tyson et al., 2004; Lo et al., 2007; Sheppard et al., 2008). The primary focus of the current paper is to determine the impact of hybridization on the long-term coevolution of a natural microbial community.

Characterizing hybridization requires long-distance linkage between variants that cannot be obtained from traditional metagenomics. Strain isolates are often not representative of the population diversity (Browne et al., 2016; Zheng et al., 2022), while metagenome assembly destroys linkage. Recruiting metagenomic reads to reference genomes can, in some cases, be used to infer linkage (Garud et al., 2019), but the high diversity of Synechococcus prevents us from using such approaches. To circumvent these problems, we used single-cell genomics (Rinke et al., 2013; Zheng et al., 2022) to obtain over 300 genomes from the Mushroom and Octopus Springs, which contain two of the most well-studied communities from the Yellowstone caldera (Ward et al., 2012). This is to our knowledge the first large sample of genomes from hot spring cyanobacterial populations without the compositional bias due to strain isolation.

Results

Population comprises three distinct highly recombining genomic clusters

Previous analysis of 16S amplicons identified two main clusters similar to OS-A and OS-B’, but also several others at frequencies below 1% (Rosen et al., 2018). But whether clusters within 16S amplicons correspond to clusters at the whole-genome level in such a rapidly recombining population is not clear (Rosen et al., 2015; Miller and Carvey, 2019). Therefore, we first compared each single-amplified genome (SAG) to Synechococcus OS-A and OS-B’. Almost all SAGs (330 out of 331) comprised two clusters with an average nucleotide divergence of ∼2% from one of the reference genomes and ∼15% from the other. Note that the divergence between clusters is well above the 5% threshold that is commonly used to distinguish different species (Nayfach et al., 2016; Jain et al., 2018; Garud et al., 2019). We labeled the cluster closer to OS-A by α (128 SAGs) and the other by β (202 SAGs), to distinguish them from the reference genomes. Surprisingly, one SAG (approximately 70% complete) was ∼15% diverged from both OS-A and OS-B’ and formed a separate cluster which we labeled γ (Figure 1A). Clustering methods based on different metrics gave identical results (Appendix 2—figures 2 and 3).

Synechococcus population comprises three genomic clusters.

(A) The mean divergence across homologous genes from the two reference genomes is shown for every genome in the data. Each dot represents a single genome and is colored according to its assigned species. (B) The relative abundances of all three species across samples from different temperatures within Mushroom Spring. Abundances were estimated based on read recruitment from the metagenome to core genes from the γ genome, as well as representative genomes from the α and β populations. Dashed lines show the minimum detection frequencies from the single-cell (black) and metagenome (gray) data, respectively (see Appendix 3). Shaded gray region shows temperature range of the eight single-cell samples.

To investigate the extent that these clusters characterize the distribution of genomes throughout the two springs, we used single-cell genomes as references and recruited reads from a large collection of metagenome samples to determine the abundance of α, β, and, particularly, γ across different environments (see Appendix 3). We found that 95% of reads were part of a main cloud of sequences with ≤5% nucleotide divergence from one of the three main clusters. We used the fraction of reads within each main cloud to infer the abundances of each cluster in the metagenomic samples. The abundance of γ was highly correlated with α (Appendix 3—figures 3 and 4) and increased rapidly with temperature in the Mushroom Spring samples (Figure 1B). One interpretation of this pattern is that α, β, and γ represent distinct species adapted to different temperatures. Alternatively, genotypes within the same cluster could be adapted to different temperatures or other environmental conditions. The relative abundances of the clusters would then reflect the complex eco-evolutionary processes that maintain the diversity in the population.

Previous analysis using a much more limited dataset proposed that β formed a quasisexual population, in which rapid recombination led to random assortment of alleles within the cluster (Rosen et al., 2015). Motivated by these results, we investigated the impact of recombination on the diversity within the two main clusters.

We quantified recombination rates in α and β using a standard measure of linkage disequilibrium, σd2. Given a pair of sites with alleles a/A and b/B, σd2 is defined as

(1) σd2=(fabfafb)fa(1fa)fb(1fb),

where is the average across pairs of sites. In asexual populations with conventional demographic drift, σd2=5/11 independently of the location of the sites. Recombination events affecting one but not both sites can unlink mutations, leading to a decrease in σd2. Because sites further apart have more opportunities to become unlinked, σd2 decreases with the distance between sites.

In β, the linkage disequilibrium σd2 decreased by around a factor of 10 over distances of ∼1 kbp (Figure 2A). This agreed quantitatively with earlier results based on deep amplicon sequencing from Rosen et al., 2015, which showed a decrease by a factor of ∼1 over distances of up to 300 bp, using a different but related measure of linkage, conventionally denoted as r2. Over genome-wide length scales, linkage was small (σd2=0.06) and close to a fully unlinked control (σd2=0.04). We found similar results for α, but with a quantitatively slower decrease in linkage by a factor of ∼8 over ∼1 kbp. Nevertheless, genome-wide linkage values were still close to the fully unlinked control (σd2=0.14 for the data vs σd2=0.12 for the control). As a consequence of rapid recombination, we found that diversity in marker genes, such as 16S rRNA, was poor predictors of whole-genome diversity (Figure 2B). Further details and analyses can be found in Appendices 4 and 8. Appendix 8 also provides estimates of the recombination parameters for each species and compares these to estimates obtained from previously proposed methods (Didelot and Wilson, 2015; Lin and Kussell, 2019). Altogether, these results are consistent with each cluster representing a quasisexual population and imply that phylogenies inferred from whole genomes do not reflect the evolutionary history of the population.

Genetic linkage analysis shows extensive recombination within α and β.

(A) Linkage disequilibrium σd2 as a function of separation between SNPs x averaged across all core genes from α (orange circles) and β (blue squares). Genome-wide estimates from SNPs in different genes are shown to the right of the black dashed line using the same symbols. Fully unlinked controls for the genome-wide linkage are shown by crosses of corresponding colors. Dashed lines show fit to linkage decay curves in an infinite sample using Equation 6 for α (orange) and β (blue) separately. (B) Scatter plot of 16S and genome-wide divergences between all pairs of genomes within the same species. Solid lines show linear regressions indicating very weak Pearson correlation coefficient, R, within each species. This demonstrates that 16S divergence is a poor proxy of whole-genome divergence.

While linkage decay in α and β was similar, we found diversity patterns at the gene level were qualitatively different (Figure 3A). The typical diversity at synonymous sites in β was πS0.04, with ∼2× variation across genes (Figure 3A), consistent with previous results (Rosen et al., 2018). In contrast, πS in α varied wildly by over two orders of magnitude (Figure 3A), ranging from πS0.2 in some genes to others that did not have a single mutation. Note that because here we only used synonymous sites, these variations cannot be explained by differences in the strength of purifying selection or the accumulation of non-synonymous mutations during rapid adaptation (Neher, 2022). Importantly, the low-diversity segments of the α genome spanned multiple samples and springs and are thus unlikely to be the result of a very recent clonal expansion as is often seen in human commensal and pathogenic bacteria (Didelot and Wilson, 2015; Garud et al., 2019; Sakoparnig et al., 2021; Weimann et al., 2024).

α and β have distinct patterns of within-species diversity.

(A) The synonymous diversity πS within α (orange) and β (blue) at all core genes ordered according to their respective reference genomes. The solid line shows the value of πS, with lighter colors showing one standard deviation across different pairs around the mean. Both lines were smoothed by averaging over a sliding window of five genes with a step size of two genes. (B) Distribution of the fraction of polymorphic fourfold degenerate (4D) sites for each α core gene (blue histogram) is shown together with Poisson fit based on the number of genes without any polymorphic sites (black line and dots). The inset shows zoomed-in region of low-diversity genes (<5% polymorphic 4D sites) together with the same fit. (C) The same as (B) using data from β. The Poisson distribution is fitted using the mean of the distribution, which provides a better fit than the number of genes without polymorphic sites (data not shown). (D) Histogram of synonymous diversities πS within genes for β (blue), α low-diversity backbone (orange), and α high-diversity (hybrid) genes (red). Note the overlap between the diversity in the β core genome and the hybrid α genes.

Beyond the low-diversity genes in α, we found genomes in both species containing segments very similar or identical to homologous regions from the other species. We performed extensive checks to verify that these segments were not assembly artifacts or cross-contamination between reaction wells during whole-genome amplification (Appendix 10). Such segments could represent accessory genes transferred between species, as has been widely reported in both Synechococcus and other species (Bhaya et al., 2007; Smillie et al., 2011). Alternatively, they could be the result of homologous recombination with other species within the core genome. We refer to the later process as hybridization throughout. The rest of our analysis aims to quantify the extent of hybridization and its impact on the evolution of the population. We began by identifying the core genes shared by α and β.

Gene-level analysis reveals distinct patterns of hybridization between species

We defined core genes by constructing groups of orthologous genes (orthogroups) and identifying those that were likely present in all genomes. Specifically, we clustered gene sequences in several steps using standard methods based on similarity (Appendix 1). We used phylogenetic distances to identify and separate distant paralogs within the same orthogroup. To assign clusters corresponding to each species, we analyzed the abundance distributions of orthogroups in α and β separately. Because of the incomplete nature of our genomes (average coverage ≈0.3, with a range of ≈0.05–0.95; see Appendix 1), the assignment has to be done statistically. Distributions for both species were bimodal, with a peak near zero corresponding to flexible genes and another at high abundances corresponding to core genes. We used a binomial fit to determine single-copy genes that were likely present in all genomes and found 1825 core α and 1737 core β orthogroups. Of these, 1448 orthogroups were shared by both α and β and make up the Synechococcus core genes. We focus the rest of our analysis on these core genes.

We first performed a quantitative comparison of the synonymous diversity between α and β across the core genes. We hypothesized that the low-diversity segments of the α genome were the result of mutation accumulation since their common ancestor. If mutation rates across genes are similar, we expect the fraction of polymorphic sites fp for each gene to follow a Poisson distribution. Empirically, we found that for fp<0.05, the data closely matched the Poisson expectation with a single fit parameter (Figure 3B). This group, which we call the ancestral backbone ofα, contained around half of the core genes (675 out of 1448). Note that this result is distinct from the distribution of divergences between species, which was shown previously to also follow a Poisson distribution (Rosen et al., 2018). The remaining half (773 out of 1448) had up to ten times higher fraction of segregating sites and were inconsistent with the backbone polymorphism rate. As our subsequent analysis shows, the high diversity in these genes is the result of hybridization (see Figures 47 and subsequent discussion). We therefore refer to this group as the hybrid α genes. In β, the distribution of fp had a single peak, but a simple Poisson process was a poor fit to the data (Figure 3C). This suggests a substantial fraction of the diversity is not simply due to mutations but is introduced through recombination with other strains or species. Consistent with this interpretation, we found the distribution of synonymous diversities in the hybrid α overlapped that of β, and both were two orders of magnitude higher than in the α backbone (Figure 3D). These results suggest that hybridization had a large impact on the genetic diversity of both α and β. We therefore developed a method to identify hybrid segments directly and test this hypothesis.

Alignments reveal distinct patterns of hybridization across different genes.

Three core gene alignments, each chosen to illustrate a distinct pattern of hybridization, are shown. Rows represent different cells and columns different positions along the gene. One sequence was chosen randomly as a reference (top row), with nucleotides shown in different colors. The species assigned to each cell are indicated by the colored rectangles on the left, using the same color scheme as the species clusters (α shown in orange, β in blue, and γ in green). Rows were sorted hierarchically based on pairwise distances. Nucleotides are shown in gray if they are identical to the reference genome and in a different color if different. Species clusters were assigned as described in the main text and Appendix 1. (A) Segment of an alignment containing three distinct species clusters of a non-hybrid gene. (B) Segment of an alignment containing two distinct clusters (α and β) and six different simple hybrids. The direction of the transfer was inferred as described in the main text and is indicated on the left. Transfers from α into β and β into α are highlighted in orange and blue, respectively. Note that the presence of more than one type of transfer is rare and was chosen here for illustration. (C) Segment of an alignment without distinct species clusters chosen to illustrate mosaic hybrid genes. Regions with clear hybrid blocks are highlighted in purple. Note the extensive hybridization on short length scales, despite the fact that sequences from the same species cluster with each other based on pairwise distances.

Transfers of whole-gene alleles between species are common.

The frequency of hybrid gene alleles along the α (A) and β (B) genomes. Different colors show the donor species, with unknown species labeled X and shown in magenta. The orange lines on top show mosaic loci without distinct gene clusters. Note that the x-axis in (B) is shared with (C). (C) The mean synonymous divergence between α and β at homologous loci, ordered according to OS-B’. Smoothed line over a sliding window of five genes with a step size of two genes is shown in black. Maximum and minimum divergences at each locus are shown in red and blue, respectively. (Insets, middle) A zoomed-in region from a typical genomic region (left) and a divergence trough (right), with dots representing individual genes. (Insets, bottom) Alignment segments of individual genes, subsampled to 40 sequences for clarity. The first sequence is arbitrarily chosen as the reference with each nucleotide represented by a different color. Other sites are colored gray if same as reference or using nucleotide colors if mutated. Left-colored rectangles show the species of each genome.

Diversity within hybrid genes and divergence troughs reveals mixture of soft and hard sweeps.

(A) The distribution of synonymous diversities in α genomes in the divergence troughs (purple histogram) is compared to a random sample of genes from the α backbone of the same size (orange histogram). The orange line shows the bootstrapped distribution of the α backbone. (B) Same as (A), but comparing divergence troughs in β genomes with the β core genome. (C) Each circle shows the synonymous diversity of a non-singleton simple hybrid locus (see Figure 5AB) in α. The donor species are shown in different colors as labeled on the x-axis. For transfers between α and β, the diversity within β is shown as squares of the same color as a control, with sequences from the same locus connected by a line. The orange horizontal lines show the synonymous diversity in the low-diversity α backbone. Transfers above this line likely contain multiple independent events.

Hybridization between species leads to mixing of short DNA segments within single genes.

(A, top) Alignment segment of α genomes from a representative locus with four SNP blocks. The inferred ancestral α haplotype (labeled α1) is highlighted in orange and the hybrid haplotype (all β here; labeled α2) is shown in blue. Note the SNPs in different blocks in different sets of cells. (Middle) Illustration of SNP block analysis pipeline. (Right) The consensus alignment of the two α haplotypes, β, and γ is constructed. (Left) Pairwise divergences between consensus sequences are calculated. Note that α2 is identical to β here. (B) The joint distribution of divergences of α SNP block haplotypes from the other species for all SNP blocks. Each point represents a haplotype with coordinates showing the divergence from β (x-axis) and γ (y-axis). The orange and blue dots show the two haplotypes from the example in (A). Marginal distributions are shown on the top and right. Note the high density of haplotypes near the axes, indicating transfer from one of the other known species. (C) Same as panel (B) for β block haplotypes. Note the much larger number of haplotypes far from the axes compared to (B), indicating transfers of unknown origin. (D) Histogram of linkage coefficients between pairs of block haplotypes at different loci for α (orange histogram) and β (blue histogram) compared to the expectation under a fully unlinked model (solid orange and blue lines, respectively).

We used hierarchical clustering to identify and characterize hybridization patterns between species. If hybridization is common, as in our case, inferences based on phylogenetic trees would not provide reliable estimates of pairwise divergences. We therefore used pairwise nucleotide distances directly. The distribution of pairwise distances was bimodal, with a clear gap around d=0.075, consistent with our comparisons to the reference genomes (Figure 1). Based on this observation, we defined species clusters for each gene using a simple cutoff at dc=0.075 (Appendix 1—figure 2) together with additional constraints to ensure that clusters were well separated and consistent with asexual divergence from a common ancestor (Appendix 1).

Detailed analysis of species clusters revealed evidence of hybridization in roughly half of the genome. We found two distinct patterns of hybridization. Around 75% of core genes (1078 out of 1448) had two to three clusters of sizes roughly proportional to abundances of α, β, and γ in the samples (Appendix 5—figure 1). The majority of these genes (844 out of 1078) had clusters, each of which contained sequences from a single species (Figure 4A), as expected if they were asexually diverged. We call these non-hybrid loci. A significant minority (234 out of 1078) contained a mixture of sequences from different species (Figure 4B), representing recent transfers of whole genes between species. We call these simple hybrid loci. The remaining 25% of core genes (370 out of 1448) formed either a single cluster or had one very large cluster containing most sequences (Figure 4C). As we show below, these large clusters result from complex patterns of hybridization within single genes. We thus call these mosaic hybrid loci. In around a quarter of these loci, we found evidence of complete or near-complete gene sweeps, but in most, the divergence between α and was typical. Almost half (297 out of 773) of high-diversity α genes were mosaic hybrids. To better understand the dynamics of hybridization in the population, we analyzed both hybridization patterns in detail, starting with the simple hybrid loci.

Simple hybrid genes are common and reveal distinct patterns of hybridization across species

We used a simple set of heuristic criteria to assign the transfer direction of simple hybrids. If species evolved by asexual divergence from a common ancestor, we expect each orthogroup to contain two to three clusters of sizes roughly proportional to the species abundances. For example, based on the average coverage and the proportion of β in our samples (Figure 1A), we expect the largest cluster to contain ∼80 sequences from β genomes. In some cases (45 out of 1063), we also find a minority of sequences from α genomes in this β cluster, which we infer to be the result of a hybridization event from β into α (Figure 4B). An alternative explanation would be a transfer from a third species that underwent a full selective sweep through β and only a partial sweep through α. This scenario would imply that hybridization is even more common than we infer, so our method is conservative. We infer transfers from α into β in a similar way (Figure 4B). When the third cluster is present and contains γ, we assign all other sequences in the cluster to hybridization with γ. Because we only have a single γ sequence, these clusters are difficult to classify. We cannot distinguish between transfer from and to γ, and it is possible that some simple hybrid loci involving γ may actually be mosaic hybrids. When the third cluster is present but does not contain γ or in cases where we find a fourth cluster, we assign those sequences to hybridization with an unknown species X. Using this procedure, we systematically determine the frequencies of different hybridization events in α and β.

We found simple hybrids were present in 5−10% of core genes in both species (Figure 5A, B). Specifically, we found 185 simple hybrid loci in α and 68 in β. Note that we conservatively excluded contigs that were fully hybrid from this analysis, so our numbers likely underestimate hybridization rates in the population (Appendix 5).

In both species, simple hybrids were scattered along the genome and contained alleles from different donor species (shown in different colors in Figure 5A, B), implying multiple hybridization events occurred independently, as expected when recombined segments are much shorter than the genome. Around 80% of α hybrid alleles were present in multiple cells, while in β the fraction was lower, at around 50%. The frequencies of simple hybrids were highly variable, ranging from singletons to frequencies close to 100%, consistent with a continual hybridization scenario. Surprisingly, around half of hybrids in β and a fifth of those in α were from unknown donors. However, a systematic search over the 34 metagenomic samples did not find evidence of other Synechococcus species in either spring (Appendix 3—figures 57). These sequences could be the result of hybridization with species that were formerly present in the caldera and are now either extinct or at very low abundances. Alternatively, the hybrid sequences could have resulted from hybridization with, as yet, undiscovered species from other springs. But regardless of the exact scenario, overall, these results suggest the current population is descended from multiple cyanobacterial species that hybridized with each other.

Soft and hard selective gene sweeps drive hybridization of whole genes

Is the genomic mixing between species the result of neutral processes or does positive selection for beneficial hybrids play an important role? Distinguishing between these alternatives is difficult in general, so we instead focused on cases where beneficial hybrid genes swept through the entire host species, that is, full gene sweeps. Full gene sweeps would greatly reduce the synonymous divergence between α and β, making them much easier to detect. Visually, we observed the synonymous divergence between α and β contained several divergence troughs, in which the average diversity was well below typical values. We identified the genes contained in these troughs by imposing a simple cutoff dtrough=0.2 on the synonymous divergence between α and β, which we chose close to the inflection point of the synonymous divergence distribution across genes (data not shown).

A systematic search revealed 91 very low divergence loci, spread across more than half a dozen divergence troughs (Figure 5C). Comparing alignments of typical genes and divergence through genes revealed a dramatic difference in fixed substitutions between species, which can only be explained by a recent selective sweep. The largest divergence trough contained 27 genes, representing around 25% of divergence trough genes, most of which were part of a nitrogen-fixation pathway. In addition, the nitrogen-fixation pathway is one of the largest genomic segments in which the gene order between α and β is conserved (Bhaya et al., 2007), suggesting all genes were transferred simultaneously. The most likely explanation is that at least one of the ancestors acquired the ability to fix nitrogen after it colonized the Yellowstone caldera and that this gave it a substantial selective advantage. The genetic diversity within the pathway genes gave further clues about the hybridization process. The diversity within the trough was similar in both species (πS=0.052 in α and πS=0.039 in β) and was close to the average diversity within β (πS=0.042), suggesting the β ancestor already had the ability to fix nitrogen when it colonized the caldera. However, the trough diversity was more than an order of magnitude larger than the diversity within the α backbone (Figure 6A, B). This large discrepancy strongly suggests α acquired the pathway from β through multiple independent transfers after they colonized the Yellowstone caldera.

We performed a similar analysis of diversity within simple hybrid genes to determine the direction and timing of transfers. We focused our analysis on α, where we had more data. Presence of hybrid genes at high frequencies suggests the hybridization process may be driven by selection. Two distinct patterns of diversity could result. If a single allele was transferred into α and then underwent a selective sweep, we would see very little diversity among the hybrid alleles. This is known as a hard sweep. Alternatively, if multiple hybrid alleles were transferred into α while undergoing a sweep, the diversity would be comparable to that in the host species. This is the pattern we observed in the divergence troughs and is known as a soft sweep.

We found evidence that both hard and soft sweeps contribute to hybridization in the population (Figure 6C). Around 15% of α simple hybrids (31 out of 153 genes) had no diversity at all, regardless of the donor species, consistent with hard sweeps driving their spread. In the remaining 85% (122 out of 153 genes), diversity within α was 1–2 orders of magnitude higher than the α backbone diversity. As a result, the diversity could not be due to mutation accumulation since the common α ancestor, but must have originated from multiple independent transfers—that is, a soft sweep. We further verified the presence of soft sweeps within α by comparing the diversity of individual genes. In most cases, the diversity within α at these loci was similar to β, with a handful of cases of α having significantly less, consistent with a subset of β alleles sweeping through the population simultaneously (Figure 6C).

Recombination within hybrid genes leads to extensive genomic mixing on short genomic length scales

Selective sweeps of individual genes could also lead to mixing of other genomic segments from the donor species into the host population through genetic hitchhiking. If the diversity within the two hybridizing species is low, this process would result in blocks of tightly linked SNPs consisting of primarily two haplotypes, one from the host and another from the donor (Figure 7A). Anecdotally, we frequently observed such patterns in mosaic hybrid genes (Figures 4C and 7A). We developed a systematic method to search for SNP blocks based on clustering the SNP linkage matrix. Briefly, we search for consecutive SNPs that have perfect linkage (see Appendix 6). We used α hybrid genes to calibrate our method and chose very conservative parameters to avoid false positives. Consistent with this, only around 17% of SNPs from the α hybrid genes belonged to blocks.

Despite the very conservative thresholds used in our algorithm, we found over one thousand SNP blocks in each species, scattered across hundreds of genes, including in highly conserved ribosomal genes and genes in which distinct species clusters could be assigned (Figure 7BC; Appendix 6—figures 1 and 2). Most blocks were short, with typical lengths around 70 bp in α and around 30 bp in β (Appendix 6—figure 1). The divergences between SNP block haplotypes were similar to that between species, consistent with hybridization (Appendix 6—figure 3). To confirm their hybrid origin, we mapped the haplotypes from each block to the two other species in our sample. We found 66% (406 out of 605) of SNP blocks within α and 15% of those within β (128 out of 776) contained a haplotype identical to consensus of other species (Figure 7B). Note that we only considered blocks in which all three species were present for this comparison. In addition, we found a comparable number of blocks that were statistically similar but did not match the other species α, β, and γ (Figure 7B, C). The most likely explanation is that these blocks resulted from hybridization with other species, similar to the whole-gene hybrids found previously (Figure 5).

How much linkage is maintained between blocks? Given the relatively high divergences between species, one might expect epistatic interactions between blocks to constrain the genotypic structure of the population. Conversely, if recombination is dominant, each genome would be a mosaic of segments from the two ancestral species. To answer this question, we calculated the linkage coefficients between haplotypes for each pair of blocks and compared their distribution to a fully unlinked control (see Appendix 6 for details). Within the β species, the two distributions were almost identical, implying that the genomes are consistent with a random mixture of block haplotypes, as was previously proposed (Rosen et al., 2015). Within α the linkage coefficients were statistically larger than those predicted from the unlinked control (Figure 7D), but the overall value was still low, with r20.12 for data compared to r20.05 for unlinked control. These results are consistent with the slower rate of linkage decrease within α compared to β discussed previously. We also found similar results for the linkage between hybrid gene alleles (Appendix 4). Thus, rather than finding clonal subpopulations, every one of the ∼300 cells in our sample has a unique combination of SNP block haplotypes.

Discussion

By analyzing over 300 Synechococcus single-cell genomes, we show that hybridization plays a major role in shaping their genomic diversity. Our results suggest a simple scenario for the evolutionary history of the population (Figure 8; see also Appendix 7). Multiple strains were geographically separated and evolved independently into distinct species from an ancestral population. They then colonized the Yellowstone caldera after its formation around 640,000 years ago and have since been gradually hybridizing. The presence of genes with multiple-sequence clusters and previous analysis of 16S sequences (Rosen et al., 2018) suggests at least four independent colonizers, but the current population is dominated by the descendants of only three: the ancestors of α, β, and γ. The β population has been hybridizing with others and recombining with itself long enough that its genomes are a mosaic of different ancestral segments, consistent with a quasisexual population. But α still exhibits a large fraction of low-diversity genomic segments, interlaced with clearly hybridized segments. Molecular clock estimates using the low-diversity regions of the α genomes suggest it emerged around 1045 years ago and on these time scales the population is well-mixed across springs (Appendix 9—figure 1). Using the hybridized genomic segments of α as a control, we also estimated the contribution of de novo mutations to the β diversity. This analysis suggests that the ancestor of β colonized the Yellowstone caldera soon after its formation (Appendix 7).

Hybridization between previously isolated lineages is the main source of diversity within the population.

(Main panel) An illustration of the evolutionary scenario consistent with the main results of this paper. A single common ancestor (gray) for the population (t0) diverged into several species (four shown here) following spatial isolation (t1). The species colonized the Yellowstone caldera, represented by the yellow rectangle, at different times (t2t3) and began hybridizing. This process continued up to the present (t4). (Top inset) Illustration of the mechanism for generating unlinked SNP blocks through recombination between a hybrid segment (blue) containing a beneficial region (red star) and the rest of the host species (orange). Note the four distinct haplotypes highlighted at the end. (Bottom inset) Zoom in of the ancestry of a particular genome (middle) compared with the donor (orange, top) and host (green, bottom) ancestral genomes. After the initial hybridization, recombination with other host genomes (red arrows) creates a mosaic structure across a range of genomic length scales.

Diversity within species is often modeled using neutral processes (Didelot and Wilson, 2015; Rocha, 2018; Chen et al., 2022). But in this Synechococcus population, there is overwhelming evidence for the crucial role played by selection. The divergence troughs resulted from selective sweeps through the whole population that likely overcame mechanistic barriers to recombination, genetic incompatibilities, and ecological differences, evident in the apparent preferences to different temperatures of the three species. These results are consistent with previous analyses showing that recombination events between pathogenic Streptococcus species were primarily driven by selection (Lefébure and Stanhope, 2007; Shapiro et al., 2009). Moreover, the thousands of blocks composed of tightly linked SNPs we observe are difficult to explain through neutral drift, due to the long time it would take for them to reach observable frequencies. Instead, genetic hitchhiking during selective sweeps is by far the most likely explanation.

Selective sweeps of single genes through subpopulations have been argued to lead to the formation of separate ecotypes, recombination barriers, and eventual speciation (Croucher et al., 2011; Shapiro et al., 2012). But we find multiple cases of genes from one species partially sweeping through another, sometimes via several independent transfer events. Thus, gene sweeps do not necessarily lead to ecological specialization, but can do the opposite, acting to homogenize the population. Overall, the population shows abundant evidence of hybridization on a wide range of genomic scales leading to both homogenization between species, through the erosion of genomic clusters, and diversification within species.

Our results suggest several directions for future work. A significant fraction of SNP blocks in α and the majority of those in β were transferred from species not present in our samples. But further work is needed to determine the precise number of hybridizing species and whether some may still be present at high abundances in other springs. Metagenome samples from a wider range of springs would provide valuable information for addressing this question (see Appendix 3). Another important question is how do variants reach observable frequencies in the population. Our analysis revealed direct evidence for genetic sweeps at several hundred loci within divergence troughs and simple hybrid loci. But these loci are only a lower bound on the effect of the selection within the population. A more quantitative analysis of SNP statistics would provide important insights into evolutionary dynamics at these intermediate time scales, which could be informative for understanding bacterial evolution more broadly.

Whether the erosion of genomic clusters we observe would continue indefinitely remains an open question. The hybridization process appears to be very slow: we estimate that the genomic mixing we observe occurred over at least 104 years and likely started soon after α became abundant (see Appendix 7). It is therefore possible that the current population is an evolutionary transient that would eventually lead to a single hybrid population. This scenario would be contrary to the prevailing view that microbial evolution leads to increased specialization (Shapiro et al., 2012; Jain et al., 2018). Distinguishing between these two scenarios would require a detailed understanding of how the diversity changes during the hybridization process. For example, a better theoretical understanding of how SNP correlations within and between SNP blocks change after an initial transfer could generate testable predictions about single- and multi-site frequency spectra that could be checked in the data. Some statistical features will likely be more sensitive to selective pressures that act at the whole-genome level and could help maintain clusters indefinitely. Similar approaches could shed light on the evolutionary history of other highly recombining bacteria, such as H. pylori or SAR11 (Vergin et al., 2007; Kennemann et al., 2011). Thus, how extensive hybridization affects, and is affected by, ecology and how the interplay between selection at both gene and genome scales shapes the diversity of long-term coexisting populations are intriguing questions for future work.

The importance of hybridization in eukaryotic evolution is increasingly being recognized, but there have been few investigations in prokaryotes (Sankararaman et al., 2014; Moran et al., 2021). Metagenomic studies on communities in acid mine drainages revealed several hybrid bacterial and archaeal populations whose genomes were mosaics of hundreds of kbp segments from two different ancestors (Tyson et al., 2004; Lo et al., 2007). Those results are consistent with very recent hybridization, after the start of mining operations (Denef and Banfield, 2012). Our study shows the effects of this process over much longer time scales. At the level of individual genes, evidence for hybridization has been observed among commensal and pathogenic species of Campylobacter (Sheppard et al., 2008; Mourkas et al., 2022). But unlike the thermophilic Synechococcus, such host-associated bacteria are globally dispersed and have likely sampled many different environments, on a wide range of time scales, during their evolutionary history. It is thus difficult to identify which evolutionary forces drive their hybridization.

For the Yellowstone Synechococcus, a fortuitous combination of spatial separation and later mixing on the same time scales on which speciation and then hybridization can extensively shape—but not completely overwrite—their diversity has enabled us to infer a great deal about the evolutionary history and the effects of full and partial sweeps. This provides a window into processes that shape species-level bacterial diversity. But for most bacterial populations, the effects of these processes are obscured by the broad range—and little knowledge—of the time scales on which different strains have co-occurred and are able to exchange DNA. Investigating whether hybridization plays a major role in other communities will thus require developing a predictive theory that can distinguish between possible evolutionary scenarios.

Appendix 1

Materials and methods

Sample collection and sequencing

Request a detailed protocol

Cork borers (approximate diameter 0.5–1 mm) were used to collect mat core samples from Octopus Spring (4453401N, 11079781W) and Mushroom Spring (4453861N, 11079791W). Samples were collected in 2005, 2006, 2007, and 2009 under YNP Park Permits (YELL-5494, YELL-5660, YELL 5694). Cores were immediately frozen until later use. Frozen samples were shipped to the Joint Genome Institute (JGI) in 2019, where they were processed and sequenced.

At JGI, for disaggregation of sample, frozen mat cores were cut into pieces using a razor blade on glass over dry ice and aliquoted chunks of approximately 5 mm diameter. Next, 500 µl PBS and 50 µl of 500 mM NaEDTA were added. Samples were vortexed briefly, sonicated for 1 min on benchtop sonicator, and vortexed briefly again. Passed 25 times through a 25-gauge needle and run through a 35-µm strainer cap. Next, 100 µl were placed in 900 µl PBS and stained with SYBR Green at 1X. Samples were run on the Influx with Gate SYBR+ Cells and Gate 670/30 off of 632 autofluorescence. Next, samples were sorted on a BD influx following the protocol outlined in Rinke et al., 2014, and single-cell genomes were amplified with WGAX protocol as described in Stepanauskas et al., 2017. Certain samples from plate YuBhay.Octo06.RedA.3 received an additional treatment with lysozyme (20 min at room temp in 50 U/µl prior to alkaline lysis) because of difficulty lysing the cells. Otherwise, they followed the standard alkaline lysis described in Stepanauskas et al., 2017. In brief, libraries were made with Nextera XT v2 kits and sequenced on an Illumina Novaseq in 2 × 150 bp mode. Reads were assembled into contigs and annotated using the Integrated Microbial Genomes (IMG) platform as described in Rinke et al., 2013. All of the data generated for this study is available on the JGI website, under the project ID 503441.

Data processing

Request a detailed protocol
Appendix 1—figure 1
The genome coverage across single-amplified genomes (SAGs) is highly variable.

The SAG coverage, defined as the total length of contigs, is shown for all 331 SAGs (after quality control). SAGs were grouped by species (see Appendix 2) sorted in descending order of coverage within each species. The solid orange and blue horizontal lines show the genome sizes of OS-A and OS-B’. The dashed orange and blue lines show the threshold used to define high-coverage SAGs. The values of the thresholds are set at 75% of the length of the reference genomes.

In this section, we describe the procedure we used to process the single-amplified genomes (SAGs). Our analysis is based on grouping genes into orthologous gene clusters or orthogroups, which are then analyzed separately. This procedure makes it much easier to analyze the fragmented and incomplete genomes resulting from single-cell sequencing (Appendix 1—figure 1) compared to multiple-sequence alignment (MSA) of entire genomes, which are sometimes used for analyzing genomes obtained from isolates. An alternative approach is to align the assemblies or even the raw reads to reference genomes. In our case, there are only two reference genomes available, OS-A and OS-B’, so aligning to them would likely result in biases in the inferred diversity.

There are several ways of constructing orthogroups: (1) pairwise comparisons or (2) graph methods. Graph methods have the advantage that they produce sets of mutually homologous genes so we decided to use this approach. There are several packages available for identifying homologous gene clusters among samples of different genomes. Most widely used packages, such as OrthoDB (Kriventseva et al., 2007) and OrthoFinder (Emms and Kelly, 2015), were designed for finding homologs between different eukaryotic species and are inappropriate for studying large bacterial pangenomes (Ding et al., 2018). More recently, several groups have developed packages that allow for efficient construction of large bacterial pangenomes (Page et al., 2015; Ding et al., 2018). However, such packages are primarily aimed at studying large collections of isolate genomes and were not designed to handle samples with low genome coverage, which are common in single-cell sequencing projects such as ours. As a result, we implemented a custom pipeline for processing the single-cell genome assemblies and constructing the pangenome from these sequences. The pipeline is divided into two parts: (1) a pre-processing step where assemblies are filtered for contamination and checked for misassemblies and misannotations, and (2) a gene clustering step, in which clusters of homologous genes are identified and paralogs are separated from true orthologs. The second half of the pipeline closely follows the approach taken in Ding et al., 2018. A more detailed outline of the pipeline is given below. For details on the assignment of sequence clusters within orthogroups to different species, see Appendix 5.

Outline of data processing pipeline

  1. Data pre-processing

  2. Orthogroup clustering

  3. Orthogroup MSA

  4. Deep branch splitting

  5. Orthogroup species clustering

  6. Reference genome mapping

The rest of this section explains how each step in the pipeline was performed. Readers not interested in the details may wish to move on to the next section.

Data pre-processing
Request a detailed protocol

The goal of this step is to remove (1) SAGs from non-Synechococcus OS species, (2) SAGs that might have resulted from sequencing multiple cells in the same reaction well, and (3) contigs that might be the result of contamination from external DNA or other sequencing wells. We first removed all SAGs that were positive controls in which multiple cells were amplified in the same reaction well (all SAGs from the third column of the 384-well plate in our case). We then aligned all of the genes in our data to the OS-A and OS-B’ reference genomes using BLAST (alignment parameters: -evalue 1E-10 -word_size 8 -qcov_hsp_perc 75.0). We then removed all SAGs with >50% of genes that did not map to either reference genome. The remaining SAGs are assumed to be from Synechococcus OS cells. Next, we filtered out all contigs without any hits to OS-A and OS-B’. This was done to avoid external DNA or cross-contamination from affecting the rate of hybridization we observed.

Next, we removed any remaining SAGs that were suspected to have been the result of sequencing multiple cells in the same well. We reasoned that sequencing multiple cells would lead to an increase in the number of duplicated genes in the SAG. We used the probability of sampling a gene with more than one copy in the two reference genomes, both ∼10%, as a benchmark and removed all SAGs with a multicopy frequency >20%.

Preliminary analysis of the data following the above filtering procedure revealed the existence of a small number of palindromic contigs, in which the first half of the sequence was mirrored exactly in the second half, that were very likely the result of assembly artifacts. To remove these artifacts, we labeled the genes in each contig based on the hits to the reference genomes and used these to search for palindromic sequences within each contig. We then removed any contig that contained a palindromic subsequence that was either longer than four genes or contained >50% of genes in the contig. Finally, we removed all partial genes from the remaining contigs to simplify the analysis of the multiple-sequence alignments (MSA), which we performed next.

Orthogroup clustering
Request a detailed protocol

The goal of this step is to obtain clusters of orthologous genes or orthogroups, which were used in subsequent analysis. To do this, we followed (Page et al., 2015; Ding et al., 2018) and used a Markov clustering algorithm to find groups of sequences that were similar to each other and well-separated from all other sequences (Enright, 2002). The Markov clustering algorithm takes as input a similarity graph, where each node represents a gene sequence and the weight of each edge represents the similarity between the two sequences.

We constructed the gene similarity graph by translating all of the gene sequences in the filtered dataset and performing an all-to-all protein alignment using BLAST (alignment parameters: -evalue 1E-3 -word_size 3 -qcov_hsp_perc 75.0). Other approaches either perform a pre-alignment clustering (Page et al., 2015) or use faster but less sensitive methods for all-to-all alignments (Ding et al., 2018), but we found that this was not necessary in our case. The higher sensitivity provided by BLAST could help distinguish between paralogs within large gene families and minimize gene misassignments, but we did not extensively test how the performance would be affected by using other methods. We used the bitscore from the alignments to define the weight of each edge in the similarity graph. Using the bitscore instead of the e-value is a simple way to correct for the minimum e-value of 1E180 in BLAST and has been shown previously to improve the clustering performance (Gibbons et al., 2015).

We used MCL (v. 14-137) to cluster the resulting similarity graph and obtain orthogroups (Enright, 2002). We used a relatively small inflation parameter (-I 1.5) in order to cluster genes aggressively and used subsequent steps in the pipeline to split clusters resulting from incorrect cluster assignments or the presence of paralogs (steps 5 and 6). As a first step, we did an additional clustering based on sequence length as follows. We constructed a graph in which all sequences in an orthogroup were represented as nodes and edges were attached between pairs of sequences if the difference between their lengths was less than 30% of their average length. We then identified all connected components in this graph. Because homologous genes in OS-A and OS-B’ sometimes have different lengths, we hypothesized that each orthogroup should contain at most two main clusters. We therefore used the following heuristic algorithm to filter out orthogroups with unusual gene length patterns that could possibly result from overly aggressive clustering. If a single-length cluster was present or the largest cluster contained >80% of sequences, only the sequences from the largest cluster were kept and the rest were removed. If two clusters accounted for >80% of sequences, only the sequences in those two clusters were kept and the rest removed. In all other cases, we removed the orthogroup from our dataset and did not include it in subsequent analysis.

Orthogroup MSA
Request a detailed protocol

The filtered orthogroups from the previous step were aligned using a custom codon-aware alignment algorithm. Briefly, nucleotide sequences were translated into protein sequences and aligned using MAFFT (v. 7.475 with default parameters) (Katoh and Standley, 2013). The alignments were then reverse-translated using nucleotide sequences as templates. After alignment, phylogenetic trees for each orthogroup were constructed using FastTree (v. 2.1.11 with parameters -nj -noml) (Price et al., 2010).

Deep branch splitting
Request a detailed protocol

In this step, we separated gene clusters within the initial orthogroups that were joined by unusually long branches in the phylogenetic tree. Specifically, we implemented the following algorithm. For each orthogroup, we checked the branch length for each node in the phylogenetic tree and grouped any sequences with a branch length greater than a cutoff value bc=0.3 into separate orthogroups. Each newly created orthogroup was realigned as in step 3, and the procedure was repeated until no new orthogroups could be separated. We used the results from previous analysis of homologous genes in OS-A and OS-B’, which showed that typical divergences were ∼0.15 (Rosen et al., 2018), as a benchmark to set bc. Based on those results, we expected very few genuine homologs with divergences above bc=0.3. The clusters that resulted after this step represent the final orthogroups that were used in subsequent analysis.

Orthogroup species clustering
Request a detailed protocol

We next sought to determine which orthogroups could be reliably partitioned into distinct species clusters, which would be expected if the different species asexually diverged from a common ancestor. To do this, we implemented the following algorithm in a custom Python script. For each orthogroup, the alignments were trimmed to remove excess gaps at the edges using a custom script. We then calculated the fraction of sites with different nucleotide between each pair of sequences and hierarchically clustered the sequences using average linkage as implemented in the Python Scipy package (v. 1.9.3) (Virtanen et al., 2020). Preliminary clusters were assigned using a distance cutoff dc=0.075 (Appendix 1—figure 2). To ensure that the clusters were well-separated and robust, we implemented an additional check by requiring that the minimum distance between any two sequences from different clusters be at least c=1.5 times greater than the maximum distance between any two sequences within either cluster. Any clusters that did not satisfy this criterion were designated as mosaic orthogroups. We validated the robustness of the cluster assignments by dc and c and found only small changes in the results.

Appendix 1—figure 2
Distribution of pairwise nucleotide distances within orthogroups shows clear gap around πij0.075.

(A) Histogram of nucleotide distances between all pairs of cells across the 1448 core orthogroups. The dashed red line shows the distance cutoff used to define species clusters (dc=0.075). (B) Distribution of orthogroups according to the number of species clusters. Species clusters were generated as described in the text using the distance cutoff shown in (A).

Manual examination of the mosaic orthogroup alignments revealed a small fraction with a large number of gaps. Such alignments could be the result of rapidly evolving genes in the presence of hybridization, but may also result from misalignments or the inclusion of non-orthologous sequences during the orthogroup construction. To simplify subsequent analysis, we therefore removed mosaic orthogroups with an unusually high fraction of gaps.

Mapping to reference
Request a detailed protocol

For this step, we use reciprocal best hit (RBH) given by BLAST to map the consensus sequences to the OS-A and OS-B’ reference CDS. Since at this point in the pipeline, most sequences within one OG are less than 10% diverged from each other, we use nucleotide sequences for this step to increase the accuracy of the mapping. The consensus sequence for each OG is constructed by finding the consensus codon at each residue along the CDS after replacing all codons containing gaps by a gap codon. The resulting sequences that have RBH to OS-B’ are mapped onto those respective genes. The genes without RBH to OS-B’ are searched against OS-A and mapped to it if an RBH is found.

Appendix 2

SAG species assignment

In this section, we outline the method we used to assign genomes to each species cluster. We also verify the robustness of the species assignment by using three different methods and show that they give identical results.

Comparisons of genomes from bacterial populations often reveal clusters at different length and divergence scales (Jain et al., 2018). But the interpretation of such clusters is controversial (Cohan, 2002; Fraser et al., 2009). There are three main possible interpretations for such patterns. One interpretation is that the clusters represent the bacterial analog of eukaryotic species, which are sexually and ecologically isolated from each other. In practice, a whole-genome divergence cutoff of 5% is frequently used to define bacterial species (Jain et al., 2018). An alternative interpretation is that clusters are the result of clonal expansions of particular strains from a much more diverse and possibly more genetically homogeneous population. The third interpretation is that the clusters are the result of uneven sampling from a larger and more diverse population. The unique combination of having a large collection of genomes, with minimal compositional bias, from a geographically isolated population allows us to test these alternatives. We will mainly aim to distinguish between the first two interpretations and the third one by checking whether there are robust genomic clusters in our sample. More discussion on the evidence concerning the first two hypotheses can be found in Appendices 3 and 5.

We used three different methods to check whether the genomes form robust clusters: the phylogenetic tree of the 16S rRNA sequences, average genome-wide divergence from the two reference genomes OS-A and OS-B’, and patterns of divergences between triplets formed by the SAG and the two reference sequences at different loci. The phylogenetic tree of the 16S is the simplest and still the most commonly used method to characterize the diversity of natural microbial communities (Yarza et al., 2014). Because of its practical importance, we included it for comparison with the other two methods. However, in a highly recombining population like the Yellowstone Synechococcus, it is not clear to what extent the 16S divergence accurately reflects the whole-genome diversity. To address this question, we used the divergences across the entire genome as a comparison. The main limitation of using summary statistics such as the average divergence across the entire genome is that it obscures heterogeneities along the genome that could have important ecological and evolutionary implications. There are different ways to address this concern. Here, we used an extension of a method first used in Rosen et al., 2018, which provides a coarse description of the variation in the diversity patterns across different loci. We present the results from each of these analyses next.

We first constructed the phylogenetic tree of 16S sequences. Predicted 16S rRNA sequences (117 from 331 filtered SAGs) were extracted and aligned using MAFFT (v7.453) (Katoh and Standley, 2013). Two sequences (SAGs MA02M11_3 and MuA02C9) were incomplete and were removed from the analysis. The remaining (115) sequences were hierarchically clustered using the average linkage (UPGMA) method implemented in Scipy. Results are shown in Appendix 2—figure 3, with the colored background showing the species assignment using whole-genome divergences (see below) for comparison. We found perfect agreement between the whole-genome assignment and the largest three clusters in the 16S phylogenetic tree.

The average divergence within the α cluster is approximately half that of β (1103 vs 2103). However, a significant fraction of the average divergence within β was due to two outliers (MuA1L16 and MusA1J8). After removing the outliers, the average divergence within β decreased to 1.7103. Closer investigation of the alignments revealed that both outlier β sequences contained a short segment of three SNPs that were identical to all α sequences and different from all other β sequences, at positions 1067–1072. In addition, MuA1L16 contained a larger segment of almost 200 bp (positions 390–570) containing 10 SNPs and an insertion that were different from all other β and were identical to the consensus α. Apart from these segments, the outlier sequences were typical of the β species. This pattern is similar to the SNP blocks discussed in the main text and in Appendix 6 and therefore is likely the result of hybridization between α and β.

We compared the 16S rRNA clusters to whole-genome divergences from the two reference genomes, OS-A and OS-B’. We aligned all of the predicted genes from each SAG to all of the coding DNA sequences (CDS) from each reference genome using BLAST (v2.13.0+) (Altschul et al., 1990). For each SAG, we calculated the divergence to each reference genome (d=1pident/100) and calculated the average over the best hits across all genes. The results for all SAGs with more than 100 hits to both references are shown in Figure 1A. Three clear clusters were observed, which we labeled as α, β, and γ. The species of each SAG was assigned as the label corresponding to the cluster it belonged to in Figure 5A.

The previous two metrics do not reflect possible heterogeneity in the patterns of diversity along the genome. To characterize this variation on a gene-by-gene basis, we extended a method that was previously used to analyze amplicon data from the Mushroom Spring in Rosen et al., 2018. We refer to this method as gene triplet analysis. This analysis exploits the separation of scales between and within species clusters. At most loci and in most cells, we observed previously in order to coarse grain the high-dimensional matrix of pairwise divergences into a small number of discrete patterns. Specifically, we take triplets of sequences from the SAGs, OS-A, OS-B’, at loci where all three are homologous and calculate the divergences between all three pairs of sequences. These divergences typically form a triangle as shown in Appendix 2—figure 1. Note that it is possible to have triplets of divergences that violate the triangle inequality, which cannot be represented as triangles in a two-dimensional space. In our data, such cases were rare, but this may not be generally true, particularly for sequences that are not clearly separated and which are the result of extensive recombination. Note, however, that the classification into distinct patterns given below can still be done even in these cases. To reduce the dimensionality of the data further, we define a main cloud diameter dM=0.1 for the diversity within species. We then classify the shape of the triangle into the 5 distinct patterns shown in Appendix 2—figure 1, depending on whether each side of the triangle is longer or shorter than dM. For example, denote the divergence triplet by (dXA,dXB,dAB), representing the divergences between the SAG sequence and OS-A, the SAG sequence and OS-B’, and OS-A and OS-B’, respectively. Then the first pattern in Appendix 2—figure 1A (labeled ‘XA–B’) contains loci where dXA<dM, dXBdM, and dAB<dM, the second pattern (labeled ‘XAB’) contains loci where dXA<dM, dXB<dM, and dAB<dM, and so on. The cutoff of dM=0.1 is the same as the one used in Rosen et al., 2018 and provides a good separation between species at a gene by gene basis. The results of our analysis, however, do not depend on the specific value of dM as discussed at the end.

For each SAG, we assigned all genes that were orthologous to OS-A and OS-B’ to one of the five patterns described above. The three rows in Appendix 2—figure 1 show typical examples of an α (Appendix 2—figure 1A), a β (Appendix 2—figure 1B), and the γ (Appendix 2—figure 1C) SAG. The majority of genes across each row in Appendix 2—figure 1 are found in the XA–B, XB–A, and X–A–B patterns, respectively. This is consistent with the species clusters we saw in the genome-wide average divergences from before. However, we also find a significant minority of genes forming other patterns, most notably XAB, representing genes where all three sequences are within the main cloud (d<dM). Some of these genes were contained in the divergence troughs discussed in the main text. In the case of the γ SAG, the fact that X–A–B is the dominant pattern rules out the possibility that γ is a hybrid resulting from the mixture of α and β. Such hybrids have been observed after recent clonal expansions in acid mine drainage microbial communities (Tyson et al., 2004; Lo et al., 2007), but do not appear to be present at substantial frequency in the Yellowstone community. Instead, it appears that γ is a distinct species that diverged from the other two around the same time α and β diverged from each other.

Using the pattern described above allows us to compare different SAGs to each other and allows us to test the robustness of the species assignment based on the average genome-wide divergence. Specifically, we assigned to each SAG a gene triple fingerprint f=(fXAB,fXAB,fXAB,fXAB,fXBA), representing the fraction of genes with each pattern (note that ifi=1). We then compared the fingerprints of each SAG with the species assignment based on the average divergence from OS-A and OS-B’ from Figure 1A. We found that the fingerprints formed three clear patterns that exactly matched the assignments based on the average divergence (Appendix 2—figure 2). We tested this assignment for different values of dM ranging from 1% to 20% and found that the fingerprint patterns were robust across the entire range (data not shown). These results show that the classification into distinct species is robust and consistent across the three methods described.

Appendix 2—figure 1
Species clusters at the gene level approximately match those at the genome level.

Figures show gene-by-gene comparison of three cells to the OS-A and OS-B’ reference genomes. The cells shown are representative of the α (A; OctA1J9) and β (B; MA02H14_2) clusters and the one γ cell (C; OcA3L13). Each core gene with homologs in both OS-A and OS-B’ was compared to the references and the divergences plotted as a triangle with each side length equal to the divergence between a given pair. The cell sequence is represented by a black dot, with the OS-A and OS-B’ sequences represented by the orange upward and blue downward triangles, respectively. The black circle is a 10% radius which was chosen as a cutoff to classify each gene. The different columns show genes in one of the five possible patterns labeled as follows: XA–B (the cell sequence is less than 10% diverged from OS-A and more than 10% diverged from OS-B’), XAB (all three sequences are within 10% of each other), X–A–B (all three sequences are more than 10% diverged from each other), X–AB (the cell is more than 10% diverged from both OS-A and OS-B’, but both references are less than 10% diverged from each other), and XB–A (the cell is less than 10% diverged from OS-B’ and more than 10% diverged from OS-A).

Appendix 2—figure 2
The proportion of gene triplet patterns shows three distinct clusters.

The proportion of genes in each of the five patterns shown in Appendix 2—figure 1 is shown for each cell passing our quality control criteria (see Appendix A). Lines represent different cells and are colored according to the species classification based on whole-genome divergences.

Appendix 2—figure 3
Phylogenetic tree inferred from 16S rRNA agrees with whole-genome species classification.

The pairwise divergences between all 16S sequences passing quality control were calculated from the multiple sequence alignment performed using MAFFT (Katoh and Standley, 2013). The tree was generated using a hierarchical agglomerative clustering algorithm with average linkage (UPGMA) as implemented in SciPy (Virtanen et al., 2020).

Appendix 3

Metagenome analysis

Phylogeny of γ

Here, we present the details of the phylogenetic analysis of the γ SAG, OA3L13, which shows that it represents a new uncharacterized species within the Yellowstone cyanobacteria population. To establish its phylogeny, we used a concatenated alignment of marker genes, following Wu and Eisen, 2008. Of the 31 marker genes used in Wu and Eisen, 2008, 16 were covered in the γ SAG (dnaG, frr, pgk, rplA, rplK, rplL, rplM, rplS, rplT, rpsB, rpsI, rpsJ, rpsK, rpsM, smpB, and tsf) and were used to determine its phylogeny. We used a large collection of cyanobacterial genomes from Chen et al., 2021 to check for closely related species to γ. The concatenated alignment was constructed as follows. For each of the 16 marker genes, we extracted the alignment of all species labeled ‘Thermal springs’ that were within one of the subsections labeled I through V in Chen et al., 2021 and added the γ sequence to the alignment. All ‘?’ characters were replaced with ‘X’. The sequences were then realigned using MAFFT (v7.453) (Katoh and Standley, 2013) using default parameters and the alignments were concatenated and trimmed using trimal (v1.4.rev22) (Capella-Gutiérrez et al., 2009) with default parameters. Finally, the resulting sequences were realigned using MAFFT and the phylogenetic trees were inferred using FastTree (v2.1.11) (Price et al., 2010) with default parameters. The resulting tree is shown in Appendix 3—figure 1.

To verify that the lack of other closely related genomes to γ was not due to the incompleteness of our reference database, we searched for homologs to the 16 γ marker genes in the NCBI database. We used the Web version of blast on NCBI with the megablast settings and word size 16. For all 16 marker genes, the top two hits were from OS-A and OS-B’, while all other high-scoring hits had less than 85% nucleotide identity. We are also matching sequences to the 16S rRNA using the SILVA database (v138.1) (Quast et al., 2013). Using the non-redundant version of the database (SSU Ref NR99) only gave hits to OS-A and OS-B’ sequences. Using the full database we did find several sequences that were less than 10 SNPs away from the γ sequence across the entire 16S rRNA (>99.7% identity), including one from a previous study of Yellowstone microbiomes (Meyer-Dombard et al., 2011). The corresponding divergence of 3103 is comparable to the typical diversity within species in our SAGs of 12103 and much less than the typical divergence between species, which is approximately 24102 (Appendix 2).

Appendix 3—table 1
Best hits to OA3L13 16S rRNA in SILVA SSU Ref database.
SILVA IDMatches/hit lengthIdentityNotes
AFSR010000501424/14540.978OS and MS metagenome, Klatt et al., 2011
AY8840521424/14540.978OS-A isolate, Allewalt et al., 2006
HM4483611382/13860.997Meyer-Dombard et al., 2011
HM4483601381/13860.997Meyer-Dombard et al., 2011
KU3821411381/13890.994
CP0002391295/13240.978Synechococcus OS-A
CP0002401261/13240.952Synechococcus OS-B’
Appendix 3—figure 1
γ SAG represents a new distinct species within Yellowstone Synechococcus.

Shows the maximum likelihood tree obtained from an alignment of 16 marker genes present in the γ SAG together with all other thermophilic cyanobacteria from Chen et al., 2021. Colored labels show the two Yellowstone Synechococcus, OS-A (orange) and OS-B’ (blue), and the γ SAG OA3L13 (green).

Inferred abundances of γ from metagenomic samples are consistent with a single γ SAG in our data (see Appendix 3). To further verify that no other γ SAGs were accidentally excluded during quality control (see Appendix A), we performed a BLAST search (alignment parameters: -evalue 1E-10 -word_size 9 -qcov_hsp_perc 75.0) of the γ genes against the contigs from the SAGs that were filtered out. The results were consistent with the majority of the excluded SAGs containing mostly α and β genes, with a smaller number of SAGs with divergences around twice as large (Appendix 3—figure 2). This confirms that no other γ SAGs were present in our data.

Appendix 3—figure 2
Single-amplified genomes (SAGs) excluded during quality control do not include additional γ genomes.

The distribution of divergences from the best BLAST hit to the γ SAG (OcA3L13) across genes for each SAG excluded during the quality control process (see Appendix A) is shown. Each line represents the probability density of divergences across genes for one of the 44 SAGs.

Estimating abundances of α, β, and γ

To estimate abundances of the three main species α, β, and γ in the Octopus and Mushroom Springs, we recruited metagenomic reads from 34 samples taken from both springs to SAGs from each species. We chose the one γ SAG (OA3L13, ≈70% coverage) along with representative SAGs from α (MA02D15, ≈77% coverage) and β (OcA2H14, ≈96% coverage) as reference genomes. Hybridization on length scales longer than the typical read size (150 bp in our case) can lead to errors in our estimates of species abundances. To minimize these errors, we only used core genes where three distinct species clusters could be identified and there was no evidence of hybridization in the single-cell data (see Appendices 1 and 5). This resulted in a set of ≈400 genes for each reference. We aligned all the metagenomic reads to these target gene sequences using BowTie2 (v2.3.4, default parameters) (Langmead and Salzberg, 2012) and extracted reads aligning to each of the three reference sequences using Samtools (v1.7.0) (Li et al., 2009) and bedtools bamtofastq command (v2.26.0) (Quinlan and Hall, 2010).

We assigned the recruited reads to each species cluster if their divergence from their respective reference was less than the ‘main cloud’ radius dM. We determined dM self-consistently from the data as follows. For each reference, counted the number of reads that had divergences less than dM from that reference sequence and more than dM from the other two reference sequences, for dM from 1% to 15%. If the population consists of species clusters with typical diversity π, the number of reads within the cloud will increase until dMπ. As dM approaches the typical divergence between species d15%, more reads will start to map to multiple reference sequences, so the number of reads uniquely assigned to each species cluster will decrease. Therefore, the number of reads uniquely assigned to each species cluster should have a maximum as a function of dM. We indeed found this to be the case for all three species (data not shown). Empirically, we found that dM=5% was close to the maximum for all three species, so we used this value throughout our analysis.

We found that the average depth of recruited reads across genes varied within the same sample. However, the variations across samples were highly correlated, which confirmed that most reads aligning to our chosen genes came from genomes that were part of the same species cluster rather than from genes transferred across species (Appendix 3—figure 3). We found that the genes that recruited the largest numbers of reads had read depths that were much more similar within the same sample compared to other genes. Interestingly, we found much more variation in depth between genes within the γ cluster compared to α and β (Appendix 3—figure 3). One likely explanation for this is that γ contains much more diversity compared to either α or β.

Appendix 3—figure 3
The number of reads recruited to different loci from the three main species is highly correlated across samples.

The three panels show the average depth of reads recruited to representative single-amplified genomes (SAGs) from α (top), β (middle), and γ (bottom), respectively. Each panel shows 100 genes with the highest average read depth across samples, with each line representing one gene. Only reads that were <5% diverged from the reference genomes were included.

We estimated the abundance of each species cluster by averaging the depth of the 100 genes with the highest average depth across all samples for each cluster. We found that γ was more than an order of magnitude more abundant than our detection threshold given by the metagenomic read depth in 32 out of the 34 metagenome samples (Appendix 3—figure 4). Moreover, its abundance in the two samples for which we had single-cell and metagenome data (MSe4 and OSM3) was close to the detection threshold of the single-cell samples, which was around 2%. This explains why we only obtained a single γ cell from the single-cell samples after our quality controls. Finally, we found that the abundance of γ was highly correlated with α across all samples and was only present at abundances higher than O(1)% in the Mushroom Spring sample taken from 65°C. This suggests that γ is, together with α, primarily adapted to growth at higher temperatures.

Appendix 3—figure 4
Abundance of γ is significantly above detection threshold in 32 of 34 metagenomic samples and is highly correlated with α.

The relative abundances of α (orange circles), β (blue squares), and γ (green diamonds) across all 34 metageomic samples. The detection threshold for the single cell (ft=1/480.021, based on the maximum number of single-amplified genomes [SAGs] per sample) and metagenomic samples is shown in black and gray dashed lines, respectively. The asterisks on the x-axis label the two samples (MSe4 and OSM3) with both single-cell and metagenomic data.

Other Synechococcus species

Are other Synechococcus species present in Mushroom or Octopus Spring apart from α, β, and γ? To answer this question, we again used the 34 samples from which we have high-depth metagenomes. We used BLAST with highly sensitive parameter choices to minimize the number of true Synechococcus reads not captured. For computational efficiency, we focused on the 100 genes with highest coverage depth in the metagenomes across all three references.

To determine how much of the population is in the α–βγ main cloud, we used the same three SAGs as in the previous subsection as references and recruited reads using BLAST (-task blastn -evalue 1e-3 -word_size 6 -num_threads 32). Using these parameters, we could recruit reads belonging to the dominant Roseiflexus species in our samples, which is in a different phylum from Synechococcus. We were therefore confident that any reads from other potential Synechococcus species would be included in our analysis. To limit the analysis to just Synechococcus species, we removed all reads that were more than 20% diverged from all of our references from subsequent analysis.

Because of the large amount of data in this analysis, we focus on just the Mushroom Spring samples taken from different temperatures (MS temperature series). We recruited a total of 4,522,833 reads with an average depth per gene of 4980. We calculated the minimum divergence from the three reference sequences for each read and found 94% within 5% divergence of one of them (Appendix 3—figure 5A). For comparison, we determined the typical divergences between species by dividing the reference sequences into 150 bp segments (the same size as our reads) and calculating the divergences between all three pairs of species (Appendix 3—figure 5B). The middle 95-percentile of the combined distributions extended down to a divergence of around 5% (eight SNPs), consistent with our cutoff for the intraspecies divergences at these loci. These results are consistent with all of the Synechococcus reads being from one of the three species and set an upper bound on the relative abundance of any other species at below 4%.

To check how sensitive our detection method is to the presence of other species, we used the Mushroom Spring sample from 65 in which γ had a relatively high abundance. To mimic the presence of another species, we compared the previous distributions to the distribution of the minimum divergence from α and β only (dashed red line in Appendix 3—figure 5A). We found a 19% decrease in cumulative number of reads within 5% from the references in this case, which was very close to the 21% abundance of γ we estimated in the previous subsection based on the average depth across loci. This shows that our method is sensitive to the presence of other species and illustrates the benefits of combining single-cell and metagenome analysis for identifying unknown species.

Appendix 3—figure 5
A vast majority of Synechococcus core gene sequences are within the α–β–γ main clouds.

(A) Cumulative distribution of the minimum divergence of each read from the reference sequences across all loci used in the left panel. Different colors represent samples taken from different temperatures from Mushroom Spring. Solid lines show minimum divergence from all three references, and dashed lines from α and β only. Shaded olive region shows the typical divergences between species shown in the right panel. (B) Histogram of pairwise divergences between three representative single-amplified genomes (SAGs) for α, β, and γ at the 100 core loci chosen for illustration. Distributions are across all 150 bp segments from each locus. Shaded olive region is the middle 95% percentile for the combined distribution.

To further verify that no other Synechococcus species were present in our samples, we closely examined the diversity of 16S sequences. We used only single reads in our analysis instead of trying to assemble full sequences to prevent misinference caused by hybridization. Because the diversity along the 16S sequence is highly heterogeneous, variation between read recruitment locations can dominate the variation between species. We controlled for this by limiting the analysis to reads that aligned within a given region of the 16S. For 150 bp reads, we found that using segment sizes of 200 bp provided the best compromise between increasing depth and minimizing variance due to alignment location. To find the most informative region, we calculated the nucleotide diversity at each site Hi=2fi(1f), where fi is the minor allele frequency at site i, and averaged it over all 200 bp segments along the 16S (Appendix 3—figure 6A). We found the diversity varied by more than a factor of four, consistent with the presence of conserved and variable regions. The diversity was nearly identical when using all three species as reference genomes, suggesting our recruitment captured all sequences within the main cloud. The highest average diversity was close to 400–600 bp segments, which we chose for further analysis.

We calculated the distribution of divergences from the γ allele for all reads recruited from this segment for MS temperature series. We found temperatures of 60°C and below had two distinct peaks close to Appendix 3—figure 6B 4% and 8% divergences. For comparison, we calculated the range of divergences from α and β for all 150 bp subsegments within the 400–600 bp segment and found these matched with the two peaks we observed. At 65°C, we also found only two peaks, one close to zero and another close to 4%, consistent with our previous estimates showing that this sample mainly consisted of α and γ cells.

Appendix 3—figure 6
Diversity of Synechococcus 16S rRNA sequences from Mushroom Spring temperature series samples is entirely contained within the α–β–γ main clouds.

(A) Average diversity H (see text) of metagenome reads from all 34 samples mapping to Synechococcus along the length of the 16S sequence. Averaging was done over all sites within a 200-bp window starting at the location given by the x-axis. Results for reads recruited using each of the three main species as a reference sequence are shown in different colors. A black dashed line indicates the 400–600 bp segment with the highest diversity which was chosen for subsequent analysis. (B) Distribution of nucleotide divergences from the γ allele within the 400–600 bp segment of the 16S for four different samples taken from Mushroom Spring. Shaded regions show the range of divergences of OS-A (orange) and OS-B’ (blue) from γ across all 51 windows of 150 bp length, corresponding to the range of expected divergences of reads from α and β.

Do peaks correspond to single-species clusters or are they amalgamations of several distinct species with similar divergences from each other? To answer this question, we constructed a multiple-sequence alignment of all of the 16S reads recruited across the 34 metagenome samples and performed a hierarchical clustering on the pairwise divergence matrix using average linkage. Pairwise divergences were calculated by padding the reads with gaps to ensure all sequences were 200 bp in length and calculating the nucleotide divergence ignoring any gaps. For the MS temperature series, we found 97% of reads were in one of the three clusters associated with α, β, or γ (Appendix 3—figure 7A–D). Consistent with our previous results, samples below 65°C had two main clusters close to OS-A and OS-B’ (Appendix 3—figure 7A–C), while at 65°C they were close to γ and OS-A (Appendix 3—figure 7D). Aggregating across all 34 metagenome samples, we found that α and β were the two dominant clusters, with γ being the next largest (Appendix 3—figure 7E, F). We also observed a large difference in the overall abundances of α and β across the two springs, but the very large heterogeneity across samples we found previously (Appendix 3—figure 4) suggests this is likely due to the particular samples used in this study and may not necessarily be an accurate reflection of the overall abundances in the two springs.

Appendix 3—figure 7
All Synechococcus 16S rRNA sequences from the metagenome samples belong to one of the three known species clusters.

(A–D) Each subpanel shows the matrix of pairwise divergences between all reads recruited to the 400−600 bp segment of the γ 16S rRNA sequence for the four samples from the MS temperature series. Dashed lines highlight the OS-A, OS-B’, and γ reference sequences in this segment for comparison. (E, F) Same as above using a subsample of reads recruited from all 34 metagenome samples divided by spring.

Appendix 4

Linkage analysis

In this section, we summarize our analysis of linkage disequilibrium within α and β. The structure of this section is as follows. We begin by reviewing definitions of two commonly used ways to quantify linkage disequilibrium. We then compare the rate of decay of linkage disequilibrium within and between the two species. We next compare the decrease in linkage in the data to theoretical predictions from the neutral model and show that while the decay is similar to the neutral model prediction, the overall shape of the linkage decay curves cannot be fully explained by the classical neutral model. We then show that within α, there is substantial heterogeneity in the rate of linkage decay in low-diversity regions compared to high-diversity ones. Surprisingly, we find that linkage decreases more rapidly in low-divergence regions, suggesting the absence of a clonal frame for α even on short time scales. The final subsection presents a technical validation of our method for processing alignments showing that our results are robust to filtering for outlier gene alleles.

Statistical measures of linkage disequilibrium

We used two distinct but related methods for quantifying linkage disequilibrium, σd2 and r2. Both metrics are related to the correlation coefficient between alleles at two loci but differ in their normalization. For a single pair of loci A and B with alleles A/a and B/b, we can define the correlation coefficient between their respective frequencies as

(2) rAB2=(fABfAfB)2fA(1fA)fB(1fB).

Using this expression, we can define a statistical measure of linkage as a function of distance along the genome by averaging over pairs of loci with a given separation x=|xAxB|, where xA and xB are the positions of the two loci along the genome. The two most common ways to perform the average are

(3) σd2(x)=(fABfAfB)2|xAxB|=xfA(1fA)fB(1fB)|xAxB|=x,

and

(4) r2(x)=(fABfAfB)2fA(1fA)fB(1fB)|xAxB|=x.

For a neutral panmitic and well-mixed population, both metrics decrease as σd2(x)r2(x)1/x. For σd2, there also exists an analytic expression (Ohta and Kimura, 1969):

(5) σd2=10+2NR22+26NR+4(NR)2,

where N is the population size, and R the recombination rate per base pair per generation. When recombination is rare and occurs via crossover R(x)R0x, and the expression is consistent with the scaling σd2(x)1/x mentioned previously. While numerical and theoretical studies have shown that r2 is more sensitive to low-frequency alleles compared to σd2 (Hudson, 1985; Song and Song, 2007), our analysis did not reveal significant differences between two metrics. We therefore focus here on σd2 which is more commonly used and easier to compare to theoretical predictions.

We calculated σd2(x) and r2(x) for α and β separately and together. Both procedures are similar, so we focus on the analysis for all α and β sequences. Unless otherwise stated, all analyses were performed on alignments of individual genes and the results were averaged across genes. Using this method, we can easily obtain linkage information over distances of up to around 1 kbp. Over longer distances, the results can be dominated by specific recombination events in a small number of unusually long genes and are therefore less useful for quantifying genome-wide statistical patterns. We thus perform most of our analysis up to distances of around 1 kbp, except when calculating typical genome-wide linkage as discussed below.

Alignments for each orthogroup were first trimmed to remove segments with excess numbers of gaps. We then determined a consensus sequence by taking the most common codon sequence at each location along the reading frame, excluding gaps. Each nucleotide was labeled 0 if it was a gap, 1 if it was the same as the consensus, and 2 if it was different from the consensus. While this does not distinguish between biallelic and multi-allelic sites, the latter are much less common than the former and should not affect our conclusions. Finally, we calculated the two linkage coefficients using Equations 3 and 4. To estimate linkage across distances longer than a single gene, we performed the average across SNPs spanning a pair of genes and then averaged the values across a random sample of 1000 pairs of genes.

The linkage calculations within each species were performed in an analogous way, with the exception of a two-step filtering process of the alignments at the beginning. We describe this procedure in detail at the end of this section.

Linkage decrease within vs between species

Appendix 4—figure 1
Genetic linkage analysis shows extensive recombination within α and β.

Linkage disequilibrium σd2 as a function of separation between SNPs x averaged across all core genes among the α (orange circles), β (blue squares), or all of the cells (black diamonds). Genome-wide estimates from SNPs in different genes are shown to the right of the black dashed line using the same symbols. Fully unlinked controls for the genome-wide linkage are shown as horizontal lines of different colors. The main cloud cutoff for both α and β was set to c=0.05.

Our analysis showed that within both α and β, linkage quickly decreased by approximately an order of magnitude over a distance of ∼1 kbp (orange circles and blue squares in Appendix 4—figure 1). However, when considering the population as a whole, we find linkage disequilibrium is maintained across the whole length of the genome (black diamonds in Appendix 4—figure 1). Within β, the linkage at distances of 1 kbp was identical to the genomewide linkage between random pairs of genes. Within α, the decrease was slower, but followed a similar pattern. The value of the genome-wide linkage was slightly higher than at 1 kbp, likely because of the reduced sample size at long genomic distances caused by poor genome coverage. Consistent with this, we found that the genomewide-linkage values in both species were close to completely unlinked controls obtained by randomly permuting each column of the alignments (σd2=0.14 vs σd2=0.12 in α and σd2=0.06 vs σd2=0.04 in β). Overall, these results suggest that the dynamics leading to the reassortment of alleles are similar in the two species and only differ in the length scale over which alleles become unlinked. We discuss this in more detail in the next section.

The fact that both species have similar levels of genomewide linkage despite the sample size of α being less than half of that of β shows that our estimates for the genomewide linkage are not significantly affected by the sample size effect. We further validated this by taking random samples from the β main cloud equal to the size of the α main cloud at each locus and recalculating the linkage curves. We found the results for the full β main cloud and the subsampled main cloud were almost identical (Appendix 4—figure 2). This shows that the sample size does not have a large effect on our results and cannot explain the differences in linkage decay between the two species.

Appendix 4—figure 2
Sample size has a small effect on linkage estimates.

Linkage disequilibria curves for the whole α and β populations shown in Appendix 4—figure 1 are compared to a random subsample of β sequences chosen to match the α sample size at each locus (shown in gray).

The rapid decrease in linkage sets important limits on how much information about the diversity in the population is contained in samples of marker genes. To illustrate this, we performed a linear regression between the whole-genome divergences and the 16S rRNA divergence. Consistent with our previous analysis, we found slightly stronger linkage in α compared to β, but in both cases, the correlations were weak and the 16S divergences explained less than 10% of the genome-wide divergences (Appendix 4—figure 1B). These results show that frequent recombination in the population makes marker genes, such as the 16S rRNA, poor predictors of the genome-wide diversity.

Comparison to model of neutral drift with recombination

To get a more quantitative understanding of the effects of recombination on the genetic diversity in our population, we compared our previous results to theoretical predictions from population genetics.

At short distances, the prediction from Equation 5 is that σd2=5/11. However, in our data, the value of σd2 for nearby sites was very close to one. While the Yellowstone Synechococcus have a strong mutational bias toward GC nucleotides (Rosen et al., 2018), Equation 5 is independent of the relative rates of mutations. As a result, we conclude that the neutral model cannot explain the linkage curves we observe.

Appendix 4—figure 3
Linkage decay curves for α and β collapse onto the same curve after rescaling distances by average recombination rate ρ.

Shows the linkage decay as a function of distances between sites rescaled by the recombination rates ρ fitted from Figure 2 (main text). The solid black line shows the theoretical prediction from the neutral model given by Equation 5.

While the overall level of σd2 in our data is inconsistent with the neutral prediction, we can still ask whether the rate of decay with distance has a similar functional form. To do this, we followed Garud et al., 2019 and fit the linkage curves to the following functional form:

(6) σd2(x)=A10+ρx22+13ρx+(ρx)2,

where ρ=2NR is the population recombination rate. We performed the fit by choosing A such that the above expression matched the value of σd2 in the data for neighboring sites (x=1). We then fitted the curves by eye by adjusting the value of ρ.

We found that this functional form could fit the data very well for both species up to distances of x300 bp (Figure 2 in main text). For longer distances, the linkage decrease was slightly slower than predicted by Equation 6. One possible interpretation of this slowdown is that longer distances have a significant overlap with the distribution of lengths of recombined segments Lr, beyond which we expect linkage to reach the genomewide level (Garud et al., 2019). This suggests that recombined segments as short as 300 bp might play an important role in the recombination process within the population.

We further validated that the linkage curves have the same functional form by rescaling distances between sites by the recombination rate ρ inferred previously. We found a very good collapse of the linkage curves for the two species when plotted against the rescaled distances (Appendix 4—figure 3B). This strongly implies that the dynamics leading to the decay of linkage are the same in both species, with the differences between them being length scale over which sites are effectively linked.

Linkage on intermediate length scales

Our analysis so far was limited to either distances up to the length of typical genes (∼1 kbp) or across the entire genome. Extending our analysis over distances in between these two extremes is difficult due to several technical challenges. The first is the fact that typical coverages for our SAGs are around 30%. Together with a typical contig size of ∼10 kbp, this means that our sample size quickly decreases with distance. The second challenge is that at longer distances, gene order might vary across genomes, making it difficult to define a genomic distance at the population level.

Linkage between multi-SNP alleles

We calculated the linkage disequilibrium between multi-SNP features, such as SNP blocks or hybrid alleles, using Equation 2 as in the case of biallelic SNPs. In the case of SNP blocks shown in Figure 7D in the main text, we calculate the frequencies for the four haplotypes within SAGs that are covered at both loci. The unlinked controls (represented by solid lines in Figure 7D) were obtained by a random permutation of the alleles at each locus independently, while keeping the SAGs that are covered at each locus fixed. This procedure preserves any correlations in genome coverage between nearby loci and provides the most accurate estimate of rAB2 for a completely unlinked genome.

The linkage between hybrid gene alleles was calculated similarly as for the SNP blocks, except for the genotype assignment. At each locus, we labeled the most abundant allele using upper case letters (A,B,) and grouped all of the remaining ones into a single allele labeled by a lower case letter (a,b,). The linkage matrix rAB2 and the unlinked control were calculated the same as previously described. The results for α were rAB2=0.04±0.15 compared to rAB2=0.02±0.11 for the unlinked control. For β, the results were rAB2=0.01±0.09 compared to rAB2=0.009±0.08 for the unlinked control. Note that the average linkage disequilibrium was close to the unlinked controls in both species, with α having slightly higher linkage. These agree with those from the analysis of SNP blocks shown in the main text.

Determining the main cloud cutoff for α and β

Appendix 4—figure 4
The effect of the main cloud diameter c on the rate of linkage decay.

The linkage decay within α (A) and β (B) is plotted for different main cloud cutoffs c (see main text). Note the convergence of the curves for c0.05.

For each species, we first selected all of the sequences that belonged to SAGs assigned to that species (see Appendix 2). We then calculated the consensus sequence for this group of sequences, as described above, and removed all sequences with divergences greater than a main cloud cutoff c. This is a simple way to remove sequences that are the result of gene-level hybridization (see Appendix 5 and Rosen et al., 2015 for a similar procedure). We chose the main cloud cutoff for each species by calculating the linkage curves for different values and choosing the largest value for which the curves did not noticeably change (Appendix 4—figure 4). Based on this criterion, we use c=0.05 for α and c=0.1 for β throughout.

Appendix 5

Gene-level hybridization

This section describes the method we used to quantify hybridization at the gene level. Our method uses a set of simple heuristics to assign labels to the species clusters defined in Appendix A based on the composition of species to which the sequences belonged. We then identified hybridization events by searching for discrepancies between the gene and whole-genome labels as explained below.

Simple hybrid genes

We first assigned labels to the species-level clusters constructed as described in Appendix 1. Based on the gene triplet results presented in Appendix 2, we expect most sequences to be contained in two main clusters, close to OS-A and OS-B’, respectively, with at least one sequence forming a separate cluster whenever the orthogroup was covered in the γ SAG. Based on this expectation, we considered all of the clusters from each orthogroup in decreasing order of their sizes and labeled them as described in the main text.

We next determined which orthogroups are part of the core genome. We defined a gene as a core gene if it is likely present in all of the genomes in our sample. The average coverage for each SAG in our sample was approximately 30%, so it is not possible to directly determine whether some genes are always present. Instead, we use a statistical measure of the coverage to assign orthogroups to the core genome. Specifically, we used the rank-coverage plot across all orthogroups and SAGs (equivalent to the complementary cumulative distribution) to determine an appropriate cutoff for the core genome. The rank-coverage plot showed three distinct regions. First, there were around a dozen orthogroups that had unusually high abundances, with the number of sequences in some cases being greater than the number of SAGs in our sample. These are likely multiple-copy genes and IS elements and were excluded from further analysis. Then there was a large plateau comprising over 1000 orthogroups with coverages in the range 100−150, followed by a sharp decrease (on a logarithmic scale) in the coverage. We found that setting a coverage threshold for the presence at 20% of the number of SAGs from each species approximately matched the shoulder between the plateau and the sharp decrease, and so we used this as a criterion for assigning core genes.

The final step in our algorithm is to catalog hybridization events, which we do by simply searching for gene sequences whose labels were different from the whole-genome species assignment. Analysis of these results showed a large number of transfers, mainly from β into α, which came from contigs that did not contain any genes labeled as the host species. One explanation for the presence of such contigs is that they represent larger genomic segments flanked by repetitive regions (such as IS elements) that are transferred and integrated into the host genome. Long repetitive genomic segments can lead to contig breaks during assembly, which would explain the complete absence of genes belonging to the host cluster. Alternatively, such contigs could be the result of contamination with external DNA. However, the contamination hypothesis does not explain why there seem to be more transfers from β into than vice versa. We also did not find any evidence of differences in the number of such transfers between the sample containing only α SAGs (OS2005, see Appendix 10) and samples containing mixtures of and β. While this evidence suggests these transfers are very likely the result of hybridization, we cannot rule out the possibility of contamination. We therefore excluded all hybridization events on contigs that did not contain any genes labeled as the host genome from the analysis. The results of our analysis are summarized in Figure 5A, B in the main text.

Appendix 5—figure 1
The majority of orthogroups contain hybrid sequences.

Orthogroups were classified into four mutually exclusive groups depending on the amount of hybridization observed. The majority of mixed-species orthogroups are mosaic hybrids (see main text) and are shown in orange. Orthogroups in which the local sequence clusters matched the whole genome ones were labeled as ‘no hybrids’ and shown in blue, those in which a single cell contained a sequence from a different cluster were labeled ‘singleton hybrids’ and shown in green, and those in which multiple cells contained sequences from different clusters were labeled ‘non-singleton hybrids’ and shown in red.

Appendix 6

Linkage block detection and analysis

In this section, we describe the method we used to discover SNP blocks and the statistical analyses we performed.

Processing alignments for SNP block detection

In order to identify linkage blocks originating from hybridization, the raw alignments from the pangenome construction were processed as follows. We divided core OGs into distinct species clusters, as explained in Appendices 1 and 5, and removed any sequences from SAGs assigned to a different species than the cluster. This was done to avoid false positive blocks due to the presence of recently transferred whole-gene alleles from different species. For OGs with mixed-species mosaic clusters, all sequences in the cluster were initially included. Alignments were then trimmed and codons that contained a high frequency of gaps were removed. Finally, we defined a main cloud of sequences for α and β comprising sequences within a divergence diameter dM=0.1 from the consensus for each species and removed all sequences outside it.

SNP block detection method

We used a linkage-based approach to search for evidence of hybridization within genes. Visual inspection of the alignments in the highly diverse α genes and in most β genes revealed segments with high SNP density grouped into two haplotypes in strong linkage disequilibrium. We hypothesized that these SNP blocks were the result of intragenic hybridization, and we developed an algorithm to systematically identify them.

The algorithm can be summarized by the following steps:

  1. Read orthogroup alignment

  2. Pre-process alignment

    1. Trim alignment edges and remove columns with high fraction of gap codons

    2. For each species, infer consensus sequence and remove sequences outside a predetermined diameter around the consensus

    3. Replace all SNPs with absolute frequencies n<nc=4 with the consensus for their respective columns

  3. For each species s, subsample sequences from cells assigned to s

  4. Calculate the linkage matrix rij2 for all pairs of SNPs

  5. Identify groups of consecutive SNPs with rij2=1, up to a rounding error of ϵ=105, containing at least lmin=5 SNPs

The inclusion of step 2c accounts for the possibility of more recent low-frequency SNPs emerging after the initial hybridization took place. Note that typical alignments depths are ∼30 for α and ∼70 for β and are considerably larger than nc.

We applied the above algorithm to the alignments of all of the core genes and obtained a total of 1000 blocks within α and 1483 blocks within β. Most blocks were short, but those in α were on average more than twice as long as those in β (L=68 in α vs L=29 in β; Appendix 6—figure 1A). Similarly, α blocks contained more SNPs on average compared to β, although here the difference was less pronounced (l=11.3 in α vs l=7.9 in β; Appendix 6—figure 1B).

Appendix 6—figure 1
Typical blocks are short and contain a handful of SNPs.

(A) The distribution of lengths of SNP blocks within α (orange) and β (blue) genomes, detected using the algorithm described in this section. (B) The distribution of the number of SNPs within the same blocks as in (A).

To identify hybrid haplotypes, we first determined the consensus sequences for the two block haplotypes, together with the other two species within the block segment. We then labeled any haplotypes whose sequence was identical to one of the other two species as a hybrid segment with the corresponding species as the donor. These results are shown in Appendix 6—table 1.

Appendix 6—table 1
Number of SNP blocks within α and β containing haplotypes that mapped to different donor species in core genome.
Host speciesBlock donor speciesTotal blocksGenes with blocksTotal gene transfersMajority mosaic genesCore genes
αβγX
α-29134724410003651943701448
β310-1259411483669703701448

To control for other processes that could potentially produce similar patterns of linkage disequilibrium, we compared the distributions of divergences between haplotypes and block lengths for cases where a hybrid haplotype could be identified directly to cases we could not. No obvious difference in the distributions could be seen between the two cases, consistent with our interpretation that the blocks without matches to other species in our sample are also the result of hybridization.

Appendix 6—figure 2
Examples of blocks and hybrid sequences within ribosomal genes.

Alignments are represented as described previously with blocks highlighted in purple. (A) Alignment of 16S sequences with a β cell containing a hybrid segment starting at around position 340 and ending around position 580 in the alignment. Note the shorter block shared by the first two β sequences at the end of the alignment (around position 1070). (B) Alignment of the SSU ribosomal protein rpsN with α cells ordered at the top and β cells at the bottom of the alignment. The first α cell was chosen as the reference sequence and is shown in color. Note the large fraction of α cells around 200–300 bp that are similar to the β sequences and distinct from the α reference. (C) Segment of alignment of β sequences at the LSU ribosomal protein rplM. The reference sequence was chosen to illustrate short SNP block in the center of the alignment.

Haplotype divergences

We further verified that the divergences between block haplotypes were consistent with what we would expect from hybridization. We calculated the fraction of nonsynonymous and synonymous nucleotide substitutions, pN and pS, respectively, between the consensus sequences of each haplotype (see Appendix 7). We then removed blocks with either pN or pS were above a threshold of 0.6. This was done to avoid large errors in estimating the divergence, which we calculated using the Jukes–Cantor correction d=(3/4)ln(14p/3) (Jukes and Cantor, 1969). The overall number of such blocks was small (46 out of 1442 in α and 143 out of 2008 in β) so their removal does not affect our conclusions about the typical blocks. We found that the distributions of both the nonsynonymous and synonymous divergences were similar to those previously reported from comparisons of OS-A and OS-B’ (Rosen et al., 2018), and consistent with our hybridization hypothesis (Appendix 6—figure 3).

Appendix 6—figure 3
Block haplotypes have high synonymous and non-synonymous divergences, consistent with them being generated through hybridization.

Panels show histograms of nonsynonymous (blue) and synonymous (orange) divergences between the two haplotypes of SNP blocks from α (A) and β (B), respectively. Solid lines show the mean nonsynonymous (blue) and synonymous (orange) divergences between α and β.

Appendix 7

Genetic diversity analysis

In this section, we present our analysis of the genetic diversity within individual genes in α and β. In the first subsection, we define the different metrics we used and present the variations in diversity along the genome of both species. In the next four subsections, we compare the diversity of different genes and use the α backbone to roughly infer the time scales of the evolutionary process. In the last subsection, we present a scenario for the evolution of the population.

Nucleotide diversity along the core genome

We quantified the nucleotide diversity across all genes in both α and β using the following procedure. Alignments were trimmed for excess gaps and sequences were assigned to each species as described in Appendix 1. We then define the nucleotide diversity in the usual way as

(7) π=1S(S1)ijIπij,

where I is an index for sequences in the alignment, S is the sample size, equal to the number of elements in I, and πij is the fraction of nucleotide substitutions between i and j, excluding gaps. To calculate synonymous (πS) and non-synonymous diversities (πN) we replace πij with the fraction of synonymous (πijS) and non-synonymous substitutions (πijN), respectively. We calculated πijS and πijN using two different methods: the fraction of fourfold degenerate (4D) and onefold degenerate (1D) sites, respectively, at which i and j differ and using the Nei–Gojobori method (Nei and Gojobori, 1986). We found that both methods gave similar results and used the one that was most convenient for each analysis.

To calculate the divergence between species, we first determined the proportion of sites that were substituted as follows:

(8) p=1SαSβiIα,jIβπij

where Iα and Iβ are the indices of sequences from α and β in the alignment, and Sα and Sβ are number of sequences from each species, respectively. The fraction of pairwise substitutions πij is the same as above. We then use the Jukes–Cantor correction (Jukes and Cantor, 1969) to correct for the effect of multiple substitutions at a single site

(9) d=34ln(14p3).

The synonymous (dS) and non-synonymous divergences (dN) were calculated in a similar way by replacing πij with πijS and πijN in Equation 8, respectively. Finally, to calculate the diversity within each species separately, we take the sum in Equation 7 over Iα or Iβ only.

Diversity heterogeneity along core genome

There are two plausible explanations for the heterogeneity within α found in Figure 3 (main text). The low-diversity genes could be due to local sweeps similar to the ones that formed the divergence troughs shown in Figure 5C in the main text. Under this scenario, the high-diversity genes would represent the typical diversity within α that has accumulated since its common ancestor. Since the high-diversity genes are as diverse as typical β genes, this scenario would imply α and β each had common ancestors at similar times in the past. Alternatively, the low-diversity genes could represent the typical diversity within α, with the high-diversity genes being the result of hybridization with other species.

To distinguish between these two possibilities, we developed a method to separate low diversity from high-diversity genes within α. Specifically, we calculated the fraction flS=PlS/LlS of 4D sites at locus l that were polymorphic within the α alignment. Here, PlS is the number of synonymous polymorphisms and LlS the number synonymous sites at that gene. By using 4D sites, we naturally control for differences in gene conservation. If mutation rates across loci do not vary too much, we expect the distribution of flS across genes to be given by a Poisson distribution. The mean of this Poisson distribution can be fitted from the fraction of sites that are not polymorphic

(10) P(k=0|μ)=eμ=1lPlSlLlS,

from which we get the following estimate for the mean:

(11) μ^=ln[1lPlSlLlS].

When we compared this prediction to the data for α, we found that the Poisson distribution gave an excellent fit for around half of the core genes (675 out of 1448) with flS0.05. In the remaining core genes (773 out of 1448), flS was much larger than predicted from the Poisson (Figure 3B in main text). Note that the agreement in the inset of Figure 3B is based on a single fitting parameter. Based on these results, we chose a threshold of flS0.05 to define the α ancestral backbone, with all other genes labeled as hybrid.

To validate our partition, calculate the synonymous diversity πS within each group. We found that our criterion produced a clear separation in the synonymous diversity, with low-diversity genes having a mean πS1.7×103, compared to πS3.4×102 for the high-diversity genes. As a result, we estimate that more than 95% of the genetic diversity within α is the result of hybridization. This result shows that the molecular clock estimate of the time since the most common ancestor based on TMRCA=πS/(2μτ), where μ is the mutation rate per division and τ the generation time, should be interpreted with caution. In the case of α the estimate using the average πS across the whole genome would be around an order of magnitude less than the true value, even without considering how mutation rates might change over such long time scales.

We performed a similar analysis for β but did not find any large peak near flS=0 (Figure 3C in main text). Instead, the distribution of flS was close to unimodal but with significantly fatter tails compared to the Poisson. Given the excellent fit to the Poisson for α in the region where SNPs are purely due to mutations, this suggests that other mechanisms play an important role in generating the diversity within β. The most parsimonious explanation is that the hybridization process we observed in α has covered most of the original β genome by this point. While it is not clear what the distribution of flS should be in this case, it is plausible that it need not be Poisson (Sakoparnig et al., 2021). Further evidence in favor of this hypothesis is the presence of SNP blocks across a large fraction of β core genes, as discussed in the main text and Appendix 6.

Contribution of hybridization to genetic diversity

The large number of SNP blocks found throughout the genomes of both α and β shows that hybridization is an important source of genetic diversity for the Synechococcus population. However, because of the requirement for perfect linkage within SNP blocks, the total number of SNPs that are the result of hybridization is likely much higher than the one we infer from our algorithm. To obtain quantitative estimates of the total contribution from hybridization to the diversity, we use α as a control. As shown previously, only ≈5% of the total diversity in the high-diversity loci from α can be attributed to mutations. These loci contained 603,588 sites and 64,462 SNPs, of which 10,391 SNPs were found inside SNP blocks. Thus, only around 17% of the SNPs that are likely the result of hybridization, or hybrid SNPs, are included in the SNP blocks.

Within the β core genome, we have 1,591,476 sites and 146,906 SNPs, of which 14,819 were within SNP blocks. Assuming a similar ratio of hybrid SNPs to SNPs in SNP blocks as in α, we estimate that around 90,000 or approximately 60% of SNPs in β are the result of hybridization. This is consistent with β having evolved over longer time compared to α. However, our knowledge of how multi-locus SNP patterns change over the course of evolution is limited, so the estimates for β should be interpreted with caution. In particular, if larger segments continue to be broken up due to recombination, it is likely that the fraction of hybrid SNPs within blocks would decrease over time for a constant minimum block size lmin (see Appendix 6). Thus, the overall contribution of hybridization to the β genetic diversity may be even higher.

Geological constraints on microbial evolution

Combining genetic diversity analysis with evidence from the geological record of the Yellowstone caldera allows us to make inferences about the evolutionary history of the Synechococcus population. If we assume that most of the fixed variants between species are the result of mutation accumulation, then we can use the typical synonymous divergence between species to infer the time since the most recent common ancestor for the whole population. Typical bacterial mutation rates for natural populations are around μ109.5 per bp per generation (Chen et al., 2022). Using this rate as a baseline and assuming generation times of τ110 days gives an estimate of TMRCA106.5107.5 years (Rosen et al., 2018).

For our argument, the lower bound is the more important one since it determines whether we can exclude the single-colonizer hypothesis. A TMRCA below 106.5 years would imply a shorter division time or higher mutation rate than our estimates. Our lower bound of 1 day for the division time is around the same as the fastest division rate observed in the laboratory. If the population is in steady state, as it appears to be, the division time sets the maximum generation time, which can be attained if the death rate is equally high. If the death rate is lower than this, other factors would limit cell division and the generation times would be longer. Taken together, we can be fairly confident that a generation time of 1 day is a hard lower bound.

The mutation rate is less certain since we do not have any reliable measurements and experimentally measured mutation rates might not be a good predictor of substitution rates over evolutionary times (Rocha, 2018). We can instead use other bacteria to estimate a range of values. In wild-type E. coli, mutation rates of 21010 per bp per generation have been measured in laboratory conditions (Lee et al., 2012), but mutator strains with rates that are up to 100 times higher are also known (Sniegowski et al., 1997). In natural populations, rates as high as 4109 have been inferred using longitudinal sequencing (Moran et al., 2009). If we use 10−85 mutations per bp per generation as an upper bound, we would infer a lower bound of TMRCA105.5 years, which very narrowly overlaps with the time since the formation of the Yellowstone caldera. Thus, while it is not inconceivable that the three species could have diverged within the caldera, such a scenario would require all the relevant parameters to be at the limit of currently known values. In addition, such a scenario would require additional hypotheses to explain why the strains initially diverged into distinct species, with seemingly little genetic exchange between them (Rosen et al., 2018), and then only much later started to hybridize. Taken together, current evidence suggests the most likely scenario is that α, β, and γ diverged from a common ancestor before the formation of the caldera around 640,000 years ago.

Upper bound on mixing time between springs

Appendix 7—figure 1
Non-hybrid regions of α have similar levels of diversity within and between springs.

The distribution of pairwise synonymous diversity was calculated across non-hybrid genes from the alpha α population for different subgroups of cells. Square symbols show the distributions for pairs of cells from the same spring (purple for Mushroom Spring and cyan for Octopus Spring). The cyano triangles show the distribution for a random subsample of cells from the Octopus Spring of the same size as the Mushroom Spring sample (n=35) for comparison. The red diamonds show the distribution for pairs of cells from different springs. Non-hybrid genes were determined using a 5% cutoff in the fraction of polymorphic synonymous sites (see text).

By using the diversity within α as a proxy for time, we can obtain estimates on the time scales of spatial mixing that influences the diversity within the population. To do this, we calculated the pairwise distances in the low-diversity regions of the α genome between pairs of cells in each spring separately and pairs of cells from different springs. We found that all three distributions were similar (Appendix 7—figure 1). The mean of the distribution for cells Mushroom Spring was slightly lower than those from the Octopus spring (πS=8.6104 vs πS=1.1103) and this difference could not be explained by differences in sample sizes between the two springs. However, both values were very close to the mean of cells from different springs (πS=1.0103). This suggests that the time scale of spatial processes relevant for the long-term evolution of the population, τm is much shorter time scales than TMRCA for α. Taking TMRCA3103 years as a reference point sets an upper bound of τm3103 years.

Proposed scenario for Synechococcus evolution

We propose the following scenario that is consistent with our previous discussion. First, we argued that a small number (3–6, but possibly more) of Synechococcus populations that had been evolving separately for 106107 years independently colonized the Yellowstone caldera after its formation 640,000 years ago. After colonization, they started to hybridize, leading to a gradual mixing of their genomes (Figure 8). This can be most clearly seen in the α species, where the diversity in core genes that do not show signs of hybridization is more than an order of magnitude less than those that do. Moreover, genes that have undergone hybridization account for the majority of diversity within α. Within β, we do not observe such heterogeneity, but we find SNP blocks are even more prevalent than in α (Appendix 6—table 1). The simplest explanation for this pattern is that β has been undergoing hybridization for longer, and by now, all of its core genome has been mixed with other species. Consistent with this explanation, we find that only a small fraction of SNP blocks within β can be traced back to α (Appendix 6—table 1). Most of the β SNP blocks cannot be traced to either α or γ and are likely the result of a combination of hybridization with species that are not currently within the Lower Geyser Basin, have abundances lower than our detection threshold, or have gone extinct.

Appendix 8

Estimate of recombination parameters

There are several important parameters that determine the impact of recombination and hybridization on the evolution of the population. First are the rates of recombination and mutation in the population. We denote the population recombination rate by ρ=2TR and the population mutation rate by θ=2Tμ, where T is the average time since the most common ancestor, R is the recombination rate per site, and μ is the mutation rate per site. Note that in the panmitic Wright–Fisher model, T in units of generation time is equal to the population size N, but this is not generally true (Neher and Hallatschek, 2013; Weissman and Hallatschek, 2014; Birzu et al., 2018; Birzu et al., 2021). A second important parameter is typical length of recombined segments, which we denote by Lr. Finally, there is the typical divergence of recombined segments. In this Appendix, we use the nucleotide distance p or the synonymous distance pS, as defined in Appendix 7. We estimated these parameters for the hybridization process, which we argued previously is the dominant source of diversity in the population.

Before describing the analysis, it is important to note that while some of the simple statistics discussed here may be captured by simple neutral models, the underlying dynamics may be far more complex. Indeed, the analysis in the main text shows extensive evidence of selective sweeps affecting the genetic diversity across a broad range of genomic length scales, contrary to simple neutral theory. Moreover, while our analysis shows that—over time scales implied by our sequencing depth—the Synechococcus population is effectively well-mixed within the caldera, the actual population is obviously not. Synechococcus cells live in a dense microbial mat spread across thousands of springs separated by distances up to ∼102 km (Ward et al., 1998). As a result, we expect the long-term evolution of the population to be described by effective parameters, which may have little relation to recombination rates and population sizes directly measurable in the field or in the laboratory (Neher and Hallatschek, 2013; Weissman and Hallatschek, 2014; Birzu et al., 2018; Birzu et al., 2021). Understanding the relationship between the measurable parameters describing recombination, selection, and spatial dynamics of individual cells and the effective parameters at the population scale is an important and interesting topic for future work.

Many different methods have been proposed to estimate these or other closely related parameters (Croucher et al., 2015; Didelot and Wilson, 2015; Lin and Kussell, 2019). All of the methods we are aware of are based on models that make various assumptions that may not be valid for any given population, such as ours. In our case, an accurate estimate of these parameters requires a quantitative model of the hybridization process and is beyond the scope of the paper. Here, we instead used the results of our previous analysis to get rough estimates of these parameters and compared them to two other commonly used methods.

We considered each species separately, starting with α. For most evolutionary models, θ can be estimated from the nucleotide diversity π (Neher and Hallatschek, 2013). As in the main text, we used only synonymous sites to control for differences in the strength of purifying selection or the accumulation of non-synonymous mutations during rapid adaptation. Based on our previous analysis, we estimated the population synonymous mutation rate θS2103 (Appendix 7). Because of the clear distinction between the ancestral backbone and hybrid genes in α, we expect this estimate to be reasonably accurate. A rough estimate of ρ can be obtained from the linkage disequilibrium curves (Figure 2 in main text). Our previous analysis showed that ρ0.03 and ρ0.12 gave good fits for α and β, respectively. Taking the ratio, we found a ratio between recombination and mutations of ρ/θS15—much higher than previous estimates.

Estimating the mutation rate in β is more difficult because there are no genes unaffected by hybridization. We instead used the fraction of SNPs found in SNP blocks to estimate the diversity purely due to mutations. In Appendix 7, we estimated that around 40% of SNPs are the result of mutations. Multiplying with the synonymous diversity gave θS0.02. Using this estimate, we found ρ/θS6, which is lower than in α but still significantly higher than values that have been reported previously.

The other recombination parameters can also be inferred from our analysis. As we showed previously, the typical divergence of hybrid segments is given by the synonymous distance between species pS0.4. We estimated the average size of recombined segments from the linkage curves. The recombination rate between two sites on the genome increases linearly with the separation x for distances shorter Lr. In the standard neutral model, this leads to the scaling σd2(x)1/x at intermediate distances (see Appendix 4). For x>Lr, the recombination rate saturates to the genomewide recombination rate ρLr, which results in deviations of the linkage decrease from the predicted σd2(x)1/x scaling. In α, we observe a deviation around x500 bp. Note that we verified that this plateau cannot be explained by the smaller finite sample size in α. A similar argument gives a shortened segment length Lr300 bp in beta. While we cannot directly test that the plateau is not due to sample size, the small effect of subsampling we observed earlier suggests it is an unlikely explanation (Appendix 4—figure 2). Note that in both species, our estimated genomewide recombination rate ρLr is much greater than one, consistent with the genomewide linkage being close to the unlinked control, as discussed in the main text.

Comparison to other methods

We compared our analysis to standard methods. ClonalFrameML and other related software are perhaps the most widely used methods to infer bacterial recombination (Didelot and Wilson, 2015). Such methods assume that recombination rates are rare enough that they can be described by a Poisson process on an asexual tree, similar to mutations. We refer to the assumption as the clonal frame hypothesis. ClonalFrameML uses the inferred phylogenetic tree from pairwise distances to identify discrete recombination events and assign them to branches on the tree. The recombination rate, genetic divergence, and length of recombined segments are then inferred from these events.

Note that the clonal frame hypothesis is inconsistent with the rapid decrease in linkage we observed (Figure 2 in main text). Nevertheless, it is useful to see how the method performs on our data.

We used a subset of high-coverage SAGs to create whole-genome alignments for α and β, which we then analyzed with ClonalFrameML. We first selected SAGs more than 75% complete, which resulted in a dataset of 2 α and 12 β SAGs (Appendix 1—figure 1). For each species, we chose the genes that mapped to OS-A or OS-B’ and created concatenated alignments of the entire genome in the order of the references. Missing genes or sequences were replaced by gaps. We constructed the phylogenetic tree using FastTree and ran ClonalFrameML on the alignments and trees with default parameters.

Appendix 8—figure 1
ClonalFrameML does not converge to a self-consistent set of recombination events in high-coverage β SAGs.

Whole-genome alignments for 12 high-coverage β SAGs were generated as described in the main text. The phylogenetic tree for this subset was inferred using FastTree with default parameters and used as input for ClonalFrameML. The histogram shows the distribution of the length of recombined segments inferred by ClonalFrameML. The solid black line shows the maximum-likelihood exponential distribution used to fit the inferred segments. Note the heavy tail of the inferred distribution which is inconsistent with the exponential fit.

For both species, ClonalFrameML inferred a ratio of recombination to mutation rates ρ/θ1 (R/θ in Didelot and Wilson, 2015) and a fraction of nucleotide differences for imported segments of p=0.070.09 (ν in Didelot and Wilson, 2015). Note that both θ and p reported by ClonalFrameML refer to nucleotide rather than synonymous differences. The values of p are around half of the distance between species p0.15 and are inconsistent with our metagenome analysis which showed no genomes present in OS or MS at these divergences. The main difference between the species was in the average length of recombined segments, which was almost 10 times higher in α (Lr=365) compared to β (Lr=45).

A detailed comparison of the results of ClonalFrameML with our analysis is beyond the scope of this paper. Nevertheless, we found several inconsistencies in the assignment of recombined segments by ClonalFrameML. Because of the lack of data in α, we focused this analysis on β. First, we found that the distribution of recombined segment lengths was inconsistent with the exponential distribution assumed by ClonalFrameML and had a mean that was almost double the inferred δ=45 (Appendix 8—figure 1). Second, manual validation of a small sample of randomly chosen recombined segments revealed identical patterns present in other cells that were not assigned to the recombination event (data not shown). We suspect that these misassignments are due to the inconsistency of the observed segments with the asexual tree.

The failure of ClonalFrameML to fit the data should not be surprising. As we showed, the low levels of linkage at whole-genome scales in α and β are inconsistent with the existence of a clonal frame assumed by the model. In addition, manual validation of some of the recombined segments detected revealed spurious segments caused by gaps in coverage due to the incomplete nature of the SAGs. These results caution against the naive application of methods such as ClonalFrameML to SAG datasets in general, but also to other populations for which the clonal frame hypothesis may not hold.

Appendix 8—figure 2
SNP correlation profiles in α agree with mcorr fits but show significant discrepancies at short distances in β.

An alternative method for fitting the clonal frame model was proposed recently by Lin and Kussell, 2017; Lin and Kussell, 2019. Instead of identifying recombination events directly, they use the correlations between SNPs to infer the recombination parameters statistically. Specifically, they model a local (sample) population that evolves according to a neutral model with recombination. They assume that the population recombines with a larger and more diverse ‘pool’ that evolves according to the same model but with different parameters. Under this model and assuming the recombination to mutation rate ratios are the same in the sample and pool populations, they show that the conditional probability that two sequences differ at a site given that they differ at a second site l bp away, P(l), depends only on population mutation rate (θ in our notation), the population recombination rate (ρ) and the average length of recombined segments (Lr). By fitting the analytical expression for P(l) from the model to the data, all of the model parameters can be inferred.

Appendix 8—table 1
Comparison between recombination parameters obtained from our analysis, mcorr (Lin and Kussell, 2019), and ClonalFrameML (Didelot and Wilson, 2015).

For each species, the four columns show the estimated ratio of recombination and mutation rates ρ/θ, the average length of recombined segments Lr, the fraction of nucleotide differences between host and recombined segments, p and the ancestral nucleotide diversity θ estimated by the different methods. All quantities were converted to the notation used in the text as follows. The mcorr columns are the best fit values of the parameters ϕs/θs,f¯,dp, and θs from Lin and Kussell, 2019. The ClonalFrameML columns are the maximum likelihood values of the parameters R/θ,δ,ν, and θ from Didelot and Wilson, 2015. Note that our analysis and mcorr only used synonymous sites, while ClonalFrameML used all sites. The diversity parameters (ρ/θ, p, and θ) from ClonalFrameML that are not directly comparable to the other two analyses are marked by asterisks.

Speciesρ/θLrpθ
Our analysisα∼15≳5000.42 · 10-3
β∼6≳3000.4~2 · 10-2
mcorrα0.384110.113 · 10-3
β0.455730.061.5 · 10-2
ClonalFrameMLα0.88*3650.07*3 · 10-4*
β1.5*450.09*3 · 10-4*

We generated alignments of the high-coverage α and β SAGs using Mauve (Darling et al., 2004) and followed Lin and Kussell, 2019 to fit P(l). Note that unlike in ClonalFrameML, we used only synonymous sites for the mcorr analysis, so the divergences cannot be directly compared between the two methods. We found the model fit the data for α reasonably well up to two distances around 300 bp, with the exception of the first two points. However, the model was not able to capture the slower decay in correlations at longer genomic distances (see also Appendix 4). In β, the fit at short distances was considerably poorer (Appendix 8—figure 2). In particular, correlations between nearby SNPs were ∼2× higher in the data compared to the model. The increased correlations at short distances could be due to genetic hitchhiking during gene sweeps or could be a sign of subtle forms of population structure within β. Distinguishing between these scenarios is an interesting topic for future work.

Compared to ClonalFrameML, the recombined fragment lengths inferred by mcorr were larger (especially for β) and were reasonably close to our estimate. In addition, the mcorr estimate of the diversity due to mutations (θ in our notation) agreed remarkably well with our estimates in both species. Similar to ClonalFrameML, the synonymous divergence of hybrid segments was between the divergence between species and the diversity within species and was inconsistent with our metagenome analysis. Finally, the ratio of recombination and mutation rates was similar to the one inferred from ClonalFrameML and more than an order of magnitude smaller than our estimates. These results are summarized in Appendix 8—table 1. We leave a more careful analysis of the differences as a topic for future work.

Appendix 9

Variation in hybridization across samples

In this section, we describe how we quantified associations between genomic features and populations from different samples. Our primary focus was on differences in hybridization patterns, so we analyzed the variations in hybrid gene and SNP block haplotype frequencies across different samples and locations. Given the small number of samples in our study, we limited the number of statistical tests we performed in order to avoid false positives. For each of the two main species, α and β, we tested for the effect of sample location and sample composition separately, as described below.

Variation in SNP block haplotype frequencies

We considered two distinct hypotheses that could lead to associations between block haplotypes and environmental conditions from which the sample was taken. First, we considered if the community composition was associated with specific haplotypes by comparing samples in which a single species was present to samples which contained a mixture of α and β. For α, we chose the 48 cells from pure-α sample OS2005 as the test group and the 45 α cells from MS2004, MS2006, and OS2009 as the control. For β, we chose the 44 cells from the pure-β sample MS2005 as the test group and the 46 β cells from MS2004 and MS2006 as the control. We denote these two tests as ‘α-composition’ and ‘β-composition’ tests, respectively. Second, we considered the hypothesis that sample location was associated with specific haplotypes. For this test, we divided all cells from each species into two groups labeled MS and OS, based on the location from which the sample was taken. These tests are denoted by ‘α-location’ and ‘β-location’, respectively.

We used the following permutation test to find block haplotypes that were significantly associated with each environmental variable. We first removed all blocks in which the minority haplotype was below an arbitrary cutoff frequency, set at 5 sequences. This was done in order to avoid tests which were statistically underpowered. We then calculated the two-sided p-value for the association for each block individually using Fisher’s exact test.

Appendix 9—table 1
Result of tests for associations between SNP block haplotypes and two environmental variables related to the samples.

The rows show the results of the different tests described in the main text. The p-value p for each block was determined using Fisher’s exact test and minimum p-value from the randomized control prandmin was used to correct for false-discovery rates (see main text for details). The first four columns show the total number of blocks in each category. The last column shows the largest value of the prandmin/pmax (most significant association) for each test.

TestTota blocksBlocks testedSignificant (prandmin/p>1)Highly significant (prandmin/p>10)Largest prandmin/p
α-Composition100060213112.0
α-Location100072461522
β-Composition1483848000.8
β-Location1483130615132.9

Because the sampling depth for each block is variable due to variations in coverage, standard methods that correct for the false-discovery rate are not well-suited in our case. We use the following procedure to determine significant associations. We randomized the composition of the test and control cell groups for each test, while keeping the total number within each group fixed. We then recalculated the p-values using the same procedure as before and recorded the lowest p-value across all blocks prandmin. We used the ratio prandmin/p between the minimum p-value in the randomized control and the p-value in the test group as a measure of significance of the association. We arbitrarily set a threshold value of prandmin/pt=10, above which associations were deemed highly significant.

The results of these tests are summarized in Appendix 9—table 1.

Overall, we found very few blocks that were associated with either the sample composition or location. This is consistent with the results from the main text (Figure 7D), which showed that the block haplotypes are close to unlinked even on distances comparable to the whole genome. The largest number of associations was with the sample composition in the α species, consistent with the previously mentioned results, which showed a small but statistically significant increase in linkage within the α species compared to the β. Nevertheless, the overall fraction of blocks whose haplotypes were strongly associated with the sample composition was ≈1.2% and still small. We therefore concluded that neither the composition of the population within samples nor the location has a large effect on the hybridization patterns revealed by the SNP blocks. A complete table with the results of these tests can be found in the supplementary data tables.

Variation in hybrid gene transfers

We also examined variations in the rate of transfers of whole-gene alleles across species. The lower overall number of such events compared to the SNP blocks makes statistical tests for associations much more likely to yield false positives in this case. As a result, we limited the analysis to a qualitative comparison of the number of different types of transfers across samples shown in Appendix 9—figure 1. We did not observe any systematic variations with either sample location or sample composition for either species.

Appendix 9—figure 1
Numbers and types of gene transfers between species are similar across samples.

The number of hybrid gene transfers from each donor found in α (A) and β (B) cells across all eight samples in the data. The labels A, B, and C, denote α, β, and γ, respectively, while X denotes other Synechococcus clusters.

Appendix 10

Data quality control

Beyond the processing steps outlined in Appendix 1, we also explicitly verified that the hybridization patterns we observe cannot be explained by sequencing and genome assembly artifacts. Specifically, we tested if misassembly of reads from distinct alleles of the same gene could explain the presence of hybrid sequences. We considered two main sources of such reads: cross-contamination from neighboring reaction wells during whole-genome amplification (WGA) and the presence of paralogs horizontally transferred from other species.

Distinguishing between hybridization and artifacts is hard because both can lead to similar signals. Here, the structure of data provides us natural controls. First, α has genes with low diversity and little sign of hybridization and high-diversity genes with extensive hybridization. We can use former to establish a baseline for genome coverage and check for deviations in the later. If the hybridized genes are caused by misassemblies of actual between α genes and contamination from β or between different alleles of the same gene within α cells, we should see an increase in the coverage at these loci. Second, the variability in species composition across samples provides controls for cross-contamination between reaction wells. Specifically, OS2005 contained 48 cells, all classified as α, while MS2004, MS2006, and OS2009 together contained 45 α cells amplified on plates with mixtures of α and β cells in similar proportions (Appendix 10—figure 1). We will refer to the former as the control SAGs sample and the latter as the test SAGs sample. We used these two groups as a measure of the contribution of cross-contamination from neighboring β cells to the observed hybridization in α.

Appendix 10—figure 1
Species composition is highly variable across samples.

We first divided the core α OGs into test and control groups as follows. We first divided OGs into low- and high-diversity groups based on the probability that synonymous sites are segregating in the sample using a cutoff at 0.05. From the low-diversity OGs, we chose the subset that were grouped into α and β, or α, β, and γ clusters during the pagenome construction (approximately 300 each out of a total of ∼800). These OGs were nearly clonal in the α sample and showed little to no evidence of hybridization in our entire data. These OGs will form the control OG group. From the high-diversity OGs, we selected those where all sequences were grouped together in a single mixed-species cluster (∼300 out of a total of ∼600). These include the cases with the most extensive hybridization and thus formed the test OG group.

We validated that core gene coverage can be described by a random process in which each gene has a fixed probability p of being present in the assembly. We fitted the baseline probability p using the low-diversity genes from the OS2005 α cells. Many of these genes were clonal within the entire α sample and showed no evidence of hybridization with the other species in our sample. They thus provide a good baseline for the expected coverage in the absence of contamination. While the coverage among different SAGs was highly variable (Appendix 10—figure 2A), we found that the distribution of coverages of different genes was well fitted by a Gaussian distribution with the mean and variance given by assuming each gene is an independent Bernoulli random variable with success probability p0.29. We tested the independence assumption by calculating the covariance between the presence of different genes (Appendix 10—figure 2C). Under our very simple model, any correlations between the coverage at different loci would be purely due to statistical fluctuations caused by finite sample sizes. The distribution of diagonal (Caa) and off-diagonal (Cab) components of the covariance matrix can then be calculated exactly given p. We compared the values obtained from the data with these predictions and found a very good agreement (Appendix 10—figure 2D).

Appendix 10—figure 2
Genome coverage across core genes is well-explained by independent presence–absence model.

(A) shows the presence-absence matrix Pia of low-diversity OGs in the control single-amplified genome (SAG) sample OS2005 (see main text for definitions). Rows and columns were ordered in decreasing order of the sums across the columns and rows, respectively. (B) shows the correlation matrix between OG presences defined as Cab=N1i=1N(PiaPi¯)(PibPi¯), where Pi¯=N1a=1LPia. Note the use of the mean abundance for each SAG rather than p=(NL)1i,aPia to account for the large variability in coverage between SAGs. (C) shows the distribution of OG presences Pa¯=N1i=1NPia. The solid orange line shows the Gaussian fit under the assumption that each gene is a Bernoulli random variable with success probability p. (D) shows the distribution of the entries in the covariance matrix from (B) together with the predictions for the diagonal (orange) and off-diagonal (green) components under the assumption that each gene is independent and present in the final assembly with probability p. Note the value of p was obtained from panel (C), and there are no free parameters for the predicted curves.

Having validated our gene presence model, we can now use it to perform simple but quite powerful tests for different artifacts. Consider the extreme case where all of the hybrid genes we observe are the result of misassemblies of reads from two distinct alleles, one from α and one from either β or γ, from the same reaction well. Note that for this argument, it is unimportant whether the alternative allele is due to contamination or genuine horizontal transfer that resulted in two distinct, unmixed copies of the gene. In either case, the probability of observing the gene in our assembly would be Pobs=p(2p)0.5, which is significantly higher than for a true single-copy gene. We tested for such an increase by comparing the distribution of coverages among α cells in the OS2005 sample of high-diversity genes that formed a single mixed cluster across all sequences in our data (see Appendix 1) and low-diversity genes with distinct α and β or α, β, and γ clusters. We found the two distributions were very similar to each other (Appendix 10—figure 3A), with a small but statistically significant increase in the mean coverage at low-diversity genes with distinct species clusters (pcontrol0.3 vs ptest0.28). The difference is in the opposite direction from the one expected from the misassembly hypothesis, but it corresponds to less than an extra copy of a gene on average and is unlikely to have any consequence for our conclusions.

Note that the misassembly hypothesis above would also predict a specific pattern for the sequences within α, with a large fraction 2(1p)/(2p)0.83 having either the pure α or the pure β alleles, and p/(2p)0.17 containing a mixture. We do not find evidence of such patterns in our data. More importantly, the 0.17 fraction is an upper bound on the maximum frequency of mixtures that can be explained through misassembly, as any lower amounts of contamination would lead to fewer chimeric sequences. Note that both of the above tests take advantage of the fact that the typical genome coverage is low.

Using a similar method to the one described above, we can also check for contributions from cross-contamination between reaction wells on the formation of chimeric sequences. Here, we compared the coverage distribution of mixed-species orthogroups in the mixed αβ samples described previously with the control group from the all α sample OS2005. If cross-contamination with β was a significant contributor to the presence of mixed-species orthogroups, one would expect a large difference between the two distributions. We did observe a large systematic difference between the two groups in the overall coverage p across all genes (data not shown), which was likely due to batch effects in the amplification and sequencing. We therefore compared the mean-centered distributions between the two groups and found no difference between them and the fit to the independent presence-absence model from before (Appendix 10—figure 3B).

Appendix 10—figure 3
Variability in core gene presence is inconsistent with observed hybridization being dominated by artifacts.

(A) compares the presence distributions among the control single-amplified genome (SAG) sample of the test OG group (blue) with the control group (orange), as defined in the main text. (B) compares the presence distributions of the test OG group in the control (blue) and test (orange) SAG samples. The green line shows the fit from Appendix 10—figure 2C. All distributions were centered at zero to control for batch effects (see main text).

Beyond the indirect presence of hybridization given by the presence of mixed-species orthogroups, we also tested for any effect of cross-contamination on the presence of linkage blocks. We applied the same algorithm described in Appendix 6 to the subsample of 48 α cells from OS2005 and compared the results with those from random samples of α cells from samples with mixtures of α and β cells. We found that the number of blocks detected in OS2005 was similar to the bootstrapped distribution from mixed αβ samples (Appendix 10—figure 4A). We also found the distribution of block lengths and divergences between block haplotypes to be similar as well (Appendix 10—figure 4BC).

Appendix 10—figure 4
Linkage blocks are robust to large variations in species composition across samples.

(A) Bootstrapped distribution of the number of linkage blocks detected in alignments across fourfold degenerate sites. The distribution was obtained by randomly choosing groups of 48 single-amplified genomes (SAGs) from samples containing mixtures of α and β SAGs (MS2004, MS2006, OS2006, OS2007, and OS2009). The arrow shows the number of blocks detected using the 48 SAGs from OS2005, which only contained α SAGs. Comparison of the distributions of block lengths (B) and divergence between the two haplotypes from each block (C) is shown in the other two panels. OS2005 (shown in blue) is compared to a randomly chosen sample from panel A (shown in orange).

Based on the evidence presented above, we concluded that cross-contamination between reaction wells and genome assembly artifacts does not contribute significantly to the presence of mixed-species orthogroups or linkage blocks. While we cannot definitively exclude that some sequences we observe are the result of amplification or assembly errors, the evidence above shows that they are rare enough to not have a significant effect on our main conclusions.

Data availability

All SAG assemblies and metagenome samples used in this study can be found on the JGI website (https://genome.jgi.doe.gov/portal/) under the project ID 503441. The analyzed data and code used to produce the figures in the main text are publicly available at https://github.com/gbirzu/yellowstone_cyanobacteria_hybridization (copy archived at Birzu, 2025) and https://doi.org/10.5281/zenodo.17534465.

The following data sets were generated
    1. Birzu G
    (2025) Zenodo
    Yellowstone Cyanobacteria hybridization.
    https://doi.org/10.5281/zenodo.17534464

References

  1. Book
    1. Coyne JA
    2. Orr HA
    (2004)
    Speciation
    Sinauer Associates.
  2. Book
    1. Jukes TH
    2. Cantor CR
    (1969) Evolution of protein molecules
    In: Jukes TH, Cantor CR, editors. Mammalian Protein Metabolism. Academic Press. pp. 21–132.
    https://doi.org/10.1016/B978-1-4832-3211-9.50009-7
  3. Book
    1. Ward DM
    2. Castenholz RW
    3. Miller SR
    (2012) Cyanobacteria in geothermal habitats
    In: Ward DM, editors. In Ecology of Cyanobacteria II: Their Diversity in Space and Time. Springer Dordrecht. pp. 39–63.
    https://doi.org/10.1007/978-94-007-3855-3_3

Article and author information

Author details

  1. Gabriel Birzu

    1. Department of Applied Physics, Stanford University, Stanford, United States
    2. Department of Physics, University of Florida, Gainesville, United States
    Contribution
    Conceptualization, Investigation, Methodology, Writing – original draft, Writing – review and editing
    For correspondence
    gbirzu@ufl.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3561-024X
  2. Harihara Subrahmaniam Muralidharan

    Department of Computer Science, University of Maryland, College Park, United States
    Contribution
    Investigation, Writing – review and editing
    Competing interests
    No competing interests declared
  3. Danielle Goudeau

    DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, United States
    Contribution
    Resources, Data curation
    Competing interests
    No competing interests declared
  4. Rex R Malmstrom

    DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, United States
    Contribution
    Resources, Data curation, Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4758-7369
  5. Daniel S Fisher

    Department of Applied Physics, Stanford University, Stanford, United States
    Contribution
    Conceptualization, Supervision, Funding acquisition, Methodology, Writing – original draft, Writing – review and editing
    Contributed equally with
    Devaki Bhaya
    For correspondence
    dsfisher@stanford.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5559-2491
  6. Devaki Bhaya

    Department of Plant Biology, Carnegie Institution for Science, Washington, DC, United States
    Contribution
    Conceptualization, Funding acquisition, Writing – original draft, Project administration, Writing – review and editing
    Contributed equally with
    Daniel S Fisher
    For correspondence
    dbhaya@carnegiescience.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7965-4258

Funding

Simons Foundation (730295)

  • Gabriel Birzu
  • Daniel S Fisher

National Science Foundation (PHY-1607606)

  • Daniel S Fisher

National Science Foundation (PHY-2210386)

  • Daniel S Fisher

National Science Foundation (1921429)

  • Devaki Bhaya

National Science Foundation (2125965)

  • Devaki Bhaya

Joint Genome Institute (503441)

  • Devaki Bhaya

Joint Genome Institute (509352)

  • Devaki Bhaya

Simons Foundation (Sabbatical Fellowship)

  • Daniel S Fisher

Carnegie Institution for Science

  • Devaki Bhaya

National Institutes of Health (R01-AI-100947)

  • Harihara Subrahmaniam Muralidharan

United States Department of Energy

https://doi.org/10.46936/10.25585/60001132
  • Devaki Bhaya

The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Benjamin H Good, Alana Papula, and Freddy Bunbury for helpful comments and discussions. GB and DSF gratefully acknowledge support from the Simons Foundation via a Postdoctoral Fellowship Award 730295 to GB and Sabbatical Fellowship to DSF. This work was also supported by NSF Grants PHY-1607606 and PHY-2210386 to DSF. HSM was supported by the National Institutes of Health grant R01-AI-100947 to Mihai Pop. DB acknowledges support from the NSF/BIO-BBSRC collaborative research grant 1921429, NSF MIM grant 2125965, Joint Genome Institute Community Sequencing Project 503441 and 509352, and the Carnegie Institution for Science. Samples processed for single-cell genome amplification were collected using NP Park Permits: YELL-5494 to David Ward (multi-year), YELL-5660 to DB (2007–2008), and YELL 5694 to DB (2007–2009). The work (proposal: https://doi.org/10.46936/10.25585/60001132) conducted by the U.S. Department of Energy Joint Genome Institute (https://ror.org/04xm1d337), a DOE Office of Science User Facility, is supported by the Office of Science of the U.S. Department of Energy operated under Contract No. DE-AC02-05CH11231.

Version history

  1. Preprint posted:
  2. Sent for peer review:
  3. Reviewed Preprint version 1:
  4. Reviewed Preprint version 2:
  5. Version of Record published:

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.90849. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Metrics

  • 840
    views
  • 48
    downloads
  • 4
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Citations by DOI

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Gabriel Birzu
  2. Harihara Subrahmaniam Muralidharan
  3. Danielle Goudeau
  4. Rex R Malmstrom
  5. Daniel S Fisher
  6. Devaki Bhaya
(2025)
Hybridization breaks species barriers in long-term coevolution of a cyanobacterial population
eLife 12:RP90849.
https://doi.org/10.7554/eLife.90849.3

Share this article

https://doi.org/10.7554/eLife.90849