Figures and data

Synechococcus population comprises three genomic clusters.
(A) The mean divergence across homologous genes from the two reference genomes is shown for every genome in the data. Each dot represents a single genome and is colored according to its assigned species. (B) The relative abundances of all three species across samples from different temperatures within Mushroom Spring. Abundances were estimated based on read recruitment from the metagenome to core genes from the γ genome, as well as representative genomes from the α and β populations. Dashed lines show the minimum detection frequencies from the single-cell (black) and metagenome (gray) data, respectively (see Appendix 3). Shaded gray region shows temperature range of the 8 single-cell samples.

Genetic linkage analysis shows extensive recombination within α and β.
(A) Linkage disequilibrium 

α and β have distinct patterns of within-species diversity.
(A) The synonymous diversity πS within α (orange) and β (blue) at all core genes ordered according to their respective reference genomes. The solid line shows the value of πS, with lighter colors showing one standard deviation across different pairs around the mean. Both lines were smoothed by averaging over a sliding window of 5 genes with a step size of 2 genes. (B) Distribution of the fraction of polymorphic four-fold degenerate (4D) sites for each α core gene (blue histogram) is shown together with Poisson fit based on the number of genes without any polymorphic sites (black line and dots). Inset shows zoomed-in region of low-diversity genes (< 5% polymorphic 4D sites) together with the same fit. (C) The same as (B) using data from β. The Poisson distribution is fitted using the mean of the distribution, which provides a better fit than the number of genes without polymorphic sites (data not shown). (D) Histogram of synonymous diversities πS within genes for β (blue), α low-diversity backbone (orange), and α high-diversity (hybrid) genes (red). Note the overlap between the diversity in the β core genome and the hybrid α genes.

Alignments reveal distinct patterns of hybridization across different genes.
Three core gene alignments, each chosen to illustrate a distinct pattern of hybridization are shown. Rows represent different cells and columns different positions along the gene. One sequence was chosen randomly as a reference (top row), with nucleotides shown in different colors. The species assigned to each cell are indicated by the colored rectangles on the left, using the same color scheme as the species clusters. Rows were sorted hierarchically based on pairwise distances. Nucleotides are shown in grey if they are identical to the reference genome and in a different color if different. Species clusters were assigned as described in the main text and Appendix 1. (A) Segment of an alignment containing three distinct species clusters (α shown in orange, β shown in blue, and γ shown in green) of a non-hybrid gene. (B) Segment of an alignment containing two distinct clusters (α and β) and six different simple hybrids. The direction of the transfer was inferred as described in the main text and and is indicated on the left. Transfers from α into β and β into α are highlighted in orange and blue, respectively. Note that the presence of more than one type of transfer is rare and was chosen here for illustration. (C) Segment of a alignment without distinct species clusters chosen to illustrate mosaic hybrid genes. Regions with clear hybrid blocks are highlighted in purple. Note the extensive hybridization on short length scales, despite the fact that sequences from the same species cluster with each other based on pairwise distances.

Transfers of whole-gene alleles between species is common.
The frequency of hybrid gene alleles along the α (A) and β (B) genomes. Different colors show the donor species, with unknown species labeled X and shown in magenta. The orange lines on top show mosaic loci without distinct gene clusters. Note that the x-axis in (B) is shared with (C). (C) The mean synonymous divergence between α and β at homologous loci, ordered according to OS-B’. Smoothed line over a sliding window of five genes with a step size of two genes is shown in black. Maximum and minimum divergences at each locus are shown in red and blue, respectively. (Insets, middle) A zoomed-in region from a typical genomic region (left) and a diversity trough (right), with dots representing individual genes. (Insets, bottom) Alignment segments of individual genes, subsampled to 40 sequences for clarity. The first sequence is arbitrarily chosen as the reference with each nucleotide represented by a different color. Other sites are colored gray if same as reference or using nucleotide colors if mutated. Left colored bars show the species of each genome.

Diversity within hybrid genes and diversity troughs reveal mixture of soft and hard sweeps.
(A) The distribution of synonymous diversities in α genomes in the diversity troughs (purple histogram) is compared to a random sample of genes from the α backbone of the same size (orange histogram). The orange line shows the bootstrapped distribution of the α backbone. (B) Same as (A), but comparing diversity troughs in β genomes with the β core genome. (C) Each circle shows the synonymous diversity of a non-singleton simple hybrid locus (see Fig. 5AB) in α. The donor species are shown in different colors as labeled on the x-axis. For transfers between α and β, the diversity within β is shown as squares of the same color as a control, with sequences from the same locus connected by a line. The orange horizontal lines shows the synonymous diversity in the low diversity α backbone. Transfers above this line likely contain multiple independent events.

Hybridization between species leads to mixing of short DNA segments within single genes.
(A, top) Alignment segment of α genomes from a representative locus with four SNP blocks. The inferred ancestral α haplotype (labeled α1) is highlighted in orange and the hybrid haplotype (all β here; labeled α2) is shown in blue. Note the SNPs in different blocks in differrent sets of cells. (Middle) Illustration of SNP block analysis pipeline. (Right) The consensus alignment of the two α haplotypes, β, and γ is constructed. (Left) Pairwise divergences between consensus sequences are calculated. Note that α2 is identical to β here. (B) The joint distribution of divergences of α SNP block haplotypes from the other species for all SNP blocks. Each point represents a haplotype with coordinates showing the divergence from β (x-axis) and γ (y-axis). The orange and blue dots show the two haplotypes from the example in (A). Marginal distributions are shown on the top and right. Note the high density of haplotypes near the axes, indicating transfer from one of the other known species. (C) Same as panel (B) for β block haplotypes. Note the much larger number of haplotypes far from the axes compared to (B), indicating transfers of unknown origin. (D) Histogram of linkage coefficients between pairs of block haplotypes at different loci for α (orange histogram) and β (blue histogram) compared to the expectation under a fully unlinked model (solid orange and blue lines, respectively).

Hybridization between previously isolated lineages is main source of diversity within the population.
(Main panel) An illustration of the evolutionary scenario consistent with the main results of this paper. A single common ancestor (gray) for the population (t0) diverged into several species (four shown here) following spatial isolation (t1). The species colonized the Yellowstone caldera, represented by the yellow rectangle, at different times (t2−t3) and began hybridizing. This process continued up to the present (t4). (Top inset) Illustration of the mechanism for generating unlinked SNP blocks through recombination between a hybrid segment (blue) containing a beneficial region (red star) and the rest of the host species (orange). Note the four distinct haplotypes highlighted at the end. (Bottom inset) Zoom in of the ancestry of a particular genome (middle) compared with the donor (orange, top) and host (green, bottom) ancestral genomes. After the initial hybridization, recombination with other host genomes (red arrows) creates a mosaic structure across a range of genomic length scales.

The genome coverage across SAGs is highly variable.
The SAG coverage, defined as the total length of contigs, is shown for all 331 SAGs (after quality control). SAGs were grouped by species (see 2) sorted in descending order of coverage within each species. The solid orange and blue horizontal lines show the genome sizes of OS-A and OS-B’. The dashed orange and blue lines show the threshold used to define high-coverage SAGs. The values of the thresholds are set at 75% of the length of the reference genomes.

Distribution of pairwise nucleotide distances within orthogroups shows clear gap around πij ≈ 0.075.
(A) Histogram of nucleotide distances between all pairs of cells across the 1,448 core orthogroups. The dashed red line showns the distance cutoff used to define species clusters (dc = 0.075). (B) Distribution of orthogroups according to the number of species clusters. Species clusters were generated as described in the text using the distance cutoff shown in (A).

Species clusters at the gene level approximately match those at the genome level.
Figures show gene-by-gene comparison of three cells to the OS-A and OS-B’ reference genomes. The cells shown are representative of the α (A; OctA1J9)and β (B; MA02H14_2) clusters and the one γ cell (C; OcA3L13). Each core gene with homologs in both OS-A and OS-B’ was compared to the references and the divergences plotted as a triangle with each side length equal to the divergence between a given pair. The cell sequence is represented by a black dot, with the OS-A and OS-B’ sequences represented by the orange upward and blue downward triangles, respectively. The black circle is a 10% radius which was chosen as a cutoff to classify each gene. The different columns show genes in one of the five possible patterns labeled as follows: XA-B (the cell sequence is less than 10% divergend from OS-A and more than 10% diverged from OS-B’), XAB (all three sequences are within 10% of each other), X-A-B (all three sequences are more than 10% diverged from each other), X-AB (the cell is more than 10% diverged from both OS-A and OS-B’, but both references are less than 10% diverged from each other), and XB-A (the cell is less than 10% diverged from OS-B’ and more than 10% diverged from OS-A).

Proportion of gene triplet patterns shows three distinct clusters.
The proportion of genes in each of the five patterns shown in 1 is show of each cell passing our quality control criteria (see Appendix 1). Lines represent different cells and are colored according to the species classification based on whole-genome divergences.

Phylogenetic tree inferred from 16S rRNA agrees with whole-genome species classification.
The pairwise divergences between all 16S sequences passing quality control was calculated from the multiple sequence alignment performed using MAFFT Katoh and Standley (2013). The tree was generated using a hierarchical agglomerative clustering algorithm with average linkage (UPGMA) as implemented in SciPy Virtanen et al. (2020)

Best hits to OA3L13 16S rRNA in SILVA SSU Ref database.

γ SAG represents a new distinct species within Yellowstone Synechococcus.
Shows the maximum likelyhood tree obtained from an alignment of 16 marker genes present in the γ SAG together with all other thermophilic cyanobacteria from Ref. Chen et al. (2020). Colored labels show the two Yellowstone Synechococcus, OS-A (orange) and OS-B’ (blue), and the γ SAG OA3L13 (green).

SAGs excluded during quality control do not include additional γ genomes.
The distribution of divergences from the best BLAST hit to the γ SAG (OcA3L13) across genes for each SAG excluded during the quality control process (see Sec. 1) is shown. Each line represents the probability density of divergences across genes for one of the 44 SAGs.

Number of reads recruited to different loci from the three main species are highly correlated across samples.
The three panels show the average depth of reads recruited to representative SAGs from α (top), β (middle), and γ (bottom), respectively. Each panel shows 100 genes with the highest average read depth across samples, with each line representing one gene. Only reads that were < 5% diverged from the reference genomes were included.

Abundance of γ is significantly above detection threshold in 32 of 34 metagenomic samples and is highly correlated with α.
The relative abundances of α (orange circles), β (blue squares), and γ (green diamonds) across all 34 metageomic samples. The detection threshold for the single-cell (ft = 1/48 ≈ 0.021, based on the maximum number of SAGs per sample) and metagenomic samples are shown in black and grey dashed lines, respectively. The asterisks on the x-axis label the two samples (MSe4 and OSM3) with both single-cell and metagenomic data.

Vast majority of Synechococcus core gene sequences are within the α-β-γ main clouds.
(A) Cumulative distribution of the minimum divergence of each read from the reference sequences across all loci used in the left panel. Different colors represent samples taken from different temperatures from Mushroom Spring. Solid lines show minimum divergence from all three references and dashed lines from α and β only. Shaded olive region shows the typical divergences between species shown in the right panel. (B) Histogram of pairwise divergences between three representative SAGs for α, β, and γ at the 100 core loci chosen for illustration. Distributions are across all 150 bp segments from each locus. Shaded olive region the middle 95% percentile for the combined distribution.

Diversity of Synechococcus 16S rRNA sequences from Mushroom Spring temperature series samples is entirely contained within the α-β-γ main clouds.
(A) Average diversity ℋ (see text) of metagenome reads from all 34 samples mapping to Synechococcus along the length of the 16S sequence. Averaging was done over all sites within a 200 bp window starting at the location given by the x-axis. Results for reads recruited using each of the three main species as a reference sequence are shown in different colors. Black dashed line indicates the 400 − 600 bp segment with the highest diversity which was chosen for subsequent analysis. (B) Distribution of nucleotide divergences from the γ allele within the 400 − 600 bp segment of the 16S for four different samples taken from Mushroom Spring. Shaded regions show the range of divergences of OS-A (orange) and OS-B’ (blue) from γ across all 51 windows of 150 bp length, corresponding to the range of expected divergences of reads from α and β.

All Synechococcus 16S rRNA sequences from the metagenome samples belong to one of the three known species clusters.
(A-D) Each subpanel shows the matrix of pairwise divergences between all reads recruited to the 400 − 600 bp segment of the γ 16S rRNA sequence for the four samples from the MS temperature series. Dashed lines highlight the OS-A, OS-B’, and γ reference sequences in this segment for comparison. (E, F) Same as above using a subsample of reads recruited from all 34 metagenome samples divided by spring.

Genetic linkage analysis shows extensive recombination within α and β.
Linkage disequilibrium 

Samples size has small effect on linkage estimates.
Linkage disequilibria curves for the whole α and β populations shown in 1 are compared to a random subsample of β sequences chosen to match the α sample size at each locus (shown in gray).

Linkage decay curves for α and β collapse on to the same curve after rescaling distances by average recombination rate ρ.
Shows the linkage decay as a function of distances between sites rescaled by the recombination rates ρ fitted from Fig. 2A (main text). The solid black line shows the theoretical prediction from the neutral model given by Eq. 5.

The effect of the main cloud diameter c on the rate of linkage decay.
The linkage decay within α (left panel) and β (right panel) is plotted for different main cloud cutoffs c (see main text). Note the convergence of the curves for c ≤ 0.05.

Majority of orthogroups contain hybrid sequences.
Orthogroups were classiffed into four mutually exclusive groups depending on the amount of hybridization observed. The majority of mixed-species orthogroups are mosaic hybrids (see main text) and shown in orange. Orthogroups in which the local sequence clusters matched the whole genome ones were labeled as “no hybrids and shown in blue, those in which a single cell contained a sequence from a different cluster were labeled “singleton hybrids” and shown in green, and those in which multiple cells contained sequences from different clusters were labeled “non-singleton hybrids” and shown in red.

Typical blocks are short and contain a handful of SNPs.
(A) The distribution of lengths of SNP blocks within α (orange) and β (blue) genomes, detected using the algorithm described in this section. (B) The distribution of the number of SNPs within the same blocks as in (A).

Number of SNP blocks within α and β containing haplotypes that mapped to different donor species in core genome.

Examples of blocks and hybrid sequences within ribosomal genes.
Aligments are represented as described previously with blocks highlighted in purple. (A) Alignment of 16S sequences with a β cell containing a hybrid segment starting at around position 340 and ending around position 580 in the alignment. Note the shorter block shared by the first two β sequences at the endo of the alignment (around position 1070). (B) Alignment of the SSU ribosomal protein rpsN with α cells ordered at the top and β cells at the bottom of the alignment. The first α cell was chosen as the reference sequence and is shown in color. Note the large fraction of α cells around 200-300 bp that are similar to the β sequences and distinct from the α reference. (C) Segment of alignment of β sequences at the LSU ribosomal protein rplM. The reference sequence was chosen to illustrate short SNP block in the center of the alignment.

Block haplotypes have high synonymous and non-synonymous divergences, consistent with them being generated through hybridization.
Panels show histograms of nonsynonymous (blue) and synonymous (orange) divergences between the two haplotypes of SNP blocks from α (A) and β (B), respectively. Solid vertical lines show the mean

Non-hybrid regions of α have similar levels of diversity within and between springs.
The distribution of pairwise synonymous diversity was calculated across non-hybrid genes from the alpha α population for different subgroups of cells. Square symbols show the distributions for pairs of cells from the same spring (purple for Mushroom Spring and cyan for Octopus Spring). The cyano triangles show the distribution for a random subsample of cells from the Octopus Spring of the same size as the Mushroom Spring sample (n = 35) for comparison. The red diamonds show the distribution for pairs of cells from different springs. Non-hybrid genes were determined using a 5% cutoff in the fraction of polymorphic synonymous sites (see text).

ClonalFrameML does not converge to self-consistent set of recombination events in high-coverage β SAGs.
Whole-genome alignments for 12 high-coverage β SAGs were generated as described in the main text. The phylogenetic tree for this subset was inferred using FastTree with default parameters and used as input for ClonalFrameML. The histogram shows the distribution of the length of recombined segments inferred by ClonalFrameML. The solid black line shows the maximum-likelihood exponential distribution used to fit the inferred segments. Note the heavy tail of the inferred distribution which is inconsistent with the exponential fit.

SNP correlation profiles in α agree with mcorr fits, but show significant discrepancies at short distances in β.

Comparison between recombination parameters obtained from our analysis, mcorr Lin and Kussell (2019), and ClonalFrameML Didelot and Wilson (2015).
For each species, the four columns show the estimated ratio of recombination and mutation rates ρ/θ, the average length of recombined segments Lr, the fraction of nucleotide differences between host and recombined segments p, and the ancestral nucleotide diversity θ estimated by the different methods. All quantities were converted to the notation used in the text as follows. The mcorr columns are the best fit values of the parameters ϕs/θs, f, dp, and θs from Lin and Kussell (2019). The ClonalFrameML columns are the maximum likelihood values of the parameters 

Result of tests for associations between SNP block haplotypes and two environmental variables related to the samples.
The rows show the results of the different tests described in the main text. The p-value p for each block was determined using Fisher’s exact test and minimum p-value from the randomized control 


Numbers and types of gene transfers between species are similar across samples.
The number of hybrid gene transfers from each donor found in α (A) and β (B) cells across all eight samples in the data. The labels A, B, and C, denote α, β, and γ, respectively, while X denotes other Synechococcucus clusters.

Species composition is highly variable across samples.

Genome coverage across core genes is well-explained by independent presence-absence model.
(A) shows the presence-absence matrix Pia of low-diversity OGs in the control SAG sample OS2005 (see main text for definitions). Rows and columns were ordered in decreasing order of the sums across the columns and rows, respectively. (B) shows the correlation matrix between OG presences defined as 



Variability in core gene presence is inconsistent with observed hybridization being dominated by artifacts.
(A) compares the presence distributions among the control SAG sample of the test OG group (blue) with the control group (orange), as defined in the main text. (B) compares the presence distributions of the test OG group in the control (blue) and test (orange) SAG samples. The green line shows the fit from 2C. All distributions were centered at zero to control for batch effects (see main text).

Linkage blocks are robust to large variations in species composition across samples.
(A) Bootstrapped distribution of the number of linkage blocks detected in alignments across 4-fold degenerate sites. The distribution was obtained by randomly choosing groups of 48 SAGs from samples containing mixtures of α and β SAGs (MS2004, MS2006, OS2006, OS2007, and OS2009). The arrow shows the number of blocks detected using the 48 SAGs from OS2005, which only contained α SAGs. Comparison of the distributions of block lengths (B) and divergence between the two haplotypes from each block (C) are shown in the other two panel. OS2005 (shown in blue) is compared to a randomly chosen sample from panel A (shown in orange).