Synechococcus population comprises three genomic clusters.

(A) The mean divergence across homologous genes from the two reference genomes is shown for every SAG in the data. Each dot represents a single SAG and is colored according to its assigned species. (B) The relative abundances of all three species across samples from different temperatures within Mushroom Spring. Abundances were estimated based on read recruitment from the metagenome to core genes from the γ SAG, as well as representative SAGs from the α and β populations. Dashed lines show the minimum detection frequencies from the SAG (black) and metagenome (gray) data, respectively (see SM, Sec. III). Shaded gray region shows temperature range of the 8 SAG samples.

Transfers of whole-gene alleles between species is common.

The frequency of hybrid gene alleles along the α (A) and β (B) genomes. Different colors show the donor species, with unknown species labeled X and shown in magenta. The orange lines on top show mosaic loci without distinct gene clusters. (C) The mean synonymous divergence between α and β at homologous loci, ordered according to OS-B’. Smoothed line over a sliding window of five genes with a step size of two genes is shown in black. Maximum and minimum divergences at each locus are shown in red and blue, respectively. (Insets, middle) A zoomed-in region from a typical genomic region (left) and a diversity trough (right), with dots representing individual genes. (Insets, bottom) Alignment segments of individual genes, subsampled to 40 sequences for clarity. The first sequence is arbitrarily chosen as the reference with each nucleotide represented by a different color. Other sites are colored gray if same as reference or using nucleotide colors if mutated. Left colored bars show the species of each genome.

Hybridization between species leads to mixing of short DNA segments within single genes.

(A, top) Alignment segment of α cells from a representative locus with four SNP blocks. The inferred ancestral α haplotype is highlighted in orange and the hybrid haplotype (all β here) is shown in blue. (Middle) Illustration of SNP block analysis pipeline. (Right) The consensus alignment of the two α haplotypes, β, and γ is constructed. (Left) Pairwise divergences between consensus sequences are calculated. Note that α2 is identical to γ here. (B) The joint distribution of divergences of α SNP block haplotypes from the other species for all SNP blocks. Each point represents a haplotype with coordinates showing the divergence from β (x-axis) and γ (y-axis). The orange and blue dots show the two haplotypes from the example in (A). Marginal distributions are shown on the top and right. Note the high density of haplotypes near the axes, indicating transfer from one of the other known species. (C) Same as panel (B) for β block haplotypes. Note the much larger number of haplotypes far from the axes compared to (B), indicating transfers of unknown origin. (D) Histogram of linkage coefficients between pairs of block haplotypes at different loci for α (orange histogram) and β (blue histogram) compared to the expectation under a fully unlinked model (solid orange and blue lines, respectively).

Hybridization between previously isolated lineages is main source of diversity within the population.

(Main panel) An illustration of the evolutionary scenario consistent with the main results of this paper. A single common ancestor (gray) for the population (t0) diverged into several species (four shown here) following spatial isolation (t1). The species colonized the Yellowstone caldera, represented by the yellow rectangle, at different times (t2t3) and began hybridizing. This process continued up to the present (t4). (Top inset) Illustration of the mechanism for generating unlinked SNP blocks through recombination between a hybrid segment (blue) containing a beneficial region (red star) and the rest of the host species (orange). Note the four distinct haplotypes highlighted at the end. (Bottom inset) Zoom in of the ancestry of a particular genome (middle) compared with the donor (orange, top) and host (green, bottom) ancestral genomes. After the initial hybridization, recombination with other host genomes (red arrows) creates a mosaic structure across a range of genomic length scales.

Examples of alignments of orthogroups with mixed-species clusters.

A random sample of 5 orthogroup alignments labeled as mixed-species clusters are shown. All alignments were trimmed to remove segments containing large fraction of gaps. Panels A-D (YSG 1519, YSG 0498, YSG 0215c, and YSG 1364) show varying degrees of hybridization between α and β. Panel E (YSG 0291) is the result of a poor initial alignment of the orthogroup leading to a large number of indels before trimming.

Species clusters at the gene level approximately match those at the genome level.

Figures show gene-by-gene comparison of three cells to the OS-A and OS-B’ reference genomes. The cells shown are representative of the α (A; OctA1J9)and β (B; MRedA02H14 2) clusters and the one γ cell (C; OcA3L13). Each core gene with homologs in both OS-A and OS-B’ was compared to the references and the divergences plotted as a triangle with each side length equal to the divergence between a given pair. The cell sequence is represented by a black dot, with the OS-A and OS-B’ sequences represented by the orange upward and blue downward triangles, respectively. The black circle is a 10% radius which was chosen as a cutoff to classify each gene. The different columns show genes in one of the five possible patterns labeled as follows: XA-B (the cell sequence is less than 10% divergend from OS-A and more than 10% diverged from OS-B’), XAB (all three sequences are within 10% of each other), X-A-B (all three sequences are more than 10% diverged from each other), X-AB (the cell is more than 10% diverged from both OS-A and OS-B’, but both references are less than 10% diverged from each other), and XB-A (the cell is less than 10% diverged from OS-B’ and more than 10% diverged from OS-A).

Proportion of gene triplet patterns shows three distinct clusters.

The proportion of genes in each of the five patterns shown in S2 is show of each cell passing our quality control criteria (see SI, Sec. I.2). Lines represent different cells and are colored according to the species classification based on whole-genome divergences.

Phylogenetic tree inferred from 16S rRNA agrees with whole-genome species classification.

The pairwise divergences between all 16S sequences passing quality control was calculated from the multiple sequence alignment performed using MAFFT (10). The tree was generated using a hierarchical agglomerative clustering algorithm with average linkage (UPGMA) as implemented in SciPy (13)

Best hits to OA3L13 16S rRNA in SILVA SSU Ref database.

γ SAG represents a new distinct species within Yellowstone Synechococcus. Shows the maximum likelyhood tree obtained from an alignment of 16 marker genes present in the γ SAG together with all other thermophilic cyanobacteria from Ref. (22). Colored labels show the two Yellowstone Synechococcus, OS-A (orange) and OS-B’ (blue), and the γ SAG OA3L13 (green).

SAGs excluded during quality control do not include additional

γ genomes. The distribution of divergences from the best BLAST hit to the γ SAG (OcA3L13) across genes for each SAG excluded during the quality control process (see Sec. I.2) is shown. Each line represents the probability density of divergences across genes for one of the 44 SAGs.

Number of reads recruited to different loci from the three main species are highly correlated across samples.

The three panels show the average depth of reads recruited to representative SAGs from α (top), β (middle), and (bottom), respectively. Each panel shows 100 genes with the highest average read depth across samples, with each line representing one gene. Only reads that were < 5% diverged from the reference genomes were included.

Abundance of

γ is significantly above detection threshold in 32 of 34 metagenomic samples and is highly correlated with α. The relative abundances of α (orange circles), β (blue squares), and γ (green diamonds) across all 34 metageomic samples. The detection threshold for the single-cell (ft = 1/48 0.021) and metagenomic samples are shown in black and grey dashed lines, respectively. The asterisks on the x-axis label the two samples (MSe4 and OSM3) with both single-cell and metagenomic data.

Vast majority of Synechococcus core gene sequences are within the

α-β-γ main cloud. (A) Cumulative distribution of the minimum divergence of each read from the reference sequences across all loci used in the left panel. Different colors represent samples taken from different temperatures from Mushroom Spring. Solid lines show minimum divergence from all three references and dashed lines from α and β only. Shaded olive region shows the typical divergences between species shown in the right panel. (B) Histogram of pairwise divergences between three representative SAGs for α, β, and γ at the 100 core loci chosen for illustration. Distributions are across all 150 bp segments from each locus. Shaded olive region the middle 95% percentile for the combined distribution.

Diversity of Synechococcus 16S rRNA sequences from Mushroom Spring temperature series samples is entirely contained within the α-β-γ main cloud. (A) Average diversity H of metagenome reads from all 34 samples mapping to Synechococcus along the length of the 16S sequence. Averaging was done over all sites within a 200 bp window starting at the location given by the x-axis. Results for reads recruited using each of the three main species as a reference sequence are shown in different colors. Black dashed line indicates the 400 600 bp segment with the highest diversity which was chosen for subsequent analysis. (B) Distribution of nucleotide divergences from the γ allele within the 400 600 bp segment of the 16S for four different samples taken from Mushroom Spring. Shaded regions show the range of divergences of OS-A (orange) and OS-B’ (blue) from γ across all 51 windows of 150 bp length, corresponding to the range of expected divergences of reads from α and β.

All Synechococcus 16S rRNA sequences from the metagenome samples belong to one of the three known species clusters.

(A-D) Each subpanel shows the matrix of pairwise divergences between all reads recruited to the 400 600 bp segment of the γ 16S rRNA sequence for the four samples from the MS temperature series. Dashed lines highlight the OS-A, OS-B’, and γ reference sequences in this segment for comparison. (E, F) Same as above using a subsample of reads recruited from all 34 metagenome samples divided by spring.

Genetic linkage analysis shows extensive recombination within

α and β. (A) Linkage disequilibrium σd2 as a function of separation between SNPs x averaged across all core genes among the α (orange circles), β (blue squares), or all of the cells (black diamonds). Genome-wide estimates from SNPs in different genes are shown to the right of the black dashed line using the same symbols. Fully unlinked controls for the genome-wide linkage are shown as horizontal lines of different colors. The main cloud cutoff for both α and β was set to c = 0.05. (B) Scatter plot of 16S and genome-wide divergences between all pairs of cells within the same species. Solid line shows a linear regression from which the Pearson correlation coefficient, R, within each species was calculated.

Samples size has small effect on linkage estimates.

Linkage disequilibria curves for the whole α and β populations shown in S12 are compared to a random subsample of β sequences chosen to match the α sample size at each locus (shown in gray).

Linkage decay curves for

α and β are collapse on to the same curve after rescaling distances by average recombination rate ρ. (A) Shows same data as Fig. S12A for α and β together with the fit usings Eq. (S5). (B) Shows the linkage decay as a function of distances between sites rescaled by the recombination rates ρ fitted from (A). The solid black line shows the theoretical prediction from the neutral model given by Eq. S4.

Hybridized and non-hybridized segments of the α genome have similar rates of linkage decrease. The decay of the linkage disequilibrium σd2 averaged over α loci that have undergone hybridization (orange circles) and loci that have not undergone hybridization (red crosses) is compared. Hybridized and non-hybridized loci were determined based on a 5% cutoff in the fraction of synonymous polymorphic sites within α (see Sec. VII).

The effect of the main cloud diameter c on the rate of linkage decay. The linkage decay within α (left panel) and β (right panel) is plotted for different main cloud cutoffs c (see main text). Note the convergence of the curves for c ≤ 0.05.

Majority of orthogroups contain hybrid sequences.

Orthogroups were classified into four mutually exclusive groups depending on the amount of hybridization observed. The majority of mixed-species orthogroups are mosaic hybrids (see main text) and shown in orange. Orthogroups in which the local sequence clusters matched the whole genome ones were labeled as “no hybrids and shown in blue, those in which a single cell contained a sequence from a different cluster were labeled “singleton hybrids” and shown in green, and those in which multiple cells contained sequences from different clusters were labeled “non-singleton hybrids” and shown in red.

Simple hybrid genes have bimodal distribution of diversity across loci.

Diversity among hybrid sequences at simple hybrid loci. Each circle shows one of the non-singleton loci from Fig. 2AB within the α (left) and β (right) cells. The colors represent the different donor species, as labeled on the x-axis. For transfers between α and β the diversity within the donor group is shown as squares of the same color as a control. Sequences from the same locus are connected by a black line. The horizontal lines show the nucleotide diversity within the low diversity genomic regions of α (left panel, in orange) and the core genome of β (right panel, in blue), respectively.

Typical diversity within diversity troughs is similar to

β and much higher than the non-hybrid regions from α. (Left panel) The distribution of synonymous nucleotide diversity within α across all diversity trough genes is shown in purple. The distribution across a random sample of non-hybridized α genes of the same size is shown in orange for comparison. Non-hybridized α genes were determined using a 5% cutoff in the fraction of polymorphic synonymous sites (see Sec. VII). (Right panel) The distributio of synonymous nucleotide diversity with β core genes across all diversity trough genes is shown in purple. Here, the control distribution shown in blue is from a random sample of core β genes.

Typical blocks are short and contain a handful of SNPs.

(A) The distribution of lengths of SNP blocks within α (orange) and β (blue) genomes, detected using the algorithm described in this section. (B) The distribution of the number of SNPs within the same blocks as in (A).

Number of SNP blocks within α and β containing haplotypes that mapped to different donor species.

Examples of blocks and hybrid sequences within ribosomal genes.

(A) Alignment of 16S sequences with a β cell containing a hybrid segment starting at around position 340 and ending around position 580 in the alignment. (B) Alignment of the SSU ribosomal protein rpsN with α cells ordered at the top and β cells at the bottom of the alignment. The first β cell was chosen as the reference sequence and is shown in color. (C) Segment of alignment of β sequences at the LSU ribosomal protein rplM. The reference sequence was chosen to illustrate short SNP block in the center of the alignment.

Block haplotypes have high synonymous and non-synonymous divergences, consistent with them being generated through hybridization.

Panels show histograms of nonsynonymous (blue) and synonymous (orange) divergences between the two haplotypes of SNP blocks from α (A) and β (B), respectively.

α and β show distinct patterns of within-species diversity. The two panels show the nucleotide diversity π within α (top) and β (bottom) at all core genes ordered according to their respective reference genomes. The solid line shows the value of π estimated from all sequences, with values calculated from a subsample of 10 randomly chosen sequences shown in lighter colors. Both lines were smoothed by averaging over a sliding window of 5 genes with a step size of 2 genes.

Core genome diversity within

α has bimodal distribution. (Left) Distribution of the fraction of polymorphic four-fold degenerate (4D) sites for each gene (blue histogram) is shown together with Poisson fit based on the number of genes without any polymorphic (black line and dots). Inset shows zoomed in region of low-diversity genes (< 5% polymorphic 4D sites) together with the fit for clarity. (Right) Shows histogram of synonymous nucleotide diversity πSfor both low-diversity (blue) and high-diversity (orange) genes. Genes without polymorphic 4D sites were arbitrarily assigned a value πS= 105 for illustration purposes.

Core genome diversity within

β is relatively homogeneous. (Left) Shows distribution of the fractions of polymorphic 4D sites for each gene (blue histogram) together with a Poisson fit based on the mean across all genes. (Right) Shows histogram of synonymous nucleotide diversity πS. Genes without polymorphic 4D sites were arbitrarily assigned a value πS = 105 for illustration purposes.

Non-hybrid regions of

α have similar levels of diversity within and between springs. The distribution of pairwise synonymous diversity was calculated across non-hybrid genes from the alpha α population for different subgroups of cells. Square symbols show the distributions for pairs of cells from the same spring (purple for Mushroom Spring and cyan for Octopus Spring). The cyano triangles show the distribution for a random subsample of cells from the Octopus Spring of the same size as the Mushroom Spring sample (n = 35) for comparison. The red diamonds show the distribution for pairs of cells from different springs. Non-hybrid genes were determined using a 5% cutoff in the fraction of polymorphic synonymous sites (see text).

Result of tests for associations between SNP block haplotypes and two environmental variables related to the samples. The rows show the results of the different tests described in the main text. The p-value p for each block was determined using Fisher’s exact test and minimum p-value from the randomized control pminrand was used to correct for false-discovery rates (see main text for details). The first four columns show total number of blocks in each category. The last column shows the largest value of the pminrand /pmax (most significant association) for each test.

Numbers and types of gene transfers between species are similar across samples.

The number of hybrid gene transfers from each donor found in α (A) and β (B) cells across all eight samples in the data. The labels A, B, and C, denote α, β, and γ, respectively, while X denotes other Synechococcucus clusters.

Species composition is highly variable across samples.

Genome coverage across core genes is well-explained by independent presence-absence model.

(A) shows the presence-absence matrix Pia of low-diversity OGs in the control SAG sample OS2005 (see main text for definitions). Rows and columns were ordered in decreasing order of the sums across the columns and rows, respec-tively. (B) shows the correlation matrix between OG presences defined as Cab = N 1 LN ((Pia − Pi)(Pib − Pi)), where Pi = N 1 LL Pia. Note the use of the mean abundance for each SAG rather than p = (NL)1 L i,a Pia to account for the large variability in coverage between SAGs. (C) shows the distribution of OG presences Pa = 1 N i=1 Pia. The solid orange line shows the Gaussian fit under the assumption that each gene is a Bernoulli random variable with success probability p. (D) shows the distribution of the entries in the covariance matrix from (B) together with the predictions for the diagonal (orange) and off-diagonal (green) components under the assumption that each gene is independent and present in the final assembly with probability p. Note the value of p was obtained from panel (C) and there are no free parameters for the predicted curves.

Variability in core gene presence is inconsistent with observed hybridization being dominated by artifacts.

(A) compares the presence distributions among the control SAG sample of the test OG group (blue) with the control group (orange), as defined in the main text. (B) compares the presence distributions of the test OG group in the control (blue) and test (orange) SAG samples. The green line shows the fit from S29C. All distributions were centered at zero to control for batch effects (see main text).

Linkage blocks are robust to large variations in species composition across samples. (A)

Boot-strapped distribution of the number of linkage blocks detected in alignments across 4-fold degenerate sites. The distribution was obtained by randomly choosing groups of 48 SAGs from samples containing mixtures of α and β SAGs (MS2004, MS2006, OS2006, OS2007, and OS2009). The arrow shows the number of blocks detected using the 48 SAGs from OS2005, which only contained α SAGs. Comparison of the distributions of block lengths (B) and divergence between the two haplotypes from each block (C) are shown in the other two panel. OS2005 (shown in blue) is compared to a randomly chosen sample from panel A (shown in orange).