Whole genome phylogenies reflect the distributions of recombination rates for many bacterial species
Abstract
Although recombination is accepted to be common in bacteria, for many species robust phylogenies with wellresolved branches can be reconstructed from whole genome alignments of strains, and these are generally interpreted to reflect clonal relationships. Using new methods based on the statistics of singlenucleotide polymorphism (SNP) splits, we show that this interpretation is incorrect. For many species, each locus has recombined many times along its line of descent, and instead of many loci supporting a common phylogeny, the phylogeny changes many thousands of times along the genome alignment. Analysis of the patterns of allele sharing among strains shows that bacterial populations cannot be approximated as either clonal or freely recombining but are structured such that recombination rates between lineages vary over several orders of magnitude, with a unique pattern of rates for each lineage. Thus, rather than reflecting clonal ancestry, whole genome phylogenies reflect distributions of recombination rates.
Introduction
The only illustration that appears in Darwin’s Origin of Species (Darwin, 1859) is of a phylogenetic tree. Indeed, the tree has become the archetypical concept representing biological evolution. Since every biological cell that has ever lived was the result of a cell division, all cells are connected through cell divisions in a giant tree that stretches all the way back to the earliest cells that existed on earth. Thus, the study of biological evolution in some sense corresponds to the study of the structure of this giant celldivision tree.
It is therefore natural that the first step in the analysis of a set of related biological sequences is to reconstruct the phylogenetic tree that reflects the cell division history of the sequences, that is their ‘clonal phylogeny’. Once the ancestral relationships between the sequences are known, the evolution of the sequences can then be modeled along the branches of this tree. Indeed, virtually all models of evolutionary dynamics are formulated as occurring along the branches of a tree, and many mathematical and computational methods have been developed for their inference, see for example (Felsenstein, 1981; Page and Holmes, 1998). This strategy has been employed from the earliest days of sequence analysis (Zuckerkandl and Pauling, 1965) and is almost invariably applied in the analysis of microbial genome sequences, which is the main topic of this work.
A second key concept in models of evolutionary dynamics is the idea of a ‘population’ of organisms that are mutually competing for resources and that, for purposes of mathematical modeling, can be considered exchangeable in the sense that they are subjected to the same environment. Indeed, populations of exchangeable individuals form the basis of almost all mathematical population genetics models (see e.g. [Hartl and Clark, 2006]), including coalescent models for phylogenies (Wakeley, 2008). Although it is of course well recognized that, in the real world, populations are structured into subpopulations with varying degrees of interaction between them, population genetics models of evolutionary dynamics almost by definition assume that at some level there are subpopulations of exchangeable individuals sharing a common environment.
In this paper, we present evidence that we believe challenges the usefulness of applying these two concepts for describing genome evolution in prokaryotes. First, we find that for most bacterial species recombination is so frequent that, within an alignment of strains, each genomic locus has been overwritten by recombination many times and the phylogeny typically changes tens of thousands of times along the genome. Moreover, for most pairs of strains, none of the loci in their pairwise alignment derives from their ancestor in the clonal phylogeny, and the vast majority of genomic differences result from recombination events, even for very close pairs. Consequently, apart from a few groups of very closely related strains, clonal ancestry cannot be reconstructed from the genome sequences using currently available methods and, more generally, the strategy of modeling microbial genome evolution as occurring along the branches of a single phylogeny breaks down.
Second, we show that recombination among bacterial strains is not random but subject to strong and complex population structure, that is recombination occurs at very different rates between different lineages. Although almost every short segment in the whole genome alignment of a set of strains follows a different phylogeny, these phylogenies are not uniformly randomly sampled from all possible phylogenies, but sampled from highly biased distributions that reflect this population structure. In particular, while a large diversity of phylogenetic trees occurs along the whole genome alignment, some subsets of strains share alleles much more frequently than others. In particular, the frequencies with which different subsets of strains share alleles follow approximately powerlaw distributions. In addition, almost every individual strain has a distinct distribution of frequencies with which it shares alleles with other strains. This suggests that the rates at which the genetic ancestors of each strain have recombined with the genetic ancestors of other strains are unique for each strain, so that the assumption that at some level the strains can be considered as exchangeable members of a population may also fundamentally break down.
The structure of the paper is as follows. To present our analyses, we will focus on a collection of 91 wild Escherichia coli strains that were isolated over a short period from a common habitat (Ishii et al., 2007). Using these strains, we introduce the main puzzle of bacterial whole genome phylogeny: although the phylogenies of individual genomic loci are all distinct, the phylogeny inferred from any large collection of genomic loci converges to a common structure, for example (Wolf et al., 2002; Lerat et al., 2003; Touchon et al., 2009). This convergence has led many researchers to assume that the phylogeny reconstructed from a whole genome alignment must represent the clonal phylogeny, and that recombination can be detected and quantified by measuring deviations from this whole genome phylogeny, for example (Bobay et al., 2015; Didelot and Wilson, 2015; Yahara et al., 2016; Mostowy et al., 2017; Garud et al., 2019). However, as far as we are aware, there are no rigorous justifications for assuming that this whole genome phylogeny corresponds to the clonal phylogeny or that deviations from this global phylogeny can be used to quantify recombination. Here, we introduce methods for quantifying the role of recombination that do not rely on such assumptions and show that the whole genome phylogeny in fact does not correspond to the clonal phylogeny for many bacterial species, but instead reflects population structure.
We first study recombination by studying pairs of strains, extending a recent approach by Dixit et al., 2015, to model each pairwise alignment as a mixture of clonally inherited and recombined regions. We show that, as the distance to the pair’s clonal ancestor increases, the fraction of the genome covered by recombined segments increases, and at some pairwise distance all clonally inherited DNA disappears. Importantly, this distance is far below the typical divergence of pairs of strains such that for the vast majority of pairs, none of the DNA in their genome alignment stems from their clonal ancestor.
Much of the new analysis methodology that we introduce is based on biallelic singlenucleotide polymorphisms (SNPs; which constitute almost all SNPs in the core alignment). Although biallelic SNPs have been studied to estimate the number of recombinations along alignments of sexually reproducing species (Hudson and Kaplan, 1985), they have received relatively little attention in the study of prokaryotic genomes. We show that almost every biallelic SNP corresponds to a singlenucleotide substitution in the history of the strains at the corresponding genomic position, so that the subset of strains sharing a common nucleotide must form a clade in the phylogeny at that position. We show various ways in which these biallelic SNPs can be used to investigate which SNPs are consistent with given phylogenies, or each other, and use them to quantify the amount of phylogenetic variation along the alignment. Using such analysis, we show that the phylogeny must change every few SNPs along the genome alignment, and derive a lower bound on the ratio of recombination to mutation events in the genome alignment.
To obtain a separate validation of our methods, we not only apply our methods to the real data from the E. coli strains but also to data from simulations of a simple evolutionary dynamics in which genomes of a wellmixed population of fixed size N evolve under reproduction, (neutral) mutation at a fixed rate μ per base per generation, and homologous recombination at a rate ρ per base per generation. By comparing the known ground truth for the simulated data with the results of our methods, we confirm the accuracy of our methods on the simulated data. We also use results of our methods on simulated data to clarify the precise meaning of several statistics that we calculate.
Applying the methods and statistics that we developed for E. coli to a set of other bacterial species, that is Bacillus subtilis, Helicobacter pylori, Mycobacterium tuberculosis, Salmonella enterica, and Staphylococcus aureus, we show that, with the exception of M. tuberculosis where all strains are very closely related and most if not all DNA has been clonally inherited, all other species follow the same general behavior as E. coli.
To explain how a robust core genome phylogeny can emerge in spite of the fact that a very large number of different phylogenies occurs along the genome, we use data from human genomes as an illustration. We show that phylogenies reconstructed from large numbers of loci from human genomes also converge to a robust phylogeny. Moreover, this phylogeny reflects the known human population structure, and we propose that bacterial whole genome phylogenies similarly reflect population structure, that is the rates with which different lineages have recombined. To support this interpretation, we show how biallelic SNPs can also be used to quantify the relative frequencies with which different subsets of strains share alleles. We find that these frequencies follow approximately scalefree distributions, indicating that there is population structure at every scale. Finally, we define entropy profiles of the phylogenetic variability of each strain and show that these entropy profiles provide a unique phylogenetic fingerprint of almost every strain. In addition, we show that simple evolutionary models of fully mixed populations cannot reproduce the statistics we observe for real genomic data, further supporting that these phylogenetic patterns reflect complex population structure. We conclude that our observations necessitate a new way of thinking about how to model genome evolution in prokaryotes.
Results
To illustrate our methods, we focus on the SC1 collection of wild E. coli isolates that were collected in 2003–2004 near the shore of the St. Louis river in Duluth, Minnesota (Ishii et al., 2006; Ishii et al., 2007). We sequenced 91 strains from this collection together with the K12 MG1655 lab strain as a reference. In a companion paper (Field, 2020, in preparation), we discuss this collection in more detail and extensively analyze the evolution of gene content and phenotypes of this collection. Here, we focus on sequence evolution in the core genome of these strains, that is the genomic regions that are shared by all strains. Although the SC1 strains were collected from a common habitat over a short period of time, they show a remarkable diversity, with no two identical strains, all known major phylogroups of E. coli represented, as well as an ‘out group’ of 9 strains that are more than 6% diverged at the nucleotide level from other E. coli strains (see Figure 1—figure supplement 1 for a phylogenetic tree constructed using maximum likelihood on the joint core genome of the SC1 strains and 189 reference strains [Field, 2020, in preparation]).
Phylogenies of individual loci disagree with the phylogeny of the core genome
To construct a core genome alignment of the SC1 strains and K12 MG1655, that is the genomic regions that occur in all strains, we used the Realphy software (Bertels et al., 2014; see Materials and methods), resulting in a multiple alignment across all 92 strains of ${2}^{\prime}{756}^{\prime}541$ base pairs long, with ${299}^{\prime}077$ (10.8%) of the positions exhibiting polymorphism. Note that the core genome length corresponds to about 56% of the median genome length of the strains. Although all strains are unique when their full genomes are considered, there are some groups of strains that are so close that they are identical in their core genomes, so that there are only 82 unique core genomes in total. Realphy used PhyML (Guindon et al., 2010) to reconstruct a phylogeny from the core genome alignment and we will refer to this tree as the core tree from here on (Figure 1).
We first checked to what extent the alignments of individual genomic loci are statistically consistent with the core tree. For each 3 kb block of the core alignment, we used PhyML to reconstruct a phylogeny and then compared its loglikelihood with the loglikelihood that can be obtained when the phylogeny is constrained to have the topology of the core tree. We find that essentially all 3 kb alignment blocks reject the core tree topology (Figure 1—figure supplement 2, left panel). Moreover, each alignment block rejects the topologies that were constructed from all other alignment blocks (Figure 1—figure supplement 2, right panel).
Although it thus appears that the phylogeny at each genomic locus is statistically significantly distinct, it is still possible that all these phylogenies are highly similar. In order to quantify the differences between the core tree and the phylogenies of 3 kb blocks we calculated, for each split in the core tree, the fraction of 3 kb blocks for which the same split occurred in the phylogeny reconstructed from that alignment block. As shown in the top row of Figure 1, the phylogenies of individual blocks differ substantially from the core tree: roughly twothirds of the splits in the core tree occur in less than half of 3 kb block phylogenies and half of the core tree splits occur in less than a quarter of all 3 kb block phylogenies. Particularly, the splits higher up in the core tree do not occur in the large majority of block phylogenies.
These observations are not particularly novel. There is by now a vast and sometimes contentious literature on the role of recombination in prokaryotic genome evolution which is beyond the scope of this article to review. We thus focus on a few key points that are central to the questions and methods we study here. First, systematic studies of complete microbial genomes have shown that horizontal gene transfer is relatively common and can significantly affect phylogenies of individual loci, for example (Guttman and Dykhuizen, 1994; Lawrence and Ochman, 1998), and many studies have observed that different genomic loci support different phylogenies, for example (Shapiro et al., 2012). Such observations caused some researchers to question whether trees can be meaningfully used to describe genome evolution (Doolittle, 1999). Interestingly, it has recently been shown that recombination can play a major role in genome evolution even within relatively shortterm laboratory evolution experiments (Maddamsetti and Lenski, 2018).
In spite of this, many researchers in the field feel that a major phylogenetic backbone can still be extracted from genomic data. For example, it has been observed that, whenever a phylogeny is reconstructed from the alignments of a large number of genomic loci, one obtains the same or highly similar phylogenies, for example (Wolf et al., 2002; Lerat et al., 2003; Touchon et al., 2009). We also observe this behavior for our strains. Phylogenies reconstructed from a random sample of 50% of all 3 kb blocks look highly similar to the core tree, that is with two thirds of the core tree’s splits occurring in all phylogenies (Figure 1, bottom row).
How should we interpret this convergence of phylogenies to the core phylogeny as increasing numbers of genomic loci are included? One interpretation that has been proposed is that once a large number of genomic segments is considered, effects of horizontal transfer are effectively averaged out and the phylogeny that emerges corresponds to the clonal ancestry of the strains, for example (Wolf et al., 2002; Bobay et al., 2015). Indeed, it has become quite common for researchers to detect and quantify recombination using methods that compare local phylogenetic patterns with an overall reference phylogeny constructed from the entire genome, for example (Bobay et al., 2015; Didelot and Wilson, 2015; Yahara et al., 2016; Mostowy et al., 2017; Garud et al., 2019; Croucher et al., 2015). However, the validity of such approaches rests on the assumption that this reference phylogeny really represents the clonal phylogeny, and it is currently unclear whether this is justified.
Indeed, some recent studies have argued that recombination is so common in some bacterial species that it is impossible to meaningfully reconstruct the clonal tree from the genome sequences, and that these species should be considered freely recombining, for example (Rosen et al., 2015). However, if members of the species are freely recombining, one would expect the core tree to take on a starlike structure as opposed to the clear and consistent phylogenetic structure that phylogenies converge to as more genomic regions are included in the analysis. Addressing this puzzle is one of the topics of this work.
Simulations of a simple evolutionary model including drift, mutation, and recombination
To confirm the validity of our methods introduced below, and to provide a reference for comparison of the results we observe on the E. coli data with those from a wellunderstood evolutionary dynamics, we performed simulations of a relatively simple evolutionary model that includes drift, (neutral) mutation, and recombination. As described in the Materials and methods, we simulated populations of constant size N with nonoverlapping generations, a constant mutation rate μ per generation per base, and assuming all mutations are neutral. We also included recombination by inserting genomic fragments of length ${L}_{r}$ from randomly chosen other individuals in the population at a rate ρ per position per generation. In order to make the simulation results comparable to our core genome alignment of E. coli strains, we focused on genome alignments of a sample of $S=50$ individuals from the population and set the mutation rate μ such that the fraction of polymorphic columns in the simulated alignment is similar to that observed in the real data, that is 10%. Apart from simulations without recombination, we performed simulations with a wide range of recombination to mutation rates from $\rho /\mu =0.001$ to $\rho /\mu =10$. We used ${L}_{r}=12kb$ for the length of the recombination segments which is on the lower end of the recombinant segments we observe in E. coli, as we will see below (Figure 2J). In addition, in order to facilitate comparison of statistics across simulations with different recombination rates, we made sure to design these simulations such that a clonal tree of the sample of $S=50$ genomes is drawn once according to the wellknown Kingman coalescent (Wakeley, 2008), and then used the same clonal tree in all simulations. In each simulation, we explicitly tracked how many times each position in the genome was overwritten by recombination along each branch of the clonal phylogeny (see Materials and methods).
As the ratio of the recombination and mutation rate $\rho /\mu $ is increased, progressively more of the clonal history is lost. Figure 1—figure supplement 3 shows, for different values of $\rho /\mu $, the clonal tree with branches colored according to the fraction of positions in the genome that were clonally inherited (that is not overwritten by recombination) along the branch. At a very low rate of recombination of $\rho /\mu =0.001$, the large majority of all positions in each branch are clonally inherited. When the ratio $\rho /\mu $ is raised to 0.01, evolution is still mostly clonal along most of the branches toward the leaves of the tree, but for the few long internal branches near the root that separate the major ’clades’, most positions have already been affected by recombination. Once the ratio $\rho /\mu $ is further raised to 0.1, almost all branches are dominated by recombination except for a few branches near the leaves. For $\rho /\mu =0.3$ or larger, almost every position in every branch has been overwritten by recombination.
Quantifying recombination through analysis of pairs of strains
As a first analysis of the impact of recombination, we follow an approach recently proposed by Dixit et al. based on the pairwise comparison of strains (Dixit et al., 2015). The simplest measure of the distance between a pair of strains is their nucleotide divergence, that is the fraction of mismatching nucleotides between the two strains in the core genome alignment. For pairs of strains with very low divergence, for example D6 and F2 with divergence $4\times {10}^{4}$ (Figure 2A), the effects of recombination are almost directly visible in the pattern of SNP density along the genome. While the SNP density is very low along most of the genome, that is $02$ SNPs per kilobase, there are a few segments, typically tens of kilobases long, where the SNP density is much higher and similar to the typical SNP density between random pairs of E. coli strains, that is $1030$ SNPs per kilobase. These high SNP density regions almost certainly result from horizontal transfer events in which a segment of DNA from another E. coli strain, for example carried by a phage, made it into one of the ancestor cells of this pair, and was incorporated into the genome through homologous recombination. For pairs of increasing divergence, for example the pair C10D7 with divergence 0.002 in Figure 2B, the frequency of these recombined regions increases, until eventually most of the genome is covered by such regions (pair D6H10 in Figure 2C).
For close pairs, the histograms of SNP densities also clearly separate into two components: a majority of clonally inherited regions with up to at most 3 SNPs per kilobase, and a long tail of recombined regions with up to 50 or 60 SNPs per kilobase (Figure 2D–E). As explained in the methods, the distributions of SNP densities are well described by mixtures of a Poisson distribution for the clonally inherited regions plus a negative binomial for the recombined regions (solid line fits in Figure 2D–F). In this way we can estimate, for each pair of strains, the fraction f_{c} of the genome that was clonally inherited, and the fraction f_{r} of SNPs that fall in clonally inherited versus recombined regions. In addition, to estimate the lengths of the recombined fragments we focused on very close pairs for which recombination events are sparse enough so that overlapping events are very unlikely, and used a Hidden Markov model to estimate the distribution of lengths of recombined regions (see Materials and methods). We find that most recombined segments are in the range of $1070$ kilobases long with a mean of 31 kilobases (Figure 2, and see Appendix 2 for additional comments on the accuracy of this estimate).
From this analysis we see that, whenever the pairwise divergence is less than 0.001, the large majority of blocks is clonally inherited, which is indicated as the lightgreen segment in Figure 2G. However, over a narrow range of divergence between 0.001 and 0.01 the fraction of clonally inherited DNA drops dramatically (yellow segment in Figure 2G) and at a divergence of about 0.014 essentially the entire alignment has been overwritten by recombination and all clonally inherited DNA is lost (blue segment in Figure 2G). Notably, 80% of all strain pairs lie in this fully recombined regime (Figure 2I). Thus, for the large majority of pairs of strains, none of the DNA in their alignment derives from their clonal ancestor, making it impossible to estimate the distance to their clonal ancestor from comparing their sequences.
Moreover, as shown in Figure 2H, even for pairs that are so close that most of their genomes are clonally inherited, the large majority of the substitutions derives from the recombined regions. That is, for pairs of strains that are so close that the clonally inherited fraction is around ${f}_{c}=0.9$, the fraction of substitutions deriving from recombination is roughly ${f}_{r}=0.9$ as well. We note that in previous studies the quantitative importance of recombination has often been summarized by a ratio $R/M$ of the rates at which substitutions are introduced by recombination and mutation, for example (Vos and Didelot, 2009). Note that, if we assume a fixed ratio of recombinationtomutation $\rho /\mu $, then $R/M$ equals $\rho /\mu $ times the average number of substitutions that are introduced per recombination event. In our pairwise analysis, the ratio ${f}_{r}/(1{f}_{r})$ corresponds to the ratio of the number of substitutions deriving from recombination and mutation, so that one can obtain an estimate of $R/M$ by calculating ${f}_{r}/(1{f}_{r})$ for very close pairs with ${f}_{c}\approx 1$, that is the dots at the right border in Figure 2H. However, we see that in this limit the fraction f_{r} of SNPs deriving from recombination becomes highly variable, giving a first indication that $R/M$ may vary across different pairs of closely related strains. Indeed, we will see below that recombination rates vary over a wide range across different lineages, so that it is inherently misleading to quantify recombination by a single ratio $R/M$.
To test the accuracy of the pair analysis, we applied the same pairwise analysis to alignments of the $S=50$ sample genomes from each of the simulations. Figure 2—figure supplement 1 shows the results, analogous to those of Figure 2G–I, for the data from simulations with recombination to mutation ratios of $\rho /\mu =0,0.001,0.01,0.1$, and 0.3. For the simulations without recombination, the pairwise analysis correctly infers that all of the genome is clonally inherited for all pairs. For the very low recombination rate $\rho /\mu =0.001$, the model also correctly infers that clonal evolution dominates, that is more than 50% of the genome is clonally inherited for all pairs, and more than 90% of the genome is clonally inherited for about 40% of all pairs. The pairwise analysis of the simulation data with recombination rate $\rho /\mu =0.01$ are also consistent with the ground truth of Figure 1—figure supplement 3, i for half of the pairs there is a substantial fraction of clonally inherited genome, whereas for the other half of more distally related pairs recombination has already affected more than 90% of the genome. Similarly, for $\rho /\mu =0.1$ the pairwise analysis correctly infers that pairs which are not yet fully recombined have become rare, and for $\rho /\mu =0.3$ essentially all pairs have become fully recombined. The pairwise analysis of the simulated data thus paints a correct picture of the ground truth shown in Figure 1—figure supplement 3.
To get a more quantitative assessment of the accuracy of the pairwise analysis, Figure 2—figure supplement 2 directly compares the true fraction of clonally inherited genome for each pair with the estimated fraction from the pairwise analysis for the simulations with $\rho /\mu =0.01$ and $\rho /\mu =0.1$. We see that for all pairs the estimated fraction of the genome that was clonally inherited is close to the true fraction. In fact, it appears that the estimation method tends to slightly overestimate the fraction of clonally inherited genome for almost all pairs. Finally, as for the real data, we also used close pairs from the simulations with $\rho /\mu =0.01$ and $\rho /\mu =0.1$ to estimate the sizes of the recombined segments. As shown in Figure 2—figure supplement 2, the sizes of the recombined regions estimated by the method are very close to the ground truth of 12 kb recombination segments.
In summary, application of the pairwise analysis to the simulation results strongly support that the pairwise analysis accurately quantifies the fraction of the genome that was affected by recombination for each pair, as well as estimate the lengths of the recombined segments.
For later comparison with the data on other species, we summarize our observations from the pairwise analysis by a few key statistics. First, half of the genome is recombined at a critical divergence of 0.0032. Second, at this critical divergence, the fraction of all SNPs that is in recombined regions is 0.95. Third, the fraction of mostly clonal pairs is 0.077, and finally, the fraction of fully recombined pairs is 0.78 (see Materials and methods). All these statistics suggest that pairwise divergences between strains are almost entirely driven by recombination and do not reflect distances to their clonal ancestors. To understand how a consistent phylogenetic structure can still emerge when the full core genomes of all strains are compared, we need to go beyond studying pairs.
SNPs in the core genome alignment correspond to splits in the local phylogeny
Although there may not be a single phylogeny that captures the evolution of our genomes, each single position in the core genome alignment, that is each alignment column, will have evolved according to some phylogenetic tree. A key observation is that our set of strains is sufficiently closely related that there are almost no alignment columns for which more than one substitution occurred in its evolutionary history. In particular, of the almost 2.8 million columns in the core genome alignment, almost 90% show an identical nucleotide for all 92 strains, that is only 10.85% are polymorphic. Moreover, almost all of these SNP columns are biallelic, that is for 93.6% of the SNPs only two nucleotides appear, 6.3% have three nucleotides, and in 0.2% all four nucleotides occur. These statistics strongly suggest that most positions have not undergone any substitutions, and that columns with multiple substitutions are rare. Notably, these statistics are still inflated due to the occurrence of an outgroup of nine strains that is far removed from the other strains (the clade from B5 to B3 visible on the right in Figure 1). We observe that almost 36% of all SNPs correspond to SNPs in which all nine strains of this outgroup have one nucleotide, and all other 83 strains have another nucleotide. If we remove the outgroup from our alignment, the fraction of SNPs in the alignment drops from 10.85% to 6.7%, and the fraction of SNPs that are biallelic increases to 95.5%.
Whenever a biallelic SNP corresponds to a single substitution in the evolutionary history of the position, the SNP pattern provides an important piece of information about the phylogeny at that position in the alignment: whatever this phylogeny is, it must contain a split, that is a branch bipartitioning the set of strains, such that all strains with one letter occur on one side of the split, and all strains with the other letter on the other side (Figure 3).
As illustrated in Figure 3, pairs of SNPs can either be consistent with a common phylogeny, that is columns X and Y or columns Y and Z, or they can be inconsistent with a common phylogeny, that is columns X and Z. The pairwise comparison of SNP columns for consistency with a common phylogeny is known as the fourgamete test and is very commonly used in the literature on sexual species, for example to give a lower bound on the number of recombination events in an alignment (Hudson and Kaplan, 1985). However, so far it has rarely been used for quantifying recombination in bacteria (Lai and Ioerger, 2018 and (Arnold et al., 2018) being the only exceptions we are aware of). In the rest of this paper, we show how analysis of biallelic SNPs (which from now on we will just call SNPs) can be systematically used to quantify not only the overall amount of recombination in alignments, but also the relative rates with which different lineages have recombined.
Since all these analyses assume that biallelic SNPs correspond to single substitutions, it is important to quantify how accurate this assumption is. In particular, some apparent SNPs might correspond to sequencing errors rather than true substitutions and, more importantly, some biallelic SNPs may correspond to multiple substitution events (often called homoplasies). In the Materials and methods, we show that sequencing errors must be so rare that they can be safely neglected. In addition, to estimate the fraction of biallelic SNPs that correspond to homoplasies we analyzed the frequencies of columns with 1, 2, 3, and 4 different nucleotides using a simple substitution model, separately analyzing positions that are under least selection (third positions of fourfold degenerate codons) and positions under most selection (second positions in codons), and either including or excluding the outgroup (see Materials and methods, and Figure 3—figure supplement 1). These analyses indicate that only 2–6% of biallelic SNPs correspond to homoplasies. We confirmed the accuracy of this estimation procedure using our simulation data, for which the true fraction of homoplasies is 2.56% and our simple method estimates 2.43%.
In addition, to put an upper bound on how much our results could be affected by a small fraction of homoplasies, we developed a method that removes, from the core genome alignment, a given fraction of alignment columns that exhibit most inconsistencies with alignment columns in their neighborhood (see Materials and methods), and checked how much the various statistics that we calculate are altered when we remove either 5% or 10% of such potentially homoplasic positions. Finally, we also apply each of our analyses to the data from the simulations with known ground truth, to further validate the accuracy of our methods.
SNP statistics are inconsistent with a single phylogeny
One of the key uses of a phylogeny is in describing the differences between the strains by an evolutionary dynamics that takes place along the branches of the phylogeny. However, for such an approach to apply, it is important that the large majority of evolutionary changes were indeed introduced along the branches of the phylogeny.
That there are significant differences between the SNP statistics as predicted by the core tree, and the actual SNP statistics observed in the data is already evident from the fact that the observed pairwise divergences between strains are systematically less than their divergences along the branches of the core phylogeny ( Figure 4—figure supplement 1, left panel). In addition, for many branches of the core tree, the observed number of SNPs corresponding to that branch is up to 100fold lower than number of SNPs that are predicted to occur on that branch (Figure 4—figure supplement 1, right panel).
To test to what extent the observed evolutionary changes occurred along the branches of the core tree, we calculated the fraction of observed SNP splits that fall on the branches of the core phylogeny. Overall, 58% of the SNPs that are shared by at least two strains correspond to a branch of the core tree, whereas 42% clash with it (SNPs that occur in only a single strain are consistent with any phylogeny). However, this relatively high fraction results almost entirely from SNPs on the single branch connecting the outgroup to the other strains, which is responsible for almost 36% of all SNPs. When the outgroup is removed, only 27.4% of all SNPs are consistent with the core tree. Since the core tree was constructed using a maximum likelihood approach that assumes the entire alignment follows one common tree, we investigated to what extent the number of tree supporting SNPs can be improved by specifically constructing a tree to maximize the number of supporting SNPs (see Materials and methods). However, this only marginally improves the number of supporting SNPs by 0.1%.
Since the overall fraction of SNPs that correspond to branches of the core phylogeny is dominated by a few very common SNP patterns, it is more informative to assess to what extent each individual branch of the core tree is consistent with the observed SNPs. We thus calculated, for each branch in the core tree, the number of supporting SNPs S that match the split, and the number of clashing SNPs C that are inconsistent with the split, to calculate the fraction $f=S/(S+C)$ of SNPs supporting the branch. Figure 4 shows that, for twothirds of the branches, there are more clashing than supporting SNPs. Moreover, for half of the branches in the core tree, the fraction of supporting SNPs is less than 5%, that is there are at least 20fold more clashing than supporting SNP columns. To check to what extent homoplasies may affect these statistics, we calculated the same statistics on core genome alignments from which either 5% or 10% of potentially homoplasic sites were removed, and observed that the distribution of SNP support is almost unchanged (Figure 4—figure supplement 2).
To further elucidate the meaning of these SNP support distributions, we calculated the same statistics for the simulations with different recombination rates, including on alignments where 5% or 10% of potentially homoplasic columns were removed (Figure 4—figure supplement 3). In general the removal of 5% or even 10% of potentially homoplasic sites has only a minor effect on the distribution of SNP support except for the simulations without recombination. When there is no recombination all inconsistencies are due to homoplasies, and we indeed see that all branches are fully supported after 5% of potential homoplasic sites have been removed. For a very small recombination rate of $\rho /\mu =0.001$, most branches in the core tree still have strong SNP support but already at a recombination rate of $\rho /\mu =0.01$ more than 80% of the branches are supported by less than half of the informative SNPs, and half of the branches have less than 20% support. It is instructive to compare this result with the fractions of the genome that are clonally inherited along each branch, that is the bottom left panel in Figure 1—figure supplement 3, and the pair statistics (middle row of Figure 2—figure supplement 1) for this simulation. We see that, although there is some recombination in most branches, most of the genome is clonally inherited for all but the long inner branches. However, even though most of the genome is clonally inherited along most branches, for most pairs the large majority of the SNPs derive from recombination (middle panel in Figure 2—figure supplement 1). This means that, at $\rho /\mu =0.01$, we are in a parameter regime where most of the genome is still clonally inherited, and the clonal tree can thus also be reconstructed from the data, but the large majority of the differences between strains already derive from recombination events. This shows that, even if recombination is rare enough that the clonal tree can be successfully reconstructed from the genome sequences, it might already be incorrect to assume most genomic changes were introduced along the branches of this clonal phylogeny.
For recombination rates of $\rho /\mu =0.1$ or larger, recombination dominates completely in that there are essentially no branches with significant SNP support. The distribution of support for E. coli does not look like any of the distributions from the simulations, but could be described as a hybrid of one third of branches with strong clonal support and two thirds of branches dominated by recombination. Besides the branch to the outgroup, all supported branches lead to groups of highly similar strains near the bottom of the tree. We thus wondered if it would be possible to construct well supported subtrees for clades of closely related strains near the bottom of the tree. We devised a method that builds subtrees bottomup by iteratively fusing clades so as to minimize the number of clashing SNPs at each step (see Materials and methods and Figure 4—figure supplement 4, left panel). As shown in Figure 4—figure supplement 4, while the fraction of clashing SNPs is initially low, it rises quickly as soon as the average divergence within the reconstructed subtrees exceeds ${10}^{4}$, which is more than 100fold below the typical pairwise distance between E. coli strains. Thus, while some groups of very closely related strains that have a recent common ancestor can be unambiguously identified, only a minute fraction of the overall sequence divergence falls within these groups, and the bulk of the sequence variation between the strains is not consistent with a single phylogeny.
To assess whether any of these statistics could be skewed due to a few strains with aberrant behavior, we also investigated whether SNPs are more consistent with a dominant phylogeny when we do not consider all strains, but only subsets of the strains. We focused on the smallest subsets of strains that have meaningfully different phylogenetic tree topologies. For a quartet of strains $(I,J,M,N)$, there are three possibly binary trees, that is with $(I,J)$ and $(M,N)$ nearest neighbors, with $(I,M)$ and $(J,N)$, or with $(I,N)$ and $(J,M)$ (see Figure 4—figure supplement 5). We selected quartets of roughly equidistant strains and checked, for each quartet, whether the SNPs clearly supported one of the tree possible topologies. However, we find that alternative topologies are always supported by a substantial fraction of the SNPs, and that for most quartets the most supported topology is supported by less than half of the SNPs (Figure 4—figure supplement 5).
In summary, consistent with the picture that emerged from our analysis of pairs of strains, most of the differences between the E. coli strains did not occur along the branches of a single phylogeny. This suggests that, rather than describing the relationships between the strains by a single phylogeny, we should think of multiple different phylogenies occurring along the genome alignment.
The phylogeny changes every few SNPs along the core alignment
So far, we have analyzed SNP consistency without regard to their relative positions in the alignment. We now analyze to what extent mutually consistent SNPs are clustered along the alignment. In particular, we calculate the lengths of segments along the alignment that are consistent with a single phylogeny.
We first assessed the lengthscale over which phylogenies are correlated by calculating a standard linkage disequilibrium (LD) measure as a function of distance along the alignment (Figure 5A and Materials and methods). LD drops quickly over the first 100 base pairs and becomes approximately constant at distances beyond $200300$ base pairs, indicating that segments of correlated phylogenies are much shorter than the typical length of a gene. Very short linkage profiles were recently also observed in thermophilic Cyanobacteria isolated in Yellowstone National Park (Rosen et al., 2015). Instead of using correlation between SNPs at different distances, one can also calculate the probability for a pair of SNPs to be consistent with a common phylogeny, as a function of their genome distance (Arnold et al., 2018). As shown in in Figure 5—figure supplement 1, like LD, pairwise compatibility of SNPs also drops quickly over the first 100 base pairs. Note, however, that even at large distances the pairwise compatibility of SNPs is close to 90%. The reason for this is that most SNPs are shared by only a small subset of strains, and as long as two SNPs are shared by nonoverlapping subsets of strains, they will be compatible with a tree. In order to more efficiently detect recombination using SNP compatibility, we need to check for the mutual consistency of all SNPs within a given segment of the alignment.
Starting from each SNP s, we determined the number of consecutive SNPs n that are all mutually consistent with a common phylogeny. As shown in Figure 5B, the distribution of the lengths of treecompatible stretches has a mode at $n=4$, and stretches are very rarely longer than $n=20$ consecutive SNPs. In terms of number of base pairs along the genome, treecompatible segments are typically just a few tens of base pairs long, and very rarely more than 300 base pairs (Figure 5C). Thus, stretches of treecompatible segments are very short. For comparison, we also calculated the distribution of treecompatible segment lengths in an alignment where the positions of all columns have been completely randomized and observe that these are still a bit shorter (blue distributions in Figure 5B and C). Thus, while there is some evidence that neighboring SNPs are more likely to be compatible than random pairs of SNPs, this compatibility is lost very quickly, typically within a handful of SNPs.
To elucidate how the recombinationtomutation ratio determines the lengths of treecompatible stretches for the simulations, we calculated the same distribution for the data from the simulations with different recombination rates. In addition, to assess the affect of homoplasies on the distribution of the lengths of treecompatible stretches, we also calculated these distributions for alignments from which 5% or 10% of potentially homoplasic sites were removed (Figure 5—figure supplement 2). In addition, Appendix 1—table 1 shows the average length of treecompatible segments for each of these datasets. We see that removal of homoplasies only has a significant effect on the distribution of treecompatible stretches for $\rho /\mu \le 0.01$. That is, homoplasies only significantly affect the distribution of treecompatible stretches when $\rho /\mu $ is less than the rate of homoplasies (which is 2.5%). For the E. coli data, removal of homoplasies has only a minor effect, that is the average number of consecutive treecompatible SNPs increases from 7.6 to 8.8 when 5% potentially homoplasic sites are removed. Although there is of course no reason to assume that the simple evolutionary dynamics of any of our simulations realistically describes the genome evolution of the E. coli strains, we note that the distribution of treecompatible stretches for E. coli is between what is observed for simulations with $\rho /\mu =0.3$ and $\rho /\mu =1$. Notably, as is clear from Figure 1—figure supplement 3, Figure 2—figure supplement 1, and Figure 4—figure supplement 3, when $\rho /\mu \ge 0.3$ the evolution along every branch of the phylogeny is almost completely dominated by recombination.
In summary, our analysis shows that, as one moves along the core genome alignment, the phylogeny changes typically every 5–10 SNP columns, that is every 50–100 nucleotides. We next use this to put a lower bound on the ratio of the number of phylogeny changes to mutations in the alignment.
A lower bound on the ratio of phylogeny changes to substitution events
Every time inconsistent SNP columns are encountered as one moves along the core genome alignment, the local phylogeny must change. For example, somewhere between columns X and Z in Figure 3 the phylogeny must change. This in turn implies that the start (or end) of at least one recombination event must occur between columns X and Z. By going along the core genome, and determining the minimum number of times the phylogeny must change, one can thus derive a lower bound on the total number of recombination events and, in the study of sexually reproducing species, this has been a standard method to put a lower bound on the number of recombination events within a genome alignment (Hudson and Kaplan, 1985) (see Materials and methods). Using this we find that the phylogeny must change at least $C={43}^{\prime}575$ times along the core phylogeny. However, this neglects that some of the inconsistencies may result from homoplasies. To correct for this, we remove 5% of potential homoplasic positions by removing 5% of the SNP columns that are most inconsistent with neighboring columns (Materials and methods) and find that the phylogeny must still change $C={34}^{\prime}030$ times along this 5% homoplasycorrected alignment. Because homoplasies are relatively rare, the number of biallelic SNPs in the alignment is a good estimate for the total number of mutations in the alignment and the ratio $C/M$ thus provides a lower bound for the ratio between the total number of phylogeny changes and substitutions that occur along genome alignment. For the 5% homoplasycorrected alignment we obtain a ratio $C/M=0.129$.
Apart from the full alignment, we can calculate the lower bound on the ratio of phylogeny changes to substitution events $C/M$ for any subset of strains. Figure 6 shows the ratio $C/M$ for random subsets of our 92 strains as a function of the number of strains in the subset, using again the 5% homoplasycorrected alignment. For comparison, Figure 6—figure supplement 1 also shows the same results for the full alignment, which shows ratios $C/M$ that are about 20% higher.
We see that, for small subsets of strains, the ratio $C/M$ shows substantial fluctuations. For example, for subsets of $n=10$ strains, the ratio $C/M$ ranges from 0.026 to 0.112, with a median of 0.075. However, as the number of strains in the subset increases, the ratio converges to the value $C/M=0.129$ and for large subsets of strains there is little variation in the ratio $C/M$. Thus, for alignments of large sets of strains, the phylogeny must change at least every $78$ SNPs.
$C/M$ within phylogroups
Apart from random subsets of strains, we also calculated C, M and $C/M$ for subsets of strains from the same phylogroup (Appendix 1—table 2, and see Figure 1—figure supplement 1 for the phylogroup annotation). The ratios $C/M$ that are observed for the phylogroups increase with the overall divergence within the phylogroup. For phylogroups that are more than 1% diverged (B1, B2, and D), the ratio $C/M$ is close to that of the full alignment, and there are at least thousands of phylogeny changes along their subalignments. The two phylogroups with lower divergence, that is A with divergence 0.0024 and the outgroup with divergence 0.003 have lower ratios $C/M$. The ratio is particularly low for the outgroup O for which only two phylogeny changes are detected. While this may suggest that recombination rates are particularly low within lineages of the outgroup, it should be noted that such low values of $C/M$, while rare, were also observed for some random subsets of strains of the same size (Figure 6).
$C/M$ for the simulation data
Note that, because each recombination event introduces at most two phylogeny changes (one at the start of the recombined segment and one at its end), $C/2$ is a lower bound on the number of recombination events that occurred in the evolutionary history of an alignment. However, this lower bound may significantly underestimate the true number of recombination events. To obtain more insight into the relationship between this lower bound and recombination rates, we compared the ratios $C/M$ of each simulated dataset (using the 5% homoplasycorrected alignment) with the ratio $\rho /\mu $ of recombination and mutation rate used in the simulation (Figure 6—figure supplement 2). The results show that when recombination rates are very low, that is $\rho /\mu \le 0.01$, the ratio $C/(2M)$ is almost exactly equal to $\rho /\mu $. In this regime recombination events are so sparse on the alignment that many SNP columns occur between every two consecutive phylogeny changes, and this causes almost every phylogeny change to introduce an inconsistency. Since each recombination event introduces two breaks, $C/M$ equals twice the number of recombinations per mutation $\rho /\mu $. However, as $\rho /\mu $ increases, fewer SNP columns occur between consecutive phylogeny changes, and more and more of the phylogeny changes go undetected because they do not introduce inconsistencies between the SNPs. Consequently, the ratio $C/M$ becomes systematically lower than $\rho /\mu $ and the difference can become very large. For example, at $\rho /\mu =1$, the observed ratio $C/M$ is almost tenfold lower than $\rho /\mu $. Note also that since the lower bound $C/M$ cannot exceed 1 (i.e. a phylogeny change at every SNP), the ratio $\rho /\mu $ can exceed $C/M$ by arbitrarily large factors at high $\rho /\mu $.
Each genomic position has likely been overwritten hundreds of times by recombination
We can also provide a rough estimate for the average number of times T that a randomly chosen position in the core genome alignment has been overwritten by recombination in its history, that is since the genetic ancestors of the position in the alignment diverged from a common ancestor.
If L is the total genome length, ${L}_{r}$ the average length of recombination segments, and $C/2$ the lower bound on the number of recombination events in the alignment, then a lower bound on the average number of times positions in the genome have been overwritten by recombination is $T={L}_{r}C/(2L)$. For the E. coli data with $C={34}^{\prime}030$ phylogeny changes for the 5% homoplasycorrected alignment of length $L={2}^{\prime}{756}^{\prime}541$, and the value of ${L}_{r}\approx {31}^{\prime}000$ base pairs that we estimated previously (Figure 2J), we obtain $T\approx 190$. That is, on average each position in the genome has been overwritten at least 190 times by recombination. We note that, since this lower bound for T is proportional to the estimated average length of recombination segments ${L}_{r}$, it of course depends on this estimate as discussed in Appendix 2.
In the simulations, we specifically tracked the number of times each position in the alignment was overwritten by recombination along each branch of the clonal phylogeny, and we thus know precisely how many times each position of the alignment was overwritten by recombination in its evolutionary history. Figure 6—figure supplement 3 shows the distribution of the number of times each position in the alignment was overwritten for the simulations with $\rho /\mu $ ranging from 0.001 to 10. We see that, in line with all previous analyses, positions that are clonally inherited along the whole tree only occur for the lowest recombination rate of $\rho /\mu =0.001$, at $\rho /\mu =0.01$ each position has been overwritten $525$ times, and at $\rho /\mu =0.1$ each position has already been overwritten $120200$ times. For each $\rho /\mu $ we calculated the average number of times ${T}_{\mathrm{true}}$ that positions were overwritten by recombination and compared this with the estimated number of times using the simple estimate ${T}_{\mathrm{est}}={L}_{r}C/(2L)$. Figure 6—figure supplement 4 shows that, while the estimate is in fact quite accurate at very low mutation rate $\rho /\mu =0.001$, as the recombination rate increases the estimate severely underestimates the true number of times positions in the alignment were overwritten by recombination, for example at $\rho /\mu =1$ the true number ${T}_{\mathrm{true}}=1561$ is more than twenty times as large as the estimated number ${T}_{\mathrm{est}}=67$. Thus, for the E. coli data it is certainly not implausible that each position in the alignment was in fact overwritten by recombination more than 1000 times. As discussed in Appendix 2, a rough estimate of T can also be derived from an analysis of the pair statistics of Figure 2, and this estimate suggests that T lies somewhere between $T=200$ and $T=500$, which is consistent with our lower bound based on C.
The observed ratio $C/M$ for the E. coli alignments is very similar to the value of $C/M$ observed in the simulations with $\rho /\mu =1$ (Figure 6—figure supplement 2). Although it is tempting to conclude from this that the ratio of recombination to mutation rate must be close to one for the E. coli strains, such a conclusion would be unwarranted. The evolutionary dynamics of the simulations makes several strong simplifying assumptions, that is that the clonal phylogeny is drawn from the Kingman coalescent process, and that the population is completely mixed so that all strains are equally likely to recombine. Both these assumptions may not apply to the evolution of E. coli in the wild. Indeed, we will see below that there is strong evidence that relative recombination rates vary highly across lineages so that instead of a single recombination rate, there is a wide distribution of recombination rates between different lineages. Therefore, it could be misleading to describe E. coli’s evolution by a single recombination rate ρ and we instead focus on providing a lower bound $C/M$ on the number of phylogeny changes per SNP column, which is a meaningful quantity independent of the precise evolutionary dynamics that caused the substitutions and phylogeny changes, and can be calculated independently for any subset of strains.
Other bacterial species show similar patterns of recombinationdominated genome evolution
To investigate to what extent the observations we made for E. coli generalize to other species of bacteria, we selected five additional species from different bacterial groups for which sufficiently many complete genome sequences of strains were available, and used Realphy to obtain a core genome alignment of the strains for each species (see Appendix 1—table 3 for a list of the species, the number of strains, and other core genome statistics for each species). We then performed most of the analyses that we presented above for E. coli on each of these core alignments. Figure 7 presents a summary of the results that we observe across the species.
Figure 7A shows the cumulative distributions of pairwise divergences between strains for all species. We see that, while among our E. coli strains that were sampled from a common habitat there is a small percentage of very close pairs with divergence below ${10}^{6}$, for the strains of the other species the closest pairs are at divergence ${10}^{5}$. With the exception of M. tuberculosis, where the median pair divergence is around ${10}^{4}$, the median pairwise divergence in all other species is around ${10}^{2}$ or larger. The vertical lines in Figure 7A indicate the critical divergences, for each species, where half of the alignment is recombined. With the exception of M. tuberculosis, where all pairs are mostly clonal, the critical divergences lie in a fairly narrow range of 0.003–0.01. Figure 7B shows the cumulative distributions, across pairs of strains, of the fraction of the alignment that is clonally inherited, that is as for Figure 2I for E. coli. Note that, for all species except M. tuberculosis, the large majority of the pairs is fully recombined, ranging from about 15% of pairs with a substantial fraction of clonally inherited DNA for S. aureus, to virtually no pairs with clonally inherited DNA for H. pylori. Thus, we see that for almost all species the situation is similar to what we observed in E. coli: for most pairs the distance to their common ancestor cannot be estimated from their alignment, because the entire alignment has been overwritten by recombination events. Note also that, for all species, there is only a relatively small fraction of pairs that lie in the partially recombined regime (yellow segment in Figure 7B).
For E. coli we found that, even for close pairs for which a substantial fraction of the genome was clonally inherited, most of the SNPs between them still derive from recombination (Figure 2H). We find that, with the exception of M. tuberculosis, this applies to the other species as well. Figure 7C shows, for each species, the fraction of all SNPs that derive from recombination, for pairs of strains that are at the critical divergence where half of the alignment is recombined. Even though this critical divergence occurs for pairs that are relatively close compared to the typical distance between pairs, for all species more than 90% of the SNPs derive from recombination. That is, we also see that for all five species the divergence between close strains is dominated by SNPs that are introduced through recombination.
Another way to quantify to what extent the observed sequence variation is consistent with evolution along the branches of the core genome each branch of the core tree. Figure 7D summarizes the distributions of support of the branches of the core tree as violin plots, that is as shown for E. coli as a cumulative distribution in Figure 4. In E. coli most branches have many more SNPs that reject the split than support it, and even stronger rejection of the branches of the core tree are observed for B. subtilis and H. pylori. For the other species, an almost uniform distribution of branch support is shown, that is for these species there are roughly as many branches that are strongly supported by the SNPs, strongly rejected by the SNPs, or supported and rejected by roughly equally many SNPs.
Figure 7E summarizes, for each species, the distribution of distances between SNPs along the core alignment as boxwhisker plots (green) as well as the distribution of distances between phylogeny breakpoints (blue), that is as shown in Figure 5C for E. coli. The figure shows that, with the exception of M. tuberculosis, the interSNP distances range from a few to a few dozen base pairs, with a median interSNP distance of 4 (H. pylori) to 15 (S. aureus) base pairs. For these five species, the median distances between phylogeny breakpoints range from around 10 (H. pylori) to about 100 base pairs for S. aureus. Note that, for all species, the left tails of the distributions stretch to very short distances between breakpoints, whereas distances between breakpoints of more than 200 bps are very rare for all these five species. Thus, for these species the segments that are consistent with a single phylogeny are always much shorter than the typical length of a gene. In contrast, for M. tuberculosis both the distances between SNPs and the distances between breakpoints are almost two orders of magnitude larger, indicating that these strains are much more closely related than the strains of the other species.
Figure 7F shows boxwhisker plots for the distributions of the number of consecutive SNPs between breakpoints, as was shown for E. coli in Figure 5B. We see that for all species, including M. tuberculosis, there are typically less than a handful of SNPs in a row before a phylogeny breakpoint occurs, and very rarely more than a dozen SNPs. The smallest number of SNPs per breakpoint is observed for H. pylori, that is typically less than 2 SNPs per breakpoint, but the range of SNPs per breakpoint is very similar across all species.
Finally, Appendix 1—table 4 shows the fraction of alignment columns that are SNPs, the lower bound C on the number of phylogeny changes, and the lower bound $C/M$ on the ratio of phylogeny changes to SNP columns, for each of the six species. We see that with the exception of H. pylori, which has a high ratio $C/M\approx 0.3$, and M. tuberculosis which has significantly lower ratio $C/M\approx 0.08$, the other species have ratios $C/M$ similar to that observed for E. coli. Note that the high rate of recombination that we infer for H. pylori is consistent with the consensus in the literature that genome evolution is dominated by recombination in this species, for example (Suerbaum et al., 1998).
In contrast, the consensus in the field of M. tuberculosis evolution is that essentially no recombination occurs in this species. Although there have been reports of evidence for recombination (Liu et al., 2006; Namouchi et al., 2012; Phelan et al., 2016), followup analyses suggested that these may well be a result of problems with genome assemblies and genome alignments (Godfroid et al., 2018). Although our statistics indicate that evolution in M. tuberculosis is predominantly clonal, we do find a significant fraction of SNPs that clash with the whole genome tree. However, because SNPs are relatively rare in M. tuberculosis, that is on average one SNP every 1200 base pairs, SNPs that clash with the core phylogeny occur only once every 4700 base pairs, and it is not hard to imagine that problems with the genome assemblies of just a few strains could cause artefactual SNPs to occur at that rate. Indeed, we found that 5 of the M. tuberculosis genomes in our dataset have recently been retracted from the database because of evidence of contamination. Notably, we found that these strains were responsible for the large majority of the clashing SNPs. That is, after removal of these five strains, clashing SNPs now only occur once every 12'700 base pairs on average. Since it is not implausible that there may be problems with the genome assemblies of one or more of the remaining strains, this suggests that these remaining clashing SNPs may well be artefacts as well. In conclusion, although we cannot definitively conclude there is no recombination at all in M. tuberculosis, our analysis confirms that it must be rare.
In summary, with the exception of M. tuberculosis, all other species show the same pattern as E. coli with genome evolution being dominated by recombination. For most pairs, no DNA in their alignment was clonally inherited, even for close pairs most of the SNPs derive from recombination events, the phylogeny changes thousands of times along the core genome alignment, typically within a handful of SNPs, and each position in the alignment has been overwritten by recombination many times in its history.
Phylogenetic structures reflect population structure
All our results so far show that the core tree cannot represent the evolutionary relationships between the strains and that a large number of different phylogenies occurs across the alignment. It may thus seem all the more puzzling that, when trees are constructed from sufficiently many genomic loci, the core tree reliably emerges (Figure 1, bottom).
As we mentioned in the introduction, some researchers interpret this convergence to the core tree to mean that the core tree must correspond to the clonal phylogeny of the strains. The interpretation is that the SNP patterns in the data are a combination ‘clonal SNPs’ that fall on the clonal phylogeny, plus a substantial number of ‘recombined SNPs’ that act so as to introduce noise on the clonal phylogenetic signal. In this interpretation, trees build from individual loci can differ from the core tree because the ‘recombination noise’ can locally drown out the clonal signal, but once sufficiently many loci are considered, this recombination noise ‘averages out’ and the true clonal structure emerges. However, this interpretation rests on two assumptions that do not necessarily hold. First, in order for the effects of recombination to ‘average out’ when many loci are considered, one has to assume that the recombination ‘noise’ has no systematic structure itself. That is, one has to assume that there is no population structure, that is that all lineages are equally likely recombine with each other. Second, for the clonal phylogeny to emerge one has to also assume that there are sufficiently many loci that have not been affected by recombination. However, the results above show that recombination is so common that each locus has been overwritten many times by recombination, that is there are essentially no loci that are unaffected by recombination.
Note that if the phylogenies along the alignment were a combination of clonal and randomly recombined SNPs, then one would expect that removal of all the clonal SNPs would destabilize the core tree. However, if we remove all SNP columns from the core genome alignment that correspond to branches of the core tree, and then reconstruct a phylogeny from this edited alignment, the resulting tree is still highly similar to the core tree (Figure 8—figure supplement 1). That is, we obtain a very similar tree from this edited alignment, even though virtually all SNPs in this alignment are inconsistent with this tree. This confirms that the structure in the core tree does not derive from the subset of SNPs that fall on the core tree, but rather reflects overall statistical properties of all SNPs.
We propose that, rather than thinking of the SNP patterns as deriving from a single ‘true’ phylogeny plus unstructured recombination noise, the statistics of the SNP patterns reflect structure in the recombination process itself. In particular, we propose that because of population structure, that is the fact that different lineages recombine at different rates, the recombination process is itself structured, and that the distribution of phylogenies that occur along the alignment reflects this population structure. To illustrate this interpretation, we will use data from a species for which there is no question that recombination dominates and no ‘clonal’ structure can exist.
Phylogenies reconstructed from human genome sequences also exhibit robust phylogenetic structure
We randomly selected 40 genomes from the 1000 Genomes project, used PhyML to build a phylogenetic tree from chromosomes 1–12 of these genomes, and then investigated to what extent branches in this tree also occur in trees build from random subsets of the genomic loci. As shown in Figure 8, a robust phylogeny also emerges for the human data. In particular, individuals with the same geographic ancestry consistently form clades in the trees build from large numbers of loci.
However, it is of course clear that this phylogeny cannot correspond to the ‘clonal phylogeny’ of the human sequences, because there is no such thing as a ‘clonal phylogeny’ for the human sequences. Perhaps, the closest analog of a ‘clonal tree’ for the human data would be the phylogeny of the strictly maternal lineage. However, since at each generation roughly half of the autosomal chromosomes derives from the mother, and half from the father, there are few if any loci in the genome that follow this strictly maternal lineage, that is almost every locus was paternally inherited in at least some generations. Instead, it is clear that there are many different phylogenies along the genome, each with different ancestries.
The reason that a robust phylogeny still emerges is that the different phylogenies that occur along the genome are not completely random. That is, human populations are not completely mixed but people are more likely to mate with others from the same geographic area. Because of this, recombination tends to occur among people of the same geographic area, and because of this population structure the phylogenies that occur along the alignment of a set of human genomes are not sampled uniformly from all possible phylogenies, but some topologies are more likely to occur than others. In particular, individuals from the same geographic area will have recent ancestors in a larger fraction of the phylogenies along the genome than individuals from different geographic areas. Indeed, a simple principal component analysis of SNP statistics in genomes of European ancestry recapitulates geographic structure in remarkable detail (Novembre et al., 2008).
Of course, there are good reasons that human population geneticists do not typically represent the ancestral population structure of human sequences by a single phylogeny. We know that many different phylogenies occur along the genome and it would be misleading to pretend that this distribution of phylogenies can be summarized by a single tree. However, if one insists on representing the population structure by a single tree and asks PhyML to build a single phylogeny from many loci of the human sequences, then one does get convergence to a common phylogeny for the human data as well. The reason one gets this convergence is because asking for a single tree is analogous to asking for the ‘average’ of a distribution. The average of almost any distribution becomes very robust given enough samples, even if this average does not actually represent any typical sample. For example, instead of the total divergence between a pair of individuals representing the distance to their clonal ancestor, this divergence represents the average distance to the many different common ancestors of the pair across the many different phylogenies along the genome, and this average is a function of the rates at which their genetic ancestors have recombined. That is, the distances in this tree reflect the relative rates of recombination among the different lineages.
We propose that the situation for the bacterial data is very similar. Our analysis above has shown that many different phylogenies occur along the alignment and that the core tree must reflect the statistics of this distribution of phylogenies. We propose that, in the same way as for the human data, the core tree that results from applying PhyML to the core genome of a set of bacterial strains reflects population structure, that is the relative rates with which different lineages have recombined. To support this, we return to the E. coli data and use the SNP statistics to quantify the relative rates with which different lineages have recombined.
Recombination rates across lineages follow approximately scalefree distributions
The analyses above have shown that the core alignment of the E. coli strains consists of tens of thousands of short segments with different phylogenies. Thus, one approximate way of thinking about the core genome alignment is that the phylogeny at every genomic locus is independently drawn from some distribution over all possible phylogenies. Conceptually, whereas for purely clonal evolution each strain will have a unique clonal ancestor at each time t in the past, due to the frequent recombination each strain will have a large number of different genetic ancestors, that are responsible for different parts of the strain’s genome. We will refer to the genetic ancestors of each strain as one moves back in time as the lineages of each strain. Population structure corresponds to the fact that different lineages, that is the genetic ancestors of different strains, did not all recombine at the same rate, but that some recombined much more frequently than others.
The distribution of observed SNP types in fact contains extensive information about the relative frequencies with which different lineages have recombined at different times in the past. For example, imagine a SNP where two strains share a nucleotide which differs from the nucleotide that all other strains possess. We will denote such SNPs as 2SNPs or pairSNPs. If, at some genomic locus g, we find a 2SNP shared by strains s_{1} and s_{2}, then it follows that, whatever the phylogeny is at locus g, the strains s_{1} and s_{2} must be nearest neighbors in the tree, and the SNP corresponds to a substitution that occurred on the branch connecting the ancestor of s_{1} and s_{2} to all other strains. As an example, Figure 9A graphically shows the frequencies of all pairSNPs $(A1,s)$ in which A1 shares a SNP with one other strain s. Note that, if the data consisted of a clonal phylogeny plus random recombinations, we would expect A1 to have a substantial number of 2SNPs with its nearest neighbor in the clonal phylogeny, plus a small number of 2SNPs with a random collection of other strains. Also, if all lineages were freely recombining, we would expect roughly equal frequencies of all possible 2SNPs $(A1,s)$. However, this is not what we see. A1 shares 2SNPs with 17 of the 92 strains in our collection at a wide range of frequencies. Instead of a clear ‘clonal’ neighbor with most SNPs, there are three strains that share 2SNPs almost equally commonly with A1, that is A2, A11, and D8 with about 200 2SNPs each. There are some strains with intermediate numbers of 2SNPs, that is D4, and H6 with about 70 occurences each, 10 strains with 10 or less 2SNPs each, and four strains with only a single 2SNP.
We propose that these relative numbers of SNPsharing directly reflect relative rates of recombination between the genetic ancestors of these strains. Note that since all loci are overwritten many times by recombination, essentially every locus g along the alignment has its own unique lineage of genetic ancestors, as determined by the recombination events overlapping this locus, resulting in its own unique phylogeny ${\varphi}_{g}$. Tracing the genetic ancestors of a pair of strains s_{1} and s_{2} into the past, we will say that their lineages ‘have recombined’ at time t in the past, when their most recent common genetic ancestor occurs at time t in the past. In order for strains s_{1} and s_{2} to share a 2SNP at locus g, they must be nearest neighbors in phylogeny ${\varphi}_{g}$, that is their lineages must thus have recombined with each other before recombining with the lineages of any of the other strains. In addition, a substitution must have occurred on the lineage from their common ancestor to the next common ancestor in phylogeny ${\varphi}_{g}$.
Thus, the number of times ${n}_{({s}_{1},{s}_{2})}$ we expect to see a 2SNP of type $({s}_{1},{s}_{2})$ is proportional to the fraction ${f}_{({s}_{1},{s}_{2})}$ of genomic loci g for which s_{1} and s_{2} have recombined before recombining with any of the other lineages, times the average length ${t}_{({s}_{1},{s}_{2})}$ of the branches from the genetic ancestors of s_{1} and s_{2} to the next common ancestor in these phylogenies ${\varphi}_{g}$, that is
The fraction ${f}_{({s}_{1},{s}_{2})}$ reflects the rate with which the lineages of s_{1} and s_{2} have recombined, relative to the rates with which the lineages of s_{1} and s_{2} recombine with the lineages of all other strains. In contrast, the average branch length ${t}_{({s}_{1},{s}_{2})}$ reflects the total rate at which the lineages of the ancestors of s_{1} and s_{2} recombine with any of the other lineages. Since the total rates of recombination are determined by sums of many rates, we believe that it is likely that variation in the average branch lengths ${t}_{({s}_{1},{s}_{2})}$ across pairs $({s}_{1},{s}_{2})$ is significantly less than the variation in ${f}_{({s}_{1},{s}_{2})}$. That is, we believe the variation in the number of occurrences ${n}_{({s}_{1},{s}_{2})}$ across pairs $({s}_{1},{s}_{2})$ mostly reflects variation in the fractions ${f}_{({s}_{1},{s}_{2})}$, which directly reflect variation in the rates of recombination.
To investigate how the relative rates of recombination vary across all pairs, we calculated the frequencies ${n}_{({s}_{1},{s}_{2})}$ of 2SNPs across all pairs of strains. Figure 9B shows a graph representation of all observed pairSNPs, with the thickness of the edges proportional to the logarithm of the frequency of occurrence of each 2SNP type. We see that each strain is connected through 2SNPs to a substantial number of other strains, indicating a high diversity of recent recombination events across the strains. At the same time, the large variability in the thickness of the edges indicates that some pairs occur much more frequently than others. Figure 9C shows the reverse cumulative distribution of the frequencies of all observed 2SNPs (blue dots), that is the distribution of the thickness of the edges in Figure 9B. Note that, if the strains were to recombine freely, each 2SNP would be equally likely to occur, and 2SNP frequencies would show little variation. Instead, we see that 2SNP frequencies f vary over more than 3 orders of magnitude, that is from an occurrence of just $f=1$ for many 2SNPs to $f=2965$ occurrences for the most common 2SNP. Since the reverse cumulative distribution of 2SNP frequencies follows an approximate straightline in a loglog plot, the frequency distribution roughly follows a powerlaw distribution $P(f)\propto {f}^{\alpha}$. Fitting the 2SNP data to a powerlaw (see Materials and methods) we find that the exponent equals approximately $\alpha \approx 1.41$ (blue line in Figure 9C). While clearly not a perfect powerlaw, the distribution is long tailed and much better fit by a straight line in a loglog plot than by a straight line in either a linear or semilog plot.
Beyond SNPs shared by pairs of strains, we can of course also look at SNPs shared by triplets, quartets, and so on. Besides the distribution of 2SNP frequencies, Figure 9C also shows the reverse cumulative distributions of 3SNPs (orange dots), 4SNPs (green dots), and 12SNPs (red dots). We see that all these distributions can be approximated with powerlaw fits (solid lines). We find that essentially all nSNP distributions are approximately scalefree, that is can be fitted with powerlaws. Thus, while some sets of n strains share SNPs much more often than others, their frequencies fall along a scalefree continuum, so that there is no natural way of dividing the clades of n strains into ‘common’ and ‘rare’ clades. Note also that each nSNP corresponds to a mutation that occurred on the branch leading to the ancestor of a group of n strains. Therefore, nSNPs for larger n typically correspond to mutational events that occurred further back in time. As shown in Figure 9D, the exponent α of the nSNP distribution increases with n, ranging from $\alpha \approx 1.25$ for singlets, that is $n=1$, to $\alpha \approx 2.8$ for $n\ge 20$. The fact that nSNP distributions become more steep as n increases means that the average number of occurrences per nSNP decreases as n increases. Thus, the diversity of nSNPs tends to be larger further back in time (see also Appendix 4).
One might wonder to what extent ‘clonal’ SNPs also contribute to the observed distributions of nSNPs, for example whether the most frequent nSNPs in the tails of the distributions might correspond to clonal SNPs. However, removal of all nSNPs that correspond to branches of the core tree has little effect on the nSNP distributions (Figure 9—figure supplement 1). In addition, the nSNP distributions for the 5% homoplasycorrected alignments also look virtually identical (Figure 9—figure supplement 1), showing that homoplasies do not significantly affect the nSNP distributions either.
If, as we have argued, these longtailed distributions of nSNP frequencies indicate complex population structure, that is that recombination rates of different lineages vary along a wide continuum, then one would expect that the nSNP statistics that are observed for clonally evolving populations, or populations that are freely recombining, look fundamentally different. This is indeed what we observe. Appendix 4 presents an indepth comparison of nSNP statistics observed for the E. coli data, with the nSNPs observed for simulations with different amounts of recombination. To summarize, when recombination rates are so low that populations evolve mostly clonally, that is $\rho /\mu \le 0.01$, the diversity of nSNP types is low. Although the distribution of nSNP frequencies exhibits long tails, these frequencies mostly reflect the lengths of the branches in the clonal tree, and the exponents are much smaller and approximately independent of n. When recombination rates are higher, and more similar to what we inferred for the E. coli data, there is a very large diversity of nSNPs, but the nSNP distributions do not exhibit long tails. Instead, all nSNP types have similar frequencies. Thus, the combination of the high diversity of nSNP types with the longtailed distributions that get more steep as n increases, is unique for the E. coli data, supporting that these distributions reflect population structure.
Finally, although it is tempting to interpret the nSNP distributions as reflecting an almost ‘scale free’ population structure, they depend in a complex manner on both the distributions of topologies and branch lengths of the phylogenies across the alignment. As we do not yet have a general theory for the evolutionary process that generates the distribution of phylogenies along the alignment, it is at this point difficult to disentangle the contributions from variations in topologies and branch lengths to the nSNP distributions. Consequently, it is difficult to give a precise interpretation of either the approximate powerlaw form or the meaning of the exponents.
Phylogenetic entropy profiles of individual strains
Another way to characterize the structure of the nSNP distributions is to quantify, for each strain s, how diverse the clades are that s occurs in at different n. To illustrate the idea, we refer back to Figure 9A, which shows all the 2SNPs in which strain A1 occurs. Note that these 2SNPs are all mutually inconsistent, that is in any one phylogenetic tree the strain A1 can only occur in a pair with one other strain. Therefore, the relative frequencies of the different 2SNPs in which A1 occurs reflect the relative frequencies of phylogenies in which A1 is paired with different strains, and the diversity of these pairs can be quantified by the entropy of this distribution. That is, if ${n}_{(s,A1)}$ is the number of 2SNPs of type $(s,A1)$, then the fraction of pairs for which A1 occurs with s is ${f}_{s}={n}_{(A1,s)}/[{\sum}_{{s}^{\prime}}{n}_{(A1,{s}^{\prime})}]$, and the entropy of the distribution of 2SNPs for A1 is ${H}_{A1}(2)={\sum}_{s}{f}_{s}\mathrm{log}({f}_{s})$.
Note that the same calculation can be done for any strain s and any n. For example, if the strain s occurs in 10 different quartets of strains q, then all quartets q are mutually inconsistent, and the diversity of quartets in which s occurs can be calculated as the entropy ${H}_{4}(s)={\sum}_{q}{f}_{q}\mathrm{log}({f}_{q})$ of the relative frequencies with which the different quartets q occur. In this way, for each strain s we can calculate an entropy profile ${H}_{s}(n)$ that contains the entropies of the nSNP distributions in which strain s occurs, as a function of n (see Materials and methods). Figure 10 shows the entropy profiles of all strains (right panel), as well for a selection of six example strains (left panel), as calculated on the 5% homoplasycorrected alignments.
Note that, if evolution were strictly clonal, then all entropies ${H}_{n}(s)$ would be zero, but instead we see entropies going up to 6–8 bits for all strains. However, probably the most striking feature of the entropy profiles is their great variability across the strains. If all lineages were recombining with each other at equal rates, we would expect the diversity of nSNPs to increase in the same way for each strain, but the data show almost as many distinct entropy profiles as there are strains. This shows directly that the recombination rates across lineages must be highly heterogeneous.
For most strains the entropy increases quickly with n, for example for strain A10 the entropy rises to ${H}_{4}(A10)\approx 6$ bits, which is equivalent to strain A10 occurring in approximately 64 different quartets with equal frequency. In contrast, for strain E7 the entropy stays zero until $n>6$, and for strain G8 even until $n>20$. As can be seen in the core tree (Figure 1), E7 and G8 are part of groups of $n=6$ and $n=20$ very closely related strains, respectively. Both these groups had a clonal ancestor so recently that essentially no recombination events occurred since. Consequently, the strains in these groups occur in at most 1 type of nSNP for $n\le 6$ and $n\le 20$, respectively. Thus, for groups of m strains with a very recent common ancestor, the entropies ${H}_{n}(s)$ are essentially zero when $n\le m$. In this way, the entropy profiles directly show when strains are parts of a clonal clade.
Although the entropies generally go up with n, they do not increase monotonically. For example, for strain B11, the entropy increases to around ${H}_{n}(B11)\approx 4$ for $n=5$ to $n=8$, but then drops back to almost zero at $n=9$. Notably, this strain B11 is a member of the outgroup of nine strains that are highly diverged from all the other strains (Figure 1) and the fact that ${H}_{9}(B11)=0$ reflects the fact that 9SNPs shared by all strains of the outgroup are vastly more numerous than any other 9SNP in which B11 occurs. That is, whereas B11 occurs in different clades of sizes $n=5$ through 8, the phylogenies of almost all its genomic loci contain the same clade of $n=9$ strains corresponding to the outgroup.
Another feature that is evident is that the entropy profiles of some strains appear to nearly merge at large n. For example, the entropy profiles of A10 and E7 become very similar from $n=34$ onward, and from about $n=40$ onwards the profile of K12 becomes very similar to these two as well. However, not all entropy profiles merge, and different groups of profiles remain up until $n=46$. As can be seen in Figure 10—figure supplement 1, we generally observe that strains from the same phylogroup tend to merge their entropy profiles at large n. Importantly, however, although the entropy profiles become highly similar at large n, they do not become identical. In contrast, for strains with a recent clonal ancestor, the entropy profiles are perfectly identical. For example, the entropy profiles of the close pairs (B6, C1) from phylogroup F, and (A7, B2) from phylogroup B2, completely overlap so that only one of the two colors of each pair is visible in the plot (Figure 10—figure supplement 1). We developed a simple statistical test, based on the Fisherexact test (Materials and methods) to quantify the extent to which the nSNP statistics of a pair of strains are significantly distinct. As shown in Figure 10—figure supplement 2, we find that only very closely related pairs of strains that had a recent common ancestor, corresponding to about 5% of all pairs, have statistically indistinguishable nSNP statistics. Thus, while strains with a recent clonal ancestor have identical entropy profiles, the entropy profiles of strains from a phylogroup have similar but not identical nSNP distributions at large n. Moreover, strains from different phylogroups can have very different entropy profiles even at large n. To provide additional insights into what the nSNP statistics show about recombination patterns, Appendix 5 provides an indepth analysis of the nSNP statistics for the relatively small phylogroup B2.
Thus, the general picture that emerges from the entropy profiles is that, while some of the structure at small n is due to clonal relationships, at large n the entropy profiles reflect the statistics of recombination of different lineages. While strains from the same phylogroup tend to show highly similar nSNP statistics at high n, suggesting that their lineages have similar recombination statistics sufficiently far into the past, strains from different lineages have clearly different recombination statistics even far into the past.
Finally, the entropy profiles observed for the E. coli strains differ fundamentally from what is observed for the data from simulations (Figure 10—figure supplement 3). For the simulations without recombination, the entropies are almost all zero and at the very low recombination rate of $\rho /\mu =0.001$, where clonal SNPs dominate along all branches of the clonal tree, all entropies are small. In contrast, in the regime where recombination dominates along all branches of the clonal tree, that is for $\rho /\mu \ge 0.3$, the entropy profiles of all strains are highly similar. This confirms that when lineages recombine at equal rates, highly similar entropy profiles result. Therefore, the fact that the entropy profiles are highly heterogeneous for E. coli, and that virtually every strain has a distinct entropy profile, directly shows that the lineages of almost every strain have a distinct pattern of recombination rates with the lineages of the other strains.
nSNP statistics and entropy profiles for the other bacterial species
We next investigated whether the nSNPs of the other species also exhibit approximately powerlaw distributions, as observed in E. coli. Figure 11—figure supplement 1 shows the reverse cumulative distributions of 2SNPs, 3SNPs, 4SNPs, and 12SNPs across all six species together with powerlaw fits. Although the curves often deviate substantially from simple straight lines, they all exhibit long tails and range over several orders of magnitudes, that is up to five orders of magnitude for 2SNPs in S. enterica. Note that M. tuberculosis is again an exception. The total number of different nSNP types is so small for this species that the nSNP distributions are not well defined, suggesting again that this species is mostly if not completely clonal, and we will not further comment on its nSNP statistics.
Figure 11 (left panel) shows the fitted exponents of the powerlaw distributions of nSNPs as a function of n for all species. We see that the exponents generally increase with n indicating that the phylogenetic diversity generally increases as one moves further back in time, that is to larger n. Consistent with other observations, H. pylori shows the highest exponents, that is the highest diversity. It is also interesting that, while the exponents become roughly constant when $n>20$ for E. coli, H. pylori and S. aureus, B. subtilis and S. enterica, exhibit more complex patterns with sudden drops in exponent at particular values of n, suggesting more complex population structures at large n for these species.
Figure 11—figure supplement 2 shows the entropy profiles ${H}_{n}(s)$ for all strains s in each of the species. Consistent with all other statistics, all entropies are low for the M. tuberculosis strains, further supporting that this species is predominantly clonal. With the exception of H. pylori, where the entropy profiles are quite similar for most strains, consistent with an almost ‘freely recombining’ population (Suerbaum et al., 1998), all other species exhibit highly diverse entropy profiles across strains, showing that also in these other species almost every strain has a unique ‘fingerprint’ of frequencies with which its lineages recombine with the lineages of other strains. That is, the entropy profile analysis suggests that, in most species, population structure is complex, with highly heterogeneous recombination rates across the lineages of different strains.
Although the entropy rises quickly to values in the range $48$ for most strains, we also see strains for which the entropy only rises after n exceeds some fairly large value of n, for example at $n=10$ for some strains in H. pylori, and at $n=24$ and $n=62$ for some S. enterica strains. This suggests that the corresponding strains are part of clonal groups of very closely related strains. To summarize the entropy profiles of each species, the right panel of Figure 11 shows the mean and standarderror of the entropy profiles, averaged over all strains, as a function of n. As for most other statistics, M. tuberculosis is an outlier whose strains generally only show low phylogenetic entropy. For all other species, the average entropy clearly increases as n increases, indicating again that the phylogenetic diversity increases further back in the past. For four of the six species, the mean entropy at large n falls in a narrow range between 7 and 8 bits, suggesting that the effective number of different ancestries far back in the past is relatively similar for these species.
Finally, for comparison we decided to also calculate these patterns on the human data. In particular, we extracted SNP data for chromosomes $112$ for 2504 humans from the 1000 Genomes project (1000 Genomes Project Consortium et al., 2015). Figure 11—figure supplement 3 shows examples of the nSNP distributions for human together with the fitted exponents for n ranging from 1 to 30, as well as the entropy profiles for the 40 randomly chosen individuals used previously in Figure 8. Apart from the 2SNPs, which appear to show a biphasic pattern, all other nSNP distributions in human are all well fit by powerlaw distributions. Moreover, for $n\ge 5$ the exponents are almost constant around a value of 2.8. Interestingly, the entropy profiles of all 40 individuals quickly rise to around 10 bits. However, from that point onward the entropy profiles of individuals with African ancestry deviate from those of the other individuals, showing consistently higher entropy, confirming the wellknown fact that African populations have higher genetic diversity. However, in contrast to what we observe for most bacterial species, the entropy profiles for individuals with the same geographic ancestry are all very similar. Among the bacterial species we studied, only H. pylori shows similar entropy profiles. In this respect it is noteworthy that previous studies have argued that H. pylori is almost freely recombining (Suerbaum et al., 1998), and that the population structure that it exhibits in fact reflects the population structure of its human host (Falush et al., 2003).
Discussion
Despite the fact that complete bacterial genomes have been available for more than 20 years, we still lack a clear picture of the relative contributions of recombination and clonal evolution in shaping the observed sequence variation among strains of a bacterial species. In our view, a key problem is that previous methods for quantifying the role of recombination virtually all make specific simplifying assumptions about the underlying evolutionary dynamics. However, we do not know to what extent these assumptions hold and affect the conclusions. To address this problem, we have here developed new methods that avoid making specific assumptions about the evolutionary dynamics as much as possible. In particular, showing that almost all biallelic SNPs in the core genome alignment correspond to single mutations in the history of the coresponding positions, we showed several new ways in which these SNPs can be used to quantify phylogenetic structures and the role of recombination in genome evolution.
Our analysis shows that, for all but one of the species studied here, evolution of the core genome is almost entirely driven by recombination. Recombination is so common that for most pairs of strains none of the DNA in their pairwise alignment derives from their clonal ancestor. Although pairs of strains exist that are so closely related that most of their DNA was clonally inherited, even for such pairs the large majority of substitutions that separate them derive from recombination events. When considering the core genome alignment of a collection of roughly 100 strains, we found that each genomic locus has been overwritten many times by recombination in the history of the strains, that is hundreds if not more than a thousand times for E. coli. Consequently, few if any of the loci follow the clonal phylogeny, and we found that the core genome alignment fragments into many thousands of short segments with different phylogenies. Thus, approximating sequence evolution in the core genome as occurring along the branches of a fixed phylogenetic tree is clearly inappropriate.
One might infer from these results that the dynamics of genome evolution within a bacterial species could be better approximated as quasisexual with free recombination between lineages. However, as has been noted previously (Yang et al., 2019), such an approximation is also inconsistent with the data. Strains do not appear roughly equidistant and phylogenies build from large numbers of genomic loci clearly converge to a welldefined phylogeny. We here argued that this clear phylogenetic structure reflects population structure, that is variation in the relative rates with which different lineages recombine. As a consequence of this population structure, the many different phylogenies that occur along the genome alignment are drawn from a highly nonuniform distribution, and the core genome phylogeny represents an effective average of this distribution of phylogenies.
These conclusions are clearly at odds with the prevailing view that, although E. coli is subject to substantial amounts of recombination, an overall phylogenetic backbone representing clonal relationships can still be identified (Touchon et al., 2009; Didelot et al., 2012; Bobay et al., 2015; Denamur et al., 2021), either by constructing a maximumlikelihood phylogeny from whole genome alignments (Touchon et al., 2009; Didelot and Wilson, 2015), or using more sophisticated approaches based on detecting loci that have not been affected by recombination (Didelot et al., 2012; Croucher et al., 2015), and inferring the clonal tree from those. If, as we are arguing, this prevailing view is mistaken, this raises the question why these previous approaches reached misleading conclusions. The answer is that these approaches rely on assumptions that do not apply. First, methods that aim to reconstruct the clonal phylogeny from a subset of positions that have not been affected by recombination are inherently problematic because our analysis shows that, for most species, such positions simply do not exist. An arguably more fundamental problem with previous analyses is that they assume recombination is completely random, that is without population structure. For example, the sophisticated ClonalFrame method (Didelot and Falush, 2007), which has been used to reconstruct a clonal phylogeny of E. coli strains (Didelot et al., 2012), uses a model in which along each branch of the clonal phylogeny a fraction of the loci is overwritten by recombination, while the rest is inherited clonally. However, instead of explicitly modeling recombination between lineages, it models the effects of recombination as simply introducing random substitutions at a fixed rate into the recombined segments. Consequently, recombination cannot introduce any phylogenetic structure by construction, and any phylogenetic structure evident in the data will thus be interpreted as reflecting clonal relationships. In short, a key problem with previous methods is that, by either explicitly or implicitly relying on models that assume that there is no population structure, the phylogenetic patterns resulting from population structure have been mistaken for clonal relationships. Note that this also implies that methods that aim to detect or quantify recombination by deviation from clonal phylogenies inferred by such methods are also fundamentally flawed.
Many studies have aimed to gain insight into bacterial genome evolution by simulating simple models of evolutionary dynamics, and comparing statistics in the resulting data with analogous statistics in real data. However, when these simple models make assumptions that do not hold in the real world, one may easily reach misleading conclusions. For example, a recent study used simulations of a simple genome evolution model with completely random recombination to conclude that, even which recombination is very frequent, phylogenies reconstructed from whole genome alignments can still correctly identify clonal relationships (Stott and Bobay, 2020). This is not surprising because, as we just noted, when recombination is random any phylogenetic structure in the data can only reflect clonal relationships by construction. However, in the real world, recombination is not completely random, that is there is population structure.
To further support that phylogenetic structures in whole genome alignments reflect population structure, we also developed methods for quantifying this population structure in more detail. In particular, we showed that biallelic SNPs contain much information about the statistics of relative recombination rates across lineages. Whenever a particular nSNP occurs in the alignment, the n strains sharing the SNP must occur as a clade in the phylogeny at that locus. Consequently, the relative frequencies of different nSNP types reflect the relative frequencies with which different subsets of strains occur as clades across the phylogenies. We find that the frequencies of nSNP types vary over 3–5 orders of magnitude and follow roughly powerlaw distributions. Notably, since the nSNP distributions follow smooth longtailed distributions that do not appear to have a characteristic scale, it is not possible to naturally subdivide subsets into highly and rarely recombining types. Rather, there seems to be a large continuum of relative rates, with population structure on all scales. Given that recombination rates vary over orders of magnitude across different lineages, the idea of an effective single recombination rate for a bacterial species might be misleading, and it thus seems problematic to fit the data to models that assume a constant rate of recombination within a species, for example (Vos and Didelot, 2009; Lin and Kussell, 2019).
Essentially, all population genetics and coalescent models start from assuming one or more populations of individuals that, for the purpose of the model, are exchangeable. However, our analysis of the entropy profiles and nSNP statistics showed that, apart from some groups of extremely closely related strains that share a common ancestor before any recombination events occurred, almost every strain has its own distinct nSNP statistics. In particular, virtually every strain s has a unique entropy profile ${H}_{n}(s)$ of the distributions of frequencies with which it occurs in different clades of size n. This suggests that almost every strain has a unique pattern of relative recombination rates between its lineages and the lineages of the other strains. Consequently, models that start from populations of exchangeable individuals may be inappropriate by definition.
Given that models that assume either a single consensus tree, a fixed rate of recombination across strains, or exchangeable individuals, are all clearly at odds with the data on prokaryotic genome evolution, this raises the question of what would be an appropriate mathematical ‘null model’ that can capture the statistics that we observed here. In such a model, almost every lineage must have distinct rates of recombination with other lineages, these rates must vary over multiple orders of magnitude, and the model should reproduce the roughly powerlaw distributions of nSNP frequencies, ideally with exponents that can be tuned by parameters in the model. It is currently unclear how to construct such a model.
Apart from the question of how to mathematically model the observed patterns, a second key question is what determines the highly variable relative rates of recombination across lineages. Just natural selection acts on mutations and thereby shapes the observed patterns of substitutions, selection also undoubtedly strongly acts on recombination events and shapes the recombination patterns that we detect in the core genome alignments. Indeed, strong selection on horizontal transfer events between two B. subtilis strains was recently observed in a laboratory evolution experiment (Power et al., 2020). However, it is currently not clear to what extent the differences in observed recombination rates are shaped mainly by natural selection, for example that due to epistatic interactions only recombinant segments from other strains with similar ‘ecotypes’ are not removed by purifying selection, as suggested by some previous works (Shapiro et al., 2012; CadilloQuiroz et al., 2012), or that recombination rates may be set mostly by which lineages cooccur at the same geographical location. It is also conceivable that phages are a major source of transfer of DNA between strains, so that recombination rates may reflect the rates at which different lineages are infected by the same types of phages.
It is wellknown that homologous recombination requires sufficient homology between the endpoints of the DNA fragment and the homologous segment in the host genome. Thus, recombination rates will intrinsically decrease with the nucleotide divergence between strains. An intriguing and theoretically attractive proposal is that relative recombination rates are simply set by sequence divergence and that bacterial species may essentially be defined by the collection of strains that are sufficiently close to allow efficient recombination (Dixit et al., 2017). Previous studies have estimated that the rate of successful recombination decreases exponentially with nucleotide divergence (Vulić et al., 1997; Oliveira et al., 2016), and in Vulić et al., 1997 it was estimated that every 1% of sequence divergence leads to a roughly twofold reduction in recombination efficiency. However, since most E. coli strains are within 1–3% sequence divergence, this would suggest less than 10fold variability in the relative recombination rates across our strains, whereas the nSNP statistics suggest a much larger range of relative recombination rates.
Finally, while we here studied the frequency distribution of nSNP types as well as the entropies ${H}_{n}(s)$ of the nSNP distributions for each strain, it appears to us that this is just the tip of the iceberg of possible ways in which nSNPs can be used to study the evolution of a set of strains from their core genome alignment. Our analyses indicate that prokaryotic genome evolution is driven by recombination that occurs at a very wide distribution of different rates between different lineages, and there is now a strong need for identifying what sets these rates, and the development of new mathematical tools and models that can accurately describe this kind of genome evolution.
Materials and methods
Data
The E. coli sequences analyzed here can be accessed on NCBI Bioproject via the accession number PRJNA432505 (Ishii et al., 2007; Field, 2020, in preparation). These genomes were sequenced with Illumina HiSeq 2000 technology. Samples were multiplexed, 24 per lane, producing 100 bp paired end reads which resulted in an average coverage depth of between 125x and 487x (with four strains overrepresented at more than 1000x).
Genome sequences for all other species were downloaded from http://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria. The following two source data files contain accessing numbers for all the genomes used in this study.
Figure 1—source data 1. Table with source data for Figure 1—figure supplement 1. Figure 7—source data 1. Table with source data for Figure 7.
In order to facilitate reproduction of the results presented here, we also provide a comprehensive collection of data files containing not only all raw genome sequences and the core genome alignments for each of the species, but also processed data files containing the results of the pairwise analysis and SNP statistics. These data files are available as object 10.5281/zenodo.4420880 from https://zenodo.org/record/4420880.
Core genome alignment and core tree
Request a detailed protocolTo build a core genome alignment for the SC1 strains, we used the Realphy tool (Bertels et al., 2014) with default parameters and Bowtie 2 (Langmead and Salzberg, 2012) for the alignments. The Illumina sequencing used to sequence the E. coli strains can still make sequencing errors at a certain rate and obtaining accurate base calls genomewide relies on having sufficient read coverage at each position to confidently call the consensus nucleotide. We note that, in order to avoid sequencing errors, Realphy is quite conservative in its construction of the core genome alignment. In particular, in order for a position to be included each strain has to be represented, the coverage has to be at least 10 in each strain, and at least 95% of the reads from each strain have to agree on the nucleotide. A rough bound on the rate of sequencing errors can be obtained by noting that some of our strains are extremely closely related, that is while all 92 strains are unique in their full genome, there are only 82 unique core genomes. This means that there are 10 genomes that are identical in their core to another genome. Given that the length of the core alignment is $L={2}^{\prime}{756}^{\prime}541$, this means there were $10\times L$ nucleotides sequenced without a sequencing error and a simple Bayesian calculation shows that this implies that, with 99% posterior probability, the rate of sequencing errors is less than ${\mu}_{s}=1.1\times {10}^{7}$.
Realphy used PhyML (Guindon et al., 2010) with parameters m GTR b 0 to infer trees from the whole and parts of the core alignment. The tree visualizations were made using the Figtree software (Rambaut, 2018).
Analysis of core alignment blocks
Request a detailed protocolFor each 3 kb block of the core alignment, we used PhyML using the option c one to infer a phylogeny while restricting the number of relative substitution rate categories to one. Furthermore, to calculate the loglikelihood of a given 3 kb block under the tree topologies of other blocks, we reran PhyML using the o ’lr’ option, which only optimizes the branch lengths as well as the substitution rate parameters but does not alter the topology of the phylogeny.
Pairwise analysis and mixture modeling
Request a detailed protocolFor each pair of strains we slide a 1 kb window over the core genome alignment of the pair, shifting by 100 bp at a time, and build a histogram of the number of SNPs per kilobase by counting the number of SNPs in each window. That is, we obtain the distribution ${P}_{n}$ of the fraction of 1 kb windows that have n SNPs. We then assumed that the one kilobase blocks can be separated into a fraction f_{a} of ‘ancestral blocks’, that is regions that were inherited from the clonal ancestor of the pair, and a fraction $(1{f}_{a})$ that have been recombined since the pair diverged from a common ancestor. Although in previous work a simple ad hoc scheme was used in which it was assumed that blocks with less than a particular number of SNPs are ancestral and blocks with more SNPs are recombined (Dixit et al., 2015), we found that this approach is not satisfactory and results significantly depend on the cutoff chosen.
We thus decided to employ a more principled mixture model approach. For the ancestral regions, the number of SNPs per kilobase should follow a simple Poisson distribution ${P}_{n}={\mu}^{n}{e}^{\mu}/n!$, with μ the expected number of mutations per block. For the recombined regions, we note that these regions themselves will consist of mosaics of subregions that have been recombined previously. Consequently, the recombined regions will consist of a mixture of Poisson distributions with different rates. It is wellknown that mixtures of Poisson distributions with rates that are (close to) Gammadistributed follow a negative binomial distribution and we found empirically that negative binomial distributions give excellent fits to the observed SNP distributions in our data. For the recombined regions, we thus assume a negative binomial distribution of the form
where $0<\lambda <1$ and $\alpha \ge 1$ are parameters of the distribution. We thus fit the observed distribution of SNPs per block ${P}_{n}$ using the following mixture:
where f_{a} is the fraction of the genome that is inherited from the clonal ancestor. Fits were obtained using maximum likelihood. While expectation maximization was used to optimize the parameters f_{a}, μ, and λ, a grid search was employed to find the optimal dispersion parameter α.
Note that, in terms of the fitted parameters, the total number of mutations in ancestral blocks is $\mu {f}_{a}$, and the number of mutations in recombined blocks is $(1{f}_{a})\alpha \lambda /(1\lambda )$.
To estimate the lengths of recombination events, we first extracted pairs that are sufficiently close (divergence less than 0.002) such that multiple overlapping recombination events are unlikely. We then used a twostate HMM with the same two components, that is a Poisson and a negative binomial component corresponding to ancestral and recombined segments, and having fixed transition rates from the ancestral to the recombined state and vice versa, to parse the pairwise alignment into ancestral and recombined segments. Note that the parameters of the HMM are fitted separately for each pair of strains. For each pair, we took as recombined segments those contiguous stretches that were assigned to the recombined state by the HMM.
We define mostly clonal pairs as pairs with more than 90% of the alignment classified as ancestral, fully recombined pairs as pairs with less than 10% of the alignment classified as ancestral, and all other pairs as transition pairs. In order to estimate the critical divergence at which half of the genome is recombined we fit a linear model to the observed relationship between divergence and clonal fraction in all transition pairs, and define the critical divergence as the divergence at which the linear fit has a clonal fraction of 50%. To calculate the fraction of mutations that derive from recombined segments at the critical divergence, we compute the fraction of mutations in recombined segments for all transition pairs (using the results from the mixture model) and fit a linear model to the observed dependence between the ancestral fraction an the fraction of mutations in recombined segments. We then define the fraction of mutations in recombined regions at the critical divergence as the fraction of mutations in the linear fit when the ancestral fraction is 50%.
Simulated data sets
In order to simulate genome evolution under a simple evolutionary model that includes drift, mutation, and recombination, we developed new software (manuscript in preparation) based on general purpose GPU programming (GPGPU), which is available from https://github.com/thomsak/GPUprokEvolSim (Sakoparnig, 2021; copy archived at swh:1:rev:ed15c4b013068d684874c3d7c9d710e289e3f31c). Our software explicitly simulates the evolution of a population of N DNA sequences of length ${L}_{g}$ using a WrightFisher model with nonoverlapping generations. In order to be able to not only evolve the sequences, but also track recombination events occurring along the clonal history of a sample of S genomes from the population of N, we proceed as follows.
The evolution of the N genomes is simulated for $8N$ generations. First, a ‘guide’ clonal history for the sample of S sequences is determined using the Kingman coalescent (Wakeley, 2008) for S individuals within the population of N. In particular, for the subset of S genomes, their ancestry is determined for $8N$ generations into the past. Notably, because the S sequences will have a common ancestor much more recently than $8N$ generations in the past, for many of these generations there will only be a single ancestor.
We then simulate $8N$ generations of the evolution of N genomes of length ${L}_{g}$ forward in time by iterating:
For each individual corresponding to an ancestor from the clonal phylogeny of the sample of S genomes, its parent in the previous generation is chosen according to the guide clonal phylogeny. For each other individual, a random parent is chosen from the previous population of N individuals.
For each individual of the new generation, the genome of the ancestor is copied, scanned from left to right, and at each position a recombination event is initiated with probability ρ. If a recombination is chosen to occur at position i, one of the $N1$ other members of the population is chosen at random, and the section of its genome from position $i+1$ to $i+{L}_{r}$ is copied into the current genome. After this, each position in the genome is mutated with probability μ. The target nucleotide is chosen randomly using a transitiontransversion ratio of 3to1.
Apart from tracking the N genome sequences, we also track, for each branch of the clonal phylogeny of the S individuals, how many times each position was overwritten by recombination during the evolution along the branch. This allows us to calculate what fraction of positions are inherited clonally along each branch, how many times each position in the final alignment of S genomes was overwritten by recombination during its clonal history, and what fraction of positions was clonally inherited for each pair of strains.
Parameter settings
Request a detailed protocolTo keep the simulations computationally feasible, we simulated a population of $N=3600$ individuals. To make the simulations directly comparable with the data from E. coli, we focused on samples of $S=50$ individuals, used a genome size of ${L}_{g}={2}^{\prime}{560}^{\prime}000$ bp, and used a size of ${L}_{r}={12}^{\prime}000$ bp for the recombined fragments. Note that this length of recombined fragments is toward the lower range of the sizes of recombined fragments estimated from close pairs of E. coli strains. For a Kingman coalescent, the expected fraction of columns that is not polymorphic in an alignment of S individuals is given by
and we set μ such that this fraction is 0.9, that is similar to what is observed for the alignment of E. coli strains. In particular we set $\mu =3.28\times {10}^{6}$ so that $N\mu =0.012$.
Finally, apart from simulations without recombination, that is $\rho =0$, we performed simulations with six different recombination rates, corresponding to ratios of recombination to mutation of $\rho /\mu =0.001$, $\rho /\mu =0.01$, $\rho /\mu =0.1$, $\rho /\mu =0.3$, $\rho /\mu =1$, and $\rho /\mu =10$, that is covering four orders of magnitude. Note that the clonal phylogeny for the subset of S strains was created once, and then used in all simulations with different recombination rates to facilitate direct comparisons.
Estimating the fraction of SNPs that correspond to single substitution events
Our analyses assume that biallelic SNPs correspond to single substitution events in the evolutionary history of the genomic position. This assumption may break down due to apparent SNPs caused by sequencing errors, as well as due to homoplasies, that is biallelic columns for which more than one substitution event occurred. We here estimate the frequency of both types of events.
SNPs due to sequencing errors are negligible
Request a detailed protocolAbove we estimated, from the fact that we observe 10 strains that are identical in their core genome to another strain, that the rate of sequencing errors is at most ${\mu}_{s}=1.1\times {10}^{7}$. Using this, we expect a sequencing error to occur in at most $L\left(1{(1{\mu}_{s})}^{92}\right)=28$ of the columns of our length $L={2}^{\prime}{756}^{\prime}541$ core genome alignment. Note that, because the fraction of columns with more than one sequencing error is negligible, all these sequencing errors will produce a mutant nucleotide in only one strain, because the expected number of columns with two sequencing errors is negligible. Thus, if a sequencing error occurs in a column that is otherwise nonpolymorphic, it will create a biallelic SNP in which only one strain carries a differing base and such SNPs do not affect any of the phylogenetic analyses. Thus, sequencing errors affecting the phylogenetic analysis only occur when the sequencing error occurs in a SNP column, and creates one of the two nucleotides that already existed. Since only about 10% of columns are polymorphic, the expected number of SNP columns that are informative for phylogeny and affected by sequencing errors is at most $28/10=2.8$. Given that there are ${247}^{\prime}822$ phylogeny informative SNPs in total, the expected fraction of these that are affected by sequencing errors is roughly ${10}^{5}$.
Estimating the rate of homoplasy
Request a detailed protocolThe relatively low frequency of SNPs and the fact that almost all SNPs are biallelic strongly suggests that almost all biallelic SNPs correspond to single substitution events in the evolutionary history of the position. Here, we use a simple substitution model to estimate the fraction of biallelic SNPs that correspond to multiple substitution events, that is homoplasies. To do this, we will analyze the observed frequencies of columns with 1, 2, 3, and 4 different nucleotides under a simple substitution model. Note that, in this simple model we assume that all substitutions are neutral, so that there is essentially no difference between mutations and substitutions. In the real data some mutations are deleterious and some of these are removed by purifying selection, leading to lower rates of substitutions at some positions than at others. Indeed, for our core genome alignment of E. coli strains, we observed that SNPs occur almost 10 times more frequently at synonymous sites than at second positions in codons. To assess the effects of selection, we will fit our model to the frequencies of columns with 1, 2, 3, and 4 different nucleotides both for the subset of positions that should be under relatively little selection, that is third positions in fourfold degenerate codons, and positions that should be under relatively strong selection, that is second positions in codons. Since a large fraction of all SNPs corresponds to SNPs in which the outgroup of nine strains has a nucleotide that differs from the nucleotide of all other strains, we will also fit the model separately for all strains, and all strains minus the nine strains of the outgroup.
We will consider the following simple model for the substitutions in a single alignment column. Note that, because the phylogeny may change across positions in the alignment, the sum of the length of the branches of the phylogeny at a given position will vary from position to position. Let μ denote the product of the substitution rate times the total length of the branches in the phylogeny at the position of interest. The variable μ thus corresponds to the expected number of substitutions in the evolutionary history of this position. The probability that n substitutions took place at this position is then given by a Poisson distribution:
and each of these substitutions is equally likely to occur anywhere on the branches of the phylogeny.
We want to calculate the probability $P(dn)$, that if n substitutions occurred along the phylogeny, that d different nucleotides will occur at the leaves. Clearly, if no substitutions occurred, there will be only one nucleotide, so that $P(d0)={\delta}_{d1}$, with ${\delta}_{ij}$ the Kronecker delta function. Also, if one substitution occurred, then we know there must be two nucleotides occuring at the leaves, that is one nucleotide in all leaves downstream of the branch in which the substitution occurred, and one other nucleotide at all other leaves, that is $P(d1)={\delta}_{d2}$. However, when two or more substitutions occur, the situation is more complicated. Let α denote the ancestral nucleotide and assume that the first substitution mutated α to β. The second substitution either occurs in one of the branches carrying the ancestral nucleotide α, or in one of the branches carrying β. To remain at $d=2$ nucleotides, one has to either have another substitution $\alpha \to \beta $, or a substitution $\beta \to \alpha $. In all other cases the number of nucleotides will increase to 3.
Under this simple model, we can calculate the general probabilities $T({d}^{\prime}d)$ that, if there are currently d different nucleotides and another substitution is added, that there will be ${d}^{\prime}$ different nucleotides after the substitution. These probabilities depend on the relative probability for transitions and transversions to occur and we will assume that transitions are r times as likely to occur as transversions. Besides having $T(21)=1$, we find that $T(22)=(2+{r}^{2})/{(2+r)}^{2}$, $T(32)=(2+4r)/{(2+r)}^{2}$, $T(33)=2/3$, $T(43)=1/3$, and $T(44)=1$. All other transition probabilities are zero.
Starting from a single nucleotide in the column, the probability $P(dn)$ to end up with d different nucleotides after n mutations is given by the nth power of the transition matrix T, that is $P(dn)={T}^{n}(d1)$. From this, we can work out the probability $P(d\mu )$ to end up with d different nucleotides as a function of the expected number of mutations μ as
which can be easily numerically evaluated to sufficient precision.
Assume we observe c_{d} columns with d different nucleotides, with d running from 1 to 4. The loglikelihood of this count data given μ is
and by maximizing this loglikelihood with respect to μ (numerically), we obtain an estimate ${\mu}_{*}$ given the counts c_{d}. Finally, given ${\mu}_{*}$, the fraction f_{h} of homoplasies, that is biallelic SNPs that correspond to multiple substitutions, is given by the fraction of all columns for which $d=2$ but $n>1$. This fraction is given by
Table 1 shows the estimated expected number of mutations per column ${\mu}_{*}$ and the estimated fraction of homoplasies f_{h} for the five different subsets of columns, using a transitiontotransversion ratio of $r=3$. We see that the fraction f_{h} is at most 6.3% and less than 1% for second positions in codons.
In addition, Figure 3—figure supplement 1 shows a comparison of the observed and predicted frequencies of columns with 1, 2, 3, and 4 letters. Since effects of selection are likely least for the synonymous positions, we expect the simple model to fit the data best and we indeed observe that, for the synonymous positions, the simple model can reasonably accurately fit the observed frequencies, and even for the set of all alignment columns the fits are quite accurate (Figure 3—figure supplement 1). In contrast, for the second positions in codons, we can see the effects of selection in that, from the larger fractions of columns without SNPs, the model infers a lower ${\mu}_{*}$, and this leads to an underestimation of columns with four nucleotides. Thus, the true fraction f_{h} is more likely close to the values inferred from the synonymous positions. Note that ${f}_{h}=0.063$ when including the outgroup and ${f}_{h}=0.033$ when the outgroup is excluded. The difference between these two estimates derives from the very high fraction of SNP columns in which the 9 strains of the outgroup have another nucleotide than all other strains. For this subset of SNPs, the fraction of columns that have more than one mutation is much higher than for any other SNP column. Thus, for all other SNP columns, the estimate that a fraction of 3.3% correspond to homoplasies is likely the most accurate. For completeness, we provide the full statistics of the observed polymorphisms in Table 2.
As an additional test, we also applied this simple method to estimate the rate of homoplasy for the simulation data. For the simulation data without recombination we explicitly kept track of homoplasies and determined that a fraction ${f}_{h}=0.0256$ of biallelic positions correspond to homoplasies. Applying our estimation method to the alignment of the $S=50$ sample genomes from this simulation, we estimated ${f}_{h}=0.0243$, which is within 2% of the true value, further supporting our method for estimating the homoplasy rate.
In summary, our estimates strongly suggest that the rate of homoplasies among biallelic SNPs in our core genome alignment of the E. coli strains lies somewhere in the range of 26%.
Removing potentially homoplasic positions from the core genome alignment
Request a detailed protocolEven though the analysis of the previous section has shown that homoplasies are only a very small fraction of all biallelic SNPs, we decided to investigate to what extent this small fraction of homoplasies may affect the various statistics that we calculate. Ideally, if we knew which alignment columns correspond to homoplasies, we could simply remove all homoplasic columns and recalculate all statistics of interest on the reduced alignment from which these homoplasic sites were removed. However, since we only know the approximate fraction f_{h} of homoplasies and do not know which columns are homoplasies, a conservative approach is to remove those columns that are most phylogenetically inconsistent with other columns in their neighborhood.
In particular, for each biallelic SNP s in the alignment, we check its phylogenetic consistency with the nearest 200 SNP columns to the left and nearest 200 SNP columns to the right, that is its consistency with others SNPs within a roughly 4 kb region. We then assign each SNP column an inconsistency score ${I}_{s}$ corresponding to the fraction of the 400 neighboring columns that are phylogenetically inconsistent with it. We then sort all SNP columns by their inconsistency ${I}_{s}$ and remove a fraction f of SNP columns with the highest inconsistency. After this, we can recalculate all statistics of interest on the alignment from which these potentially homoplasic columns have been removed. In particular, we generated reduced alignments from which a fraction $f=0.05$ and a fraction of $f=0.1$ of most inconsistent columns were removed.
Constructing a tree that maximizes the number of compatible SNPs
Request a detailed protocolWe classify all SNPs in the core genome alignment into SNP types as follows. For each biallelic SNP, we map all letters with the majority nucleotide to a 0 and the minority nucleotide to a one and sort the bits according to the alphabetic order of the strain names. For SNPs where one allele occurs in exactly half of the strains the minority allele is not defined and the ambiguity is resolved by setting the first bit of the string to 0. In this way, each SNP is mapped to a binary sequence of length 92. This binary sequence defines the SNP type. Note that a SNP type corresponds to a particular bipartition of the strains.
We next counted the number of occurrences n_{t} of each SNP type t and sorted the SNP types from most to least common. We then used the following greedy algorithm to a collect a subset T of mutually compatible SNP types that accounts for as many SNPs as possible. We seed T with the most common SNP type, that is the SNP type occurring at the top of the list. We then go down the list of SNP types, iteratively adding SNP types t to the set T that are compatible with all previous types in the set T.
Bottom up tree building
Request a detailed protocolIn this procedure, we build phylogenies of subclades in a bottomup manner, starting from the full set of 92 strains and iteratively fusing pairs, minimizing the number of incompatible SNPs at each step.
For any subset of strains S, we define the number of supporting SNPs ${n}_{S}$ as the number of SNPs that fall on the branch between the subset S and the other strains, that is the number of SNPs in which all strains in S have one letter, and all other strains another letter. Similarly, we define the number of clashing SNPs ${c}_{S}$ as the number of SNPs that are incompatible with the strains in S forming a subclade in the tree.
The iterative merging procedure is initiated with each of the 92 strains forming a subclade S. At each step of the iteration we calculate, for each pair of existing subclades S_{1}, S_{2}, the number of clashing SNPs ${c}_{S}$ and supporting SNPs ${n}_{S}$ for the set of strains $S={S}_{1}\cup {S}_{2}$ consisting of the union of the strains in S_{1} and S_{2}. We then merge the pair $({S}_{1},{S}_{2})$ that minimizes the clashes ${c}_{S}$ and, when their are ties, maximizes the number of supporting SNPs ${n}_{S}$. At each step of the calculation, we keep track of the total number of SNPs on the branches of the subtrees build so far, as well as the total number of SNPs that are inconsistent with the subtrees build so far. In addition, we calculate the average pairwise divergences of the strains within the subclades. Figure 4—figure supplement 4 shows the ratio of clashing to supporting SNPs as a function of the divergence within the subclades.
Quartet analysis
Request a detailed protocolQuartets were assembled in the following way. We construct a grid of target distances d starting at 0.00001 and having 50 points with 0.0005 sized distance. For every target distance d, we scan the alignment for four strains which have all pairwise distances within 1.25fold of distance d. Every target distance d for which no quartet can be found fulfilling these criteria is ignored.
For each quartet, we extract all SNP columns where two strains have a specific nucleotide and the other two strains have another nucleotide. Every such SNP column unambiguously supports one out of three possible tree topologies for this quartet. For each quartet, we determine which topology has the largest number of supporting SNPs, and what the fraction of SNPs is that support this topology.
Linkage disequilibrium measure
Request a detailed protocolA standard measure of linkage disequilibrium of SNPs at a given distance is given by the average squaredcorrelation of the genotypes at these positions (Lewontin, 1988). For a pair of loci with biallelic SNPs there are four possible genotypes which we indicate as binary patterns 00, 01, 10, and 11. If the frequencies of these genotypes are f_{00}, f_{01}, f_{10}, and f_{11}, then the squared correlation is calculated as
where the variables with dots correspond to marginal probabilities, for example ${f}_{1.}={f}_{10}+{f}_{11}$, ${f}_{.1}={f}_{01}+{f}_{11}$, and so on.
Distribution of treecompatible stretches
Request a detailed protocolTo calculate the distribution of the number of consecutive treecompatible SNPs, we start from each SNP s in the core genome alignment and count the number n_{s} of SNP columns immediately following s, until a SNP column occurs that is incompatible with at least one of the n_{s} SNP columns. Similarly, to obtain the distribution of the number of consecutive treecompatible nucleotides we start from each position p in the core genome alignment and count the number n_{p} of consecutive nucleotides until a SNP column occurs that is incompatible with at least one of the SNP columns among the n_{p} nucleotides.
Minimum number of phylogeny changes C
Request a detailed protocolWe iterate over all SNP columns along the core genome alignment and add the current SNP to a list if it is pairwise compatible with all SNPs currently in the list. If it is incompatible with at least one SNP in this list, we empty the list, reinitialize the list with the current SNP, and increase the phylogeny counter C by one.
Phylogenies of human genome sequences
Request a detailed protocolWe selected 40 individuals at random from the 1000 Genome project (1000 Genomes Project Consortium et al., 2015) and build a ‘core tree’ by applying PhyML to the sequences of chromosomes 1 through 12. To investigate the robustness of this core tree, we cut the alignment in 1000 bp blocks and did 100 random resamplings of half of the blocks, building a phylogeny for each resampling using PhyML. We then determined the fraction of times each split in the core tree occurred in the trees of the 100 resamplings.
Powerlaw fits of nSNP distributions
Request a detailed protocolWe extract each nSNP from the core genome alignment and count the frequency, that is the number of occurrences, f_{t} of each nSNP type t as well as the total number T of nSNP types that occur at least once. We assume the nSNP type occurrences are drawn from a powerlaw of the form
where $\zeta (\alpha )$ is the Riemann zeta function defined by
The loglikelihood of the frequencies f_{t} as a function of α is given by
where $\u27e8\mathrm{log}[f]\u27e9$ is the average of the logarithm of the SNPtype frequencies. Using a uniform prior on α, the posterior distribution of α is simply proportional to the likelihood function. The optimal exponent ${\alpha}_{*}$ is the solution of
To calculate errorbars on the fitted exponentials we approximate the posterior by a Gaussian by expanding the loglikelihood to second order around the optimal exponent ${\alpha}_{*}$. We then find for the standarddeviation of the posterior distribution:
Entropy profiles of nSNP distributions
Request a detailed protocolFor a given strain X, we first extract all SNP types t for which X is one of the strains that shares the minority nucleotide. We then further stratify these SNP types by the number of strains n sharing the minority nucleotide. For each n, we thus obtain a set $S(X,n)$ of nSNPs in which strain X is one of the strains sharing the SNP. We denote the number of occurrences of an SNP of type t by f_{t} and the total number of nSNPs within set $S(X,n)$ as $S(X,n)$, that is
The entropy $H(X,n)$ of the nSNP distribution of strain X is then defined as
Comparing nSNP statistics of pairs of strains
Request a detailed protocolThe considerable variability of the entropy profiles of E. coli’s strains suggests that the lineages of different strains must have recombined at different rates with other lineages. Indeed, we observe much less variation in the entropy profiles of the data from simulations in which each strain recombines at an equal rate with each other strain, than we observe for the E. coli data (Figure 10—figure supplement 3). However, since we currently lack a concrete evolutionary model that can reproduce all the statistics that we observe in the data, it is difficult to quantify how different the recombination rates of different lineages have to be in order to reproduce the observed variation in entropy profiles.
Nonetheless, it is straightforward to define a simple statistical measure of the difference in the nSNP statistics of a given pair of strains $(x,y)$. Each SNP in the core genome alignment can be categorized by the subset of strains S that share the minority allele. Given a pair of strains $(x,y)$, this subset S can take on four possible types: S_{0} = (Z), S_{2} = ($xyZ$), ${S}_{x}$ = ($xZ$), and ${S}_{y}$ = ($yZ$), where Z is a subset of strains that does not include x or y. That is, either neither x or y carry the minority allele, they both do, or only one of them does. We are interested in comparing the relative frequencies with which x and y cooccur with different subsets Z in SNPs across the alignment. Note that to this end, we can ignore SNPs of the type S_{0} and S_{2} because x and y occur together in these SNPs, so that the frequency with which different subsets Z occur in types S_{0} and S_{2} is per definition the same for both x and y. Thus, the relevant SNPs are of the type ${S}_{x}$ and ${S}_{y}$.
Let ${n}_{xZ}$ be the number of SNPs of the type ${S}_{x}=(xZ)$, ${n}_{yZ}$ the number of SNPs of type ${S}_{y}=(yZ)$, and the totals ${N}_{x}={\sum}_{Z}{n}_{xZ}$, and ${N}_{y}={\sum}_{Z}{n}_{yZ}$. We can then define two probability distributions over all possible subsets Z, that is ${p}_{Z}^{x}={n}_{xZ}/{N}_{x}$ and ${p}_{Z}^{y}={n}_{yZ}/{N}_{y}$. These two probability distributions give the relative frequencies with which x and y are observed to occur in SNPs with all other sets of strains Z and there are standard methods to quantify to what extent these two probability distributions are statistically significantly different. A standard orthodox statistical test is the Fisher exact test, which uses the test statistic
Note that, if the counts are large enough so that we can use the Stirling approximation $\mathrm{log}(n!)\approx n\mathrm{log}(n)n$, this can be rewritten as
where the distribution p is the marginal distribution over the sets Z
and $D(pq)$ is the KullbackLeibler divergence (or relative entropy) of the distribution p with respect to distribution q. Note that the probability $P(x,y)$ corresponds directly to the pvalue of the Fisher exact test. The cumulative distribution of pvalues across all pairs of strains is shown in Figure 10—figure supplement 2, left panel, showing that the nSNP profiles are different for about 95% of all pairs. In the right panel we show a scatter of the distance $\mathrm{log}[P(x,y)]$ of the nSNP frequency profiles of each pair of strains $(x,y)$ as a function of their nucleotide divergence $d(x,y)$. This shows that only very close pairs that differ by less than 10 SNPs in their core genomes have statistically identical nSNP frequency profiles. In addition, there is a very high correlation between the nucleotide divergence $d(x,y)$ and the distance $\mathrm{log}[P(x,y)]$ of the nSNP frequency profiles.
Finally, one caveat of the Fisher exact test is that it presumes that all the observed nSNPs of the types ($Zx$) and ($Zy$) are statistically independent, and in the absence of a concrete stochastic model for the evolutionary dynamics, we do not know whether this assumption holds. However, the pvalues for most pairs are so low that, even if we assume the true numbers of independent events are 500fold less, the fraction of significantly different pairs would still be larger than 90%.
Appendix 1
Appendix 2
Comment on the estimation of recombination segment lengths
The recombination segment lengths that we inferred from comparing closely related pairs of strains (Figure 2J), and the average segment length of ${L}_{r}=31kb$, are significantly longer than estimates that have been reported previously, for example (Vos and Didelot, 2009; Touchon et al., 2009). In contrast to our method, which directly identifies recombined segments as regions of high SNP density in pairwise genome alignments, which can be easily detected by eye (i.e. see Figure 2A and B), these previous estimates were based on fitting of more complex population genetics models to the entire multiple alignment. Since these models make a number of assumptions that our results suggest do not hold, this may explain why these models inferred significantly lower fragment lengths. However, one may still ask whether our estimate of the average recombination segment length could be significantly inflated, for example due to our method failing to detect short recombination events.
We do not believe this is plausible for a number of reasons. First, the simple HMM that we used to detect recombined segments uses 1 kb as a minimal segment length, and as much as 30% of the recombination segments that we infer are only $12$ kb long (Figure 2J). This shows that our method has no difficulty in detecting relative short recombination segments.
Second, a simple backoftheenvelope calculation suggests that segments have to be very short in order for our method to fail to detect them. For the close pairs that we use to estimate the lengths of recombination segments, SNPs in clonally inherited regions are so rare that essentially only 1 kb blocks with 0, 1, or at most 2 SNPs will be assumed to be clonally inherited. Since the recombined segments derive from strains that are typically 1–2% diverged, even a recombination segment of length 200 bp would cause three or more SNPs, and would be detected by our method. Thus, very roughly speaking, segments of length 200 bp or more are expected to be reliably detected and only segments of 100 bp or less could easily be missed.
Third, to substantially affect the average length of the recombination segments, undetectable short segments (i.e. of length less than 100 bp) would have to be extremely frequent. That is, if we denote by ${L}_{r}=31000$ the average length of the detected segments, by ${L}_{u}$ the average length of undetected segments, by ${L}_{t}$ the true average segment length, and by f_{d} the fraction of all segments that were detected, we have
In the limit that ${L}_{u}$ is much smaller than both ${L}_{r}$ and ${L}_{t}$, we have approximately ${f}_{d}\approx {L}_{t}/{L}_{r}$. Thus, in order for the average segment length to be overestimated 10fold, we would have to assume that for every detected recombination segment of length 200 bp or more, there are 10 undetected recombination segments of length 100 or less. This seems rather implausible to us.
Finally, the average length of the recombination segments ${L}_{r}$ is immaterial for almost all of our results, with the exception of our estimated lower bound on the number of times T each locus has been overwritten by recombination. In particular, this lower bound T is directly proportional to the average length of the recombination segments ${L}_{r}$. However, the estimate T is also directly proportional to the lower bound C on the number of phylogeny changes along the genome, and we saw in the main text that in our simulations the true number of phylogeny changes can easily be 10fold larger than this lower bound C. Therefore, our lower bound for T can only be larger than the true value if the average segment length ${L}_{r}$ is overestimated by a large factor.
Appendix 3
Estimating T using statistics from the pair analysis
For a pair of strains $(s,{s}^{\prime})$ our pairwise analysis estimates the fraction f_{c} of the genome that was clonally inherited and the divergence d_{c} (i.e. fraction of positions for which s and ${s}^{\prime}$ have different nucleotides) in the clonally inherited regions, from the density of SNPs along their core genome alignment. If we make the assumptions that, since s and ${s}^{\prime}$ diverged from a common clonal ancestor, substitutions have occurred at a constant rate μ per unit time, recombination events have occurred at a constant rate ρ per unit time, and the average length of the recombination segments is ${L}_{r}$, then f_{c} is approximately given by
with ${L}_{g}$ the length of the core genome alignment, and t the time since the common ancestor of s and ${s}^{\prime}$. Similarly, the divergence d_{c} is given by
Using these, we can solve for the clonal fraction f_{c} as a function of the divergence d_{c} and find
Thus, given an estimate of the fraction of clonally inherited genome f_{c} and divergence d_{c} in these clonally inherited regions, we can infer an effective value of ${L}_{r}\rho /\mu $ for this pair, that is
In order for these simple equations to be applicable, we have to be in a regime where there are enough recombined regions so that their number and average size is not too noisy, but also not so many that the recombined regions start to significantly overlap. We thus extracted all pairs of strains for which $1/4\le {f}_{c}\le 3/4$ and plotted the estimate of $L\rho /\mu $ against f_{c} for each pair (Figure 1).
We see that based on this simple model, the estimates of ${L}_{r}\rho /\mu $ range from about 2100–4600.
Finally, we can use these estimates of ${L}_{r}\rho /\mu $ to derive an estimate for the average number of times T that a given locus in the core genome alignment has been overwritten by recombination. Note that, because recombination is very frequent, many thousands of different phylogenies occur along the core genome, but each single position i has some definite phylogeny ${\varphi}_{i}$. If we denote by $t({\varphi}_{i})$ the total length of the branches of phylogeny ${\varphi}_{i}$, then the expected number of times position i has been substituted is roughly $\mu t({\varphi}_{i})$ and the expected number of times position i has been recombined is ${L}_{r}\rho t({\varphi}_{i})$. From the observation that 10% of the positions in the core genome alignment are polymorphic, it follows that $\mu \u27e8t\u27e9\approx 0.1$, with $\u27e8t\u27e9$ the average of $t({\varphi}_{i})$ across all positions. Consequently, the average number of times T that a position in the genome has been overwritten by recombination is given by
Thus, this simple model predicts, from the statistics of the partially recombined pairs, that T lies somewhere between 210 and 460, which is consistent with our lower bound of $T=190$ that we estimated based on $C/M$ and the estimated average length of recombination segments ${L}_{r}$.
In spite of this consistency, we want to stress that this simple model assumes that there is a fixed rate of recombination, whereas our data indicates that recombination rates vary strongly across lineages. As such, it is inherently misleading to try to summarize the strength of recombination by a single recombination rate ρ. Moreover, in going from the estimate of ${L}_{r}\rho /\mu $ of each pair, to the estimate of T, we implicitly assume that the rates inferred for these partially recombined pairs can be extended to the full core genome alignment. This too is potentially problematic. We therefore consider this method for estimating T to be only a rough orderofmagnitude check on the lower bound on T that we estimated from the lower bound on the number of phylogeny changes C.
Appendix 4
E. coli's nSNP statistics differ from nSNP statistics of simulations with free recombination
In this section, we systematically compare the nSNP statistics observed for the E. coli data with nSNP statistics observed for the simulations with different rates of recombination. We show that the nSNP statistics for E. coli differ in several fundamental ways from those observed for simulations of populations of freely recombining individuals.
An often considered statistic in the population genetic analysis of sexually reproducing species is the socalled site frequency spectrum, which in our terminology corresponds to the total number of nSNPs $T(n)$, that is the total number of SNPs shared by n strains, as a function of n. Figure 1 shows the site frequency spectrum $T(n)$ for the E. coli data, the simulations without recombination, and the simulations with recombination to mutation rates ranging from $\rho /\mu =0.001$ to 10, using the 5% homoplasycorrected alignments. To distinguish the contribution of potentially clonal SNPs falling on the core tree, the solid lines show $T(n)$ for all SNPs, and the dashed lines show $T(n)$ for all but the nSNPs falling on the core tree.
We first focus on the simulation data. It is wellknown that, for populations evolving under a Kingman coalescent, $T(n)\propto 1/n$, that is the total frequency of SNPs shared by n strains falls as $1/n$ when averaged over many instantiations of the coalescent process. Indeed, for recombination rates sufficiently high that essentially all branches in the clonal history are dominated by recombination, that is $\rho /\mu \ge 0.3$, the observed site frequency spectrum falls exactly as $1/n$, and removing the SNPs that fall on the core tree has virtually no effect on the nSNP frequencies $T(n)$. In contrast, for the simulations without recombination (top center panel in Figure 1) virtually all SNPs fall on the clonal tree, that is after removal of the core tree SNPs only a tiny number of SNPs remain that derive from homoplasies that escaped the homoplasy correction. Moreover, because the counts $T(n)$ for the simulation without recombination derive from a single tree, we observe significant deviations from the average trend $T(n)\propto 1/n$, for example the clonal tree happens to have particularly long branches toward clades with n=5, n=7, and n=12 .
For $\rho /\mu =0.001$ all loci are still predominantly clonally inherited across all branches of the tree (see Figure 1—figure supplement 3), which is reflected in the fact that the number of SNPs drops dramatically when the clonal SNPs are removed, and that we clearly see the peaks in $T(n)$ at $n=5$, $n=7$, and $n=12$ that derive from the clonal tree. The recombination rate $\rho /\mu =0.01$ is an interesting intermediate case. While most branches of the tree are predominantly clonally inherited, a few longer branches are already mostly recombined (Figure 1—figure supplement 3). Reflecting this, we see that for $\rho /\mu =0.01$ clonal SNPs dominate the nSNP counts for small n, whereas from about $n=10$ onward, the nSNP counts derive mostly from recombination. At recombination rate, $\rho /\mu =0.1$ clonal SNPs only affect nSNP counts for $n<5$, and for higher recombination rates the site frequencies $T(n)$ perfectly follow the function $1/n$.
For the E. coli data, we see that core tree SNPs make a significant contribution to the nSNP counts for $n\le 9$. In particular, the peak at $n=9$, corresponding to the extremely common SNPs toward the outgroup, as well as a smaller peak at $n=20$, corresponding to a group of 20 extremely close strains, both disappear after removal of the core tree SNPs. From $n=10$ onward the core tree SNPs do not contribute significantly to the nSNP counts $T(n)$. Interestingly, while the counts $T(n)$ roughly follow the $1/n$ trend until $n=20$, for $n>20$ the counts $T(n)$ do not further decrease but appear virtually constant. This behavior is not observed in any of the simulated datasets.
Next, we investigated how many different types of nSNPs occur as a function of n for the different datasets, as well as the average number of occurrence of the each of the nSNP types. Since the exponents of the powerlaw fits are determined by the geometric average of the nSNP type occurrences (see Materials and methods), we calculated the geometric average of the counts per nSNP type as a function of n for each dataset. To focus on the nSNPs associated with recombination, we used the SNPs from the 5% homoplasycorrected alignment and removed all SNPs corresponding to branches of the core tree. The results are shown in Figure 2.
We first note that the number of 1SNP types is 50 for all the simulated data (which consisted of a sample of $N=50$) genomes. That is 1SNPs are observed that are exclusive to each of the 50 genomes. However, the number of 1SNP types observed for the E. coli data is 55, that is much less than the 92 possible types given that there are 92 strains. This reflects the fact that, in our collection of wild E. coli isolates, there are some groups of extremely closely related strains with identical core genomes. Note that such closely related strains are unlikely to occur for random samples from a Kingman coalescent.
Second, we see that, for the E. coli data, the number of nSNP types increases with n, saturating at around $500800$ unique nSNP types for each n at $n\ge 5$. In the right panel of Figure 2 we see that, as the number of nSNP types increases, the geometric mean frequency of the nSNP types decreases smoothly. This smooth decrease of the geometric mean mirrors the increasing exponents of the nSNP powerlaws. None of the simulation data resemble the pattern that is observed for the E. coli data. First, at the lowest recombination rate $\rho /\mu =0.001$ the number of nSNP types is low and decreasing with n. Second, for recombination rate $\rho /\mu =0.01$, the number of SNP types increases more gradually with n, saturating at about 200 nSNP types for large n. However, in contrast to what we observe for the E. coli data, the geometric mean number of occurrences for the simulations with $\rho /\mu =0.01$ is much larger and virtually constant for $n\ge 2$. At higher recombination rates $\rho /\mu \ge 0.1$, the number of nSNP types rises quickly with n, reaching a maximum of thousands of unique nSNP types between $n=3$ and $n=5$, and then decreases with larger n. At the same time, the geometric mean of the number of occurrences per type decreases quickly with n and stays at low values of about three for $\rho /\mu =0.1$ and close to one for the highest recombination rates. That is, when recombination completely dominates, and there is no population structure by construction, almost all SNP types are distinct, occurring only once or a few times.
Note that our previous analysis has shown that the number of phylogeny changes per SNP column C/M for E. coli is similar to that observed for $\rho /\mu =1$. However, as Figure 2 shows, at such high recombination rates one would expect much higher nSNP diversity still than is observed for the E. coli data. The fact that the E. coli data shows evidence of high recombination on the one hand, in combination with lower SNP diversity, is consistent with SNP diversity being constrained by population structure.
We next compared the approximately powerlaw nSNP frequency distributions that we observed for E. coli, with the nSNP frequency distributions observed for the simulation data. Figure 3 shows the nSNP distributions for six different values of n ranging from $n=2$ to $n=12$.
Whereas, for the E. coli data, all distributions are approximately powerlaw, almost all of the distributions for the simulated data are clearly bending downwards in the loglog plot, often severely so. Long tailed distributions are only observed for the very low recombination rates $\rho /\mu =0.001$ and $\rho /\mu =0.01$, for which we have seen that nSNPs are still largely dominated by clonal SNPs at small values of n. However, as we have seen above, such low recombination rates are not consistent with the E. coli data. In addition, as we will see in more detail below, the exponents of the nSNP distributions at recombination rates $\rho /\mu =0.001$ and $\rho /\mu =0.01$ are much lower than is observed for E. coli, and the average number of occurrence per nSNP type are significantly higher than observed for E. coli (Figure 2, right panel).
All nSNP distributions at recombination rates $\rho /\mu \ge 0.1$ differ clearly from powerlaws. That is, instead of longtailed distributions most of these nSNP frequencies fall within a fairly narrow range. For example, at $\rho /\mu =1$ most 2SNPs have frequencies between $1050$, most 3SNPs have frequencies between $220$, and 4SNPs that occur more than 10 times are very rare.
Although the nSNP distributions at mutation rates $\rho /\mu \ge 0.1$ cannot reasonably be fitted to powerlaws, we can still determine the exponents of the maximum likelihood fits since these only depend on the geometric average number of occurrences of the nSNP types (see Materials and methods). Figure 4 shows the resulting exponents for the simulated data, as well as for the E. coli data for reference.
We see that, for the simulation data, exponents increase systematically with recombination rate. For low recombination rates $\rho /\mu =0.001$ and $\rho /\mu =0.01$ all exponents are below 1.5, whereas for recombination rate $\rho /\mu =1$ the exponents lie in the range $2.32.6$. In contrast to the E. coli data, for which exponents increase smoothly from about 1.25–2.7 as n increases, for the simulations the exponents tend to be more constant as a function of n.
In summary, none of the nSNP statistics of the data from simulations resemble the nSNP statistics observed for the E. coli data. For example, while fairly longtailed distributions of nSNPs are observed at very low recombination rates, the exponents of these distributions are much lower than found for the E. coli data, and the nSNP diversity is lower than observed for the E. coli data. In addition, such low recombination rates are inconsistent with all of our other analyses of the E. coli data. In contrast, while high nSNP diversity is observed at higher recombination rates, there the nSNP distributions are not long tailed, but instead nSNP frequencies vary only over a relatively narrow range.
Appendix 5
Detailed SNP statistics for Phylogroup B2
In the main text, we illustrated 2SNPs using strain A1 (Figure 9), which is part of the phylogroup B2. The phylogroup B2, which is represented by the six strains A1, A2, A7, A11, B2, and D8 in our dataset, is small enough to allow for an illustrative discussion of what our various analyses show about the role of recombination in the evolutionary history of these strains.
Pairwise analysis for the sextet of strains (A1,A2,A7,A11,B2,D8)
Appendix 5—table 1 shows the pairwise divergences between strain A1 and the other strains of this phylogroup.
Note that these divergences are in the regime where, in general, most of the pairwise alignment has already been overwritten by recombination (see Figure 2G and note that the critical divergence at which 50% of the genome has recombined is 0.0032). Indeed, if one looks at the pairwise alignments of A1 with these strains (Figure 1), one finds that for each of these five strains there is little if any ancestrally inherited DNA left.
This is in fact true for all pairs of the phylogroup B2, with the exception of the pair (A7,B2). The divergence of A7 and B2 is less than ${10}^{4}$ and their alignment is fully clonal, that is A7 and B2 share a recent common ancestor. However, according to the analysis of SNP densities along the pairwise alignments, all other pairs in this clade are close to fully recombined. Note that this also means that the pairwise divergences are dominated by mutations in recombined regions, not by ancestrally inherited mutations (see Figure 2H). Thus, the pairwise analysis rejects that the core genome phylogeny for this phylogroup corresponds to the clonal phylogeny.
PairSNPs of A1 along the core genome alignment
In the main text, we introduced the 2SNP concept by showing the distribution of 2SNPs for the strain A1, which belongs to phylogroup B2. This analysis showed that strain A1 occurs in 2SNPs with 17 different other strains from our collection, and most commonly with strains D8, A2, and A11, which each have around 200 2SNPs with strain A1. Besides these three strains from phylogroup B2, strains D4 (from phylogroup F) and H6 (which does not have a clear phylogroup) each also have over 70 2SNPs with A1. Figure 2 shows how these five most common 2SNP types involving strain A1 are distributed along the core genome alignment. Note that, wherever a 2SNP occurs of type $(A1,s)$, it means that the strains A1 and s are nearest neighbors in the phylogeny at that locus. Consequently, the pattern of 2SNP types along the core genome alignment gives an indication of how different phylogenies are distributed along the genome.
The results in Figure 2 are consistent with the pairwise analysis, showing that the different 2SNP types are fairly uniformly spread along the core genome alignment, and that the phylogeny changes many times. As shown in the middle panel of Figure 2, the tails of the distributions of distances between 2SNPs of the same type are approximately exponential, indicating that 2SNPs of the same type occur approximately at a constant rate on a large lengthscale. However, on short lengthscales there is some evidence for clustering of 2SNPs of the same type. To further elucidate this, the right panel of Figure 2 shows the reverse cumulative distribution of consecutive runs of the same 2SNP type. This indicates that these runs are mostly very short, that is 3 or less 2SNPs of the same type in a row for 90% of the runs. The longest runs are one run of 16 SNPs in a row of the most common type (A1,D8), and one run of 16 SNPs in a row where strain A1 shares a SNP with the strain D4 from phylogroup F. The fact that 2SNPs of A1 with strains D4 and H6, both of which do not belong to phylogroup B2, are dispersed all across the core genome alignment underscores again that phylogroups should not be thought of as reflecting clonal ancestry, that is there is no single common ancestor cell of the strains in phylogroup B2. Instead, different positions in the core genome alignment have different ancestries and the strains of phylogroup B2 form a ‘group’ in the sense that, for many positions in the alignment, these strains share ancestors with each other more recently than with other strains.
Phylogeny changes along the core alignment of phylogroup B2
That recombination is pervasive within this set of strains with relatively low divergence is confirmed by the phylogenetic inconsistencies along the alignment made from the core genomes of just this sextet of strains (A1,A2,A7,A11,B2,D8). In particular, in the 5% homoplasycorrected alignment there are $M={30}^{\prime}196$ SNPs and at least $C=2664$ changes in phylogeny along this core genome alignment. This means that the lower bound on the ratio of the number of phylogeny changes to mutation events is $C/M=0.088$, that is a break on average every 11 SNPs. Figure 3 shows the distribution of segment lengths between phylogeny breaks (either counting the number of consecutive SNPs or total length of the segments) which shows that breaks typically occur within 10 SNPs and within a few hundred base pairs.
Figure 3 also shows (right panel) that the SNP density varies by as much as hundredfold across the 2664 segments, consistent with the fact that the transferred fragments are themselves mosaics of previous recombination events.
SNP types for the strains in phylogroup B2
Another indication that these singlephylogeny segments are the result of recombination comes from looking at what the actual SNP types are that occur in the segments. Appendix 5—table 2 shows, for the 10 longest segments, the lengths, the total number of SNPs, and SNPtypes that occur in each segment.
Note that the SNPtypes that occur in these longest unbroken segments are not only almost all inconsistent with the core genome phylogeny, they are also almost all inconsistent with each other. That is, each of these segments suggests a different phylogeny. Note also that, in each segment, there are typically multiple SNPs of the same type.
Finally, the fact that the alignment of this sextet is a mixture of segments that follow different phylogenies is also confirmed by the overall relative frequencies of the SNPtypes that are observed for these strains. Appendix 5—table 3 shows the most common 2SNPs, 3SNPs, and 4SNPs together with their number of occurrences, for this sextet of strains. Note that these are SNPtypes obtained from the entire alignment. That is, a 4SNP (A1 A11 A7 B2) means that these strains share a nucleotide that differs from the nucleotide that all other 88 strains have.
Appendix 5—table 3 shows that, apart from the clonal pair (A7 B2), which is consistently supported by the observed SNPs, the others strains occur at similar frequencies in different mutually inconsistent combinations. For example, the four most common 3SNPs all consist of the pair (A7 B2) with a different third strain. This diversity of phylogenetic partners for each strain is well summarized by the entropy profiles of these six strains, as shown in Figure 4.
First, note that the strains of the pair (A7 B2) have virtually identical entropy profiles, indicating that they had a clonal ancestor before much recombination took place, that is their entropy profiles correspond to the entropy profile of their common ancestor. They also have zero pair entropy because they only occur in pairs with each other. However, all other strains have pair entropies around 2.5, which is equivalent to occurring with equal frequency in $56$ different pairs. For quartets, the entropies range from 3.6 to 4, equivalent to $1216$ equally frequent quartets. That is, there is a diverse collection of phylogenetic relationships in which these six strains occur.
Summary: the phylogeny of strains (A1,A2,A7,A11,B2,D8)
In summary, all statistics show that even for the relative close sextet of strains (A1, A2, A7, A11, B2, D8), the only clear clonal signal left is the close pair (A7 B2). The alignments between all other pairs have been mostly overwritten by recombination, the phylogeny changes thousands of times along the core genome alignment, and each strain occurs in a diverse collection of multiply inconsistent nSNPs with the other strains. However, as shown in the bottom panel of Figure 1 of the main text, the branches in the core tree of this sextet are well supported when a tree is constructed from sufficiently many loci. That is, the core genome phylogeny is just the best compromise capturing the statistics with which different phylogenetic patterns appear, and one reliably converges to this best compromise when sufficiently many loci are taken into account.
Data availability
All data generated or analyzed during this study are available from public databases and links to all the source data are provided in the article.

NCBI BioProjectID PRJNA432505. Sequencing, assembly and annotation of naturalized E. coli isolates from Lake Superior Watersheds.
References

Automated reconstruction of wholegenome phylogenies from shortsequence readsMolecular Biology and Evolution 31:1077–1088.https://doi.org/10.1093/molbev/msu088

BookOn the Origin of Species by Means of Natural Selection, Or, the Preservation of Favoured Races in the Struggle for LifeJohn Murray.

The population genetics of pathogenic Escherichia coliNature Reviews Microbiology 19:37–54.https://doi.org/10.1038/s415790200416x

ClonalFrameML: efficient inference of recombination in whole bacterial genomesPLOS Computational Biology 11:e1004041.https://doi.org/10.1371/journal.pcbi.1004041

Evolutionary trees from DNA sequences: a maximum likelihood approachJournal of Molecular Evolution 17:368–376.https://doi.org/10.1007/BF01734359

Recombination signal in Mycobacterium tuberculosis stems from Referenceguided assemblies and alignment artefactsGenome Biology and Evolution 10:1920–1926.https://doi.org/10.1093/gbe/evy143

Statistical properties of the number of recombination events in the history of a sample of dna sequencesGenetics 111:147–164.

Presence and growth of naturalized Escherichia coli in temperate soils from lake superior watershedsApplied and Environmental Microbiology 72:612–621.https://doi.org/10.1128/AEM.72.1.612621.2006

Beach sand and sediments are temporal sinks and sources of Escherichia coli in lake superiorEnvironmental Science & Technology 41:2203–2209.https://doi.org/10.1021/es0623156

Fast gappedread alignment with bowtie 2Nature Methods 9:357–359.https://doi.org/10.1038/nmeth.1923

On measures of gametic disequilibriumGenetics 120:849–852.https://doi.org/10.1093/genetics/120.3.849

Evidence for recombination in Mycobacterium tuberculosisJournal of Bacteriology 188:8169–8177.https://doi.org/10.1128/JB.0106206

Efficient inference of recent and ancestral recombination within bacterial populationsMolecular Biology and Evolution 34:1167–1182.https://doi.org/10.1093/molbev/msx066

BookMolecular Evolution: A Phylogenetic ApproachNew Jersey, United States: John Wiley & Sons, Inc.

SoftwareGPUprokEvolSim, version swh:1:rev:ed15c4b013068d684874c3d7c9d710e289e3f31cSoftware Heritage.

Genome trees and the tree of lifeTrends in Genetics 18:472–479.https://doi.org/10.1016/S01689525(02)027440

The landscape of realized homologous recombination in pathogenic BacteriaMolecular Biology and Evolution 33:456–471.https://doi.org/10.1093/molbev/msv237

Evolutionary divergence and convergence in proteinsEvolving Genes and Proteins 97:97–166.https://doi.org/10.1016/B9781483227344.500176
Decision letter

Armita NourmohammadReviewing Editor; University of Washington, United States

Aleksandra M WalczakSenior Editor; École Normale Supérieure, France
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
This work offers a careful, quantitative analysis of a population of E. coli genomes sampled at the same time from one location. Combining genomic analysis, with population genetics models and very extensive simulations to test different evolutionary scenarios, the authors have shown that the observed genetic diversity is inconsistent with both pure clonal evolution and a freely recombining population. Instead, population structure which leads to a broad range of recombination rates within a species must be considered. This careful study, while it does not offer a unique solution, is an important starting point to a broader discussion, both at the level of population genetics modelling and the level of the biological and ecological sources of the population structure.
Decision letter after peer review:
Thank you for submitting your work entitled "Whole genome phylogenies reflect the distributions of recombination rates for many bacterial species" for consideration by eLife. Your article has been evaluated by Aleksandra Walczak (Senior Editor) and a Reviewing Editor.
This work offers a careful, quantitative analysis of a population of E. coli genomes sampled at the same time from one location. Combining genomic analysis, with population genetics models and very extensive simulations to test different evolutionary scenarios, the authors have shown that the observed genetic diversity is inconsistent with both pure clonal evolution and a freely recombining population. Instead, population structure which leads to a broad range of recombination rates within a species must be considered. This careful study, while it does not offer a unique solution, is an important starting point to a broader discussion, both at the level of population genetics modelling and the level of the biological and ecological sources of the population structure.
Both reviewers had access to the multiple rounds of reviews from a previous journal that the authors provided with their submission, and their responses to the concerns raised in these previous reviews. These concerns especially addressed some of the technical aspects of the work, and the reviewers at eLife felt that your responses were not only adequate, but also that the additional analyses further elevated the work.
Taking this into account, the reviewers agreed that the work should be published largely as is, with just a few clarifications, noted below:
1) A recent study by J. Power et al. (https://doi.org/10.1101/2020.04.23.057174) has shown prevalence of strong selection on horizontal gene transfer in B. subtilis in an experimental evolution setup in the lab. I understand that population diversity in the work of Power et al. arises from a clone in the lab and the timescale of evolution is substantially different from what you are considering in this manuscript. Nonetheless, I was wondering if you could comment how (if at all) selection on horizontal gene transfer (or recombination) could impact some of the results and the interpretations presented in the manuscript; a few sentences in the Discussion section would be sufficient.
2) Figure 11B very interesting. Is there a biological explanation (from the literature) as why H. Pylori has such high entropy level and a rapid saturation at n~5?
3) Please thoroughly address reviewer #1 comments on language edits.
Reviewer #1:
I suggest that you carefully go through the text and make language edits throughout. For example, there are issues with the usage of past and present tense and sometimes the narrative switched between the two even within a single paragraph. Also, there are typos and issues with the usage of articles and commas, which at times cause confusion.
https://doi.org/10.7554/eLife.65366.sa1Author response
Both reviewers had access to the multiple rounds of reviews from a previous journal that the authors provided with their submission, and their responses to the concerns raised in these previous reviews. These concerns especially addressed some of the technical aspects of the work, and the reviewers at eLife felt that your responses were not only adequate, but also that the additional analyses further elevated the work.
Taking this into account, the reviewers agreed that the work should be published largely as is, with just a few clarifications, noted below:
1) A recent study by J. Power et al. (https://doi.org/10.1101/2020.04.23.057174) has shown prevalence of strong selection on horizontal gene transfer in B. subtilis in an experimental evolution setup in the lab. I understand that population diversity in the work of Power et al. arises from a clone in the lab and the timescale of evolution is substantially different from what you are considering in this manuscript. Nonetheless, I was wondering if you could comment how (if at all) selection on horizontal gene transfer (or recombination) could impact some of the results and the interpretations presented in the manuscript; a few sentences in the Discussion section would be sufficient.
Just as there is no doubt that selection acts on mutations, there is of course also no doubt that selection acts on recombination events. As we already noted in the Discussion, it is likely that natural selection plays an important role in determining the relative rates of recombination that we observe in the data. To what extent natural selection versus other mechanisms such as geographic population structure determine the relative rates of observed recombinations is extremely interesting, but beyond the scope of this work. Here we are just trying to provide quantitative estimates for how much recombination occurred in the history of a set of strains and the statistics of which strains have recombined more or less frequently. These results are not affected by the extent to which selection has shaped the observed patterns. We’ve extended our remarks in the Discussion and added a reference to the preprint that the reviewer mentioned.
2) Figure 11B very interesting. Is there a biological explanation (from the literature) as why H. Pylori has such high entropy level and a rapid saturation at n~5?
Previous studies have noted that recombination rates are very high in H. pylori and suggested that this species is’ freely recombining’ (Suerbaum et al., PNAS 1998). That is, H. pylori is the one species for which the consensus in the literature already is that its evolution is pretty much dominated by recombination. In addition, it has also been argued that population structure evident in H. pylori sequences reflects the population structure of its human host (Falush et al., 2003). We have added some remarks in the text, including citations to these papers, to make clear that the high amount of recombination that we infer for H. pylori is consistent with the literature.
3) Please thoroughly address reviewer #1 comments on language edits.
We have followed the reviewer’s advice and gone through the text in search for typos, superfluous commas, and unwarranted switches between tenses. However, we feel that in many places a mixture of tenses is appropriate, e.g. we say that we found that nSNP distribution have long tails, because while this finding occurred in the past, the distributions did not cease to have long tails.
[Editors' note: we include below the reviews that the authors received from another journal, along with the authors’ responses.]
Reviewer 1
1.1) This manuscript tackles an interesting question and tries some new and worthwhile approaches. It brings a commendable amount of fresh energy to the problem. However makes many provocative claims in the Abstract that are not justified by the analysis performed. Furthermore, several of the sections have conceptual problems and are unclear. In general the authors rely far too much on verbal argument and intuition about what the quantities they calculate mean. There is a total absence of testing to show that the methods are actually measuring what is claimed. If published in anything like its current form, it would be an extremely confusing addition to the literature.
After damning us with faint praise, the reviewer starts the review by making a number of strong and very critical claims, including:
1) That claims in the Abstract are not justified by our analysis.
2) That we ‘rely on verbal arguments’, provide no quantitative support, nor any testing, to support that the quantities we calculate mean what we claim.
Since the reviewer makes these claims without identifying any concrete issue, method, or statement here, let alone providing an argument for these criticisms, we can at this point only note that we strongly disagree with these claims.
As detailed below, we have performed many additional analyses and validations to comprehensively address every concrete issue raised by the reviewer. These analyses include extensive comparisons with data from simulations of populations evolving under drift, mutation, and recombination (for which we specifically developed new simulation code). All these analyses confirm the accuracy of our methods. Moreover, although some valid technical issues were raised, we also show that quite a number of the critical claims made by the reviewer do not hold up to careful scrutiny, e.g. see responses to points 1.3, 1.4, 1.8, 1.9, 1.10, and 1.12.
In contrast, we do accept the reviewer’s criticism that several conceptual issues were not sufficiently clearly explained and we have now significantly rewritten and extended discussions in the manuscript to make these points clear. Here too we have taken advantage of the simulation results, where the ground truth is known, to better explain the meaning of some of the quantities we calculate. In addition, we introduced a new section in which we illustrate conceptual issues regarding the interpretation of the core genome phylogeny by using data from human genomes, and we also expanded the section on the entropy profiles to better explain what the nSNP statistics show about the relative recombination rates among different strains. Finally, to further clarify the meaning of the results we added analyses on the phylogroups(Appendix 1—table 2, and Figure 10—figure supplement 1), including an indepth analysis of the SNP patterns for phylogroup B2.
1.2) As the authors remark, the phylogeny section (Figure 1) is not hugely new and the authors provide no way to quantify recombination patterns from degree of incongruence.
The main aim of the analysis of Figure 1 is to introduce the general problem that the paper addresses using wellknown previously used methods, i.e. that phylogenies reconstructed from individual 3Kb alignment blocks are incongruent, but that phylogenies reconstructed from a large number of such blocks converge to a robust phylogenetic tree.
We agree that these incongruence measures do not directly quantify recombination patterns. Indeed, this is precisely the problem that our manuscript addresses and much of the rest of the manuscript is exactly about quantifying recombination patterns.
However, it should be noted that the results of Figure 1 do already rule out several suppositions that are sometimes made in the literature. First, if there were a significant fraction of blocks that were unaffected by recombination, than all these blocks would have to be consistent with a common phylogeny. The fact that every 3Kb block significantly rejects the phylogeny of every other block shows that this cannot be the case, i.e. almost every block has to be affected by recombination to some extent. Second, it is also often presumed that, while phylogenies may vary along the genome, that deviations from the core tree reconstructed from all blocks are only minor. Figure 1 also shows that this is not the case but that the incongruences are very substantial, i.e. half of the branches of the core tree occur in less than a quarter of all blocks.
In the revision we have rephrased so as to stress these points.
1.3) The pair of strains analysis (Figure 2) assumes that the simple statistical model that the authors use is fully accurate in identifying recombined regions. The method that the authors used is reasonable (others have used similar models) but is sure to make errors. Moreover, these errors are unlikely to be random within the dataset, since distinguishing recombined regions from nonrecombined ones becomes harder as divergence between strains increases. This bias is likely to interfere materially with the arguments the authors want to make.
It is not true that the results in Figure 2 assume that the identification of recombined regions is ‘fully accurate’. We state very clearly that this method is used to estimate the recombined regions for each pair. We also completely agree with the reviewer that it becomes harder to estimate the fraction of clonally inherited regions as divergence increases. Indeed, one purpose of panels A, B, and C of Figure 2 was to show readers that the rare recombined regions in close pairs can almost be identified by eye whereas, as recombined regions get denser on the genome, it becomes harder to distinguish them from the clonally inherited regions.
In contrast to what the reviewer implies, the conclusions that we draw from Figure 2 are very robust to inaccuracies in these estimates. We are noting broad quantitative trends that are observed reproducibly across many independently analyzed pairs of strains. For example, all close pairs with divergence less than 0.001 have more than 90% clonally inherited regions and all pairs with divergence over 0.01 have a very low fraction of clonally inherited regions. In addition, in the transition regime of divergences between 0.001 and 0.01, the estimated fraction of clonally inherited regions drops systematically with divergence. That is, even though each pair is analyzed independently, the fraction of clonally inherited regions decreases in a very systematic and reproducible way as a function of the divergence between the strains. Our conclusions only depend on this very reproducible trend seen across all pairs of strains. We find it hard to imagine how inaccuracy in these estimates could artificially produce such a clear systematic dependence on divergence.
Nonetheless, to remove any doubts about the accuracy of this procedure, we now additionally apply the same pairwise analysis to data from simulations for which the ground truth is known. We show that the results of the pair statistics accurately reflect the known ground truth in those simulations (Figure 1—figure supplement 3 and Figure 2—figure supplement 2). In addition, in Figure 2—figure supplement 2 we directly show that the estimated fractions of clonally inherited loci are always very close to the true fractions of clonally inherited loci. This analysis shows that our method is in fact conservative in that it tends to slightly overestimate the fraction of clonally inherited loci.
1.4) Further, large recombination events are easier to detect than small ones. Also, multiple recombination events can easily be called as one, whether or not they actually overlap. This makes it very likely that the median size of recombination events is overestimated, probably substantially.
We agree that overlapping recombination events can lead to overestimation of the size of recombination events. Precisely for this reason, we estimated the sizes of recombination events using only comparison of very close strains that have only a handful of recombination events between them, i.e. we stated:”Using a Hidden Markov model on close pairs, we also estimated the distribution of lengths of recombined regions (see Materials and methods),” and in the Materials and methods it says:”To estimate the lengths of recombination events, we first extracted pairs that are sufficiently close (divergence less than 0.002) such that multiple overlapping recombination events are unlikely.” It appears the reviewer failed to note this. Nonetheless, to further confirm that our approach accurately estimates the size of recombination events, we applied the same approach to data from simulations for the revision. As shown in Figure 2—figure supplement 2, the sizes estimated from the simulated data are in fact very close to the true size of the recombined fragments used in the simulation.
Note also that, even if we did overestimate the size of the recombination events, it would leave virtually all the main conclusions of the paper unaffected, because almost none of our conclusions strongly depend it.
1.5) The authors nowhere acknowledge that the recombination detection method is less than perfect, attempt to quantify its errors or allow for those errors in interpreting the patterns they deduce from it.
It is not entirely clear which of our methods the reviewer refers to here. If this still refers to the pairwise analysis, then this comment repeats comment 1.3. We thus suspect that the reviewer must be referring to the method explained in Figure 3 and used in the results of Figures 4, 5, and 6.
First, as we pointed out in the paper, this use of biallelic SNPs to detect phylogenetic inconsistencies is in fact commonly used in the literature on sexually reproducing species. Second, it is not true that we pretended this method is ‘perfect’. We explicitly discuss that the method assumes that biallelic SNPs predominantly correspond to single substitution events, i.e. that homoplasies are rare. Indeed, to support this assumption, we included calculations in the Materials and methods to estimate the fraction of homoplasies and showed that those are very rare. Moreover, in the application in Figure 6 we explicitly discussed that the method gives only a lower bound on the number of recombination events.
However, we agree that we did not explicitly try to quantify the effects of a small fraction of homoplasies (or other errors) since we thought it was obvious that our interpretation of the results in Figures 4, 5, and 6 could not possibly be substantially affected by them. From the response of this reviewer and reviewer 2, we now see that this was apparently not obvious and that we should have done an explicit quantification. Thus, for the revision we have performed several new analyses to explicitly quantify the accuracy of this method:
1) Following questions from reviewer 2 we now quantify the rate of sequencing errors and show these are so rare that they cannot possible affect any of these analyses.
2) We have extended our model for estimating the homoplasy rate by including a 3to1 transition to transversion ratio in mutations (which raises the estimated homoplasy rate from 2.5% to 3.3%). In addition, we confirm the accuracy of the estimation of the homoplasy rate using data from simulations for which the exact homoplasy rate is known (Materials and methods).
3) We introduce a new method for correcting for homoplasies. In particular, we first quantify for each SNP column in the core genome to what extent it clashes with other SNP columns in its vicinity, and then remove the most inconsistent columns from the alignment. By recalculating the effect on all of our statistics (i.e. those of Figures 4, 5, and 6) of removing either 5% or 10% of potentially homoplasic columns, we show that our results are indeed not substantially affected by homoplasies (see Figure 4—figure supplement 2, Figure 5—figure supplement and Figure 6—figure supplement 1).
4) We also calculate all the relevant statistics from Figures 4, 5 and 6 for the data from the simulations and we do this both for the original data and for data from which 5% or 10% of potentially homoplasic columns have been removed. Comparisons with the known ground truth allow us to assess the accuracy and precise meaning of our statistics (see Figure 4—figure supplement 3, Figure 5—figure supplement 2, Figure 6—figure supplement 2, Figure 6—figure supplement 3 and Figure 6—figure supplement 4).
These very substantial additional analyses and validations all confirm the validity of our methods and results.
1.6) The LD plot is not new nor (more importantly) particularly well integrated with the rest of the manuscript.
We did not claim this analysis is new. On the contrary, we explicitly stated that this is a standard LD measure. We included it because this analysis has been used in other studies, e.g the recent analysis of thermophilic Cyanobacteria from Yellowstone, which we specifically refer to.
1.7) The section on the lower bound of recombination to substitution events fails to note that a single recombination event will introduce changes in phylogeny at both ends of the event. Therefore, even if the rest of the reasoning is correct, the estimate would be off by a factor of 2.
This is correct and we thank the reviewer for pointing this out. C/M is a lower bound on the number of phylogeny changes per SNP, not recombinations. Knowing that this lower bound likely strongly underestimates the true number, we sloppily failed to properly distinguish between phylogeny changes and recombination events. In the revision we now explicitly compare the lower bound C/M of phylogeny changes per SNP obtained for simulation data, with the ratio ρ/µ of recombination to mutation rate used in the simulations. This analysis shows that, at very low recombination rates, the ratio C/M is exactly twice ρ/µ, precisely for the reason that the reviewer mentions. Thus, for very low recombination rates, the ratio C/(2M) accurately estimates the recombination to mutation rate (Figure 6—figure supplement 1). At such low ratios ρ/µ, there are many SNPs columns that separate consecutive edges of recombination events on the alignment, so that each such edge leads to detectable phylogeny clashes. However, already when ρ/µ ≥ 0.05 or so (Figure 6—figure supplement 2), a substantial fraction of recombination events goes undetected because they do not result in phylogeny clashes, so that C/M is in fact substantially lower than ρ/µ. In the regime of the E. coli data, i.e. C/M ≈ 0.13, the ratio C/M likely underestimates the ratio of recombination to mutation events by more than a factor ten.
In the revision we have now paid attention to clearly distinguish the lower bound on the number of phylogeny changes C, from a lower bound on the number of recombination events, which is C/2. However, as just noted, this factor 2 correction is small compared to the factor by which the ratio C/M underestimates the true ratio of recombination to mutation.
1.8) The authors have also made no effort to quantify how large the effect of homoplasy is on these results. It may be larger than they imagine, since each homoplasy can potentially introduce two phylogeny changes. My intuition is that only a relatively small proportion of SNPs need to be affected to have quite substantial effects. Of course, this is not tested in any way.
As discussed already above, we have now explicitly tested this and show that, even using a very conservative procedure in which the most clashing columns are removed, and a larger fraction of columns is removed than the estimated homoplasy rate, the estimate of C/M only decreases by about 16% from 0.155 to 0.13 (see Figure 6—figure supplement 1).
1.9) Even if the estimate for C is accepted, the estimate for the number of times that each site has recombined is questionable for a few reasons. The estimate of average recombination size of 20kb is totally inappropriate to use for this purpose. As discussed above it is sure to be upwardly biased. We know in any case that there are multiple mechanisms of bacterial recombination and that they have very different length sizes. Long ones are disproportionately important in break up the clonal frame, while short ones break up the phylogeny. Dividing C by M effectively assumes, I think, that all recombinations are the mean estimated length.
We would like to again point out that the method for providing a lower bound C is a very standard method in the field of sexually reproducing species.
Regarding our order of magnitude estimate for the average number of times alignment positions have been overwritten by recombination, we do not make the assumption that all segments must have the same length. The argument is very simple and just based on the estimated average length L_{r} of the recombined segments. We tried to clarify this in the revision.
Second, as we explained above, the reviewer’s claim that our estimate of the mean size of recombination segments is biased upward is based on a misunderstanding of how this estimate was done (and we now provide results from analysis of simulations to validate the accuracy of our estimation method).
Third, in the revision we use data from simulations to show that, even if we use an average length of recombined segments approximately half of what we estimate from the E. coli data, already at recombination to mutation rates as low as ρ/µ = 0.01, positions in the genome have been overwritten on average more than 12 times, and at ρ/µ = 1, positions have been overwritten on average more than 1500 times (Figure 6—figure supplement 3). In addition, using the simulation data we also show that our simple order of magnitude estimate CL_{r}/(2L) in fact severely underestimates the true average number of times positions in the genome have been overwritten by recombination (Figure 6—figure supplement 4).
But independently of all this, we make very clear in the manuscript that this is only an order of magnitude estimate that we use to conclude that each position in the genome has been overwritten many times. The precise number is completely immaterial for the overall conclusions and message of the paper. Indeed, in the revision the number has been changed from 300 to 125 due to the factor 2 we overlooked (see response 1.7) and because we now use the 5% homoplasy corrected alignments. But, as suggested by the analysis of the simulated data (Figure 6—figure supplement 4), this estimate may well be a factor 10 below the true value. That is, it is clear that we are very far from having any loci in the genome that are not affected by recombination, and this is all that matters for our conclusions.
1.10) Furthermore, even if the estimate of C/M is taken at face value, it is difficult to interpret. There are 87 genomes in the sample, so it implies that each strain has recombined about 4 times on average at each site since the shared common ancestor. But the largest number of recombination events per site per strain that their method can detect is 1. So all that this “order of magnitude” estimate implies, I think, is that assumptions of the inference of C/M are wrong in some way, as I would expect based on the arguments above.
We frankly cannot follow the reasoning in this comment. There are 92 instead of 87 genomes in our sample, but this is a detail. Mutations and recombinations are not events that happen to ‘strains’, they are events that happen in the evolutionary history of a set of strains. Each mutation at a genomic position is an event that happened on a branch of the phylogenetic tree at that position. For recombinations the situation is even more complex, i.e to precisely describe the recombination events in the history of a set of strains one needs the more complex structure of a recombination graph.
The C/M value provides a lower bound on the ratio of the number of times the phylogeny changes along the alignment and the number of mutations along the alignment. Thus if the rate of SNPs is about 1/10 and C/M ≈ 1/7, it means that once every 10 alignment columns, a substitution occurred along some branch of the phylogeny at that position, and once every 7 SNPs (or 70 bps) a phylogeny change occurred, i.e. either the start or end of a recombination event that occurred somewhere in the evolutionary history of the strains.
We do agree that, per definition C/M cannot be larger than 1, and so in the limit that recombination rates are higher than mutation rates, i.e. ρ/µ > 1, the lower bound C/2 will hugely underestimate the true number of recombination events (see Figure 6—figure supplement 2). This is precisely why this method can only give a lower bound.
In any case, the comparisons with ground truth in the simulations show that the estimate C/(2M) accurately estimates the ratio of recombinationtomutation rate ρ/µ when this ratio is very low, and gives a valid lower bound when it is large. Similarly, we also show that the estimate L_{r}C/(2L) is a valid lower bound for the average number of times each position in the alignment was overwritten by recombination. These additional comparisons with data from simulations should remove any questions about the meaning of these estimates and confirms their validity.
1.11) In any case, the authors provide little guidance on what they think the value actually means. How much of the recombination do the authors think has taken place since the evolution of the phylogroups, for example? I have no idea.
Although we tried to be already quite specific in the original submission, for the revision we have made sure to spell out precisely in the text what the various lower bounds mean. In particular,
C gives a lower bound on the number of phylogeny changes in an alignment.
C/2 gives a lower bound on the number of recombination events in an alignment.
C/M gives a lower bound on the ratio of phylogeny changes to substitutions in the alignment.
C/(2M) gives a lower bound on the number of recombination events per substitution in the alignment.
T_{est} = L_{r}C/(2L) gives a lower bound on the average number of times each position in the genome has been overwritten by recombination.
Note that we confirm the validity of all these lower bounds using data from simulations.
At the request of the reviewer, we have now also calculated these quantities for each of the phylogroups represented in our dataset (Appendix 1—table 2). In addition, we have added a supplementary analysis where we look in great detail at all statistics for the phylogroup B2, which is represented by 6 strains in our collection.
1.12) The nSNP section is interesting, particular for 2SNPs. I have not seen this in bacterial genomics and believe it can potentially be very informative. However, this section, based on weird reasoning, completely ignores the impact of clonal relationships. Pairs of strains that share a recent common ancestor will share many 2SNPs. They have not undergone large amounts of recombination. This is obviously likely to account for many of the extremely high 2SNP values. The so called “scale free” distribution, which I do not think is terribly helpful in any case – parameter estimation would be a better idea – is very dependent on these high values. Note also that individual large recombination events can give high 2SNP values between pairs of less closely strains. The authors also do not discuss this at all.
The claims of the reviewer, i.e. that the pairs with highest 2SNP counts are ‘obviously likely’ due to clonal pairs, and that the longtailed distribution of 2SNP counts is ‘very dependent’ on these clonal pairs, are both incorrect. One only needs to carefully look at Figure 9 to see that the first claim cannot be maintained. In Figure 9A strain A1 is shown to share the highest number of 2SNPs with strains D8 (214 2SNPS), A2 (196 2SNPs) and A11 (194 2SNPs). Figure 9C shows that these pairs are among the top 10 highest of all 2SNP counts. However, only one of these three strains can share a most recent clonal ancestor with A1 so at least 2 of these three pairs do not correspond to clonal pairs.
In order to determine the exact effect of 2SNPs from clonal pairs on the longtailed distributions in Figure 9C we would of course have to know which pairs correspond to clonal pairs, and this is not always obvious, as the example in Figure 9A demonstrates. However, to make a conservative estimate of how much removal of clonal SNPs could have on Figures 9C and 9D we proceeded as follows: We first assumed that all SNPs that correspond to bipartitions that occur in the clonal tree are clonal SNPs. We then removed all these potentially clonal SNPs and remade Figures 9C and D without the nSNPs corresponding to clonal SNPs. As show in Figure 9—figure supplement 1, removal of these potentially clonal nSNPs has very little effect on the distributions of nSNPs.
1.13) Furthermore, there is an apparent confusion between lineages having high average rates of recombination, i.e. measured relative to time or to rates of mutation and lineages being very nonrandom in their patterns of recombination, i.e. which strains they recombine with. What the authors are actually measuring is much more related to the latter. These results gives no indication that strains differ in their absolute recombination rates, but this is the impression given, including in the headline conclusion that is made prominently in the Abstract. It’s really extremely unclear what the authors actually think they have shown.
As we already mentioned in our response 1.10, recombinations are not events that happen to strains. It therefore makes fundamentally no sense to talk about the recombination rate ‘of a strain’. We were in fact careful to talk in the Abstract about the ‘recombination rates between different lineages’ (emphasis added). Moreover, the statement in the Abstract is about the amount of variation in the recombination rates across different lineages and this is thus per definition a statement about the relative sizes of rates.
We thought that it was clear that, while the analysis in Figures 5 and 6 quantifies the absolute amount of recombination within alignments of subsets of strains, Figure 9 shows that the relative rates of recombination between different lineages vary along a longtailed continuum that spans several orders of magnitude. To avoid any confusion, we now explicitly emphasize this point in the description of these results.
1.14) In the discussion the authors conclude (which is not novel), that the ancestral phylogeny is very difficult, or impossible to reconstruct. This is true for most bacterial species, but this is typically because there are branches at the base of the tree that are extremely uncertain. It is still likely possible to infer a great deal about bacterial evolution from inferring clonal relationships closer to the tips. For most species, this includes a fairly substantial proportion of the clonal tree. While there are the kernel of some good ideas presented, I am not sure, once the substantial issues underlined above are dealt with, that this manuscript ultimately changes our view of this in a substantive way.
It seems that this final comment is not so much a specific criticism of any of our analyses, but more an expression of the reviewer’s own views regarding the role of recombination in bacterial genome evolution. We understand the results of our analyses are at least partially at odds with these views, but we hope the reviewer can agree that this is not in itself a valid reason to reject our work.
Also, there is by no means consensus in the field on these points. For example, while this reviewer claims that it is ‘not novel’ to conclude that the ancestral phylogeny cannot be reconstructed, reviewer 2 seems to be of the opinion that it can. In fact, since reviewer 2 identified himself, we know that this reviewer is coauthor on a wellknown paper on E. coli genome evolution that makes this claim. A main aim of our manuscript is to provide new methods that can help clarify these issues.
We are of course aware that our results challenge views held by many in the field of bacterial genome evolution, and because of this we not only spend great time and effort to develop our analysis methods, extensively testing and redesigning them over more than 6 years, but we also made sure to draw conclusions based not on the results of one single method, but on the very consistent quantitative picture that emerges from multiple different and complementary methods. Thus, even if there were valid technical issues with some of our methods, we find it very hard to see how several different and complementary methods can all paint the same quantitative picture that is wrong. Unfortunately, it seems that the reviewer has completely ignored this point.
Nonetheless, for the revision we have now addressed every single one of the technical issues that were raised in the review, providing extensive additional analyses and validations to confirm that our methods were in fact all valid and support the conclusions we draw. We hope the reviewer can appreciate this and will agree with us that, having addressed all issues that were raised, there are no good reasons left to obstruct publication of this work, and allowing other researchers to learn about our analyses.
Reviewer 2  Olivier Tenaillon:
Reviewer 2’s review was long and complex and there were multiple cases where the same or highlyrelated issues came up in different contexts. Therefore, instead of going through all comments one by one, we have attempted to bring some structure in both the issues raised and all our responses to these issues.
To provide structure to the reviewer’s comments and our responses, we start from points (i)(vi) of the review, which provide a fairly accurate summary of the key points of the paper (maybe with exception of the interpretation of the results of Figures 7 and 9). To anticipate some of the responses to specific comments of the reviewer, we’ve added short comments to each of these points.
• Ad (i): phylogenies constructed from blocks of 3kb are distinct from wholegenome phylogeny (”core tree”), and distinct across blocks. However, phylogenies of 50% of the 3kb blocks are similar to the core tree.
Note that we also show that the block phylogenies are not just minor variants but that they substantially differ from the core tree. that is, we do not just do a same/different but look individually at each branch of the core tree. Half of the branches in the core tree are found in less than 25% of the block phylogenies and only a small fraction of the branches is supported by the large majority of the blocks.
• Ad (ii):when comparing pairs of E. coli genomes, SNP density is either low (interpreted as clonal inheritance of mutations) or high (interpreted as recombination). The fraction of recombination SNPs (high density regions) increases with divergence between strains. Half of the genome is recombined at a low divergence of 0.0032.
Note also that. even for close pairs, most SNPs derive from recombination. For example, for the green pair in Figure 2, at divergence 0.002, even though almost 90% of the alignment is clonally inherited, 90% of the SNPs lie in the recombined regions. That is, there are virtually no pairs where their divergence reflects the distance to the clonal ancestor of the pair and most of the divergence derives from recombinations.
• Ad (iii): only about half of the individual SNPs support the core tree
Although this is correct, the total fraction of SNPs consistent with the core tree is not very informative because a significant fraction of all SNPs correspond to a single branch (i.e. 36% of all SNPs are of the type where the outgroup has one allele and all other strains another allele). To get a more informative picture of the SNP support for the core tree we analyzed the SNP support separately for each branch of the core tree. We find that for half of the branches in the tree, the branch is rejected by 20 times as many SNPs as support it, and this applies to virtually all branches deeper in the tree (with the exception of the branch to the outgroup). The only branches that do not have a high fraction of rejecting SNPs are branches to clades of strains that are all extremely close to each other.
• Ad (iv): phylogeny changes every ∼ 10 bp along the genome, implying a ratio of recombination to mutation events of at least 0.15
Note that we calculated the ratio of phylogeny changes to substitution events (C/M) not only for the full alignment, but across a very large number of random subsets of strains (Figure 6). The results show that, the smaller the subset of strains, the more variable the C/M ratio, suggesting recombination rates vary across lineages. However, as soon as one looks at groups of 10 or more strains, the C/M ratio is never small (i.e. larger than 0.05 meaning a recombination occurs every 20 SNPs in the alignment of the subset).
• Ad (v): For a focal strain, they identify all strains that share a SNP with the focal strain and none of the others. The frequency distribution of the number of 2SNPs (SNPs present in exactly 2 strains) shared by a focal strain with each other strain is scale free. This is taken as evidence that there is no group of highly recombining clades.
This is the one point where the reviewer’s summary does not quite match what we were intending to show. The nSNP patterns were used for two complementary analyses. When we plot the distribution of 2SNPs (Figure 9C), we are not looking at a focal strain but put all 2SNPs are put’ in a pot’ and we simply plot their overall frequency distribution. That is, we are plotting the distribution of the thickness of all lines in Figure 9B (and similarly for 3SNPs, 4SNPs, etc). We interpret the fact that all these distributions are smooth and approximately scalefree as showing that the strains cannot be easily clustered into groups that have high withingroup and low betweengroup recombination rates.
The second analysis, which does use focal strains, calculates entropy profiles. Whenever a focal strain occurs in different nSNP types, all these nSNP types are mutually inconsistent, so that each nSNP type must correspond to a different phylogeny. For each focal strain and n, we can summarize the diversity of mutually exclusive nSNP types by the entropy of nSNP type frequencies (Materials and methods). This gives the entropy profiles of Figure 10 of the revision. The main result of this analysis is that, especially when n gets larger, all strains show high entropy indicating that they occur in a diverse set of phylogenies in which they belong to different subgroups of strains. In addition, the fact that the entropy profiles look different for most strains shows that the nSNP distributions are distinct across strains.
The main criticism separated into 2 parts
The main criticism is that we have not provided sufficient validation for our two claims that 1. the structure of the phylogeny is generated by recombination patterns (as opposed to clonal ancestry) and that 2. there is currently no known model that can account for the patterns we observe. I would like to start by stressing that we are not contending that there is an unambiguous phylogenetic structure in the sense that when one reconstructs a phylogeny from sufficiently many genomic loci, one tends to obtain very similar phylogenies. Indeed we explicitly show this in Figure 1. The question is what this phylogenetic structure reflects. Second, for the purpose of the discussion, it is useful to separate our claims into 2 parts:
1) We show, using several complementary methods, that recombination is so dominant over clonal mutations that the clonal phylogeny cannot be reconstructed and it is unlikely that the core genome tree corresponds to the clonal phylogeny. Moreover, even if we could reconstruct the clonal phylogeny, there is such a large number of different phylogenies along the alignment that the observed mutational patterns cannot be meaningfully captured by a single phylogeny.
2) If the core genome phylogeny doesn’t reflect clonal structure, how does one explain there is a stable phylogeny? We argue that, since all genomic loci have been affected many times by recombination and thousands of different phylogenies occur along the alignment, the overall phylogenetic structure reflects the statistics of the distribution of phylogenies that occur along the alignment. We propose that these statistics reflect the relative frequencies with which recombination has occurred among different lineages. To support this we showed that, indeed, the relative frequencies with which different subsets of strains share SNPs varies over a wide continuum spanning several orders of magnitude and that almost every strain has its own distinct statistics of SNP sharing with other strains. No population genetics model that we are aware of can reproduce such statistics.
Issues affecting part 1 of our claims.
Let’s first discuss part 1 of our claims. The main criticism of our methods for quantifying recombination is that we fail to account for alternative explanations for phylogenetic inconsistencies and the following alternative explanations are put forward by the reviewer:
Tree inference errors.
Sequencing errors.
Homoplasies.
Let’s go through points (i)(v) and examine how each of these 3 alternative explanations would affect the results we report.
Tree inference errors
Tree inference errors can only affect the analysis of (i) because none of the other results rely on tree inference. As an aside, the main point of (i) is not to present new results but to introduce the main problem that our study addresses using a wellestablished method (PhyML).
We of course agree that there can be tree inference errors but Figure 1 shows large statistical patterns of the phylogenies across all 3Kb blocks, i.e. the fraction of 3Kb block phylogenies that support each branch of the core tree. We find that the majority of the core tree branches occur only in a minority of the 3Kb block phylogenies. We cannot see how tree inference errors could substantially affect such broad statistics. Note also that each block significantly rejects the phylogeny of each other block (Figure 1—figure supplement 2), i.e the loglikelihood differences between different phylogenies on a given alignment block are never small. To argue that the apparent inconsistencies between the phylogenies of 3Kb blocks are artefacts of tree inference errors one would have to argue that the ML inference on alignment blocks is so error prone that even overall statistical patterns (i.e. fractions of blocks agreeing on the occurrence of particular bipartitions) cannot be trusted at all. This would essentially invalidate any analysis using ML phylogeny reconstruction and it seems unlikely to us that the reviewer really wants to claim that.
However, since Figure 1 is only used to introduce the problem we analyze, even if the reviewer wants to discount the results of Figure 1 altogether, this does not really affect any of the methods and conclusions we present.
Sequencing errors
We completely agree that we should have included more about the sequencing methods and the potential rate of sequencing errors. For the revision we have now included a section at the start of the methods describing the sequencing procedures and the core alignment construction using Realphy. Realphy is in fact very conservative in what alignment columns to include in the core genome alignment and we also show that, from the occurrence of multiple independent genomes that are identical in the core genome, we can bound the rate of sequencing errors to be less than 1.1 ∗ 10^{−7}.
In the methods we now also provide a new section that shows, using a simple calculation, that sequencing errors can account for at most one in every hundred thousand informative SNPs, which obviously is not going to meaningfully affect any of the results.
Homoplasies
This leaves homoplasies. We agree with the reviewer that if the rate of homoplasy were very high (e.g. if the majority of SNPs were homoplasies), then this would be problematic. Indeed, to show that homoplasies must be rare we presented analysis in the methods to give a rough estimate of the rate of homoplasy. The most relevant estimate is for synonymous sites excluding the outgroup and gives that about 2.5% of all biallelic SNPs are homoplasies (i.e. more than 1 mutation occurred).
One criticism that the reviewer had is that we may be significantly underestimating the homoplasy rate because we ignore heterogeneity in mutation rates, in particular that transition mutations are more frequent than transversion mutations. For the revision we have now redone the analysis using a 3to1 transition to reversion ratio. We show that this only raises the estimated homoplasy rate from 2.5% to 3.3%, i.e. still well below 5%.
Note that, for the revision, we have developed new software for simulating populations evolving under drift, mutation, and recombination (of the type where a genomic segment from one individual is recombined in the genome of another) while at the same time keeping track of the clonal phylogeny, and the number of times each position on each branch of the clonal tree is affected by recombination. We then used this software to simulate populations evolving under a large range of recombination rates. Throughout the revision we use data from these simulations to validate our analysis methods and to compare results on E. coli with corresponding results from simulations for which the ground truth is known.
To confirm the accuracy of the estimation method for the rate of homoplasies, we used data from our simulations and compared the estimated rate of homoplasy with the true rate of homoplasy in the alignment, finding them to be virtually identical, i.e. 2.43% estimated versus 2.56% true rate.
We thought that it would be obvious that such a low rate of homoplasy, although of course affecting the precise values of the various statistics, could not meaningfully affect any of our conclusions. However, both this reviewer and reviewer 1 question this, suggesting that even a fairly low rate of homoplasy could have large effects on the results of the analysis. To dispel this concern we developed a method that estimates the maximal effect homoplasies can have on the results, by identifying those SNP columns in the alignment that lead to the largest number of clashes with other SNP columns, and removing either the 5% or 10% most clashing SNP columns. We then redid all the analyses with such ‘homoplasycorrected’ alignments and included comparison of the original and homoplasycorrected results in the revision. In addition, we also performed such comparisons for all the data from the simulations.
Note that our method is conservative because it does not just remove potentially homoplasic columns, but systematically removes those columns that lead to the largest number of inconsistencies. In addition, we remove up to 3fold more columns than the estimated rate of homoplasies. We now discuss what this new analysis showed about the effect of homoplasies on each of our results.
Effect of homoplasies on the results (ii): The analysis in (ii) is not based on phylogenetic inconsistencies at all and just looks at the patterns of SNP density in pairwise alignments. Therefore, these analyses are not affected at all by homoplasies and neither are the conclusions from the pairwise analysis, i.e. that for the vast majority of pairs no clonally inherited DNA is left in their pairwise alignment, and that even for close pairs for which most genomic segments were clonally inherited, the large majority of their SNPs comes from recombined regions.
Note also that in the revision we extensively validate the accuracy of the pairwise analyses using data from simulations (Figure 1—figure supplement 3, Figure 2—figure supplement 1 and Figure 2—figure supplement 2).
Effect of homoplasies on the results (iii): Regarding (iii), our main observation in Figure 4 is that for half of the branches in the core tree, there are 20 times as many conflicting as supporting SNPs. As shown in Figure 4—figure supplement 2, these results are virtually unchanged for the homoplasycorrected alignments for which 5% or 10% of the most clashing SNP columns have been removed.
To further explore the possible effect of homoplasies on this type of analyses, Figure 4—figure supplement 3 shows the same analysis on the simulated datasets. We see that homoplasies only significantly affect the results when there is no recombination at all. In this case all clashing SNPs result from homoplasies and the homoplasy removal has a substantial effect, i.e. only after 5% of homoplasies have been removed does the analysis confirm that there is essentially 100% SNP support for each branch.
The analysis of the simulation data also confirm what our pairwise analysis already indicated: that even in situations where most genomic loci are clonally inherited, the large majority of the SNPs can result from recombination events. In particular, for the low recombination rate of ρ/µ = 0.01, most genomic loci are clonally inherited along all but the long inner branches of the clonal tree (Figure 1—figure supplement 3). However, for most pairs of strains, including pairs for which almost all loci were clonally inherited, most of their divergence derives from recombination events (Figure 2—figure supplement 1), and this is confirmed by the analysis in Figure 4—figure supplement 3, which shows that for simulations with ρ/µ = 0.01, half of the branches have less than 20% support.
Effect of homoplasies on the results (iv): Homoplasies are most relevant for the analysis of (iv) and we agree with the reviewer that it is worthwhile to investigate how much homoplasies can affect the results shown in Figures 5 and 6. For the revision we have now explicitly done so by redoing all these analyses on the homoplasycorrected alignments. In addition, to further validate these analyses, we have also performed all these analyses on the data from the simulations, including investigating the effect of removing 5% or 10% of potentially homoplasic sites from the alignments of the simulation data.
As shown in the new Figure 5—figure supplement 2 and Appendix 1—table 1, the distribution of the number of consecutive treecompatible SNPs is only very modestly affected by the removal of potentially homoplasic SNPs, i.e. the average number of consecutive treecompatible SNPs increases from 7.6 to 8.8 when 5% of potentially homoplasic sites are removed. For the simulation data, the removal of homoplasies only significantly affects the results for low recombination rates ρ/µ ≤ 0.01, i.e. exactly when the frequency of true recombination events is lower than the frequency of homoplasies. However, the E. coli data is clearly in the regime where homoplasies have little effect on this statistic.
As shown in the new Figure 6—figure supplement 1, the observed ratios C/M for random subsets of strains are about 20% higher for the full alignment than for the 5% homoplasycorrected alignment, i.e. C/M increases from 0.13 to 0.155 for the full set of strains. For the revision, we now show the results from the 5% homoplasycorrected alignment in the main paper. In addition, following a request by reviewer 1, we also show C/M statistics for the subsets of strains of different phylogroups represented in our data (Appendix 1—table 2).
The analysis of C/M on the simulation data allows us to compare the lower bound on the recombination to mutation ratio C/(2M) with the ground truth, i.e. the ratio ρ/µ in the simulations. This analysis shows (Figure 6—figure supplement 2) that for very low recombination rates, i.e. ρ/µ ≤ 0.01, the ratio C/(2M) actually estimates the true recombination to mutation rate very accurately. However, as ρ increases the ratio C/(2M) severely underestimates the true ratio of recombination to mutation rate. For example, for the E. coli data we have C/(2M) ≈ 0.065, which is very close to the observed ratio C/(2M) for simulations with ρ/µ = 1, i.e. almost 20fold higher.
We also expanded the section on the lower bound for the average number of times each position in the genome has been overwritten by recombination. Using data from simulations for which the ground truth is known we show that only at ρ/µ = 0.001 are there any positions left that have not been overwritten by recombination, at ρ/µ = 0.1 positions have been overwritten 120 − 200 times, and at ρ/µ = 1 each position has been overwritten 1450 − 1700 times by recombination (Figure 6—figure supplement 3). We also show (Figure 6—figure supplement 4) that our lower bound on the number of times each position has been overwritten by recombination is strongly underestimating the true number as soon as ρ/µ ≥ 0.1. Thus, although the lower bound for the E. coli data is that each position has been overwritten at least 125 times by recombination, the true number is likely larger than 1000 times.
Effect of homoplasies on the results (v): For the revision, we show in Figure 9—figure supplement 1 that the nSNP distributions of the 5% homoplasycorrected alignment are virtually identical to nSNP distributions of the full alignment. Also, none of the 2SNPs in Figure 9A were removed by the 5% homoplasy correction.
Even though homoplasies hardly affect the overall patterns we see in the nSNP statistics, for the revision we nonetheless used the 5% homoplasycorrected alignments for all the E. coli results shown on nSNP distributions and entropy profiles.
Response to miscellaneous comments regarding part 1 of our claim
For completeness, we will also address the remaining other comments that reviewer 2 made regarding part 1 of our claims.
Using a strict criterion of phylogenetic consistency
First, there are these 3 related comments:
A lot of the issues come from the use of a strict criterion for phylogeny support, that with more strains added gives an evident discrepancy as soon as a bit of recombination or homoplasy is present. So the present manuscript strongly validates the idea that there is some recombination but falls short of telling significantly how strong.
The larger the dataset, the most likely it is that each fragment will be affected to some extent by both of these processes, and the phylogeny will be different, but if one strain out of 100 is changing place in the phylogeny, the phylogeny will be said to be incompatible, though it still holds some strong information. Similarly, for the bipartition analysis, the more strains there are below a branch, the more likely it is that one of the strains will not fit the bipartition. So once again the analysis is not conclusive regarding the quantitative importance of recombination.
Analyses based on comparison of phylogenies (i), (iii), (iv) would benefit from a more progressive metric of agreement between two phylogenies. Two phylogenies can conflict for very minor reasons (e.g. different structure of small terminal subtrees) or major topological changes. The authors choose a binary measure to compare phylogenies based on the presence or not of a specific strain bipartition. But one could use a progressive measure of distance between two bipartitions, e.g. based on the transfer distance (number of taxa that must be transferred to make the two bipartitions identical) Lemoine et al. Nature 2018. In principle changes in topologies along the genome should be due to minor differences (small transfer distances).
We of course agree that, the more strains are involved, the easier it becomes to detect phylogenetic inconsistencies. We also agree that more ‘soft’ or graded measures of inconsistency across phylogenies could be useful to measure the statistics of similarities of the different phylogenies that occur along the alignment. The problem is, however, that if there is a lot of population structure such that relative recombination rates are very different across lineages (as we indeed claim is the case) then even when there is a saturating amount of recombination, the phylogenies at different loci may still show significant similarities. Therefore, one cannot reliably take similarity of phylogenies as a measure for the amount of recombination that has occurred.
A strict measure of inconsistency is in fact a more reliable way to quantify the amount of recombination. In particular, whether ‘small’ or ‘large’, each inconsistency is still an inconsistency. For each such inconsistency, there either is another explanation (e.g. homoplasy), or else it indicates the occurrence of a recombination event. The methods of Figures 36 precisely use this fact to quantify how much recombination must have occurred. And as mentioned already, the methods we use here are variations of very standard approaches used in the field of sexually reproducing eukaryotes.
We cannot understand how the reviewer can claim that we did not provide quantitative estimates of the importance of recombination. In fact, we provide several very direct quantifications of the amount and importance of recombination using several different methods. These include:
Figure 2G quantifies what fraction of the core genome alignment of pairs of strains has been recombined as a function of their total divergence. This is a direct quantitative measure of the strength of recombination that does not rely on any phylogenetic consistency evaluations at all.
Figure 2H quantifies, for each pair of strain, what fraction of nucleotide differences between them were derived ancestrally versus from recombination events. This directly quantifies how much of the pairwise divergence of each pair is due to recombination, and again this measure does not use phylogenetic consistency at all.
Figure 4 shows the relative numbers of supporting vs. inconsistent SNPs, separately for each branch of the tree. Although this is a more indirect measure of the amount of recombination, it does directly quantify the amount of inconsistency separately for each branch of the core tree.
Figure 6 provides a direct lower bound on the ratio C/M of the number of phylogeny changes to mutation events in the core alignments not only of the full set of strains, but for arbitrary subsets of any number of strains. We can hardly see what more direct quantitative bound on the amount of recombination one could give. Indeed, our simulation analyses confirm that C/(2M) is a direct lower bound on the overall recombination to mutation ratio.
Heterogeneity of rates along the genome and effects of selection:
The distribution of differences along the genome is also not supposed to be homogenous along the core genome. Some genes are much more conserved than others even at synonymous sites, and the mutation rate has been shown to be locally heterogenous. Selection on recombinant may also differ along the genome, and some regions like the Oantigene lead to strong selection of recombinant that affect the neighbouring core genes. These regions may increase the estimated contribution of recombination. Selection may also counter select recent recombinant and lead to an overestimation recombination: in a similar way based on recent phage insertion in genomes we could conjecture that ancestral genomes should be only composed of prophages.
We think it is important to distinguish between the actual rates at which mutations and recombinations are introduced into individuals of the evolving populations, and the number of substitutions and recombination events that have occurred in the evolutionary history of the sequences in the core genome alignment. We are nowhere in the article attempting to estimate the actual mutation and recombination rates precisely because it is obviously very hard to accurately estimate the size of the effects of selection. Instead, we aim to provide estimates and bounds for the number of substitutions and recombination events that have occurred in the history of the observed strains. We tried to make this point clear in the revision.
Reconciliation with approaches that claim to remove recombination events:
It is difficult to reconcile this set of analysis with how people deal with recombination in bacterial genomics. Softwares exist to remove recombination events, some of them similar to the approach used by the authors (Figure 2), e.g. Gubbins or others (Bratnextgen). These allow estimating ratios of SNPs introduced by recombination over mutation (r/m ratios). Phylogenies can then be inferred on alignments without recombination. How does this approach compare to the authors’?
Although we did not refer to the two specific methods mentioned here (i.e. Gubbins and Bratnextgen), we discussed this general issue in the discussion of the results in Figure 1, i.e. we wrote:
“How should we interpret this convergence of phylogenies to the core phylogeny as increasing numbers of genomic loci are included? One interpretation that has been proposed, is that once a large number of genomic segments is considered, effects of horizontal transfer are effectively averaged out, and the phylogeny that emerges corresponds to the clonal ancestry of the strains, e.g. [8,11]. Indeed, it has become quite common for researchers to detect and quantify recombination using methods that compare local phylogenetic patterns with an overall reference phylogeny constructed from the entire genome, e.g. [11–15]. However, the validity of such approaches rests on the assumption that this reference phylogeny really represents the clonal phylogeny, and it is currently unclear whether this is justified. Indeed, some recent studies have argued that recombination is so common in some bacterial species that it is impossible to meaningfully reconstruct the clonal ancestry from the genome sequences, and that these species should be considered freely recombining, e.g. [27].”
Like the methods we cite here, methods like Gubbins and Bratnextgen simply start from the assumption that recombination is sufficiently limited such that the clonal phylogeny can be inferred from the core genome alignment. In particular, these methods assume that there are sufficiently many genomic loci whose phylogeny has not been disturbed by recombination so that the clonal phylogeny can be inferred from those loci. However, we are not aware of any demonstration that real data of bacterial genomes are in the regime where this assumption holds and, as we point out, there is no agreement in the community on this fact. Indeed, reviewer 1 expresses the opinion that it is true and ‘not novel’ that the precise clonal phylogeny cannot be reconstructed for most bacterial species.
One of the aims of our work is to rigorously investigate this assumption and we feel that our results clearly show that recombination occurs so often that no loci evolve free from recombination, and the clonal phylogeny cannot be reconstructed at all. In other words, we maintain that methods like Gubbins and Bratnextgen start from assumptions that do not hold for the bacterial species that we analyzed. We made this point clear in the revision.
Summary, status of part 1 of our claims
In summary, all our additional analyses and validations using data from simulations show that none of the technical issues raised can substantially effect any of the results (i)(v). But even if there were valid questions about the reliability of some of our statistics, we feel that the validity of our overall conclusions are supported by something far stronger. The results of (i)(v) are based on very different and mutually complementary methods, and these analyses all independently paint the same quantitatively consistent picture, i.e. that recombination is completely dominating and has overwritten virtually all clonal structure. Even if there were a technical problem with one or other of these methods, we find it extremely hard to see how all of the methods could consistently create the wrong impression that recombination is dominating.
In order to maintain the belief that, in spite of all our results, the clonal structure is what is determining the core genome phylogeny, one would have to argue that all results (i)(v) are not only all wrong, but that they somehow also all fortuitously paint the same quantitative picture, and that for some reason they also by sheer luck happen to paint a correct picture on simulation data. We submit that this is simply not believable.
Part 2 of our claims
To most constructively discuss part 2 of our claims, we think it useful to start with the following comment from the end of the review:
Overall, a lot of remaining questions are not solved by the present model: if recombination is so massive, we would expect each transferred fragment to be itself the result of multiple fragments and therefore to leave a somehow fragmented signature of recombination, how does this model allow diversification to occur and most importantly why is the phylogeny overall stable since the use of MLST two decades ago…
First, as already mentioned above, we do not at all dispute that a clear phylogenetic structure is present in the data. Indeed we start by explicitly showing in Figure 1 that, given sufficiently many genomic loci, the phylogeny reconstructed from them converges to the core genome phylogeny. As we pointed out in the discussion of this result, some in the field see no alternative but to conclude that this stable phylogeny must correspond to the clonal phylogeny. But it is not at all impossible for there to be massive recombination and a stable genomewide phylogeny at the same time. Even if, through massive recombination, there is a large number of different phylogenies along the genome, an algorithm that attempts to fit the data assuming a single phylogeny (like PhyML), will generally still converge to a stable result when given sufficient data. To give an analogy, imagine that we fit data coming from a complex mixture of Gaussian distributions to a single Gaussian. Given sufficiently many samples from the complex distribution, the fit to a single Gaussian will become very stable, even though the underlying distribution is not a simple Gaussian at all.
Since there clearly was confusion on this point we have included new sections in the revision explaining these points in detail. In particular, we illustrate the point by constructing phylogenies (using PhyML) from autosomal sequences from the 1000 genomes project, i.e. human genomes. We show that, even though there is of course no such thing as a clonal phylogeny for human autosomal chromosomes, phylogenies reconstructed from sufficiently many loci robustly converge to the’ core’ phylogeny. Moreover, this phylogeny reflects known population structure, i.e. individuals from the same geographic regions consistently form clades in the phylogeny (Figure 8). That is, humans with ancestors who predominantly come from a common geographic area end up closer in this phylogeny because they are more likely to share recent ancestors on average across the different phylogenies along the genome. In exactly the same way, we propose that the whole genome phylogenies for many bacterial species reflect the statistics of recombination between their ancestors.
Do 'close' strains recombine more often?
It seems that strains share SNPs much more frequently with close strains. This information can be found for strain A1 from comparing Figure 7A and Figure 1—figure supplement 1, showing that strain A1 shares a lot of 2SNPs with its neighbours H5, D8, A2. If these SNPs are indeed shared because of the action of recombination, this means that a good description of recombination patterns would be one where the rate of recombination between pairs of strains is a decreasing function of their divergence. This relationship must be systematically investigated and presented. This relationship has been known for a while, has been used in models of diversification (e.g. Fraser et al., 2007) and seems a natural way to describe recombination structure. Yet this is not mentioned at all in the text.
I think this comment is very valuable because it illustrates so clearly the confusion that results from continuing to think of pairwise divergence as reflecting the distance to a clonal ancestor. That is, the reviewer’s comment presupposes that the pairwise divergences and rates of recombination are separate independent entities which do not a priori need to correlate, but that the data just happen to suggest that they do.
However, in the regime where recombination dominates almost every SNP can be thought of as deriving from an independent phylogeny. In that limit, in order for two strains x and y to have a different nucleotide in a given SNP column, the mutation must be more recent than the common ancestor of x and y in the phylogeny at that position. Thus, the divergence of a pair of strains does not reflect the distance to their clonal ancestor, but the average distance to their common ancestors across the many different trees. And these distances depend precisely on how often the lineages of strains x and y have recombined. If their lineages recombine rarely, then on average their common ancestors will lie further in the past, and if they recombine a lot, their common ancestors will on average be recent. Thus, when recombination dominates the pairwise divergence of two strains is a direct reflection of the recombination rates of their lineages, and it becomes almost a tautology that the lineages of strains that have low pairwise divergence must have recombined frequently. That is, rather than a surprising observation this correlation is in fact predicted by our interpretation of the data.
In the revision we have added comments to make this point clear. Also, we have added an Appendix in which we perform an indepth analysis of the recombination structure of phylogroup B2 (to which the strains the reviewer mentions belong) to further illustrate and confirm this interpretation.
Entropy profiles of nSNPs
We are not convinced about the interpretation of the 2SNP, 3SNP, etc., analysis. This analysis is interpreted as implying that each strain has a unique recombination profile, that “there is no natural way of dividing the strains into groups of highly recombining clades” and that individuals are not exchangeable. How do the authors quantify the uniqueness of the’ entropy profiles’ (Figure 10)?
For the revision, we have significantly extended the analysis of the entropy profiles and this answers the question of the reviewer in several ways. Apart from showing the entropy profiles of all E. coli strains in the main text, and discussing several example profiles, we now also show the entropy profiles of the strains of each of the phylogroups that are represented in the dataset (Figure 10—figure supplement 1). This analysis shows that, when a group of strains has a very recent common ancestor, their entropy profiles H_{n}(s) are not just similar, they are exactly identical. In contrast, entropy profiles H_{n}(s) for strains of the same phylogroup look similar for large n, but they are never identical. Moreover, entropy profiles of strains from different phylogroups tend to look clearly different, even at large n.
Following the reviewer’s comments, we also developed a statistical test (based on the Fisher exact test) for testing the difference in nSNP statistics for a pair of strains. This analysis shows that about 95% of all pairs of strains have significantly different entropy profiles (Figure 10—figure supplement 2). Finally, we also show that the entropy profiles of the data of the simulations look very different from those of the E. coli data. Instead of highly variable entropy profiles, entropy profiles from simulations show consistently low entropies at low recombination rates, and very similar looking entropy profiles for all strains at high recombination rates (Figure 10—figure supplement 3).
What do we mean with a population of exchangeable individuals?
Finally, the article starts by posing two cornerstones of bacterial epidemiology: a robust phylogeny and a population like structure. It is not clear to us what is the meaning of the second assertion and how recombination differs from mutation in that process. In a coalescent as much as in coalescent with recombination, the genealogy has fractal structure and any of the possible clustering relies on some arbitrary choice. Whether mutation or recombination contribute to that process, there is no clear definition of population. Only at the ecological level can one find within a population of hosts some discrete bacterial populations.
We are here referring specifically to the concept of a population of exchangeable individuals in mathematical models of evolutionary dynamics. In mathematical population genetics models members of the ‘same population’ are treated as exchangeable. For example, in coalescent models, each pair of lineages is equally likely to coalesce. In models with recombination, such as the one we use in the simulations, each pair of individuals is equally likely to recombine. Even in models with substructure, e.g. with different subpopulations with migration between them, the members within each subpopulation are still treated as exchangeable.
We believe that, in order to successfully model the nSNP statistics and entropy profiles that we observe, we would have to introduce almost as many subpopulations as there are strains in our collection. If one needs almost as many subpopulations as there are strains to successfully model the observations, it becomes reasonable to question whether models that start from subpopulations of exchangeable individuals are appropriate at all. In the revision we have attempted to clarify these points.
What would nSNP distributions look like for no or free recombination?
Moreover, while the distribution of 2SNP for one strain may reflect some recombination, it is not clear what is expected for the overall distribution in the absence of recombination. A simple coalescent process could also presumably lead to a longtailed distribution. The arguments made in the text on the typical mode of 2SNP in the case of free recombination is not substantiated and should be validated by simulations or an analytical derivation. The whole topology of the tree will affect how many sites are shared by two strains. If a pair of strains coalesced, what matter for 2SNP is the time of the coalescence of their ancestor. This is not a homogeneous distribution. We need to know how much the distribution is informative about the recombination process, but the manuscript in its present form is not addressing that question: what is expected in the absence of recombination, in the presence of homoplasy, how much a little bit of recombination is affecting the patterns… These are key question if one want as the author to challenge the whole idea of phylogeny.
These are valid points. We could have discussed much more the interpretation of these distributions of nSNPs. In addition, we agree that it is useful to see what such distributions look like for simulations of either a simple Kingman coalescent without recombination, or a model with mutation and free recombination within a wellmixed population. As already discussed, for the revision we have developed simulation code and performed such simulations with a large range of recombinationtomutation rate ratios.
In the revision we now use the 5% homoplasycorrected alignments for the nSNP analysis, and in Appendix 4 we present extensive comparison of the nSNP distributions observed for E. coli with those observed for the simulations, either without recombination, or with different recombination rates. This analysis shows that nSNP statistics observed for E. coli are fundamentally different from the nSNP statistics observed for any of the simulations. In particular, we confirm that at recombination rates as high as we observe for the E. coli data, no longtailed distributions of nSNPs are observed in the simulations.
Finally, we agree that the number of nSNPs of a given type does not only depend on the fraction of the phylogenies along the genome for which that group of n strains form a clade in the tree, but also on the average length of the branch from their ancestor to the next coalescence. In the revision we have now explicitly pointed this out.
Is our set of strains special?
Of note the strains were also isolated in some atypical reservoir of E. coli. In this ecological niche it is not impossible that recent transfer may be magnified due to higher action of phage and larger diversity than within a regular gut (colonised by a few strains maximum). However the fact that the pattern is similar to the one described by Dixit and Maslov suggest this may not be too much of an issue.
The fact that we observed very similar statistics for the other species already shows that our observations are not specific to our set of E. coli strains. However, we have also performed our analyses on a collection of E. coli genomes from public databases and confirmed that these show the same overall patterns. We could in principle also include these analyses but this would be quite some extra work and we don’t think this would bring any useful additional information.
Reviewer 3  Daniel Weissman
3.1) This is a very exciting manuscript that changed the way that I think about bacterial evolution. The key findings are:
• For several different species of bacteria, there are essentially no parts of the genome that are inherited along the clonal phylogeny in samples of size 100.
• Despite this, there is a clear preferred genome tree, indicating that there are patterns in which some strains preferentially recombine with each other. These patterns persist deep into the past, beyond the point at which every gene has recombined off the clonal phylogeny.
We thank the reviewer for these comments. It is very rewarding to see that some readers do appreciate the key points of our work.
3.2) These findings indicate that there is some hidden structure in the studied bacterial populations. I think the nature of this structure is a real mystery. It could be ecological, epistatic, spontaneously arising from genome divergence suppressing recombination, or something else entirely. I hope that this paper will spark a wave of interest in trying to figure out what population genetics should look like in most of the tree of life. The authors make a first attempt at this by looking at some interesting distributions of shared SNPs across strains. I found these distributions hard to interpret and wasn’t sure about all the details of the analysis, but I think it’s good to try out some new ways of looking at the data.
We completely agree with the reviewer. Indeed, the key question is now to understand what this ‘population structure’ setting these very different relative rates of recombination between different lineages represents. We also agree that our nSNP distributions (and nSNP entropy profiles) are only a first step toward solving this puzzle. To make clearer what these nSNP distributions represent, we now also calculate these distributions for data from simulations of completely mixed populations evolving under (neutral) mutation, drift, and recombination and compare and contrast the patterns that are observed for the real data with the patterns that are observed from these simulations. We also show what the entropy profiles look like for the traditional E. coli phylogroups.
3.3) I would partially reorganize the paper so that it’s organized more by method/analysis than species. Concretely, I would move Figure 7 (nonE. coli bacteria) and the accompanying analysis up so that they come before Figure 9 (SNPsharing distributions) and its accompanying analysis. This way, you would introduce the tree analysis in E. coli, then apply it to other species showing that you get similar results, then move on to SNP sharing.
Upon reflection, we agree with the reviewer that it makes sense to first show that the dominance of recombination is also observed for other species, and then turn to the question about what the phylogenetic structure in the data represents. Thus, for the revision we have reorganized the presentation as suggested by the reviewer.
3.4) I would try changing the analysis comparing SNPs to putative branches. Right now, it’s just consistent vs inconsistent, which is nice because it’s simple. But I don’t think it’s that informative. If you have a putative branch that splits the sample in half, and a SNP at 50% frequency, and they totally agree except for a single sample, that gets called “inconsistent”, whereas a random SNP that’s only found in a pair of individuals has a 50% chance of being consistent. Similarly, a putative branch that just lies above a pair of samples will be consistent with lots of SNPs just by chance. What about using mutual information, either instead of consistency or as an additional analysis? I think that Figure 8—figure supplement 1 indicates that you have lots of SNPs that are nearly or partly consistent, given that you recover something very close to the original tree using only inconsistent SNPs. Hopefully an MI analysis would pick these up as still being correlated with the branches. (Of course, MI can’t go negative, but you can still classify branchSNP pairs as correlated or not depending on whether the MI exceeds what you would get if you drew them both at random conditioning on number of strains under the branch and SNP frequency.)
In Figures 4, 5, and 6 the aim is to show that the SNP inconsistencies imply certain lower bounds on the number of recombination events and show that very large fractions of SNPs do not correspond to mutations along the branches of the core tree, i.e. that one cannot approximate the dynamics as taking place along the branches of the tree. For that it is essential to categorize SNPs as simply consistent/inconsistent with particular tree branches or with each other.
At the same time, we of course agree that the distribution of SNP types is not random at all but that some lineages are much more likely to recombine with each other than others. That is, whereas recombination is so frequent that many thousands of phylogenies occur along the alignment, these phylogenies are not completely random but drawn from some biased distribution over phylogenies. How to efficiently capture or summarize this distribution over phylogenies is an entirely separate question. In the paper we have proposed the nSNP distributions and nSNP entropy profiles of individual strains. For the revision we now also provide a new statistic that quantifies the similarity of the nSNP statistics of two strains. We agree that additional more graded measures of similarity between SNPs could be helpful but it is not entirely clear to us what precisely measures such as the mutual information that the reviewer proposes would capture. It is still the case that a single recombination event can either make a large or a small change to the SNP pattern.
Figure 8—figure supplement 1 simply shows that, if one is forced to summarize the patterns of SNP sharing across the strains (which really reflects the distribution of phylogenies along the alignment) by a single ‘average’ phylogeny, then this average is very similar whether one includes the SNPs that actually fall on this tree or not. This observation is analogous to observing that the median or average of a sample from a distribution is typically unchanged under removing data points near this median/average.
3.5) I don’t understand the argument that because Figure 9C shows broad distributions, that must mean that strains are not organized into subpopulations. Figure 11—figure supplement 3 shows a similar pattern for humans – are the authors claiming that subpopulations are not a good description of the human data? At the very least, couldn’t the subpopulation sizes be broadly distributed? But more basically, do we know what these distributions should look like in a simple wellmixed, recombining WrightFisher population? I’m sorry, I really meant to simulate this, but I’m turning in this review late as it is. Maybe you get that the occurrence of a particular kind of nSNP can be dominated by its occurrence on a single genome block?
We agree that interpreting what the nSNP distributions indicate regarding the population structure is still challenging at this point. We presume that, if there were clear subpopulations with high recombination rates within them, and low recombination rates between them, then this would be visible as specific substructure in the nSNP distributions. However, we agree we have no strong argument to support this and for the revision we have rewritten this section to make clear that the precise interpretation of the nSNP distributions is currently unclear.
Following the suggestion of the reviewer, we now extensively compare the observed nSNP statistics with statistics from simulations of a simple wellmixed WrightFisher model with the product of population size and mutation rate set to match the sequence diversity observed in our E. coli strains, and recombinationtomutation rates ranging from ρ/µ = 0.001 to ρ/µ = 10 (Appendix 4). We show that the nSNP distributions observed in these simulations differ significantly from what we observe for the real data. In the limit where recombination is very weak there are not many nSNP types and they essentially capture the structure of the clonal tree. When recombination is high (like for the real data) the complete mixing causes different nSNP patterns to occur with similar frequency, and the distributions are no longer longtailed.
With regard to the evidence of the population separating into subpopulations, we have moved our discussion of this issue to the discussion of the entropy profiles, since these contain more relevant information about this question.
Detailed questions/comments:
3.6) What’s the allele frequency spectrum look like? I don’t think I saw this anywhere in the manuscript.
This was shown in Figure 4—figure supplement 2. We have extended this analysis to include data from the simulations and the allele frequency spectra are now shown in Appendix 4—figure 1.
3.7) How well does the consensus tree capture the pairwise distances among strains?
We now show this in Figure 4—figure supplement 1. Since the pairwise divergences vary over 4 orders of magnitude, there is a good correlation between the pairwise divergences and the pairwise distances in the core tree, but the distances along the tree are systematically lower than the true distance. Moreover, the branch lengths of the core tree and the observed frequencies of SNPs falling on each branch differ by up to a factor 100 (Figure 4—figure supplement 1, right panel).
3.8) Could you say more about the potential effects of the SNPs that correspond to multiple mutational events? If I understand the numbers correctly, in one of the 3kb blocks you expect about 5 SNPs to actually come from two independent mutations. Does PhyML take this into account? I don’t think it should be a big deal for most of the rest of the analysis, since Figure 5 indicates that you see inconsistencies after about 10 SNPs, and you don’t expect a multiorigin till you have about 50 SNPs, but I think it would be worth going through this explicitly.
We are glad to see that the reviewer appreciates that a backoftheenvelope calculation already shows that multiple mutations (homoplasies) cannot substantially affect our analyses. However, we agree a explicit calculations are helpful. For the revision we have done extensive analyses of the potential effects of homoplasies. First, we updated our estimate of the rate of homoplasy taking transitiontotransversion bias into account. We now estimate about 3.3% of biallelic SNPs to correspond to homoplasies (this previously was 2.5%). Second, we developed a method for ranking SNP columns by the overall amount of phylogenetic clashes they generate with other SNPs in their neighborhood. To correct for homoplasies we then removed either 5% or 10% of the most clashing SNPs from the alignment and redid all our analysis with this homoplasycorrected alignment. We also investigated the effects of homoplasy removal for all the simulation data. These analyses show that, indeed, none of our results and conclusions are significantly affected by homoplasies.
With regards to PhyML, this indeed implements a full likelihood model that does allow for multiple substitutions.
3.9) What do the wiggles in Figure 9C mean? They’re produced by the tree structure, right, and correspond to approximately exchangeable strains?
We’re sorry but we are not sure what the reviewer has in mind. We can see wiggles in Figure 7C (now 9C) but we do not know what causes them and cannot think of an easy way to figure this out. Frankly, we would already be happy to have a theory for the overall slope of these approximate powerlaws.
3.10) For the human analysis, was it too much computationally to do more than just chromosome 21? Could you do all the autosomes for individuals from just one subpopulation at least?
For the revision we have now done chromosomes 112.
3.11) Did you ever think about just running STRUCTURE on the E. coli data and seeing what it said for different values of k? Maybe it would find “subpopulations” that would be clustered on the consensus tree.
We hope the reviewer can understand that, given very large number of additional analyses we already performed in response to reviewers 1 and 2, we feel that this rather exploratory additional analysis is beyond the scope of this manuscript.
3.12) I don’t find the power law fits to the SNPsharing distributions very convincing, particularly the ones shown in Figure 11—figure supplement 1. I would cut the left panel of Figure 9, and I would qualify the sentence on p13 starting “Moreover…”
We do not agree with the reviewer here. For some reason, whereas one rarely hears complaints about fitting a linear trend to scatters that are far from straight, nor complaints about fitting an approximate exponential increase or decay to scatters on a semilog plot, the moment one fits a line on a loglog plot there are immediately complaints that the data are ‘not really power laws’. We are not claiming that these are perfect power laws. However, it is a fact that these scatters look far more straight in loglog scales than on linear or semilog scales. The fits just capture the overall trend of these longtailed distributions. We have now explicitly pointed out that we are not claiming that these distributions are perfect powerlaws.
3.13) What do the entropy curves look like for the human data?
For the revision we have performed this analysis, which is shown in Figure 11—figure supplement 3. Interestingly, the entropy profiles are virtually identical for all individuals, except for the individuals with African ancestry, which have clearly larger entropy at large n. Note that the similarity of the entropy profiles doesn’t imply that there aren’t any human subpopulations. It just implies that the diversity of different nSNPs across the genome increases in a similar fashion for all individuals.
[Editors' note: we include below the second set of reviews that the authors received from another journal, along with the authors’ responses.]
Reviewer 1  Daniel Falush:
1.1) This paper is substantially improved, thanks especially to the introduction of simulations and extensive reworking of many sections and the fixing of logical and other errors. I still have a number of substantial problems with it, as I will explain below. I do find the response more confrontational and even highhanded than is necessary, to the point of rudeness. Within the cover letter this is exemplified by bringing out one clich´e of the selfproclaimed oppressed revolutionary scientist, Max Planck’s quote that science progresses one funeral at a time. Which comes perilously close to implying that the field would be in better shape if me and the other people who were not happy to accept the work in its current form were no longer in it. The counterpoint to that quote is the opinion that I first saw in a movie about the life of Sigmund Freud but I believe has another original source: “The work contains much that is new and much that is true. However that which is new is not true and that which is true is not new.” Which is what the old turks in most scientific fields think of the great majority attempts to foment revolution. And at least 90% of the time they are right.
We are sorry that our attempt at levity toward the end of the cover letter appears to have misfired. We hope that the reviewer noticed that we did not approvingly cite Max Planck, but instead pointed out that we are much more optimistic than Max Planck and expected that we can convince the ‘old turks’ that our work contains results that are both new and true. We’ll leave it up to the reviewer to decide whether this optimism is warranted.
1.2) The manuscript is greatly improved and some important errors have been fixed. I thank them for the extensive time and effort they have put into this; once they did it they made an effort to do it properly but I wish to remind them that reviewers and editors put their time in gratis. We could be doing our own research instead. The onus is on authors, not editors or reviewers to make the manuscript and its relationship to previous literature clear and unfortunately this still has not been achieved. Despite the many improvements, establishing what is new and what is true in this manuscript remains a substantial problem, firstly because the Introduction continues to set up a straw man for what is actually believed within the field. The authors correctly point out that neither a phylogenetic approach or a population genetic approach is fully applicable to bacteria. However this is common knowledge in the field. For example Didelot and Falush, 2007, when attempting to apply ClonalFrame to entire bacterial species state: A more ambitious use of the method is to attempt to define the clonal relationships and imports for a sample from an entire bacterial species. The assumptions of the model fit less well since in many instances recombination will reassort existing polymorphisms rather than introduce novel ones. In addition, the deeper branches in the genealogy might be difficult or impossible to resolve accurately, especially if there has been substantial recombination so that most of the genome of each strain has been exchanged since they shared a common ancestor. The authors imply that it is believed within the bacterial population genetics field that phylogenetic methods should recover the true genealogy but this is simply not true. ClonalFrame, Gubbins and other methods reconstruct genealogies while simultaneously trying to account for recombination events, in the latter case by removing recombined regions from the alignment and clonalframe attempts to use MCMC to represent the substantial uncertainty that exists about the genealogy while other methods (BURST, BAPS, PopPunk) forgo attempts to reconstruct an entire genealogy and instead try to identify relationships amongst closely related strains at the tip of the genealogy. Insofar as people in the field do use phylogenetic methods to infer relationships, this is a compromise, most usually due to computational issues. Unfortunately the MCMC approach to inferring genealogies used by ClonalFrame does not scale up, so ClonalframeML uses a maximum likelihood tree but this does not mean that the authors of that paper believe it to be the correct answer and I myself am currently involved in attempting to make a method that is based on inferred local genealogies and that attempts to indicate which branches cannot be inferred with any statistical confidence. The Introduction of the manuscripts states: First, we find that for most bacterial species recombination is so frequent that, within an alignment of strains, each genomic locus has been overwritten by recombination many times and the phylogeny typically changes tens of thousands of times along the genome. Moreover, for most pairs of strains, none of the loci in their pairwise alignment derives from their ancestor in the ancestral phylogeny, and the vast majority of genomic differences result from recombination events, even for very close pairs. Consequently, the ancestral phylogeny cannot be reconstructed from the genome sequences using currently available methods and, more generally, the strategy of modelling microbial genome evolution as occurring along the branches of an ancestral phylogeny breaks down.
In my view, none of these claims are new, or indeed depart obviously from the informed consensus in the field.
In his previous review the reviewer characterized our paper as making ‘provocative claims that are not supported by the evidence provided’, and the review focused on pointing out perceived problems with our methodology, leading the reviewer to claim that our methods were sufficiently flawed that our conclusions could not be trusted. In response we thoroughly addressed all methodological issues raised and showed that our conclusions were in fact fully warranted. The reviewer appears to accept this because (with the exception of a discussion of potential underestimating of the average length of recombined regions that we respond to below) the reviewer no longer raises any technical issues. In light of this, we hope the reviewer can understand that we are quite surprised that he now takes the position that it is unclear to what extent any of our claims are novel. In response, we first of all note that this position is clearly at odds with those of reviewers 2 and 3.
The central novel insight of our work is that the robust phylogenetic structure that is evident in whole genome alignments of bacterial strains does not reflect clonal ancestry, but reflects population structure. We indicated this in the title of our manuscript, we explained it at length in the paper, and again in our responses to the reviews. In addition, to avoid any conceptual misunderstanding on this central point, we included a new section in the first revision showing how population structure can cause robust phylogenetic relationships to appear even in a manifestly nonclonal species such as human.
Unfortunately, the reviewer simply does not engage at all with this central result. It is unclear to us whether the reviewer has not understood this central claim, or that he simply rejects it without giving a justification for this rejection. It is clear from the reviewer’s comments, however, that he continues to assume that robust phylogenetic structures in whole genome alignments can only reflect ancestry. Although the reviewer quotes from Didelot and Falush, 2007, to point out that it has been acknowledged that approaches such as ClonalFrame are not perfect, and that it may be impossible to accurately reconstruct branches deep in the clonal tree, the entire premise of such methods is that much of the clonal tree can be inferred from the whole genome alignment. Indeed, in the section of Didelot and Falush, 2007 following the passage that the reviewer quotes, these authors apply ClonalFrame to a dataset of sequences from multiple Bacillus species, implying that they believe clonal relationships can be successfully inferred even beyond individual species. This is also consistent with statements of this reviewer in his first review: “In the discussion the authors conclude (which is not novel), that the ancestral phylogeny is very difficult, or impossible to reconstruct. This is true for most bacterial species, but this is typically because there are branches at the base of the tree that are extremely uncertain. It is still likely possible to infer a great deal about bacterial evolution from inferring clonal relationships closer to the tips. For most species, this includes a fairly substantial proportion of the clonal tree.” In summary, while the reviewer acknowledges that it may be difficult to reconstruct the entire clonal tree, it is clearly assumed that approaches such as ClonalFrame, Gubbins, and ClonalFrameML can be used to recover much of the clonal tree.
Our work fundamentally challenges this assumption and several passages in our manuscript explicitly stated this:
Subsection “Phylogenies of individual loci disagree with the phylogeny of the core genome” introduces the main assumption, explicitly cites methods such as clonalframeML as examples of it, and makes clear we are going to question this assumption: “How should we interpret this convergence of phylogenies to the core phylogeny as increasing numbers of genomic loci are included? One interpretation that has been proposed, is that once a large number of genomic segments is considered, effects of horizontal transfer are effectively averaged out, and the phylogeny that emerges corresponds to the clonal ancestry of the strains, e.g. [8,11]. Indeed, it has become quite common for researchers to detect and quantify recombination using methods that compare local phylogenetic patterns with an overall reference phylogeny constructed from the entire genome, e.g. [11–15,27]. However, the validity of such approaches rests on the assumption that this reference phylogeny really represents the clonal phylogeny, and it is currently unclear whether this is justified.”
In subsection “Phylogenetic structures reflect population structure” after having shown that recombination is so common that hardly any clonally inherited DNA is left in the genome alignments, we explicitly point out that these results invalidate this assumption: “As we mentioned in the Introduction, some researchers interpret this convergence to the core tree to mean that the core tree must correspond to the clonal phylogeny of the strains. The interpretation is that the SNP patterns in the data are a combination ‘clonal SNPs’ that fall on the clonal phylogeny, plus a substantial number of ‘recombined SNPs’ that were affected by recombination, which act so as to introduce noise on the clonal phylogenetic signal. In this interpretation, trees build from individual loci can differ from the core tree because the ‘recombination noise’ can locally drown out the true clonal signal, but once sufficiently many loci are considered, the recombination noise ‘averages out’ and the true clonal structure emerges. However, this interpretation makes two assumptions that do not necessarily hold. First, assuming that the effects of recombination will ‘average out’ when sufficiently many loci are considered only holds if recombination is effectively random and does not have clear population structure itself. Second, for the clonal phylogeny to emerge one has to assume that there are sufficiently many genomic loci left that have not been affected by recombination, whereas our analyses above have shown that there is no clonally inherited DNA left for most pairs, and that each locus in the genome has been overwritten many times by recombination.”We then go on to show that the first assumption, i.e. that there is no population structure, also has to be rejected, and that instead of averaging out, this population structure can cause robust phylogenetic structures to appear.
Finally, in the discussion we summarized our findings by stressing that methods which assume that much of the clonal tree can be recovered are inherently flawed: “Given this, there is no reason to assume that a phylogeny reconstructed using maximum likelihood from the core genome alignment corresponds to the clonal phylogeny of the strains. Moreover, methods that aim to reconstruct the clonal phylogeny from a subset of positions that have not been affected by recombination are also inherently problematic because our analysis shows that, for most species, such positions simply do not exist.”
As a general principle, we try to avoid directly attacking previous work of other researchers as fundamentally flawed in our paper. Instead, we assumed that the passages cited above would make it obvious to readers that approaches such as ClonalFrame, Gubbins, and ClonalFrameML are fundamentally flawed because they rely on assumptions that do not hold.
However, since this was apparently not clear to the reviewer, we can explain more explicitly what the problems are with an approach such as ClonalFrame. For a given set of sequences, ClonalFrame aims to reconstruct the clonal tree of these sequences using a model that assumes that
The clonal tree follows Kingman coalescent statistics.
Along each branch of the tree, some genomic segments evolve clonally at a constant mutation rate, while other segments are overwritten by recombination events.
The lengths of the recombined segments are exponentially distributed with a mean length fitted from the data.
It is assumed these recombination events do not copy a DNA segment from another member of the population, but rather simply bring in random substitutions at a fixed rate.
There are several problems with these assumptions. First, there is no reason to assume that the clonal relationships of real sequences should follow Kingman coalescent statistics. Second, the average length of the recombined segments is typically fitted to something relative small, e.g. 1 kilobase. Given the assumed exponential distribution, this would make it virtually impossible to ever see recombined segments of 10Kb or more. In contrast, based on comparison of very close pairs of strains in E. coli, as much as half of the recombined segments range from 20 to 80 Kb. However, these are not our main concern. The main concern is the last assumption, i.e that recombination events simply bring in random substitutions at a fixed rate. It is not only unrealistic to assume that all recombined segments are always at a fixed divergence from the sequence they recombine into, more importantly, it assumes that there is no population structure at all. As a consequence, in this model recombination cannot introduce any phylogenetic structure by construction and any phylogenetic structure evident in the data will per definition be assumed to reflect clonal relationships. Thus, approaches such as ClonalFrame will mistake the phylogenetic structure introduced by population structure for clonal relationships by construction.
In the latest revision we have edited the text in a number of places to make these points clearer, including some comments in the Discussion.
1.3) There is a significant (and not cited) literature on estimating ratio of changes due to recombination and mutation (most notably an article by Vos and Didelot from 2008) which show that there are many species where R/M is substantially above one. Are the predictions of the authors actually substantially different from these either methodologically or in terms of the estimates produced. If so why? Even if they are, the qualitative conclusion is not different.
In order to determine R/M, the study of Vos and Didelot uses ClonalFrame to determine both the clonal tree, as well as the recombination events. For reasons just explained, we do not believe one can validly estimate the extent of recombination in this way. More importantly, however, we never estimated a value for R/M in the paper, because we believe this is conceptually misleading. That is, one of the key results of our paper is that there is no such thing as a single ratio R/M for a bacterial species. The ratio R/M depends both on the ratio of mutation and recombination ρ/µ as well as on the number of substitutions that are introduced per recombination event, which depends both on the lengths of the recombination events, and on the divergence between the donor and acceptor strains of the recombination events. Since our results show that there is a significant population structure, with rates of recombination varying dramatically between different lineages, it would be misleading to pretend that a species can be represented by a single ratio R/M.
We note that we explicitly pointed this out in subsection “lower bound on the ratio of phylogeny changes to substitution events” of the previous revision: “The observed ratio C/M for the E. coli alignments is very similar to the value of C/M observed in the simulations with ρ/µ = 1 (Figure 6—figure supplement 2). Although it is tempting to conclude from this that the ratio of recombination to mutation rate must be close to 1 for the E. coli strains, such a conclusion would be unwarranted. The evolutionary dynamics of the simulations makes several strong simplifying assumptions, i.e. that the clonal phylogeny is drawn from the Kingman coalescent process, and that the population is completely mixed so that all strains are equally likely to recombine. Both these assumptions may not apply to the evolution of E. coli in the wild. Indeed, we will see below that there is strong evidence that relative recombination rates vary highly across lineages so that instead of a single recombination rate, there is a wide distribution of recombination rates between different lineages. Therefore, it could be misleading to describe E. coli’s evolution by a single recombination rate ρ and we instead focus on providing a lower bound C/M on the number of phylogeny changes per SNP column, which is a meaningful quantity independent of the precise evolutionary dynamics that caused the substitutions and phylogeny changes, and can be calculated independently for any subset of strains.”
This point was reiterated in the Discussion section: “Given that recombination rates vary over orders of magnitude across different lineages, the idea of an effective single recombination rate for a species might be misleading, and it seems problematic to fit the data to models that assume a constant rate of recombination within a species [34].”For the latest revision we have added the citation to Vos and Didelot at this point.
We note that closest to estimates of R/M are the results shown in Figure 2H. These show, for each pair of strains, the fraction f_{r} of the single nucleotide differences between the pair of genomes that is due to recombination. The ratio R/M thus roughly corresponds to f_{r}/(1 − f_{r}) in the limit of very close pairs. However, as can be seen from the figure, the ratio f_{r}/(1 − f_{r}) becomes highly variable for close pairs that are mostly clonal. This already suggests that R/M can vary significantly across different pairs of closely related strains, and our results on nSNP statistics confirm this. In the latest revision we have pointed this out in the section on the pair statistics.
In summary, both our approach and our results are qualitatively different in that our results show there is no single ratio R/M.
1.4) The claim about most pairs of strains having no genome shared by recombination is also true for many species. See for example our preprint”why panmictic bacteria are rare”, which shows that for many species, the recombination scaled effective population size is 100 or more, which indeed means that for most pairs of strains the proportion of shared inherited clonal frame will be vanishingly small.
Like in the other studies that the reviewer cites, the cited paper uses a model that assumes that the clonal phylogeny obeys Kingman coalescent statistics, that there is no population structure (i.e. panmixia), and that recombination introduces segments of DNA at a fixed divergence from the donor strain. In particular, it uses such a model to estimate N_{e}×r. The main result of this work is that the data rejects the assumption of panmixia and we in fact cited this paper in our Discussion section in this context.
However, given that the panmictic model was rejected by the data, we do not see how it can be used to accurately estimate what fraction of pairs have still clonally inherited DNA in their alignment. In fact, the methods used in the paper seem to assume a priori that pairwise divergences are dominated by recombination.
1.5) Phylogeny (I would prefer the more exact terminology local tree or local genealogy) changing tens of thousands of times is not a quantitively surprising prediction (and is very dependent on the size of the sample). We do not currently have the technology to make robust inferences of tens of thousands of distinct local trees but we know that the tree changes many times. This is just not new either. The second set of claims potentially has more novel elements but has a number of problems of coherence. Second, we show that the structure represented in whole genome phylogenies of microbial strains does not reflect ancestry, but instead the relative rates with which different lineages have recombined in the past.
“ancestry” is normally used (e.g. in human genetics) to reflect the gene pools from which DNA comes. I think that here it is meant to mean “clonal inheritance”. But if this is what is meant it is an overstatement since the authors do analyse pairs of closely related strains in order to estimate relative ratio of recombination and mutation. In so far as it is a coherent statement it simply seems to be a restatement of the claims made in the paragraph above.
We agree with the reviewer that the term ancestry may be ambiguous and have rephrased in our latest revision to make clear that we are referring to the clonal phylogeny.
Apart from this, the reviewers comments just repeat what we already addressed in response 1.2, i.e. the reviewer does not engage with the central claim made in the title of our paper that the whole genome tree reflects population structure. We have rephrased this section of the Introduction in an attempt to be clearer.
1.6) Whereas almost every short genomic segment follows a different phylogeny, these phylogenies are not uniformly randomly sampled from all possible phylogenies, but sampled from highly biased distributions.
Some parts of every local tree will follow the clonal frame and thus be a highly nonrandom subset of all possible local genealogies. So this is predicted even under a standard coalescent with gene conversion model. The bias can of course be made larger by nonrandom recombination.
This comment simply revisits the view that substantial parts of the local genealogies will reflect the clonal phylogeny. However, this is only true if a substantial fraction of the genome is unaffected by recombination along each branch of the clonal tree, and our results show that this is not the case. While the reviewer acknowledges that nonrandom recombination, i.e. population structure, can increase biases in the distribution of phylogenies along the genome, he somehow fails to acknowledge that this nonrandom recombination can itself induce phylogenetic structure. We expected that our analysis of the human data would have driven this point home, but for the revision we have rewritten this section in an attempt to further clarify this point.
1.7) In particular, the relative frequencies with which particular subclades of strains occur in the phylogenies at different loci follow roughly powerlaw distributions and each strain has a distinct distribution of cooccurrence frequency with the other strains.
It took a great deal of effort before I was able to start to parse this key sentence and I am still left guessing what cooccurrence frequency refers exactly to.
Whenever a SNP occurs that is shared only by the triplet of strains (A,B,C), then the strains (A,B,C) form a clade in the phylogenetic tree of that particular locus. The statement simply means that we observe that the frequencies of different triplets (A,B,C) follows a powerlaw. For the revision we have rephrased so as to make this clear.
1.8) Since almost every strain has a unique ‘finger print’ of recombination rates with the lineages of other strains, the assumption that at some level the strains can be considered as exchangeable members of a population, also fundamentally breaks down.
The idea that recombination is nonrandom is also extremely well known within the bacterial field. This is firstly due to laboratory experiments that have shown that in many species recombination rates are suppressed as a function of the number of mismatches between donor and recipient. Secondly, there have been studies done on E. coli, which show that recombination is highly nonrandom, with higher rates occurring within phylogroups than between them. Therefore this claim is not new either. See Didelot, Meric, Falush and Darling, BMC Microbiology 2012, and also the literature cited therein.
Therefore, I have got to the end of the Introduction and still have little idea what the novel claims of the manuscript actually are.
We are well aware that biases in recombination have been shown previously, and in fact explicitly discussed the suppression by mismatches in the Discussion section, including citations. However, we are not aware of any studies showing or even suggesting that the phylogenetic trees that are reconstructed from whole genome alignments reflect the statistics of biases in recombination rates.
1.9) We can also provide a rough estimate for the typical number of times T that each position in the genome has been overwritten by recombination in its history.
It needs to be made clear what “its history” means. This means anywhere across the clonal genealogy of these strains. It does not cover events before their common ancestor. It is the property of a sample of strains. It is not the property of any one strain for example, since this will also have undergone recombination events going back to LUCA.
We are referring to the history of a particular position in the core genome alignment. Note that, because of recombination, the nucleotides at different loci derive from different ancestors with different histories and a different common ancestor. It is thus not only a property of the set of strains but also of the specific position and T corresponds to the average across positions. We have rephrased in the revision to make this point clear.
1.10) I thank the authors for doing simulations. The reasons I am personally sceptical of these estimates of T is that I think 20,000 base pairs is a substantial overestimate of the mean length of the recombination events that change the local trees. Some recombination events will be this long and these are the easiest events to detect. However there can also be many shorter events that still introduce 2 changes in local genealogies. It is known for example that individual events can lead to introduction of several fragments, many of which are short, within a long sequence block. For example see Golubchik et al. Nature genetics 2012. These events naturally get fused together in the kind of pairwise analysis that the authors do. It can be that the events of more than 20,000 bases are responsible for half of the Introduction of DNA but only responsible for a very small fraction of changes in local genealogy. If the authors estimated mean recombination fragment size by methods based on LD rather than by attempting to characterize individual events they would I think get much shorter estimates. I am not exactly sure what the authors should do to address this issue. One is to ask whether the pattern of LD as a function of genetic distance is really consistent with 20,000 bp imports. Another is simply to acknowledge the methodological difficulties.
The reviewer imagines that our identification of recombination segments in very close pairs may miss short segments, and that this may cause us to significantly overestimate the average length of recombined segments.
For this revision we have included a new Appendix 2 discussing why we believe it is unlikely that we are underestimating the average length of the recombination segments by a significant factor. To summarize: Our method does not have trouble detecting relatively short segments, i.e. fully 30% of the recombination segments we identify are only 12Kb long. Second, because typical pairs of E. coli strains are 1 − 3% diverged, even a recombined segment of only 250bp long would bring in 47 mutations, and this would easily be detectable in the alignment of a very close pairs of strains, which have at most 2 SNPs or less per kilobase in clonally inherited regions. In order for our method to plausibly miss a recombined segment it would have to be shorter than 100bp. Third, as we discuss in the Appendix 2, in order for our estimate of the average length of the recombination segments to be too large by a factor n, we would have to assume that for each recombination event of length 250 bp or more that we can detect, there are n recombination events of length 100 bp or less that we cannot detect.
The LD analyses that the reviewer mentions are based on explicit population genetics models that make a number of assumptions that, according to the results we present, simply do not hold. Consequently, we do not believe the results of such analyses can be trusted. To give just one example, such analyses typically assume that lengths of recombination segments are exponentially distributed and typically fit a mean length of 1 kilobase or less. Under such a model, recombination segments of length 10Kb or more would be essentially impossible. However, our pairwise analysis of very close pairs clearly shows that recombination segments of tens of kilobases long are in fact common (Figure 2A, B, and J).
The only statistic that we report that is affected by the estimated average length of the recombination segments is the average number of times T each position in the alignment has been overwritten by recombination. This latter estimate is based on the lower bound on the number of phylogeny changes in the alignment. Comparison with simulations suggests that this estimate may well be 10fold below the true value. Thus, in order for our estimate to still be an overestimate one would have to assume that our 20Kb recombination segment length (which is already 50% less than the actual mean that we estimated) is an overestimate by 10fold or more. We think it is very implausible that there are 10 times as many recombination events of length 100 bp or less as all recombination events of length 250bp or more.
In addition, in response to a suggestion by reviewer 3 we have included an Appendix 3 with a complementary estimate of the number of times each position has been overwritten by recombination, based on the statistics from the pair analysis (see response 3.5). The results of this analysis are entirely consistent with our estimate based on the minimal number of phylogeny changes across the genome alignment, i.e. suggesting that each position has been overwritten somewhere between 200 and 500 times by recombination on average.
Finally, as already pointed out in our previous response, even if our estimate were still a significant overestimate, this would still not affect any of the main conclusions of our paper.
For the revision we have further rephrased the discussion of this estimate to make clear that it could be affected if there were very large numbers of very short recombination events that we cannot detect.
1.11) The consensus in the field based on careful work of many scientists is that M. tuberculosis has undergone no recombination at all in its history back to its last common ancestor and that when authors detect it, it is due to methodological issues (either with the analysis method or with the data). If the authors wish to claim that they have detected positive evidence for recombination in the species, they need to spell out why. Otherwise, they need to comment on which the deviation from this established truth says about the practical limitations of their methods.
We agree that, while there are some reports in the literature of evidence for recombination with M. tuberculosis, the consensus is that, if there is any recombination, it must be very rare. In line with this, almost all of our statistics on M. tuberculosis confirm that, if there is any recombination, it is really rare. The pair statistics infers that all pairs are almost fully clonal, there is no diversity of nSNP patterns, and the nSNP entropy profiles all have low entropy. The only potential evidence of recombination is the occurrence of a (small) fraction of SNPs that clash with the core tree. Since these clashing SNPs are relatively rare, i.e. roughly one every 4.7Kb along the genome, we do not think it can excluded that they are due to problems with either the assembly of some strains, or their alignment.
Since the reviewer asked we have performed some more analysis of the M. tuberculosis SNPs for this revision. We discovered that the majority of clashing SNPs involved a particular pair of strains and found that these two strains have since been removed from the database because of evidence of contamination in these sequences. In total the genomes of 5 of the strains that we used have since been retracted. We removed these strains from the alignments and redid all analysis to find that the rate of clashing SNPs has now been reduced to once every 12.7Kb. Although there are still clashing SNPs left, we obviously cannot exclude that there are problems with the assemblies of some of the other strains as well.
For the revision we have edited the way we describe our M. tuberculosis results, and have added comments about the retracted strains and their effect on the frequency of SNPs that clash with the core genome phylogeny.
1.12) Figure 10—figure supplement 3 and Figure 11—figure supplement 2 are interesting because they show a deviation of the E. coli and other species data from the standard model of evolution with gene conversion model. It is nice to see that H. pylori looks like a freely recombining species, while M. tb looks like a nonrecombining species according to this pattern (I think these plots should be given the same scale). This is the first part of the manuscript where I think I am seeing a new and interesting way of visualizing the data. However, frustratingly, the difference is not worked through and explained. My view is that these figures should be brought into the main text and the difference in pattern between simulated data with intermediate recombination rates and what is seen in E. coli should be explained carefully. The problem with the current main text figures is that they are very dry; it is far from obvious what they actually say about what is happening in E. coli and it takes a great deal of effort to fight through and understand them.
As stated above, in our 2012 paper we showed that recombination was inhomogeneous within E. coli. We used ClonalOrigin to infer recombination events and found more events within clades than between them. The visualization in Figure 10—figure supplement 3 and Figure 11—figure supplement 2 is interesting because it shows that inhomogeneity is a direct feature of the data compared to simulated data and does not need complex inference methods to detect. This I believe should be the centre point of the manuscript.
We are very glad that the reviewer appreciates that the results in Figure 10—figure supplement 3 and Figure 11—figure supplement 2 directly show the heterogeneity in recombination without the need of postulating specific models. In the first revision we already moved some of these entropy profile results to the main text (Figure 10) and we do not think it is necessary to move these supplementary figures to the main text as well. We also note that reviewer 2 has the opposite opinion, i.e. that we should spend less space in the main text on these results. We believe that the current presentation strikes a reasonable balance.
At the suggestion of the reviewer, we have reworked the text describing these results to improve the explanation of what they show.
1.13) I asked a talented postdoc to look at the preprint of this paper and tell me what he thought. He spent 4 hours and was not able to decide whether anything could be concluded from all these new methods. The real task of the authors now is to focus down on making what is genuinely new accessible and with the time I can feasibly devote to it these are my comments to help the author to get there. I hope they are grateful!
We’re convinced that many of the comments of the reviewer have ultimately positively impacted the paper.
Reviewer 2
2.1) The manuscript has largely improved since last version and is making a clear point. It now combines multiple previous observations and some new statistics to discuss the impact of recombination on bacterial genetic diversity and the possibility to represent bacterial evolutionary history as a phylogeny.
The addition of situations added clearly some idea for comparison with a standard population genetic background.
Yet the picture is still not clear and can become quite technical. Here are a few propositions to improve the manuscript.
In light of this reviewer’s strong skepticism of the original version of our manuscript, we are very grateful to the reviewer for these gracious comments.
2.2) The main problem is that the nSNPs statistics remain difficult to interpret. Specifically, the authors give an intuition for the nSNP distribution in a freely recombining population. It should be peaked around a typical number of occurrence per type (Appendix 4—figure 3, brown line), In a clonal population, the nSNP distribution can also resemble a power law (Appendix 4—figure 3), a fact that is somewhat buried in the Supplement of the manuscript. A first clear description of the expected nSNPs distribution in these two extreme scenarios is warranted before going to more complex statistics (e.g. the entropy profiles).
We note that reviewer 3 also asked about the interpretation of the nSNP distributions in the first round of review, and did feel that our revisions had cleared this up. Nonetheless, for this revision we have significantly edited the text to make as clear as possible what these nSNPs represent, including a formal description of how we imagine relative rates of recombination determine the relative frequencies of different nSNP patterns. We have also edited our descriptions of what is observed in the clonal and freely recombining limits, and how those observations fundamentally differ from what we see in the E. coli data.
2.3) At the 2 SNPs, it makes sense that having multiple strains with almost equivalent level of 2 SNPs links is a reflection of recombination. But it would be interesting to plot the genomic localisation of the 2SNPs of the Figure 9A. The initial part of the paper mentions that for closely related strains a single recombination event will bring most variable SNPs. As such a recombination between two distant strains may lead to a high 2SNPs score just based on one transfer. Conversely, consider that strain A and B are more closely related and C is their closest relative. All mutations that occurred between on the branch leading to clone ABC will appear as 2 SNPs between either A and B, B and C or A and C if they occurred in regions that have later been overwritten by a recombination in branch C, A or B respectively. Here, the topology will play a major role. The 2SNPs connection will reflect clonality perturbed by recombination, not preferential recombination between strains. The 2 SNPs statistics is both dependent on the topology (its asymmetry, long branches…) and on the recombination. The simulations using coalescent approach did not capture the highly biased sampling structure of the population of genomes studied. So it would be interesting to illustrate on the example of Figure 9A where are the 2SNPs to substantiate the underlying structure of these 2 SNPs.
For this revision we have followed the reviewer’s suggestion and included a supplementary figure showing the distribution along the genome of the 5 most common 2SNPs from Figure 9A (Appendix 5—figure 2) within the section analyzing phylogroup B2 in detail (A1 belongs to phylogroup B2). Consistent with our observation (from the pair statistics) that little clonal signal is left for this set of strains, and that recombination has overwritten each locus already multiple times, we observe a fairly uniform distribution of these 2SNPs along the genome. Apart from just visualizing where these SNPs fall along the alignment, we have also shown the distributions of distances between consecutive occurrences of the same 2SNP, and the distribution of the length of runs of the same 2SNP. This shows that, although there are some locations where a cluster of the same 2SNP occurs, runs of identical 2SNPs are mostly short. Finally, among the top 5 pairSNP partners of A1 are strains D4 and H6, neither of which belong to phylogroup B2. Appendix 5—figure 2 also shows that 2SNPs of A1 with D4 and H6 are dispersed all across the core genome alignment and in the latest revision we point out that this also underscores that phylogroups should not be thought of as reflecting clonal ancestry.
2.4) The trouble in the later part of the paper is that the idea of preferential recombination is not clearly supported by the data. The nSNPs distribution in data do not perfectly follow the simulation model (although on Appendix 4—figure 3, the profiles are not that distinct either). But this does not imply that strains have to present “unique recombination profiles”. Moreover, the fact that the distribution is scale free does not necessarily imply that there are no groups of preferential recombination. All in all, we had trouble following the whole paragraph on nSNPs distribution and even more that on the entropy profiles. All we see is that indeed the simulations do not perfectly match these profiles. There could be multiple processes behind this poor match, like population structure for example or sampling biases. We would therefore suggest to considerably shorten that part (which comes at the end of an already long paper) and to present the “unique recombination profile” as one hypothesis by which these patterns could possibly arise (although this is only based on a verbal argument).
Reviewer 2’s opinion differs from the opinions of both reviewers 1 and 3 on these points. Reviewer 3 agrees we have explained the interpretation of these nSNP distributions and feels they clearly differ from the simulations and show evidence of population structure. Note also that, whereas reviewer 2 appears to contrast ‘population structure’ and ‘preferential recombination’, these two terms mean essentially the same in this context (as reviewer 3 has stressed). Regarding how long this section should be, we note that here the reviewer is also in disagreement with those of the other two reviewers, e.g. reviewer 1 feels we should expand the section on the entropy profiles. Overall we believe that the amount of space invested on the nSNP statistics is appropriate.
For this revision we have rewritten the section in question once more to stress that our results show that the nSNP distributions and entropy profiles show that different lineages recombine at significantly different rates, and that (apart from strains that had a common ancestor very recently) the lineages of each strain have a unique pattern of recombination with the lineages of other strains.
2.5) Heterogeneity in selection: One question that is not addressed is the heterogeneity in selective pressure along the genome? This heterogeneity could contribute to an over estimation of the fraction that is linked to recombination. Specifically, in the figure showing the accumulation or recombination with divergence, the last panel could be misleading because to the lack of resolution. Some genes are much more conserved than others and that intergenic regions. Synonymous are much more frequent (more than 10 fold) than nonsynonymous…. All these factors could play a role in the variability observed at intermediate levels of divergence. Could a zoom on some regions reveal the heretogeneity of divergence and its decorrelation from the underlying genetic structure. Is this pattern different from simulations that lack these diversity on the level of constraints.
The reviewer is presumably referring to the panels of Figure 2AC and wondering whether we may be overestimating the amount of recombination because of heterogeneity in purifying selection. If we understand it correctly, the reviewer imagines that one could mistake a region with little purifying selection, and thus, increased mutations, for a recombined region. Or vice versa, mistake regions under strong purifying selection for ancestral regions. We agree with the reviewer that, for pairs that are close to fully recombined, it becomes harder to clearly distinguish ancestral from recombined regions, but this does not affect the overall statistics that we derive from the pair analysis, which are driven by closer pairs that contain substantial fractions of ancestrally inherited DNA, i.e. as in Figure 2A and 2B, where the recombined regions can be easily detected.
Moreover, we stress again that our quantification of recombination is not only based on the pair analysis, but also on the consistency of SNP splits, and those statistics are not affected by selection.
2.6) For the detection of homoplasy, the impact of synonymous and transitions transversion, the model proposed could be better explained and the supporting data (number of transitions transversions leading to bi, tri or quadri allelic data for different types of mutation (synonymous, intergenic...)) should be provided.
For the revision we have edited this section to improve clarity and expanded the explanation. In addition, we have added a table reporting the statistics the reviewer requested (Table 2).
2.7) Overall the article has improved but we fill the nSNP interpretation, could still be clarified and/or shortened.
As mentioned above, we have now further revised the section on the interpretation of the nSNP statistics to improve clarity and expand the explanation.
2.8) Please specify the length of recombination fragments (20 kb) when you explain how recombination is implemented in the simulations.
In the simulations the lengths of the recombination segments were 12Kb, which we now also specify in the main text.
2.9) In paragraph “Recombination rates across lineages follow approx... scalefree distributions”, did not follow the sentence starting in “Formally,.…” are nearest neighbour in the phylogeny”  which phylogeny are you mentioning?
It is the different phylogenies along the genome alignment, i.e. each phylogeny at each locus. We have edited the text to make this clear.
2.10) Discussion “nSNP types reflect the relative frequencies with which different subsets of n strains share a common ancestor across the phylogenies”  is that statement correct?
We have rewritten this section to make sure the statement is clear.
Reviewer 3  Daniel B Weissman:
3.1) I was very positive in my first review, and I remain so: all the things that I really liked about the manuscript are still there, with more support. The authors have addressed my main comment (reorganizing the manuscript) and one of my main comments (interpreting the nSNP distributions), and partially addressed my other main comment (quantitative analysis of agreement between SNPs and the core tree), while explaining why they have not done more (it’s just a hard problem that will take some more serious analysis than my dashedoff idea). My main concern is just that the first submission was already quite a lot to take in (e.g., despite trying to read it carefully, I missed the site frequency spectrum) and now after addressing all the comments it’s really sprawling. I can’t think of magic fix to this, but a few of the minor suggestions below may help.
We are happy that reviewer 3 feels we have addressed his comments about the interpretation of the nSNP distributions. We agree with the reviewer that, in response to the many additional controls that reviewers 1 and 2 requested, including simulations, there is now so much extra material that it will make it harder for readers to follow the main thread of the argument. In this second revision we have gone over the whole text again and tried to improve the presentation so that the overall thread is clear.
3.2) There’s a lot to the simulation analysis, and I got kind of lost at points. To me, there are two main points of the simulations. The first is just to provide a quick check on the all the statements the authors are making in the first part of the manuscript, where they show that recombination has erased the clonal tree. This part was pretty easy to follow.
We agree with the reviewer.
3.3) The second main point are the nSNP distributions. The authors’ main claim about the E. coli nSNP distributions is that they are roughly scalefree. The human ones exhibit this even more clearly. The simulations show a different pattern. This suggests that a more complex history, with a mix of population size changes, population structure, and gene flow, as in humans, is a better starting point than the simple simulation model.
Indeed, in the first round of reviews the reviewers questioned whether a simple model of neutral evolution plus fully mixed recombination would not lead to the same nSNP distributions and the second point of the simulations is to show that it doesn’t. In this revision we also tried to make clearer that this is the main second point to take from these simulations.
3.4) I didn’t care so much about the fact that the simulations don’t match all the E. coli diversity statistics in detail – it would be weird if they did, given that they don’t even have the same sample size and that the core tree doesn’t look like a Kingman tree. If I’m correct about the two main points, it would be helpful if the authors could emphasize this to help guide the reader.
We agree and have tried to make this clearer in the revision.
3.5) Reviewer 1 mentioned the issue of biases in the estimation of the length distribution of the recombined segments. I don’t think the simulations fully address this, particularly the problem of missing small recombination events, since they assume that all recombination events are 12 kb. The authors mention that most of their conclusions don’t depend on the distribution of recombined segment lengths, which I agree with, but I think that the estimate of how many times each part of the genome is overwritten by recombination does depend on it. One could imagine a situation in which most of the phylogeny switches from alignment column are caused by very short recombined segments, while the segments detectable in close pairs are much longer. I don’t think that’s the case here, but the authors should make an argument why it’s not. What does ρ/µ look like if you just estimate from the close pairs? Is it roughly consistent with the ρ ≈ 0.3 − 1 that gives frequency of phylogeny changes in the alignment in simulations?
We agree with the reviewer that the estimate of the average size of recombination events is unimportant for almost all of our conclusions, and only affects the estimate of the number of times T each locus has been overwritten by recombination. Regarding the question of whether it is plausible that our lower bound estimate for T is still an overestimate, we refer the reviewer to our response 1.10. In short, for our estimate of the average length L_{r} of the recombined segments to be an overestimate by n fold one would need to have that for each recombined segment of length 250 or more, there are at least n recombined segments of length 100 or less. We feel that it is therefore not plausible that are estimate is off by a large factor.
The reviewer further asks whether we cannot get an estimate of ρ/µ using the analysis of close pairs. In response we have now added a Appendix 3 that estimates L_{r}ρ/µ using statistics from partially recombined pairs for which the fraction of clonally inherited genome is between 0.25 and 0.75. This analysis suggests that L_{r}ρ/µ ranges from 2100 to 4600 for these pairs.
Assuming that these rates also apply to the full evolutionary history of the strains in the genome alignment, the Appendix 3 also explains that, because a fraction 0.1 of columns in the alignment are polymorhic, this implies that the average number of times T each position in the alignment has been overwritten by recombination lies between T = 210 and T = 460. Note that this is consistent with our estimate based on the lower bound on the number of phylogeny changes C, which suggests T is at least 191 and may be as large as 1000.
However, we feel that this simple estimate based on the pair statistics should only be considered an orderofmagnitude check on the lower bound obtained from C for several reasons. First, the simple model assumes that, for a given pair, recombination has occured at a constant rate ρ since they diverged from a common ancestor. However, our results show that recombination rates vary strongly across lineages. More importantly, to estimate T we have to assume that the estimates of L_{r}ρ/µ for the partially recombined pairs can be extended to the full core genome alignment, and it is not clear at all whether this is valid. We have noted this as well in Appendix 3.
3.6) I would replace “recombination rates between lineages” with “population structure” as much as possible. First, technically, with so much recombination, each sample corresponds to many lineages, not one, and the recombination will almost never actually occur between two lineages ancestral to sampled individuals. Rather, recombination will split one ancestral lineage, and then deeper in the past one of the resulting lineages will coalesce with a different ancestral lineage. Second and more importantly, population structure is a familiar notion in population genetics, so it will hopefully reduce cognitive load on readers. You can emphasize early on that this structure doesn’t need to be geographic. For “variation of relative recombination rates among lineages”, maybe “variation in proportions of shared ancestry among lineages”? Not sure, could also see leaving it asis.
Precisely what terminology to use is something we have been thinking about quite a bit and it is very useful to have the reviewer give his views on this. We agree that ‘population structure’ is a familiar notion to some, but this may not include all readers of our paper and some may mistakenly think we are talking only about geographic distribution or something like that. Note that, in his last review, reviewer 2 refers to population structure and ‘preferential recombination’ as if they were two separate things. In our latest revision we have attempted to make sure that it is clear how we are using the terminology and we have in general increased the use of the term population structure.
The reviewer’s comments also made us realize that with ‘lineages’ we were not only thinking of the clonal ancestors of the sample, but the lineages of all ancestors that have contributed genetically to the genomes in the sample. In this revision we have also tried to make this point clearer by pointing how we use the term ‘lineages’.
3.7) In the human section: “few if any loci in the genome that follow this strictly maternal lineage”. I would explicitly mention the mitochondrion, and actually build a mitochondrial tree for the human sample that you can compare to the wholegenome tree. This would really help make things concrete, which I think would help people understand the bacterial results. If you want to, you could even include a cartoon human population tree, showing African and European admixture into Native Americans. Seeing the three trees (genomic consensus, mitochondrial/clonal, and model population structure) next to each other would drive the point home.
We agree that contrasting a mitochondrial tree with the wholegenome tree would help drive the point home. Unfortunately, the mitochondrial sequences are not available for these genome sequences.
3.8) I didn’t really understand the motivation behind the simulation setup. Why was it important that there be a single clonal tree shared by all the simulations? If anything, this seems to introduce a lot of noise, and I would be worried about unluckily drawing a weird tree. I would think it would be easier to use a pure coalescent simulation. If there’s an issue with the ARG getting too huge because recombination is too frequent, then I think that means that there is a corresponding issue in the forwardtime simulations where multiple recombined segments are too frequently being drawn from the same lineages. The authors might even be able to use existing software. ms and I believe msprime can do gene conversion, and if the exploding ARG is a problem, FastSimBac uses a Sequential Markov approximation and also keeps track of the clonal frame. I don’t know if any of them record all the stats that the authors use in this manuscript though.
The reason we use the same clonal tree is to be able to more directly show the effect of different recombination rates on various statistics. For example, in Figure 1—figure supplement 3 we show, for different recombination rates, what the fraction is of the genome that has been overwritten by recombination along each branch of the clonal tree. If there had been different clonal trees for each simulation, it would be much harder to meaningfully compare these statistics for different recombination rates.
We have extensively looked into using existing software for doing our simulations but none allow us to gather all statistics we want to gather.
3.9) I would try to maintain a clear distinction between a locus being overwritten many times by recombination in the phylogeny as a whole, and being overwritten many times along the line of descent of any given individual. As I understand it, the authors are making the second, stronger claim for most of the species studied, but I wasn’t sure what the actual numbers quoted in, e.g., Figures 6—figure supplement 3 and 4 correspond to.
In Figures 6—figure supplement 3 and 4, we are referring to the number of times positions have been overwritten along the whole phylogeny. In this revision we have made sure to be explicit about this so that no confusion can occur.
3.10) “Indeed, if we remove…”: I’m not sure that the argument in this paragraph is quite correct as written. Suppose we had a large sample from a basic unstructured WrightFisher population with a low rate of recombination. Then even if we were to take only recombined loci, wouldn’t we still get back something like the clonal phylogeny? I think the key point is the one you make in the sentence before this paragraph, that every locus has recombined many times. So then yes, maybe those all those trees look like the clonal phylogeny, but there’s no particular reason to expect them to, any more than we expect all the loci in humans to have trees that look like the mitochondrion’s.
We don’t think that there is anything incorrect with the argument in this paragraph, but we clearly did not explain ourselves well enough. This section of the paper concerns the question of why there is a robust core phylogeny in the face of so much recombination. At the start of this section we introduced a viewpoint that many in the field take regarding this question: “some researchers interpret this convergence to the core tree to mean that the core tree must correspond to the clonal phylogeny of the strains. The interpretation is that the SNP patterns in the data are a combination ‘clonal SNPs’ that fall on the clonal phylogeny, plus a substantial number of ‘recombined SNPs’ that were affected by recombination, which act so as to introduce noise on the clonal phylogenetic signal. In this interpretation, trees build from individual loci can differ from the core tree because the ‘recombination noise’ can locally drown out the true clonal signal, but once sufficiently many loci are considered, the recombination noise ‘averages out’ and the true clonal structure emerges.”
If this interpretation were correct, one would expect that explicit removal of all SNPs that fall on the core tree would destabilize the phylogeny. However, this is not what we observe. For this revision we have rephrased this section with the aim of making this point clearer.
3.11) “Essentially all population genetics…”: I’m not sure that I see the contradiction in this paragraph. As I understand, say, STRUCTURE or ADMIXTURE, they have no problem with each sampled individual having a unique distribution of ancestry across populations, which would seem to correspond to the claim being made about the E. coli strains.
Software such as STRUCTURE fit the data to a general purpose mixture model which, at least as far as we understand it, is not based on any explicit model for the evolutionary dynamics. We suspect that it would not be easy to device an explicit dynamical model to reproduce a specific fit to the mixture model of STRUCTURE.
3.12) The SI would be easier to navigate if it had a table of contents and if Table S5 were a separate data file.
We agree with the reviewer and, for the revision, we have made these changes.
[Editors' note: we include below the third round of reviews that the authors received from another journal, there is no corresponding authors’ response, as the authors opted to submit to eLife.]
https://doi.org/10.7554/eLife.65366.sa2Thank you for the revisions and for the lengthy rebuttal.
The problem with this paper continues to be that while the analyses are interesting, the description of the results and their interpretation remains garbled and confusing. Previous literature is mischaracterized but beyond that the manuscript fails to be clear in its own terms. These two problems need to be solved in parallel.
It is quite tiring to spend a long time wading through unclear text and then to get an overly combative response yet again. For this reason, I have again not attempted a comprehensive review of the manuscript but tried to focus on the most important problem, which is that the manuscript needs a clear statement of what is actually new  using instructive examples.
If correctly explained, the authors do have more than enough material for a solid and exciting contribution to bacterial population genetics that would be suitable for publication. Alas, unless the core issues of scientific clarity are addressed, the manuscript remains unsuitable for publication anywhere.
The authors are correct to highlight the issue, we do not know how many wellresolved branches can be attributed to clonal structure and how many can be attributed to nonrandom patterns of recombination (and how many can be attributed e.g. to technical factors in phylogenetic reconstruction) and that we indeed have limited understanding of the effect of gene flow on overall patterns of variation. It is an interesting, deep and underappreciated question in bacterial evolution. The best parts of the paper (still alas buried in the supplement) show that the null single genepool model, with homogeneous recombination rates between lineages (which is indeed used as the basis for many inference methods and simulations, especially due to the difficulty of implementing other models) is very far from being correct in many species.
The paper examines some summary statistics that avoid several of the assumptions made by more parameterized inference approaches. But they do not provide meaningful quantitative estimates of, for example, how many wellresolved branches of the phylogeny in any given species are inconsistent with the clonal tree. Furthermore, they do not provide estimates of how nonrandom recombination actually is. These problems are not easy to address directly.
The title, which describes “longtailed distribution of recombination rates” is therefore extremely misleading. The authors do not provide estimates of recombination rates between pairs of lineages. The longtailed distribution refer to nSNP patterns, not directly to recombination patterns. This issue of jumping between one and the other is repeated throughout the manuscript and destroys its scientific clarity.
“Each genomic locus has been overwritten so many times by recombination that it is impossible to reconstruct the clonal phylogeny and, instead of a consensus phylogeny, the phylogeny typically changes many thousands of times along the core genome alignment.”
This section of the Abstract is entirely garbled. The question of whether it is possible to recover the clonal genealogy is a technical question rather than a biological one. We always know there is a clonal genealogy, because bacteria always reproduce by binary fission, the question is just whether it is recoverable from the data. For example, if the level of recombination between strains was determined entirely by how closely related they were (which is close to the truth if recombination is extremely homology dependent), then you could have infinite recombination rates and still be able to successfully reconstruct clonal relationships. I can only guess what is meant by the sentence “instead of a consensus phylogeny”. A consensus phylogeny is something that it is always possible to calculate but is often difficult to interpret. So in what sense is it “instead” that the phylogeny (which means actually local genealogies) changes many thousands of times. I cannot see what possible logic this sentence could be conveying.
“Our findings show that bacterial populations are neither clonal nor freely recombining, but structured such that recombination rates between different lineages vary along a continuum spanning several orders of magnitude, with a unique pattern of rates for each lineage.”
The first part is most definitely not new while the second part might be true but is not demonstrated. Recombination rates have not been measured directly, only nSNP patterns.
“The reviewer no longer raises any technical issues.”
The reason I did not raise many technical issues in my review of Revision 1 is not because I had gone carefully through the entire manuscript and was unable to find any but because, as I stated in the review, I was already exhausted and spent far more time on the review than I wanted by the time I got to the end of the manuscript introduction. Now we are at revision 2 and there was a real danger of exhaustion by the time I get to the end of reading and responding to the cover letter. Do they actually want me to state every technical issue I have?
“Our conclusions were in fact fully warranted.”
I have pointed out several substantive and important mistakes, such as the factor of 2 misestimate of recombination in the previous version. The authors did actually make many scientifically substantive changes in response to my comments in this iteration as well on M. tb and on variation in tract length, which I thank them for.
The main point here is that the conclusions were not, and still are not, clearly stated. So how can they possibly be fully warranted?
I’m not going to respond linebyline to the rebuttal of my reviewer points made in the cover letter, although it’s tempting, but the whole premise of the argument that is made is false in a key respect:
“It is clear from the reviewer’s comments, however, that he continues to assume that robust phylogenetic structures in whole genome alignments can only reflect ancestry.”
In 2003, my colleagues and I published a paper entitled “Traces of human migrations in Helicobacter pylori populations” in Science. Helicobacter pylori is a particularly recombinogenic species (as their analysis confirms) and recombination has abolished most obvious traces of clonal structure in the genomic data, at least if strains are sampled from unrelated individuals within large human populations. Nevertheless, in population level H. pylori data one recovers a tree, which has several wellsupported deep branches. Indeed, figure 1 in that paper is a population tree, showing those branches. These robust phylogenetic structures clearly reflect patterns of gene flow (populations refers to population genetics which is the study of gene pools) and not clonal structure. (as I noted, their usage of the term ancestry is nonstandard – it also seems to be extremely vague  and should be avoided).
If strains are sampled from related individuals, one often finds clones and can infer some clonal structure, as I did for example in a paper published in 2015 in the journal Gut which found that the strains sampled from a single patient derived from two distinct clones but had recombined substantially.
Moreover, in 2019 my colleagues and I published a paper in ISME journal showing that the marine pathogen Vibrio parahaemolyticus also has populations defined by gene flow, while showing nearto free recombination within each of those populations. Note however that in this species there is plenty of evidence for clonal structure. For example, if you sample strains that cause human disease, you find a handful of lineages of recent origin predominate. Within these lineages, there are nice timeresolved phylogenies that make total sense. It is easy to infer clonal structure and it is also known that recombination rates can differ substantially within these lineages.
“The central novel insight of our work is that the robust phylogenetic structure that is evi dent in whole genome alignments of bacterial strains does not reflect clonal ancestry, but reflects population structure.”
This “central novel insight” is thus neither new nor true. In Helicobacter pylori we have already long ago and more recently in Vibrio parahaemolyticus shown that robust phylogentic structure in some parts of the tree results from clonal ancestry and in other parts of the tree from population structure. It all depends on where in the tree you look.
“We are much more optimistic than Max Planck and expected that we can convince the ‘old turks’ that our work contains results that are both new and true. We’ll leave it up to the reviewer to decide whether this optimism is warranted.”
I believe that there are new and true results in this work but unfortunately my doubts persist about whether they will ever be described accurately enough to allow people in the field to establish what they are.
Article and author information
Author details
Funding
Swiss National Science Foundation (31003A_135397)
 Erik van Nimwegen
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
The authors thank Olin Silander and Diana Blank for their laboratory work on the sequencing of the SC strains. During the development of this work, the authors have benefited from discussions with many researchers including Olin Silander, Frederic Bertels, Sergei Maslov, Purushottam Dixit, Edo Kussell, Boris Shraiman, Eugene Koonin, Daniel Fisher, Paul Rainey, Oskar Hallatschek, Mikhail Tikhonov, Richard Neher, Bruce Levin, Otto Cordero, and Daniel Weissman. This work was supported by the Swiss National Science Foundation grant No. 31003A_135397. In addition, this work was done in part while the authors were visiting the Simons Institute for the Theory of Computing and was thus supported in part by NSF Grant No. PHY1748958, NIH Grant No. R25GM067110, and the Gordon and Betty Moore Foundation Grant No. 2919.01. Calculations were performed at sciCORE (http://scicore.unibas.ch/), the scientific computing core facility of the University of Basel.
Senior Editor
 Aleksandra M Walczak, École Normale Supérieure, France
Reviewing Editor
 Armita Nourmohammad, University of Washington, United States
Publication history
 Received: December 2, 2020
 Accepted: January 7, 2021
 Accepted Manuscript published: January 8, 2021 (version 1)
 Version of Record published: February 15, 2021 (version 2)
Copyright
© 2021, Sakoparnig et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 6,901
 Page views

 815
 Downloads

 19
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Developmental Biology
 Evolutionary Biology
The gill skeleton of cartilaginous fishes (sharks, skates, rays, and holocephalans) exhibits a striking anterior–posterior polarity, with a series of fine appendages called branchial rays projecting from the posterior margin of the gill arch cartilages. We previously demonstrated in the skate (Leucoraja erinacea) that branchial rays derive from a posterior domain of pharyngeal arch mesenchyme that is responsive to Sonic hedgehog (Shh) signaling from a distal gill arch epithelial ridge (GAER) signaling centre. However, how branchial ray progenitors are specified exclusively within posterior gill arch mesenchyme is not known. Here, we show that genes encoding several Wnt ligands are expressed in the ectoderm immediately adjacent to the skate GAER, and that these Wnt signals are transduced largely in the anterior arch environment. Using pharmacological manipulation, we show that inhibition of Wnt signalling results in an anterior expansion of Shh signal transduction in developing skate gill arches, and in the formation of ectopic anterior branchial ray cartilages. Our findings demonstrate that ectodermal Wnt signalling contributes to gill arch skeletal polarity in skate by restricting Shh signal transduction and chondrogenesis to the posterior arch environment and highlights the importance of signalling interactions at embryonic tissue boundaries for cell fate determination in vertebrate pharyngeal arches.

 Ecology
 Evolutionary Biology
Spider venoms are a complex concoction of enzymes, polyamines, inorganic salts, and disulfiderich peptides (DRPs). Although DRPs are widely distributed and abundant, their bevolutionary origin has remained elusive. This knowledge gap stems from the extensive molecular divergence of DRPs and a lack of sequence and structural data from diverse lineages. By evaluating DRPs under a comprehensive phylogenetic, structural and evolutionary framework, we have not only identified 78 novel spider toxin superfamilies but also provided the first evidence for their common origin. We trace the origin of these toxin superfamilies to a primordial knot – which we name ‘Adi Shakti’, after the creator of the Universe according to Hindu mythology – 375 MYA in the common ancestor of Araneomorphae and Mygalomorphae. As the lineages under evaluation constitute nearly 60% of extant spiders, our findings provide fascinating insights into the early evolution and diversification of the spider venom arsenal. Reliance on a single molecular toxin scaffold by nearly all spiders is in complete contrast to most other venomous animals that have recruited into their venoms diverse toxins with independent origins. By comparatively evaluating the molecular evolutionary histories of araneomorph and mygalomorph spider venom toxins, we highlight their contrasting evolutionary diversification rates. Our results also suggest that venom deployment (e.g. prey capture or selfdefense) influences evolutionary diversification of DRP toxin superfamilies.