Whole genome phylogenies reflect the distributions of recombination rates for many bacterial species

  1. Thomas Sakoparnig
  2. Chris Field
  3. Erik van Nimwegen  Is a corresponding author
  1. Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, Switzerland
20 figures, 9 tables and 1 additional file

Figures

Figure 1 with 3 supplements
Whereas phylogenies of individual alignment blocks differ substantially from the core tree, phylogenies reconstructed from a large number of blocks are highly similar to the core tree.

Top left: For each split (i.e. branch) in the core tree, the color indicates what fraction of the phylogenies of 3 kb blocks support that bi-partition of the strains. Top right: Cumulative …

Figure 1—source data 1

List of database accessions of the genomes used.

https://cdn.elifesciences.org/articles/65366/elife-65366-fig1-data1-v2.txt
Figure 1—figure supplement 1
Joint maximum likelihood phylogeny of our strains and 189 E.coli reference strains.

Maximum likelihood tree reconstructed from the core genome alignments of the SC1 strains (red font names), the K-12 lab strain (green font name), and 189 E. coli reference strains (black font …

Figure 1—figure supplement 2
Each 3 kb alignment block rejects the core tree topology as well as the topologies of the phylogenies reconstructed from all other blocks.

Each 3 kb alignment block rejects the core tree topology as well as the topologies of the phylogenies reconstructed from all other blocks. Left panel: For each 3 kb block in the core alignment, we …

Figure 1—figure supplement 3
Fractions of positions that are clonally inherited (not affected by recombination) along each branch of the clonal phylogeny for the simulated datasets.

Fractions of positions that are clonally inherited (not affected by recombination) along each branch of the clonal phylogeny for the simulated datasets. Each panel shows the clonal phylogeny with …

Figure 2 with 2 supplements
Pairwise analysis of recombination in the SC1 strains.

(A-C) SNP densities (SNPs per kilobase) along the core genome for three pairs of strains at overall nucleotide divergences of 4×10-4 (D6–F2), 0.002 (C10–D7), and 0.0048 (D6–H10). (D-F) Corresponding …

Figure 2—figure supplement 1
Statistics of the pairwise analysis on the simulated data.

Statistics of the pairwise analysis on the simulated data. Each row of panels corresponds to simulations performed with a different recombination rate (indicated in each row) and shows, in blue, the …

Figure 2—figure supplement 2
The pairwise analysis accurately estimates both the clonally inherited fractions and the sizes of the recombined segments for the simulation data.

The pairwise analysis accurately estimates both the clonally inherited fractions and the sizes of the recombined segments. Left panels: Comparison of the true fraction of the genome that was …

Figure 3 with 1 supplement
Bi-allelic SNPs correspond to phylogeny splits.

A segment of a multiple alignment of 6 strains containing three bi-allelic SNPs, X, Y, and Z. Assuming that each SNP corresponds to a single substitution in the evolutionary history of the position, …

Figure 3—figure supplement 1
Comparison of the observed frequencies of columns with different numbers of nucleotides with predictions from a simple substitution model.

Comparison of the observed frequencies of columns with 1, 2, 3, and 4 different nucleotides under the simple model described in the methods. Different colored dots correspond to different subsets of …

Figure 4 with 5 supplements
Most branches of the core genome tree are rejected by the statistics of individual SNPs.

Left panel: Fraction of supporting versus clashing SNPs for each branch of the core tree. Right panel: Cumulative distribution of the fraction of supporting SNPs across all branches. The purple and …

Figure 4—figure supplement 1
The pairwise distances and number of SNPs on each branch as predicted by the core tree do not match the pairwise distances and SNP numbers observed in the data.

Comparison of pairwise distances and number of SNPs on each branch as predicted by the core tree, with pairwise distances and SNP numbers observed in the data. Left panel: Scatter plot of the …

Figure 4—figure supplement 2
Homoplasies do not significantly effect the fractions of supporting SNPs for E.coli.

Cumulative distribution of the fraction of supporting SNPs across all branches of the core tree for the original alignment of E. coli strains (blue) as well as for alignments from which 5% (orange) …

Figure 4—figure supplement 3
Distributions of SNP support for the data from the simulations.

Cumulative distributions of the fraction of supporting SNPs across all branches of the core tree for the alignments resulting from the simulations with ρ/μ=0 (dark blue), ρ/μ=0.001 (light blue), ρ/μ=0.01 (light …

Figure 4—figure supplement 4
Supporting versus clashing SNPs for trees that were built bottom-up, while minimizing SNP clashes.

Supporting versus clashing SNPs for trees that were built bottom-up, while minimizing SNP clashes. Left panel: Illustration of the iterative bottom-up tree reconstruction. At each step, the pair of …

Figure 4—figure supplement 5
Quartets of roughly equidistant strains have no consensus phylogeny.

Quartets of roughly equidistant strains have no consensus phylogeny. Left: Using the distribution of pairwise distances (top panel) we select, for each pairwise distance D, quartets of strains whose …

Figure 5 with 2 supplements
SNP compatibility along the core genome alignment shows tree-compatible segments are short.

(A) Linkage disequilibrium (squared correlation, see Materials and methods) as a function of the separation of a pair of columns in the core genome alignment. (B) Probability distribution of the …

Figure 5—figure supplement 1
Pairwise SNP compatibility as a function of genomic distance.

Pairwise SNP compatibility as a function of genomic distance. The plot show the fraction of SNP pairs that are compatible with a common phylogeny as a function of the genomic distance between the …

Figure 5—figure supplement 2
Lengths of tree-compatible segments for the data from the simulations.

Probability distributions of the length of tree-compatible segments along the alignments of the E. coli genomes (black line) and the alignments of the sequences from the simulations with ρ/μ=0 (dark …

Figure 6 with 4 supplements
Ratio C/M of the minimal number of phylogeny changes C to substitutions M for random subsets of strains using the alignment from which 5% of potentially homoplasic positions have been removed.

For strain numbers ranging from n=4 to n=92, we collected random subsets of n strains and calculated the ratios C/M of phylogeny changes to SNPs in the alignment. The figure shows box-whisker plots that …

Figure 6—figure supplement 1
Ratios C/M for homoplasy-corrected and full alignments of random subsets of strains.

Ratio C/M of the minimal number of phylogeny changes C to substitutions M for random subsets of strains using the alignment for which 5% of potentially homoplasic positions have been removed (orange) …

Figure 6—figure supplement 2
Lower bound C/M on the ratio of phylogeny changes to mutations, versus the ratio ρ/μ of recombination to mutation rate, for the data of the simulations.

Observed ratio C/M of the minimal number of phylogeny changes C and SNPs M in the alignment (vertical axis) as a function of the ratio of recombination and mutation rate ρ/μ used in the simulation …

Figure 6—figure supplement 3
Distributions of the number of times each position in the alignment has been overwritten by recombination for the data of the simulations.

Histograms of the number of times each position in the genome was overwritten by recombination along the branches of the clonal phylogeny, for the simulations with recombination-to-mutation ratios …

Figure 6—figure supplement 4
Comparison of the estimated and true average number of times each position in the alignment has been overwritten by recombination for the data of the simulations.

Comparison of the true average number of times Ttrue that positions were overwritten by recombination along the branches of the clonal phylogeny (horizontal axis) versus the estimated number of times Test=LrC/(2L)

Quantification of the importance of recombination across species.

(A) The cumulative distribution of pairwise divergences is shown as a different colored line for each species (see legend in panel B). Both axes are shown on logarithmic scales. The vertical lines …

Figure 7—source data 1

List of database accessions of the genomes used.

https://cdn.elifesciences.org/articles/65366/elife-65366-fig7-data1-v2.txt
Figure 8 with 1 supplement
Core tree build by PhyML from the sequences of chromosomes 1–12 of 40 randomly chosen human genomes of the 1000 Genome project.

The colors indicate what fraction of the time each split in the core tree occurred in trees build from random subsets of half of the genomic loci. The colors on the leaves indicate the annotated …

Figure 8—figure supplement 1
The E.coli core tree is relatively insensitive to removal of all SNPs that correspond to branches of the core tree.

Differences between the core tree T and the tree T reconstructed from the alignment from which all SNPs that fall on branches of the core tree have been removed. Each branch of the core tree T is …

Figure 9 with 1 supplement
SNP-type frequencies follow approximately power-law distributions.

(A) Frequencies of 2-SNPs of the type (A1,s) in which a SNP is shared between strain A1 and one other strain s. Each edge corresponds to a 2-SNP (A1,s) and the thickness of the edge is proportional to the …

Figure 9—figure supplement 1
The n-SNP distributions are insensitive to removal of potential homoplasies and removal of n-SNPs corresponding to branches of the core tree.

Distributions of n-SNP frequencies (left panels) and exponents of the power-law fits (right panels) for original E. coli core genome alignment (top row), the 5% homoplasy-corrected core genome …

Figure 10 with 3 supplements
Phylogenetic entropy profiles of the E.coli strains.

Left panel: Entropy profiles Hs(n) (in bits) for six example strains, indicated in the legend. Right panel: Entropy profiles Hs(n) for all E. coli strains.

Figure 10—figure supplement 1
Entropy profiles of the n-SNP distributions for each of the E.coli phylogroups.

Entropy profiles of the n-SNP distributions for each of the E. coli phylogroups. Each panel shows the entropy Hn(s) of the n-SNP distribution (vertical axis) as a function of n for each strain s

Figure 10—figure supplement 2
Statistical significance of the difference in n-SNP statistics for all pairs of strains.

Statistical significance of the difference in n-SNP statistics for all pairs of strains. Left panel: Cumulative distribution of the p-values of the Fisher exact test (Materials and methods) for the n

Figure 10—figure supplement 3
Comparison of the entropy profiles of the n-SNP distributions of E.coli with those of the data of the simulations.

Entropy profiles of the n-SNP distributions for the E. coli data (top left panel) and the data from the simulations other panels, with the recombination rate ρ/μ indicated in the title of each panel. …

Figure 11 with 3 supplements
Left panel: Exponents of the power-law fits to the n-SNP frequency distributions, as a function of the number of strains sharing a SNP n for each of the species (different colors).

Error bars correspond to 95% posterior probability intervals. Right panel: Mean entropy of the entropy profiles Hn(s), averaged over all strains s, as a function of the number n of strains sharing the …

Figure 11—figure supplement 1
Power-law fits of the n-SNP distributions for all six species.

Power-law fits of the n-SNP distributions for all six species. Each panel shows the reverse cumulative distributions of the frequencies of all observed 2-SNPs (blue dots), 3-SNPs (orange dots), …

Figure 11—figure supplement 2
Entropy profiles of all the strains for each of the six species.

Entropy profiles of all the strains for each of the six species. Each panel corresponds to one species (indicated at the top) and shows the entropy profiles Hs(n) of the distributions of n-SNPs in …

Figure 11—figure supplement 3
n-SNPs distributions and entropy profiles for the human data.

n-SNPs distributions and entropy profiles for the human data. Left panel: Reverse cumulative distributions of the frequencies of all observed 2-SNPs (blue dots), 3-SNPs (orange dots), 4-SNPs (green …

Appendix 3—figure 1
Estimate of the effective recombination strength Lrρ/μ (vertical axis) as a function of the estimated fraction of clonally inherited genome fc (horizontal axis) for each pair of strains with fc between 1/4 and 3/4.

Each point corresponds to a pair of strains.

Appendix 4—figure 1
SNP frequency spectra.

Total number of occurrences of n-SNPs, that is SNPs shared by n strains (vertical axes) as a function of n (horizontal axes) for the E. coli data (top left panel) and all simulated data with …

Appendix 4—figure 2
Diversity of n-SNPs.

Left panel: Number of n-SNP types (vertical axis) as a function of n (horizontal axis) for the E. coli data (black line) and for simulations with different recombination to mutation rates ρ/μ

Appendix 4—figure 3
Example n-SNP distributions for E. coli (black lines) as well as for the simulations with different rates of recombination with ρ/μ=0 shown in blue, ρ/μ=0.001 in cyan, ρ/μ=0.01 in light green, ρ/μ=0.1 in dark green, ρ/μ=0.3 in orange, ρ/μ=1 in red, and ρ/μ=10 in brown.

Each panel corresponds to the observed n-SNP distributions with the value of n indicated at the top of each panel. All axes are shown on logarithmic scales.

Appendix 4—figure 4
Fitted exponents of the n-SNP distributions for the E. coli data (black) and the data from the simulations with different recombination rates (colors, see legend).

The bars show the fitted exponent plus and minus one standard-deviation of the posterior distribution.

Appendix 5—figure 1
Left panels: SNP densities (SNPs per kilobase) along the core genome for the five pairs of strains (A1-B2), (A1-A7), (A1-A11), (A1-D8), and (A1-A2).

Right panels: Corresponding histograms for the number of SNPs per kilobase (dots) together with fits of the mixture model. Note the vertical axis is on a logarithmic scale.

Appendix 5—figure 2
Distribution of the five most common 2-SNP patterns involving strain A1 along the core genome alignment.

Left: Direct visualization of the positions of each of the 2-SNP patterns along the core genome alignment. Each dashed line corresponds to an SNP and SNPs are colored according to the 2-SNP type …

Appendix 5—figure 3
Left: Histogram of the number of consecutive SNPs before an inconsistency in the core genome alignment of the sextet of strains (A1,A2,A7,A11,B2,D8) of phylogroup B2.

Note that the vertical axis corresponds to the number of segments with the corresponding number of consecutive SNPs. Middle: Histogram of the length of segments without phylogeny breaks. Right: …

Appendix 5—figure 4
Entropy profiles for the six strains (A1, A2, A7, A11, B2, D8).

Tables

Table 1
Estimated expected number of mutations per position μ* and estimated fraction of homoplasies for five different subsets of core alignment columns: all columns, all synonymous positions (third positions in fourfold degenerate codons), second positions in codons, synonymous positions excluding the outgroup, and second positions in codons excluding the outgroup.
Column setμ*fh
All columns0.1180.026
Synonymous positions0.2870.063
Second positions in codons0.02580.006
Synom. pos. without outgroup0.1490.033
Sec. pos. without outgroup0.01720.004
Table 2
Detailed statistics on the polymorphisms in the whole core genome alignment, at synonymous positions (third positions in fourfold degenerate codons) and at second codon positions (where all substitutions are non-synonymous).

First, for each set of positions the table lists the total number of positions, and the number of positions at which 1, 2, 3 or 4 different nucleotides appear. Second, for the subset of positions …

StatisticAll columnsSynom. codon pos.Sec. codon pos.
Total columns2,880,516349,311960,172
1-letter columns2,484,831299,536936,588
2-letter columns363,16446,80722,505
3-letter columns30,61128381029
4-letter columns191013050
Transitions275,13436,86613,420
Transversions88,03099419085
A ↔ G138,72821,7676676
C ↔ T136,40615,0996744
G ↔ T23,6792198963
A ↔ C23,63632943510
A ↔ T21,03624162549
C ↔ G19,67920332063
Appendix 1—table 1
Average length of tree compatible segments, both in terms of number of consecutive SNPs and number of consecutive nucleotides, for the full E. coli and simulation data, as well as for alignments from which 5% or 10% of potentially homoplasic sites were removed.
StatisticAverage nb. of SNPsAverage nb. of nucleotides
Perc. homoplasies removed0%5%10%0%5%10%
E. coli7.598.7910.75123145179
ρ/μ=03227571160638730435108413
ρ/μ=0.0013243070237150208396
ρ/μ=0.012857853657331158
ρ/μ=0.1141618211256312
ρ/μ=0.38910135157183
ρ/μ=155588100114
ρ/μ=10333576268
Appendix 1—table 2
Number of substitutions and phylogeny changes within sub-alignments corresponding to known phylogroups.

Starting from the full 5% homoplasy-corrected core genome alignment we extracted, for each phylogroup, the sub-alignment of all strains belonging to that phylogroup and determined the number of …

PhylogroupNo. of strainsSNP rate M/LPhyl. changes CC/MTest
A60.00241780.0270.65
B1350.01345400.13016.5
B260.01126640.0889.7
D290.01754260.11419.7
E1----
F30.007---
O90.00320.00020.007
Appendix 1—table 3
Summary statistics of the core genome alignments of the different bacterial species.

For each species, the number of strains, the median genome size, the size of the core genome alignment, and the total number of informative SNPs (that is SNPs that occur in at least two strains) are …

SpeciesStrainsGenome sizeCore sizeInf. SNPs
Escherichia coli924,929,2992,756,541 (56%)247,822
Bacillus subtilis754,155,8432,341,553 (56%)182,535
Helicobacter pylori831,655,288850,827 (51%)114,993
Mycobacterium tuberculosis404,465,9854,150,139 (93%)3502
Salmonella enterica1554,810,9802,846,634 (59%)192,117
Staphylococcus aureus952,881,8992,002,833 (69%)73,756
Appendix 1—table 4
Summary statistics on mutation and recombination for each of the six species.

For each species, the table shows the SNP rate (SNPs per alignment column) M/L, the lower bound on the number of phylogeny changes C, and the lower bound on the ratio of phylogeny changes to …

SpeciesSNP rate M/LPhyl. changes CC/M
Escherichia coli0.10143,5750.156
Bacillus subtilis0.11340,8110.155
Helicobacter pylori0.20250,7430.295
Mycobacterium tuberculosis0.0027550.078
Salmonella enterica0.08531,5980.131
Staphylococcus aureus0.05313,2150.124
Appendix 5—table 1
Pairwise divergence of strain A1 with each of the other strains of phylogroup B2.
StrainDivergence
B20.00626
A70.00627
A110.00702
D80.00729
A20.00778
Appendix 5—table 2
Lengths, total SNP count, and SNP types (that is the strains that carry the minority allele) for the ten longest segments (in terms of number of consecutive SNPs) along the core genome of the sextet (A1,A2,A7,A11,B2,D8) of phylogroup B2.
Length segmentNumber of SNPsSNP types
9698109(A7, B2) (A1, A7, B2) (A2, D8)
5672100(A7,B2) (A1,A2) (A11,D8)
2068100(A2,D8) (A11,A7,B2)
225595(A7,B2) (A11,A2) (A1,A11,A2)
1172695(A7,B2) (A2,A7,B2) (A1,D8)
1179093(A7,B2) (A2,D8)
361486(A11,A2)
456486(A11,A7,B2) (A1,A2)
289075(A7,B2) (A1,A7,B2)
239071(A1,A11)
Appendix 5—table 3
All 2-SNPs, 3-SNPs, and 4-SNPs involving strains from the sextet (A1 A2 A7 A11 B2 D8), that occur at least 100 times, sorted by their frequency of occurrence.
StrainsNumber of occ.
A7 B21227
A11 D8306
A2 D8291
A11 A2284
A1 D8214
A1 A2196
A1 A11194
A1 A7 B2389
A7 B2 D8303
A11 A7 B2265
A2 A7 B2208
A11 A2 D8179
A1 A11 D8172
A1 A11 A2161
A1 A2 D8153
A1 A11 A7 B2265
A1 A11 A2 D8248
A11 A7 B2 D8232
A1 A7 B2 D8226
A1 A2 A7 B2139
A2 A7 B2 D8136
A11 A2 A7 B2110

Additional files

Download links