Template switching during DNA replication is a prevalent source of adaptive gene amplification

Julie N Chuong
Nadav Ben Nun
Ina Suresh
Julia Cano Matthews
Titir De
Grace Avecilla
Farah Abdul-Rahman
Nathan Brandt
Yoav Ram
David Gresham author has email address

Department of Biology, Center for Genomics and Systems Biology, New York University, New York, United States
School of Zoology, Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
Edmond J. Safra Center for Bioinformatics, Tel Aviv University, Tel Aviv, Israel
Department of Natural Sciences, Baruch College CUNY, New York, United States
Department of Ecology and Evolutionary Biology, Yale University, New Haven, United States
Microbial Sciences Institute, Yale University, New Haven, United States
Department of Biological Sciences, North Carolina State University, Raleigh, United States

https://doi.org/10.7554/eLife.98934.2

Open access
Copyright information

Figures and data

A local DNA replication origin contributes to CNV dynamics during adaptive evolution.
(A) The Saccharomyces cerevisiae GAP1 gene is located on the short arm of chromosome XI (beige rectangle). Light blue rectangle - Ty Long terminal repeats (LTR). Purple rectangle - Autonomously replicating sequences (ARS). Orange rectangles - tRNA genes. GAP1 ORF - white rectangle. The GAP1 gene (white rectangle) is flanked by Ty1 LTRs (YKRCδ11, YKRCδ12), which are remnants of retrotransposon events, and is directly upstream of an autonomously replicating sequence (ARS1116). Variants of the GAP1 locus were engineered to remove either both LTRs, the single ARS, or all three elements. All engineered genomes contain a CNV reporter. (B) We evolved the four different strains in 5-8 replicate populations, for a total of 27 populations, in glutamine-limited chemostats and monitored the formation and selection of de novo GAP1 CNVs for 137 generations using flow cytometry. Population samples were taken every 8-10 generations and 100,000 cells were assayed using a flow cytometer. Colored lines show the median proportion of cells in a population with GAP1 amplifications across 5-8 replicate populations of the labeled strain. The shaded regions represent the median absolute deviation across the replicates. (C) We summarized CNV dynamics and found that strain has a significant effect on CNV appearance (Kruskal-Wallis, p =0.001384). There are significant differences in CNV appearance between LTRΔ (blue) and ARSΔ (red), and LTRΔ (blue) and ALLΔ (yellow) (pairwise wilcoxon test with Bonferroni correction, p=0.0059 and p=0.0124, respectively). (D) Strain has a significant effect on the per generation increase in proportion of cells with CNV (ANOVA, p = 0.00318) calculated as the slope during CNV selection phase. There is a significant difference between LTRΔ (blue) and ALLΔ (yellow) (pairwise t-test with Bonferroni correction, p = 0.0026). (E) Strain has a significant effect on time to CNV equilibrium phase (ANOVA, p = 0.00833). There is a significant difference in time to CNV equilibrium between WT and ARSΔ (pairwise t-tests with bonferroni correction, p =0.050).

CNV reporter failure does not impact parameter inference.
(A) Flow cytometry of a representative WT population with a persistent one-copy GFP subpopulation, bottom in each panel. FSC-A is forward scatter-area which is a proxy for cell size (x-axis). GFP fluorescence was measured in arbitrary units (a.u.) (y-axis). Hierarchical gating was performed to define the one-copy GFP and two-or-more copy subpopulations (see Methods). (B) Model illustration. X_A is the frequency of ancestor cells in the chemostat; X_C⁺, X_C⁻ are the frequencies of cells with GAP1 duplications with two or one reporters, respectively, and a selection coefficient s_C; X_B is the frequency of cells with other beneficial mutations and a selection coefficient s_B. GAP1 duplications form with a rate δ_C, other beneficial mutations occur with rate δ_B. At generation 0, only genotypes C⁻ and A are present, with frequencies of X_C⁻ = φ and X_A = 1 − φ. (C) Examples of total CNV proportions (dashed) and reported CNV proportions (solid) for two parameter combinations, both with s = 0. 15, φ = 10⁻⁴.

Inference of CNV formation rate and selection coefficient from experimental evolutionary data.
(A) Collective MAP estimate (black markers) and 50% HDR (colored areas) of GAP1 CNV formation rate, δ_C, and selection coefficient, s_C. Marginal posterior distributions are shown on the top and right axes. (B) Collective posterior prediction of Shannon diversity of CNV lineages (e^{−Σi[p_ilog(p_i)]}, Jost, 2006). Line and shaded area show mean and 50% HDI. (C) CNV reported frequency (X_C⁺) prediction using collective MAP (solid line) compared to empirical observations (dotted lines).

Local and distal elements contribute to generation of GAP1 CNV alleles.
(A) Schematic of Saccharomyces cerevisiae GAP1 locus on Chromosome XI: 513332-518060 with LTR, ARS elements and tRNA genes labeled. Long terminal repeat non-allelic homologous recombination (LTR NAHR) is defined on the basis of both CNV breakpoints occuring at LTR sites as revealed by read depth plots (left, pink and blue vertical lines) and increased read depth relative to the genome-wide read depth (left). In some cases we detect a hybrid sequence between two LTR sequences, a result of recombination between the two LTRs (right). LTR NAHR typically forms tandem duplications.
(B) ODIRA is a DNA replication-error based CNV mechanism generated by template switching of the leading and lagging strand template at short inverted repeats. In this clone example, the relative read depth estimate of 2.67 copies of GAP1 is rounded to 3 copies (left) and has apparent breakpoints in the CAF4 and AIM29 genes. We classify a clone as being formed by ODIRA if it has a de novo inverted sequence in at least one breakpoint. In the clone example, the short inverted repeat pairs are CAAT, ATTG (ChrXI:508938, ChrXI: 508986) in CAF4 and TAAAA, AAAAT (ChrXI:582561, ChrXI582610) in AIM29. The contig sequence at the breakpoint (rectangle) is aligned to the reference within the CAF4 coding sequence fragment. The 5’ and 3’ ends of the contig are labeled and a dashed line indicates contiguity (no gaps). The contig spans the CAF4 breakpoint junction and contains a de novo inversion, i.e.) two fragments of the CAF4 gene are in opposite orientations with the mediating short inverted repeats shown in blue and underlined. The contig was generated using CVish (Methods) and supported by split reads at the breakpoint junction (not shown).. The contig containing a de novo inversion across the AIM29 breakpoint is not shown. ODIRA typically forms tandem triplications with an inverted middle copy and contains an ARS (bottom).
(C) Non-allelic homologous recombination (NAHR) is defined by having at least one CNV breakpoint not at the proximate LTR sites, ie. other homologous sequences in the genome. In the example clone, we detect a hybrid sequence between the two homologous sequences in BET3 (ChrXI) and SPT4 (ChrVI). Because these two sequences are on different chromosomes we infer that an interchromosomal translocation occurred. The other breakpoint is unresolved. A read depth plot supports the amplified segment containing the GAP1 gene. NAHR is able to produce supernumerary chromosomes as is the case in Clone 2968 (Figure 4G)
(D) Transposon-mediated mechanism is defined by an inference of at least one intermediate novel LTR retrotransposon insertion followed by LTR NAHR. In the ALLΔ strains which have the flanking LTRs deleted, we find novel LTR retrotransposon insertions near previously deleted LTR sites. The newly deposited LTR sequence (downstream of GAP1) recombines with another LTR sequence (upstream of GAP1), either pre-existing or introduced by a second de novo retrotransposition, to form a resulting CNV (focal amplification or segmental amplification). Read depth estimation (not shown) supports the CNV breakpoints at pre-existing or newly deposited LTRs.
(E) Violin plot of CNV length in each genome-sequenced clone, n = 177. Strain has a significant effect on CNV length, Kruskal-Wallis test, p = 1.008 x 10^-5. Pairwise wilcoxon rank sum test with bonferroni correction show significant CNV length differences between WT and LTRΔ (p = 0.00490), WT and ALLΔ (p = 0.01230), LTRΔ and ARSΔ (p=0.00056), ARSΔ and ALLΔ (p=0.002).
(F) Barplot of inferred CNV mechanisms, described in A-D, for each CNV clone isolated from glutamine-limited evolving populations. Complex CNV is defined by a clone having more than two breakpoints in chromosome XI, indicative of having more than one amplification event. Inference came from a combination of read depth, split read, and discordant read analysis and the output of CVish (see Methods). Strain is significantly associated with CNV Mechanism, Fisher’s Exact Test, p = 5.0 x 10^-4. There is a significant increase in ODIRA prevalence between WT and LTRΔ, chi-sq, p = 0.02469. There is a significant decrease in ODIRA prevalence from WT to ARSΔ and ALLΔ, chi-sq, p = 0.002861 and 0.002196, respectively. There is a significant decrease of LTR NAHR from WT to LTRΔ, chi-sq, p = 0.03083.
(G) Top: Schematic of S. cerevisiae chromosome XI, with LTR, ARS elements, tRNA genes annotated. LTR-blue, ARS-purple, tRNA-orange, GAP1 ORF-white rectangle. Using a combination of read depth, split read, and discordant read analysis, we defined the extent of the amplified region, the precise CNV breakpoints, and GAP1 copy number. GAP1 copy numbers were estimated using read depth relative to the average read depth of chromosome XI. Bottom: Dumbbell plots represent the amplified region (>1 copy) relative to the WT reference genome. The ends of the dumbbells mark the approximate CNV breakpoints shown relative to the start codon of the GAP1 ORF (vertical dotted line). Select clones were chosen as representative of the observed diversity of amplifications.
(H) Scatterplots of CNV length for all genome-sequenced clones, n = 177. We defined the upstream and downstream breakpoints as kilobases away from the start codon of the GAP1 ORF (vertical dashed line in (G) dumbbell plot and this scatterplot). CNV mechanisms are defined in Figure 4A-D and Methods. Select clones from (G) dumbbell plots are annotated. Note in the focal amplications resulting for LTR NAHR in WT clones and ARSΔ clones, respectively. In ARSΔ, note NAHR between one local and one distal LTR ∼60 kb upstream. Note in ALLΔ focal amplications mediated by two newly deposited LTR sequences from two transposon insertions. Note in ALLΔ amplifications formed between one newly inserted LTR and one distal pre-existing LTR sequence, 30kb or 60 kb upstream.

Local and distal elements contribute to generation of GAP1 CNV alleles.
(A) Schematic of Saccharomyces cerevisiae GAP1 locus on Chromosome XI: 513332-518060 with LTR, ARS elements and tRNA genes labeled. Long terminal repeat non-allelic homologous recombination (LTR NAHR) is defined on the basis of both CNV breakpoints occuring at LTR sites as revealed by read depth plots (left, pink and blue vertical lines) and increased read depth relative to the genome-wide read depth (left). In some cases we detect a hybrid sequence between two LTR sequences, a result of recombination between the two LTRs (right). LTR NAHR typically forms tandem duplications.
(B) ODIRA is a DNA replication-error based CNV mechanism generated by template switching of the leading and lagging strand template at short inverted repeats. In this clone example, the relative read depth estimate of 2.67 copies of GAP1 is rounded to 3 copies (left) and has apparent breakpoints in the CAF4 and AIM29 genes. We classify a clone as being formed by ODIRA if it has a de novo inverted sequence in at least one breakpoint. In the clone example, the short inverted repeat pairs are CAAT, ATTG (ChrXI:508938, ChrXI: 508986) in CAF4 and TAAAA, AAAAT (ChrXI:582561, ChrXI582610) in AIM29. The contig sequence at the breakpoint (rectangle) is aligned to the reference within the CAF4 coding sequence fragment. The 5’ and 3’ ends of the contig are labeled and a dashed line indicates contiguity (no gaps). The contig spans the CAF4 breakpoint junction and contains a de novo inversion, i.e.) two fragments of the CAF4 gene are in opposite orientations with the mediating short inverted repeats shown in blue and underlined. The contig was generated using CVish (Methods) and supported by split reads at the breakpoint junction (not shown).. The contig containing a de novo inversion across the AIM29 breakpoint is not shown. ODIRA typically forms tandem triplications with an inverted middle copy and contains an ARS (bottom).
(C) Non-allelic homologous recombination (NAHR) is defined by having at least one CNV breakpoint not at the proximate LTR sites, ie. other homologous sequences in the genome. In the example clone, we detect a hybrid sequence between the two homologous sequences in BET3 (ChrXI) and SPT4 (ChrVI). Because these two sequences are on different chromosomes we infer that an interchromosomal translocation occurred. The other breakpoint is unresolved. A read depth plot supports the amplified segment containing the GAP1 gene. NAHR is able to produce supernumerary chromosomes as is the case in Clone 2968 (Figure 4G)
(D) Transposon-mediated mechanism is defined by an inference of at least one intermediate novel LTR retrotransposon insertion followed by LTR NAHR. In the ALLΔ strains which have the flanking LTRs deleted, we find novel LTR retrotransposon insertions near previously deleted LTR sites. The newly deposited LTR sequence (downstream of GAP1) recombines with another LTR sequence (upstream of GAP1), either pre-existing or introduced by a second de novo retrotransposition, to form a resulting CNV (focal amplification or segmental amplification). Read depth estimation (not shown) supports the CNV breakpoints at pre-existing or newly deposited LTRs.
(E) Violin plot of CNV length in each genome-sequenced clone, n = 177. Strain has a significant effect on CNV length, Kruskal-Wallis test, p = 1.008 x 10^-5. Pairwise wilcoxon rank sum test with bonferroni correction show significant CNV length differences between WT and LTRΔ (p = 0.00490), WT and ALLΔ (p = 0.01230), LTRΔ and ARSΔ (p=0.00056), ARSΔ and ALLΔ (p=0.002).
(F) Barplot of inferred CNV mechanisms, described in A-D, for each CNV clone isolated from glutamine-limited evolving populations. Complex CNV is defined by a clone having more than two breakpoints in chromosome XI, indicative of having more than one amplification event. Inference came from a combination of read depth, split read, and discordant read analysis and the output of CVish (see Methods). Strain is significantly associated with CNV Mechanism, Fisher’s Exact Test, p = 5.0 x 10^-4. There is a significant increase in ODIRA prevalence between WT and LTRΔ, chi-sq, p = 0.02469. There is a significant decrease in ODIRA prevalence from WT to ARSΔ and ALLΔ, chi-sq, p = 0.002861 and 0.002196, respectively. There is a significant decrease of LTR NAHR from WT to LTRΔ, chi-sq, p = 0.03083.
(G) Top: Schematic of S. cerevisiae chromosome XI, with LTR, ARS elements, tRNA genes annotated. LTR-blue, ARS-purple, tRNA-orange, GAP1 ORF-white rectangle. Using a combination of read depth, split read, and discordant read analysis, we defined the extent of the amplified region, the precise CNV breakpoints, and GAP1 copy number. GAP1 copy numbers were estimated using read depth relative to the average read depth of chromosome XI. Bottom: Dumbbell plots represent the amplified region (>1 copy) relative to the WT reference genome. The ends of the dumbbells mark the approximate CNV breakpoints shown relative to the start codon of the GAP1 ORF (vertical dotted line). Select clones were chosen as representative of the observed diversity of amplifications.
(H) Scatterplots of CNV length for all genome-sequenced clones, n = 177. We defined the upstream and downstream breakpoints as kilobases away from the start codon of the GAP1 ORF (vertical dashed line in (G) dumbbell plot and this scatterplot). CNV mechanisms are defined in Figure 4A-D and Methods. Select clones from (G) dumbbell plots are annotated. Note in the focal amplications resulting for LTR NAHR in WT clones and ARSΔ clones, respectively. In ARSΔ, note NAHR between one local and one distal LTR ∼60 kb upstream. Note in ALLΔ focal amplications mediated by two newly deposited LTR sequences from two transposon insertions. Note in ALLΔ amplifications formed between one newly inserted LTR and one distal pre-existing LTR sequence, 30kb or 60 kb upstream.

Model parameters and priors.
Fixed parameters from (Avecilla et al., 2022; Hall et al., 2008; Joseph & Hall, 2004; Venkataram et al., 2016).

Summary of genome sequence analysis of clones containing a single copy of the GAP1 CNV reporter.
Estimated copy number of the GAP1 gene and inserted GFP gene of sequenced clones from five 1-copy-GFP minor subpopulations of the WT genome architecture strain. Copy number estimation is defined as the read depth of the target gene relative to the average read depth of the chromosome XI. Populations 1, 2, 4, 5 contain clones harboring GAP1 CNVs but only 1 copy of GFP. Clones from population 3 and 5 harbor 1 copy each of GAP1 and GFP suggesting these lineages have beneficial mutations elsewhere in the genome, allowing coexistence with the GAP1 CNV major subpopulation. CN, copy number, CNV = copy number variant, GAP1 = general amino acid permease gene

Estimation of network confidence.
The coverage, defining the probability that the true parameter falls within the 95% highest density interval (HDI) of the posterior distribution, for 829 synthetic simulations in which the final reported GAP1 CNV proportion is at least 0.3. 95% HDI was calculated for each simulation using 200 posterior samples. Our neural density estimator is slightly over-confident for φ (coverage of 0.934), and under-confident for GAP1 CNV selection coefficient and formation rate (coverage of 0.992 for s_C and 0.995 for δ_C). Despite this under-confidence, the posterior distributions are narrow in biological terms: the 95% HDI represents less than an order of magnitude for both s_C and δ_C. Thus, we did not apply post-training adjustments to the neural density estimator, such as calibration (Cook et al., 2006) or ensembles (Caspi et al., 2023; Hermans et al., 2022).

Inferred CNV mechanisms by strain.
Counts of inferred CNV mechanisms for each sequenced clone, n=177, separated by strain.

Independent GAP1 amplifications lacking CNV reporter amplification.
(A) Read depth plots of the GAP1 CNV reporter locus, ChrXI:500-600kb, of sequenced clones from 1-copy-GFP subpopulations isolated across five chemostats. Identification of eleven distinct CNVs, shown above, indicate the occurrence of at least eleven independent amplifications of GAP1 without GFP co-amplification. Sequences were aligned to a custom reference genome containing the CNV reporter upstream of the GAP1 gene. The CNV reporter comprises a GFP gene and kanamycin resistance gene. GFP reference gene - green rectangle, GAP1 reference gene - blue rectangle. (B) Inset of the left-most CNV junction at the GFP, kanamycin, and GAP1 region, ChrXI: 513193-519171 with genome read depth and split read depth tracks for each sample. The location of the split reads pileup (blue and red clipping marks) show the precise CNV breakpoint which is downstream of the GFP gene and upstream of the GAP1 coding sequence for every clone indicating each lack an amplification of the inserted GFP gene.

Raw flow cytometry plots over the long term experimental evolution
FSC-A is forward scatter-area which is a proxy for cell size. GFP fluorescence was measured using the B2-A channel in arbitrary units. Hierarchical gating was performed to identify zero-, one-, and two-or-more-copy populations. Within the two-or-more copy gate, distinct subpopulations formed consistent with having a two-, three-, four- copies of GFP.

Raw flow cytometry plots over the long term experimental evolution
FSC-A is forward scatter-area which is a proxy for cell size. GFP fluorescence was measured using the B2-A channel in arbitrary units. Hierarchical gating was performed to identify zero-, one-, and two-or-more-copy populations. Within the two-or-more copy gate, distinct subpopulations formed consistent with having a two-, three-, four- copies of GFP.

Population GFP Ridgeplots.
Density plots of cell-size normalized GFP fluorescence in arbitrary units (a.u.) for every population and timepoint over the course of long-term experimental evolution in glutamine-limited chemostats.

Population GFP Ridgeplots.
Density plots of cell-size normalized GFP fluorescence in arbitrary units (a.u.) for every population and timepoint over the course of long-term experimental evolution in glutamine-limited chemostats.

MAP estimates of GAP1 CNV formation rates (δ_C) and selection coefficients (s_C) for all replicate populations.
Markers show MAP estimates from individual replicates, crosses show 50% HDI of collective posteriors. Extreme points are marked for comparison to data and posterior prediction, see Supplementary Figure 5 for posterior predictive checks.

Posterior predictive checks for all replicates.
Black markers are the empirical observations, dashed line shows MAP prediction. The leftmost plot of each row shows the collective MAP prediction with empirical data’s interquartile range (gray bars).

Pairwise and marginal collective posteriors for all estimated model parameters.
Diagonals show marginal collective posteriors per parameter per strain. Below-diagonal plots show pairwise KDEs for all pairs of model parameters. Collective joint MAPs (which may differ from collective marginal MAPs, as the marginal distribution integrates over all other parameters), are marked by a red vertical line. Panels are separated by strain: (A) WT, (B) ARSΔ, (C) LTRΔ, (D) ALLΔ.

Parameter estimation accuracy on synthetic data.
Log-ratio of MAP estimate and true parameter value for 829 synthetic simulations in which the final reported GAP1 CNV proportion is at least 0.3.

Neural density estimator training and validation loss during training.
Convergence threshold of 100 unimproved epochs (no decrease in minimal validation loss) was reached after 569 epochs.

Total GAP1 CNV frequency.
Solid lines show collective MAP predictions, dashed lines show the total proportion of GAP1 CNVs, comprising unreported CNVs and reported CNVs generated during the experiment, as predicted by the evolutionary model.

Error estimation of parameter inference.
Average root mean square errors (RMSE) of 50 posterior samples against the observed data. (A) Individual posteriors and individual replicates. (B) Collective posterior and individual replicates. (C) Collective posterior and empirical mean.

Pairwise evolutionary competition predictions.
We simulated evolutionary competitions in the experimental conditions of WT vs. genomic architecture mutants, starting from equal frequencies. The proportion at generation 116 of WT was predicted using 10,000 combinations of collective posterior samples for each pairwise competition. Overall, WT outcompetes all mutants because it adapts faster (due to faster CNV formation rate), but its advantage over ARSΔ and ALLΔ is much higher than its advantage over LTRΔ. (A) Histograms for three pairwise competitions. Note that ARSΔ and ALLΔ values overlap at this scale and are all in the rightmost bar. (B) High-resolution histograms for ARSΔ and ALLΔ.

No significant interaction between strain and generation on CNV length.
Boxplot of CNV length of clones by strain and generation of isolation. There is no significant interaction between strain and generation of isolated clone, and no significant effect of generation on CNV length (Two-way ANOVA, Strain x Generation, p = 0.33)

Types of ODIRA detected.
We found 87 ODIRA clones total regardless of strain. The majority of ODIRA clones fit the canonical definition of having two inverted junctions and 3 copies, 55/87 clones (63%) (ODIRA_3). We found four non-canonical types. We found 17 clones (20%) with only one inverted junction detected and 3 copies (ODIRA_oneEnd_3). We found 11 clones (13%) with two inverted junctions but only 2 copies (ODIRA_2) which may result from hairpin-capped double strand break repair. We found 3 clones (3.4%) with only one inverted junction detected and 2 copies (ODIRA_oneEnd_2). We found 1 clone (1.1%) with two inverted junctions but the amplified region did not contain an ARS.

CNV mechanisms in ARSΔ clones.
Two CNV sizes in ARSΔ clones correspond to different CNV mechanisms. We found two different groups of CNV lengths in the ARSΔ clones. 100% of smaller CNVs (6-8kb) correspond with a mechanism of NAHR between LTRs flanking the GAP1 gene. Larger CNVs (8kb-200kb) correspond with other mechanisms that tend to produce larger CNVs, including ODIRA and NAHR between distal LTR elements. The smaller CNVs are indeed focal amplifications of GAP1 that are 8kb or less.

Sign up for email alerts