Long read sequencing reveals poxvirus evolution through rapid homogenization of gene arrays

  1. Thomas A Sasani
  2. Kelsey R Cone
  3. Aaron R Quinlan  Is a corresponding author
  4. Nels C Elde  Is a corresponding author
  1. University of Utah, United States

Abstract

Poxvirus adaptation can involve combinations of recombination-driven gene copy number variation and beneficial single nucleotide variants (SNVs) at the same loci. How these distinct mechanisms of genetic diversification might simultaneously facilitate adaptation to host immune defenses is unknown. We performed experimental evolution with vaccinia virus populations harboring a SNV in a gene actively undergoing copy number amplification. Using long sequencing reads from the Oxford Nanopore Technologies platform, we phased SNVs within large gene copy arrays for the first time. Our analysis uncovered a mechanism of adaptive SNV homogenization reminiscent of gene conversion, which is actively driven by selection. This study reveals a new mechanism for the fluid gain of beneficial mutations in genetic regions undergoing active recombination in viruses and illustrates the value of long read sequencing technologies for investigating complex genome dynamics in diverse biological systems.

https://doi.org/10.7554/eLife.35453.001

Introduction

Gene duplication is long recognized as a potential source of genetic innovation (Ohno, 1970). Following duplication events, the resulting stretches of homologous sequence can promote recombination between gene copies. Gene conversion, the nonreciprocal transfer of sequence between homologous genetic regions, is one outcome of recombination evident in diverse eukaryotes (Brown et al., 1972; Semple and Wolfe, 1999; Drouin, 2002; Rozen et al., 2003; Ezawa et al., 2006; Chen et al., 2007), as well as in bacterial and archaeal genomes (Santoyo and Romero, 2005; Soppa, 2011). Gene conversion can result in a high degree of identity among duplicated gene copies, as in ribosomal RNA gene arrays (Liao, 1999; Eickbush and Eickbush, 2007). However, in other cases, such as the human leukocyte antigen gene family (Zangenberg et al., 1995) or the transmembrane protein gene cassettes of some pathogenic bacteria (Santoyo and Romero, 2005), gene conversion can also generate sequence diversity. Because multiple gene copies create more targets for mutation, variants that arise within individual copies can be efficiently spread or eliminated through gene conversion (Mano and Innan, 2008; Ellison and Bachtrog, 2015).

Although recombination might influence genetic variation in populations on very short time scales, many studies to date have used phylogenetic analyses in relatively slow-evolving populations to infer outcomes of gene conversion, including cases of concerted evolution, where similarity between genes of a gene family within a species exceeds the similarity of orthologous genes between species (Chen et al., 2007; Ohta, 2010). Extensive studies in yeast and bacteria (reviewed in Petes and Hill, 1988; Perkins, 1992; Haber, 2000; Santoyo and Romero, 2005), and some recent work in viruses (Hughes, 2004; Ba Abdullah et al., 2017), consider gene conversion on shorter time scales. In order to expand our understanding of how recombination might influence virus variation during the course of adaptation, we focused on large DNA viruses, in which rapidly evolving populations can simultaneously harbor both adaptive gene copy number variation and beneficial single nucleotide variants (SNVs) at the same locus.

Poxviruses are an intriguing system to study mechanisms of rapid adaptation, as they possess high rates of recombination (Ball, 1987; Evans et al., 1988; Spyropoulos et al., 1988; Merchlinsky, 1989) that lead to the recurrent emergence of tandem gene duplications (Slabaugh et al., 1989; Elde et al., 2012; Brennan et al., 2014; Erlandson et al., 2014; Cone et al., 2017). The poxvirus DNA polymerase gene encodes both replicase and recombinase activities, reflecting a tight coupling of these essential functions for virus replication (Colinas et al., 1990; Willer et al., 1999; Hamilton and Evans, 2005). Polymerase-associated recombination may underlie the rapid appearance of gene copy number variation (CNV), which was proposed as a potentially widespread mechanism of vaccinia virus adaptation in response to strong selective pressure (Elde et al., 2012; Cone et al., 2017). In these studies, recurrent duplications of the K3L gene, which encodes a weak inhibitor of the human innate immune factor Protein Kinase R (PKR; Davies et al., 1992), were identified following serial infections of human cells with a vaccinia strain lacking a strong PKR inhibitor encoded by the E3L gene (ΔE3L; Chang et al., 1992; Beattie et al., 1995). In addition to copy number amplification, a beneficial single nucleotide variant arose in the K3L gene in some populations, resulting in a His47Arg amino acid change (K3LHis47Arg), which encodes enhanced inhibition of PKR activity and aids virus replication (Kawagishi-Kobayashi et al., 1997; Elde et al., 2012). However, the mechanisms by which these distinct processes of adaptation might synergize or compete in virus populations during virus evolution remain unknown.

To investigate how heterogeneous virus populations adapt to cellular defenses, we performed courses of experimental evolution with a vaccinia virus population containing both K3L CNV and the K3LHis47Arg SNV (Elde et al., 2012). To overcome the challenge of genotyping point mutations in repetitive arrays from evolving populations, we sequenced virus genomes with the Oxford Nanopore Technologies (ONT) MinION platform. Sequencing extremely long DNA molecules allowed us to develop an integrated pipeline to analyze CNV at single-genome scale. Long sequencing reads, which can completely span tandem arrays of K3L duplications (reads measuring up to 99 kbp and comprising up to 16 K3L copies in this study), provided a means for tracking the K3LHis47Arg mutation as it spread through K3L gene arrays within an evolving virus population. Altering conditions in this experimental system allowed us to assess the impact of selection and recombination on K3LHis47Arg variant accumulation within K3L arrays. These analyses of variant dynamics reveal a mechanism of virus adaptation involving genetic homogenization and demonstrate how long read sequencing can facilitate studies of recombination-driven genome evolution.

Results

Rapid accumulation of a single nucleotide variant following gene copy number amplification

In previous work, we collected a virus population adapted over ten serial infections that contained gene copy number amplifications of K3L, and the beneficial K3LHis47Arg point mutation at a frequency of roughly 0.1 in the population (Figure 1A and B, up to passage 10; Elde et al., 2012). To study the fate of the K3LHis47Arg variant among repetitive arrays of K3L, we performed ten additional serial infection passages (P11-P20) in human cells. Comparative replication in human cells showed that virus titers remained well above parent (ΔE3L) levels through P20 (Figure 1A). While there was no major gain in replication between P5 and P20, the resolution of titer measurements may limit detection of more subtle increases in virus replication. A clear increase in replication around P5 coincided with the emergence of K3L CNV within virus genomes (Elde et al., 2012), and gene copy number increases appeared to stabilize by P10 (Figure 1B). Notably, the K3LHis47Arg SNV, though apparently stable at an estimated frequency of 0.1 in the population from P5 through P10 (Elde et al., 2012), accumulated to near fixation between P10 and P20 (maximum frequency of approximately 0.9; Figure 1B). Thus, despite a plateau in replication across later experimental passages, the accumulation of the beneficial point mutation suggests that there is selection for the K3LHis47Arg variant in the heterogeneous virus population.

Figure 1 with 3 supplements see all
A single nucleotide variant accumulates following increases in K3L copy number.

(A) Following 20 serial infections of the ΔE3L strain (MOI 0.1 for 48 hr) in HeLa cells (see Materials and methods for further details), replication was measured in triplicate in HeLa cells for every fifth passage, and compared to wild-type (VC-2) or parent (ΔE3L) virus. (B, C) Digested viral DNA from every 5th passage (B) and four plaque-purified clones (C) were probed with a K3L-specific probe by Southern blot analysis. Number of K3L copies (left) and size in kbp (right) are shown. K3LHis47Arg allele frequency for each population (shown below) was estimated by PCR and Sanger sequencing of viral DNA. (D) Replication of plaque purified clones from (C) was measured in HeLa (D) or BHK (Figure 1—figure supplement 1) cells in triplicate. Statistical analysis was performed to compare the means of populations b or c relative to a, or between the means of populations b or c relative to d by one-way ANOVA followed by Dunnett’s multiple comparison test. *p<0.05, **p<0.01, ***p<0.005. K3LHis47Arg and E9LGlu495Gly population-level allele frequencies estimated from Illumina MiSeq reads are shown in Figure 1—figure supplement 2. Replication of clone a compared to ΔE3L is shown in Figure 1—figure supplement 3. All titers were measured multiple times in BHK cells by plaque assay, shown with median and 95% confidence intervals.

https://doi.org/10.7554/eLife.35453.002
Figure 1—source data 1

Data used to generate Figure 1A.

https://doi.org/10.7554/eLife.35453.006
Figure 1—source data 2

Data used to generate Figure 1D.

https://doi.org/10.7554/eLife.35453.007
Figure 1—source data 3

Statistics for Figure 1D, One-way ANOVA followed by Dunnett’s multiple comparison test.

https://doi.org/10.7554/eLife.35453.008

To determine how changes in K3L copy number and the K3LHis47Arg mutation might individually contribute to virus replication in heterogeneous populations, we isolated distinct variants from single virus clones. Following plaque purification, we obtained viruses containing a single copy of the K3L gene, either with or without the K3LHis47Arg variant (Figure 1C). We also obtained plaque purified clones containing K3L CNV that were homogeneous for either wild-type K3L (K3LWT) or K3LHis47Arg. While viruses with CNV were not clonal due to recurrent recombination between multicopy genomes, they possessed nearly uniform K3L copy number, collapsing under relaxed selection to contain mainly 2 copies of K3L following plaque purification (ranging from 1 to 5; Figure 1C), consistent with observations from our previous study (Elde et al., 2012). As a result, these plaque purified clones do not represent the total diversity of copy number observed in the passaged virus populations, and the individual contributions of each K3LWT or K3LHis47Arg copy to virus replication are difficult to pinpoint given the heterogeneity of the clones. However, these populations do provide a useful tool to approximate the replication of viruses containing distinct genetic changes. Comparing the ability of these viruses to replicate in human cells revealed that either the K3LHis47Arg variant or K3L CNV is sufficient for a replication gain, and the combination of the two genetic changes increases replication more than either one alone (Figure 1D). In contrast, in a permissive hamster cell line, there was less than a 2-fold difference in titer between any of these four viruses (Figure 1—figure supplement 1). These results are consistent with earlier reports showing that K3L containing the His47Arg variant is a more potent inhibitor of human PKR than wild-type K3L (Kawagishi-Kobayashi et al., 1997), and that CNV was sufficient for overexpression of the K3L protein, which increased virus replication in human cells (Elde et al., 2012). However, the K3LHis47Arg variant only reached high frequency in our experimental population following increases in K3L copy number. This suggests that copy number affected K3LHis47Arg accumulation, which could occur through changes to selection for the variant, recombination rate, replication rate, or some combination of these processes.

In addition to K3L, variation at other loci could influence vaccinia adaptation during experimental evolution. Therefore, we assessed the full complement of single nucleotide variants by sequencing virus genomes from passages 10, 15, and 20 using the Illumina MiSeq platform. Apart from the K3LHis47Arg variant, only one other SNV was identified above an allele frequency of 0.01 in any of the sequenced populations compared to the parent ΔE3L virus. This variant, a point mutation encoding a Glu495Gly amino acid change in the viral DNA polymerase (E9LGlu495Gly), decreased in frequency from P10-P20, in contrast to the K3LHis47Arg SNV (Figure 1—figure supplement 2). While the E9LGlu495Gly SNV reached high frequency at P10 (0.64), this variant alone did not provide a measurable increase in virus replication (Figure 1—figure supplement 3), suggesting that it may be non-adaptive, and could have accumulated as a hitchhiker mutation. These observations, and the lack of any other detectable genetic changes, suggest that virus gains in replication were dominated by changes in K3L. Furthermore, the detection of only two high-frequency SNVs reflects the rarity of point mutations during poxvirus replication relative to RNA viruses and suggests that recombination-based mechanisms could be a major means of adaptation.

ONT reads reveal precise K3L copy number and distributions of K3LHis47Arg in individual genomes

To track the rise of the K3LHis47Arg variant in virus populations containing K3L CNV, we needed a means to analyze large and repetitive arrays of sequence. Duplication of K3L produces breakpoints between flanking genetic regions that mark the boundaries of recombination (Figure 2A; Elde et al., 2012; Cone et al., 2017). We previously identified two distinct breakpoints flanking K3L in the P10 population that differ by only 3 bp (Elde et al., 2012). Each of these breakpoint pairs demarcates a duplicon approximately 500 bp in length (Figure 2A), and encompasses both the K3L open reading frame (ORF, 267 bp) and predicted K3L promoter (Yang et al., 2010). In heterogeneous virus populations, short (e.g. 150 bp) Illumina reads cannot discriminate the presence or absence of the K3LHis47Arg variant either in single-copy K3L genomes or within multicopy arrays of K3L (Figure 2A). Therefore, we sequenced virus genomes using the ONT MinION sequencing platform and routinely generated reads with a mean length of ~3 kbp and a N50 between 5–8 kbp (Table 1). These reads reached a maximum aligned length of 40 kbp with a standard library preparation (see Materials and methods for further details) and allowed us to directly measure both K3L copy number and the presence or absence of K3LHis47Arg in each K3L copy within individual virus genomes (Figure 2A).

Figure 2 with 3 supplements see all
ONT reads capture SNVs and copy number expansions in individual virus genomes.

(A) Representative structure of the K3L locus in the VC-2 reference genome is shown on top, with representative Illumina MiSeq and ONT MinION reads shown to scale below. The K3LHis47Arg variant within reads is indicated by an asterisk. ONT reads that split and re-align to the K3L duplicon are indicative of individual multicopy arrays (shown below). Tandem duplication breakpoints flanking the duplicon are indicated by arrowheads. (B) Population-level K3LHis47Arg allele frequency was estimated using Illumina or ONT reads from different passages. E9LGlu495Gly allele frequencies are shown in Figure 2—figure supplement 1. Error rate calculations for different flow cell chemistries are shown in Figure 2—figure supplement 2. (C) For each sequenced passage, K3L copy number was assessed within each ONT read that aligned at least once to the K3L duplicon (see Materials and methods for further details). Detailed plot of reads containing 6 + K3L copies is shown in Figure 2—figure supplement 3. (D) Representative reads from the specific long read preparation are depicted relative to the VC-2 reference genome. The locations of relevant genes are indicated by colored boxes (gene name above or below), and the locations of high frequency variants in K3L and E9L are indicated by arrowheads.

https://doi.org/10.7554/eLife.35453.014
Figure 2—source data 1

Single nucleotide variants in virus populations from Illumina or ONT datasets, used to generate Figure 2B.

https://doi.org/10.7554/eLife.35453.018
Table 1
Summary of ONT sequencing datasets
https://doi.org/10.7554/eLife.35453.020
Population*Total sequenced readsMean read length (bp)Read length N50 (bp)Total sequenced bases (Gbp)Reads containing K3L
P5239,737216859320.521190
P1091,815352376930.32912
P15388,502449369081.754317
P2094,050289377020.27789
  1. *ONT sequencing datasets for all populations are available in Table 1-source data 1

Table 1-source data 1

Complete summary of ONT sequencing datasets

https://doi.org/10.7554/eLife.35453.021

Using ONT reads, we performed variant calling on the two high frequency SNVs (K3LHis47Arg and E9LGlu495Gly) from every fifth passage, yielding similar population-level allele frequencies to those estimated using Illumina MiSeq data for the same samples (Figure 2B, Figure 2—figure supplement 1; Elde et al., 2012). ONT sequencing error rates, which vary according to the specific k-mer being sequenced (Jain et al., 2018), are higher than in Illumina sequencing. To determine how error rates might influence variant allele frequency estimates, we calculated the proportions of sequencing errors (mismatches and deletions) at the 5-mers containing the wild-type or variant sequence for each SNV (K3LWT: TATGC; K3LHis47Arg: TACGC; E9LWT: ATTCG, E9LGlu495Gly: ATCCG; see Materials and methods for further details). While error rates differ between ONT flow cell chemistries, we found that the highest error rate was only 2.6% (Table 2, Figure 2—figure supplement 2), supporting the utility of ONT reads for identifying high frequency SNVs in virus genomes. We also identified the same two K3L duplication breakpoints previously described in the P10 population (Elde et al., 2012) using nanopore reads from P5, P10, P15, and P20, which revealed that the ~500 bp duplicons were maintained in the virus population throughout passaging (Table 3). Given the high quality of the long ONT reads, we restricted subsequent data analysis to reads containing complete K3L arrays (reads that mapped ≥150 bp upstream and downstream of the duplicon), thereby excluding reads that only contain a subset of total K3L copies in a virus genome (see Materials and methods for further details).

Table 2
Median sequencing error rates using various ONT flowcell chemistries
https://doi.org/10.7554/eLife.35453.022
Mutation and context (amino acid change)R7.3R9R9.4
TA[T > C]GC (His47Arg)0.0230.0230.005
TA[C > T]GC (Arg47His)0.0150.0240.026
AT[T > C]CG (Glu495Gly)0.0140.0180.009
AT[C > T]CG (Gly495Glu)0.0250.0090.009
Table 3
Structural variant breakpoint frequencies during passaging
https://doi.org/10.7554/eLife.35453.023
Breakpoint frequency*
BreakpointK2L breakK4L breakP5P10P15P20
130,284-0.760.690.760.66
1-30,8370.760.630.720.62
230,287-0.140.060.100.08
2-30,8400.120.040.090.05
  1. *Due to sequencing errors, a proportion of reads do not match either breakpoint

The long read datasets revealed that K3L copy number expansions occurred as early as P5 (Figure 2C), consistent with Southern blot analysis (Figure 1B). Over the course of the next five passages, K3L copy number steadily increased, and nearly 70% of virus genomes contained multiple copies of the gene by P10 (Figure 2C). From passages 10 to 20, there was a modest shift toward higher copy number arrays, but the distribution of K3L copy number within the population appears to have reached a point near equilibrium. Throughout the experiment, over 90% of sequenced virus genomes contained between 1 and 5 copies of K3L (Figure 2C). These results, combined with the rapid collapse of copy number under relaxed selection (Figure 1C, Elde et al., 2012), are consistent with a fitness trade-off between additional K3L duplications and increased genome size at very high copy numbers. However, using ONT, we captured reads from genomes containing up to 16 total copies of K3L (Figure 2—figure supplement 3), confirming that rare, large gene arrays exist in the population. These virus genomes highlight the ability of long read sequencing to analyze large arrays of gene repeats.

High K3L copy number reads are likely underrepresented in our initial data sets, both because there is a higher probability of capturing a complete short array, and because an average sequencing read length of ~3 kbp limits the discovery of virus genomes with greater than 6–8 K3L copies. Therefore, we re-sequenced virus genomes from P15 using a DNA library preparation protocol designed to generate extremely long reads (see Materials and methods for further details). The alternate protocol produced a mean read length of 9,392 bp (N50 = 19,288 bp), with a maximum aligned read length of 99,214 bp (Figure 2D). Even with increased read lengths, we did not recover larger proportions of high K3L copy number genomes in this dataset, suggesting that standard library preparations captured a representative sample of K3L copy number in virus populations. Using this specific long read preparation, we were also able to identify nearly 100 reads that span a ~ 30 kbp region separating the two high frequency single nucleotide variants in this population, K3LHis47Arg and E9LGlu495Gly (Figure 2D). These extremely long sequencing reads enable the direct phasing of distant variants within single poxvirus genomes, and suggest that single reads may routinely capture entire poxvirus genomes in future studies.

The K3LHis47Arg variant rapidly homogenizes in multicopy vaccinia genomes

To determine how single nucleotide variants spread throughout tandem gene duplications within virus genomes, we assessed the presence or absence of the K3LHis47Arg variant in each copy of K3L within our ONT reads (see Materials and methods for further details). Specifically, we categorized multicopy vaccinia genomes as containing either homogeneous K3L arrays, in which every K3L copy contains either the variant or wild-type sequence, or ‘mixed’ arrays, in which both K3LWT and K3LHis47Arg copies are present in a single read. At P5, the K3LHis47Arg SNV was observed almost exclusively in single-copy reads (four multicopy reads contained the SNV, compared to 304 single-copy reads; Figure 3), indicating that the variant likely originated in a single-copy genome. From P5 to P10, while the K3LHis47Arg variant remained at a nearly constant population-level frequency (Figure 2B), we observed a slight increase in the proportion of multicopy genomes containing the SNV, which was enriched for mixed, rather than homogeneous K3LHis47Arg arrays (Figure 3). Although overall K3L copy number increased from P5 to P10, this pattern was driven mainly by an increase in homogeneous K3LWT arrays (Figure 3). Thus, even though the K3LHis47Arg SNV was present in a high proportion of single-copy reads by P5, the scarcity of homogeneous K3LHis47Arg arrays at P10 suggests that the SNV entered multicopy arrays through recombination, rather than gene amplification from single copies of K3LHis47Arg.

Figure 3 with 4 supplements see all
The K3LHis47Arg variant homogenizes within multicopy arrays throughout experimental evolution.

Stacked bar plots representing the proportions of mixed and homogeneous K3L arrays were generated from ONT reads for the indicated virus populations (passages are listed above each plot). The proportions of reads containing homogeneous K3LWT, homogeneous K3LHis47Arg, or any combination of mixed alleles are shown for reads containing 1–5 K3L copies. A simulation of SNV accumulation under a binomial distribution is shown in Figure 3—figure supplement 1, and results from sequencing with different flow cell chemistries is shown in Figure 3—figure supplement 2. Simulations of the effects of ONT sequencing error rates on the identification of mixed and homogeneous arrays are shown in Figure 3—figure supplement 3, and the proportions of each combination of K3L alleles in 3, 4, and 5-copy arrays are shown in Figure 3—figure supplement 4.

https://doi.org/10.7554/eLife.35453.024

From passages 10–20, as the K3LHis47Arg allele frequency drastically increased in the population (Figure 2B), we observed a very different pattern in the K3L arrays. By the last passages of the experiments, homogeneous K3LHis47Arg arrays became increasingly prevalent, and were observed more frequently than mixed arrays (Figure 3). In contrast, when we simulated K3LHis47Arg accumulation in passages 10–20 according to a binomial distribution, we observed a markedly lower prevalence of homogeneous K3L multicopy arrays (Figure 3—figure supplement 1, see Materials and methods for further details). One possible explanation for our observations is that amplification of K3LHis47Arg or K3LWT copies drives the higher proportion of homogeneous arrays, and that sequencing errors at the K3LHis47Arg site could lead us to label truly homogeneous multicopy K3L arrays as containing mixed alleles. Therefore, to address the impact of ONT sequencing errors on our observations of mixed arrays, we took two approaches. First, we sequenced the P15 population with a variety of flowcell chemistries. Despite having distinct error rate profiles, each chemistry yielded nearly identical distributions of homogeneous and mixed K3L arrays (Figure 3—figure supplement 2). Second, we simulated a population of K3L arrays that matched the copy number distribution of the P15 population, in which all arrays were homogeneous for K3LWT or K3LHis47Arg alleles (see Materials and methods for further details). Then, as a proxy for sequencing errors, we randomly switched K3LWT alleles to K3LHis47Arg (and vice versa) at a frequency equal to the median error rate for each flowcell chemistry used in our experiments. From this analysis, the experimental data consistently returned substantially higher fractions of mixed arrays compared to the 1000 simulations, confirming that our observed enrichment of mixed arrays was not an artifact of sequencing errors (Figure 3—figure supplement 3).

We further probed the mechanism underlying rapid homogenization of the K3LHis47Arg variant by analyzing patterns of alleles within multicopy arrays from passages 10–20. In K3L arrays with 3 or 4 copies of the gene, we observed every possible combination of K3LWT and K3LHis47Arg alleles at P15 (Figure 3—figure supplement 4A–B). A closer examination of 3-copy K3L arrays revealed steady homogenization of the K3LHis47Arg SNV from P10 to P20 (Figure 3—figure supplement 4A). Mixed arrays remained prevalent in these populations, comprising between 23–42% of all 3-copy K3L arrays (Figure 3, Figure 3—figure supplement 4A). At P15, we also observed a similar overall proportion of various 3, 4, and 5-copy mixed arrays, suggesting that the trends observed for 3-copy arrays are as diverse, or potentially even more so, for mixed arrays at higher copy numbers (Figure 3—figure supplement 4). The abundance of mixed arrays, coupled with our observation of numerous allele combinations and the low point mutation rate of poxviruses (Gago et al., 2009; Sanjuán et al., 2010), strongly support a recombination-based mechanism of variant homogenization. Consistent with this idea, negligible sequencing error rates observed in our identification of K3L alleles demonstrate that mixed arrays are not technical artifacts (Figure 2—figure supplement 2, Figure 3—figure supplement 3) and that replacement of K3LWT arrays with pure K3LHis47Arg arrays is not the primary means of variant homogenization (Figure 3). These findings reveal recombination-driven genetic homogenization in the rapid rise of adaptive variation in homologous sequences.

Throughout the passaging experiment, the K3LHis47Arg SNV was nearly equally distributed throughout K3L arrays, regardless of K3L copy number (Figure 4A). To determine whether the presence of homogeneous K3L arrays, either K3LWT or K3LHis47Arg, influenced this pattern, we reanalyzed P15 virus genomes after removing all homogeneous K3L arrays from the dataset (Figure 4B). We observed the same striking pattern of homogeneity for the SNV, regardless of array copy number or a copy’s position within the array. Because the same K3LHis47Arg SNV frequency is observed even when homogeneous K3LWT or K3LHis47Arg genomes are excluded, these results suggest that once the K3LHis47Arg variant entered multicopy genomes, variant accumulation was independent of copy number.

The K3LHis47Arg variant homogenizes in K3L arrays regardless of copy number.

(A) ONT reads from every 5th passage were grouped by K3L copy number, and each K3L copy was assessed for the presence or absence of the K3LHis47Arg SNV. Reads containing 1–5 K3L copies are shown. (B) Using reads from the P15 population, homogeneous K3L arrays were removed from the dataset, and K3LHis47Arg SNV frequency was plotted exclusively in mixed arrays. The number of reads of each copy number is indicated to the right of each row. Reads are oriented 5’ to 3’ relative to the VC-2 reference sequence, and the K3LHis47Arg allele frequency in each copy is indicated in blue.

https://doi.org/10.7554/eLife.35453.029

Recombination and selection drive patterns of K3LHis47Arg homogenization

To investigate the influence of intergenomic recombination between co-infecting viruses on gene homogenization, we conducted serial infections at various multiplicity of infection (MOI). We repeated passages 11 to 15 using a range of MOI from 1.0 to 0.001 (the original experiments are MOI = 0.1), in order to determine whether increasing or virtually eliminating the occurrence of intergenomic recombination would affect the accumulation of the K3LHis47Arg variant. While intergenomic recombination has been observed following passaging of wild-type viruses at a low MOI (0.02; Qin and Evans, 2014), our lowest MOI (0.001), combined with the reduced replication ability of these virus populations (Figure 1A), greatly diminishes the probability of co-infection even over the course of a 48 hour passage. Analysis of all four P15 populations returned similar distributions of K3L copy number, as well as distributions of homogeneous and mixed arrays, regardless of MOI (Figure 5). Thus, even when co-infection is rare at the lowest MOI, these distributions are consistent, suggesting that patterns of variant accumulation are robust to changes in the probability of intergenomic recombination. Therefore, intragenomic recombination during replication from single virus infections is likely the main source for rapid homogenization of the K3LHis47Arg variant within gene arrays. Frequent crossover recombination events between virus genomes could contribute to the diversity of mixed K3L arrays. However, the initial presence of multicopy arrays containing mixed alleles, quickly followed by the predominance of multicopy arrays homogeneous for the variant (Figure 3) is also consistent with abundant intragenomic recombination resulting in a process reminiscent of gene conversion.

K3LHis47Arg homogenization within multicopy arrays is independent of intergenomic recombination rate.

The P10 population was serially passaged in HeLa cells at different MOIs (listed above each plot), and each of the resulting P15 populations was sequenced with ONT. Stacked bar plots representing the proportions of mixed and homogeneous arrays were generated as in Figure 3.

https://doi.org/10.7554/eLife.35453.030

Gene conversion is a driving force behind sustained sequence identity among repeated sequences, promoting concerted evolution (Chen et al., 2007; Ohta, 2010). While the precise role and outcomes of recombination during the spread of the K3LHis47Arg variant are difficult to test without a clear understanding of the recombination machinery in poxviruses (Gammon and Evans, 2009), two lines of evidence highlight the importance of natural selection for homogenization of the K3LHis47Arg variant in multicopy gene arrays. First, we repeated passages 11 to 15 under relaxed selection on K3L by infecting BHK cells, in which neither the K3LHis47Arg variant nor K3L CNV provides a measurable replication benefit (Figure 1—figure supplement 1; Elde et al., 2012). Consistent with related protocols of plaque purification in BHK cells (Figure 1C), we observed a uniform reduction of K3L copy number in the population following five passages in BHK cells (P15-BHK; Figure 6A). Further analysis of virus genomes from the P15-BHK population revealed that the K3LHis47Arg variant had not increased in frequency compared to P10, in contrast to its accumulation in HeLa cells (Figure 6B). Additionally, the proportion of reads homogeneous for K3LHis47Arg decreased following passaging in BHK cells (Figure 6C), as opposed to the rapid homogenization observed in HeLa cells. These results suggest that variant accumulation and homogenization are dependent on selective pressure imposed by the host environment.

K3LHis47Arg variant homogenization is dependent on selection.

The P10 population was serially passaged five times in BHK cells (MOI = 0.1, 48 hr; P15-BHK). P10 and P15 data are included from previous figures for comparison with P15-BHK. (A) K3L copy number was assessed for all sequenced reads that unambiguously aligned to K3L at least once, as in Figure 2C. (B) K3LHis47Arg and E9LGlu495Gly allele frequencies in each population were estimated using ONT reads, as in Figure 2B. Allele frequencies for all sequenced populations are included in Figure 6—source data 1. (C) Stacked bar plots representing the proportions of mixed and homogeneous arrays were generated from sequenced ONT reads, as in Figure 3. (D) ONT reads were assessed for the presence of each breakpoint (shown relative to the genome to the right) by aligning reads to a query sequence containing K3L using BLAST and extracting the starts and ends of individual alignments to the K3L duplicon. Due to sequencing errors, a proportion of reads do not match either breakpoint 1 or breakpoint 2.

https://doi.org/10.7554/eLife.35453.031
Figure 6—source data 2

Data used to generate Figure 6D.

https://doi.org/10.7554/eLife.35453.032
Figure 6—source data 1

Single nucleotide variants in all sequenced virus populations from Illumina or ONT datasets.

Data used to generate Figure 6B.

https://doi.org/10.7554/eLife.35453.033

Next, we considered how selection influenced the accumulation of the K3LHis47Arg SNV relative to variation in the population at two other loci. Unlike the K3LHis47Arg variant, the only other observed point mutation, E9LGlu495Gly, did not provide a replication benefit in human cells, indicating differential selection for each SNV (Figure 1D, Figure 1—figure supplement 3). Indeed, only the K3LHis47Arg variant reached near-fixation following passaging in human cells, while the E9LGlu495Gly variant instead decreased over the course of passaging in both HeLa and BHK cells (Figure 2—figure supplement 1 and Figure 6B). Similarly, the presence of two distinct recombination breakpoints (located three base pairs apart) allowed us to compare structural variation expected to be neutral relative to amino acid position 47 in K3L (Figure 2A), because both breakpoints fall outside of the K3L ORF and predicted promoter sequence (Yang et al., 2010). Across all analyzed populations, we observed that one breakpoint was dominant over the other, and that the frequency of these breakpoints did not appreciably change over the course of passaging in either cell line (Table 3 and Figure 6D). The consistency of these neutral genetic changes sharply contrasts the rapid accumulation of the K3LHis47Arg variant in response to selection, providing further support for the idea that selection is required to drive rapid homogenization of the K3LHis47Arg allele. Additionally, we observed that the frequencies of the K3LHis47Arg SNV and the K3L breakpoints are uncoupled (Figure 6B, Figure 6D), despite their proximity in the genome (Figure 2A). This observation supports gene conversion as a potentially major mechanism driving only the SNV to high frequency, rather than reciprocal recombination linking a particular breakpoint to the point mutation as it accumulated within the population. Taken together, our results support a model of adaptation resembling gene conversion, in which a beneficial variant that enters an amplified gene array can be rapidly spread among the remaining copies through recombination and selection (Figure 7).

Model of K3LHis47Arg homogenization within K3L CNV via gene conversion.
https://doi.org/10.7554/eLife.35453.034

Discussion

In this study, we investigated how a virus population evolves given the simultaneous presence of distinct adaptive variants at a single locus. In vaccinia virus populations harboring recombination-driven gene copy number amplifications and a beneficial point mutation in the same gene, courses of experimental evolution revealed a process of variant homogenization resembling gene conversion. Repeated gene amplification likely also contributes to the accumulation of the beneficial SNV; however, the large proportions and various patterns of mixed allele combinations in multicopy arrays suggest that this is not the only mechanism of genome diversification. Furthermore, while we cannot directly distinguish between gene conversion and crossover events, the consistency of neutral variants in close proximity to the rapidly homogenized SNV strongly support a model of recombination-driven homogenization. Future analyses might inform additional means of adaptation, but these results support gene conversion as a potentially critical mechanism underlying rapid homogenization of a SNV within repeated gene copies. This process could be a unique adaptive feature of large DNA viruses, because evolution through mechanisms of gene duplication are widespread in DNA viruses (McLysaght et al., 2003; Shackelton and Holmes, 2004; Filée, 2009; Elde et al., 2012; Filée, 2015; Gao et al., 2017), but rare in RNA viruses (Simon-Loriere and Holmes, 2013). For DNA viruses, which possess significantly lower point mutation rates than RNA viruses (Gago et al., 2009; Sanjuán et al., 2010), the rapid fixation of rare beneficial variants within multiple gene copies could be key to the process of adaptation.

A major outcome from genetic homogenization of beneficial point mutations in multicopy genes might be an enhanced persistence of large gene families. Under this model, the rapid spread of point mutations in gene arrays would counter the advantages of single or low-copy genomes enriched for the SNV to dominate DNA virus populations. Among poxviruses, for example, nearly half of the Canarypox genome consists of 14 gene families, which may have been regularly shaped by mechanisms of genetic homogenization (Afonso et al., 2000; Tulman et al., 2004). Within the emerging classes of giant viruses, the Bodo saltan virus is notable for a large gene family of 148 ankyrin repeat proteins at the ends of its linear genome, with some copies being nearly identical (Deeg et al., 2018). In cases like these, point mutations that are sampled among tandem arrays of genes might quickly spread to fixation through homogenization. Indeed, repeated homogenization through gene conversion has been suggested as the process driving concerted evolution of viral genes (Hughes, 2004), which has been observed in nanoviruses (Hughes, 2004; Hu et al., 2007; Savory and Ramakrishnan, 2014), baculoviruses (de Andrade Zanotto and Krakauer, 2008), and Epstein-Barr virus, a human herpesvirus (Ba Abdullah et al., 2017). In these studies, comparing naturally existing strains led the authors to propose cases of concerted evolution, however, short read sequencing of fixed populations restricted the possibility of investigating underlying mechanisms. In contrast, our experimental system allowed us to track an actively evolving virus population, and using long DNA sequencing reads, uncover a model consistent with gene conversion driving the rapid homogenization of a variant within gene arrays. We present a set of tools to study poxviruses and other DNA virus populations to determine whether similar mechanisms of diversification underlie adaptation during natural virus infections.

Our work also demonstrates the power of long read sequencing to perform high resolution analyses of complex genome dynamics. Using the Oxford Nanopore Technologies platform, we investigated two simultaneous adaptations at single-genome resolution. This type of analysis provides a framework to definitively determine the sequence content of tandem gene duplications, and accurately call variants within these repetitive regions. Additionally, we demonstrate the ability to phase extremely distant genetic variants, and the longest reads we obtained suggest that entire poxvirus genomes could routinely be captured in single reads. Sequencing entire DNA virus genomes with this level of detail could expand our understanding of DNA virus adaptation as a population evolves, either in an experimental system or during infection of a host. Together, these methods allow for high-resolution analyses of complex genomes and could be used to explore the evolution of diverse organisms in new and exciting detail.

Materials and methods

Key resources table
Reagent type (species)
or resource
DesignationSource or referenceIdentifiersAdditional information
Gene (Vaccinia virus)K3LNANCBI_Gene ID:3707649
Strain, strain
background (Vaccinia virus)
VC-2, Copenhagen(Goebel et al., 1990)
PMID: 2219722
NCBI_txid:10249;
NCBI_GenBank:M35027.1
Strain, strain
background (Vaccinia virus)
ΔE3L, Copenhagen(Beattie et al., 1995)
PMID: 7527085
Cell line (Homo sapiens)HeLaOtherObtained from Geballelab, University of Washington
Cell line
(Mesocricetus auratus)
BHKOtherObtained from Geballelab, University of Washington
Commercial assay
or kit
Covaris g-TUBECovaris, Inc.Catalog no: 520079
Commercial assay
or kit
DIG High-Prime DNA
Labeling and Detection
Starter Kit II
RocheCatalog no: 11585614910
Commercial assay
or kit
Nextera XT DNA library
preparation kit
IlluminaCatalog no: FC-131–1024
Commercial assay
or kit
SQK-NSK007; SQK-LSK208;
SQK-LSK308; SQK-RAD002
Oxford Nanopore
Technologies
Catalog no: SQK-NSK007;
SQK-LSK208; SQK-LSK308;
SQK-RAD002
Commercial assay
or kit
FLO-MIN104; FLO-MIN106;
FLO-MIN107
Oxford Nanopore
Technologies
Catalog no: FLO-MIN104;
FLO-MIN106; FLO-MIN107
Chemical compound,
drug
DMEMHyClone, VWRCatalog no: 16777–129
Chemical compound,
drug
FBSHyClone, VWRCatalog no: 26-140-079
Chemical compound,
drug
Penicillin-streptomycinGE Life Sciences, VWRCatalog no: 16777–164
Chemical compound,
drug
SG-2000GE Life Sciences, VWRCatalog no: 82024–258
Software, algorithmGraphPad PrismGraphPad Software
Software, algorithmBWA-MEM(Li, 2013)v0.7.15arxiv.org/abs/1303.3997
Software, algorithmsamblaster(Faust and Hall, 2014)
PMID: 24812344
v0.1.24https://github.com/GregoryFaust/samblaster
Software, algorithmfreebayes(Garrison and Marth,
2012)
v1.0.2–14arxiv.org/abs/1207.3907
Software, algorithmMetrichorOxford Nanopore
Technologies
v2.40
Software, algorithmAlbacoreOxford Nanopore
Technologies
v1.2.4
Software, algorithmporetools(Loman and Quinlan,
2014) PMID: 25143291
v0.6.0https://github.com/arq5x/poretools
Software, algorithmPorechopOtherv0.2.3https://github.com/rrwick/Porechop
Software, algorithmnanopolish(Loman et al., 2015)
PMID: 26076426
v0.8.4https://github.com/jts/nanopolish
Software, algorithmsource codethis paperSee Materials andmethods, https://github.com/tomsasani/vacv-ont-manuscript;
copy archived at
https://github.com/elifesciences-publications/vacv-ont-manuscript)
Software, algorithmraw sequencing datathis paperSRP128569; SRP128573;
DOI: 10.5281/zenodo.1319732
See Materials and methods
Software, algorithmraw sequencing data(Elde et al., 2012)
PMID: 22901812
SRP013146

Cells

HeLa and BHK cells were maintained in Dulbecco’s modified Eagle’s medium (DMEM; HyClone, Logan, UT) supplemented with 10% fetal bovine serum (HyClone), 1% penicillin-streptomycin (GE Life Sciences, Chicago, IL), and 1% stable L-glutamine (GE Life Sciences). Cell lines were authenticated by STR analysis of 24 loci (DNA Sequencing Core Facility, University of Utah, Salt Lake City, UT) for HeLa cells, and species-specific PCR for both HeLa and BHK cells as previously described (Steube et al., 2008). Both cell lines tested negative for mycoplasma contamination, using the MycoSensor PCR Assay Kit (Agilent Technologies, Inc., Santa Clara, CA) according to the manufacturer’s protocol.

Experimental evolution

Request a detailed protocol

The P5-P10 populations of vaccinia virus were previously established following serial passages of the ΔE3L virus (Beattie et al., 1995) in HeLa cells (Elde et al., 2012). Briefly, 150 mm dishes were seeded with an aliquot from the same stock of HeLa cells (5 × 106 cells/dish) and infected (MOI = 1.0 for P1, and MOI = 0.1 for subsequent passages) for 48 hr. Cells were then collected, washed, pelleted, and resuspended in 1 mL of media. Virus was released by one freeze/thaw cycle followed by sonication. P10 (replicate C passage 10 in Elde et al., 2012) virus was expanded in BHK cells, and titer determined by 48 hr plaque assay in BHK cells. Passages 11–20 were performed as above, starting with the P10 virus population. Following passage 20, replication ability was assayed simultaneously by 48 hr infection (MOI = 0.1) in at least triplicate in HeLa cells.

For the intergenomic recombination passages, P10 virus was passaged in HeLa cells five times as above at a range of MOI (MOI = 0.001–1.0 as indicated) for 48 hr. For BHK passages, the P10 virus population was passaged five times as above, using an aliquot from the same stock of BHK cells (5 × 106 cells/dish) infected (MOI = 0.1) for 48 hr.

All replication comparisons were performed on biological triplicates, assessing virus titers by 48 hr plaque assay in BHK cells performed with at least three technical replicates, as shown in the source data associated with each figure. Statistical analyses were performed using GraphPad Prism (GraphPad Software, La Jolla, CA).

Southern blot analysis

Request a detailed protocol

Viral DNA from purified viral cores was digested with EcoRV (New England Biolabs, Ipswich, MA), and separated by agarose gel electrophoresis. DNA was transferred to nylon membranes (GE Life Sciences) using a vacuum transfer, followed by UV-crosslinking. Blots were probed with PCR-amplified K3L using the DIG High-Prime DNA Labeling and Detection Starter Kit II (Roche, Basel, Switzerland) according to the manufacturer’s protocol.

Isolation and testing of plaque purified clones

Request a detailed protocol

BHK cells were infected for 48 hr with dilutions of P15 or P20 virus, and overlayed with 0.4% agarose. Single plaques were harvested and transferred to new BHK dishes, and resulting wells harvested after 48 hr. Virus was released by one freeze/thaw cycle followed by sonication. The process was repeated three additional times for a total of four plaque purifications. Viral DNA was extracted from individual clones from each population as previously described (Esposito et al., 1981) from infected BHK cells (MOI = 0.1) 24 hr post-infection, and assessed for K3L CNV and the K3LHis47Arg SNV by PCR and Sanger sequencing.

One clone of each genotype (single-copy K3LWT or K3LHis47Arg, and multicopy K3LWT or K3LHis47Arg) was expanded in BHK cells. Replication ability was assessed by 48 hr infection (MOI = 0.1) in triplicate in either HeLa or BHK cells. Virus titers were determined by 48 hr plaque assay in BHK cells performed with at least three technical replicates.

Deep sequencing of viral genomes

Illumina

Request a detailed protocol

Total viral genomic DNA was collected as above (Esposito et al., 1981). Libraries were constructed using the Nextera XT DNA sample prep kit (Illumina, Inc., San Diego, CA). Barcoded libraries were pooled and sequenced on an Illumina MiSeq instrument (High-Throughput Genomics Core Facility, University of Utah). Reads were mapped to the Copenhagen reference strain of vaccinia virus (VC-2; accession M35027.1, modified on poxvirus.org; Goebel et al., 1990) using default BWA-MEM v0.7.15 (Li, 2013) parameters. PCR duplicates were removed using samblaster (Faust and Hall, 2014). Variant calling was performed using freebayes v1.0.2–14 (Garrison and Marth, 2012), using the following parameters:

freebayes -f $REF $BAM --pooled-continuous -C 1 --genotype-qualities --report-genotype-likelihood-max -F 0.01.

Oxford Nanopore Technologies

Request a detailed protocol

Virus particles were isolated from infected BHK cells (MOI = 1.0) 24 hr post-infection, and viral cores were purified by ultracentrifugation through a 36% sucrose cushion at 60,000 rcf for 80 min. Total viral genomic DNA was extracted from purified cores as above (Esposito et al., 1981). Purified DNA was then size-selected in a Covaris g-TUBE (Covaris, Inc., Woburn, MA) 2 × 1 min at 6000 rpm. Sequencing libraries for the P10, P15, and P20 populations were prepared using the ONT SQK-NSK007 kit and sequenced on R9 chemistry MinION flow cells (FLO-MIN104); libraries for the P5, P6, and P7 populations were prepared using the ONT SQK-LSK208 kit and sequenced on R9.4 chemistry flow cells (FLO-MIN106); libraries for the P15 MOI 0.01, P15 MOI 0.1, and P15-BHK populations were prepared using the SQK-LSK308 kit and sequenced on R9.5 chemistry flow cells (FLO-MIN107); libraries for the P15 MOI 0.001 and P15 MOI 1.0 populations were sequenced using both R9.4 and R9.5 chemistry flow cells; additional libraries for the P15 population were sequenced using R9.5 chemistry flow cells (Oxford Nanopore Technologies Ltd., Oxford, UK). For the specific long read library preparation, we used purified, un-sheared P15 viral DNA and a SQK-RAD002 sequencing kit; libraries were sequenced on a FLO-MIN106 flow cell. All sequencing reactions were performed using a MinION Mk1B device and run for 48 hr; base calling for R9 reactions was performed using the Metrichor cloud suite (v2.40), while a command line implementation of the Albacore base caller (v1.2.4) was used to base call data from the remaining sequencing runs. For Albacore base calling on R9.4 runs, the following command was used:

read_fast5_basecaller.py -k SQK-LSK208 -f FLO-MIN106 -o fast5 -t 16 r -i $RAW_FAST5_DIRECTORY.

For Albacore base calling on R9.5 runs, the following command was used:

full_1dsq_basecaller.py -k SQK-LSK308 -f FLO-MIN107 -o fastq,fast5 -t 16 r -i $RAW_FAST5_DIRECTORY.

For all R9 and R9.4 data, FASTQ sequences were extracted from base-called FAST5 files using poretools v0.6.0 (Loman and Quinlan, 2014), while FASTQ were automatically generated by Albacore during base-calling on the R9.5 populations. Prior to alignment, adapter trimming on all ONT reads was performed using Porechop (https://github.com/rrwick/Porechop). Only the highest quality reads (2D and 1D2 for R9/R9.4 and R9.5 chemistries, respectively) were used for downstream analysis. Pooled FASTQ files for each sample were then aligned to the VC-2 reference genome with BWA-MEM v0.7.15 (Li, 2013), using the default settings provided by the -x ont2d flag. Population-level estimates of SNV frequencies were determined from our nanopore data using nanopolish v0.8.4 (Loman et al., 2015).

Copy number and allele frequency estimation

Request a detailed protocol

Custom Python scripts (www.github.com/tomsasani/vacv-ont-manuscript [copy archived at https://github.com/elifesciences-publications/vacv-ont-manuscript] DOI: 10.5281/zenodo.1320424) were used to calculate both K3L copy number and K3LHis47Arg allele frequency within individual aligned virus reads. Briefly, to identify individual ONT reads containing K3L, we first selected all reads that aligned at least once to the duplicon containing K3L. We next categorized ONT reads containing K3L as single-copy or multicopy. Reads that unambiguously aligned once to the K3L locus were classified as single-copy. The VC-2 reference contains a single K3L gene; therefore, if more than one distinct portion of a read aligned to the full length of K3L, that read was instead classified as multicopy. For every one of a read’s alignments to K3L, we examined the read base aligned to reference position 30,490. If the read base matched the reference, we catalogued that alignment as being K3LWT, and if the read base was a cytosine, we catalogued it as K3LHis47Arg. We filtered reads to include only those that aligned to the K3L duplicon, as well as 150 bp of unique VC-2 sequence upstream and downstream of the K3L duplicon, to ensure that the read fully contained the estimated number of K3L copies and was derived from a single vaccinia genome. Finally, we removed sequencing reads that contained one or more truncated alignments to the K3L duplicon, one or more alignments to K3L with mapping qualities less than 20, or one or more alignments to K3L with a non-reference or non-K3LHis47Arg base (i.e., not a C or T) at reference position 30,490.

Breakpoint characterization

Request a detailed protocol

To characterize the proportions of K3L duplicon breakpoint pairs in the ONT data, we first extracted all reads from each population that unambiguously aligned to K3L at least once, and that also aligned to unique sequence 150 base pairs up- and downstream of the K3L duplicon. We then aligned a 1000 bp query sequence (containing VC-2 reference sequence from genomic position 30,000 to 31,000) to each of the extracted reads using BLAST (Zhang et al., 2000). For every read, we extracted the start and end coordinates of all high-quality alignments to the K3L query, ±50 bp from the expected K3L breakpoints (Elde et al., 2012).

Estimating ONT sequencing error

Request a detailed protocol

We analyzed ONT sequencing accuracy at each k-mer centered on the reference or non-reference base for the K3LHis47Arg or E9LGlu495Gly SNVs in our ONT datasets (K3LWT: TATGC, K3LHis47Arg: TACGC, E9LWT: ATTCG, E9LGlu495Gly: ATCCG). For each instance of the 5-mer in the reference genome (excluding the SNV location and the first or last 10 kbp of repetitive sequence in the reference genome), we calculated the proportions of each non-reference base aligned to the middle nucleotide as a proxy for sequencing errors. Additionally, we calculated the proportion of alignments in which there was a deletion at the middle nucleotide. We then generated kernel density plots representing the distributions of these error proportions across all instances of the 5-mer in the reference genome (K3LWTn = 126, K3LHis47Argn = 83, E9LWTn = 152, E9LGlu495Glyn = 170), and calculated the median proportion of each sequencing error across all 5-mer sites.

Generating simulated distributions of the K3LHis47Arg variant within multicopy arrays

Request a detailed protocol

To simulate the accumulation of the K3LHis47Arg allele in K3L arrays, we first created a simulated population of K3L arrays for each population that matched the copy number distribution of the passage of interest. Each copy within this population was initially K3LWT; to simulate de novo accumulation of K3LHis47Arg within these arrays, we looped over every copy within every K3L array. After randomly sampling a single value from a uniform distribution (0.0 to 1.0), if that value was less than or equal to the observed population allele frequency of K3LHis47Arg, we ‘mutated’ the selected copy. This process was repeated until the population allele frequency of the hypothetical population matched the observed population allele frequency at that passage. This process effectively simulated a binomial distribution of K3LHis47Arg alleles within the hypothetical population, with the probability of mutating any particular K3L copy equal to the observed K3LHis47Arg allele frequency in the passage.

Simulating the effects of error rate on observations of mixed arrays

Request a detailed protocol

After empirically determining error rates for the 5-mers containing K3LWT or K3LHis47Arg, we simulated the effects of these error rates on our observed proportions of mixed and homogeneous arrays. To do this, we converted every vaccinia array in our experimental data into a homogeneous array using the following heuristic: if the array is already homogeneous, keep it as such; if the array contains a majority of K3LWT or K3LHis47Arg alleles, convert it into an array of identical copy number that is homogeneous for the majority allele; if the array has an equal number of K3LWT and K3LHis47Arg copies, randomly convert the array into a homogeneous K3LHis47Arg or homogeneous K3LWT array of identical copy number. Then, for each K3L copy in each of these ‘converted’ arrays, we randomly introduced sequencing errors (i.e., switched K3LHis47Arg alleles to K3LWT alleles, and vice versa) at a rate equivalent to the median C > T or T > C error rate for that sequencing chemistry.

Accession numbers

Request a detailed protocol

All deep sequencing data are available on the Sequence Read Archive under accessions SRP128569 (Oxford Nanopore reads) and SRP128573 (Illumina MiSeq reads from the P15 and P20 populations). Previously published Illumina MiSeq reads from the P10 population are available on the SRA under accession SRP013146 (Elde et al., 2012). All Illumina MiSeq and Oxford Nanopore data are additionally archived on Zenodo at the following DOI: 10.5281/zenodo.1319732.

Data availability

Sequencing data are publicly available at DOI: 10.5281/zenodo.1169394 Source data files are provided in the revised submission

The following data sets were generated
The following previously published data sets were used

References

    1. Ball LA
    (1987)
    High-frequency homologous recombination in vaccinia virus DNA
    Journal of Virology 61:1788–1795.
    1. Beattie E
    2. Denzler KL
    3. Tartaglia J
    4. Perkus ME
    5. Paoletti E
    6. Jacobs BL
    (1995)
    Reversal of the interferon-sensitive phenotype of a vaccinia virus lacking E3L by expression of the reovirus S4 gene
    Journal of Virology 69:499–505.
    1. Davies MV
    2. Elroy-Stein O
    3. Jagus R
    4. Moss B
    5. Kaufman RJ
    (1992)
    The vaccinia virus K3L gene product potentiates translation by inhibiting double-stranded-RNA-activated protein kinase and phosphorylation of the alpha subunit of eukaryotic initiation factor 2
    Journal of Virology 66:1943–1950.
    1. Evans DH
    2. Stuart D
    3. McFadden G
    (1988)
    High levels of genetic recombination among cotransfected plasmid DNAs in poxvirus-infected mammalian cells
    Journal of Virology 62:367–375.
    1. Haber JE
    (2000) Lucky breaks: analysis of recombination in Saccharomyces
    Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis 451:53–69.
    https://doi.org/10.1016/S0027-5107(00)00040-3
    1. Merchlinsky M
    (1989)
    Intramolecular homologous recombination in cells infected with temperature-sensitive mutants of vaccinia virus
    Journal of Virology 63:2030–2035.
  1. Book
    1. Ohno S
    (1970)
    Evolution by Gene Duplication
    Berlin, Heidelberg: Springer Berlin Heidelberg.
    1. Perkins DD
    (1992)
    Neurospora: the organism behind the molecular revolution
    Genetics 130:687–701.
    1. Spyropoulos DD
    2. Roberts BE
    3. Panicali DL
    4. Cohen LK
    (1988)
    Delineation of the viral products of recombination in vaccinia virus-infected cells
    Journal of Virology 62:1046–1054.

Decision letter

  1. Richard A Neher
    Reviewing Editor; University of Basel, Switzerland
  2. Patricia J Wittkopp
    Senior Editor; University of Michigan, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

[Editors’ note: the authors were asked to provide a plan for revisions before the editors issued a final decision. What follows is the editors’ letter requesting such plan.]

Thank you for sending your article entitled "Long read sequencing reveals poxvirus evolution through rapid homogenization of gene arrays" for peer review at eLife. Your article is being evaluated by three peer reviewers, and the evaluation is being overseen by Richard Neher as the Reviewing Editor and Patricia Wittkopp as the Senior Editor. The reviewers discussed your manuscript at length and one critical issue came up.

Sasani and colleagues use long read nanopore sequencing to characterize pox virus populations from cell culture evolution experiments. In these experiments, poxviruses adapt initially by amplification of the target gene or by a point mutation. The central claim of the paper is that the point mutation subsequently spreads among amplified gene arrays via a gene-conversion like mechanism. This is an intriguing possibility and long read sequencing seems in principle well suited to characterize this complex evolutionary dynamics. However, the evidence presented is also compatible with the following more parsimonious explanation:

The initial response of the virus population is gene amplification and the rise of single copy mutant genes. Once the latter are common, it becomes likely that they undergo gene amplification as well resulting in multi-copy mutant arrays. These homogeneous mutant arrays than gradually replace the WT arrays. This mechanism explains the homogeneity of the arrays and their late dominance via processes we already know to be common in this system without the need to invoke gene conversion. We failed to identify evidence in the data you presented that favors gene conversion over this simpler mechanism, especially given the high error rate of nanopore sequencing.

Given this critical issue, the editors and reviewers invite you to respond within the next two weeks either with existing data/explanations in favor of gene conversion, or suggestions for experiments/analyses that would decisively answer this question. We plan to share your responses with the reviewers and then issue a binding recommendation.

[Editors’ note: formal revisions were requested, following approval of the authors’ plan of action.]

Thank you for submitting your article "Long read sequencing reveals poxvirus evolution through rapid homogenization of gene arrays" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Richard A Neher as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Patricia Wittkopp as the Senior Editor.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

Sasani and colleagues use long read nanopore sequencing to characterize pox virus populations from cell culture evolution experiments. The central claim of the paper is that pox virus achieves immune evasion initially by either amplification of the target gene or a point mutation. Subsequently, the point mutation spreads among amplified gene arrays via a gene-conversion like mechanism. This is an intriguing possibility and long read sequencing is well suited to characterize these complex evolutionary dynamics.

Essential revisions:

Sasani et al., have already responded to our most serious concern, an alternative mechanism via amplification of mutant alleles, and their additional analyses provide convincing evidence that this simple mechanism can be ruled out. We would like these analyses to be included in the manuscript and have a number of additional essential points that need to be addressed:

1) Statistics on the arrangements of alleles in multi-copy arrays. How often are different patterns AAAAA, AABAA, ABBAB, etc. observed? Do identical alleles tend to be neighbors or are they randomly distributed? Are specific subpatterns over-represented across 3,4,5 copy arrays? These statistics should be used to explicitly discuss the plausibility of different scenarios and ideally quantitatively model the dynamics across time. Simply showing examples/pictures and stating that this result is "compatible" with your hypothesis is not enough. For follow-up analysis, it would be useful to provide alignments of 1, 2, 3, 4, 5, copy arrays such that readers don't have to go back to the FASTQ files (maybe after stripping non-reference insertions that are likely ONT errors).

2) Recombination and MOI: Viral titer will increase dramatically during the 48h experiments and there is, therefore, no single MOI for any experiment. Furthermore, even experiments started at low MOI likely have high MOI towards the end. Subsection “Recombination and selection drive patterns of K3LHis47Arg homogenization” should explain these caveats and discuss carefully the degree to which inter-genomic recombination can be ruled out. Statistical signatures of WT/mutant patterns in the multicopy arrays might help to differentiate such patterns (cross-over recombination might result in mutant copies being preferentially at the 3' or 5' end of the array).

Qin and Evans, 2014 provide careful estimates of pox virus recombination rates. In particular, they find short recombination tract length (that look like gene conversion) and frequent recombination even though experiments start at MOI 0.02. This further calls into question your assumption that little inter-genomic recombination is happening.

If inter-genomic recombination is common, the observed patterns could be explained by non-allelic homologous recombination that results in frequent deletion, concatentation, and replacements of arrays or parts thereof.

3) ONT error rates. A detailed analysis of sequencing errors needs to be part of the manuscript. It would be useful to indicate the ONT error threshold in Figure Figure 2B. How does one reconcile the fact that ONT finds 10% mutant alleles in P5 while these are not detected in Illumina reads?

4) An effort should be made to clarify the logic of the manuscript. The reviews below highlight problems and provide a number of concrete suggestions to improve the manuscript.

5) "Homology" is used incorrectly: homology is a binary quality. Two sequences either share a common ancestor (are homologous) or they don't. You use homology as a synonym for similarity, which it is not (see https://www.ncbi.nlm.nih.gov/books/NBK20255/).

6) Better graphs: bar graphs are a sub-optimal way to present data in many cases. In Figure 1A/D, for example, please show all data points and the median, there is no need for the bars. Figure Figure 1—figure supplement 1 is even worse - all the relevant differences happen within 10% of the figure. Figure 2C might be better as line graph with copy number as x-axis and one line per passage. We would like to see graphs and statistics that explicitly show the WT/mutant arrangements in multi-copy arrays with more information than the current Figure 3 and Figure 4. Overall, quantitative analysis and presentation of the data should be improved.

Finally, you need to be careful not to overstate your case: Evidence for gene conversion remains indirect (experiments with mixtures of viruses recoded at synonymous positions might provide more direct evidence), the importance of this process in the real-world pox-virus infections remains speculative, and other mechanisms to homogenize the array can't be ruled out.

https://doi.org/10.7554/eLife.35453.044

Author response

[Editors’ note: what follows is the authors’ plan to address the revisions.]

Sasani and colleagues use long read nanopore sequencing to characterize pox virus populations from cell culture evolution experiments. In these experiments, poxviruses adapt initially by amplification of the target gene or by a point mutation. The central claim of the paper is that the point mutation subsequently spreads among amplified gene arrays via a gene-conversion like mechanism. This is an intriguing possibility and long read sequencing seems in principle well suited to characterize this complex evolutionary dynamics. However, the evidence presented is also compatible with the following more parsimonious explanation:

The initial response of the virus population is gene amplification and the rise of single copy mutant genes. Once the latter are common, it becomes likely that they undergo gene amplification as well resulting in multi-copy mutant arrays. These homogeneous mutant arrays than gradually replace the WT arrays. This mechanism explains the homogeneity of the arrays and their late dominance via processes we already know to be common in this system without the need to invoke gene conversion. We failed to identify evidence in the data you presented that favors gene conversion over this simpler mechanism, especially given the high error rate of nanopore sequencing.

Given this critical issue, the editors and reviewers invite you to respond within the next two weeks either with existing data/explanations in favor of gene conversion, or suggestions for experiments/analyses that would decisively answer this question. We plan to share your responses with the reviewers and then issue a binding recommendation.

We appreciate the concern and outline here how the existing data support our proposed mechanism of gene conversion. We also took this opportunity to perform some new analyses. Taking the existing data and new analyses, we highlight three lines of evidence supporting a model of gene conversion. Although some homogeneous replacement of wild type K3L arrays likely occurs, we think you will agree that a closer examination of the data strongly supports a primary role for recombination-driven homogenization of the K3L arrays, revealing the importance of gene conversion for poxvirus adaptation. We feel that this remains a novel and important finding with potentially far reaching implications, particularly for large DNA virus evolution, meriting further consideration for publication in eLife.

Here we outline the major points supporting our model, with details of our new analyses. We expect that this and related modifications would be included in a revised manuscript:

The prevalence and complexity of the mixed (mutant and WT K3L) multicopy genomes we observe favors a recombination-based mechanism for variant homogenization. We agree that the current presentation of data, for example in Figure 3 and Figure 4, makes it difficult to distinguish the mechanisms underlying our conclusions. However, random sampling and genotyping of vaccinia genomes (Author response image 1), strongly supports the existence and variety of mixed arrays, even when accounting for error rates in nanopore sequencing, which we further analyzed in the next point.

Author response image 1
Variety of mixed vaccinia genomes in P15 sequencing data.

We selected a random set of 20 vaccinia genomes of the specified copy number from the P15 sequencing data (a passage with a large proportion of mixed genomes), and plotted the distribution of K3LHis47Arg and K3LWT alleles within each genome.

https://doi.org/10.7554/eLife.35453.036

We calculated the sequencing error rate for the exact sequence change encoding the K3LHis47Arg variant using our nanopore data. We demonstrate that the relatively low error rate for calling this mutation (<3%) cannot account for the abundance of mixed genomes (see Point 2). Furthermore, the proportion of mixed genomes we observe is consistent across multiple nanopore sequencing chemistries (3 versions over the course of our study) and is therefore robust to changes in error rate (see Point 2, Figure 2 and Table 1). This new analysis also includes simulations of K3L array variant calling given the nanopore error rates, which further support the veracity of our original interpretation of the data (Point 2, Figure 3 and Table 2).

Independent evidence for mixed array homogenization emerged from our observations during the course of the experiments. Specifically, the K3LHis47Arg variant is present at a consistent frequency of roughly 10% from P5 to P10, such that the substantial expansion of multicopy genomes that occurs during these passages are almost entirely K3LWT (see Figure 1B of original manuscript). As the expansion of K3L arrays reaches a plateau by P10, the frequency of mixed genomes rapidly increases, with K3LHis47Arg reaching nearly 90% frequency by P20. We simultaneously tracked a nearby variant, a 3 bp shift at the site of the recombination breakpoints. The resulting two breakpoint variants appear to be neutral and, in sharp contrast to K3LHis47Arg, they remain at nearly identical frequencies from P5 to P20 of the experiments (see Figure 6D and Table S1 in the original manuscript). Our interpretation of these data is that virus populations “top out” in K3LWT copy number before the His47Arg mutation begins infiltrating K3L arrays. Under the alternate model proposed in review, we would predict that one of the two neutral breakpoints would preferentially ‘hitchhike’ in homogeneous K3LHis47Arg arrays, and at the very least change substantially in frequency. More consistent with these data, a model of recombination-driven homogenization can account for the consistency of the neutral variants at a nearby genomic location (a point we propose to clarify and emphasize in a revised manuscript).

We anticipate including these new analyses and also sharpening the presentation of evidence for gene conversion in a revised version of the manuscript. Please find below details supporting and clarifying our claims:

Point 1: The large proportion of mixed alleles is consistent with recombination-based accumulation and homogenization of the K3LHis47Arg variant.

As the K3LHis47Arg variant accumulated in the population, the high proportion of multicopy genomes containing mixed sets of alleles suggests that rampant recombination occurred, in line with a model involving gene conversion (subsection “The K3LHis47Arg variant rapidly homogenizes in multicopy vaccinia genomes”). For example, these mixed K3L gene arrays comprise nearly ¼ of all vaccinia genomes sequenced at P15 (Figure 3, original submission). Given the high recombination rate and relatively low point mutation rate of poxviruses, these mixed arrays are likely a product of recombination events rather than repeated de novo point mutations.

The reviewers raise the point that the mixed arrays could also be a result of the higher error rate of nanopore sequencing (which we further address below). Were this the case, we would expect the vast majority of these mixed genomes to be homogeneous except for one K3L copy containing a different allele. However, we observed a large variety of mixed K3LWT and K3LHis47Arg copies in multicopy genomes (Author response image 1).

Furthermore, we observed various combinations of K3L alleles in mixed genomes, rather than alleles of each type consistently grouped together (Author response image 1). These data are indicative of successive (rather than single) recombination events, further supporting a recombination-based mechanism of variant accumulation and homogenization. Copy number expansions from a single-copy K3LHis47Arg genome can, and likely do, occur, but the abundance of mixed arrays suggests that this is not the primary mechanism of homogenization, as proposed under the alternative model. The presence of the mixed arrays better fits a model of recombination spreading the variant through multicopy genomes.

Point 2: Nanopore sequencing error rates at the K3LHis47Arg locus are low, and do not explain the large fraction of mixed genomes we observe.

One possible concern is that the high error rate of nanopore sequencing may be impacting our observations of homogeneous and mixed arrays. Such a concern understandably could stem from our lack of clarity regarding the sources and rates of nanopore sequencing error in our data as originally presented. Because nanopore sequencing involves the measurement of changes in ionic current caused by specific kmers occupying the nanopore, rather than individual nucleotides, sequencing accuracy varies depending on the particular k-mer being sequenced. Indeed, counts of homopolymer k-mers are often underrepresented in ONT reads, when compared to their counts in the reference genome of the sequenced sample (Loman et al., 2015).

Therefore, we have more clearly defined the error rates in our sequencing data. To do this, we calculated the proportions of sequencing errors (mismatches and deletions) at all other 5-mers in the vaccinia genome that match the two 5-mer sequences that include the nucleotide encoding either K3LWT or K3LHis47Arg (WT: TATGC, n=126, His47Arg: TACGC, n=83; see Extended Methods below). Using three separate nanopore sequencing chemistries, we find that T>C and C>T errors at these 209 loci are slightly more common than other single nucleotide errors; however, the median proportions of these errors are never greater than 2.6% (Author response image 2 and Response table 1). In stark contrast, using these R7.3, R9, and R9.4 sequencing chemistries to sequence P15, we estimate the population allele frequency of K3LHis47Arg (encoded by a T>C variant) in multicopy genomes to be 53.8%, 55.0%, and 57.0%, respectively. These observed allele frequencies at the K3L locus are clearly well above the basal ONT error rates for these 5-mers, and demonstrate that sequencing error cannot account for the prevalence and complexity of mixed genomes we observe.

Author response image 2
Error rate distribution within K3LHis47Arg sequence context.

(A) Kernel density plots representing the distribution of error rates for T>C, T>A, T>G, and T>deletion error across all TATGC 5-mers in data from each flowcell chemistry. (B) Kernel density plots representing the distribution of error rates for C>T, C>A, C>G, and C>deletion errors across all TACGC 5-mers in data from each flowcell chemistry.

https://doi.org/10.7554/eLife.35453.037

Response table 1: Median T>C and C>T sequencing error rates using various ONT chemistries

R7.3 ONT chemistryR9 ONT chemistryR9.4 ONT chemistry
T > C0.0230.0230.005
C > T0.0150.0240.026

Although we used different flow cell chemistries throughout the course of our experiments, and each chemistry has a different error profile, we observed a nearly identical proportion of mixed genomes when sequencing the same population with each of the different flow cell chemistries (Author response image 3A). This suggests that our observations of mixed arrays are robust to both flow cell chemistry and sequencing error rates. Our calculations of error for the specific 5-mer surrounding the K3LHis47Arg allele show that the later flow cell chemistries have little T>C bias (Author response image 2), and yet we still observed a high proportion of mixed reads using these flow cells. As mentioned above, if rare sequencing errors were driving our observations of mixed genomes, we might expect that the most common types of mixed genomes in our data would be “near homogeneous” (i.e., homogeneous except for one copy due to a sequencing error). Therefore, we calculated binomial p-values to assess whether we observe more of these “near homogeneous” genomes than expected given the T>C and C>T error rates in our data (Extended Methods). For all flow cell chemistries, these pvalues are extremely small (Response Table 2), suggesting that the mixed genomes we observe are truly heterogeneous, rather than artifacts.

Author response image 3
Oxford Nanopore sequencing error does not impact observed patterns of K3L copy number or K3LHis47Arg allele heterogeneity.

The P15 vaccinia population was sequenced with R7.3, R9, and R9.4 chemistry ONT flowcells. (A) Stacked bar plots representing the diversity of allele combinations within single-copy and multicopy reads were generated from sequenced ONT reads from each chemistry, as in Figure 3 and Figure 5 (original manuscript). (B) All mixed genomes observed in the data from each flowcell chemistry were converted to homogeneous genomes (Extended Methods), and sequencing errors were randomly distributed throughout these genomes at a rate equivalent to the median C>T or T>C error rate for the particular chemistry (see Response Table 1). Stacked bar plots representing the resulting diversity of allele combinations were then created as in (A). (C) We performed the simulation in (B) a total of 1000 times, and generated kernel density plots representing the distribution of mixed genome proportions recovered in each simulation. The red line indicates the proportion of mixed genomes observed in the experimental data for the sequencing chemistry of interest..

https://doi.org/10.7554/eLife.35453.038

Response Table 2: Binomial p-values calculated for observed numbers of near-homogeneous (NH) mixed genomes

R7.3 ONT chemistryR9 ONT chemistryR9.4 ONT chemistry
P (NH WT)4.46E-53.21E-382.48E-42
P (NH H47R)2.86E-201.54E-2356.03E-123

Furthermore, if we assume that all genomes in our data are truly homogeneous and then randomly distribute T>C and C>T errors throughout these genomes at a frequency equivalent to their median error rates for each chemistry, we find that these errors do not account for the high proportion of mixed arrays that we observe in our experimental data (Author response image 3A, 3B). We repeated this experiment (i.e., randomly distributing sequencing errors) a total of 1,000 times, and find that our observed proportion of mixed genomes is always greater than the results of simulations (Author response image 3C). Using R7.3, R9, and R9.4 chemistry flowcells, the observed rate was roughly five times the rate expected from sequencing error (defined as the median of 1,000 simulations). Thus, while the error rate of nanopore sequencing is higher than that of technologies like Illumina, it does not have a large impact on our data. These internal controls further support our accurate identification of mixed genomes, suggesting that recombination is rampant, and thus the spread of the K3LHis47Arg variant cannot be due to expansion of homogeneous genomes alone.

Point 3: Multicopy homogeneous K3LHis47Arg genomes do not initially accumulate, despite abundant single-copy K3LHis47Arg genomes.

Under a model of homogeneous array replacement, single-copy K3LHis47Arg genomes undergo gene amplification once they become common in the population, and subsequently replace WT homogeneous genomes. However, this model does not match our observations. As early as P5, the K3LHis47Arg variant was present in nearly 40% of single-copy genomes. At P10, however, we observed very few multicopy genomes homogeneous for the K3LHis47Arg allele (original manuscript, Figure 3). Were selection for homogeneous K3LHis47Arg genomes the main driver of variant accumulation, we would expect to see a larger fraction of these genomes at P10. Furthermore, although overall K3L copy number increases from P5 to P10, this pattern is driven almost entirely by the amplification of homogeneous K3LWT alleles.

Across the same time points (P5-10), the K3LHis47Arg allele remains at a constant population-level frequency (original manuscript, Figure 2B, Figure 6—source data 1), but there is a shift in the proportion of multicopy genomes that contain K3LHis47Arg alleles. At P5, K3LHis47Arg is present almost exclusively in single-copy genomes, but by P10 the allele has spread to a larger proportion of multicopy genomes, which are mainly mixed rather than homogeneous (original manuscript, Figure 3). These results suggest that the K3LHis47Arg allele was present in a large proportion of single-copy genomes for many passages, but did not undergo significant gene amplification as predicted by the reviewer's model. Instead, we observed the rapid accumulation and homogenization of the K3LHis47Arg allele only from P10 to P20, accompanied by a large population of mixed arrays (original manuscript, Figure 3). This suggests that the variant is introduced into multicopy arrays through recombination, and only once these mixed genomes are more abundant did we observe an increase in homogeneous K3LHis47Arg multicopy genomes. Gene amplification of single-copy K3LHis47Arg genomes is of course possible, and likely occurs as well, but does not seem to be the main mechanism of array homogenization.

Coupled with our observations of consistent recombination breakpoint variant frequencies (original manuscript, Figure 6D and Table S1), these data favor a recombination-driven mechanism of adaptation.

Extended Methods:

Calculating sequence-dependent error rates in nanopore sequencing data.

We scanned the vaccinia reference genome for every instance of the “TATGC” 5-mer sequence; the thymine at the 3rd position of this sequence (at reference position 30,490) is the mutated base that encodes the K3LHis47Arg amino acid change. For each instance of this 5-mer in the reference genome (ignoring the 5-mer that spans reference positions 30,388-30,492), we calculated the proportions of As, Cs, and Gs aligned to the site of the middle nucleotide in the reference genome; observations of these non-reference bases likely represent sequencing errors. Additionally, we calculated the proportion of alignments in which there was a deletion at the middle nucleotide. We then generated kernel density plots representing the distributions of these error proportions across all instances of the 5-mer in the reference genome (n = 126, excluding any kmer matches in the first or last 10 kbp of the vaccinia genome, which comprise highly repetitive sequence). We repeated this analysis for all “TACGC” 5-mer sequences (n = 83, excluding sequence in the first or last 10 kbp of the genome); a cytosine to thymine change at the center nucleotide would encode a K3LHis47Arg to K3LWT change, and is therefore a proxy for sequencing errors that would lead us to incorrectly call a K3LHis47Arg copy as being K3LWT.

Simulating the effects of error rate on observations of mixed and homogeneous genomes.

We then simulated the effects of the empirically determined T>C and C>T error rates on our observed proportions of mixed and homogeneous genomes. In other words, we addressed the following question: if all of the vaccinia genomes in our passaged populations were, in fact, homogeneous for K3LWT or K3LHis47Arg alleles, would we expect to observe a large fraction of mixed genomes simply due to sequencing errors? To do this, we converted every vaccinia genome in our experimental data into a homogeneous array using the following heuristic: if the genome is already homogeneous, keep it as such; if the genome contains a majority of K3LWT or K3LHis47Arg alleles, convert it into a genome of identical copy number that is homogeneous for the majority allele; if the genome has an equal number of K3LWT and K3LHis47Arg copies, randomly convert the genome into a homogeneous K3LHis47Arg or homogeneous K3LWT genome of identical copy number. Then, for each K3L copy in each of these “converted” genomes, we randomly introduced sequencing errors (i.e., switched K3LHis47Arg alleles to K3LWT alleles, and vice versa) at a rate equivalent to the median C>T or T>C error rate for that sequencing chemistry.

If we perform this simulation using the median error rates for each flow cell chemistry, we observe far fewer mixed genomes than in our experimental data, suggesting that the large proportions of mixed genomes in Figure 3 (main manuscript) represent true heterogeneous K3L arrays (Author response image 3). To address the statistical significance of this observation, we performed two experiments.

First, we performed the simulation outlined above a total of 1000 times. For each simulation, we calculated the proportion of genomes with mixed alleles, and plotted the distributions of mixed genome proportions across all 1000 simulations (Author response image Figure 3C). For all sequencing chemistries, we find that our observed proportion of mixed genomes greatly exceeds (~5 times) the expected proportion of mixed genomes recovered by simulations (Author response image 3C).

Second, for all genomes of copy number two or greater, we calculated the proportion of genomes that were homogeneous for either K3LWT or K3LHis47Arg alleles, as well as the proportion of genomes that were “near-homogeneous” (containing all but one K3LWT or K3LHis47Arg allele). To test if we observe more of these “near-homogeneous” mixed arrays than expected given our C>T and T>C error rates, we calculated a binomial pvalue as follows. In the following examples, “near-homogeneous WT genomes” are those with all but one K3LWT copy, and “near-homogeneous H47R genomes” are those with all but one K3LHis47Arg copy.

WTNH = # of near-homogeneous WT genomes

WTH = # of homogeneous WT genomes

H47RNH = # of near-homogeneous H47R genomes

H47RH = # of homogeneous H47R genomes

ECT = median C>T error rate for the flowcell chemistry

ETC = median T>C error rate for the flowcell chemistry

P(NH WT) = binom(WTNH, WTNH + WTH, probability = ETC) = probability of observing WTNH given ETC

P(NH H47R) = binom(H47RNH, H47RNH + H47RH, probability = ECT) = probability of observing H47RNH given ECT

For all genomes with 2 or more copies of K3L, the p-values of each binomial test are presented in Response Table 2.

[Editors’ notes: the authors’ response after being formally invited to submit a revised submission follows.]

Summary:

Sasani and colleagues use long read nanopore sequencing to characterize pox virus populations from cell culture evolution experiments. The central claim of the paper is that pox virus achieves immune evasion initially by either amplification of the target gene or a point mutation. Subsequently, the point mutation spreads among amplified gene arrays via a gene-conversion like mechanism. This is an intriguing possibility and long read sequencing is well suited to characterize these complex evolutionary dynamics.

Essential revisions:

Sasani et al.et al., have already responded to our most serious concern, an alternative mechanism via amplification of mutant alleles, and their additional analyses provide convincing evidence that this simple mechanism can be ruled out. We would like these analyses to be included in the manuscript and have a number of additional essential points that need to be addressed:

In response to this previous concern, we performed an extensive analysis of error rates and related issues, which we sent as a response and supporting PDF before receiving full reviews. The pertinent points of this analysis are now incorporated as follows into the revised manuscript (note: all line numbers included below refer to the final revised version of the manuscript and may differ from those in the original document).

1) Additional analyses and explanations of ONT error rates. See Figure 2—figure supplement—figure supplement 2, Figure 3—figure supplement 2, Figure 3—figure supplement 3, Table 2 descriptions in the manuscript (subsection “The K3LHis47Arg variant rapidly homogenizes in multicopy vaccinia genomes”, subsection “Deep sequencing of viral genomes”), and an updated Materials and methods section.

2) More careful consideration of the combined data supporting a model resembling gene conversion rather than expansion of homogeneous K3LHis47Arg arrays in Figure 3—figure supplement 4, Figure 4B, descriptions in the manuscript (subsection “Copy number and allele frequency estimation” and subsection “Breakpoint characterization”), and a revised Discussion section.

1) Statistics on the arrangements of alleles in multi-copy arrays. How often are different patterns AAAAA, AABAA, ABBAB, etc. observed? Do identical alleles tend to be neighbors or are they randomly distributed? Are specific subpatterns over-represented across 3,4,5 copy arrays? These statistics should be used to explicitly discuss the plausibility of different scenarios and ideally quantitatively model the dynamics across time. Simply showing examples/pictures and stating that this result is "compatible" with your hypothesis is not enough. For follow-up analysis, it would be useful to provide alignments of 1, 2, 3, 4, 5, copy arrays such that readers don't have to go back to the FASTQ files (maybe after stripping non-reference insertions that are likely ONT errors).

We have added a supplement to Figure 3 (Figure 3—figure supplement 4) that shows a more comprehensive analysis of 3-copy K3L arrays containing every possible combination of K3LWT and K3LHis47Arg alleles in passages 10, 15, and 20 (subsection “Copy number and allele frequency estimation”). We chose to focus on combinations in 3-copy arrays for two main reasons: first, because 3-copy arrays make up a large fraction of multi-copy K3L arrays in these passages (and represent a useful cross-section of the diversity of multi-copy arrays); and second, because our sequencing datasets for passages 10 and 20 did not contain enough reads to create robust and meaningful plots of 4- and 5-copy array combinations. In addition, we performed additional deep sequencing to profile virus genomes at passage 15 (which contained the largest fraction of mixed arrays among any of our passaged datasets), generating an additional ~2,700 reads that fully aligned to the K3L duplicon. Using these additional data from P15, we generated plots of mixed array combinations for 4- and 5-copy arrays. Although the technical limitations of our sequencing approach prevent us from generating large numbers of reads containing 6+ K3L copies, we expect that the trends we observe in 3, 4, and 5-copy arrays hold for higher copy-number arrays as well, based on our current dataset for robustly scoring large arrays.

We also considered a number of possible statistical analyses to address whether particular allele combinations were over- or under-represented in mixed K3L arrays. Ultimately, though, we struggled to build a null expectation of K3LHis47Arg accumulation to which we could compare our experimental data. For example, to simulate the accumulation of K3LHis47Arg through de novomutation alone, we considered uniformly distributing K3LHis47Arg alleles throughout a population that initially contained only K3LWT alleles, and subsequently counting the numbers of arrays with each combination of K3LWT and K3LHis47Arg alleles. However, given the low point mutation rates of poxviruses, this seemed like a highly unrealistic model of K3LHis47Arg accumulation. Additionally, we considered a model in which we grouped all K3L arrays of a given copy number at P10, P15, and P20 by the number of crossovers needed to generate them (an “AAAA” or “BBBB” array would require zero crossovers, an “ABBB” or “BBBA” array would require one, etc.). Then, we calculated the average number of crossovers we observed among all K3L arrays in a passage, and estimated the number of arrays with 0, 1, … n crossovers we would expect based upon a Poisson distribution. Although this model is relatively simple, it suffers from the requirement that we build an expectation from our observed experimental data. This inherent circularity is troubling and therefore, we ultimately decided to present our analyses of K3L allele combinations in multicopy K3L arrays without explicit statistical treatment. We believe that the trends shown in Figure 3—figure supplement 4 clearly demonstrate that the allelic patterns observed in mixed arrays are distinct between passages, sum to a significant fraction of all arrays, and support our proposed mechanism of genetic homogenization.

Additionally, we have added a panel to Figure 4 in which we excluded all homogeneous arrays from the P15 dataset, and plotted the frequency of K3LHis47Arg in each copy of the remaining mixed K3L arrays. We show that K3LHis47Arg homogenizes within multicopy K3L arrays even when we exclusively examine mixed arrays, and demonstrate that K3LHis47Arg accumulation is independent of K3L array copy number. This new Figure 4B and modified text (subsection “Breakpoint characterization”) provides strong evidence of K3L homogenization through a recombination-based mechanism, rather than amplification of purely K3LWT or K3LHis47Arg arrays.

Finally, for ease of future analysis, we have uploaded aligned sequencing reads (in BAM format) for all analyzed passages to the SRA (accession SRP128569) and Zenodo (DOI 10.5281/zenodo.1319732). These alignment files contain only those reads that aligned to the K3L locus.

2) Recombination and MOI: Viral titer will increase dramatically during the 48h experiments and there is, therefore, no single MOI for any experiment. Furthermore, even experiments started at low MOI likely have high MOI towards the end. Subsection “Recombination and selection drive patterns of K3LHis47Arg homogenization” should explain these caveats and discuss carefully the degree to which inter-genomic recombination can be ruled out. Statistical signatures of WT/mutant patterns in the multicopy arrays might help to differentiate such patterns (cross-over recombination might result in mutant copies being preferentially at the 3' or 5' end of the array).

Qin and Evans, 2014 provide careful estimates of pox virus recombination rates. In particular, they find short recombination tract length (that look like gene conversion) and frequent recombination even though experiments start at MOI 0.02. This further calls into question your assumption that little inter-genomic recombination is happening.

If inter-genomic recombination is common, the observed patterns could be explained by non-allelic homologous recombination that results in frequent deletion, concatentation, and replacements of arrays or parts thereof.

We agree that multiple rounds of replication influence MOI and increase the chances of intergenomic recombination during the experiments. However, in specific consideration of results from Qin and Evans, (2014), our experimental approach differs in three notable ways. First, our viruses have reduced replication ability compared to WT viruses; therefore, the potential for increases in MOI over the course of 48 hours is diminished. Second, our lowest MOI is 20-fold lower than reported in previous work, which reduces the occurrence of co-infection and intergenomic recombination in our protocol. A quick estimation of the maximum titer following a 48-hour passage in our MOI 0.001 experiment predicts 2-3 infectious particles per cell, which further supports the rarity of co-infections in our setup by the time virus is harvested. Finally, the previous observations of short recombination tract length (Qin and Evans, 2014), which are also supported by Fathi et al.et al., 1991, are only seen in high MOI (10) single passage experiments.

Therefore, while intergenomic recombination seems to favor short conversion tracts, the more modest potential for co-infection in our MOI 0.001 experiment suggests that this is not the main mechanism by which the SNV accumulates. Given the similar proportions of mixed and homogeneous K3L arrays we observed across a 10,000-fold difference in MOI (Figure 5), the data support the idea that intergenomic recombination does not play a significant role in the homogenization of K3LHis47Arg. We have more clearly acknowledged these important caveats and explain these points more clearly in the revised manuscript (subsection “Generating simulated distributions of the K3LHis47Arg variant within multicopy arrays”).

3) ONT error rates. A detailed analysis of sequencing errors needs to be part of the manuscript. It would be useful to indicate the ONT error threshold in Figure 2B. How does one reconcile the fact that ONT finds 10% mutant alleles in P5 while these are not detected in Illumina reads?

As described in our previous response to the reviewers’ major concern noted above, we performed additional analyses to more clearly explain and calculate the error rates in our ONT data. This is described in subsection “The K3LHis47Arg variant rapidly homogenizes in multicopy vaccinia genomes”, subsection “Deep sequencing of viral genomes”, as well as in the Materials and methods section. To illustrate these error rates, we have added supplements for Figure 2 and Figure 3 (Figure 2—figure supplement 2, Figure 3—figure supplement 2, and Figure 3—figure supplement 3).

We have chosen not to annotate Figure 2B with ONT error thresholds, as the methods we used to generate allele frequency estimates differ from the methods used to calculate error rates. Specifically, our estimates of ONT error rate were calculated directly from read alignments, by counting every nonreference base aligned to the middle nucleotide of the 5-mer of interest. ONT allele frequencies, on the other hand, were estimated using the nanopolish software tool. Nanopolish examines the aligned sequences in a BAM file, as well as the raw current signal output by the MinION device, to increase the accuracy of SNV calling. Therefore, we do not feel that the two measurements are directly comparable. However, we have included our measurements of error rate for each sequencing chemistry and SNV in a supplement to Figure 2 (Figure 2—figure supplement 2), and in an updated version of Table 2, so that readers are made aware of the error rates in the manuscript. Also, in regard to Figure 2B, we did not sequence P5 using Illumina, as described in the updated figure legend. Thus, it is not the case that the mutant allele was not detected in Illumina reads at P5. We simply do not have Illumina data for that passage and apologize for the confusion. For clarity, we have updated text annotations in the figure (Figure 2B) and its legend from “n.d.” (not determined – but potentially interpreted as not detected) to “n.t.” (not tested).

4) An effort should be made to clarify the logic of the manuscript. The reviews below highlight problems and provide a number of concrete suggestions to improve the manuscript.

We agree and have reorganized the first three sections of the manuscript to improve the logical progression and grouping of ideas and experiments. First, we moved the description of the only other high-frequency SNV (E9LGlu495Gly) into the first section of the manuscript (subsection “ONT reads reveal precise K3L copy number and distributions of K3LHis47Arg in individual genomes”) to clarify our reasoning for focusing on genetic changes at the K3L locus (see also response to reviewer #3’s last comment). Second, we combined the second and third sections of the manuscript under a new heading (subsection “The K3LHis47Arg variant rapidly homogenizes in multicopy vaccinia genomes”) to better organize and emphasize the information obtained from our initial ONT sequencing experiments (see also response to reviewer #3’s first comment).

5) "Homology" is used incorrectly: homology is a binary quality. Two sequences either share a common ancestor (are homologous) or they don't. You use homology as a synonym for similarity, which it is not (see https://www.ncbi.nlm.nih.gov/books/NBK20255/).

We agree and have corrected the manuscript accordingly (Introduction). We maintain the use of homology/homologous in instances where referring to sequences that share a common ancestor between species, and also for sequences within a gene family that share a common ancestor within the genome.

6) Better graphs: bar graphs are a sub-optimal way to present data in many cases. In Figure 1A/D, for example, please show all data points and the median, there is no need for the bars. Figure 1—figure supplement 1 is even worse -- all the relevant differences happen within 10% of the figure. Figure 2C might be better as line graph with copy number as x-axis and one line per passage. We would like to see graphs and statistics that explicitly show the WT/mutant arrangements in multi-copy arrays with more information than the current Figure 3 and Figure 4. Overall, quantitative analysis and presentation of the data should be improved.

We have updated Figure 1A, Figure 1D, Figure 1—figure supplement 1, and Figure 1—figure supplement 3 to more clearly represent the data. These have been changed from bar graphs to dot plots showing each data point, median, and 95% CI. As stated above in response to the first point, we have included additional analyses of the variant patterns within multicopy arrays in Figure 3—figure supplement 4. We have also included an example of Figure 2C redrawn as a line plot below (Author response image 4); however, we feel that the current figure design, using stacked bars to represent the proportions of arrays with a given copy number, is more useful.

Finally, you need to be careful not to overstate your case: Evidence for gene conversion remains indirect (experiments with mixtures of viruses recoded at synonymous positions might provide more direct evidence), the importance of this process in the real-world pox-virus infections remains speculative, and other mechanisms to homogenize the array can't be ruled out.

We have updated the Discussion section to more clearly acknowledge other potential mechanisms and better summarize the evidence from our experiments. While we think tracking the noncoding variation in breakpoints near K3L (Figure 2A) offer some evidence, we agree that a synonymous site re-coding and mixing experiment would be more incisive and thank the reviewer for making the suggestion.

https://doi.org/10.7554/eLife.35453.045

Article and author information

Author details

  1. Thomas A Sasani

    Department of Human Genetics, University of Utah, Salt Lake, United States
    Contribution
    Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft
    Contributed equally with
    Kelsey R Cone
    Competing interests
    TAS received travel and accommodation expenses to speak at an Oxford Nanopore Technologies conference.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2317-1374
  2. Kelsey R Cone

    Department of Human Genetics, University of Utah, Salt Lake, United States
    Contribution
    Validation, Investigation, Visualization, Methodology, Writing—original draft
    Contributed equally with
    Thomas A Sasani
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4547-7174
  3. Aaron R Quinlan

    Department of Human Genetics, University of Utah, Salt Lake, United States
    Contribution
    Conceptualization, Supervision, Funding acquisition, Writing—review and editing
    Contributed equally with
    Nels C Elde
    For correspondence
    aaronquinlan@gmail.com
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1756-0859
  4. Nels C Elde

    Department of Human Genetics, University of Utah, Salt Lake, United States
    Contribution
    Conceptualization, Supervision, Funding acquisition, Writing—review and editing
    Contributed equally with
    Aaron R Quinlan
    For correspondence
    nelde@genetics.utah.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0426-1377

Funding

National Institutes of Health (R01GM114514)

  • Nels C Elde

Burroughs Wellcome Fund (1015462)

  • Nels C Elde

University of Utah (Equipment Grant)

  • Aaron R Quinlan
  • Nels C Elde

H.A and Edna Benning Presidential Endowed Chair

  • Nels C Elde

National Institutes of Health (R01HG006693)

  • Aaron R Quinlan

National Institutes of Health (R01GM124355)

  • Aaron R Quinlan

National Institutes of Health (T32GM007464)

  • Thomas A Sasani

National Institutes of Health (T32AI055434)

  • Kelsey R Cone

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Ryan Layer for his assistance in early MinION sequencing experiments, and members of the Quinlan and Elde labs for helpful discussions during preparation of the manuscript.

Senior Editor

  1. Patricia J Wittkopp, University of Michigan, United States

Reviewing Editor

  1. Richard A Neher, University of Basel, Switzerland

Version history

  1. Received: January 27, 2018
  2. Accepted: August 12, 2018
  3. Version of Record published: August 29, 2018 (version 1)

Copyright

© 2018, Sasani et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,460
    Page views
  • 339
    Downloads
  • 17
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Thomas A Sasani
  2. Kelsey R Cone
  3. Aaron R Quinlan
  4. Nels C Elde
(2018)
Long read sequencing reveals poxvirus evolution through rapid homogenization of gene arrays
eLife 7:e35453.
https://doi.org/10.7554/eLife.35453

Further reading

    1. Evolutionary Biology
    Hironori Funabiki, Isabel E Wassing ... Thomas Carroll
    Research Article

    5-Methylcytosine (5mC) and DNA methyltransferases (DNMTs) are broadly conserved in eukaryotes but are also frequently lost during evolution. The mammalian SNF2 family ATPase HELLS and its plant ortholog DDM1 are critical for maintaining 5mC. Mutations in HELLS, its activator CDCA7, and the de novo DNA methyltransferase DNMT3B, cause immunodeficiency-centromeric instability-facial anomalies (ICF) syndrome, a genetic disorder associated with the loss of DNA methylation. We here examine the coevolution of CDCA7, HELLS and DNMTs. While DNMT3, the maintenance DNA methyltransferase DNMT1, HELLS, and CDCA7 are all highly conserved in vertebrates and green plants, they are frequently co-lost in other evolutionary clades. The presence-absence patterns of these genes are not random; almost all CDCA7 harboring eukaryote species also have HELLS and DNMT1 (or another maintenance methyltransferase, DNMT5). Coevolution of presence-absence patterns (CoPAP) analysis in Ecdysozoa further indicates coevolutionary linkages among CDCA7, HELLS, DNMT1 and its activator UHRF1. We hypothesize that CDCA7 becomes dispensable in species that lost HELLS or DNA methylation, and/or the loss of CDCA7 triggers the replacement of DNA methylation by other chromatin regulation mechanisms. Our study suggests that a unique specialized role of CDCA7 in HELLS-dependent DNA methylation maintenance is broadly inherited from the last eukaryotic common ancestor.

    1. Developmental Biology
    2. Evolutionary Biology
    Nico Posnien, Vera S Hunnekuhl, Gregor Bucher
    Review Article

    Gene expression has been employed for homologizing body regions across bilateria. The molecular comparison of vertebrate and fly brains has led to a number of disputed homology hypotheses. Data from the fly Drosophila melanogaster have recently been complemented by extensive data from the red flour beetle Tribolium castaneum with its more insect-typical development. In this review, we revisit the molecular mapping of the neuroectoderm of insects and vertebrates to reconsider homology hypotheses. We claim that the protocerebrum is non-segmental and homologous to the vertebrate fore- and midbrain. The boundary between antennal and ocular regions correspond to the vertebrate mid-hindbrain boundary while the deutocerebrum represents the anterior-most ganglion with serial homology to the trunk. The insect head placode is shares common embryonic origin with the vertebrate adenohypophyseal placode. Intriguingly, vertebrate eyes develop from a different region compared to the insect compound eyes calling organ homology into question. Finally, we suggest a molecular re-definition of the classic concepts of archi- and prosocerebrum.