1. Genetics and Genomics
  2. Microbiology and Infectious Disease
Download icon

Tiled-ClickSeq for targeted sequencing of complete coronavirus genomes with simultaneous capture of RNA recombination and minority variants

  1. Elizabeth Jaworski
  2. Rose M Langsjoen
  3. Brooke Mitchell
  4. Barbara Judy
  5. Patrick Newman
  6. Jessica A Plante
  7. Kenneth S Plante
  8. Aaron L Miller
  9. Yiyang Zhou
  10. Daniele Swetnam
  11. Stephanea Sotcheff
  12. Victoria Morris
  13. Nehad Saada
  14. Rafael RG Machado
  15. Allan McConnell
  16. Steven G Widen
  17. Jill Thompson
  18. Jianli Dong
  19. Ping Ren
  20. Rick B Pyles
  21. Thomas G Ksiazek
  22. Vineet D Menachery
  23. Scott C Weaver
  24. Andrew L Routh  Is a corresponding author
  1. Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, United States
  2. ClickSeq Technologies LLC, United States
  3. World Reference Center for Emerging Viruses and Arboviruses, University of Texas Medical Branch, United States
  4. Department of Microbiology and Immunology, The University of Texas Medical Branch, United States
  5. Department of Pediatrics, University of Texas Medical Branch, United States
  6. Department of Pathology, University of Texas Medical Branch, United States
  7. Institute for Human Infections and Immunity, University of Texas Medical Branch, United States
  8. Next-Generation Sequencing Core, The University of Texas Medical Branch, United States
  9. Sealy Centre for Structural Biology and Molecular Biophysics, University of Texas Medical Branch, United States
Tools and Resources
  • Cited 0
  • Views 295
  • Annotations
Cite this article as: eLife 2021;10:e68479 doi: 10.7554/eLife.68479

Abstract

High-throughput genomics of SARS-CoV-2 is essential to characterize virus evolution and to identify adaptations that affect pathogenicity or transmission. While single-nucleotide variations (SNVs) are commonly considered as driving virus adaption, RNA recombination events that delete or insert nucleic acid sequences are also critical. Whole genome targeting sequencing of SARS-CoV-2 is typically achieved using pairs of primers to generate cDNA amplicons suitable for next-generation sequencing (NGS). However, paired-primer approaches impose constraints on where primers can be designed, how many amplicons are synthesized and requires multiple PCR reactions with non-overlapping primer pools. This imparts sensitivity to underlying SNVs and fails to resolve RNA recombination junctions that are not flanked by primer pairs. To address these limitations, we have designed an approach called ‘Tiled-ClickSeq’, which uses hundreds of tiled-primers spaced evenly along the virus genome in a single reverse-transcription reaction. The other end of the cDNA amplicon is generated by azido-nucleotides that stochastically terminate cDNA synthesis, removing the need for a paired-primer. A sequencing adaptor containing a Unique Molecular Identifier (UMI) is appended to the cDNA fragment using click-chemistry and a PCR reaction generates a final NGS library. Tiled-ClickSeq provides complete genome coverage, including the 5’UTR, at high depth and specificity to the virus on both Illumina and Nanopore NGS platforms. Here, we analyze multiple SARS-CoV-2 isolates and clinical samples to simultaneously characterize minority variants, sub-genomic mRNAs (sgmRNAs), structural variants (SVs) and D-RNAs. Tiled-ClickSeq therefore provides a convenient and robust platform for SARS-CoV-2 genomics that captures the full range of RNA species in a single, simple assay.

Introduction

Virus genomics and next-generation sequencing (NGS) are essential components of viral outbreak responses (Grubaugh et al., 2019b). Reconstruction of consensus genetic sequences is essential to identify adaptations correlated with changes in pathogenicity or transmission (Gussow et al., 2020). In addition to single nucleotide variations, studies of SARS-CoV-2 have identified numerous genomic structural variants (SVs) (Yi, 2020) that arise due to non-homologous RNA recombination. SVs typically comprise small insertions/deletions that nonetheless allow the variant genome to independently replicate and transmit. Numerous SVs have been described for CoVs including deletions of the accessory open-reading frames (aORFs) (Yc et al., 2020; Muth et al., 2018) and changes in spike protein observed in the B.1.1.7 (Alpha) and other variants of concern (Kemp et al., 2020). Adaptation of SARS-CoV-2 also occurs during passaging in cell-culture, such as small deletions that arise near the furin cleavage site of spike protein during amplification on Vero cells (Ogando et al., 2020). These deletions can alter the fitness and virulence of SARS-CoV-2 isolates and thus must be genetically characterized at the within-culture population level prior to passaged stock use in subsequent studies.

Similar to SVs, non-homologous RNA recombination also gives rise to defective-RNAs (D-RNAs), also known as defective viral denomes (DVGs). D-RNAs have been observed in multiple studies of coronaviruses (CoVs), including mouse hepatitis virus (MHV) (Makino et al., 1984; Makino et al., 1985; Makino et al., 1988a; Makino et al., 1988b), bovine CoV (Chang et al., 1994), avian infectious bronchitis virus (IBV) (Penzes et al., 1995), human CoV 299E (Viehweger et al., 2019; Banerjee et al., 2001; Joo et al., 1996; Kim et al., 1993). We recently demonstrated that SARS-CoV-2 is >10-fold more recombinogenic in cell culture than other CoVs such as MERS (Gribble et al., 2021) and generates abundant D-RNAs containing RNA recombination junctions that most commonly flank U-rich RNA sequences. D-RNAs may change the fitness, disease outcomes, and vaccine effectiveness for SARS-CoV-2 similar to other respiratory pathogens such as influenza and RSV (Vignuzzi and Lopez, 2019). Together, these findings highlight the need to identify these RNA species and their impact on SARS-CoV-2 infection and pathogenesis.

Whole genome sequencing can be achieved through a range of approaches including non-targeted (random) NGS of virus isolates amplified in cell culture or directly from patient samples. However, when input material is limited, low viral genome copy numbers necessitate a template-targeted approach followed by molecular amplification by PCR or iso-thermal amplification to generate sufficient nucleic acid for sequencing. Generally, these require knowledge of the virus genome and the design of pairs of primers that anneal to the target genome. Perhaps the most popular method for SARS-CoV-2 sequencing is the ‘ARTIC’ approach (Tyson et al., 2020), which can reliably identify SNVs and minority variants present in as little as 3% of genomes (Grubaugh et al., 2019a). However, the requirement for pairs of primers constrains where amplicons can be designed and imparts sensitivity to single nucleotide variants (SNVs). Multiple PCR reactions containing different pools of paired-primers must also be performed in order to obtain cDNA amplicons of the correct size and to prevent the interaction or mis-priming of PCR primers. Importantly, pairs of primers that do not flank RNA recombination junctions will be unable to detect unexpected or unpredicted RNA recombinant species. Finally, paired-primer approaches also necessitate the re-design and validation of alternative sets of primer-pairs for each specific NGS platform used (e.g. Illumina amplicons are 200–500 nts, Nanopore amplicons are ~2000–5000 nts).

To address these limitations and optimize the ability of NGS to quantify all types of viral genetic variants, we have combined ‘ClickSeq’ with tiled-amplicon approaches. ClickSeq (Routh et al., 2015b; Jaworski and Routh, 2018) is a click-chemistry-based platform for NGS that prevents artifactual sequence chimeras in the output data (Gorzer et al., 2010). Using ClickSeq, the 3’end of an amplified cDNA segment is generated by the stochastic incorporation of terminating 3’ azido-nucleotides (AzNTPs) during reverse transcription. A downstream adaptor is ‘click-ligated’ onto the cDNA using copper-catalyzed azide-alkyne cycloaddition (CuAAC). Therefore, ‘Tiled-ClickSeq’ only requires one template-specific primer per cDNA amplicon. To achieve whole genome sequencing of a virus isolate or sample, multiple tiled primers are designed evenly along the virus genome. Only one pool of RT-primers is required, even when > 300 template-specific primers and their corresponding cDNA amplicons are generated in the same reaction. This simplifies the assay design, and importantly removes constraints imposed in paired-primer strategies (Itokawa et al., 2020). Furthermore, the same primer set can be used for both Illumina and Nanopore platforms even when requiring different cDNA amplicon sizes. The library construction allows for additional quality control features including the use of unique molecular identifiers (UMIs) in the ‘click-adaptor’ as well as the ability to identify each RT-primer that gives rise to specific cDNA amplicon when using paired-read NGS.

Here, we utilize the Tiled-ClickSeq method to analyze multiple isolates of SARS-CoV-2 both from cell-culture and clinical specimens used in routine diagnostics for COVID19 and demonstrate that ‘Tiled-ClickSeq’ accurately reconstructs full-length viral genomes. The method also captures recombinant RNA species including sgmRNAs, SVs, and D-RNAs. Overall, Tiled-ClickSeq therefore provides a convenient and robust platform for full genetic characterization of viral isolates.

Results

Overview of sequencing strategy

Most tiled approaches for complete viral genomes sequencing from viral isolates require the design of pairs of primers that generate pre-defined overlapping amplicons in multiple pools (Figure 1A and B). However, this can prevent the detection of recombinant viral genomic materials such as sub-genomic mRNAs (sgmRNAs) or Defective-RNAs (D-RNAs). To overcome these challenges, we designed a template directed tiled-primer approach to reverse transcribe segments of the SARS-CoV-2 genome based upon the ‘ClickSeq’ method for NGS library synthesis (Routh et al., 2015b). Instead of random-hexamer or oligo-dT primers as used in ClickSeq and Poly(A)-ClickSeq, respectively (Routh et al., 2017), we use multiple ‘tiled’ RT-primers designed at regular intervals along the viral genome (Figure 1C). In ‘Tiled-ClickSeq’, pooled primers initiate a reverse transcription in a reaction that has been supplemented with 3’-azido-nucleotides (AzNTPs). This yields stochastically terminated 3’-azido-cDNA fragments, which can be click-ligated onto a hexynyl-functionalized Illumina i5 sequencing adaptor (Figure 1D). After click-ligation, the single-stranded triazole-linked cDNA is PCR-amplified using indexing p7 adaptors to fill in the ends of the NGS library, yielding the final library schema shown in Figure 1E. We designed the click-adaptor with an additional 12 random nucleotides at its 5’ end. As each adaptor can only be ligated once onto each unique cDNA molecule, this provides a unique molecular identifier (UMI) (Jabara et al., 2011). Due to the stochastic termination of cDNA synthesis in the RT step, a random distribution of cDNA fragments is generated from each primer, giving rise to the hypothetical read coverage depicted in Figure 1F. The lengths of these fragments, and thus the obtained read coverage can be optimized to ensure overlapping read data from each amplicon by adjusting the ratio of AzNTPs to dNTPs in the RT reaction (Routh et al., 2015b). With this approach, we found that we could robustly make NGS libraries from as little as 8 ng of total cellular RNA with only 18 PCR cycles (Figure 1G). Final libraries are excised from agarose gels (300-600nt cDNA size), pooled, and are compatible with Illumina sequencing platforms. A computational pipeline was compiled into a batch script (Source data 2) depicted by the flow-chart in Figure 1H.

Schematic of Tiled-ClickSeq and computational pipeline: (A) Schematic of SARS-CoV-2 genome with two examples of sub-genomic mRNAs.

(B) Paired-primer approaches typically generate short amplicons flanked by upstream and downstream primers that are PCR amplified in non-overlapping pools. (C) Tiled-ClickSeq uses a single pool of primers at the reverse-transcription step with the upstream site generated by stochastic termination by azido-nucleotides. (D) 3’-Azido-blocked single-stranded cDNA fragments are ‘click-ligated’ using copper-catalyzed azide alkyne cycloaddition (CuAAC) to hexynyl functionalized Illumina i5 sequencing adaptors. Triazole-linked ssDNA is PCR amplified to generate a final cDNA library. (E) The structure of the final cDNA is illustrated indicating the presence of the i5 and i7 adaptors, the 12 N unique molecular identifier (UMI), the expected location of the triazole linkage, and the origins of the cDNA in the reads including the tiled primer-derived DNA, which is captured using paired-end sequencing. (F) The hypothetical read coverage over a viral genome is indicated in red, yielding overlapping ‘saw-tooth’ patterns of sequencing coverage. Longer fragment lengths with more extensive overlapping can be obtained using decreased AzNTP:dNTP ratios. (G) Final cDNA libraries are analyzed and size-selected by gel electrophoresis (2 % agarose gel). Duplicates of libraries synthesized from 8, 80, and 800 ng of input SARS-CoV-2 RNA input are shown. (H) Flowchart of the data processing and bioinformatic pipeline. Input data is in Blue, output data are in Green, scripts/processes are Purple.

Validation with WA-1 strain

To test this approach, we obtained 200 ng RNA from an SARS-CoV-2 isolate deposited at the World Reference Center for Emerging Viruses and Arboviruses (WRCEVA) at UTMB (Harcourt et al., 2020b) and performed Tiled-ClickSeq using CoV2 primer pool v1 and a 1:35 AzNTP:dNTP mix. NGS libraries were sequenced on an Illumina MiSeq (2 × 150 reads). Reads were quality processed using fastp (Chen et al., 2018) and mapped to the virus genome using bowtie2 (Langmead and Salzberg, 2012). A ‘saw-tooth’ pattern of read coverage over the genome was generated (Figure 2A, orange plot) with ‘teeth’ appearing as expected upstream of each tiled primer. Peaks of coverage for each ‘tooth’ ranged from ~13,000 x to ~100 x. Overall, we obtained genome coverage >25 X from nucleotide 3–29823 (50nts from the 3’ end of the genome). This depth is sufficient to reconstruct a consensus genome sequence that was found to be identical to that already deposited (MT020881) for this isolate (Harcourt et al., 2020a).

Figure 2 with 2 supplements see all
Read coverage over the SARS-CoV-2 genome using Tiled-ClickSeq.

(A) Read coverage obtained from Tiled-ClickSeq over the whole viral genome is depicted when sequencing using an Illumina MiSeq (orange) or on an Oxford Nanopore Technologies MinION device (blue). A ‘saw-tooth’ pattern of coverage is observed with ‘teeth’ upstream of tiled-primers, indicated at the bottom of the plot by short black lines. (B) Zoomed in read coverage of nts 1–2400 of the SARS-CoV-2 genome with coverage of Illumina MiSeq reads from five individual primers coloured to illustrate coverage from downstream amplicons overlapping the primer-binding sites of upstream tiled-primers (Blue: Read coverage from primer 1; Orange: coverage from primer 2; Green: coverage from primer 3; Red: coverage from primer 4; Purple: coverage from primer 5).

When using paired-end sequencing, the ‘forward’/’R1’ read is derived from the click-adaptor and contains the UMI. The ‘reverse’/’R2’ read is derived directly from the tiled primer (see schematic in Figure 1E). We wrote a custom python3 script to split all the forward ‘R1’ reads into multiple individual FASTQ files based upon which primer generated each fragment. The mapping coverage obtained from five individual tiled-primers is shown in Figure 2B. The coverage for each primer (denoted by individual colours in Figure 2B) spans approximately 500–600 nts and extends 5’-wards from the tiled RT-primer. Read coverage from each primer overlaps the read coverage of the upstream primer. This allows for continuous gap-free read coverage over the viral genome which, importantly, allows a downstream cDNA amplicon to provide sequence information over and beyond an upstream primer. Additionally, we can determine the frequency with which each primer either successfully maps to the viral genome, mis-primes from the host RNA, or gives rise to adaptor-dimers or other sequencing artifacts. This information can be used to identify primers that yield poor viral priming efficiency and therefore a more specific primer can be designed and substituted as needed.

For nanopore sequencing, we also synthesized Tiled-ClickSeq libraries but using a 1:100 AzNTP:dNTP ratio to generate cDNA amplicons of increased lengths. We retained cDNA fragments > 600 nts, yielding a few nanograms of dsDNA. This library, though containing the Illumina adaptors, can nonetheless be used as input in the default Oxford Nanopore Technologies (ONT) Ligation-Sequencing protocol (LSK-109) that appends ONT adaptors directly onto the ends of A-tailed dsDNA fragments. We sequenced this library using an ONT MinION device and obtained 279,192 reads greater than 1kbp in length. These were mapped to the WA-1 viral genome using minimap2 yielding continuous genome coverage (Figure 2A, blue). A similar profile of read coverage to the Illumina data was observed, with peaks of coverage upstream of tiled-primer sites. The deeper dips in coverage were avoided however, due to the longer reads lengths that give greater overlap between cDNA amplicons.

Genome reconstruction of 12 isolates: ClickSeq, Tiled-ClickSeq, and Nanopore-Tiled-ClickSeq

To validate the suitability of Tiled-ClickSeq for whole virus genome reconstruction, we obtained RNA extracted from 12 outgrowth samples of SARS-CoV-2 deposited at WRCEVA from nasopharyngeal swabs collected between March and April 2020. We synthesized 12 Tiled-ClickSeq libraries and 12 random-primed ClickSeq libraries in parallel. These were submitted for sequencing on a NextSeq (2 × 150) yielding ~2–5 M reads per sample (Table 1). Random-primed ClickSeq data were quality-filtered and adaptor trimmed using fastp (Chen et al., 2018Chen et al., 2018) retaining only the forward R1 reads. Tiled-ClickSeq read data were processed and mapped following the scheme in Figure 1H.

Table 1
Read counts and mapping rates for random-primed versus Tiled-ClickSeq approaches.
SampleCTClickSeq readsVirus mapped% Viral ReadsTiled v1 readsVirus mapped% Viral Reads
WRCEVA_0050112.94,665,869116,0362.5%2,359,7952,204,75093.4%
WRCEVA_0050212.94,989,513118,2602.4%1,962,5811,820,92592.8%
WRCEVA_0050512.73,894,32571,8091.8%2,779,6722,482,85489.3%
WRCEVA_0050612.54,979,989108,5322.2%2,395,7502,148,25689.7%
WRCEVA_0050712.95,659,073161,0592.8%2,056,6701,867,01290.8%
WRCEVA_0050816.83,987,00991,4522.3%1,787,4181,433,00580.2%
WRCEVA_0050917.14,057,92857,4241.4%2,202,6611,856,63384.3%
WRCEVA_0051016.25,328,82965,2811.2%2,040,3321,601,54478.5%
WRCEVA_0051316.04,391,17569,1691.6%1,641,2131,455,99188.7%
WRCEVA_0051412.94,340,08484,2111.9%2,089,2411,902,74891.1%
WRCEVA_0051515.75,416853102,1791.9%2,205,1661,915,12986.8%
WRCEVA_0051617.44,290,92961,0171.4%1,988,9391,715,44886.2%

In the Tiled-ClickSeq data, after UMI deduplication, each isolate had an average coverage between 4500 and 7500 reads and a coverage of 25 reads in greater than 99.5% (29753/29903 nts) of the SARS-CoV-2 genome. Read coverage was also obtained covering the 5’UTR of each strain ( > 25 reads for all isolates from nucleotide three onwards (Figure 3A and B)). When using paired-primer approaches, the 5’UTR is ordinarily obscured by the 5’-most primer used in each pool (nts 30–54 for the ARTIC primer set depicted in Figure 3A). As the 5’ end is resolved here due to stochastic incorporation of a single AzNTP in a template-specific manner, the entirety of the viral genome can be resolved. We reconstructed reference genomes from mapped reads using pilon (Walker et al., 2014Walker et al., 2014) requiring 25 x coverage for variant calling. In all cases, the reconstructed reference genomes were identical with or without controlling for PCR duplicates using the UMIs. We found 5–12 SNVs per viral genome (Source data 3), including the prevalent D614G (A23403G) spike adaptation, which enhances SARS-CoV-2 transmission (Plante et al., 2020), in 11 out of the 12 isolates (Figure 3C).

Genome Reconstruction of 12 SARS-CoV-2 isolates deposited at the World Reference Center for Emerging Viruses and Arboviruses (WRCEVA).

(A) Read coverage is depicted over the 5’ UTR of the SARS-CoV-2 genome for each isolate revealing capture of this region. The 5’-most primer from the ARTICv3 protocol at nts-30–54 is illustrated. (B) Snapshot of read data from Tiled-ClickSeq is depicted using the Tablet Sequencing Viewer from WRCEVA_000508 over the same region of the 5’UTR as A. (C) The most common single-nucleotide variants (SNVs) found in complete genome reconstructions from all 12 isolates are illustrated and colour-coded to depict the underlying viral protein. (D) Phylogenetic tree of 12 WRCEVA isolates with their corresponding clade indicated.

Genome reconstruction was similarly performed using the random-primed ClickSeq data reads. Identical genomes to the Tiled data were obtained for 11 out of 12 isolates, with only one SNV difference in one sample (WRCEVA _000510: T168C). In this case, the read coverage was too low in the random-primed data for pilon to report an SNV. Nevertheless, visual inspection of the mapped data revealed that all nucleotides at this locus were indeed C’s, as reported for the Tiled-ClickSeq data. Phylogenetic tree reconstruction using NextStrain (Hadfield et al., 2018) placed 10 of the isolates in the A2a clade (Figure 3D) and two isolates (WRCEVA_00508, WRCEVA_00513) were Clade B/B1.

We also retained cDNA fragments > 600 bps from the Tiled-ClickSeq libraries and sequenced these using an ONT MinION device. We used the ONT native barcoding kit to multiplex the 12 samples and the Ligation-Sequencing protocol (LSK-109) to generate final libraries. Reads were mapped with minimap2 (Li, 2016) yielding at least 100 x coverage over >99.6% of the genome for each isolate (Figure 2—figure supplement 1). Again, reference genomes were reconstructed from the mapped data using pilon (Source data 3). With the exception of WRCEVA_000514 which contained a single additional SNV (C14220T), the reference genomes reconstructed from the nanopore data were identical to those generated from the Tiled-ClickSeq Illumina data. These data illustrate that Tiled-ClickSeq performs as well as random-primed methods either on Illumina or Nanopore platforms for whole genome reconstruction.

To further validate our approaches, we used the well described ARTIC v3 protocol for amplicon sequencing of whole SARS-CoV-2 viral genomes (Tyson et al., 2020) using the same input RNA as above for Tiled-ClickSeq NGS library synthesis. In every case, the reported SNVs were identical between the ARTIC data and the Tiled-ClickSeq data. Read coverage over the viral genomes is illustrated in Figure 2—figure supplement 2.

Minority variants

Our initial primer design (v1) (Figure 4A, blue plots) successfully yielded coverage suitable for complete genome reconstruction. However, some regions still received low coverage with fewer than a 100 deduplicated reads, preventing identification of minority variants in these regions. Therefore, we redesigned our primer scheme by adding an additional 326 primers (v2) previously reported (Guo et al., 2020) for tiled coronavirus sequencing (Source data 1) to make a pool comprising a total of 396 unique primers (v3). We re-sequenced the 12 WRCEVA isolates analyzed as described above plus an additional four that subsequently became available. An example of mapping coverage for isolate WCREVA_000508 is illustrated in Figure 4A, where the coverage over the viral genome is more even with less extreme ranges of read depth.

Additional tiled-primers improves read coverage and allows identification of minority variants.

(A) Read coverage obtained from Tiled-ClickSeq over the whole viral genome is depicted using an Illumina MiSeq when using the original primers as in Figure 2 (v1 - blue) or with an additional 326 tiled-primers (v3 - pink). Tiled-primers are indicated at the bottom of the plot by short blue (v1) or pink (v3) lines. (B) The rates of mismatching nucleotides found in mapped NGS reads is depicted across the SARS-CoV-2 genome for isolate WRECVA_000508 prior to trimming the tiled primers from forward/‘R1’ reads and without PCR deduplication. (C) The rates of mismatching is also depicted after data quality processing to remove PCR duplicates and primer-derived nucleotides in the reads, revealing three minority variants in this sample with frequencies > 2%.

Figure 4—source data 1

The frequency of all mapped nucleotides at each genome coordinate for each WRCEVA isolate is provided.

The reference genome, nucleotide coordinate and expected reference Nucleotide is provided. Total read coverage and the numbers of each non-reference nucleotide are also shown. Finally, the mismatch/error rate at each site is provided which reveals minority variants in each isolate.

https://cdn.elifesciences.org/articles/68479/elife-68479-fig4-data1-v1.xlsx

Using the R2 read, we can determine which primer gives rise to each R1 read and trim primer-derived nucleotides from the R1 read. This is an important quality control as it prevents the assignment (or failure thereof) of SNVs and/or the mapping of recombination events due to primer mis-priming. If reads are mapped without trimming away the primer-derived nucleotides found in the R1 read (as depicted in Figure 4B), we see numerous high frequency (2–50%) minority variants. The majority of these apparent minority variants overlap primer-target sites and are likely artefactual. Furthermore, the same high-frequency events are often seen across multiple independent samples. To control for this, we mapped reads after trimming away primer-derived nucleotides from the R1 reads as per our pipeline described above (schematic in Figure 1H). Finally, to control for PCR duplication events, we make use of the UMIs embedded in the click-adaptor. The final de-duplicated mapped, primer-trimmed reads (Figure 4C) provide a robust readout of minority variants in these isolates (Table 2 and Figure 4—source data 1). Across 10 WRCEVA isolates we found only 26 minority variants present at >2% all of which were unique within this dataset. Six isolates reported no minority variants at all.

Table 2
Minority variants and rates ( > 2%) found across 16 WRCEVA isolates.
SampleNtNucReadDepthAUGCVariantRateLocationResult
WRCEVA_00050112,049C2,116095120204.5%ORF1abN3928K
WRCEVA_00050210,207C2,240011802,1225.3%--
WRCEVA_00050216,050U3,85303,322053113.8%--
WRCEVA_00050217,489A4,5974,433162113.6%ORF1abE5742V
WRCEVA_00050221,526A8,7496,50802,240125.6%ORF1abI7088V
WRCEVA_00050314,220C1,638146301,17428.3%--
WRCEVA_0005041,556A2,8282,4990328111.6%ORF1abI431V
WRCEVA_00050427,925C2,857013402,7234.7%ORF8T11I
WRCEVA_00050719,515A2,3932,29519704.1%--
WRCEVA_0005089,756G1,3762801,34802.1%ORF1abR3164H
WRCEVA_00050826,056G2092086200604.1%ORF3aD222Y
WRCEVA_00050827,556G20661280193806.2%ORF7aA55T
WRCEVA_00050911,956C1962019901,76310.1%--
WRCEVA_00050917,245C4,062247003,59011.6%ORF1abR5661C
WRCEVA_00050918,005U5,40814,94945808.5%ORF1abL5915R
WRCEVA_00050925,569U3,44843,32611353.5%--
WRCEVA_00050927,919U83908090303.6%ORF8I9T
WRCEVA_00050928,767C20110109019025.4%NT165I
WRCEVA_0005113,003U2,880792,7871132.7%ORF1abV913E
WRCEVA_00051110,738U4,58004,44001403.1%--
WRCEVA_00051125,892U1330130032.3%ORF3aI167T
WRCEVA_00051128,001G1,4141291,38402.1%--
WRCEVA_00051327,046C5,539013805,4012.5%MT175M
WRCEVA_00051411,603A5,4055,075033006.1%ORF1abM3780V
WRCEVA_00051426,526G52502050503.8%MA2S

RNA recombination: sgmRNAs, structural variants, and defective RNAs

To characterize RNA recombination, we used our bespoke ViReMa pipeline (Routh and Johnson, 2014) to map RNA recombination events in NGS reads that correspond to either sgmRNAs, SVs, or D-RNAs. ViReMa can detect agnostically a range of expected and unusual RNA recombination events including deletions, insertions, duplications, inversions as well as virus-to-host chimeric events and provides BED files containing the junction sites and frequencies of RNA recombination events. We mapped the Tiled-ClickSeq data to the corrected reference genome for each WRCEVA isolate using ViReMa. We also took total cellular RNA and RNA extracted from the supernatants of Vero cells transfected with RNA derived from an in vitro infectious clone of SARS-CoV-2 (icSARS-CoV-2) (Xie et al., 2020). These clone-derived RNAs contained either the WT SARS-CoV-2, or were engineered with a deletion near the furin cleavage site of the spike protein, which we recently demonstrated is a common adaption to Vero cells and which alters SARS-CoV-2 pathogenesis in mammalian models of infection (Johnson et al., 2021).

The identities and frequencies of the 13 most abundant RNA recombination events are illustrated in Figure 5A. We found all the expected sgmRNAs previously annotated for SARS-CoV-2 (Kim et al., 2020) as well as non-canonical sgmRNAs. An overview of mapped data over the SARS-CoV-2 illustrating large recombination events (depicted by the blue horizontal lines) is provided in Figure 5—figure supplement 1. We found that sgmRNAs were highly enriched in the cellular fractions from expressed icSARS-CoV-2 isolates (comprising >95% of the total viral genetic materials) but were relatively depleted in the supernatant fraction. This reflects a strong restriction of the packaging of these RNA species into virions. In the icSARS-CoV-2 samples, Tiled-ClickSeq and ViReMa accurately reported the expected deletion (Δ23603^23616). Interestingly, we also identified small structural variants (Δ23583^23599) in seven of the WRCEVA isolates with a frequency of 2–50%, similar to reports of the selection of variants containing deletions at this site after in vitro passaging on Vero cells (Klimstra et al., 2020). We also found a novel SV in one isolate (WRCEVA_000504: Δ27619^27642) present in 3.5% of the reads resulting in an eight amino acid deletion in ORF7a. We additionally identified a small number of micro-indels (Table 3) in some isolates.

Table 3
Micro-indels and rates ( > 2%) found across 16 WRCEVA isolates.
SampleMicroInDelNucsVariantRateLocationResult
WRCEVA_000502Δ519^523UGGUU2.2%ORF1ABFrameshift
WRCEVA_000504Δ29686^29,693CAGUGUGU3.5%3’UTR-
WRCEVA_000505Δ519^523UGGUU2.9%ORF1ABFrameshift
WRCEVA_000506Δ519^523UGGUU3.8%ORF1ABFrameshift
WRCEVA_000509Δ1237^1,239UCA2.9%ORF1ABΔH325
WRCEVA_000510Δ686^694AAGUCAUUU5.1%ORF1abΔLSF141-143
WRCEVA_000511Δ519^523UGGUU3.7%ORF1ABFrameshift
WRCEVA_000511Δ10811^10,813CUU3.1%ORF1ABΔL3516
WRCEVA_000512Δ29750^29,759GAUCGAGUG10.0%3’UTR-
Figure 5 with 1 supplement see all
Tiled-ClickSeq identifies sub-genomic mRNAs, structural variants, and defective-RNAs.

(A) A table of the most common RNA recombination events found using Tiled-ClickSeq to study ‘World Reference Center for Emerging Viruses and Arboviruses’ (WRCEVA) isolates. The recombination junctions are indicated on the left of the table, with their relative frequencies indicated in the table and colour-matched for each sample analyzed. All canonical sgmRNAs are found with their open-reading frame (ORF) indicated, in addition to one non-canonical sgmRNA (*). Three common structural variants including two deletions in spike protein and a deletion in ORF7a were also detected. (B) Unique RNA recombination events are plotted for 16 WRCEVA isolates as a scatter plots whereby the upstream ‘donor’ site is plotted on the y-axis and a downstream ‘acceptor’ site is plotted on x-axis. The read count for each unique RNA recombination event is indicated by the size of the point, while the number of samples in which this each RNA recombination event is found is indicated by the colour-bar. Insertions/duplication/back-splicing events are found above the x = y axis, while deletions and RNA recombination events yielding sgmRNAs are found below.

Figure 5—source data 1

Snapshot of Tiled-ClickSeq reads from icSARS-CoV-2 delta-PRRA.

https://cdn.elifesciences.org/articles/68479/elife-68479-fig5-data1-v1.zip

Finally, we observed thousands of RNA recombination events corresponding to D-RNAs (BED files for each sample are provided in Figure 5—source data 1). Despite their individual low frequencies, these events (displayed as a Recombination Heatmap in Figure 5B) reveals interesting features of D-RNAs of SARS-CoV-2. Apparent duplication events or insertions were most commonly observed with recombination events enriched around the 3’UTR of the genome, consistent with our previous characterization of RNA recombination in distinct coronavirus isolates including MHV, MERS and SARS-CoV-2 (Gribble et al., 2021). Finally, large deletions comprising RNA recombination events stretching from nucleotides ~ 6000–7000 to the 3’UTR were also observed, again, consistent with our previous observations. Altogether, these results demonstrate RNA recombination is a common and conserved feature of SARS-CoV-2 and that the emergence of D-RNAs is a prevalent source of genetic diversity amongst these isolates and is captured using Tiled-ClickSeq.

Clinical isolates of SARS-CoV-2

We assayed nasopharyngeal (NP) swabs from 60 individuals who tested positive for SARS-CoV-2 in UTMB clinics between 5th January and 4th February 2021, which corresponds to the beginning of the ‘third wave’ driven by the B.1.1.7 ‘Alpha’ variant of concern (VOC). CTs available for 51 of the samples ranged from 15 to 35 (Source data 3). RNA was extracted from 200 µl of the inactivated Viral Transport Medium (VTM) for each of these samples and used as input for Tiled-ClickSeq library synthesis using the v3 tiled primers and only 18 cycles in the PCR amplification step. Similar to the SARS-CoV-2 isolates, libraries were pooled and sequenced on either a MiSeq or NextSeq and raw data were processed using the computational pipeline established above.

The percentage coverage of the genome with greater than 10 reads is illustrated in Figure 6A as a function of the CT value (where available) for each sample and coloured according to the sequencing batch. At low CT values < 20, all but one sample returned complete genome coverage. Between CTs values of 20 and 25, coverage began to drop with the only one complete genome being obtained with a CT >30. The raw read coverage of the Tiled-ClickSeq data is shown for each sample in Figure 6—figure supplement 1. At low CTs ( < 17.5), genome coverage ranging from 100x to 10,000x is seen for all the samples for almost the entire genome. Coverage reduces corresponding with increased CT values, with only patchy coverage observed for samples with CT >30. Therefore, from a total of 60 samples, we obtained 36 complete genomes passing quality filters ( > 99.5% total genome coverage) which were deposited into GISAID. The majority of these were assigned to the B.1.2 Pango lineage (Source data 3).

Figure 6 with 1 supplement see all
Tiled-ClickSeq for surveillance of SARS-CoV-2 from nasopharyngeal (NP) swabs collected during routine diagnostics of COVID19 at UTMB.

(A) The percent genome coverage with greater than 10 reads is plotted as a function of the measured CT value (x-axis) for each clinical sample sequenced. Each point is colour-coded according to the batch of NGS libraries synthesized. (B) Sequence reads for one of the SARS-CoV-2 samples from the B.1.1.7 lineage are illustrated using Tablet sequence viewer to indicate the U to C transition at nt 2 (U2C) of the SARS-CoV-2 genome. (C) Sequence alignments of the 5’UTR of the consensus genomes of 18 clinical samples assayed illustrates the U2C SNVs found in each B.1.1.7 variant as well as a U13C and C21U in two other B.1.2 variants. (D) The structure of the first 35 nts of the SARS-CoV-2 5’UTR is illustrated which contains Stem Loop 1. The three SNVs identified in the consensus genomes of clinical samples are indicated.

As Tiled-ClickSeq can capture the entire genome including the 5’UTR, we inspected the raw data mapping to the reconstructed references to verify the SNV assignment. We found multiple isolates with SNVs in the 5’UTR. From the 36 reconstructed genomes, 8 were from the B.1.1.7 lineage (Source data 3). Although read coverage is generally diminished at the 5’ ends, 7 of these had sufficient coverage ( > 10 x coverage) to assign a U-to-C transition at the second nucleotide of the genome (U2C) (Figure 6B and C). To determine whether this was a novel SNV, we queried all SARS-CoV-2 genomes deposited in the GISAID database (2’867’767 genomes on the 19th August 2021) and extracted the first 30 nts of each genome. Only 176 genomes contained the U2C SNV; 70 of these were from B.1.1.7 ‘Alpha’ VOC and 43 were from the B.1.617.2 ‘Delta’ VOC. We additionally found C21U in MicroGNL_134 and U13C and C21U in MicroGNL_135. Both of these SNVs are located in the first stem-loop of the SNV 5’UTR (SL1) (Figure 6D).

Recombination analysis using ViReMa revealed all the expected sgmRNAs (Figure 7 and Figure 7—figure supplement 1, Figure 7—source data 1). Similar to the distribution of RNA recombination frequencies observed for the cell-culture isolates (both here and in our previous studies [Gribble et al., 2021]), we observe an enrichment of small recombination events near the 3’UTR of SARS-CoV-2. The two common recombination events in this region that correspond to small duplications (29173^27,802 and 29442^29323, using WA-1 coordinates, indicated Figure 7) were found in 14 and 17 of the clinical samples respectively (Figure 7—figure supplement 1) at frequencies ranging up to approximately 8% of the sequencing reads mapping in these regions. Interestingly, although it is a conserved feature in clinical isolates, it is not found in any of the cell-culture derived samples assayed (Figure 4—source data 1). Larger RNA recombination events corresponding to D-RNAs similar to those we have previously characterized in samples derived from cell-culture are also observed. These predominantly comprise recombination events with donor sites at nt 6000–7000 and acceptor sites in the 3’UTRs. Interestingly, only one sample (MicroGNL-163: B.1.2) had any evidence of InDels at the furin cleavage site of Spike protein with 83 unique reads mapping over a 23583^23,599 deletion, corresponding to an approximate frequency of 5% of the total virus reads mapping across this locus.

Figure 7 with 1 supplement see all
Tiled-ClickSeq identifies sub-genomic mRNAs, structural variants and Defective-RNAs in clinical samples of SARS-CoV-2.

Similarly to Figure 5B, unique RNA recombination events are plotted for 36 clinical samples as a scatterplot whereby the upstream ‘donor’ site is plotted on the y-axis and a downstream ‘acceptor’ site is plotted on x-axis using the WA-1 reference coordinates for each sample. The read count for each unique RNA recombination event is indicated by the size of the point, while the number of samples in which this each RNA recombination event is found is indicated by the colour-bar. Insertions/duplication/back-splicing events are found above the x = y axis, while deletions and RNA recombination events yielding sgmRNAs are found below.

Figure 7—source data 1

BED files of RNA recombination events detected by ViReMa in the Tiled-ClickSeq data from each WRCEVA isolate and clinical sample.

https://cdn.elifesciences.org/articles/68479/elife-68479-fig7-data1-v1.zip

Discussion

Tiled-ClickSeq provides a simple method for whole genome sequencing of virus isolates such as SARS-CoV-2 that can simultaneously map SNVs, minority variants as well as recombination events. Importantly, having only a single template-targeted primer per amplicon provides the opportunity to sequence any RNA template regardless of what expected or unknown sequence is found upstream, including recombinant RNA molecules such as sgmRNAs and D-RNAs. The targeted approach requires a relatively small number of reads to be collected, allowing 10 s of samples to be processed on a MiSeq platform or potentially 100 s on a single flowcell of a NextSeq. Furthermore, the same library preps can be used as input in Oxford Nanopore Sequencing pipelines to yield longer reads, providing the convenience and portability inherent to the platform. We demonstrated that this method can reconstruct full-length SARS-CoV-2 genomes in a manner equivalent to random-primed methods. Full length-genome sequencing is achieved, including the 5’UTR which is missed in the bulk of current high-throughput sequencing efforts, removing the need for 5’RACE.

We further validated our method by sequencing SARS-CoV-2 genomes from discarded clinical specimens (nasopharyngeal swabs in viral transport media) used for routine COVID19 diagnostics. From 60 specimens, we obtained 36 consensus genomes with greater than 10 x coverage across >99.5% of the viral genome. With only 18 PCR cycles, genome coverage began to drop concomitantly with the viral titer of the specimen, with samples with a CT value lower than 25 providing the best quality sequence data. From these samples, we found that all the B.1.1.7 variants characterized contained a U2C SNV. It is yet to be determined whether this SNV is unique to clinical samples collected in the geographical region of this study and the small number of SARS-CoV-2 deposited in GISAID that resolved the 5’UTRs. It is possible many B.1.1.7 isolates/samples bear this mutation but have been missed due to the prevalent use of paired-primer approaches whose 5’-most primer is downstream of or annealing to this locus. In either case, this may be significant as the 5’UTR contains many important regulatory elements that impact replication and translation as well as recombination events that generate the viral sgmRNAs (Li et al., 2008; Zuniga et al., 2004) and therefore may impact viral fitness.

An additional interesting feature of the clinical specimens was in the absence of small InDels in and adjacent to the furin cleavage site of Spike. Small deletions such as the actual furin cleave site ‘PRRA’ and the ‘QTQTN’ motif upstream of the site are prevalent among cell-passaged and amplified isolates of SARS-CoV-2 (Liu et al., 2020). Indeed, we recently demonstrated that these deletions confer a strong selection advantage in Vero E6 cells, rationalizing their common occurrence in in vitro preparations (Johnson et al., 2021). It is surprising therefore that despite observing these deletions in almost all of the cell-culture isolates to some small degree, we did not find any evidence of small deletions in the 60 clinical specimens analyzed here with the exception of one sample (MicroGNL_163). This suggests that while SARS-CoV-2 may be poised to generate structural variants in this region, they are not observed in primary human samples and thus not selected for during human infections.

Recombination analysis using ViReMa confirms that D-RNAs are a true component of the intrahost diversity of clinical specimens of SARS-CoV-2, albeit at low abundance and are highly variable from sample to sample in terms of their precise junctions. In contrast, we observed a small number of unique recombination events in multiple clinical samples, such as 29442^29,323. The 29442^29,323 event is a 120nt duplication that maintains the open reading frame in the C-terminal end of Nucleocapsid (N) at residues 350–389, which contains the last 10 structured amino acids of the C-terminal domain (Peng et al., 2020). Interestingly, this event is not found in any of the cell-culture derived samples suggesting that it is purged during expansion and isolation. Multiple small insertions and duplications have been previously reported in this locus of SARS-CoV-2 (Garushyants et al., 2021), although these reported events are smaller than those detected here ( ≤ 24 nts). Genomic duplication followed by ‘resolution’ through a series of recombination events is a common mechanism observed in RNA viruses that allows for virus diversification and evolution of the duplicated regions (Scheel et al., 2013; Bentley et al., 2021). Such large duplications may not be resolved when using paired-primer genomics approaches as these would generate amplicons larger than expected and so may be filtered out during final NGS library size selection or during bioinformatic processing.

The design of Tiled-ClickSeq imparts built-in quality control tools, including UMIs in the click-adaptor and the opportunity to use paired-end sequencing to identify the primer that gives rise to each amplicon. In addition to controlling for aberrant SNVs, minority and structural variants, this information can be used to determine the relative sensitivity and specificity of each primer in the primer mix allowing the scheme to be pruned and optimized. Our final primer scheme contained over 390 unique SARS-CoV-2 primers. This purposely thorough design demonstrates how the Tiled-ClickSeq pipeline can accommodate complex mixtures of overlapping primers within the same RT reaction. This built-in redundancy reduces the chance of primer dropout due to the presence of SNVs, SVs or recombination events found in primer-annealing sites. This feature is especially important considering the emergence of SARS-CoV-2 variants with deletions and mutations that disrupt sequencing efforts (Plante et al., 2021). Interestingly, we detected very few minority variants in our samples present above 2%. This is consistent with other reports of minority variant detection in SARS-CoV-2 isolates and likely reflects the well-characterized activity of the coronavirus ExoN-nsp14 as a ‘proof-reader’ enzyme (Smith and Denison, 2013). As a result, the greatest source of genetic diversity in coronavirus isolates may well be due to RNA recombination.

On the nanopore sequencing platform, we could obtain sequence reads within the same day as RNA extraction. While the baseline accuracy rate of the nanopore platform prevents the reliable annotation of minority variants present at <5%, this platform can reconstruct novel SARS-CoV-2 variants as well as identify abundant sgmRNAs. Nanopore sequencing also allows for identification of long-range epistatically linked variants (Gallardo et al., 2021). Epistatic linkage can also be computationally leveraged to identify minority variants present at levels below the baseline error-rate of the sequencing platform, for example, using CliqueSNV (Knyazev et al., 2020) or CoVaMa (Routh et al., 2015a). Therefore, the nanopore platform in combination with Tiled-ClickSeq provides a robust pipeline for high-throughput SARS-CoV-2 variant detection with minimal infrastructure.

The ARTIC protocol contains a primer cognate to the 5’UTR of SARS-CoV-2 (nts 30–54) to capture and quantitate sub-genomic mRNAs (Parker et al., 2020). However, recombination events including non-canonical sgmRNAs will be missed by primer-pools that do not happen to flank RNA recombination junctions. In contrast, Tiled-ClickSeq is capable of ‘agnostically’ detecting any unanticipated RNA recombination including D-RNAs that can be characterized by RNA recombination events in or between any region of the viral genome, often in an unpredictable manner. As ClickSeq was originally designed to avoid artefactual recombination with fewer than three artefactual chimeric reads found per million reads, Tiled-ClickSeq provides a useful tool to identify D-RNAs and to robustly characterize rates of RNA recombination. Together, using the Tiled ClickSeq approach, we have the opportunity to identify rare and unexpected recombination events and are not biased by the limitation of primer-pair approaches. Coupled with its cross-sequencing platform capabilities, the work highlights the utility of Tiled-ClickSeq for analysis of SARS-CoV-2.

Materials and methods

Key resources table
Reagent type (species) or resourceDesignationSource or referenceIdentifiersAdditional information
Chemical compound, drugAzNTPsBaseClick
  • BCT-25, BCT-26, BCT-27, BCT-28

Sequence-based reagent5’-hexynyl-functionalized i5 oligoIDT
  • DNA oligo

Sequence-based reagentTCSv3This PaperPCR PrimersSource data 1
Software, algorithmBash and python3 text filesThis Paper.txt and.py filesSource data 2

Viruses and RNA extraction

Request a detailed protocol

For the World Reference Center for Emerging Viruses and Arboviruses (WRCEVA) isolates, viral RNA was obtained from supernatant materials of viral isolates amplified on Vero cells originally obtained from nasopharyngeal swab samples that tested positive in clinical laboratory assays for SARS-CoV-2 RNA, as described previously (Harcourt et al., 2020b). The use of deidentified human samples was approved by the UTMB IRB under protocol 20–0088. The recombinant wild-type and ‘PRRA-deletion’ mutant SARS-CoV-2 are based on the sequence of USA-WA1/2020 isolate provided by the WRCEVA as previously described (Xie et al., 2020; Johnson et al., 2021). Wild-type and mutant SARS-CoV-2 were titrated and propagated on Vero E6 cells. RNAs were extracted from either total cellular materials or supernatants as indicated in the main text.

Collection of RNA from clinical samples

Request a detailed protocol

Discarded nasopharyngeal swabs Viral Transport Media (VTM) used for COVID19 diagnostics at UTMB were collected and de-identified before being discarded. Only the CT value was retained. Of the remaining VTM, 200 µl was put into 1000 µl of Trizol LS for inactivation and stored at –80°C overnight. The next day, samples were thawed, vortexed and 266.6 µl of chloroform was added. Samples were centrifuged at 12,500 g for 15 min at 4°C to separate the phases. The top aqueous phase (400 µl-450µl) was aspirated and added to 667 µl isopropanol containing 2 µl of glycoblue and vortexed for 10 s. Next, samples were incubated at room temperature for 10 minutes and centrifuged at 20,800 g for 20 min at 4°C to pellet RNA. Isopropanol was removed and 1 ml 75% ethanol was added, vortexed and inverted several times. RNA was precipitated by centrifugation at 10,000 g for 10 min at 4°C. Ethanol was removed and the pellet allowed to air-dry for 5–10 min. The RNA pellet was then resuspended in 20 µl of RNase/DNase free water. Five µl of the resuspended RNA was subsequently used as input in Tiled-ClickSeq library preparation.

SARS-CoV-2 reverse transcription primer design

Request a detailed protocol

A ‘first’ tiled-primer set (v1) containing 71 primers was designed cognate to the WA-1 SARS-CoV-2 genome (accession number: NC_045512.2) using the primalseq webserver (Grubaugh et al., 2019a; http://primal.zibraproject.org/) with an amplicon distance of approximately 500nt in between each primer pair. We used only the ‘right’ primer sequences generated by primalseq and appended the Illumina p7 adaptor to these (e.g. GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT+ NNNN + TGTCTCACCACTACGACCGTAC). We also included an additional primer designed to target the 3’-most 25 nts of the SARS-CoV-2 genome. A ‘second’ tiled-primer set was synthesized in a similar fashion using 326 loci described previously (Guo et al., 2020). A ‘third’ tiled-primer pool (v3) was generated by combining the v1 and v2 pools. Primers used in this study are provided as BED files with their loci and the corresponding sequence in Source data 1. Each primer was pooled in equimolar ratios to yield a SARS-CoV-2 specific primer pool used for the RT step of Tiled-ClickSeq.

ClickSeq library preps

Request a detailed protocol

Random-primed ClickSeq NGS libraries were synthesized as described in previously published protocols from our lab (Jaworski and Routh, 2018; Routh et al., 2015b). In Jaworski and Routh, 2018, MiMB (Jaworski and Routh, 2018), we provided detailed descriptions of each step of the protocol and emphasize issues that affect the success and quality of the final libraries. Here, for Tiled-ClickSeq, the protocol has two important adjustments: (1) firstly, the primers used to initiate reverse transcription comprised pools of 10 s-100s of virus-specific primer oligos (primer pool v1, v2, and/or v3, described above); and (2) we annealed the RT-primers to the RNA template by incubating the RNA and primer mixture at 65°C for 5 min, followed by a slow-cool of 1 degree per second to a final temperature or 12°C. RT-enzyme mixes were added at 12°C and primer extension is performed for 10 min at 55 °C. All subsequent steps of the Tiled-ClickSeq reaction, comprising RT cleanup, click-ligation, PCR amplification and cDNA library size-selection are identical to those used in the random-primed ClickSeq method and described previously (Jaworski and Routh, 2018). The i5 ‘Click-Adaptor’ was a reverse complement of the full Illumina Universal Adaptor sequence, plus an additional twelve ‘N’s at 5’-end to provide a Unique Molecular Identifier and functionalized with a 5’-hexynyl group (Integrated DNA Technologies). Final NGS libraries containing fragment sizes ranging 300–700 nts were pooled and sequenced on Illumina MiSeq, MiniSeq or NextSeq platforms using paired-end sequencing.

ARTIC amplicon sequencing

Request a detailed protocol

We used the standard protocol for ARTIC sequencing of SARS-CoV-2 as is well described (Tyson et al., 2020; dx.doi.org/10.17504/protocols.io.bbmuik6w, last accessed 26th Aug 2021) using the version3 ARTIC primer sets. Briefly, cDNA was generated using random-primed reverse-transcription reactions. DNA amplicons 300-500nts in length were generated in two separate PCR reactions containing pairs of primers targeting the SARS-CoV-2 that together provide overlapping amplicons to generate full-length genome sequence coverage. DNA amplicons were pooled in equimolar ratios and we used the Illumina NEBNext Multiplexed sequencing kit to append barcoded adaptors and performed 2 × 150 nt PE sequencing. Whole SARS-CoV-2 genomes were reconstructed by mapping both forward and reverse reads to the reference WA-1 strain and calling SNVs using pilon, RRID:SCR_014731 (Walker et al., 2014), in an identical fashion performed for the Tiled-ClickSeq data.

Nanopore sequencing

Request a detailed protocol

Final cDNA libraries generated by the Tiled-ClickSeq protocol, although containing Illumina adaptors, are compatible with the Direct Sequencing by Ligation Kit (LSK-109) provided by Oxford Nanopore Technologies. cDNA library fragments > 600 nts in length were gel extracted and processed for nanopore sequencing using the manufacturer’s protocols. The addition of demultiplexing barcodes can be achieved using the Native Barcoding by Ligation module (NBD104), again following the manufacturer’s protocols. Single-plex or pooled cDNA libraries with ONT adaptors were loaded onto MIN-FLO109 flowcells on a MinION Mk1C and sequenced using the MinKNOW controller software for >24 hours. Raw FAST5 reads were base-called and demultiplexed using Guppy.

Bioinformatics

Request a detailed protocol

All batch scripts and custom python scripts used in this manuscript are available in Source data 2. Specific command-line entries and parameters can be found therein.

For Illumina reads, raw data were filtered and trimmed using fastp, RRID:SCR_016962 (Chen et al., 2018Chen et al., 2018) to remove Illumina adaptors, quality filter reads and extract Unique Molecular Identifiers (UMIs). A custom python3 script was written to split the raw ‘forward’/R1 reads into multiple individual FASTQ files depending upon the tiled-sequencing primer that is present in the first 30 nts of the ‘reverse’/R2 paired-read (Source data 2). These split FASTQ files were then trimmed using cutadapt, RRID:SCR_011841 (Martin, 2011) to remove primer-derived sequences from the R1 reads. After trimming, all the split R1 files were re-combined to yield a final processed dataset. These reads were mapped to the WA-1 strain (NC_045512.2) of SARS-CoV-2 using bowtie2, RRID:SCR_016368 (Langmead and Salzberg, 2012) and a new reference consensus genome was rebuilt for each dataset using pilon, RRID:SCR_014731 (Walker et al., 2014). Next, we mapped the processed read data to the reconstructed reference genome using ViReMa, RRID:SCR_000566 (Routh and Johnson, 2014) to map to both the virus and the host (chlSab2 or hg19) genome. SAM files were manipulated using samtools, RRID:SCR_002105 (Li et al., 2009) and de-duplicated using umi-tools, RRID:SCR_017048 (Smith et al., 2017). Minority variants were extracted using the mpileup command in samtools, RRID:SCR_002105 (Li et al., 2009) and a custom python3 script to count nucleotide frequency at each coordinate to find minority variants (Source data 2). Mapped data were visualized using Tablet, RRID:SCR_000017 (Milne et al., 2010).

For Nanopore reads, porechop, RRID:SCR_016967 (https://github.com/rrwick/Porechop) was used to remove Illumina adaptor sequences and reads greater than 100nts in length were retained. These were mapped to the WA-1 SARS-CoV-2 genome (NC_045512.2) using minimap2, RRID:SCR_018550 (Li, 2016) with the -splice option selected. Output SAM files were processed using samtools, RRID:SCR_002105 (Li et al., 2009Li et al., 2009) and bedtools, RRID:SCR_006646 (Quinlan, 2014Quinlan, 2014) to generate coverage maps.

Data availability

All raw sequencing data (Illumina and Nanopore in FASTQ format) are available in the NCBI Small Read Archive with BioProject PRJNA707211. Consensus genomes for WRCEVA SARS-CoV-2 isolates reported in this manuscript are deposited at GenBank (MW047307-MW047318 and MW703487-MW703490). Genome sequences of clinical samples of SARS-CoV-2 were deposited in GISAID with accession numbers listed in SData 3 and are searchable in GISAID using 'clickseq'.

The following data sets were generated
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW047307. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000501/2020.
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW047308. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000502/2020.
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW047309. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000505/2020.
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW047310. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000506/2020.
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW047311. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000507/2020.
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW047312. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000508/2020.
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW047313. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000509/2020.
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW047314. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000510/2020.
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW047315. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000513/2020.
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW047316. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000514/2020.
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW047317. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000515/2020.
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW047318. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000516/2020.
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW703487. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000503/2020.
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW703488. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000504/2020.
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW703489. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000511/2020.
    1. Jaworski E
    2. Langsjoen RM
    3. Judy B
    4. Newman P
    5. Plante K
    6. Pyles R
    7. Ksiazek T
    8. Weaver SC
    9. Routh AL
    (2021) NCBI Nucleotide
    ID MW703490. Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/WRCEVA_000512/2020.

References

Decision letter

  1. Anurag Agrawal
    Reviewing Editor; CSIR Institute of Genomics and Integrative Biology, India
  2. K VijayRaghavan
    Senior Editor; National Centre for Biological Sciences, Tata Institute of Fundamental Research, India
  3. Anurag Agrawal
    Reviewer; CSIR Institute of Genomics and Integrative Biology, India

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Acceptance summary:

The proposed method provides an alternate approach for viral whole genome sequencing that appears to have advantages over current amplicon methods in terms of detecting variants such as duplications. This may be important to explore viral variants and evolution.

Decision letter after peer review:

Thank you for submitting your article "Tiled-ClickSeq for targeted sequencing of complete coronavirus genomes with simultaneous capture of RNA recombination and minority variants" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Anurag Agrawal as Reviewing editor and Reviewer #1, and the evaluation has been overseen by K VijayRaghavan as the Senior Editor.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

While the work is of interest and the proposed new method seems useful, extensive additional data is required, specifically by comparisons to existing approaches in clinical samples, since the value of this approach can only be judged after seeing the additional data.

Essential revisions:

1) The method must be tested on actual clinical samples, with comparison to at least one of the existing methods in wide use, so that claims of better performance can be judged with evidence. Average read depth (relative to total nucleotides sequenced per sample) across multiple samples with confidence intervals or equivalent to visualize run-to-run variability should be shown in comparison to other approaches.

2) Provide evidence to support the claim of identifying recombinant RNAs that are otherwise undetectable.

Reviewer #1 (Recommendations for the authors):

1. The claims for sensitivity and specificity must be backed by actual comparisons.

2. Is there an example of a recombinant RNA identified by this approach missed by alternative approaches?

Reviewer #2 (Recommendations for the authors):

I would be much more supportive of this manuscript if the authors were able to expand on the current body of knowledge about SARS-CoV-2 using their sequencing method, even if this is limited to identifying recombinant RNAs that are otherwise undetectable.

Reviewer #3 (Recommendations for the authors):

The main critique and weakness are that the authors did not use the method described to sequence directly from the primary nasopharyngeal samples, but rather from Vero cell-grown SARS-CoV-2 isolates. This has dampened the significance of this potentially important contribution and should be easily addressed in a revised form. One of the greatest obstacles in the viral sequencing space, especially with the emergence of SARS-CoV-2 variants, is in resolving the sequence of primary samples and not from cell grown-cells. A lot of public laboratories may not have the time, resource or expertise to grow viral isolates in Vero cells first and then sequence. If the authors can provide evidence that their approach is comparable or perhaps better than the current method (i.e., ARTIC protocol) in generating sequences directly from samples then this method can significantly contribute to field.

https://doi.org/10.7554/eLife.68479.sa1

Author response

Essential revisions:

1) The method must be tested on actual clinical samples, with comparison to at least one of the existing methods in wide use, so that claims of better performance can be judged with evidence. Average read depth (relative to total nucleotides sequenced per sample) across multiple samples with confidence intervals or equivalent to visualize run-to-run variability should be shown in comparison to other approaches

We have provided a substantial update to the manuscript by providing an analysis of 60 clinical specimens of SARS-CoV-2, as described below in the detailed responses to the reviewers that addresses all of these issues.

2) Provide evidence to support the claim of identifying recombinant RNAs that are otherwise undetectable.

In the revision, we describe the recombination analysis of multiple clinical samples of SARS-CoV-2. To illustrate the potential of Tiled-ClickSeq to characterize novel recombination events, we describe the observation of a large genome duplication (annotated as 29442^29323) found in multiple clinical samples, but not any cell-culture samples (providing support that these are not sequence artifacts). We are not so bold as to state that ARTIC approaches or other could never detect such an event, but simply state that this event is an example of one that has not been observed before.

In the manuscript, we now present data using the ARTIC protocol and Illumina sequencing to study the 12 WRCEVA isolates described in the manuscript and presented in Figure 3. We performed an analysis of recombination events using a similar pipeline to that described for the Tiled-ClickSeq data. After adaptor trimming and quality filtering the read data, recombination events were characterized using ViReMa. The recombination heatmaps (analogous to Figure 5B) for these data are shown in Author response image 1.

Author response image 1
Similarly to Figure 5B in the manuscript, unique RNA recombination events are plotted for WRCEVA samples as a scatter plots whereby the upstream ‘donor’ site is plotted on the y-axis and a downstream ‘acceptor’ site is plotted on x-axis.

The read count for each unique RNA recombination event is indicated by the size of the point, while the number of samples in which this each RNA recombination event is found is indicated by the colour-bar. Insertions/duplication/back-splicing events are found above the x=y axis, while deletions and RNA recombination events yielding sgmRNAs are found in Author response image 2.

Similarly to Figure 5B in the manuscript, unique RNA recombination events are plotted for WRCEVA samples as a scatter plot whereby the upstream ‘donor’ site is plotted on the y-axis and a downstream ‘acceptor’ site is plotted on x-axis. The read count for each unique RNA recombination event is indicated by the size of the point, while the number of samples in which this each RNA recombination event is found is indicated by the colour-bar.Insertions/duplication/back-splicing events are found above the x=y axis, while deletions and RNA recombination events yielding sgmRNAs are found in Author response image 2.

While this does indeed capture the expected sgmRNAs (large purple circles in bottom right), there are many spurious recombination events also reported. These are represented by the characteristic vertical and horizontal striations present in this heatmap and which are absent in our Tiled-ClickSeq data. As we suspected that these events are derived from mis-priming events in during the ARTIC PCR step, we additionally trimmed the first 25 nts from each read and re-mapped the data using ViReMa. A new heat map is shown in Author response image 2.

Author response image 2
Same as Author response image 1, except that read data used in the analysis have been trimmed by 25nts to remove any potential primer-derived sequences that may give rise to artifactual chimeric reads.

This removes some of the striations, indicating that recombination/chimeric artefacts have been removed. However, many events remain, in particular we still see numerous small InDels that are not found in the Tiled-ClickSeq data. Furthermore, this primer-trimming results in the loss of the detection of the sgmRNAs. This is likely due to the closeness of the primer annealing site (nts30-54) to the sgmRNAs recombination site (nt70). Trimming these 24 nts from the reads will leave only ~14nts of the Leader sequence, which is shorter than the seed length used in ViReMa.

These issues do not demonstrate that ARTIC or similar approaches cannot be used for recombination analysis with careful optimization or primer designs, but merely illustrate that there is substantial background noise and that analyses are at least problematic. We feel that it would be infeasible to exhaustively demonstrate that other approaches cannot solve recombination events, and indeed the capture of these may vary greatly among different experimentalists using different wet-lab and bioinformatics protocols.

Rather, we hope that in the revised manuscript, the power of Tiled-ClickSeq to reliably detect and report novel recombination events is clear as presented. We also emphasize in the manuscript how Tiled-ClickSeq has the capacity to resolve the entire 5’ UTR of SARS-CoV-2 genomes, which paired-primer amplicons inherently cannot. We found a U2C SNV that is pervasive throughout all B.1.1.7 variants sequenced here. This is an interestingly and novel feature that we are following up on. Altogether, we believe that Tiled-ClickSeq provides a unique and flexible platform for targeted virus genomics that will have applications in many settings beyond those presented here.

Reviewer #1 (Recommendations for the authors):

1. The claims for sensitivity and specificity must be backed by actual comparisons.

As described above, we provide a comprehensive analysis of clinical specimens and report the sensitivity and specificity of our approach. These new data are presented in Figure 6 and described in the main text.

2. Is there an example of a recombinant RNA identified by this approach missed by alternative approaches?

In the manuscript we provide an analysis of RNA recombination events found in clinical specimens of SARS-CoV-2 wherein we observe a pervasive 29422^29323 duplication. We do not feel we can state that this event cannot be captured using other approaches (indeed it would not be feasible to exhaustively demonstrate this), however, we simply state that it has not been previously observed despite being so prevalent here. These new data we presented in Figure 7 and supplementary materials and discussed in the main text.

Reviewer #2 (Recommendations for the authors):

I would be much more supportive of this manuscript if the authors were able to expand on the current body of knowledge about SARS-CoV-2 using their sequencing method, even if this is limited to identifying recombinant RNAs that are otherwise undetectable.

Thank you for your comments. In the revised manuscript we have provided substantial additional information as described above with an analysis of multiple clinical samples of SARS-CoV-2. In particular, we identify SNVs in 5’UTRs and novel recombinant RNAs that have not previously been described.

Reviewer #3 (Recommendations for the authors):

The main critique and weakness are that the authors did not use the method described to sequence directly from the primary nasopharyngeal samples, but rather from Vero cell-grown SARS-CoV-2 isolates. This has dampened the significance of this potentially important contribution and should be easily addressed in a revised form. One of the greatest obstacles in the viral sequencing space, especially with the emergence of SARS-CoV-2 variants, is in resolving the sequence of primary samples and not from cell grown-cells. A lot of public laboratories may not have the time, resource or expertise to grow viral isolates in Vero cells first and then sequence. If the authors can provide evidence that their approach is comparable or perhaps better than the current method (i.e., ARTIC protocol) in generating sequences directly from samples then this method can significantly contribute to field.

Thank you for your comments. We have addressed this issue in the revised manuscript.

https://doi.org/10.7554/eLife.68479.sa2

Article and author information

Author details

  1. Elizabeth Jaworski

    1. Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, Galveston, United States
    2. ClickSeq Technologies LLC, Galveston, United States
    Contribution
    Conceptualization, Data curation, Investigation, Methodology, Resources, Validation, Writing – review and editing
    Competing interests
    E.J. and A.R. are co-founders and owners of ‘ClickSeq Technologies LLC’, a Texas-based Next-Generation Sequencing provider offering ClickSeq kits and services including the methods described in this manuscript. E.J. and A.R have filed a patent application (PCT/US2021/038048) on the method and use of single-primer tiled sequencing.
  2. Rose M Langsjoen

    Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, Galveston, United States
    Contribution
    Investigation, Methodology, Resources, Writing – review and editing
    Competing interests
    No competing interests declared
  3. Brooke Mitchell

    1. World Reference Center for Emerging Viruses and Arboviruses, University of Texas Medical Branch, Galveston, United States
    2. Department of Microbiology and Immunology, The University of Texas Medical Branch, Galveston, United States
    Contribution
    Investigation, Resources
    Competing interests
    No competing interests declared
  4. Barbara Judy

    Department of Pediatrics, University of Texas Medical Branch, Galveston, United States
    Contribution
    Investigation, Resources
    Competing interests
    No competing interests declared
  5. Patrick Newman

    Department of Pediatrics, University of Texas Medical Branch, Galveston, United States
    Contribution
    Investigation, Resources
    Competing interests
    No competing interests declared
  6. Jessica A Plante

    1. World Reference Center for Emerging Viruses and Arboviruses, University of Texas Medical Branch, Galveston, United States
    2. Department of Pathology, University of Texas Medical Branch, Galveston, United States
    3. Institute for Human Infections and Immunity, University of Texas Medical Branch, Galveston, United States
    Contribution
    Investigation, Resources
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4768-7458
  7. Kenneth S Plante

    1. World Reference Center for Emerging Viruses and Arboviruses, University of Texas Medical Branch, Galveston, United States
    2. Department of Pathology, University of Texas Medical Branch, Galveston, United States
    3. Institute for Human Infections and Immunity, University of Texas Medical Branch, Galveston, United States
    Contribution
    Investigation, Resources
    Competing interests
    No competing interests declared
  8. Aaron L Miller

    Department of Pediatrics, University of Texas Medical Branch, Galveston, United States
    Contribution
    Investigation, Resources
    Competing interests
    No competing interests declared
  9. Yiyang Zhou

    Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, Galveston, United States
    Contribution
    Data curation, Formal analysis, Validation, Visualization
    Competing interests
    No competing interests declared
  10. Daniele Swetnam

    Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, Galveston, United States
    Contribution
    Data curation, Formal analysis, Validation, Visualization
    Competing interests
    No competing interests declared
  11. Stephanea Sotcheff

    Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, Galveston, United States
    Contribution
    Investigation, Resources
    Competing interests
    No competing interests declared
  12. Victoria Morris

    Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, Galveston, United States
    Contribution
    Investigation, Resources
    Competing interests
    No competing interests declared
  13. Nehad Saada

    1. World Reference Center for Emerging Viruses and Arboviruses, University of Texas Medical Branch, Galveston, United States
    2. Department of Microbiology and Immunology, The University of Texas Medical Branch, Galveston, United States
    Contribution
    Resources
    Competing interests
    No competing interests declared
  14. Rafael RG Machado

    1. World Reference Center for Emerging Viruses and Arboviruses, University of Texas Medical Branch, Galveston, United States
    2. Department of Microbiology and Immunology, The University of Texas Medical Branch, Galveston, United States
    Contribution
    Resources
    Competing interests
    No competing interests declared
  15. Allan McConnell

    1. World Reference Center for Emerging Viruses and Arboviruses, University of Texas Medical Branch, Galveston, United States
    2. Department of Microbiology and Immunology, The University of Texas Medical Branch, Galveston, United States
    Contribution
    Resources
    Competing interests
    No competing interests declared
  16. Steven G Widen

    1. Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, Galveston, United States
    2. Next-Generation Sequencing Core, The University of Texas Medical Branch, Galveston, United States
    Contribution
    Resources
    Competing interests
    No competing interests declared
  17. Jill Thompson

    Next-Generation Sequencing Core, The University of Texas Medical Branch, Galveston, United States
    Contribution
    Resources
    Competing interests
    No competing interests declared
  18. Jianli Dong

    1. Department of Pediatrics, University of Texas Medical Branch, Galveston, United States
    2. Institute for Human Infections and Immunity, University of Texas Medical Branch, Galveston, United States
    Contribution
    Investigation, Resources
    Competing interests
    No competing interests declared
  19. Ping Ren

    Department of Microbiology and Immunology, The University of Texas Medical Branch, Galveston, United States
    Contribution
    Investigation, Resources
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4022-6667
  20. Rick B Pyles

    Department of Pediatrics, University of Texas Medical Branch, Galveston, United States
    Contribution
    Resources, Supervision
    Competing interests
    No competing interests declared
  21. Thomas G Ksiazek

    1. World Reference Center for Emerging Viruses and Arboviruses, University of Texas Medical Branch, Galveston, United States
    2. Department of Pathology, University of Texas Medical Branch, Galveston, United States
    Contribution
    Resources, Supervision
    Competing interests
    No competing interests declared
  22. Vineet D Menachery

    1. World Reference Center for Emerging Viruses and Arboviruses, University of Texas Medical Branch, Galveston, United States
    2. Department of Microbiology and Immunology, The University of Texas Medical Branch, Galveston, United States
    3. Institute for Human Infections and Immunity, University of Texas Medical Branch, Galveston, United States
    Contribution
    Funding acquisition, Investigation, Resources, Supervision, Writing – review and editing
    Competing interests
    No competing interests declared
  23. Scott C Weaver

    1. World Reference Center for Emerging Viruses and Arboviruses, University of Texas Medical Branch, Galveston, United States
    2. Department of Microbiology and Immunology, The University of Texas Medical Branch, Galveston, United States
    3. Institute for Human Infections and Immunity, University of Texas Medical Branch, Galveston, United States
    Contribution
    Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Writing – review and editing
    Competing interests
    No competing interests declared
  24. Andrew L Routh

    1. Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, Galveston, United States
    2. Institute for Human Infections and Immunity, University of Texas Medical Branch, Galveston, United States
    3. Sealy Centre for Structural Biology and Molecular Biophysics, University of Texas Medical Branch, Galveston, United States
    Contribution
    Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review and editing
    For correspondence
    alrouth@utmb.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2874-5990

Funding

National Institutes of Health (R21AI151725)

  • Andrew L Routh

National Institutes of Health (R01AI153602)

  • Vineet D Menachery

National Institutes of Health (R00AG049042)

  • Vineet D Menachery

National Institutes of Health (R24AI120942)

  • Scott C Weaver

The Sealy and Smith Foundation

  • Scott C Weaver

University of Texas Medical Branch at Galveston

  • Andrew L Routh

Centers for Disease Control and Prevention (200-2021-11195)

  • Scott Weaver
  • Andrew L Routh

Institute for Translational Sciences, University of Texas Medical Branch (TL1TR001440)

  • Rose M Langsjoen

Fundação de Amparo à Pesquisa do Estado de São Paulo (Grant 2019/27803-2)

  • Rafael RG Machado

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was funded by: a Technology Commercialization Program grant (UTMB) to EJ and ALR; a ‘Data collection grant’ and ‘COVID-19 funding’ from the Institute for Human Infections & Immunity (UTMB) to ALR; NIH grant R21AI151725 from to ALR; CDC contract 200-2021-11195 to ALR and S.W.; NIH Grant R01AI153602 and R00AG049042 to VDM; NIH grant R24 AI120942 “World Reference Center for Emerging Viruses and Arboviruses” to SW; and The Sealy and Smith Foundation for financial support to SW and UTMB. RML was supported by an NRSA Clinical and Translational Science (NCATS) TL-1 Training Core Award (TL1TR001440). We thank Brittany Dennler for graphic design support and figure illustrations.

Senior Editor

  1. K VijayRaghavan, National Centre for Biological Sciences, Tata Institute of Fundamental Research, India

Reviewing Editor

  1. Anurag Agrawal, CSIR Institute of Genomics and Integrative Biology, India

Reviewer

  1. Anurag Agrawal, CSIR Institute of Genomics and Integrative Biology, India

Publication history

  1. Preprint posted: March 11, 2021 (view preprint)
  2. Received: March 17, 2021
  3. Accepted: September 8, 2021
  4. Version of Record published: September 28, 2021 (version 1)

Copyright

© 2021, Jaworski et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 295
    Page views
  • 38
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

  1. Further reading

Further reading

    1. Genetics and Genomics
    Tingting Zhao et al.
    Research Article

    In Saccharomyces cerevisiae, RNA Polymerase II (Pol II) selects transcription start sites (TSS) by a unidirectional scanning process. During scanning, a preinitiation complex (PIC) assembled at an upstream core promoter initiates at select positions within a window ~40-120 basepairs downstream. Several lines of evidence indicate that Ssl2, the yeast homolog of XPB and an essential and conserved subunit of the general transcription factor (GTF) TFIIH, drives scanning through its DNA-dependent ATPase activity, therefore potentially controlling both scanning rate and scanning extent (processivity). To address questions of how Ssl2 functions in promoter scanning and interacts with other initiation activities, we leveraged distinct initiation-sensitive reporters to identify novel ssl2 alleles. These ssl2 alleles, many of which alter residues conserved from yeast to human, confer either upstream or downstream TSS shifts at the model promoter ADH1 and genome-wide. Specifically, tested ssl2 alleles alter TSS selection by increasing or narrowing the distribution of TSSs used at individual promoters. Genetic interactions of ssl2 alleles with other initiation factors are consistent with ssl2 allele classes functioning through increasing or decreasing scanning processivity but not necessarily scanning rate. These alleles underpin a residue interaction network that likely modulates Ssl2 activity and TFIIH function in promoter scanning. We propose that the outcome of promoter scanning is determined by two functional networks, the first being Pol II activity and factors that modulate it to determine initiation efficiency within a scanning window, and the second being Ssl2/TFIIH and factors that modulate scanning processivity to determine the width of the scanning widow.

    1. Evolutionary Biology
    2. Genetics and Genomics
    Gabriela Santos-Rodriguez et al.
    Research Article Updated

    Many primate genes produce circular RNAs (circRNAs). However, the extent of circRNA conservation between closely related species remains unclear. By comparing tissue-specific transcriptomes across over 70 million years of primate evolution, we identify that within 3 million years circRNA expression profiles diverged such that they are more related to species identity than organ type. However, our analysis also revealed a subset of circRNAs with conserved neural expression across tens of millions of years of evolution. By comparing to species-specific circRNAs, we identified that the downstream intron of the conserved circRNAs display a dramatic lengthening during evolution due to the insertion of novel retrotransposons. Our work provides comparative analyses of the mechanisms promoting circRNAs to generate increased transcriptomic complexity in primates.