Targeted genomic sequencing with probe capture for discovery and surveillance of coronaviruses in bats

  1. Kevin S Kuchinski
  2. Kara D Loos
  3. Danae M Suchan
  4. Jennifer N Russell
  5. Ashton N Sies
  6. Charles Kumakamba
  7. Francisca Muyembe
  8. Placide Mbala Kingebeni
  9. Ipos Ngay Lukusa
  10. Frida N’Kawa
  11. Joseph Atibu Losoma
  12. Maria Makuwa
  13. Amethyst Gillis
  14. Matthew LeBreton
  15. James A Ayukekbong
  16. Nicole A Lerminiaux
  17. Corina Monagin
  18. Damien O Joly
  19. Karen Saylors
  20. Nathan D Wolfe
  21. Edward M Rubin
  22. Jean J Muyembe Tamfum
  23. Natalie A Prystajecky
  24. David J McIver
  25. Christian E Lange
  26. Andrew DS Cameron  Is a corresponding author
  1. Department of Pathology and Laboratory Medicine, University of British Columbia, Canada
  2. Public Health Laboratory, British Columbia Centre for Disease Control, Canada
  3. Department of Biology, Faculty of Science, University of Regina, Canada
  4. Institute for Microbial Systems and Society, Faculty of Science, University of Regina, Canada
  5. Metabiota Inc, Democratic Republic of the Congo
  6. Institut National de Recherche Biomédicale, Democratic Republic of the Congo
  7. Labyrinth Global Health Inc, United States
  8. Metabiota Inc, United States
  9. Development Alternatives, United States
  10. Mosaic, Cameroon
  11. Metabiota, Canada
  12. Southbridge Care, Canada
  13. One Health Institute, School of Veterinary Medicine, University of California, Davis, United States
  14. Nyati Health Consulting, Canada
  15. Institute for Global Health Sciences, University of California, San Francisco, United States
9 figures, 5 tables and 3 additional files

Figures

Custom hybridization probe panel provided broadly inclusive coverage of known bat coronavirus diversity in silico.

Bat coronavirus (CoV) sequences were obtained by downloading all available alphacoronavirus, betacoronavirus, and unclassified coronaviridae and coronavirinae sequences from GenBank on 4 October 2020 and searching for bat-related keywords in sequence headers. A custom panel of 20,000 probes was designed to target these sequences using the makeprobes module in the ProbeTools package. The ProbeTools capture and stats modules were used to assess probe coverage of bat CoV reference sequences. (A) Each bat CoV sequence is represented as a dot plotted according to its probe coverage, that is, the percentage of its nucleotide positions covered by at least one probe in the custom panel. (B) The same analysis was performed on the subset of sequences representing full-length genomes (>25 kb in length).

De novo assembly of probe captured libraries yielded more genome sequence than standard amplicon sequencing methods for most specimens.

Reads from probe captured libraries were assembled de novo with coronaSPAdes, and coronavirus contigs were identified by local alignment against a database of all coronaviridae sequences in GenBank. (A) The size distribution of contigs from all libraries is shown. Dots are coloured to indicate whether the length of the contig exceeded partial RNA-dependent RNA polymerase (RdRP) gene amplicons previously sequenced from these specimens. (B) Total assembly size and assembly N50 distributions for all libraries. (C) Each contig is represented as a dot plotted according to its length. Assembly N50 sizes and total assembly sizes are indicated by the height of their bars.

Figure 3 with 4 supplements
Coverage of reference sequences by probe captured libraries was used to assess extent and location of recovery.

Reference sequences were chosen for each previously identified phylogenetic group (indicated in panel titles). Coverage of these reference sequences was determined by mapping reads and aligning contigs from probe captured libraries. Dark grey profiles show depth of read coverage along reference sequences. Blue shading indicates spans where contigs aligned. The locations of spike and RNA-dependent RNA polymerase (RdRP) genes are indicated in each reference sequence and shaded light grey. This figure shows the six libraries with the most extensive reference sequence coverage. Similar plots are provided as figure supplements for all libraries where any coronavirus sequence was recovered (Figure 3—figure supplements 14) .

Figure 3—figure supplement 1
Coverage of reference sequence by probe captured libraries for specimens from phylogenetic group Q-Alpha-4.

Coverage of reference sequence was determined by mapping reads and aligning contigs from probe captured libraries. Dark grey profiles show depth of read coverage along reference sequence. Blue shading indicates spans where contigs aligned. The locations of spike and RNA-dependent RNA polymerase (RdRP) genes are indicated and shaded light grey.

Figure 3—figure supplement 2
Coverage of reference sequence by probe captured libraries for specimens from phylogenetic group W-Beta-2.

Coverage of reference sequence was determined by mapping reads and aligning contigs from probe captured libraries. Dark grey profiles show depth of read coverage along reference sequence. Blue shading indicates spans where contigs aligned. The locations of spike and RNA-dependent RNA polymerase (RdRP) genes are indicated and shaded light grey.

Figure 3—figure supplement 3
Coverage of reference sequence by probe captured libraries for specimens from phylogenetic group W-Beta-3.

Coverage of reference sequence was determined by mapping reads and aligning contigs from probe captured libraries. Dark grey profiles show depth of read coverage along reference sequence. Blue shading indicates spans where contigs aligned. The locations of spike and RNA-dependent RNA polymerase (RdRP) genes are indicated and shaded light grey.

Figure 3—figure supplement 4
Coverage of reference sequence by probe captured libraries for specimens from phylogenetic group W-Beta-4.

Coverage of reference sequence was determined by mapping reads and aligning contigs from probe captured libraries. Dark grey profiles show depth of read coverage along reference sequence. Blue shading indicates spans where contigs aligned. The locations of spike gene are indicated and shaded light grey. Ambiguous bases (Ns) are shaded orange.

Probe captured libraries provided more extensive coverage of reference genomes than standard amplicon sequencing protocols for most specimens.

Reference sequences were selected for the previously identified phylogenetic groups to which these specimens had been assigned by Kumakamba et al., 2021. (A) Coverage of these reference sequences was determined by mapping reads and aligning contigs from probe captured libraries. Each library is represented as a dot, and dots are coloured according to whether reference sequence coverage exceeded the length of the partial RNA-dependent RNA polymerase (RdRP) gene sequence that had been previously generated by amplicon sequencing. (B) The number of reference sequence positions covered by probe captured libraries was divided by the length of the partial RdRP amplicon sequences from these specimens. This provided the fold-difference in recovery between probe capture and standard amplicon sequencing methods. (C) Percent coverage of the spike and RdRP genes were calculated for each specimen.

Recovery of coronavirus (CoV) genomic material was limited in vitro by method sensitivity.

(A) Sensitivity was assessed by evaluating recovery of partial RNA-dependent RNA polymerase (RdRp) gene regions that had been previously sequenced in these specimens by amplicon sequencing. Probe coverage of partial RdRp sequences was assessed in silico to exclude insufficient probe design as an alternate explanation for incomplete recovery of these targets. (B) Input RNA concentration, RNA integrity numbers (RINs), and CoV genome abundance were measured for each specimen. The impact of these specimen characteristics on recovery by probe capture (as measured by reference sequence coverage) was assessed using Spearman’s rank correlation (test results stated in plots). An outlier was omitted from this analysis: RNA concentration for specimen CDAB0160R was recorded as 190 ng/μl, a value 4.7 SDs from the mean of the distribution.

In silico assessment of probe panel coverage for reference genomes.

Reference sequences were chosen for each previously identified phylogenetic group (indicated in panel titles). Blue profiles show the number of probes covering each nucleotide position along the reference sequence. Probe coverage, that is, the percentage of nucleotide positions covered by at least one probe, is stated in panel titles. Ambiguity nucleotides (Ns) are shaded in orange, and these positions were excluded from the probe coverage calculations. The locations of spike and RNA-dependent RNA polymerase (RdRP) genes are indicated in each reference sequence (where available) and shaded grey.

Coronavirus (CoV) genomic material was low abundance in swab specimens but effectively enriched by probe capture.

(A) Reads from uncaptured, deep metagenomic sequenced libraries were mapped to complete genomes recovered from these specimens to assess abundance of CoV genomic material. On-target rate was calculated as the percentage of total reads mapping that mapped to the CoV genome sequence. (B) Reads from probe captured libraries were also mapped to assess enrichment and removal of background material. Most libraries used for probe capture (-PRE and -TRI) had insufficient volume remaining for deep metagenomic sequencing, so new libraries were prepared (-DEEP) from the same specimens.

Phylogenetic tree of translated spike gene sequences from alphacoronaviruses.

Spike sequences are coloured according to whether they were from study specimens (blue), human CoVs (red), RefSeq (black), or GenBank (grey). Only the 25 closest-matching spike sequences from GenBank were included, as determined by blastp bitscores. GenBank and RefSeq accession numbers are provided in parentheses. The scale bar measures amino acid substitutions per site.

Phylogenetic tree of translated spike gene sequences from betacoronaviruses.

Spike sequences are coloured according to whether they were from study specimens (blue), human coronaviruses (CoVs) (red), RefSeq (black), or GenBank (grey). Only the 25 closest-matching spike sequences from GenBank were included, as determined by blastp bitscores. GenBank and RefSeq accession numbers are provided in parentheses. The scale bar measures amino acid substitutions per site.

Tables

Table 1
Bat specimens and sequencing libraries analysed in this study.

Rectal and oral swabs collected for a previous study were used to evaluate hybridization probe capture (Kumakamba et al., 2021). For 19 swabs, archived RNA extracted during the previous study was assayed. For 6 swabs, freshly extracted RNA (using a conventional Trizol method) was assayed. As part of the previous study, Kumakamba et al. generated partial sequences from the RNA-dependent RNA polymerase gene, which were used to assign alpha- and betacoronaviruses in these specimens to four novel phylogenetic groups.

Specimen IDLibrary IDHostSwab typeRNA extraction methodPhylogenetic group
CDAB0017RSVCDAB0017RSV-PREMicropteropus pusillusRectalPreviously extractedW-Beta-2
CDAB0040RCDAB0040R-PREMyonycteris sp.RectalPreviously extractedW-Beta-2
CDAB0040RSVCDAB0040RSV-PREMyonycteris sp.RectalPreviously extractedW-Beta-2
CDAB0305RCDAB0305R-PREMicropteropus pusillusRectalPreviously extractedW-Beta-2
CDAB0146RCDAB0146R-PREEidolon helvumRectalPreviously extractedW-Beta-3
CDAB0158RCDAB0158R-PREEidolon helvumRectalPreviously extractedW-Beta-3
CDAB0160RCDAB0160R-PREEidolon helvumRectalPreviously extractedW-Beta-3
CDAB0173RCDAB0173R-PREEidolon helvumRectalPreviously extractedW-Beta-3
CDAB0174RCDAB0174R-PREEidolon helvumRectalPreviously extractedW-Beta-3
CDAB0203RCDAB0203R-PREEidolon helvumRectalPreviously extractedW-Beta-3
CDAB0212RCDAB0212R-PREEidolon helvumRectalPreviously extractedW-Beta-3
CDAB0217RCDAB0217R-PREEidolon helvumRectalPreviously extractedW-Beta-3
CDAB0113RSVCDAB0113RSV-PREHipposideros cf. ruberRectalPreviously extractedW-Beta-4
CDAB0486RCDAB0486R-PREChaerephon sp.RectalPreviously extractedQ-Alpha-4
CDAB0488RCDAB0488R-PREMops condylurusRectalPreviously extractedQ-Alpha-4
CDAB0488RCDAB0488R-TRIMops condylurusRectalTrizol re-extractionQ-Alpha-4
CDAB0491RCDAB0491R-PREMops condylurusRectalPreviously extractedQ-Alpha-4
CDAB0491RCDAB0491R-TRIMops condylurusRectalTrizol re-extractionQ-Alpha-4
CDAB0492RCDAB0492R-PREMops condylurusRectalPreviously extractedQ-Alpha-4
CDAB0492RCDAB0492R-TRIMops condylurusRectalTrizol re-extractionQ-Alpha-4
CDAB0494OCDAB0494O-TRIMops condylurusOralTrizol re-extractionQ-Alpha-4
CDAB0494RCDAB0494R-PREMops condylurusRectalPreviously extractedQ-Alpha-4
CDAB0494RCDAB0494R-TRIMops condylurusRectalTrizol re-extractionQ-Alpha-4
CDAB0495OCDAB0495O-PREMops condylurusOralPreviously extractedQ-Alpha-4
CDAB0495RCDAB0495R-TRIMops condylurusRectalTrizol re-extractionQ-Alpha-4
Table 2
Sequencing metrics for probe captured libraries.

Total reads and sequencing output were measured for each library. Raw metrics describe unprocessed FASTQ files directly from the sequencer. Valid metrics describe FASTQ files following pre-processing to trim adapters, trim trailing low-quality bases, remove index hops, and remove PCR chimeras. On-target metrics were estimated by mapping valid data to the coronavirus reference sequence selected for each specimen and to the contigs assembled from each specimen.

Library IDRaw reads(#)Raw output (kb)Valid reads(#)Valid output(kb)Mapped reads(#)Mapped size(kb)
CDAB0017RSV-PRE115,28014,225.847,609736236,7164919
CDAB0040R-PRE37,9504708.9695121.100
CDAB0040RSV-PRE373,25445,333.7176,78326,783.348,1367068.4
CDAB0113RSV-PRE31,394426116,861287516,3022772.2
CDAB0146R-PRE11,870152019323.718622.6
CDAB0158R-PRE48,0146422.54189706.91548239.8
CDAB0160R-PRE83,5249948.52513376.61363191.2
CDAB0173R-PRE10,628140320634.420333.7
CDAB0174R-PRE900,578118,821107,97917,525.482,67913,290.9
CDAB0203R-PRE6,832,218849,284.91,186,186188,188.9456,47468,384.7
CDAB0212R-PRE60,83875264158681.44152678.3
CDAB0217R-PRE20,078,1422,617,427.38,946,9351,467,955.35,173,44881,5381.3
CDAB0305R-PRE27,0543182.73971594.91787250
CDAB0486R-PRE442,4565,8326.756,838937720,6873385.1
CDAB0488R-PRE188,29424,6792913506.22867493.1
CDAB0488R-TRI343,91645,867.184151381.282251346.2
CDAB0491R-PRE791,12096,136.346,5097081.845,2896854.6
CDAB0491R-TRI1,561,144204,995.4173,53329,280.5157,88926,421.2
CDAB0492R-PRE3,453,456448,217.9277,17648,023.8185,66531,641.3
CDAB0492R-TRI4,200,520518,837.7139,80420,074.793,44213,294.3
CDAB0494O-TRI141,49418,980.929049.76011.3
CDAB0494R-PRE82,36011,162.722400
CDAB0494R-TRI95,92412,762.192.100
CDAB0495O-PRE27,07437760000
CDAB0495R-TRI470,85063,26788961440.986721399.2
Table 3
Alignments between translated spike sequences from study specimens and phylogenetically proximate entries from GenBank and RefSeq.

Alignments were conducted with blastp. Reference sequence host and collection location were obtained from GenBank entry summaries.

SpecimenSpecimen hostReference sequence GenBank accession numberReference sequence hostReference sequence collection locationAlignment query coverage(%)Alignment identity(%)Alignment positivity(%)
CDAB0492RMops condylurusHQ728486.1Chaerephon sp.Kenya10071.280.1
CDAB0492RMops condylurusMZ081383.1Chaerephon plicatusYunnan, China10065.877.5
CDAB0017RSVMicropteropus pusillusHQ728482.1Eidolon helvumKenya9976.585.7
CDAB0017RSVMicropteropus pusillusMG693168.1Eidolon helvumCameroon9963.777.7
CDAB0040RSVMyonycteris sp.HQ728482.1Eidolon helvumKenya9975.984.7
CDAB0040RSVMyonycteris sp.MG693168.1Eidolon helvumCameroon9964.477.7
CDAB0203REidolon helvumHQ728482.1Eidolon helvumKenya10073.785.3
CDAB0203REidolon helvumMG693168.1Eidolon helvumCameroon10065.678.8
CDAB0217REidolon helvumHQ728482.1Eidolon helvumKenya10073.585.1
CDAB0217REidolon helvumMG693168.1Eidolon helvumCameroon10065.279.0
Table 4
Nucleotide alignments between novel spike genes from study specimens and phylogenetically related sequences from GenBank and RefSeq.

Alignments were conducted with blastn. Discontinuous alignments are represented as multiple lines in the table, for example, CDAB0217R vs. MG693168.1.

SpecimenReference sequence GenBank accession numberAlignment query coverage(%)Alignment identity(%)
CDAB0492RHQ728486.16081.0
CDAB0492RMZ081383.11871.5
CDAB0040RSVHQ728482.18375.4
CDAB0203RHQ728482.17875.5
CDAB0203RMG693168.14576.6
CDAB0217RHQ728482.17176.0
CDAB0217RMG693168.14775.7
CDAB0217RMG693168.14784.6
Table 5
RT-qPCR primer sequences.
SpecimenPrimersPrimer sequencesStandard curve R2
CDAB0146R-PRE
CDAB0158R-PRE
CDAB0160R-PRE
CDAB0173R-PRE
CDAB0174R-PRE
CDAB0203R-PRE
CDAB0212R-PRE
CDAB0217R-PRE
Beta-3_rdrp_FWD
Beta-3_rdrp_REV
ATA TAT GTC AGG CCG TTA GTG C
CCA TAT AGA GGC GAT GTT GC
0.995
CDAB0486R-PRE
CDAB0488R-PRE
CDAB0488R-TRI
CDAB0491R-PRE
CDAB0491R-TRI
CDAB0492R-PRE
CDAB0492R-TRI
CDAB0494O-TRI-PRE
CDAB0494R-PRE
CDAB0494R-TRI
CDAB0495O-PRE
CDAB0495R-TRI
Alpha_4_rdrp_FWD
Alpha_4_rdrp_REV
GCG ACT ACC TGG TAA ACC TAT C
CTT TGC CGC ACT CAC AAA C
0.989
CDAB0017R-PRE
CDAB0040R-PRE
CDAB0040RSV-PRE
Beta-2_rdrp_FWD
Beta-2_rdrp_REV
CAC TAC TTG TAC CAC CAG GTT T
TTG TAG TGG TTC TGA TCG GTT T
0.998
CDAB0305R-PRED0305_rdrp_FWD
D0305_rdrp_REV
GAC GGC AAT AAG GTG CAT AAC
AGT CAG AAA CCA AGT CCT CAT C
0.999
CDAB0113RSV-PRED0113_rdrp_FWD D0113_rdrp_REVGTA CGT TGA GTG AGC GGT ATT
GAT GAA GTT CCA CCT GGC TTA
0.998

Additional files

Supplementary file 1

Probe sequences, fasta format.

https://cdn.elifesciences.org/articles/79777/elife-79777-supp1-v1.txt
Supplementary file 2

gBlock sequences, fasta format.

Two gBlocks were generated for this study. ‘Gblck_Beta2_17_40_B3_Alpha’ is a composite of RdRp nucleotide sequences from W-Beta-2 (samples CDAB0017 and CDAB0040), W-Beta-3, and Q-Alpha-4 (Table 1). ‘Gblck_Beta2_0305_B4_Alpha’ is a composite of RdRp nucleotide sequences from W-Beta-2 (samples CDAB0305), W-Beta-4, and Q-Alpha-4. The Q-Alpha-4 sequence is identical in both gBlocks.

https://cdn.elifesciences.org/articles/79777/elife-79777-supp2-v1.txt
MDAR checklist
https://cdn.elifesciences.org/articles/79777/elife-79777-mdarchecklist1-v1.docx

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Kevin S Kuchinski
  2. Kara D Loos
  3. Danae M Suchan
  4. Jennifer N Russell
  5. Ashton N Sies
  6. Charles Kumakamba
  7. Francisca Muyembe
  8. Placide Mbala Kingebeni
  9. Ipos Ngay Lukusa
  10. Frida N’Kawa
  11. Joseph Atibu Losoma
  12. Maria Makuwa
  13. Amethyst Gillis
  14. Matthew LeBreton
  15. James A Ayukekbong
  16. Nicole A Lerminiaux
  17. Corina Monagin
  18. Damien O Joly
  19. Karen Saylors
  20. Nathan D Wolfe
  21. Edward M Rubin
  22. Jean J Muyembe Tamfum
  23. Natalie A Prystajecky
  24. David J McIver
  25. Christian E Lange
  26. Andrew DS Cameron
(2022)
Targeted genomic sequencing with probe capture for discovery and surveillance of coronaviruses in bats
eLife 11:e79777.
https://doi.org/10.7554/eLife.79777