1. Ecology
  2. Genetics and Genomics
Download icon

Viral dark matter and virus–host interactions resolved from publicly available microbial genomes

  1. Simon Roux
  2. Steven J Hallam
  3. Tanja Woyke
  4. Matthew B Sullivan  Is a corresponding author
  1. University of Arizona, United States
  2. University of British Columbia, Canada
  3. U.S Department of Energy Joint Genome Institute, United States
Tools and Resources
Cite this article as: eLife 2015;4:e08490 doi: 10.7554/eLife.08490
8 figures, 1 table and 1 data set

Figures

Figure 1 with 2 supplements
Distribution of viral sequences from the VirSorter curated data set across the bacterial and archaeal phylogeny.

For each bacteria or archaea phylum (or phylum-level group), corresponding viruses in RefSeq (gray) and VirSorter curated data set (red) are indicated with circles proportional to the number of sequences available. Groups for which no viruses were available in RefSeq are highlighted in black.

https://doi.org/10.7554/eLife.08490.003
Figure 1—source data 1

List of data sets mined for viral signal.

Bacterial and archaeal genomes searched with VirSorter for viral sequences originated from NCBI Refseq and WGS, as well as the Microbial Dark Matter data set (MDM, Rinke et al., 2013) and the SUP05 SAGs data set (Roux et al., 2014).

https://doi.org/10.7554/eLife.08490.004
Figure 1—source data 2

New virus–host associations detected in VirSorter sequences.

The star (*) marks the questionable detection of an Inoviridae genome in a Caldiserica SAG, which could originate from another bacterium contaminating MDA reagents (see ‘Materials and methods’).

https://doi.org/10.7554/eLife.08490.005
Figure 1—source data 3

Summary table of VirSorter data set sequences.

All sequences currently identified as plasmids on NCBI and which did not display any viral gene in the automatic annotation from NCBI are gathered at the bottom of the table and highlighted in orange. ‘Detection tag’ column indicates how the sequence was detected as viral by VirSorter: ‘hallmark’ for the presence of viral hallmark gene(s), ‘refseq’ for an enrichment in bacterial and archalea virus genes, ‘noncaudo’ for an enrichment in non-Caudovirales genes, and ‘vdb’ for an enrichment in virome-like genes.

https://doi.org/10.7554/eLife.08490.006
Figure 1—figure supplement 1
Viral diversity in the VirSorter data set.

The best BLAST hits of predicted proteins along each sequence (i.e., within 75% of the best BLAST hit for this sequence) were used in a Lowest Common Ancestor affiliation (here displayed at the family level). ‘Unclassified Caudovirales’ gathers viruses only affiliated to the Caudovirales level without confident affiliation to the Myo-, Sipho-, or Podoviridae. The number and percentage of sequences affiliated is indicated next to each family.

https://doi.org/10.7554/eLife.08490.007
Figure 1—figure supplement 2
Genome map comparison (A) and recruitment plot (B) of Bacteroidia virus sequences from a putative new order.

Replication-associated, Relaxase, and hypothetical proteins are depicted in blue, orange, and gray respectively. The recruitment plot includes two viromes from human feces samples from two different studies (Human gut assembly, Minot et al., 2012, and Human feces, Kim et al., 2011). Identity percentage is based on a blastn between virome contigs and the reference genome.

https://doi.org/10.7554/eLife.08490.008
Figure 2 with 2 supplements
Degree of novelty of viruses detected in VirSorter curated data set.

(A) Viral clusters (VCs) are considered as putative new genera when including at least one sequence larger than 30 kb, circular, or known to be a complete genome (from RefSeq). These putative genera were considered as ‘new’ when the VC did not include any RefSeq sequence, and ‘known’ otherwise. (B) The proportion of new VCs (containing no RefSeqABVir), VCs with only one RefSeqABVir sequence, and VCs with more than one RefSeqABVir sequence is displayed for host classes associated with more than 10 virl sequences. Only ‘putative genera’ VCs were considered (i.e., clusters containing a RefSeqABVir genome, a circular sequence, or a sequence with more than 30 predicted genes).

https://doi.org/10.7554/eLife.08490.009
Figure 2—source data 1

Summary table of virus clusters (VCs).

Cluster affiliation is based on the combination of BLAST-based taxonomic affiliation of its members. For VCs with more than 10 proteins, those composed only of VirSorter sequences are highlighted in green and those with only one sequence from RefSeqABVir are marked in blue. Cases where sequences affiliated to both ssDNA and dsDNA viruses are clustered together are highlighted in red. ‘Detection tags’ lists the different detection tags for the cluster members, with ‘NCBI_RefSeq’ for complete genomes from the RefSeq database. These NCBI RefSeq sequences are counted as ‘complete’ in the ‘type of sequences’ column.

https://doi.org/10.7554/eLife.08490.010
Figure 2—figure supplement 1
Structure of viral sequence space sampled in VirSorter data set.

Network of virus clusters (VCs) based on gene content comparison between viral genome sequences from RefSeqABVir and VirSorter data set. VCs including only VirSorter sequences are highlighted with a black outline. The size of nodes is proportional to the number of sequences in the cluster and the color of the node corresponds to the BLAST-based affiliation (at the family level) of its members when consistent (i.e., agreement between >75% of the cluster members, otherwise clusters are indicated as ‘unaffiliated’).

https://doi.org/10.7554/eLife.08490.011
Figure 2—figure supplement 2
Benchmarks used to determine the best value for inflation and significance thresholds for virus clustering.

For each pair of values (inflation and significance threshold), the genome network was computed and its overall shape evaluated with ICCC (intra-cluster clustering coefficient). The chosen values are highlighted in green in the table and with a star on the associated plot.

https://doi.org/10.7554/eLife.08490.012
Figure 3 with 1 supplement
Extrachromosomal prophages in VirSorter curated data set and improvement in virome affiliation.

(A) The distribution of VirSorter curated data set as ‘integrated’ (i.e., prophages integrated in the host chromosome), ‘extrachromosomal’ (i.e., >30 kb or circular sequences with no microbial genes), or ‘undetermined’ (<30 kb linear with no microbial genes) is indicated for each host class with at least five VirSorter curated data set sequences. The number of sequences associated with each host class in indicated above the histogram. (B) Improvement in the proportion of affiliated genes from viromes with VirSorter data set. Predicted genes from the Pacific Ocean Viromes (Hurwitz and Sullivan, 2013), Tara Ocean Viromes (Brum et al., 2015), and Human Gut Viromes (Minot et al., 2012) were compared to RefSeqVirus (May 2015) and the VirSorter data set (BLASTp, threshold of 50 on bit score and 0.001 on e-value). Predicted proteins affiliated to VirSorter (in blue) did not display any significant similarity to a RefSeq sequence.

https://doi.org/10.7554/eLife.08490.013
Figure 3—figure supplement 1
Contig map of a putative new extrachromosomal prophage.

Contig Spirochaetia_gi_359585655 represent a complete genome (the contig was detected as circular) from a new genus (affiliated to a VC with no RefSeqABVir sequence). Functional affiliation of predicted genes is indicated on the map, with notably two genes (ParA/ParB) indicative of extrachromosomal prophages, as well as two genes (in orange) affiliated to the ACR_tran efflux pump family, of which some members are involved in antiobiotic resistance phenotypes. This contig belongs to the virus cluster VC_61, composed of 35 new putative extrachromosomal prophages from different Spirochetes genomes.

https://doi.org/10.7554/eLife.08490.014
Scale and range of co-infection.

(A) Number of different viral sequences detected by host genome. Numbers are based on the set of microbial genomes with at least one viral sequence detected (5492 genomes). (B) Affiliation of viruses involved in multiple infections of the same host. Affiliations are deduced from best BLAST hits alongside the viral sequences, as in Figure 1. Co-infections involving dsDNA and ssDNA viruses are highlighted in bold.

https://doi.org/10.7554/eLife.08490.015
Figure 5 with 1 supplement
Virus–host network between virus clusters and host classes (matrix visualization).

A cell in the matrix is colored when at least one virus from a virus cluster (VC, rows) was retrieved in a genome from a host class (columns). This virus–host network is detected as significantly modular by lp-Brim (modularity Q = 0.45; the same index computed from 99 randomly permuted matrices ranged from 0.02 to 0.17, with an average of 0.08). The different modules are highlighted in color, with inter-module links in gray. Virus clusters are identified by their number and their family-level affiliation (based on BLAST-based affiliation of the cluster members) is indicated next to each cluster when available (virus clusters with inconsistent members affiliation are considered as ‘unclassified’, affiliations are spread along the x-axis for spacing purpose). Host phylum and class are indicated for each host column, with domains indicated above the corresponding hosts.

https://doi.org/10.7554/eLife.08490.016
Figure 5—figure supplement 1
Virus–host network between virus clusters and host classes (network visualization).

An edge is displayed between a virus cluster (VC) and a host class when at least one virus from this cluster was retrieved in a genome from the host class. This network is detected as significantly modular by lp-Brim (modularity Q = 0.45; the same index computed from 99 randomly permuted matrices ranged from 0.02 to 0.17, with an average of 0.08). The different modules are highlighted in color, with inter-module links in gray. VCs are identified by their number and their family-level affiliation (based on BLAST-based affiliation of the cluster members) is indicated below each cluster when available (VCs with inconsistent members affiliation are considered as ‘unclassified’). Host phylum and class are indicated for each host node, with phyla (when multiple class from the same phylum are included in the network) and domains indicated above the corresponding host nodes.

https://doi.org/10.7554/eLife.08490.017
Figure 6 with 2 supplements
Adaptation of viral genome composition and codon usage to the host genome.

K–S distances between distributions of virushost distances and virus–non-host distances for each metrics (in color) and different subsets of the viral sequences (all sequences, by type, and by taxonomy). Only families with more than 5 genomes are displayed (although it should be noted that the VirSorter data set includes only 6 Microviridae sequences). The number of sequences in each category is indicated in brackets. Distributions used to compute distances are displayed in Figure 6—figure supplement 1.

https://doi.org/10.7554/eLife.08490.018
Figure 6—figure supplement 1
(A) K–S distances between distributions of virus–host distances and virus–non-host distances for each metrics (in color) and different subsets of the viral sequences (based on the number of tRNA genes detected).

The number of sequences in each category is indicated below the number of tRNA. (B) Distribution of k-mer distances between viral and cellular genomes and codon usage adaptation index for host, host genus, host family, and non-host (different order) genomes. For each viral genome, the distance to the host is displayed, as well as 10 randomly taken distances to genomes from each category and different subsets of the viral sequences (by taxonomy on the left column, and by number of tRNA genes on the rigth column).

https://doi.org/10.7554/eLife.08490.019
Figure 6—figure supplement 2
Distance between k-mer frequency vectors of virus genome subsamples and host genomes for Caudovirales.

Viral genomes (1000) were randomly sub-sampled at different sizes (from 2000 to 20,000 bp). Only Caudovirales genomes were selected for this subsample analysis. For each size of k-mer, the result of a linear regression of distance between host or non-host and viral subsample size is indicated. The same distances for the Microviridae and Inoviridae (taken from Figure 6A) are indicated for comparison, and associated with the size of the reference genome of each group (Enterobacteria phage phiX174 and Enterobacteria phage M13). For clarity's sake, the almost-identical values for 2-mer, 3-mer, and 4-mer for Microviridae are slightly horizontally shifted.

https://doi.org/10.7554/eLife.08490.020
Author response image 1
Improvement in the proportion of affiliated genes from viromes with VirSorter dataset.

Predicted genes from the Pacific Ocean Viromes (Hurwitz and Sullivan, 2013), Tara Ocean Viromes (Brumnoza, et al., 2015) and Human Gut Viromes (Minot et al., 2013) were compared to RefSeqVirus (May 2015) and the 12.5k VirSorter dataset (BLASTp, threshold of 50 on bit score and 0.001 on e-value). Predicted proteins affiliated to VirSorter (in blue) did not display any significant similarity to a RefSeq virus, but can now be associated with a phage and a host through the VirSorter database.

https://doi.org/10.7554/eLife.08490.024
Author response image 2
Viral sequences distribution of RefSeq and VirSorter dataset.

For each host group, a circle proportional to the number of viral genomes available is noted in red for RefSeq and blue for VirSorter. Hosts for which no RefSeq references were available are highlighted in bold.

https://doi.org/10.7554/eLife.08490.025

Tables

Table 1

Accuracy of host prediction based on distance (d) between tetranucleotide frequencies of viral and microbial genomes

https://doi.org/10.7554/eLife.08490.021
PredictedHost orderHost familyHost genus
CorrectRatio (%)CorrectRatio (%)CorrectRatio (%)
All reference sequences
 d < 4 × 10−04989798.989798.989798.98
 4 × 10−04 ≤ d < 1 × 10−0310,173936192.02897188.18526151.72
 1 × 10−03 ≤ d2508187274.64175770.0691736.56
Host species excluded
 d < 4 × 10−04212095.242095.242095.24
 4 × 10−04 ≤ d < 1 × 10−0310,003906790.64837283.69299229.91
 1 × 10−03 ≤ d2755198171.91184066.7981829.69
Host genus excluded
 d < 4 × 10−04100.0000.0000.00
 4 × 10−04 ≤ d < 1 × 10−039085730380.39618168.0400.00
 1 × 10−03 ≤ d3693176847.87138837.5800.00
  1. For each viral genome, the order, family, and genus of its host were predicted from the taxonomy of the closest microbial genome (based on the mean absolute difference between tetranucleotide frequency vectors) and compared to the order, family, and genus of the actual host (i.e., the taxonomy of the genome with which the virus was identified). These predictions were computed with (i) all microbial genomes, (ii) excluding specifically all genomes from the host species, and (iii) excluding all genomes from the host genus. Cases with over 75% of prediction accuracy are highlighted in gray.

Data availability

The following data sets were generated
  1. 1

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)