Figures and data in Viral dark matter and virus–host interactions resolved from publicly available microbial genomes

Figures
Tables

8 figures and 1 table

Figures

Figure 1 with 2 supplements

Download asset Open asset

Distribution of viral sequences from the VirSorter curated data set across the bacterial and archaeal phylogeny.

For each bacteria or archaea phylum (or phylum-level group), corresponding viruses in RefSeq (gray) and VirSorter curated data set (red) are indicated with circles proportional to the number of sequences available. Groups for which no viruses were available in RefSeq are highlighted in black.

https://doi.org/10.7554/eLife.08490.003

Figure 1—source data 1 List of data sets mined for viral signal. Bacterial and archaeal genomes searched with VirSorter for viral sequences originated from NCBI Refseq and WGS, as well as the Microbial Dark Matter data set (MDM, Rinke et al., 2013) and the SUP05 SAGs data set (Roux et al., 2014).: https://doi.org/10.7554/eLife.08490.004
Download elife-08490-fig1-data1-v3.xls
Figure 1—source data 2 New virus–host associations detected in VirSorter sequences. The star (*) marks the questionable detection of an Inoviridae genome in a Caldiserica SAG, which could originate from another bacterium contaminating MDA reagents (see ‘Materials and methods’).: https://doi.org/10.7554/eLife.08490.005
Download elife-08490-fig1-data2-v3.xls
Figure 1—source data 3 Summary table of VirSorter data set sequences. All sequences currently identified as plasmids on NCBI and which did not display any viral gene in the automatic annotation from NCBI are gathered at the bottom of the table and highlighted in orange. ‘Detection tag’ column indicates how the sequence was detected as viral by VirSorter: ‘hallmark’ for the presence of viral hallmark gene(s), ‘refseq’ for an enrichment in bacterial and archalea virus genes, ‘noncaudo’ for an enrichment in non-Caudovirales genes, and ‘vdb’ for an enrichment in virome-like genes.: https://doi.org/10.7554/eLife.08490.006
Download elife-08490-fig1-data3-v3.xls

Figure 1—figure supplement 1

Download asset Open asset

Viral diversity in the VirSorter data set.

The best BLAST hits of predicted proteins along each sequence (i.e., within 75% of the best BLAST hit for this sequence) were used in a Lowest Common Ancestor affiliation (here displayed at the family level). ‘Unclassified *Caudovirales*’ gathers viruses only affiliated to the *Caudovirales* level without confident affiliation to the *Myo*-, *Sipho*-, or *Podoviridae.* The number and percentage of sequences affiliated is indicated next to each family.

https://doi.org/10.7554/eLife.08490.007

Figure 1—figure supplement 2

Download asset Open asset

Genome map comparison (A) and recruitment plot (B) of *Bacteroidia* virus sequences from a putative new order.

Replication-associated, Relaxase, and hypothetical proteins are depicted in blue, orange, and gray respectively. The recruitment plot includes two viromes from human feces samples from two different studies (Human gut assembly, Minot et al., 2012, and Human feces, Kim et al., 2011). Identity percentage is based on a blastn between virome contigs and the reference genome.

https://doi.org/10.7554/eLife.08490.008

Figure 2 with 2 supplements

Download asset Open asset

Degree of novelty of viruses detected in VirSorter curated data set.

(A) Viral clusters (VCs) are considered as putative new genera when including at least one sequence larger than 30 kb, circular, or known to be a complete genome (from RefSeq). These putative genera were considered as ‘new’ when the VC did not include any RefSeq sequence, and ‘known’ otherwise. (B) The proportion of new VCs (containing no RefSeqABVir), VCs with only one RefSeqABVir sequence, and VCs with more than one RefSeqABVir sequence is displayed for host classes associated with more than 10 virl sequences. Only ‘putative genera’ VCs were considered (i.e., clusters containing a RefSeqABVir genome, a circular sequence, or a sequence with more than 30 predicted genes).

https://doi.org/10.7554/eLife.08490.009

Figure 2—source data 1 Summary table of virus clusters (VCs). Cluster affiliation is based on the combination of BLAST-based taxonomic affiliation of its members. For VCs with more than 10 proteins, those composed only of VirSorter sequences are highlighted in green and those with only one sequence from RefSeqABVir are marked in blue. Cases where sequences affiliated to both ssDNA and dsDNA viruses are clustered together are highlighted in red. ‘Detection tags’ lists the different detection tags for the cluster members, with ‘NCBI_RefSeq’ for complete genomes from the RefSeq database. These NCBI RefSeq sequences are counted as ‘complete’ in the ‘type of sequences’ column.: https://doi.org/10.7554/eLife.08490.010
Download elife-08490-fig2-data1-v3.xls

Figure 2—figure supplement 1

Download asset Open asset

Structure of viral sequence space sampled in VirSorter data set.

Network of virus clusters (VCs) based on gene content comparison between viral genome sequences from RefSeqABVir and VirSorter data set. VCs including only VirSorter sequences are highlighted with a black outline. The size of nodes is proportional to the number of sequences in the cluster and the color of the node corresponds to the BLAST-based affiliation (at the family level) of its members when consistent (i.e., agreement between >75% of the cluster members, otherwise clusters are indicated as ‘unaffiliated’).

https://doi.org/10.7554/eLife.08490.011

Figure 2—figure supplement 2

Download asset Open asset

Benchmarks used to determine the best value for inflation and significance thresholds for virus clustering.

For each pair of values (inflation and significance threshold), the genome network was computed and its overall shape evaluated with ICCC (intra-cluster clustering coefficient). The chosen values are highlighted in green in the table and with a star on the associated plot.

https://doi.org/10.7554/eLife.08490.012

Figure 3 with 1 supplement

Download asset Open asset

Extrachromosomal prophages in VirSorter curated data set and improvement in virome affiliation.

(A) The distribution of VirSorter curated data set as ‘integrated’ (i.e., prophages integrated in the host chromosome), ‘extrachromosomal’ (i.e., >30 kb or circular sequences with no microbial genes), or ‘undetermined’ (<30 kb linear with no microbial genes) is indicated for each host class with at least five VirSorter curated data set sequences. The number of sequences associated with each host class in indicated above the histogram. (B) Improvement in the proportion of affiliated genes from viromes with VirSorter data set. Predicted genes from the Pacific Ocean Viromes (Hurwitz and Sullivan, 2013), Tara Ocean Viromes (Brum et al., 2015), and Human Gut Viromes (Minot et al., 2012) were compared to RefSeqVirus (May 2015) and the VirSorter data set (BLASTp, threshold of 50 on bit score and 0.001 on e-value). Predicted proteins affiliated to VirSorter (in blue) did not display any significant similarity to a RefSeq sequence.

https://doi.org/10.7554/eLife.08490.013

Figure 3—figure supplement 1

Download asset Open asset

Contig map of a putative new extrachromosomal prophage.

Contig Spirochaetia_gi_359585655 represent a complete genome (the contig was detected as circular) from a new genus (affiliated to a VC with no RefSeqABVir sequence). Functional affiliation of predicted genes is indicated on the map, with notably two genes (ParA/ParB) indicative of extrachromosomal prophages, as well as two genes (in orange) affiliated to the ACR_tran efflux pump family, of which some members are involved in antiobiotic resistance phenotypes. This contig belongs to the virus cluster VC_61, composed of 35 new putative extrachromosomal prophages from different Spirochetes genomes.

https://doi.org/10.7554/eLife.08490.014

Figure 4

Download asset Open asset

Scale and range of co-infection.

(A) Number of different viral sequences detected by host genome. Numbers are based on the set of microbial genomes with at least one viral sequence detected (5492 genomes). (B) Affiliation of viruses involved in multiple infections of the same host. Affiliations are deduced from best BLAST hits alongside the viral sequences, as in Figure 1. Co-infections involving dsDNA and ssDNA viruses are highlighted in bold.

https://doi.org/10.7554/eLife.08490.015

Figure 5 with 1 supplement

Download asset Open asset

Virus–host network between virus clusters and host classes (matrix visualization).

A cell in the matrix is colored when at least one virus from a virus cluster (VC, rows) was retrieved in a genome from a host class (columns). This virus–host network is detected as significantly modular by lp-Brim (modularity Q = 0.45; the same index computed from 99 randomly permuted matrices ranged from 0.02 to 0.17, with an average of 0.08). The different modules are highlighted in color, with inter-module links in gray. Virus clusters are identified by their number and their family-level affiliation (based on BLAST-based affiliation of the cluster members) is indicated next to each cluster when available (virus clusters with inconsistent members affiliation are considered as ‘unclassified’, affiliations are spread along the x-axis for spacing purpose). Host phylum and class are indicated for each host column, with domains indicated above the corresponding hosts.

https://doi.org/10.7554/eLife.08490.016

Figure 5—figure supplement 1

Download asset Open asset

Virus–host network between virus clusters and host classes (network visualization).

An edge is displayed between a virus cluster (VC) and a host class when at least one virus from this cluster was retrieved in a genome from the host class. This network is detected as significantly modular by lp-Brim (modularity Q = 0.45; the same index computed from 99 randomly permuted matrices ranged from 0.02 to 0.17, with an average of 0.08). The different modules are highlighted in color, with inter-module links in gray. VCs are identified by their number and their family-level affiliation (based on BLAST-based affiliation of the cluster members) is indicated below each cluster when available (VCs with inconsistent members affiliation are considered as ‘unclassified’). Host phylum and class are indicated for each host node, with phyla (when multiple class from the same phylum are included in the network) and domains indicated above the corresponding host nodes.

https://doi.org/10.7554/eLife.08490.017

Figure 6 with 2 supplements

Download asset Open asset

Adaptation of viral genome composition and codon usage to the host genome.

K–S distances between distributions of virus–host distances and virus–non-host distances for each metrics (in color) and different subsets of the viral sequences (all sequences, by type, and by taxonomy). Only families with more than 5 genomes are displayed (although it should be noted that the VirSorter data set includes only 6 *Microviridae* sequences). The number of sequences in each category is indicated in brackets. Distributions used to compute distances are displayed in Figure 6—figure supplement 1.

https://doi.org/10.7554/eLife.08490.018

Figure 6—figure supplement 1

Download asset Open asset

(A) K–S distances between distributions of virus–host distances and virus–non-host distances for each metrics (in color) and different subsets of the viral sequences (based on the number of tRNA genes detected).

The number of sequences in each category is indicated below the number of tRNA. (B) Distribution of k-mer distances between viral and cellular genomes and codon usage adaptation index for host, host genus, host family, and non-host (different order) genomes. For each viral genome, the distance to the host is displayed, as well as 10 randomly taken distances to genomes from each category and different subsets of the viral sequences (by taxonomy on the left column, and by number of tRNA genes on the rigth column).

https://doi.org/10.7554/eLife.08490.019

Figure 6—figure supplement 2

Download asset Open asset

Distance between k-mer frequency vectors of virus genome subsamples and host genomes for *Caudovirales*.

Viral genomes (1000) were randomly sub-sampled at different sizes (from 2000 to 20,000 bp). Only *Caudovirales* genomes were selected for this subsample analysis. For each size of k-mer, the result of a linear regression of distance between host or non-host and viral subsample size is indicated. The same distances for the *Microviridae* and *Inoviridae* (taken from Figure 6A) are indicated for comparison, and associated with the size of the reference genome of each group (*Enterobacteria* phage phiX174 and *Enterobacteria* phage M13). For clarity's sake, the almost-identical values for 2-mer, 3-mer, and 4-mer for *Microviridae* are slightly horizontally shifted.

https://doi.org/10.7554/eLife.08490.020

Author response image 1

Download asset Open asset

Improvement in the proportion of affiliated genes from viromes with VirSorter dataset.

Predicted genes from the Pacific Ocean Viromes (Hurwitz and Sullivan, 2013), Tara Ocean Viromes (Brumnoza, et al., 2015) and Human Gut Viromes (Minot et al., 2013) were compared to RefSeqVirus (May 2015) and the 12.5k VirSorter dataset (BLASTp, threshold of 50 on bit score and 0.001 on e-value). Predicted proteins affiliated to VirSorter (in blue) did not display any significant similarity to a RefSeq virus, but can now be associated with a phage and a host through the VirSorter database.

https://doi.org/10.7554/eLife.08490.024

Author response image 2

Download asset Open asset

Viral sequences distribution of RefSeq and VirSorter dataset.

For each host group, a circle proportional to the number of viral genomes available is noted in red for RefSeq and blue for VirSorter. Hosts for which no RefSeq references were available are highlighted in bold.

https://doi.org/10.7554/eLife.08490.025

Tables

Table 1

Accuracy of host prediction based on distance (d) between tetranucleotide frequencies of viral and microbial genomes

https://doi.org/10.7554/eLife.08490.021

	Predicted	Host order		Host family		Host genus
	Predicted	Correct	Ratio (%)	Correct	Ratio (%)	Correct	Ratio (%)
All reference sequences
d < 4 × 10⁻⁰⁴	98	97	98.98	97	98.98	97	98.98
4 × 10⁻⁰⁴ ≤ d < 1 × 10⁻⁰³	10,173	9361	92.02	8971	88.18	5261	51.72
1 × 10⁻⁰³ ≤ d	2508	1872	74.64	1757	70.06	917	36.56
Host species excluded
d < 4 × 10⁻⁰⁴	21	20	95.24	20	95.24	20	95.24
4 × 10⁻⁰⁴ ≤ d < 1 × 10⁻⁰³	10,003	9067	90.64	8372	83.69	2992	29.91
1 × 10⁻⁰³ ≤ d	2755	1981	71.91	1840	66.79	818	29.69
Host genus excluded
d < 4 × 10⁻⁰⁴	1	0	0.00	0	0.00	0	0.00
4 × 10⁻⁰⁴ ≤ d < 1 × 10⁻⁰³	9085	7303	80.39	6181	68.04	0	0.00
1 × 10⁻⁰³ ≤ d	3693	1768	47.87	1388	37.58	0	0.00

For each viral genome, the order, family, and genus of its host were predicted from the taxonomy of the closest microbial genome (based on the mean absolute difference between tetranucleotide frequency vectors) and compared to the order, family, and genus of the actual host (i.e., the taxonomy of the genome with which the virus was identified). These predictions were computed with (i) all microbial genomes, (ii) excluding specifically all genomes from the host species, and (iii) excluding all genomes from the host genus. Cases with over 75% of prediction accuracy are highlighted in gray.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Simon Roux
Steven J Hallam
Tanja Woyke
Matthew B Sullivan

(2015)

Viral dark matter and virus–host interactions resolved from publicly available microbial genomes

eLife 4:e08490.

https://doi.org/10.7554/eLife.08490

Figures

Distribution of viral sequences from the VirSorter curated data set across the bacterial and archaeal phylogeny.

Figure 1—source data 1

Figure 1—source data 2

Figure 1—source data 3

Viral diversity in the VirSorter data set.

Genome map comparison (A) and recruitment plot (B) of Bacteroidia virus sequences from a putative new order.

Degree of novelty of viruses detected in VirSorter curated data set.

Figure 2—source data 1

Structure of viral sequence space sampled in VirSorter data set.

Benchmarks used to determine the best value for inflation and significance thresholds for virus clustering.

Extrachromosomal prophages in VirSorter curated data set and improvement in virome affiliation.

Contig map of a putative new extrachromosomal prophage.

Scale and range of co-infection.

Virus–host network between virus clusters and host classes (matrix visualization).

Virus–host network between virus clusters and host classes (network visualization).

Adaptation of viral genome composition and codon usage to the host genome.

(A) K–S distances between distributions of virus–host distances and virus–non-host distances for each metrics (in color) and different subsets of the viral sequences (based on the number of tRNA genes detected).

Distance between k-mer frequency vectors of virus genome subsamples and host genomes for Caudovirales.

Improvement in the proportion of affiliated genes from viromes with VirSorter dataset.

Viral sequences distribution of RefSeq and VirSorter dataset.

Tables

Download links

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Be the first to read new articles from eLife

Share this article

Cite this article

Distribution of viral sequences from the VirSorter curated data set across the bacterial and archaeal phylogeny.

Figure 1—source data 1

Figure 1—source data 2

Figure 1—source data 3

Viral diversity in the VirSorter data set.

Genome map comparison (A) and recruitment plot (B) of Bacteroidia virus sequences from a putative new order.

Degree of novelty of viruses detected in VirSorter curated data set.

Figure 2—source data 1

Structure of viral sequence space sampled in VirSorter data set.

Benchmarks used to determine the best value for inflation and significance thresholds for virus clustering.

Extrachromosomal prophages in VirSorter curated data set and improvement in virome affiliation.

Contig map of a putative new extrachromosomal prophage.

Scale and range of co-infection.

Virus–host network between virus clusters and host classes (matrix visualization).

Virus–host network between virus clusters and host classes (network visualization).

Adaptation of viral genome composition and codon usage to the host genome.

(A) K–S distances between distributions of virus–host distances and virus–non-host distances for each metrics (in color) and different subsets of the viral sequences (based on the number of tRNA genes detected).

Distance between k-mer frequency vectors of virus genome subsamples and host genomes for Caudovirales.

Improvement in the proportion of affiliated genes from viromes with VirSorter dataset.

Viral sequences distribution of RefSeq and VirSorter dataset.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)