Microbiology and Infectious Disease

Research: A comprehensive and quantitative exploration of thousands of viral genomes

California Institute of Technology, United States

Apr 6, 2018

https://doi.org/10.7554/eLife.31955

Open access
Copyright information

Download
Cite
CommentOpen annotations (there are currently 0 annotations on this page).
Share

11 figures, 1 table and 1 additional file

Figures

Figure 1

Download asset Open asset

Schematics of several viral classification systems explored in this study.

(A) The Baltimore classification divides all viruses into seven groups based on how the viral mRNA is produced. DNA strands are denoted in red (+ssDNA in darker shade of red than -ssDNA). Similarly RNA strands are denoted in green (+ssRNA in darker shade of green than -ssRNA). In the case of Baltimore groups 1,2,6, and 7, the genome either is or is converted to dsDNA, which is then converted to mRNA through the action of DNA-dependent RNA polymerase. In the case of Baltimore groups 3, 4 and 5, the genome is or is converted to +ssRNA, which is mRNA, through the action of RNA-dependent RNA polymerase. (B) Nucleotide type classification divides viruses based on their genomic material into DNA and RNA viruses. Baltimore viral groups 1, 2, and 7 are all considered DNA viruses, and the remaining viral groups are considered RNA viruses. (C) Host Domain classification groups viruses based on the host domain that they infect. Three groups are formed: eukaryotic, bacterial and archaeal viruses.

https://doi.org/10.7554/eLife.31955.002

Figure 2 with 1 supplement

Download asset Open asset

A census of all viruses with complete genomes reported to NCBI that were matched to a host (N= 2399).

(A) Percentage of viruses infecting hosts from the three domains of life. 1) Eukaryotic, 2) bacterial and 3) archaeal viromes are further classified according to the (B) Nucleotide Type, (C) Baltimore, and D) ICTV classification systems. (E) Distributions of host phyla (or supergroups) infected by the (1) eukaryotic, (2) bacterial, and (3) archaeal viruses is shown. As in the case of panel F, the host taxonomic identification is derived from the NCBI Taxonomy database (see Materials and methods). (F) Histograms of the number of known viruses infecting host species. Median and mean number of viruses infecting a host species is provided in each plot. The full-range of x-values for the bacterial and eukaryotic histograms extends beyond n=20 (see virusHostHistograms.ipynb in our GitHub repository [Mahmoudabadi, 2018]). Further exploration of the largest fraction of the eukaryotic virome (i.e. animal viruses) is shown in Figure 2—figure supplement 1.

https://doi.org/10.7554/eLife.31955.003

Figure 2—figure supplement 1

Download asset Open asset

Further exploration of the largest fraction of the eukaryotic virome: viruses of Opisthokonta supergroup (animals).

The x-axis corresponds to the number of viruses infecting each host group. In a recursive fashion, the host group with the largest number of known viruses is further zoomed in on (host groups infected by only a few known viruses are not shown). The host classification was obtained from the NCBI taxonomic database.

https://doi.org/10.7554/eLife.31955.004

Figure 3 with 1 supplement

Download asset Open asset

Describing viral genomes through distributions of genome length, gene length and gene density.

(A) Box plots of genome lengths (Log10) across all viruses included in our dataset (top), further partitioned based on the Baltimore classification categories (bottom). The number of viruses included in each group is denoted by N. (B) A closer examination of dsDNA and ssDNA viral genome lengths through the overlay of Host Domain and ICTV classification systems. Distributions of genome lengths associated with eukaryotic, bacterial and archaeal viruses are shown in salmon, blue, and teal, respectively. ICTV viral families with only a few members are omitted. Distributions of genome lengths across different classification systems along with various statistics are shown in Figure 3—figure supplement 1. and Figure 3—source data 1. Note that the bimodal distribution of eukaryotic ssDNA viruses, which also appears in the next figure, arises from the Begomoviruses, which are plant viruses with circularized monopartite and bipartite genomes (Melgarejo et al., 2013). (C) Median gene length is plotted against the number of genes for each genome for all genomes in our dataset, color-coded according to different classification systems. (D) Number of genes per genome length (gene density) for dsDNA viruses based on the overlay of Host Domain (bottom) and ICTV family classification categories (top) (Pearson correlations and their statistical significance, two-tailed t-test P values, are denoted).

https://doi.org/10.7554/eLife.31955.005

Figure 3—source data 1 Genome length statistics for viral groups across different classification systems (rounded to the nearest kilobase).: https://doi.org/10.7554/eLife.31955.007
Download elife-31955-fig3-data1-v2.docx

Figure 3—figure supplement 1

Download asset Open asset

Histograms of genome length (Log10) across all complete viral genomes associated with a host.

Histograms are grouped according to four viral classification systems: (A) Baltimore classification, (B) Nucleotide type classification, (C) Host Domain Classification, and D) ICTV classification. Instead of showing absolute viral counts on the y-axis, the counts are normalized by the total number of viruses in each viral category (the total counts of viruses in each category is denoted as N inside the plots). The mean of each distribution is denoted as a dot on the boxplots. The relevant statistics for each distribution is provided in Figure 3—source data 1. In each histogram the number of bins and their width is set by Freedman-Diaconis rule (Reich et al., 1966).

https://doi.org/10.7554/eLife.31955.006

Figure 4

Download asset Open asset

Normalized histograms of median gene lengths (log10) across all complete viral genomes associated with a host.

Instead of showing absolute viral counts on y-axes, the counts are normalized by the total number of viruses in each viral category (denoted as N inside each plot). The mean of each distribution is denoted as a dot on the boxplot. For all histograms, bin numbers and bin widths are systematically decided by the Freedman-Diaconis rule (Reich et al., 1966). Viral schematics on the right of the figure are modified from ViralZone (Hulo et al., 2011). Key statistics describing these distributions can be found in Table 1 and Figure 4—source data 1.

https://doi.org/10.7554/eLife.31955.009

Figure 4—source data 1 Median gene length statistics for viral groups across different classification systems (rounded to the nearest base). It is important to clarify that the median values in this table represent the median of median gene lengths.: https://doi.org/10.7554/eLife.31955.010
Download elife-31955-fig4-data1-v2.docx

Figure 5

Download asset Open asset

Normalized histograms of noncoding DNA/RNA percentage across all complete viral genomes associated with a host.

The counts of viruses are normalized by the total number of viruses in each viral category (denoted as N inside each plot). The mean of each distribution is denoted as a dot on the boxplot. For all histograms, bin numbers and bin widths are systematically decided by the Freedman-Diaconis rule (Reich et al., 1966). Viral schematics are modified from ViralZone (Hulo et al., 2011). Key statistics describing these distributions can be found in Table 1 and Figure 5—source data 1.

https://doi.org/10.7554/eLife.31955.011

Figure 5—source data 1 Percent noncoding DNA (or RNA) for viral groups across different classification systems (rounded to the nearest percentage).: https://doi.org/10.7554/eLife.31955.012
Download elife-31955-fig5-data1-v2.docx

Figure 6

Download asset Open asset

Normalized abundance of functional gene categories across different viral groups.

(A) Abundances of functional gene categories across 8 viral groups normalized to the number of labeled genes in each viral group (the total number of genes in each viral group is shown above the panel, and in brackets are the number of labeled genes for each viral group). (B) Abundances of functional gene subcategories across 8 viral groups: RNA, ssDNA, and dsDNA viral groups (top plot); eukaryotic and bacterial dsDNA viral groups (middle); *Siphoviridae*, *Myoviridae*, and *Podoviridae* viral groups (bottom). A few examples of the types of genes contained as part of each functional subcategory are provided.

https://doi.org/10.7554/eLife.31955.013

Figure 7 with 1 supplement

Download asset Open asset

Alignment of the most common gene order patterns for dsDNA bacterial viruses.

Each genome is summarized by a sequence of letters, with each letter corresponding to a gene, positioned in the order that it appears on the genome. As an example, the gene order sequence for *Salmonella phage FSL SP-004* is shown. Note the letters shown serve to only denote genes with similar functions. Structural genes are assigned colors, whereas other genes are denoted in black. Across all three panels, each row corresponds to the gene order sequence for a given virus, and thus, the length of the sequence denotes the number of genes within a given genome. The left two columns accompanying each panel provide further information on hosts and viral morphologies. Panel A, B, and C, represent gene order patterns A, B, and C, respectively. Geneious global alignment (Steitz et al., 2011) was used to align gene order sequences (see Materials and methods). Refer to Figure 7—figure supplement 1 to see the percent identity heat maps of terminases (large and small subunits) across dsDNA bacterial viruses.

https://doi.org/10.7554/eLife.31955.014

Figure 7—figure supplement 1

Download asset Open asset

Percent identity heat maps of A) 320 terminase (large subunit) amino acid sequences, and B) 191 terminase (small subunit) amino acid sequences from dsDNA bacteriophages.

The sidebars denote the host phylum for each bacteriophage sequence.

https://doi.org/10.7554/eLife.31955.015

Figure 8

Download asset Open asset

Attachment site length, position, and sequence diversity for 164 dsDNA bacterial viruses.

(A) Histogram of attachment site length. (B) Histogram of attachment site start positions (left attachment: blue, right attachment: red). (C) Histogram of attachment site start positions normalized by the genome length. (D) Percent sequence similarity matrix across attachment sites. (E) Attachment site locations along viral genomes (left attachment: blue, right attachment: red). Figure 8—source data 1 demonstrates several bacteriophages shown in panel E with similar or identical attachment site sequences.

https://doi.org/10.7554/eLife.31955.016

Figure 8—source data 1 Several bacteriophages from Figure 8D with similar or identical attachment site sequences.: https://doi.org/10.7554/eLife.31955.017
Download elife-31955-fig8-data1-v2.docx

Figure 9

Download asset Open asset

The result of BLASTP for all dsDNA bacteriophage proteins against the NCBI Refseq protein database (limited to bacterial proteins).

The numbers reported correspond to the number of dsDNA bacteriophage proteins (rounded to the nearest thousand).

https://doi.org/10.7554/eLife.31955.018

Figure 10 with 1 supplement

Download asset Open asset

A depiction of the taxonomic distance between the bacteriophage host organism and the bacterium containing the closest homolog to a bacteriophage protein.

All circles are drawn to scale with respect to the number of proteins (N) that they each represent. Note, the number of proteins denoted at each taxonomic layer includes proteins in lower taxonomic layers. For example, the 20,000 figure denoted at the genus layer already includes the 11,000 proteins shown at the species layer. N values are rounded to the nearest thousand. Histograms of the fraction of proteins with bacterial homologs per bacteriophage genome are shown in Figure 10—figure supplement 1.

https://doi.org/10.7554/eLife.31955.019

Figure 10—figure supplement 1

Download asset Open asset

Histogram of the fraction of proteins per bacteriophage genome with bacterial homologs (Left) and the same histogram with an additional filter to identify possible prophages and their lytic relatives (right).
https://doi.org/10.7554/eLife.31955.020

Figure 11 with 1 supplement

Download asset Open asset

Histograms of bit scores describing the match between each bacteriophage protein and its closest bacterial homolog.

Histograms are created according to the proteins belonging to three different layers corresponding to an increasing taxonomic distance between the host organism and the bacterium containing the closest homolog. (A) When the host and the homolog-containing bacterium belong to the same species, the median bit score is significantly higher (one sided Mann-Whitney U test, P<0.001) than it is for those that are only part of the same genus. (B) Similarly, when comparing proteins from the “same species” layer to the “same phylum” layer, the median bit score is significantly higher for the “same species” layer (one sided Mann-Whitney U test, P<0.001). Note that for each layer, when comparing the “same species” to the “same genus” layers, we are comparing the 11,000 proteins in the “same species” layer to the 9,000 proteins from the “same genus” layer that do not also belong to the “same species” layer. The same principle applies when we are comparing the “same species” layer to the “same phylum” layer. Distributions of bacteriophage proteins with homologs from a different phylum than their host phylum are shown in Figure 11—figure supplement 1.

https://doi.org/10.7554/eLife.31955.021

Figure 11—figure supplement 1

Download asset Open asset

Distributions of bacteriophage proteins with a homolog in a bacterium from a different phylum than their host phylum.

These proteins are categorized based on their host’s phylum (top), and then based on the phylum where their closest homolog appears (bottom). There are 26 different phyla that bacterial homologs appear in, however, only the ones containing the highest number of homologs are annotated for visual clarity.

https://doi.org/10.7554/eLife.31955.022

Tables

Table 1

Viral genomic statistics based upon different classification systems.

Only median values are reported in this table. Genome length data is rounded to the nearest kilobase. N corresponds to the number of viruses from which data is obtained.

https://doi.org/10.7554/eLife.31955.008

Classification		N	Genome length (kb)	Percent noncoding (DNA/RNA)	Median gene length (bases)
Host Domain	Eukaryotic Viruses	1384	8	10	1055
	Bacteria Viruses	969	43	9	408
	Archaea Viruses	46	24	10	400
Baltimore	Group I (dsDNA)	1211	44	9	429
	Group II (ssDNA)	431	3	14	588
	Group III (dsRNA)	123	8	8	2291
	Group IV (+ssRNA)	482	9	5	2366
	Group V (-ssRNA)	101	12	7	1353
	Group VI (ssRNA-RT)	14	8	16	1799
	Group VII (dsDNA-RT)	37	8	11	558
Nucleotide Type	DNA Viruses	1679	38	10	444
Nucleotide Type	RNA Viruses	720	9	6	2072
ICTV (orders)	Caudovirales	879	44	9	408
	Herpesvirales	55	159	19	1107
	Ligamenvirales	11	37	12	372
	Mononegavirales	71	12	8	1266
	Nidovirales	35	27	3	672
	Picornavirales	89	8	11	7056
	Tymovirales	73	8	4	693
Combinations of different classifications	All Eukaryotic dsDNA viruses	271	33	11	990
	All Bacterial dsDNA viruses	899	44	9	408
	All Archaeal dsDNA viruses	41	28	10	396
	All Eukaryotic ssDNA viruses	375	3	14	732
	All Bacterial ssDNA viruses	51	7	14	348

Additional files

Transparent reporting form: https://doi.org/10.7554/eLife.31955.023
Download elife-31955-transrepform-v2.docx

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Gita Mahmoudabadi
Rob Phillips

(2018)

Research: A comprehensive and quantitative exploration of thousands of viral genomes

eLife 7:e31955.

https://doi.org/10.7554/eLife.31955

Share this article

Cite this article

Schematics of several viral classification systems explored in this study.

A census of all viruses with complete genomes reported to NCBI that were matched to a host (N= 2399).

Further exploration of the largest fraction of the eukaryotic virome: viruses of Opisthokonta supergroup (animals).

Describing viral genomes through distributions of genome length, gene length and gene density.

Figure 3—source data 1

Histograms of genome length (Log10) across all complete viral genomes associated with a host.

Normalized histograms of median gene lengths (log10) across all complete viral genomes associated with a host.

Figure 4—source data 1

Normalized histograms of noncoding DNA/RNA percentage across all complete viral genomes associated with a host.

Figure 5—source data 1

Normalized abundance of functional gene categories across different viral groups.

Alignment of the most common gene order patterns for dsDNA bacterial viruses.

Percent identity heat maps of A) 320 terminase (large subunit) amino acid sequences, and B) 191 terminase (small subunit) amino acid sequences from dsDNA bacteriophages.

Attachment site length, position, and sequence diversity for 164 dsDNA bacterial viruses.

Figure 8—source data 1

The result of BLASTP for all dsDNA bacteriophage proteins against the NCBI Refseq protein database (limited to bacterial proteins).

A depiction of the taxonomic distance between the bacteriophage host organism and the bacterium containing the closest homolog to a bacteriophage protein.

Histogram of the fraction of proteins per bacteriophage genome with bacterial homologs (Left) and the same histogram with an additional filter to identify possible prophages and their lytic relatives (right).

Histograms of bit scores describing the match between each bacteriophage protein and its closest bacterial homolog.

Distributions of bacteriophage proteins with a homolog in a bacterium from a different phylum than their host phylum.

Viral genomic statistics based upon different classification systems.

Transparent reporting form

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)