Research: A comprehensive and quantitative exploration of thousands of viral genomes

  1. Gita Mahmoudabadi
  2. Rob Phillips  Is a corresponding author
  1. California Institute of Technology, United States
11 figures, 1 table and 1 additional file

Figures

Schematics of several viral classification systems explored in this study.

(A) The Baltimore classification divides all viruses into seven groups based on how the viral mRNA is produced. DNA strands are denoted in red (+ssDNA in darker shade of red than -ssDNA). Similarly RNA strands are denoted in green (+ssRNA in darker shade of green than -ssRNA). In the case of Baltimore groups 1,2,6, and 7, the genome either is or is converted to dsDNA, which is then converted to mRNA through the action of DNA-dependent RNA polymerase. In the case of Baltimore groups 3, 4 and 5, the genome is or is converted to +ssRNA, which is mRNA, through the action of RNA-dependent RNA polymerase. (B) Nucleotide type classification divides viruses based on their genomic material into DNA and RNA viruses. Baltimore viral groups 1, 2, and 7 are all considered DNA viruses, and the remaining viral groups are considered RNA viruses. (C) Host Domain classification groups viruses based on the host domain that they infect. Three groups are formed: eukaryotic, bacterial and archaeal viruses.

https://doi.org/10.7554/eLife.31955.002
Figure 2 with 1 supplement
A census of all viruses with complete genomes reported to NCBI that were matched to a host (N= 2399).

(A) Percentage of viruses infecting hosts from the three domains of life. 1) Eukaryotic, 2) bacterial and 3) archaeal viromes are further classified according to the (B) Nucleotide Type, (C) Baltimore, and D) ICTV classification systems. (E) Distributions of host phyla (or supergroups) infected by the (1) eukaryotic, (2) bacterial, and (3) archaeal viruses is shown. As in the case of panel F, the host taxonomic identification is derived from the NCBI Taxonomy database (see Materials and methods). (F) Histograms of the number of known viruses infecting host species. Median and mean number of viruses infecting a host species is provided in each plot. The full-range of x-values for the bacterial and eukaryotic histograms extends beyond n=20 (see virusHostHistograms.ipynb in our GitHub repository [Mahmoudabadi, 2018]). Further exploration of the largest fraction of the eukaryotic virome (i.e. animal viruses) is shown in Figure 2—figure supplement 1.

https://doi.org/10.7554/eLife.31955.003
Figure 2—figure supplement 1
Further exploration of the largest fraction of the eukaryotic virome: viruses of Opisthokonta supergroup (animals).

The x-axis corresponds to the number of viruses infecting each host group. In a recursive fashion, the host group with the largest number of known viruses is further zoomed in on (host groups infected by only a few known viruses are not shown). The host classification was obtained from the NCBI taxonomic database.

https://doi.org/10.7554/eLife.31955.004
Figure 3 with 1 supplement
Describing viral genomes through distributions of genome length, gene length and gene density.

(A) Box plots of genome lengths (Log10) across all viruses included in our dataset (top), further partitioned based on the Baltimore classification categories (bottom). The number of viruses included in each group is denoted by N. (B) A closer examination of dsDNA and ssDNA viral genome lengths through the overlay of Host Domain and ICTV classification systems. Distributions of genome lengths associated with eukaryotic, bacterial and archaeal viruses are shown in salmon, blue, and teal, respectively. ICTV viral families with only a few members are omitted. Distributions of genome lengths across different classification systems along with various statistics are shown in Figure 3—figure supplement 1. and Figure 3—source data 1. Note that the bimodal distribution of eukaryotic ssDNA viruses, which also appears in the next figure, arises from the Begomoviruses, which are plant viruses with circularized monopartite and bipartite genomes (Melgarejo et al., 2013). (C) Median gene length is plotted against the number of genes for each genome for all genomes in our dataset, color-coded according to different classification systems. (D) Number of genes per genome length (gene density) for dsDNA viruses based on the overlay of Host Domain (bottom) and ICTV family classification categories (top) (Pearson correlations and their statistical significance, two-tailed t-test P values, are denoted).

https://doi.org/10.7554/eLife.31955.005
Figure 3—source data 1

Genome length statistics for viral groups across different classification systems (rounded to the nearest kilobase).

https://doi.org/10.7554/eLife.31955.007
Figure 3—figure supplement 1
Histograms of genome length (Log10) across all complete viral genomes associated with a host.

Histograms are grouped according to four viral classification systems: (A) Baltimore classification, (B) Nucleotide type classification, (C) Host Domain Classification, and D) ICTV classification. Instead of showing absolute viral counts on the y-axis, the counts are normalized by the total number of viruses in each viral category (the total counts of viruses in each category is denoted as N inside the plots). The mean of each distribution is denoted as a dot on the boxplots. The relevant statistics for each distribution is provided in Figure 3—source data 1. In each histogram the number of bins and their width is set by Freedman-Diaconis rule (Reich et al., 1966).

https://doi.org/10.7554/eLife.31955.006
Normalized histograms of median gene lengths (log10) across all complete viral genomes associated with a host.

Instead of showing absolute viral counts on y-axes, the counts are normalized by the total number of viruses in each viral category (denoted as N inside each plot). The mean of each distribution is denoted as a dot on the boxplot. For all histograms, bin numbers and bin widths are systematically decided by the Freedman-Diaconis rule (Reich et al., 1966). Viral schematics on the right of the figure are modified from ViralZone (Hulo et al., 2011). Key statistics describing these distributions can be found in Table 1 and Figure 4—source data 1.

https://doi.org/10.7554/eLife.31955.009
Figure 4—source data 1

Median gene length statistics for viral groups across different classification systems (rounded to the nearest base).

It is important to clarify that the median values in this table represent the median of median gene lengths.

https://doi.org/10.7554/eLife.31955.010
Normalized histograms of noncoding DNA/RNA percentage across all complete viral genomes associated with a host.

The counts of viruses are normalized by the total number of viruses in each viral category (denoted as N inside each plot). The mean of each distribution is denoted as a dot on the boxplot. For all histograms, bin numbers and bin widths are systematically decided by the Freedman-Diaconis rule (Reich et al., 1966). Viral schematics are modified from ViralZone (Hulo et al., 2011). Key statistics describing these distributions can be found in Table 1 and Figure 5—source data 1.

https://doi.org/10.7554/eLife.31955.011
Figure 5—source data 1

Percent noncoding DNA (or RNA) for viral groups across different classification systems (rounded to the nearest percentage).

https://doi.org/10.7554/eLife.31955.012
Normalized abundance of functional gene categories across different viral groups.

(A) Abundances of functional gene categories across 8 viral groups normalized to the number of labeled genes in each viral group (the total number of genes in each viral group is shown above the panel, and in brackets are the number of labeled genes for each viral group). (B) Abundances of functional gene subcategories across 8 viral groups: RNA, ssDNA, and dsDNA viral groups (top plot); eukaryotic and bacterial dsDNA viral groups (middle); Siphoviridae, Myoviridae, and Podoviridae viral groups (bottom). A few examples of the types of genes contained as part of each functional subcategory are provided.

https://doi.org/10.7554/eLife.31955.013
Figure 7 with 1 supplement
Alignment of the most common gene order patterns for dsDNA bacterial viruses.

Each genome is summarized by a sequence of letters, with each letter corresponding to a gene, positioned in the order that it appears on the genome. As an example, the gene order sequence for Salmonella phage FSL SP-004 is shown. Note the letters shown serve to only denote genes with similar functions. Structural genes are assigned colors, whereas other genes are denoted in black. Across all three panels, each row corresponds to the gene order sequence for a given virus, and thus, the length of the sequence denotes the number of genes within a given genome. The left two columns accompanying each panel provide further information on hosts and viral morphologies. Panel A, B, and C, represent gene order patterns A, B, and C, respectively. Geneious global alignment (Steitz et al., 2011) was used to align gene order sequences (see Materials and methods). Refer to Figure 7—figure supplement 1 to see the percent identity heat maps of terminases (large and small subunits) across dsDNA bacterial viruses.

https://doi.org/10.7554/eLife.31955.014
Figure 7—figure supplement 1
Percent identity heat maps of A) 320 terminase (large subunit) amino acid sequences, and B) 191 terminase (small subunit) amino acid sequences from dsDNA bacteriophages.

The sidebars denote the host phylum for each bacteriophage sequence.

https://doi.org/10.7554/eLife.31955.015
Attachment site length, position, and sequence diversity for 164 dsDNA bacterial viruses.

(A) Histogram of attachment site length. (B) Histogram of attachment site start positions (left attachment: blue, right attachment: red). (C) Histogram of attachment site start positions normalized by the genome length. (D) Percent sequence similarity matrix across attachment sites. (E) Attachment site locations along viral genomes (left attachment: blue, right attachment: red). Figure 8—source data 1 demonstrates several bacteriophages shown in panel E with similar or identical attachment site sequences.

https://doi.org/10.7554/eLife.31955.016
Figure 8—source data 1

Several bacteriophages from Figure 8D with similar or identical attachment site sequences.

https://doi.org/10.7554/eLife.31955.017
The result of BLASTP for all dsDNA bacteriophage proteins against the NCBI Refseq protein database (limited to bacterial proteins).

The numbers reported correspond to the number of dsDNA bacteriophage proteins (rounded to the nearest thousand).

https://doi.org/10.7554/eLife.31955.018
Figure 10 with 1 supplement
A depiction of the taxonomic distance between the bacteriophage host organism and the bacterium containing the closest homolog to a bacteriophage protein.

All circles are drawn to scale with respect to the number of proteins (N) that they each represent. Note, the number of proteins denoted at each taxonomic layer includes proteins in lower taxonomic layers. For example, the 20,000 figure denoted at the genus layer already includes the 11,000 proteins shown at the species layer. N values are rounded to the nearest thousand. Histograms of the fraction of proteins with bacterial homologs per bacteriophage genome are shown in Figure 10—figure supplement 1.

https://doi.org/10.7554/eLife.31955.019
Figure 10—figure supplement 1
Histogram of the fraction of proteins per bacteriophage genome with bacterial homologs (Left) and the same histogram with an additional filter to identify possible prophages and their lytic relatives (right).
https://doi.org/10.7554/eLife.31955.020
Figure 11 with 1 supplement
Histograms of bit scores describing the match between each bacteriophage protein and its closest bacterial homolog.

Histograms are created according to the proteins belonging to three different layers corresponding to an increasing taxonomic distance between the host organism and the bacterium containing the closest homolog. (A) When the host and the homolog-containing bacterium belong to the same species, the median bit score is significantly higher (one sided Mann-Whitney U test, P<0.001) than it is for those that are only part of the same genus. (B) Similarly, when comparing proteins from the “same species” layer to the “same phylum” layer, the median bit score is significantly higher for the “same species” layer (one sided Mann-Whitney U test, P<0.001). Note that for each layer, when comparing the “same species” to the “same genus” layers, we are comparing the 11,000 proteins in the “same species” layer to the 9,000 proteins from the “same genus” layer that do not also belong to the “same species” layer. The same principle applies when we are comparing the “same species” layer to the “same phylum” layer. Distributions of bacteriophage proteins with homologs from a different phylum than their host phylum are shown in Figure 11—figure supplement 1.

https://doi.org/10.7554/eLife.31955.021
Figure 11—figure supplement 1
Distributions of bacteriophage proteins with a homolog in a bacterium from a different phylum than their host phylum.

These proteins are categorized based on their host’s phylum (top), and then based on the phylum where their closest homolog appears (bottom). There are 26 different phyla that bacterial homologs appear in, however, only the ones containing the highest number of homologs are annotated for visual clarity.

https://doi.org/10.7554/eLife.31955.022

Tables

Table 1
Viral genomic statistics based upon different classification systems.

Only median values are reported in this table. Genome length data is rounded to the nearest kilobase. N corresponds to the number of viruses from which data is obtained.

https://doi.org/10.7554/eLife.31955.008
ClassificationNGenome length (kb)Percent noncoding (DNA/RNA)Median gene length (bases)
Host DomainEukaryotic Viruses13848101055
Bacteria Viruses969439408
Archaea Viruses462410400
BaltimoreGroup I (dsDNA)1211449429
Group II (ssDNA)431314588
Group III (dsRNA)123882291
Group IV (+ssRNA)482952366
Group V (-ssRNA)1011271353
Group VI (ssRNA-RT)148161799
Group VII (dsDNA-RT)37811558
Nucleotide TypeDNA Viruses16793810444
RNA Viruses720962072
ICTV (orders)Caudovirales879449408
Herpesvirales55159191107
Ligamenvirales113712372
Mononegavirales711281266
Nidovirales35273672
Picornavirales898117056
Tymovirales7384693
Combinations of different classificationsAll Eukaryotic dsDNA viruses2713311990
All Bacterial dsDNA viruses899449408
All Archaeal dsDNA viruses412810396
All Eukaryotic ssDNA viruses375314732
All Bacterial ssDNA viruses51714348

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Gita Mahmoudabadi
  2. Rob Phillips
(2018)
Research: A comprehensive and quantitative exploration of thousands of viral genomes
eLife 7:e31955.
https://doi.org/10.7554/eLife.31955