Whole genome comparison of a large collection of mycobacteriophages reveals a continuum of phage genetic diversity
Abstract
The bacteriophage population is large, dynamic, ancient, and genetically diverse. Limited genomic information shows that phage genomes are mosaic, and the genetic architecture of phage populations remains ill-defined. To understand the population structure of phages infecting a single host strain, we isolated, sequenced, and compared 627 phages of Mycobacterium smegmatis. Their genetic diversity is considerable, and there are 28 distinct genomic types (clusters) with related nucleotide sequences. However, amino acid sequence comparisons show pervasive genomic mosaicism, and quantification of inter-cluster and intra-cluster relatedness reveals a continuum of genetic diversity, albeit with uneven representation of different phages. Furthermore, rarefaction analysis shows that the mycobacteriophage population is not closed, and there is a constant influx of genes from other sources. Phage isolation and analysis was performed by a large consortium of academic institutions, illustrating the substantial benefits of a disseminated, structured program involving large numbers of freshman undergraduates in scientific discovery.
https://doi.org/10.7554/eLife.06416.001eLife digest
Viruses are unable to replicate independently. To generate copies of itself, a virus must instead invade a target cell and commandeer that cell's replication machinery. Different viruses are able to invade different types of cell, and a group of viruses known as bacteriophages (or phages for short) replicate within bacteria. The enormous number and diversity of phages in the world means that they play an important role in virtually every ecosystem.
Despite their importance, relatively little is known about how different phage populations are related to each other and how they evolved. Many phages contain their genetic information in the form of strands of DNA. Using genetic sequencing to find out where and how different genes are encoded in the DNA can reveal information about how different viruses are related to each other. These relationships are particularly complicated in phages, as they can exchange genes with other viruses and microbes.
Previous studies comparing the genomes—the complete DNA sequence—of reasonably small numbers of phages that infect the Mycobacterium group of bacteria have found that the phages can be sorted into ‘clusters’ based on similarities in their genes and where these are encoded in their DNA. However, the number of phages investigated so far has been too small to conclude how different clusters are related. Are the clusters separate, or do they form a ‘continuum’ with different genes and DNA sequences shared between different clusters?
Here, Pope, Bowman, Russell et al. compare the individual genomes of 627 bacteriophages that infect the bacterial species Mycobacterium smegmatis. This is by far the largest number of phage genomes analyzed from a single host species. The large number of genomes analyzed allowed a much clearer understanding of the complexity and diversity of these phages to be obtained. The isolation, sequencing and analysis of the hundreds of M. smegmatis bacteriophage genomes was performed by an integrated research and education program, called the Science Education Alliance Phage Hunters Advancing Genomics and Evolutionary Science (SEA-PHAGES) program. This enabled thousands of undergraduate students from different institutions to contribute to the phage discovery and sequencing project, and co-author the report. SEA-PHAGES therefore shows that it is possible to successfully incorporate genuine scientific research into an undergraduate course, and that doing so can benefit both the students and researchers involved.
The results show that while the genomes could be categorized into 28 clusters, the genomes are not completely unrelated. Instead, a spread of diversity is seen, as genes and groups of genes are shared between different clusters. Pope, Bowman, Russell et al. further reveal that the phage population is in a constant state of change, and continuously acquires genes from other microorganisms and viruses.
https://doi.org/10.7554/eLife.06416.002Introduction
Bacteriophages are the dark matter of the biological universe, forming a vast, ancient, dynamic, and genetically diverse population, replete with genes of unknown function (Pedulla et al., 2003). Phages are the most abundant organisms in the biosphere, and the ∼1031 tailed phage particles participate in ∼1023 infections per second on a global scale, with the entire population turning over every few days (Suttle, 2007). The population is not only vast and dynamic, but comparisons of virion structures suggest that it is also extremely old (Krupovic and Bamford, 2010). It is thus not surprising that bacteriophages are genetically highly diverse, although their comparative genomics has lagged behind that of other microbes, largely due to the lack of individual isolates for genomic analyses (Hatfull and Hendrix, 2011). To date, there are approximately 2000 completely sequenced bacteriophage genomes in the GenBank database, a small number relative to the more than 30,000 sequenced prokaryotic genomes (http://www.ncbi.nlm.nih.gov/genome/browse/), in spite of phage genomes being only 1–5% of the size of their host genomes.
Double-stranded DNA tailed phages are proposed to have evolved with common ancestry but with different phages having differential access to a large common gene pool (Hendrix et al., 1999). Phage genomes are typified by their mosaic architectures generated by gene loss and gain through horizontal genetic exchange; however, the parameters influencing access to the common gene pool are numerous and likely include host range, genome size, replication mode, and life style (temperate vs lytic). Migration to new hosts is probably common, but is affected by local host diversity and mutation rates, as well as resistance mechanisms such as receptor availability, restriction, CRISPRs, and abortive infection systems (Buckling and Brockhurst, 2012; Jacobs-Sera et al., 2012; Hoskisson et al., 2015). Constraints on gene acquisition may also be imposed by synteny—particularly among virion structural genes—and by size limits of DNA packaging (Juhala et al., 2000; Hatfull and Hendrix, 2011).
We have previously described comparative analyses of modest numbers of mycobacteriophages and shown that they can be sorted by nucleotide sequence and gene content comparisons into groups of closely related genomes referred to as ‘clusters’ (designated Cluster A, B, C, etc.); phages without any close relatives are referred to as ‘singletons’. Some of the clusters can be further divided into subclusters (e.g., Subcluster A1, A2, A3, etc.) according to nucleotide sequence relatedness (Pedulla et al., 2003; Hatfull et al., 2006, 2010; Pope et al., 2011b). The genomes are mosaic whereby individual phages are constructed as assemblages of modules, many of which are single genes (Pedulla et al., 2003). Each mycobacteriophage cluster has features particular to that cluster (e.g., regulatory systems, repeated sequences, tRNA genes, etc. [Pope et al., 2011a, 2011b, 2013, 2014a, 2014b]), but because of the pervasive mosaicism, the relationships among phages within clusters and between clusters are complex. Collections of phages have been isolated on other hosts such as Bacillus spp., Escherichia coli, Pseudomonas spp., Propionibacterium spp. and Staphylococcus spp. (Kwan et al., 2005, 2006; Kropinski et al., 2007; Marinelli et al., 2012; Hatfull et al., 2013; Grose and Casjens, 2014; Grose et al., 2014; Lee et al., 2014) and these can be similarly divided into clusters based on DNA similarity. Recent analysis of 337 phages infecting 31 bacterial species within the Enterobacteriaceae (Grose and Casjens, 2014) reveals 56 clusters of phage genomes. It is thus clear that there is substantial diversity within the phage population, even when comparing phages of a common host and which are expected to be in direct genetic contact with each other in their natural environment (Hatfull and Hendrix, 2011). Nonetheless, the numbers of genomes isolated on a particular host generally are too small to define the nature and the size of the populations at large with any substantial resolution.
Viral metagenomic studies provide valuable insights into phage diversity and population dynamics, but typically generate few complete genome sequences or any specific information relating viral genomes to specific bacterial hosts (Hambly and Suttle, 2005; Rodriguez-Brito et al., 2010; Mokili et al., 2012). A recent analysis of Synechococcus phages using metagenomic analysis coupled with viral tagging showed that there are multiple ‘populations’ of these phages (similar to the clusters described above), but suggested that these represent distinct groups of related phages rather than a continuous spectrum of diversity (Deng et al., 2014). This differs from prior predictions that the phage population as a whole likely spans a continuum of diversity—albeit with uneven representation of different groups of related phages—because of genomic mosaicism (Hendrix, 2003; Hatfull, 2010, 2012). However, as the Synechococcus phage data are derived from a single sample using a single host, it is unclear if this extends to phages of other hosts (Deng et al., 2014).
Here we describe the comparative analysis of a large number of completely sequenced mycobacteriophage genomes and demonstrate that they represent a spectrum of diversity and do not constitute discrete populations. Rarefaction analyses of their constituent genes are consistent with populations of gene families shared among mycobacteriophages being augmented by the introduction of new gene families from outside sources. The assembling of a large and highly informative collection of bacteriophages by a consortium of students and faculty at multiple institutions demonstrates that a course-based research experience (CRE) can be successfully implemented at large scale without compromising the authenticity or richness of a scientific investigation imbued with discovery and project ownership.
Results and discussion
A genome-by-genome approach to defining phage diversity
Exploring phage diversity using a genome-by-genome approach has notable advantages and some potential disadvantages. The main advantage is that complete genome sequences give information about genome length and composition, providing key insights into genome mosaicism and how genome segments are shared and exchanged. A difficulty is that there are not large extant phage collections available for most bacterial hosts, and isolation, purification, and characterization of phages can be slow and time-consuming. Because isolation typically requires plaque formation and growth in the laboratory, some naturally occurring phages may escape isolation using standard methods. Thus, although the diversity of phages isolated and propagated in the laboratory may not capture all types of phage, it represents a minimum, not a maximum, index of diversity.
Authentic research in a CRE
The 2012 report from the President's Council of Advisors on Science and Technology (PCAST) focused on the poor retention of undergraduate students in science, technology, engineering and mathematics (STEM) as an impediment to meeting US economic demands (PCAST, 2012). One of the PCAST recommendations is to replace traditional introductory laboratory courses with research-based experiences that would inspire freshman students and promote STEM retention. A powerful strategy is to engage students in scientific discovery through CREs. The successful implementation of this strategy depends on (i) identifying research questions that can engage students in contributing genuine advances in scientific knowledge without requiring prior expert knowledge, and (ii) designing the project so that large numbers of students can participate in a meaningful fashion.
We have previously described the Howard Hughes Medical Institute (HHMI) Science Education Alliance Phage Hunters Advancing Genomics and Evolutionary Science (SEA-PHAGES) program, in which beginning undergraduate students isolate, purify, sequence, annotate, and compare bacteriophages, and have described its educational advantages (Jordan et al., 2014). By taking advantage of the massive diversity of the phage population so that each student can isolate a unique phage, the program encourages student ownership of their science. And because the collective discoveries by many students generate new scientific insights, the program creates a scientific community of students engaged in authentic research.
The SEA-PHAGES program has contributed to the growth of the collection of sequenced mycobacteriophages to nearly 700 individual isolates (http://phagesdb.org), of which 627 were selected for a detailed analysis (Supplementary file 1). This is by far the largest collection of sequenced phage genomes for any single host and thus promises to substantially advance our understanding of phage diversity. The phages were isolated using either direct plating or by enrichment using Mycobacterium smegmatis mc2155 as a host, and sequenced using next-generation approaches (see ‘Materials and methods’). More than 5000 students—primarily freshmen—at 74 institutions have been involved since inception of the SEA-PHAGES program in 2008, and the phages isolated represent a broad geographical distribution (Figure 1) and a variety of viral morphotypes (http://phagesdb.org). The new insights gained from comparative genomic analyses of these phages—as described below—demonstrate the effectiveness of viral discovery and genomics as a model for CRE development.
Assembling mycobacteriophages into clusters and subclusters
Using previously reported parameters based primarily on nucleotide sequence similarity spanning >50% genome length (Hatfull et al., 2006), the 627 genomes were assembled into 20 clusters (A–T) and eight singletons (with no close relatives) (Figure 2, Supplementary file 1); 11 clusters were subdivided into 2 to 11 subclusters (Table 1). There is considerable variation in cluster size with substantial differences in the numbers of genomes in each cluster (2–232), but there is relatively little variation in either genome length or the numbers of genes per genome in any given cluster (Table 1). Cluster assignment is of practical utility and is generally robust, with clustered phages typically sharing genome architectures, as noted for the Enterobacteriacea (Grose and Casjens, 2014). For example, Cluster A phages are similar in size and transcriptional organization, and share an unusual immunity system (Brown et al., 1997; Pope et al., 2011b). Cluster M phages all contain large numbers of tRNA genes (Pope et al., 2014a), Cluster K (Pope et al., 2011a) and Cluster O (Cresawn et al., 2015) phages have different but characteristic repeated sequences, and Cluster J phages have an unusual capsid with a triangulation (T) number of 13 (Pope et al., 2013). Therefore, the organization of related mycobacteriophages into clusters provides a framework for identifying and interpreting gene trafficking within and among potentially distinct groups of genomes.
-
Figure 2—source data 1
- https://doi.org/10.7554/eLife.06416.005
Gene content relationships among sequenced mycobacteriophages
Genome mosaicism is more apparent from comparison of gene product amino acid sequences than nucleotide sequence comparisons because of the accumulation of genome rearrangements over a longer period of evolution, during which indications of DNA similarity are lost. To compare mycobacteriophage gene contents we grouped related genes into protein families (‘phamilies’ or ‘phams’) using Phamerator (Cresawn et al., 2011), which we modified to use kClust (Hauser et al., 2013) so as to easily accommodate the large numbers of comparisons. The 69,633 genes assembled into 5205 phams of which 1613 (31%) are orphams (single-gene phamilies [Hatfull et al., 2010]). Approximately 25% of phams can be assigned functions in viral structure and assembly, DNA metabolism, integration, lysis, and regulation, but the vast majority are of unknown function. Representation of gene content relationships among all 627 phages as a network phylogeny reveals relationships that are in accord with the cluster and subcluster designations derived from nucleotide sequence comparisons (Figure 3). The multiple branches between clusters/subclusters reflect the phylogenetic complexities that arise from genome mosaicism, where genes within a genome have distinct evolutionary histories.
-
Figure 3—source data 1
- https://doi.org/10.7554/eLife.06416.011
The distribution of orphams (genes without mycobacteriophage homologues) provides additional support for cluster/subcluster assignments; Figure 4). A relatively high proportion of orphams is a characteristic of both singleton genomes and single-genome subclusters (Figure 4). At least 30% of genes in all of the singleton genomes are orphams, and the single-genome subclusters have a minimum of 15% orphams; genomes in other clusters and subclusters typically have fewer than 10% orphams (Figure 4). The presence of numerous orphams ensures that the lack of cluster inclusion did not result from sequence errors or insufficient or inappropriate gene annotation. Notable exceptions are Predator (Subcluster H1) and Mendokysei (Cluster T), both of which are in very small clusters/subclusters, and KayaCho (Subcluster B4). KayaCho may warrant separation into a new subcluster (e.g., B6), but overall the orpham distribution is consistent with the cluster/subcluster designations.
-
Figure 4—source data 1
- https://doi.org/10.7554/eLife.06416.013
The diversity of different clusters is highly varied
To determine the extent to which the various clusters/subclusters represent discrete groups, we generated a heat map showing pairwise shared gene content (Figure 5) and quantified the cluster/subcluster diversity (Table 1, Figure 6). The heat map strikingly illustrates that diversity is non-uniform, with genomes in some clusters (e.g., Subclusters B1, C1) being very closely related, whereas in others they display substantial differences (e.g., Subclusters A1, F1). The variation is also evident within the large Cluster A group, with some subclusters having low diversity (e.g., A4, A5, A6), some being highly diverse (e.g., A1, A2), and some plausibly further splitting into subgroups (A3) (Figure 5).
-
Figure 5—source data 1
- https://doi.org/10.7554/eLife.06416.017
-
Figure 6—source data 1
- https://doi.org/10.7554/eLife.06416.019
We quantified the cluster diversity using three different measures, CLuster Average Shared Phamilies (CLASP), Cluster Associated Phamilies (CAP), and Cluster Cohesion Index (CCI) (Tables 1, 2, Figure 6A). Both CAP (the number of phams present in all genomes within a cluster divided by the average number of genes per genome) and CCI (the average number of genes per genome as a percentage of the total number of phams in that cluster) show substantial variation between clusters (Table 1, S2), and little evidence for commonly conserved ‘core genes’, as suggested for T4-related phages (Petrov et al., 2010). However, both of these parameters are somewhat influenced by cluster/subcluster size, which varies from cluster to cluster. In contrast, CLASP (the percentage of phamilies shared between two genomes, then averaged across all possible pairs within a cluster or subcluster) is relatively insensitive to cluster/subcluster size (as seen by a resampling analysis; Figure 6—figure supplement 1), but still shows substantial variation from one cluster to another (Table 1, Figure 6A).
The discreteness of different clusters is highly varied
The heat map of genome comparisons (Figure 5) also illustrates the degrees to which clusters and subclusters share gene content, a reflection of cluster discreteness, or how isolated discrete clusters are from each other. For example, although the Cluster A phages are highly diverse, they also appear relatively isolated and share relatively few genes with other clusters (Figure 5). In contrast, phages in Cluster E share substantial numbers of genes with other clusters, including those in Clusters F, J, L, P, and several singletons. We have quantified these relationships with the Cluster Isolation Index (CII, the percentage of phams present within a cluster that are not present in other mycobacteriophage genomes), which demonstrates the considerable variation in isolation from phages of other clusters/subclusters (Table 1, Figure 6B). For example, at one extreme, 84.6% of Cluster C gene phamilies are found only in Cluster C and not elsewhere. At the other extreme, only 23.8% of Cluster I gene phamilies are constrained to that cluster, with the remainder having relatives present in genomes in other clusters. Other clusters form a spectrum of relationships between these extremes (Table 1, Figure 6B), and clusters such as I and P—which share recognizable DNA sequence similarity (Figure 2—figure supplement 1)—share >60% of their genes with other phages (low CII values; Table 1). Thus, although some clusters could be considered as discrete groups—as reported for the Synechococcus phages (Deng et al., 2014)—this is far from being a universal or characteristic feature of groups of related phages.
Cluster isolation analyses reveal additional complexities arising from highly mosaic genomes. For example, the singleton Dori is clearly related to Cluster B phages (Figure 3) with which it shares limited DNA similarity (Figure 2—figure supplement 2) with 20–26% of its genes (Figure 4—figure supplement 1), but also has nucleotide similarity and shares genes with Cluster N and I2 phages, among others (Figure 2—figure supplement 2, Figure 4—figure supplement 1), as reflected in its low CII (Table 1, Figure 6B). Likewise, the singleton MooMoo has segments of DNA similarity and shares ∼20% of its gene content (as determined by shared phams) with Cluster F phages (Figure 3, Figure 2—figure supplement 3, Figure 4—figure supplement 1), but also has similarity to Clusters N and I, as well as a low CII (Table 1, Figure 6B). It has low DNA similarity to Cluster O (Figure 2—figure supplement 3), but has several phams in common with the Cluster O phages, and has the same unusual prolate morphology (Figure 3). Complex relationships are also seen in the singletons Gaia and Sparky (Figure 4—figure supplement 2).
Taken together, the analyses of both cluster diversity and cluster isolation show that mycobacteriophage populations contain a continuum of diversity, with non-uniform abundance of different types of phages. The prevalence of isolated phages may not necessarily reflect the proportions of different types of phages in the environment, but the availability of a large collection of isolated phages enables capture and whole genome analysis of relatively rare phages that are critical to understanding the complexities of genome relationships. We recently reported genomic analysis of the singleton mycobacteriophage Patience, which has a substantially lower GC% than its host (50.3% vs 67.4%), has a different codon usage profile, but is undergoing codon selection for growth in a high GC% environment (Pope et al., 2014b). If there is a flux of phage genomes and genes entering the mycobacterial neighborhood, then we predict that the phages of a single host do not reflect a closed system with discrete populations, but one that is open with ever-expanding diversity.
The mycobacteriophage population is not a closed system
Both the huge diversity of phamilies in mycobacteriophages and the high frequency of orphams suggest that genes are constantly added to phage genomes from outside sources just as genes are added to the genomes of their bacterial hosts via horizontal gene transfer. Such gene influx—for example, from host-jumping phages such as Patience (Pope et al., 2014b)—would provide genetic novelty and enable phages to adapt to their ever-changing hosts. To examine gene flux into the mycobacteriophage population, we performed a rarefaction analysis by re-sampling the gene phamilies within the phage population (Figure 7). Remarkably, the rarefaction curves of the entire collection—including the 95% confidence limits—do not fit a hyperbola as would be expected if the mycobacteriophages were limited to an isolated set of genes, and about 2.5 new gene phamilies are predicted to be identified with each newly isolated phage (Figure 7A). Similar independent analyses on the phages of Cluster A or the phages of Cluster B show that this is also observed within these clusters (Figure 7B,C). Thus both individual clusters and the collection as a whole are not genetically fixed, but are in constant flux. While a hyperbola can model sampling of gene phamilies from a finite pool, it does not accommodate the influx of new phamilies. The addition of a linear term (see ‘Materials and methods’), representing the introduction of new phamilies from outside sources, results in a non-asymptotic curve which predicts the continual identification of new phams even after large numbers of genomes have been sampled (R > 0.999; Figure 7D). This linear term acts as a surrogate for the linear range of a second hyperbolic curve, one representing the resampling of a much larger set of gene phamilies available for introduction into mycobacteriophage genomes. Unfortunately, the current dataset remains insufficient to confidently extrapolate to give an estimate of the total number of viral protein families in the biosphere, which has been previously estimated to be anywhere between a half a million and 2 billion (Rohwer, 2003; Ignacio-Espinoza et al., 2013).
-
Figure 7—source data 1
- https://doi.org/10.7554/eLife.06416.024
We note that because of the generally slow pace of the advancement of phage genomics, we have little insight into the phage populations of other hosts. We retrieved all double-stranded DNA tailed phage genomes in GenBank that we could identify (a total of 1781), corresponding to about 120 host bacterial genera, with a median number of phages per host genus of two. Using similar parameters for pham building as described above, the 181,717 predicted genes assemble into 47,479 phamilies. The relatively low representation of each phamily (3.8 genes/phamily) compared to the mycobacteriophages (13.4 genes/phamily) is a further reflection of the gross under-sampling of the phage population as a whole.
Implications for bacteriophage taxonomy
Bacteriophage taxonomic classification reflecting phylogeny presents substantial challenges because of genome mosaicism (Lawrence et al., 2002). Classification by viral morphology is well established, but may not accurately reflect the genetic relationships, as illustrated for the prolate-headed MooMoo (Figure 3). We also note that the mycobacteriophage myoviruses have a high CII and form a discrete group (Table 1) as do the Synechococcus myophages (Deng et al., 2014), perhaps reflecting a virulent lifestyle that constrains productive gene exchange; T4-related phages from diverse hosts share a core set of 15–20% of their genes, and whole genome comparisons reveal extensive mosaicism (Petrov et al., 2010). Host range mutability thus may differ in phages with different morphotypes, limiting access to the gene pool, and although grouping phages into clusters and subclusters provides analytical advantages because of the wide range in prevalence of different phages (Table 1), it is not suitable as a broadly applicable hierarchical taxonomic system. The comparative analysis of these mycobacteriophages thus supports reticulate taxonomies that more accurately reflect the phylogenetic complexities (Lawrence et al., 2002; Lima-Mendez et al., 2007).
Implications for student learning through research experiences
A research experience can be a powerful vehicle that enables a person to gain an understanding of the process of science (Hunter et al., 2007). When the research experience occurs early and at a large scale, as described here, the focus can shift from selecting a few ‘qualified’ students to exploring the potential interests of many students. Clearly, an essential ingredient is the nature of the research project, as definitions of research may vary from an inquiry-based exercise to authentic research with the potential to contribute publishable findings. To optimize the educational benefits, the research project must be intellectually and technically accessible to beginning students (i.e., few prerequisites) and scalable so that many students can simultaneously make progress in parallel, yet independently (Hatfull et al., 2006). Importantly, each student's findings should contribute to a scientific question with integration of all students' discoveries advancing a scientific question of significance, as judged by scientific peer review. This, we believe, defines an ‘authentic’ research experience. We note that in the SEA-PHAGES platform, substantial student effort is invested in arriving at high-quality genome annotations by close manual inspection followed by expert verification, a critical component of the detailed comparative analysis of phage gene content described here.
Concluding comments
Bacteriophage genomics has progressed relatively slowly compared to that of other microbes in spite of their relatively small genome sizes. Here we have demonstrated that programmatically integrating the research and education missions at large scale provides an effective solution to expanding our knowledge of viral diversity, with a multitude of insights gained as a consequence of the scale of phage discovery. The nature of different genomic types, the variations of the diversity both within clusters and shared genome content among clusters, and the expanse of the mycobacteriophage population can be viewed at an unprecedented level of resolution. Our conclusions align well with comparative analyses of phages of Enterobacteriacea (Grose and Casjens, 2014) and Bacillus spp. (Grose et al., 2014) and we predict that these are general parameters of bacteriophage diversity, at least when sampling broadly across the environment. Both the rarefaction analysis described here and preliminary analysis of phamilies of all sequenced DNA phages illustrate how little of the global phage population has been genomically sampled. With a near endless supply of diverse viruses readily accessible for isolation and analyses, integrated research/education programs will continue to play substantial roles in defining the nature of the virosphere.
Materials and methods
Phages and genomes
View detailed protocolIn addition to extant GenBank sequence information, mycobacteriophages were isolated, sequenced, and annotated in the Phage Hunters Integrating Research and Education (PHIRE) or SEA-PHAGES programs. Phage genomes were shotgun sequenced using either 454, Ion Torrent, or Illumina platforms to at least 20-fold coverage. Shotgun reads were assembled de novo with Newbler versions 2.1 to 2.9. Assemblies were checked for low coverage or discrepant areas, and targeted Sanger reads were used to resolve weak areas and identify genome ends. All genome sequences are publically available at phagesDB.org or in GenBank. Nucleotide comparisons used BLASTN or Gepard (Krumsiek et al., 2007).
Database construction
View detailed protocolTo create Phamerator database Mykobacteriophage_627, phamilies were constructed by first clustering the entire database of 69,574 genes using strict kClust parameters (70% clustering threshold and 0.25 alignment coverage of the longer sequence). This was followed by multiple sequence alignment of each preliminary cluster using Kalign (Lassmann and Sonnhammer, 2005). Consensus sequences were then extracted using HHmake and HHconsensus (Remmert et al., 2012). The resulting list of sequences was subjected to a second—and less strict—round of clustering via kClust (30% clustering threshold and 0.5 alignment coverage of the longer sequence) to obtain the final phamily assignments.
Network phylogeny constructions were made using the NeighborNet function with default parameters in SplitsTree (Huson, 1998; Huson and Bryant, 2006).
Cluster diversity and isolation indices
View detailed protocolFour parameters were used to evaluate cluster diversity. The first is the CLASP index that calculates the percentage of phamilies shared between two genomes, then averages across all possible pairs within a cluster or subcluster. Because the pairwise similarities are averaged, CLASP is relatively insensitive to either the overall size of the cluster, or the heterogeneity of its diversity (such as in Cluster C in which of the 45 genomes in total, 44 are in Cluster C1, and only one is in Cluster C2). CLASP robustness with respect to cluster size was demonstrated through a resampling analysis. For each cluster with more than 30 members, a random subset (of 5, 10, 20, or 30 genomes) was selected and CLASP was calculated. For each sample size, 20 iterations were performed with replacement. As expected, there is substantial deviation among the iterations, especially at smaller sizes. However, there is little change in the average CLASP values with different sample sizes (Figure 4—figure supplement 1), showing that cluster size is not a primary driver of diversity. The resampling analyses also suggest that while a greater number of genomes helps refine the CLASP value, there is still predictive power when only 10 genomes are compared. On average, the maximum and minimum iteration values at a sample size of 10 genomes were within 8% of the whole-cluster CLASP value. This implies that, for example, increasing Cluster D from 10 to 50 or 100 genomes may raise or lower its current CLASP value of 88.1, but that value is likely to remain between ∼80 and ∼96.
The second measure used is the CAP, which is calculated as the number of phamilies present in all genomes within a cluster divided by the average number of phamilies per genome. These cluster-conserved genes could correspond to core genes that define a particular phage group such as cluster or subcluster. However, for those clusters with sufficient diversity to detect such core genes, these values are low. For example, among the 66 Cluster F genomes, only five phamilies are present in all genomes. None are virion structural genes, one is a glycosyltransferase whose role is unknown, one is a putative regulator, and the others are small proteins of unknown function. For the Cluster A genomes, 11 phamilies are conserved, seven of which are virion structural proteins, three are involved in DNA metabolism (DNA Pol, Helicase, Rec-Like protein), and one is of unknown function.
The third parameter is the Cluster Phamily Variation (CPV) index, which is the proportion of phams that are not present in all members of the cluster. CAP and CPV are inversely related but imperfectly as CPV varies with cluster size even among similarly diverse clusters; a plot of CAP values against CPV values is shown in Figure 6—figure supplement 2.
The CCI is calculated as the average number of genes per genome as a percentage of the total number of phams in that cluster. Thus if all genomes in a cluster are identical (and if phamilies occur only once in a genome), CCI would be 100; the CCI for two sets of five randomly chosen genomes is ∼2. CCI values correlate with cluster size, but similarly sized clusters as such G, J, and L, or E and K have substantially different CCI values (Table 1).
The CII is the percentage of phams present within a cluster that are not present in other mycobacteriophage genomes.
Rarefaction analysis
View detailed protocolRarefaction analysis was performed by randomly selecting subsets (without replacement) of between 1 and 627 (all), 232 (Cluster A) or 108 (Cluster B) mycobacteriophages and determining the numbers of phamilies represented. This was repeated 10,000 times to generate a mean number of phamilies observed given a number of phage genomes selected. The means of the accumulated numbers of phams and the numbers of new phages identified are plotted as the function of the number of genomes selected at random. The observed numbers were fit to a hyperbolic function for 50% of the sample (i.e., 1 to 314, 116 or 54 genomes for all, Cluster A or Cluster B phages, respectively); Hanes-Woolf regression was used to estimate PhamMax and Km of the hyperbola:
where NGenomes is the number of genomes sampled, NPhams is the number of total phams seen within those genomes, PhamMax is the total number of phams among all mycobacteriophage genomes, and Km is the number of genomes required to sample one half of PhamMax.
The lack of fit of the observed data to the hyperbola—with the observed data reflecting infinite size—suggests that the overall population is dynamic. The lack of hyperbolic fit of the data does not result from outliers such as phages with highly deviant GC%, because removing these does not improve the fit. The fit is also not substantially improved by analysis of the two largest clusters, Cluster A and Cluster B (Figure 7), suggesting that the dynamic nature of the gene pool is not an artifact of examining independent phage clusters with separate gene pools.
To model this behavior, we modified Equation 1 to include the introduction of novel phams via recombination with outside, non-mycobacteriophage genomes:
where CPhage is the number of outside phams seen in each phage. The value of CPhage was estimated from Figure 7B and new values for PhamMax and KPham were estimated by Hanes-Woolf regression following data normalization.
Data availability
-
Mycobacteriophage databasePublicly available at the Mycobacteriophage Database (http://phagesdb.org).
-
Mykobacteriophage_627Accessed via the program Phamerator, publicly available at http://phagesdb.org/Phamerator/faq/.
References
-
Transcriptional silencing by the mycobacteriophage L5 repressorThe EMBO Journal 16:5914–5921.https://doi.org/10.1093/emboj/16.19.5914
-
Bacteria-virus coevolutionAdvances in Experimental Medicine and Biology 751:347–370.https://doi.org/10.1007/978-1-4614-3567-9_16
-
The viriosphere, diversity, and genetic exchange within phage communitiesCurrent Opinion in Microbiology 8:444–450.https://doi.org/10.1016/j.mib.2005.06.005
-
Mycobacteriophages: genes and genomesAnnual Review of Microbiology 64:331–356.https://doi.org/10.1146/annurev.micro.112408.134233
-
The secret lives of mycobacteriophagesAdvances in Virus Research 82:179–288.https://doi.org/10.1016/B978-0-12-394621-8.00015-7
-
Bacteriophages and their genomesCurrent Opinions in Virology 1:298–303.https://doi.org/10.1016/j.coviro.2011.06.009
-
Comparative genomic analysis of 60 mycobacteriophage genomes: genome clustering, gene acquisition, and gene sizeJournal of Molecular Biology 397:119–143.https://doi.org/10.1016/j.jmb.2010.01.011
-
Complete genome sequences of 63 mycobacteriophagesGenome Announcements 1:.https://doi.org/10.1128/genomeA.00847-13
-
Bacteriophage genomicsCurrent Opinion in Microbiology 6:506–511.https://doi.org/10.1016/j.mib.2003.09.004
-
Evolutionary relationships among diverse bacteriophages and prophages: all the world's a phageProceedings of the National Academy of Sciences of USA 96:2192–2197.https://doi.org/10.1073/pnas.96.5.2192
-
Application of phylogenetic networks in evolutionary studiesMolecular Biology and Evolution 23:254–267.https://doi.org/10.1093/molbev/msj030
-
The global virome: not as big as we thought?Current Opinion in Virology 3:566–571.https://doi.org/10.1016/j.coviro.2013.07.004
-
Genomic sequences of bacteriophages HK97 and HK022: pervasive genetic mosaicism in the lambdoid bacteriophagesJournal of Molecular Biology 299:27–51.https://doi.org/10.1006/jmbi.2000.3729
-
Salmonella phages and prophages–genomics and practical aspectsMethods in Molecular Biology 394:133–175.https://doi.org/10.1007/978-1-59745-512-1_9
-
Order to the viral universeJournal of Virology 84:12476–12479.https://doi.org/10.1128/JVI.01489-10
-
Comparative genomic analysis of 18 Pseudomonas aeruginosa bacteriophagesJournal of Bacteriology 188:1184–1187.https://doi.org/10.1128/JB.188.3.1184-1187.2006
-
The complete genomes and proteomes of 27 Staphylococcus aureus bacteriophagesProceedings of the National Academy of Sciences of USA 102:5174–5179.https://doi.org/10.1073/pnas.0501140102
-
Imbroglios of viral taxonomy: genetic exchange and failings of phenetic approachesJournal of Bacteriology 184:4891–4905.https://doi.org/10.1128/JB.184.17.4891-4905.2002
-
Metagenomics and future perspectives in virus discoveryCurrent Opinion in Virology 2:63–77.https://doi.org/10.1016/j.coviro.2011.12.004
-
Viral and microbial community dynamics in four aquatic environmentsThe ISME Journal 4:739–751.https://doi.org/10.1038/ismej.2010.1
-
Marine viruses–major players in the global ecosystemNature Reviews Microbiology 5:801–812.https://doi.org/10.1038/nrmicro1750
Article and author information
Author details
Funding
Howard Hughes Medical Institute (HHMI) (54308198)
- Graham F Hatfull
National Institutes of Health (NIH) (GM51975)
- Graham F Hatfull
Howard Hughes Medical Institute (HHMI) (52007054)
- Graham F Hatfull
Brigham Young University
- Sandra Burnett
Cabrini College
- David Dunbar
National Institutes of Health—INBRE (GM103408)
- R Luke Daniels
National Science Foundation (NSF)
- John Dennehy
Queens College
- John Dennehy
Lehigh University
- Vassie Ware
Merrimack College
- Janine LeBlanc-Straceski
National Institutes of Health (NIH) (GM094712)
- Nicanor Austriaco
National Institutes of Health—INBRE (GM103430)
- Kathleen Cornely
Davis Foundational Grant
- Kathleen Cornely
Providence College
- Kathleen Cornely
St. Joseph's University
- Christina King Smith
University of Houston, Downtown
- Rachna Sadana
University of Maine, Honors College
- Keith Hutchinson
National Institutes of Health (NIH) (GM1003423)
- Keith Hutchinson
Howard Hughes Medical Institute (HHMI)
- Michael Rubin
- Kirk Anders
University of Puerto Rico
- Michael Rubin
University of Wisconsin, River Falls
- Karen Klyczek
Western Kentucky University
- Claire Rinehart
Gatton Academy of Science and Mathematics
- Rodney King
Georgia College
- Indiren Pillay
Del Mar College
- John Hatherill
Miami University
- Iddo Friedberg
National Science Foundation (NSF) (DUE-1205059)
- John Hatherill
National Science Foundation (NSF) (ABI-1146960)
- Iddo Friedberg
Howard Hughes Medical Institute (HHMI) (52007572)
- SK Ireland
Doris Duke Charitable Foundation
- Michelle Larsen
Gonzaga University
- Kirk Anders
National Science Foundation (NSF) (DUE-1245778)
- Kirk Anders
David Asai, Kevin Bradley, and Lucia Barker (formerly) are (or were) employees of Howard Hughes Medical Institute who also provided support for the SEA-PHAGES and PHIRE programs. DA, KB, and LB contributed to the design of the programs and the systems for data collection.
Acknowledgements
Students, faculty, and their contributions to authorship are listed in the Supplementary file 2. We thank Aileen Beard, Gerald Henkel-Johnson, and Larry McGahey at the College of St. Scholastica, the Core Facility for Imaging, Cellular and Molecular Biology at Queens College, Jennifer Kelly and Towanda Kirksey-Stanton at Jacksonville State Univeristy, Susan Crump at Merrimack College and Dr Gregory Hendricks at the University of Massachusetts Medical School Electron Microscopy Imaging Facility, Melissa Cox at North Carolina State University and Valerie Lapham at the NCSU Center for Electron Microscopy, Dr Karen M Snetselaar at Saint Joseph's University, and Rick Ellingworth at the University of Wisconsin-River Falls for excellent technical assistance. We also thank Drs Winston Anderson and Broderick Eribo for their roles as consultants at Howard University, and Dr R Edelmann and the Miami University Center for Advanced Microscopy and Imaging for their support and assistance with electron microscopy. We also thank John Morrell, Alicia Brighton, Joshua Fisher, Michael Shelfo, Brigham Wright, Jessica Engle, Brian Early, Kyle Smith, Kyler Haskell, Tambi Issac, Bryce Lunt, David Payne II, Lissenya Argueta, Bryan Merrill, Adam Gardner, Hailey Meadows, Adam Hansen, and Marshall Sheide for contributions to phage isolation.
This work was supported in part by the Howard Hughes Medical Institute SEA-PHAGES program, by the Howard Hughes Medical Institute through its Professorship grant to GFH, and by NIH grant GM51975 to GFH. Additional support was provided by the Department of Microbiology and Molecular Biology and the BYU College of Life Sciences; Cabrini College; NIH Grant No. P20 GM103408 to the College of Idaho; the National Science Foundation grant 0703449, the CUNY LSAMP program, the Office of the Provost, the Division of General Education, the Division of Mathematics and Natural Sciences, the Queens College UM/RE program and the Biology Department at Queens College; the Department of Biological Sciences at Lehigh University; the Center for Biotechnology and Biomedical Sciences of the Department of Biology in the School of Science and Engineering at Merrimack College; the NCSU Biotechnology Program and Department of Microbiology; a Davis Foundation grant, the Providence College Undergraduate Research Committee, RI-INBRE and NIGMS grant R15-GM094712 to Providence College; the Department of Biology, Saint Joseph's University; the Natural Science Department at the University of Houston-Downtown; the University of Maine Honors College, the University of Maine Department of Molecular and Biomedical Sciences, Maine-INBRE and NIH-INBRE Grant 8P20GM1003423-12; an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health (P20GM0103423) to the University of Maine at Machias; the Howard Hughes Medical Institute, RISE and BRIC Programs, the Department of Biology, and the Offices of the Academic Dean and Chancellor at the University of Puerto Rico at Cayey; the University of Wisconsin River Falls Biology Department; the Gatton Academy of Science and Mathematics in Kentucky and the Western Kentucky University Bioinformatics and Information Science Center; Georgia College STEM Initiative and a Georgia College Faculty Research Grant; NSF Grant REVISION DUE-1205059 and the Department of Natural Sciences at Del Mar College; Miami University Department of Microbiology and the College of Arts and Science Dean's office and National Science Foundation ABI award 1146960; the HHMI Scicomp Project (52007572) to Xavier University of Louisiana; the Doris Duke Foundation; the Gonzaga University Biology Department, NSF-TUES grant DUE-1245778 and HHMI Undergraduate Science Education grant to Gonzaga University; the School of Biological Sciences and the School of Molecular Biosciences at Washington State University; and the Benjamin Harris Memorial Fund through the Pittsburgh Foundation.
Copyright
© 2015, Pope et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 20,834
- views
-
- 2,500
- downloads
-
- 275
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Genetics and Genomics
Mitochondrial biogenesis requires the expression of genes encoded by both the nuclear and mitochondrial genomes. However, aside from a handful transcription factors regulating specific subsets of mitochondrial genes, the overall architecture of the transcriptional control of mitochondrial biogenesis remains to be elucidated. The mechanisms coordinating these two genomes are largely unknown. We performed a targeted RNAi screen in developing eyes with reduced mitochondrial DNA content, anticipating a synergistic disruption of tissue development due to impaired mitochondrial biogenesis and mitochondrial DNA (mtDNA) deficiency. Among 638 transcription factors annotated in the Drosophila genome, 77 were identified as potential regulators of mitochondrial biogenesis. Utilizing published ChIP-seq data of positive hits, we constructed a regulatory network revealing the logic of the transcription regulation of mitochondrial biogenesis. Multiple transcription factors in core layers had extensive connections, collectively governing the expression of nearly all mitochondrial genes, whereas factors sitting on the top layer may respond to cellular cues to modulate mitochondrial biogenesis through the underlying network. CG1603, a core component of the network, was found to be indispensable for the expression of most nuclear mitochondrial genes, including those required for mtDNA maintenance and gene expression, thus coordinating nuclear genome and mtDNA activities in mitochondrial biogenesis. Additional genetic analyses validated YL-1, a transcription factor upstream of CG1603 in the network, as a regulator controlling CG1603 expression and mitochondrial biogenesis.
-
- Genetics and Genomics
An unprecedented amount of SARS-CoV-2 data has been accumulated compared with previous infectious diseases, enabling insights into its evolutionary process and more thorough analyses. This study investigates SARS-CoV-2 features as it evolved to evaluate its infectivity. We examined viral sequences and identified the polarity of amino acids in the receptor binding motif (RBM) region. We detected an increased frequency of amino acid substitutions to lysine (K) and arginine (R) in variants of concern (VOCs). As the virus evolved to Omicron, commonly occurring mutations became fixed components of the new viral sequence. Furthermore, at specific positions of VOCs, only one type of amino acid substitution and a notable absence of mutations at D467 were detected. We found that the binding affinity of SARS-CoV-2 lineages to the ACE2 receptor was impacted by amino acid substitutions. Based on our discoveries, we developed APESS, an evaluation model evaluating infectivity from biochemical and mutational properties. In silico evaluation using real-world sequences and in vitro viral entry assays validated the accuracy of APESS and our discoveries. Using Machine Learning, we predicted mutations that had the potential to become more prominent. We created AIVE, a web-based system, accessible at https://ai-ve.org to provide infectivity measurements of mutations entered by users. Ultimately, we established a clear link between specific viral properties and increased infectivity, enhancing our understanding of SARS-CoV-2 and enabling more accurate predictions of the virus.