1. Genomics and Evolutionary Biology
  2. Microbiology and Infectious Disease
Download icon

The inherent mutational tolerance and antigenic evolvability of influenza hemagglutinin

  1. Bargavi Thyagarajan
  2. Jesse D Bloom Is a corresponding author
  1. Fred Hutchinson Cancer Research Center, United States
Research Article
Cited
35
Views
4,259
Comments
0
Cite as: eLife 2014;3:e03300 doi: 10.7554/eLife.03300

Abstract

Influenza is notable for its evolutionary capacity to escape immunity targeting the viral hemagglutinin. We used deep mutational scanning to examine the extent to which a high inherent mutational tolerance contributes to this antigenic evolvability. We created mutant viruses that incorporate most of the ≈104 amino-acid mutations to hemagglutinin from A/WSN/1933 (H1N1) influenza. After passaging these viruses in tissue culture to select for functional variants, we used deep sequencing to quantify mutation frequencies before and after selection. These data enable us to infer the preference for each amino acid at each site in hemagglutinin. These inferences are consistent with existing knowledge about the protein's structure and function, and can be used to create a model that describes hemagglutinin's evolution far better than existing phylogenetic models. We show that hemagglutinin has a high inherent tolerance for mutations at antigenic sites, suggesting that this is one factor contributing to influenza's antigenic evolution.

https://doi.org/10.7554/eLife.03300.001

eLife digest

Influenza is a major threat to human health largely because the flu virus evolves rapidly to escape recognition by the immune system. These ongoing changes also explain why flu vaccines become less effective over time and need to be reformulated every year.

Hemagglutinin is a protein on the surface of the flu virus that helps the virus bind to and infect host cells. The surface proteins of most viruses are recognized by the immune system, and influenza hemagglutinin is no exception. However, hemagglutinin is unusual in that it evolves exceptionally rapidly to avoid being recognized by the immune system. This raises an important question: what is it about the influenza hemagglutinin protein that allows it to change so readily?

Thyagarajan and Bloom address this question by making mutant copies of the gene that encodes the hemagglutinin protein. There are over 10,000 ways in which the protein can be mutated, and Thyagarajan and Bloom managed to make the vast majority of the possible changes. The mutated genes were then re-introduced into the virus, and the mutant viruses were allowed to replicate in cells for several generations.

Thyagarajan and Bloom sequenced the viruses that had replicated—which meant that the mutant copies of the hemagglutinin protein in these viruses still worked—and looked to see where in the protein the changes had occurred. Those regions that rarely changed included the part of the protein that binds to host cells, whereas other regions—especially those that are recognized by the immune system—were much more likely to contain mutations. Thyagarajan and Bloom then went on to show that not all influenza proteins share hemaglutinin's capacity to change the regions targeted by the immune system, suggesting that this capacity is possibly a unique feature of this protein.

Thyagarajan and Bloom also suggest that this capacity to tolerate mutations in parts of proteins that are recognized by the immune system might be important for shaping a virus's ability to evolve to escape this recognition. Future work is now needed to see how tolerant to mutations other viral proteins are, and to reveal which properties of a protein determine its tolerance to mutations.

https://doi.org/10.7554/eLife.03300.002

Introduction

Epidemic influenza poses an annual threat to human health largely because the virus rapidly evolves to escape the immunity elicited by previous infections or vaccinations. The most potent form of anti-influenza immunity is antibodies targeting the virus’s hemagglutinin (HA) protein (Yewdell et al., 1979; Wiley et al., 1981; Caton et al., 1982). The virus evades these antibodies primarily by accumulating amino-acid substitutions in HA's antigenic sites (Smith et al., 2004; Das et al., 2013; Koel et al., 2013; Bedford et al., 2014). Remarkably, HA undergoes this rapid evolution while retaining the ability to fold to a highly conserved structure that performs two functions essential for viral replication: receptor binding and membrane fusion (Wiley and Skehel, 1987; Russell et al., 2004). HA is therefore highly ‘antigenically evolvable’ in the sense that it can accommodate rapid antigenic change without compromising its structural and functional properties.

Two factors that undoubtedly contribute to HA's rapid antigenic evolution are influenza's high mutation rate and the strong selection that immunity exerts on the virus. However, it is unclear whether these factors are sufficient to fully explain HA's antigenic evolution. For instance, while some other error-prone viruses (such as HIV and hepatitis C) also exhibit rapid antigenic evolution of their surface proteins (Burton et al., 2012), other viruses with comparable mutation rates (such as measles) show little propensity for antigenic change (Sheshberadaran et al., 1983; Duffy et al., 2008), despite the fact that evasion of immunity would presumably confer a selective benefit. A variety of explanations ranging in scale from ecological to molecular can be posited to account for these differences in rates of antigenic evolution (Lipsitch and O’Hagan, 2007; Koelle et al., 2006; Heaton et al., 2013). One hypothesis is that HA has a high inherent tolerance for mutations in its antigenic sites, thereby conferring on influenza the evolutionary capacity to escape from anti-HA antibodies with relative ease.

Testing this hypothesis requires quantifying the inherent mutational tolerance of each site in HA. This cannot be done simply by examining variability among naturally occurring viruses, since the evolution of influenza in nature is shaped by a combination of inherent mutational tolerance and external immune selection. For example, the rapid evolution of HA's antigenic sites in nature could reflect the fact that these sites are especially tolerant of mutations, or it could be purely a consequence of strong immune selection. Traditional experimental approaches using site-directed mutagenesis or serial viral passage are also inadequate to quantify inherent mutational tolerance—while such experimental techniques have been used to determine the effect of specific mutations on HA, they cannot feasibly be applied to all possible individual amino-acid mutations. Recently Heaton et al. (2013) used transposon mutagenesis to show that HA is tolerant to the random insertion of five to six amino-acid sequences at several locations in the protein. However, the relevance of this tolerance to insertional mutations is unclear, since HA's actual antigenic evolution involves almost entirely point substitutions, with only a very low rate of insertions and deletions.

Here we use the new high-throughput experimental technique of deep mutational scanning (Fowler et al., 2010; Araya and Fowler, 2011) to comprehensively quantify the tolerance of HA to amino-acid mutations. Specifically, we create mutant libraries of the HA gene from the H1N1 strain A/WSN/1933 (WSN) that contain virtually all of the ≈4 × 104 possible individual codon mutations, and therefore virtually all of the ≈104 possible amino-acid mutations. We use these mutant libraries to generate pools of mutant influenza viruses, which we estimate incorporate at least 85% of the possible HA codon mutations and 97% of the possible amino-acid mutations. We then passage these viruses to select for functional variants, and use Illumina deep sequencing to determine the frequency of each HA mutation before and after this selection for viral growth. Since these experiments measure the impact of mutations in the absence of immune selection, they enable us to quantify HA's inherent preference for each amino acid at each site in the protein. We show that these quantitative measurements are consistent with existing knowledge about HA structure and function, and can be used to create an evolutionary model that describes HA's natural evolution far better than existing models of sequence evolution. Finally, we use our results to show that HA's antigenic sites are disproportionately tolerant of mutations, suggesting that a high inherent tolerance for mutations at key positions targeted by the immune system is one factor that contributes to influenza's antigenic evolvability.

Results

Strategy for deep mutational scanning of HA

Our strategy for deep mutational scanning (Fowler et al., 2010; Araya and Fowler, 2011) of HA is outlined in Figure 1. The wildtype WSN HA gene was mutagenized to create a diverse library of mutant HA genes. This library of mutant genes was then used to generate a pool of mutant viruses by reverse genetics (Hoffmann et al., 2000). The mutant viruses were passaged at a low multiplicity of infection to ensure a linkage between genotype and phenotype. The frequencies of mutations before and after selection for viral growth were quantified by Illumina deep sequencing of the mutant genes (the mutDNA sample in Figure 1) and the mutant viruses (the mutvirus sample in Figure 1). An identical process was performed in parallel using the unmutated wildtype HA gene to generate unmutated viruses in order to quantify the error rates associated with sequencing, reverse transcription, and virus growth (these are the DNA and virus samples in Figure 1). The entire process in Figure 1 was performed in full biological triplicate (the replicates are referred to as #1, #2, and #3). In addition, a repeat of the Illumina sample preparation and deep sequencing was performed for replicate #1 to quantify the technical variation associated with these processes.

Schematic of the deep mutational scanning experiment.

The Illumina deep-sequencing samples are shown in yellow boxes (DNA, mutDNA, virus, mutvirus). Experimental steps and associated sources of mutations are shown in blue text, while sources of error during Illumina sample preparation and sequencing are shown in red text. This entire process was performed in biological triplicate.

https://doi.org/10.7554/eLife.03300.003

Creation of HA codon-mutant libraries

The deep mutational scanning strategy in Figure 1 requires creating mutant libraries of the HA gene. We wanted to assess the impact of all possible amino-acid mutations. Most mutagenesis techniques operate at the nucleotide level, and so frequently introduce single-nucleotide codon changes (e.g., GGA → cGA) but only very rarely introduce multi-nucleotide codon changes (e.g., GGA → cat). However, several PCR-based techniques have recently been developed to introduce random codon mutations into full-length genes (Firnberg and Ostermeier, 2012; Bloom, 2014; Jain and Varadarajan, 2014). We used one of these techniques (Bloom, 2014) to create three replicate codon-mutant libraries of the WSN HA gene (Supplementary file 1).

Sanger sequencing of 34 individual clones indicated that the libraries contained an average of slightly over two codon mutations per gene, with a very low rate of insertions and deletions (less than 0.1 per gene). The number of mutations per clone was distributed around this average in an approximately Poisson fashion (Figure 2). The mutations consisted of a mix of one-, two-, and three-nucleotide codon changes, and were roughly uniform in their nucleotide composition and location in the gene (Figure 2).

Properties of the HA codon-mutant library as assessed by Sanger sequencing of 34 individual clones drawn roughly evenly from the three experimental replicates.

(A) There are an average of 2.1 codon mutations per clone, with the number per clone following a roughly Poisson distribution. (B) The codon mutations involve a mix of one-, two-, and three-nucleotide mutations. (C) The nucleotide composition of the mutant codons is roughly uniform. (D) The mutations are distributed uniformly along HA's primary sequence. (E) There is no tendency for mutations to cluster in primary sequence. Shown is distribution of observed pairwise distances between mutations in multiply mutated clones vs the expected distribution when the mutations are placed independently in the clones. All plots show results only for substitution mutations; insertion/deletion mutations are not shown. However, only two insertion/deletion mutations (0.06 per clone) were identified. The data and computer code used to generate this figure are at https://github.com/jbloom/SangerMutantLibraryAnalysis/tree/v0.2.

https://doi.org/10.7554/eLife.03300.004

The genes in each mutant library were cloned at high efficiency into a bidirectional influenza reverse-genetics plasmid (Hoffmann et al., 2000). Each of the library replicates contained at least six-million unique clones—a diversity that far exceeds the 104 unique single amino-acid mutations and the ≈4 × 104 unique single codon mutations to the HA gene. The vast majority of possible codon and amino-acid mutations are therefore represented many times in each plasmid mutant library, both individually and in combination with other mutations.

Generation of mutant viruses by reverse genetics

The HA plasmid mutant libraries were used to generate pools of mutant influenza viruses by reverse genetics (Hoffmann et al., 2000). Briefly, this process involves transfecting cells with the HA plasmid mutant library along with plasmids encoding the other seven genes from the WSN strain of influenza. Although the cells were transfected with a very large diversity of HA plasmids, we were uncertain what fraction of the genes encoded on these plasmids would actually be productively packaged into a virus. In an attempt to maximize the diversity in the viral pools, the mutant viruses were generated by transfecting several dozen wells of cells. The logic behind this scheme was to maintain substantial diversity even if only a subset of viral mutants stochastically predominated in each individual well of cells. A different replicate virus pool was generated for each of the three HA plasmid mutant libraries.

The mutant viruses generated for each replicate were passaged at a relatively low multiplicity of infection (MOI) of 0.1 to reduce the probability of co-infection, thereby creating a link between viral genotype and phenotype. This genotype-phenotype link is essential to ensure that the sequenced HA gene matches the protein on the surface of the virus. Our previous work with NP (Bloom, 2014) indicates that one low MOI passage is sufficient to create a strong genotype-phenotype link, since that previous work found that the results obtained after one viral passage were extremely similar to the results obtained after two viral passages. In order to maintain a diversity of over two-million infectious viral particles, we performed the passaging in a total of 2.4 × 107 cells.

Deep sequencing reveals purifying selection against many mutations

We used Illumina sequencing to quantify the frequencies of mutations before and after selection for viral growth. For each replicate, we sequenced HA from the unmutated plasmid, the plasmid mutant library, virus produced from the unmutated plasmid, and mutant virus produced from the plasmid mutant library—these are the DNA, mutDNA, virus, and mutvirus samples in Figure 1. For the DNA and mutDNA samples, the HA gene was amplified directly from the plasmids by PCR. For the virus and mutvirus samples, the HA gene was first reverse-transcribed from viral RNA and was then amplified by PCR. In all cases, template quantification was performed prior to PCR to ensure that >106 initial HA molecules were used as templates for subsequent amplification.

In order to reduce the sequencing error rate, the HA molecules were fragmented to roughly 50 nucleotide fragments using Illumina's transposon-based Nextera kit, and then sequenced with overlapping paired-end reads (Figure 3—figure supplement 1). We only called codon identities for which both paired reads concur—this strategy substantially increases the sequencing fidelity, since it is rare for the same sequencing error to occur in both reads. For each sample, we obtained in excess of 107 overlapping paired-end reads that could be aligned to HA (Figure 3—figure supplement 2). As shown in Figure 3—figure supplement 3, the read depth varied somewhat along the primary sequence, presumably due to known weak biases in the insertion sites for the Nextera transposon (Adey et al., 2010). However, these biases were fairly mild, and so we obtained well over 2 × 105 unique paired reads for nearly all HA codons.

Figure 3 shows the frequency of mutations in each sample as quantified by deep sequencing. The DNA samples derived from unmutated HA plasmid show a low frequency of apparent mutations which are almost exclusively composed of single-nucleotide codon changes—the frequency of these apparent mutations reflects the rate of errors from the PCR amplification and subsequent deep sequencing. The virus samples created from the unmutated plasmid show only a slightly higher frequency of mutations, indicating that reverse-transcription and viral replication introduce only a small number of additional mutations. As expected, the mutDNA samples derived from the plasmid mutant libraries show a high rate of one-, two-, and three-nucleotide mutations, as all three types of mutations were introduced during the codon mutagenesis. The mutvirus samples derived from the mutant virus pools exhibit a mutation rate that is substantially lower than that of the mutDNA samples. Most of the reduction in mutation frequency in the mutvirus samples is due to decreased frequencies of nonsynonymous and stop-codon mutations; synonymous mutations are only slightly depressed in frequency. As stop-codon and nonsynonymous mutations are much more likely than synonymous mutations to substantially impair viral fitness, these results are consistent with purifying selection purging deleterious mutations during viral growth.

Figure 3 with 3 supplements see all
The per-codon frequencies of mutations in the samples.

The samples are named as in Figure 1, with the experimental replicate indicated with the numeric label. The DNA samples have a low frequency of mutations, and these mutations are composed almost entirely of single-nucleotide codon changes—these samples quantify the baseline error rate from PCR and deep sequencing. The mutation frequency is only slightly elevated in virus samples, indicating that viral replication and reverse transcription introduce only a small number of additional mutations. The mutDNA samples have a high frequency of single- and multi-nucleotide codon mutations, as expected from the codon mutagenesis procedure. The mutvirus samples have a lower mutation frequency, with most of the reduction due to fewer stop-codon and nonsynonymous mutations—consistent with purifying selection purging deleterious mutations. The data and code used to create this plot is available via http://jbloom.github.io/mapmuts/example_WSN_HA_2014Analysis.html; this plot is the file parsesummary_codon_types_and_nmuts.pdf described therein. The sequencing accuracy was increased by using overlapping paired-end reads as illustrated in Figure 3—figure supplement 1. The overall number of overlapping paired-end reads for each sample is shown in Figure 3—figure supplement 2. A representative plot of the read depth across the primary sequence is shown in Figure 3—figure supplement 3.

https://doi.org/10.7554/eLife.03300.005

Inspection of Figure 3 also demonstrates an important advantage of introducing the mutations at the codon rather than the nucleotide level. While there is a low but non-zero rate of errors (from sequencing, PCR, or reverse-transcription) that lead to single-nucleotide codon changes (as judged by the DNA and virus samples), errors that lead to multi-nucleotide codon changes are negligible because it is extremely rare for a single codon to experience two errors. We similarly expect that any do novo mutations or reversions that arise during viral growth should be limited to single-nucleotide changes given the short duration of viral passage in our experiments. The fact that our mutant libraries were constructed at the codon rather than the nucleotide level means that the vast majority (54 of 63) possible mutations to each codon involve multiple nucleotide changes, and so the sequencing results for these mutations can be analyzed essentially at face value, without having to worry about confounding errors. For the remaining (9 of 63) possible mutations that only involve a single-nucleotide codon change, we have attempted to statistically correct for the error rates estimated from our controls as described in the ‘Materials and methods’.

Most mutations are sampled by the experiments

It is important to assess the completeness with which the experiments sampled all possible HA mutations. Several problems could limit mutational sampling: mutations might be absent from the plasmid mutant libraries due to biases in the codon mutagenesis, mutations that are present in the plasmid mutants might fail to be incorporated into viruses due to stochastic bottlenecks during virus generation by reverse genetics, or the sequencing read depth might be inadequate to sample the mutations that are present. The most straightforward way to assess these issues is to quantify the number of times that each possible multi-nucleotide codon mutation is observed in the mutDNA and mutvirus samples. Restricting the analysis to multi-nucleotide codon mutations avoids the confounding effects of sequencing and reverse-transcription errors, which cause almost exclusively single-nucleotide changes.

Figure 4 shows the number of times that each mutation was observed in the combined sequencing data for the three biological replicates; Figure 4—figure supplement 1 shows the same data for the replicates individually. More than 99.5% of multi-nucleotide codon mutations are observed at least five times in the combined sequencing data from the plasmid mutant libraries (mutDNA samples), and ≈ 97.5% of all such mutations are observed at least five times in sequencing of the mutDNA for each individual replicate. These results indicate that the vast majority of codon mutations are represented in the plasmid mutant libraries.

Figure 4 with 1 supplement see all
The number of times that each possible multi-nucleotide codon mutation was observed in each sample after combining the data for the three biological replicates.

Nearly all mutations were observed many times in the mutDNA samples, indicating that the codon mutagenesis was comprehensive. Only about half of the mutations were observed at least five times in the mutvirus samples, indicating either a bottleneck during virus generation or purifying selection against many of the mutations. If the analysis is restricted to synonymous multi-nucleotide codon mutations, then about 85% of mutations are observed at least five times in the mutvirus samples. Since synonymous mutations are less likely to be eliminated by purifying selection, this latter number provides a lower bound on the fraction of codon mutations that were sampled by the mutant viruses. The redundancy of the genetic code means that the fraction of amino-acid mutations sampled is higher. The data and code used to create this figure are available via http://jbloom.github.io/mapmuts/example_WSN_HA_2014Analysis.html; this plot is the file countparsedmuts_multi-nt-codonmutcounts.pdf described therein. Similar plots for the individual replicates are shown in Figure 4—figure supplement 1.

https://doi.org/10.7554/eLife.03300.009

In contrast, only 53% of multi-nucleotide codon mutations are observed at least five times in the combined sequencing data for the mutant viruses (mutvirus samples), and only ≈ 26% of such mutations are observed at least five times in sequencing of the mutvirus for each individual replicate (Figure 4, Figure 4—figure supplement 1). However, these numbers are confounded by the fact that many mutations are deleterious, and so may be absent because purifying selection has purged them from the mutant viruses. A less confounded measure is the frequency of synonymous multi-nucleotide mutations, since synonymous mutations are less likely to be strongly deleterious. About 85% of such mutations are observed at least five times in the combined mutvirus samples, and ≈ 51% of such mutations are observed at least five times in the mutvirus samples for the individual replicates (Figure 4, Figure 4—figure supplement 1). Note that these numbers are only a lower bound on the fraction of codon mutations sampled by the mutant viruses—even synonymous mutations to influenza are sometimes strongly deleterious (Marsh et al., 2008), and so some of the missing synonymous codon mutations may have been introduced into mutant viruses but then purged by purifying selection. Furthermore, the redundancy of the genetic code means that the fraction of possible amino-acid mutations sampled is substantially higher than the fraction of codon mutations sampled. Specifically, if 85% of possible codon mutations are sampled at least five times in the combined libraries (as Figure 4 indicates), then our simulations suggest that ≈ 97% of possible amino-acid mutations will have also been sampled at least five times (‘Materials and methods’).

Overall, these results indicate that nearly all mutations are represented in the plasmid mutant libraries. Virus generation by reverse genetics does introduce a bottleneck—but fortunately, this bottleneck is sufficiently mild that at least half of all possible codon mutations are still sampled at least five times by the mutant viruses in each individual replicate. Combining the data for the three replicates brings the coverage of possible codon mutations to around 85%, and the coverage of possible amino-acid mutations to 97%. Therefore, the sampling of mutations is sufficiently complete to provide information on the effects of most amino-acid mutations when the data from the three experimental replicates are combined.

Estimation of the effects of each amino-acid mutation to HA

We quantified the effects of mutations in terms of site-specific amino-acid ‘preferences’. These preferences are the expected frequency of each amino acid at each site in the mutant viruses in a hypothetical situation in which all amino acids are introduced at that site at equal frequency in the initial plasmid mutant library (Bloom, 2014). Because many of the HAs in our libraries contain several mutations, these preferences do not simply correspond to the fitness effect of each individual mutation to the WSN HA—rather, they represent the average effect of each mutation in a collection of closely related HA mutants. Mutations to amino acids with high preferences are favored by selection, while mutations to amino acids with low preferences are disfavored. The amino-acid preferences are inferred from the deep sequencing data using a Bayesian statistical framework in which the observed counts are treated as draws from multinomial distributions with unknown parameters representing the initial mutagenesis rate, the various error rates, and selection as represented by the preferences (see ‘Materials and methods’ for details).

Figure 5 shows the amino-acid preferences for the entire HA gene inferred from the combined data from all three biological replicates. As can be seen from this figure, some sites have strong preferences for one specific amino acid, while other sites are tolerant of a variety of different amino acids. As described in Table 1, the inferred amino-acid preferences are consistent with existing knowledge about mutations and residues affecting HA stability, membrane fusion, proteolytic activation, and receptor binding (Nakajima et al., 1986; Martin et al., 1998; Qiao et al., 1999; Stech et al., 2005). This concordance suggests that the deep mutational scanning effectively captures many of the structural and functional constraints on HA.

Figure 5 with 1 supplement see all
The amino-acid preferences inferred using the combined data from the three biological replicates.

The letters have heights proportional to the preference for that amino acid, and are colored by hydrophobicity. The first overlay bar shows the relative solvent accessibility (RSA) for residues in the HA crystal structure. The second overlay bar indicates Caton et al. antigenic sites or conserved receptor-binding residues. The sequence is numbered sequentially beginning with 1 at the N-terminal methionine—however, this first methionine is not shown as it was not mutagenized. Figure 5—figure supplement 1 shows the same data with H3 numbering of the sequence. The data and code used to create this figure are available via http://jbloom.github.io/mapmuts/example_WSN_HA_2014Analysis.html; this plot is the file sequentialnumbering_site_preferences_logoplot.pdf described therein.

https://doi.org/10.7554/eLife.03300.011
Table 1

The amino-acid preferences inferred from the combined experimental replicates are consistent with existing knowledge about HA structure and function

https://doi.org/10.7554/eLife.03300.013
Site in sequential numberingSite in H3 numberingExisting knowledgeInferred amino-acid preferences
127117 (HA1)Mutation from S to P creates a temperature-sensitive defect (Nakajima et al., 1986)The preference for S is 30 times higher than the preference for P
174161 (HA1)Mutation from Y to H creates a temperature-sensitive defect (Nakajima et al., 1986)The preference for Y is 25 times higher than the preference for H
3441 (HA2)Mutation from G to E abolishes HA fusion activity (Qiao et al., 1999)The preference for G is 11 times higher than for E
343327 (HA1)A basic residue (R or K) is required for HA proteolytic activation (Stech et al., 2005)The combined preferences for R and K (0.87) far exceed those of all other amino acids combined
10898 (HA1)Receptor-binding residue, is Y in >99% of natural H1 HAsThe preference for Y (0.61) exceeds those of all other amino acids combined
166153 (HA1)Receptor-binding residue, is W in >99% of natural H1 HAsThe preference for W (0.65) exceeds those of all other amino acids combined
196183 (HA1)Receptor-binding residue, is H in >99% of natural H1 HAsThe preference for H (0.69) exceeds those of all other amino acids combined
203190 (HA1)Receptor-binding residue, is D in 90% of natural H1 HAsThe highest preference is for the chemically similar E
207194 (HA1)Receptor-binding residue, is L in 97% of natural H1 HAsThe preference for L (0.55) exceeds those of all other amino acids combined
208195 (HA1)Receptor-binding residue, is Y in >99% of natural H1 HAsThe preference for Y (0.72) exceeds those of all other amino acids combined
239226 (HA1)Receptor-binding residue, is Q in ≈99% of natural H1 HAsQ is one of three amino acids with a high preference
241228 (HA1)Receptor-binding residue, is G in >99% of natural H1 HAsThe preference for G (0.57) exceeds those of all other amino acids combined
  1. The conserved receptor-binding residues listed in this table are those delineated in the first table of Martin et al. (1998) that also have at least 90% conservation among all naturally occurring H1 HAs in the Influenza Virus Resource (Bao et al., 2008).

Despite the general concordance between the inferred amino-acid preferences and existing knowledge, it is important to quantify the experimental error associated with the deep mutational scanning. We sought to quantify two factors: technical variation due to inaccuracies and statistical limitations during Illumina sample preparation and deep sequencing, and biological variation due to stochasticity in the viral mutants that were generated and enriched during each replicate of the experiment. Figure 6A shows the correlation between biological replicate #1 and a technical repeat of the Illumina sample preparation and deep sequencing for this biological replicate. There is a very high correlation between the preferences inferred from these two repeats, indicating that technical variation has only a very minor influence on the final inferred amino-acid preferences. Figure 6B–D show the correlation among the three different biological replicates. Although the biological replicates are substantially correlated, there is also clear variation. Most of this variation is attributable to amino acids which in one replicate are inferred to have preferences near the a priori expectation of 0.05 (there are 20 amino acids, which in the absence of data are all initially assumed to have an equal preference of 120), but in another replicate are inferred to have a much higher or lower preference. Such variation arises because the mutant viruses for each biological replicate only sample about 50% of the possible codon mutations (see previous section), meaning that there is little data for some mutations in any given replicate. Fortunately, combining the three biological replicates greatly increases the coverage of possible mutations (see previous section). Therefore, inferences made from the combined data (as in Figure 5) should be substantially more accurate than inferences from any of the individual replicates. This idea is supported by the results below, which quantify the extent to which the inferred preferences accurately describe natural HA evolution.

Correlations among the amino-acid preferences inferred using data from the individual biological replicates.

(A) The preferences from two technical repeats of the sample preparation and deep sequencing of biological replicate #1 are highly correlated. (B)(D) The preferences from the three biological replicates are substantially but imperfectly correlated. Overall, these results indicate that technical variation in sample preparation and sequencing is minimal, but that there is substantial variation between biological replicates due to stochastic differences in which mutant viruses predominate during the initial reverse-genetics step. The Pearson correlation coefficient (R) and associated p-value are shown in the upper-left corner of each plot. The data and code used to create this figure are available via http://jbloom.github.io/mapmuts/example_WSN_HA_2014Analysis.html; these plots are the files correlations/replicate_1_vs_replicate_1_repeat.pdf, correlations/replicate_1_vs_replicate_2.pdf, correlations/replicate_1_vs_replicate_3.pdf, and correlations/replicate_2_vs_replicate_3.pdf described therein.

https://doi.org/10.7554/eLife.03300.014

Comparison to another high-throughput study of mutations to HA

As our paper was under review, Wu et al. (2014) published the results of using a similar strategy to examine the effects of mutations to the WSN HA. In their study, the HA gene was mutated at the nucleotide level, so their experiments surveyed only amino-acid mutations accessible by single-nucleotide codon changes. As a result, they provide data on the effects of only about 20% of the 19 × 564 = 10716 amino-acid mutations examined in our study. Despite this limitation, their study provides a large dataset of mutational effects to which we can compare our results.

Figure 7 compares the mutational effects determined in our study to those from Wu et al. (2014). There is a highly significant correlation between the results of the two studies—but the inferred mutational effects are certainly not identical. Because Wu et al. (2014) do not provide the data for replicates of their experiment, we are unable to assess whether the variability between the two different studies exceeds the variability between experimental replicates within each study. So one can imagine both biologically interesting and uninteresting explanations for the imperfect correlation between the results of the two studies. The interesting explanation is that differences in experimental methodology could lead to different selection pressures on specific mutations: for instance, Wu et al. (2014) use A549 cells while we use MDCK-SIAT1 cells, and perhaps the impact of certain mutations is dependent on the cell line. The uninteresting explanation is that the imperfect correlation is simply due to noise in the experimental measurements. Unfortunately, it is not straightforward to distinguish between these two explanations. This difficulty in pinpointing reasons for inter-study variation highlights a limitation of the high-throughput experimental methodology employed by ourselves and Wu et al. (2014): while such experiments provide a wealth of data, numerous factors can create noise in these data (sequencing errors, population bottlenecks, epistasis among mutations, etc). Realizing the full potential of such studies will therefore require extensive experimental controls and biological replicates to quantify errors and noise to enable comparisons across data sets.

Correlation of the site-specific amino-acid preferences determined in our study with the “relative fitness” (RF) values reported by Wu et al. (2014). Wu et al. (2014) report RF values for 2350 of the 564×19 = 10716 possible amino-acid mutations to the WSN HA examined in our study (they only examine single-nucleotide changes and disregard certain types of mutations due to oxidative damage of their DNA).

To compare across the data sets, we have normalized their RF values by the RF value for the wildtype amino-acid (which they provide for only 2264 of the 2350 mutations). We then correlate on a logarithmic scale these normalized RF values with the ratio of our measurement of the preference for the mutant amino acid divided by the preference for the wildtype amino acid, using the preferences from our combined replicates. For mutations for which Wu et al. (2014) report an RF of zero, we assign a normalized RF equal to the smallest value for their entire data set. There is a significant Pearson correlation of 0.48 between the data sets, indicating that both our experiments and those of Wu et al. (2014) are capturing many of the same constraints on HA. The data and code used to create this figure are available via http://jbloom.github.io/mapmuts/example_WSN_HA_2014Analysis.html; this plot is the file correlation_with_Wu_et_al.pdf described therein.

https://doi.org/10.7554/eLife.03300.015

Nonetheless, Figure 7 shows that there is a highly significant correlation between the results of these two high-throughput studies, despite differences in experimental methodology and unquantified sources of experimental noise. This fact suggests that both studies capture fundamental constraints on HA’s mutational tolerance. In the remaining sections, we apply the more comprehensive data generated by our study to address questions about HA’s natural evolution and antigenic evolvability.

Experimental inferences are consistent with HA’s natural evolution

Do the results of our deep mutational scanning experiment accurately reflect the real constraints on HA? Table 1 uses an anecdotal comparison to a small number of existing experimental studies to suggest that they do. However, a more systematic way to address this question is to compare the inferred amino-acid preferences to the actual patterns of HA evolution in nature.

To make such a comparison, we created an alignment of HA sequences from human and swine influenza viruses descended from a common ancestor closely related to the virus that caused the 1918 influenza pandemic. Figure 8 shows a phylogenetic tree of these sequences. The WSN HA used in our deep mutational scanning falls relatively close to the root of this tree.

Figure 8 with 1 supplement see all
A phylogenetic tree of human and swine H1 HA sequences descended from a common ancestor closely related to the 1918 virus.

The WSN virus used in the experiments here is a lab-adapted version of the A/Wilson Smith/1933 strain. Human H1N1 that circulated from 1918 until 1957 is shown in blue. Human seasonal H1N1 that reappeared in 1977 is shown in purple. Swine H1N1 is shown in red. The 2009 pandemic H1N1 is shown in green. This tree was constructed using codonPhyML (Gil et al., 2013) with the substitution model of Goldman and Yang (1994). This plot is the file CodonPhyML_Tree_H1_HumanSwine_GY94/annotated_tree.pdf described at http://jbloom.github.io/phyloExpCM/example_2014Analysis_Influenza_H1_HA.html. Figure 8—figure supplement 1 shows a tree estimated for the same sequences using the substitution model of Kosiol et al. (2007).

https://doi.org/10.7554/eLife.03300.016

The crudest comparison is simply to correlate amino-acid frequencies in the natural sequences to the experimentally inferred amino-acid preferences. Figure 9 shows that the inferred preferences are substantially although imperfectly correlated with the natural amino-acid frequencies. However, this comparison is problematic because it fails to account for the contingent and limited sampling of mutations by natural evolution. While the deep mutational scanning is designed to sample all possible mutations, only a fraction of theoretically tolerable mutations have fixed in natural H1 HAs due to the finite timespan during which evolution has been exploring possible sequences (in other words, evolution is not at equilibrium; see Povolotskaya and Kondrashov, 2010). Therefore, an amino-acid frequency of close to one among the natural HA sequences in Figure 8 might imply an absolute functional requirement for that amino acid—or it might simply mean that natural evolution has not yet happened to fix a mutation to another tolerable amino acid at that site.

The frequencies of amino acids among the naturally occurring HA sequences in Figure 8 vs the amino-acid preferences inferred from the combined replicates (Figure 5).

Note that a natural frequency close to one or zero could indicate absolute selection for or against a specific amino acid, but could also simply result from the fact that natural evolution has not completely sampled all possible mutations compatible with HA structure and function. The Pearson correlation coefficient (R) and associated p-value are shown on the plot. This plot is the file natural_frequency_vs_preference.pdf described at http://jbloom.github.io/phyloExpCM/example_2014Analysis_Influenza_H1_HA.html.

https://doi.org/10.7554/eLife.03300.018

A better approach is therefore to treat natural evolution as a non-equilibrium dynamic process, and ask whether the inferred amino-acid preferences accurately describe this process. This type of analysis can be done using the likelihood-based statistical framework for phylogenetics developed by Felsenstein (1973, 1981). Specifically, we fix the phylogenetic tree topology to that shown in Figure 8 and then assess the likelihood of the natural sequences given a specific evolutionary model after optimizing the branch lengths of the tree. Evolutionary models that more accurately describe HA sequence evolution will have higher likelihoods, and the relative accuracy of models can be quantified by comparing their likelihoods after correcting for the number of free parameters using AIC (Posada and Buckley, 2004). Previous work has described how experimental measurements of amino-acid preferences can be combined with known mutation rates to create a parameter-free phylogenetic evolutionary model from deep mutational scanning data (Bloom, 2014).

Table 2 and Table 3 compare the fit of evolutionary models based on the experimentally inferred amino-acid preferences with several existing state-of-the-art models that do not utilize this experimental information (Goldman and Yang, 1994; Kosiol et al., 2007). The model based on amino-acid preferences inferred from the combined experimental data from the three replicates describes the evolution of the naturally occurring HA sequences far better than the alternative models, despite the fact that the latter have a variety of free parameters that are optimized to improve the fit. Models based on amino-acid preferences inferred from the individual experimental replicates also fit the data better than existing models—however, the fit is poorer than for the model that utilizes the data from all three replicates. This result is consistent with the fact that the individual replicates are incomplete in their sampling of the mutational effects, meaning that aggregating the data from several replicates improves the accuracy of inferred preferences. Overall, these comparisons show that the deep mutational scanning reflects the actual constraints on HA evolution substantially better than existing quantitative evolutionary models.

Table 2

An evolutionary model derived from the experimentally inferred amino-acid preferences describes the HA sequence phylogeny in Figure 8 far better than a variety of existing state-of-the-art models

https://doi.org/10.7554/eLife.03300.019
ModelΔ AICLog likelihoodParameters (optimized + empirical)
Combined0.0−24088.70 (0 + 0)
replicate 3303.2−24240.30 (0 + 0)
Combined, Halpern and Bruno500.6−24339.00 (0 + 0)
replicate 1535.4−24356.40 (0 + 0)
replicate 3, Halpern and Bruno657.8−24417.60 (0 + 0)
replicate 2876.2−24526.80 (0 + 0)
GY94, gamma ω, gamma rates882.6−24517.013 (4 + 9)
replicate 1, Halpern and Bruno983.2−24580.30 (0 + 0)
GY94, gamma ω, one rate1109.7−24631.512 (3 + 9)
replicate 2, Halpern and Bruno1190.0−24683.70 (0 + 0)
KOSI07, gamma ω, gamma rates1620.5−24834.964 (4 + 60)
GY94, one ω, gamma rates1859.4−25006.412 (3 + 9)
KOSI07, gamma ω, one rate1883.0−24967.263 (3 + 60)
KOSI07, one ω, gamma rates2378.8−25215.163 (3 + 60)
GY94, one ω, one rate2544.5−25350.011 (2 + 9)
KOSI07, one ω, one rate3040.0−25546.762 (2 + 60)
combined, randomized5632.8−26905.10 (0 + 0)
replicate 1, randomized6002.4−27089.90 (0 + 0)
replicate 3, randomized6138.8−27158.10 (0 + 0)
replicate 2, randomized6477.8−27327.60 (0 + 0)
combined, randomized, Halpern and Bruno7072.8−27625.10 (0 + 0)
replicate 1, randomized, Halpern and Bruno7795.0−27986.20 (0 + 0)
replicate 3, randomized, Halpern and Bruno7891.8−28034.60 (0 + 0)
replicate 2, randomized, Halpern and Bruno8494.4−28335.90 (0 + 0)
  1. The model is most accurate if it utilizes data from the combined experimental replicates, but it also outperforms existing models even if the data are only derived from individual replicates. Models are ranked by AIC (Posada and Buckley, 2004). GY94 indicates the model of Goldman and Yang (1994), and KOSI07 indicates the model of Kosiol et al. (2007). The nonsynonymous/synonymous ratio (ω) and the substitution rate are either estimated as a single value or drawn from a four-category gamma distribution. Randomizing the experimentally inferred preferences among sites makes the models far worse. The models work best fixation probabilities are computed from the preferences using the first equation proposed in Bloom (2014). The table also shows the results if the fixation probabilities are instead computed using the equation of Halpern and Bruno (1998) as described in Bloom (2014). This table is the file H1_HumanSwine_GY94_summary.tex described at http://jbloom.github.io/phyloExpCM/example_2014Analysis_Influenza_H1_HA.html. Table 3 shows the results when the tree topology is instead estimated using the substitution model of Kosiol et al. (2007).

Table 3

An evolutionary model derived from the experimentally inferred amino-acid preferences also outperforms existing models for the tree topology in Figure 8—figure supplement 1

https://doi.org/10.7554/eLife.03300.020
ModelΔAICLog likelihoodParameters (optimized + empirical)
Combined0.0−24082.50 (0 + 0)
replicate 3304.8−24234.90 (0 + 0)
Combined, Halpern and Bruno494.4−24329.70 (0 + 0)
replicate 1534.2−24349.60 (0 + 0)
replicate 3, Halpern and Bruno653.2−24409.10 (0 + 0)
replicate 2869.4−24517.20 (0 + 0)
GY94, gamma ω, gamma rates876.7−24507.813 (4 + 9)
replicate 1, Halpern and Bruno976.8−24570.90 (0 + 0)
GY94, gamma ω, one rate1101.0−24621.012 (3 + 9)
replicate 2, Halpern and Bruno1180.4−24672.70 (0 + 0)
KOSI07, gamma ω, gamma rates1609.0−24823.064 (4 + 60)
GY94, one ω, gamma rates1856.2−24998.612 (3 + 9)
KOSI07, gamma ω, one rate1867.3−24953.163 (3 + 60)
KOSI07, one ω, gamma rates2367.9−25203.463 (3 + 60)
GY94, one ω, one rate2548.3−25345.611 (2 + 9)
KOSI07, one ω, one rate3028.0−25534.562 (2 + 60)
Combined, randomized5628.0−26896.50 (0 + 0)
replicate 1, randomized5993.6−27079.30 (0 + 0)
replicate 3, randomized6138.0−27151.50 (0 + 0)
replicate 2, randomized6475.2−27320.10 (0 + 0)
combined, randomized, Halpern and Bruno7069.4−27617.20 (0 + 0)
replicate 1, randomized, Halpern and Bruno7786.8−27975.90 (0 + 0)
replicate 3, randomized, Halpern and Bruno7889.2−28027.10 (0 + 0)
replicate 2, randomized, Halpern and Bruno8496.0−28330.50 (0 + 0)
  1. This table differs from Table 2 in that it uses the tree topology inferred with the model of Kosiol et al. (2007) rather than Goldman and Yang (1994). This table is the file H1_HumanSwine_KOSI07_summary.tex described at http://jbloom.github.io/phyloExpCM/example_2014Analysis_Influenza_H1_HA.html.

The inherent evolvability of antigenic sites on HA

The amino-acid preferences inferred from the deep mutational scanning reflect the inherent mutational tolerance of sites in HA. In contrast, the evolution of HA in nature is shaped by a combination of HA's inherent mutational tolerance and external selection pressures. Specifically, the evolution of HA in humans is strongly driven by selection for mutations that alter antigenicity (Yewdell et al., 1979; Wiley et al., 1981; Caton et al., 1982; Smith et al., 2004; Das et al., 2013; Koel et al., 2013; Bedford et al., 2014). The fact that such antigenic mutations fix at high frequency implies some degree of mutational tolerance at antigenic sites, since no mutations would fix if these sites were under absolute structural or functional constraint. However, it is not possible to tell from natural sequences alone whether antigenic sites are unusually mutationally tolerant compared to the rest of HA, or whether their rapid evolution is solely because they are under strong external immune selection.

To address this issue, we used the results of the deep mutational scanning to compare the inherent mutational tolerance of antigenic sites to the rest of the HA protein. Caton et al. (1982) mapped the antigenic sites of the H1 HA from A/Puerto Rico/8/1934 (PR8), which is closely related to the WSN HA used in our experiments. We therefore defined the ‘Caton et al. antigenic sites’ as the WSN residues homologous to those mapped by Caton et al. (1982) with the exclusion of a single site that has gained glycosylation in the WSN HA relative to the PR8 HA (see ‘Materials and methods’ for details). One possible concern is that Caton et al. (1982) mapped antigenic sites largely by selecting monoclonal-antibody escape mutants, and so these sites might be biased towards being more mutationally tolerant. We therefore also made a broader classification of ‘antigenic sites and contacting residues’ consisting of the Caton et al. antigenic sites plus all surface-exposed residues in contact with these sites (see ‘Materials and methods’ for details). This broader classification includes all residues in regions of the HA surface targeted by antibodies, and so should not be biased by whether sites are amenable to the selection of monoclonal-antibody escape mutants. We hypothesized that both sets of antigenic sites would have unusually high mutational tolerance.

For comparison, we used two classifications of receptor-binding residues (‘Materials and methods’). The first classification consists of residues that have important roles in receptor binding (Martin et al., 1998) and are conserved in H1 HAs; these residues are mostly deep in the binding pocket. The second classification consists of all residues that contact the sialic-acid receptor in the crystal structure, regardless of their level of conservation. We hypothesized that the core set of conserved receptor-binding residues would have unusually low mutational tolerance, but that the set of all receptor-binding residues would have typical levels of mutational tolerance since influenza routinely escapes from antibodies that target the periphery of the receptor-binding pocket (Koel et al., 2013).

The positions of the Caton et al. antigenic sites and the conserved receptor-binding residues in the primary sequence are indicated by the top overlay bar in Figure 5. Visual inspection suggests that the conserved receptor-binding residues are indeed relatively intolerant of mutations (have a strong preference for one specific amino acid), whereas the Caton et al. antigenic sites are relatively tolerant of mutations (have roughly equivalent preferences for many amino acids).

For a more quantitative analysis, we computed a site entropy from the inferred amino-acid preferences—larger site entropies indicate a higher inherent tolerance for mutations. The site entropies of all residues are displayed on the HA protein structure in Figure 10. Visual inspection suggests that both classifications of antigenic sites have unusually high mutational tolerance, whereas the conserved receptor-binding residues have unusually low mutational tolerance.

Inherent mutational tolerance of HA’s receptor-binding residues and antigenic sites.

(A) Surface of HA with one monomer colored by site entropy as determined by the deep mutational scanning; blue indicates low mutational tolerance and red indicates high mutational tolerance. (B) The structure shows residues classified as antigenic sites by Caton et al. (1982) in colored spheres; the plot shows site entropy vs relative solvent accessibility (RSA) of these residues (red triangles) and all other HA1 residues in the crystal structure (blue circles). (C) Antigenic sites of Caton et al. (1982) plus all other surface-exposed residues that contact these sites. (D) Conserved receptor-binding residues. (E) All receptor-binding residues. Table 4 shows that residues in (B) and (C) have unusually high mutational tolerance, residues in (D) have unusually low mutational tolerance, and residues in (E) do not have unusual mutational tolerance. The data and code to create all panels of this figure is provided via http://jbloom.github.io/mapmuts/example_WSN_HA_2014Analysis.html. The structure is PDB 1RVX (Gamblin et al., 2004).

https://doi.org/10.7554/eLife.03300.021

We next tested whether these visual observations were supported by a rigorous statistical analysis. A confounding factor in comparing mutational tolerance across different sets of residues is that sites with higher solvent accessibility are typically more tolerant of mutations (Bustamante et al., 2000; Ramsey et al., 2011). To correct for this fact, we computed the relative solvent accessibility (RSA) for all residues in the HA crystal structure. Residues with RSAs close to zero are buried and are expected to be fairly intolerant of mutations, whereas residues with RSAs substantially greater than zero are surface exposed and are expected to be fairly tolerant of mutations. Figure 10 plots site entropy as a function of RSA for HA1 residues. This figure shows that sites with higher RSA are more mutationally tolerant as expected. However, the figure also suggests that both classifications of antigenic sites are more mutationally tolerant than other residues with equivalent RSA. The figure also suggest that the conserved receptor-binding residues are less mutationally tolerant than other residues with equivalent RSA, whereas the set of all receptor-binding residues have fairly typical mutational tolerance. These observations are supported by the statistical analyses in Table 4: even after correcting for RSA, there is a significant trend for antigenic sites to have high mutational tolerance, and for conserved receptor-binding residues to have low mutational tolerance.

Table 4

The antigenic sites are more significantly mutationally tolerant than other HA1 residues with similar relative solvent accessibility (RSA), the conserved receptor-binding residues are significantly less mutationally tolerant than other similar residues, and sites in the more expansive set of all receptor-binding residues have typical levels of mutational tolerance

https://doi.org/10.7554/eLife.03300.022
Model: site entropy ∼ RSA + (Caton et al. antigenic site) + intercept
 PropertyEstimateStandard errorp-value
 RSA1.290.12<10−10
 Caton et al. antigenic site0.300.091.6 × 10−3
Model: site entropy ∼ RSA + (antigenic site or contacting residue) + intercept
 PropertyEstimateStandard errorp-value
 RSA1.220.13<10−10
 antigenic site or contacting residue0.230.072.2 × 10−3
Model: site entropy ∼ RSA + (conserved receptor binding) + intercept
 PropertyEstimateStandard errorp-value
 RSA1.380.11<10−10
 conserved receptor binding−0.520.161.7 × 10−3
Model: site entropy ∼ RSA + (all receptor binding) + intercept
 PropertyEstimateStandard errorp-value
 RSA1.400.11<10−10
 all receptor binding−0.180.110.12
  1. The sets of residues analyzed here are those shown in Figure 10. Shown here are the results of multiple linear regression of the continuous dependent variable of site entropy (as computed from the amino-acid preferences) vs the continuous independent variable of RSA and the binary variable of being a receptor-binding residue or being an antigenic site. The data and code used to perform these analyses are available via http://jbloom.github.io/mapmuts/example_WSN_HA_2014Analysis.html.

Overall, these results show that antigenic sites in HA have unusually high inherent mutational tolerance, suggesting that this property combines with external immune selection to contribute to HA’s rapid antigenic evolution. These results also show that while a core group of conserved residues deep in the receptor-binding pocket have unusually low mutational tolerance, the bulk of residues that contact the receptor are not under exceptional constraint. This fact probably explains why HA is able to escape from antibodies targeting the periphery of the receptor-binding pocket (Koel et al., 2013), and why only rare antibodies that penetrate deep into this pocket are broadly neutralizing (Whittle et al., 2011).

HA's antigenic evolvability is not shared by all influenza proteins

The foregoing results show that the antigenic sites in HA have an unusually high inherent tolerance for mutations. Is this antigenic evolvability an exceptional feature of HA, or is it commonly shared by other viral proteins? Ideally one would compare HA to the major surface antigens of other viruses with high (e.g., HIV) and low (e.g., measles) rates of antigenic evolution—but unfortunately comparable data sets for these other viruses are not yet available. Therefore, we instead compared the antigenic evolvability of HA to that of influenza nucleoprotein (NP), a protein for which we have recently performed a similar deep mutational scanning experiment (Bloom, 2014).

The adaptive immune system targets NP via cytotoxic T-lymphocytes (CTLs) (Valkenburg et al., 2011). Although the selection exerted by these CTLs is believed to be weaker than the antibody-mediated selection on HA’s antigenic sites (Bhatt et al., 2011), influenza does benefit from mutations in NP that promote escape from CTLs (Berkhoff et al., 2007; Valkenburg et al., 2013). However, whereas HA rapidly evolves to escape from antibodies, NP does not appear to have any special propensity for rapid evolution of the epitopes targeted by CTLs. Instead, mutations in NP's CTL epitopes are often deleterious and require secondary permissive or compensatory mutations to fix without a fitness cost (Rimmelzwaan et al., 2004; Berkhoff et al., 2005, 2006; Gong et al., 2013). Therefore, we hypothesized that unlike HA's highly evolvable antigenic sites, NP's CTL-antigenic sites would not possess unusually high inherent mutational tolerance.

To test this hypothesis, we used a previously described delineation of epitopes in NP from the human H3N2 strain A/Aichi/2/1968 with experimentally validated human CTL responses (Gong and Bloom, 2014). In this delineation, less than a quarter of NP’s sites participate in multiple CTL epitopes. We used the results of our previous deep mutational scanning of NP to compare the inherent mutational tolerance of sites that participate in multiple CTL epitopes to all other sites in NP. As shown in Figure 11 and Table 5, the NP sites involved in multiple CTL epitopes have an inherent mutational tolerance that is indistinguishable from other sites in the protein. Therefore, NP does not possess any special inherent mutational tolerance in its CTL epitopes. This finding implies that a high level of antigenic evolvability is not a general feature of all viral proteins, but is instead at least somewhat unique to HA.

The inherent mutational tolerance of NP's CTL epitopes is indistinguishable from that of non-epitope sites in NP.

The plot shows the site entropy vs relative solvent accessibility (RSA) of NP residues that participate in multiple CTL epitopes (red triangles) and all other NP residues in the crystal structure (blue circles). Visual inspection suggests that the epitope sites have mutational tolerance comparable to other sites, and this result is supported by the statistical analysis in Table 5. Note that unlike for HA, there is no trend for RSA to correlate with site entropy—this could be because many of NP’s surface-exposed sites are constrained by interactions with viral RNA. The CTL epitopes are those delineated in the first supplementary table of Gong and Bloom (2014). The site entropies are computed from a previously described deep mutational scan of NP, and are the values in the first supplementary file of Bloom (2014); the RSA values are also taken from that reference. The data and code used to generate this plot is available via http://jbloom.github.io/mapmuts/example_WSN_HA_2014Analysis.html; the plot itself is the file NP_CTL_entropy_rsa_correlation.pdf described therein.

https://doi.org/10.7554/eLife.03300.023
Table 5

There is no statistically significant difference between the inherent mutational tolerance of NP sites involved in multiple CTL epitopes and all other NP residues

https://doi.org/10.7554/eLife.03300.024
Model: NP site entropy ∼ RSA + (multiple CTL epitopes) + intercept
PropertyEstimateStandard errorp-value
RSA−0.050.070.52
multiple CTL epitopes−0.040.040.31
  1. The table shows the result of multiple linear regression of the continuous dependent variable of site entropy (as computed from the amino-acid preferences) vs the continuous independent variable of RSA and the binary variable of participating in multiple CTL epitopes. The data set analyzed here is plotted in Figure 11. The data and code used to perform this analysis are available via http://jbloom.github.io/mapmuts/example_WSN_HA_2014Analysis.html.

Discussion

A fundamental challenge in studying the natural evolution of influenza is separating the effects of external selection pressures from inherent structural and functional constraints. The evolutionary patterns observed in natural sequences are shaped by a combination of inherent mutational tolerance and external pressures such as immune selection, and the analysis of such sequences is further confounded by the fact that influenza is not at evolutionary equilibrium.

Here we have quantified the inherent mutational tolerance of influenza HA by using deep mutational scanning (Fowler et al., 2010; Araya and Fowler, 2011) to simultaneously assess the impact on viral growth of the vast majority of the ≈104 possible amino-acid mutations to influenza HA. The information obtained from the deep mutational scanning is consistent with existing knowledge about the effects of mutations on HA function and structure. For instance, the deep mutational scanning shows strong selection for specific amino acids known to play important roles in HA's receptor-binding activity, fusion activity, and proteolytic activation (Martin et al., 1998; Qiao et al., 1999; Stech et al., 2005). Similarly, at the sites of known temperature-sensitive mutations to HA (Nakajima et al., 1986), the deep mutational scanning identifies the more stabilizing amino-acid as more favorable. Broader trends from the deep mutational scanning are also in agreement with current thinking about mutational effects. For example, the deep mutational scanning finds that there is strong purifying selection against stop-codon mutations and many nonsynonymous mutations, but that there is only weak selection against synonymous mutations. All of these results suggest that the deep mutational scanning faithfully captures both the specific and general effects of mutations on HA.

The comprehensive information generated by the deep mutational scanning can be used to create quantitative evolutionary models for analyzing HA sequence phylogenies. Here we have shown that an evolutionary model constructed from our deep mutational scanning data describes the evolution of human and swine H1 HAs far better than existing state-of-the-art models for sequence evolution. We anticipate that separating HA's inherent mutational tolerance from external selection should also eventually allow the external selection pressures to be studied in greater detail. For example, one might imagine that sites in HA that exhibit evolutionary patterns that deviate from the quantitative model created from our deep mutational scanning are likely to be under external selection. Future work that augments deep mutational scanning with specific experimentally defined selection pressures (such as antibodies against HA) could aid in further elucidation of the forces that shape influenza evolution. It also may be possible to utilize high-throughput experimental data on mutational effects to better estimate the fitness of naturally occurring strains in a way that aids in prediction of the year-to-year strain dynamics of influenza (Łuksza and Lässig, 2014).

The deep mutational scanning also enabled us to assess the extent to which HA's inherent mutational tolerance contributes to influenza’s antigenic evolvability. It remains a mystery why error-prone RNA viruses differ so widely in their capacity for evolutionary escape from immunity, with some (e.g., influenza and HIV) undergoing rapid antigenic evolution while others (e.g., measles) show little antigenic change on relevant timescales (Lipsitch and O’Hagan, 2007; Koelle et al., 2006; Heaton et al., 2013). Our data demonstrate that the antigenic sites in HA are unusually tolerant to mutations, implying that inherent evolutionary plasticity at sites targeted by the immune system is one factor that contributes to influenza's rapid antigenic evolution. This high mutational tolerance at antigenic sites could itself be a property that influenza has evolved to aid in its antigenic escape—or it might simply be an unfortunate coincidence that the immune system focuses on especially plastic portions of HA. In either case, it is intriguing to speculate whether a high inherent mutational tolerance in antigenic sites is also a feature of other antigenically variable RNA viruses. Application of the deep mutational scanning approach used here to additional viruses should provide a means to address this question.

Materials and methods

Availability of data and computer code

Illumina sequencing data are available at the SRA, accession SRP040983 (http://www.ncbi.nlm.nih.gov/sra/?term=SRP040983). Source code and a description of the computational process used to analyze the sequencing data and infer the amino-acid preferences is at http://jbloom.github.io/mapmuts/example_WSN_HA_2014Analysis.html. Source code and a description of the computational process used for the phylogenetic analyses is available at http://jbloom.github.io/phyloExpCM/example_2014Analysis_Influenza_H1_HA.html.

HA sequence numbering

A variety of different numbering schemes for HA are used in the literature. Unless noted otherwise, residues are numbered here using sequential numbering of the WSN HA protein sequence (Supplementary file 1) starting with one at the N-terminal methionine. In some cases, the number of the corresponding residues in the widely used H3 numbering scheme is also indicated. These numbering systems can be interconverted using the Python script available at https://github.com/jbloom/HA_numbering.

Generation of HA codon mutation library

The HA codon-mutant library was generated using the oligo-based PCR mutagenesis protocol described previously by Bloom (2014). The only differences from that protocol were that HA was used as the template rather than NP, and that only two overall rounds of mutagenesis were performed, rather than the three rounds used by Bloom (2014). This reduction in the number of rounds of mutagenesis reduced the average number of codon mutations from the ≈ three per clone in Bloom (2014) to the ≈ two per clone shown in Figure 2. The libraries were created in full biological triplicate, meaning that each experimental replicate was derived from an independent plasmid mutant library.

The end primers for the mutagenesis were 5′-cgatcacgtctctgggagcaaaagcaggggaaaataaaaacaac-3′ and 5′-gatacacgtctcatattagtagaaacaagggtgtttttccttatatttctg-3′ (these primers include BsmBI restriction sites). The mutagenic primers were ordered from Integrated DNA Technologies, and are listed in Supplementary file 2.

The final products from the codon mutagenesis PCR were gel purified and digested with BsmBI (R0580L; New England Biolabs, Ipswich, Massachusetts). The BsmBI-digested HA was ligated into a dephosphorylated (Antarctic Phosphatase, M0289L; New England Biolabs) and BsmBI-digested preparation of the bidirectional reverse-genetics plasmid pHW2000 (Hoffmann et al., 2000) using T4 DNA ligase (M0202S; New England Biolabs). Column-purified ligations were electroporated into ElectroMAX DH10B T1 phage-resistant competent cells (12033-015; Invitrogen, Carlsbad, California) and plated on LB plates containing 100 μg/ml of ampicilin. A 1:4000 dilution of each transformation was plated in parallel to enable estimation of the number of unique transformants—we obtained at least two-million unique colonies per transformation. For each replicate of the codon-mutant library, we performed three transformations to generate approximately six-million independent clones per replicate library. Control ligations lacking an insert yielded at least 100 times fewer colonies, indicating a very low rate of background self-ligation of the pHW2000 plasmid. The transformants from each HA mutant library replicate were pooled, cultured in LB supplemented with ampicillin, and mini-prepped to generate the HA codon mutant plasmid libraries.

For the Sanger sequencing analysis shown in Figure 2, we picked and prepped 34 independent colonies for sequencing. The full analysis of this Sanger sequencing is available at https://github.com/jbloom/SangerMutantLibraryAnalysis/tree/v0.2.

Virus rescue and passage in cells

The HA mutant plasmid libraries were used to generate pools of mutant influenza viruses by reverse genetics (Hoffmann et al., 2000). Cocultures of 293T and MDCK-SIAT1 cells were transfected with equal amounts of HA (either unmutated or one of the plasmid mutant libraries) cloned into pHW2000 as described above, plus the seven other WSN genes in bidirectional reverse-genetics plasmids (pHW181-PB2, pHW182-PB1, pHW183-PA, pHW185-NP, pHW186-NA, pHW187-NA, pHW188-NS), which were kind gifts from Robert Webster of St. Jude Children's Research Hospital. Overall, six viral rescues and passages were performed, each using a different HA plasmid preparation: the three HA mutant library replicates (eventually yielding the mutvirus samples in Figure 1) and three independent unmutated HAs (eventually yielding the virus samples in Figure 1).

Each of the viral rescues was performed by transfecting multiple wells of cells in an effort to increase the diversity of the rescued viruses. Specifically, two 12-well dishes were transfected per rescue. Cells were plated at 2 × 105 293T cells and 5 × 104 MDCK-SIAT1 cells per well in D10 (DMEM supplemented with 10% heat-inactivated FBS, 2 mM L-glutamine, 100 U of penicillin/ml, and 100 μg of streptomycin/ml), and then each well was transfected with 1 μg of total plasmid DNA (125 ng of each of the eight plasmids) using the BioT transfection reagent (Bioland B01-02, Paramount, California). At 12 to 18 hr post-transfection, the medium was changed to our WSN viral growth media: Opti-MEM supplemented with 0.5% heat-inactivated FBS, 0.3% BSA, 100 U of penicillin/ml, 100 μg of streptomycin/ml, and 100 μg of calcium chloride/ml. This media does not contain trypsin since viruses with the WSN HA and NA are trypsin independent (Goto and Kawaoka, 1998). Viral supernatants were collected 72 hr post-transfection, and the supernatants from the different wells were pooled for each viral rescue. These pooled supernatants were then clarified by centrifugation at 2000×g for 5 min, aliquoted, and frozen at −80 oC. Aliquots were then thawed and titered by TCID50 (see below).

For viral passage, each viral rescue replicate was passaged in four 10-cm dishes. Briefly, 6 × 106 MDCK-SIAT1 cells per 10-cm dish in WSN viral growth media were infected with 6×105 infectious particles (multiplicity of infection of 0.1). Since there are four dishes for each replicate, this maintains a diversity of 2.4 × 106 TCID50 units per replicate. The passaged viral supernatants were collected at 50 hr post-infection, and the supernatants for the four plates were pooled for each replicate. These pooled supernatants were clarified at 2000 × g for 5 min, aliquoted, and frozen at −80 oC. Aliquots were then thawed and titered by TCID50.

Virus titering by TCID50

The viruses were titered by TCID50 (50% tissue culture infectious dose). In this assay, 10 μl of a 1:10 dilution of the viral supernatant to be titered was added to the first row of a 96-well tissue culture plate containing 90 μl of WSN viral growth media. At least one no-virus control supernatant was included on each plate as a negative control. The virus was then serially diluted 1:10 down the rows of the plates, and then 5 × 103 MDCK-SIAT1 cells were added to each well. The plates were then incubated at 37°C, and scored for cytopathic effects caused by viral growth after for 65–72 hr. Virus titers were calculated by the method of Reed and Muench (1938) implemented via the Python script at https://github.com/jbloom/reedmuenchcalculator.

Generation of samples for Illumina deep sequencing

The deep sequencing samples were prepared from PCR amplicons that were generated exactly as described for the DNA, mutDNA, virus, and mutvirus samples in Bloom (2014). The viral RNA template for the virus and mutvirus were isolated using freshly purchased Trizol reagent (15596-026; Life Technologies) in order to avoid any oxidative damage associated with old reagents. After performing reverse transcription as described in Bloom (2014), quantitative PCR (qPCR) was used to quantify the number of HA cDNA molecules to ensure that there were at least 106 unique template molecules before beginning the subsequent PCR amplification. The qPCR primers were designed based on those described by Marsh et al. (2007), and were 5′-taacctgctcgaagacagcc-3′ and 5′-agagccatccggtgatgtta-3′.

The PCR amplicons were fragmented and barcoded using the custom modification of Illumina's Nextera kit using the protocol described in Bloom (2014). Samples were barcoded as follows: DNA–N701, mutDNA–N702, virus–N704, and mutvirus–N705. For each of the three biological replicates, these four samples were pooled and sequenced on their own Illumina lane with 50-nucleotide paired-end reads as described in Bloom (2014). For the technical sequencing repeat of biological replicate #1, the library preparation and sequencing were repeated from the same viral RNA templates. This technical repeat therefore only quantifies variation associated with sample preparation and sequencing, whereas the biological replicates also quantify variation associated with the processes of codon-mutant library creation, virus generation, and virus passage.

Analysis of deep sequencing data

The deep sequencing data was analyzed using the mapmuts computer program (Bloom, 2014). A description of the analysis approach and the resulting data files and figures produced are available at http://jbloom.github.io/mapmuts/example_WSN_HA_2014Analysis.html.

Briefly, paired reads were overlapped as illustrated in Figure 3—figure supplement 1 and then aligned to HA. Reads were retained only if both reads in the pair passed the default Illumina filter, had average Q-scores of at least 25, overlapped for at least 30 nucleotides with no more than one mismatch, and the overlap aligned to the HA gene with no more than six mismatches. Figure 3—figure supplement 2 shows the number of reads for each sample that met these criteria. Most reads that did not meet these criteria failed to do so because they could not be paired with at least 30 nucleotides of overlap—a situation that arises when the HA fragment produced by the Nextera fragmentation produces a fragment smaller than 30 nucleotides or larger than 70 nucleotides. Codon identities were called only if both overlapped paired reads agreed on the identity of the codon. This requirement reduces the error rate, because it is rare for both paired reads to independently experience the same sequencing error.

As shown in Figure 4, we estimated that 85% of possible codon mutations were sampled at least five times by the mutant viruses. To estimate the fraction of amino-acid mutations that would have been sampled, we simulated randomly selecting 85% of the mutant codons from the HA sequence, and determined that these codons encoded ≈97% of the amino-acid mutations.

Inference of amino-acid preferences and site entropies

The counts of each codon identity in the deep sequencing data was used to infer the ‘preference’ of each site for each amino acid as described in Bloom (2014). This inference was also done using the mapmuts computer program as detailed at http://jbloom.github.io/mapmuts/example_WSN_HA_2014Analysis.html.

Briefly, the preference πr,a of site r for amino-acid a represents the expected frequency of that amino acid in a hypothetical library where each amino-acid is introduced at equal frequency. Specifically, the expected frequency fr,xmutvirus of mutant codon x at site r in the mutvirus sample is related to the preference for its encoded amino-acid A(x) by

fr,xmutvirus=ϵr,x+ρr,x+μr,x×πr,A(x)yμr,y×πr,A(y),

where ϵr,x is the rate at which site r is erroneously read to be codon x, ρr,x is the rate at which site r is erroneously reverse-transcribed to codon x, and μr,x is the rate at which site r is mutagenized to codon x in the mutant DNA sample. These unknown error and mutation rate parameters are inferred from the DNA, virus, and mutvirus samples using the Bayesian approach described in Bloom (2014). Inferences of the posterior mean preferences πr,a were made separately for each replicate of the experiment, and the correlations among these inferences from different replicates are in Figure 6. The final ‘best’ inferred preferences from the combined data of the three biological replicates were obtained by averaging the preferences obtained from the three biological replicates. These final inferred preferences are provided in Supplementary file 3 and displayed graphically in Figure 5.

The site entropies in Figure 10 and Table 4 were calculated from the amino-acid preferences as hr=aπr,a×log2πr,a. These site entropies are therefore in bits. Higher site entropies indicate a higher inherent mutational tolerance.

Alignment of naturally occurring HAs and phylogenetic tree

The inferred amino-acid preferences were compared to amino-acid frequencies in an alignment of naturally occurring H1N1 HAs from swine and human lineages descended from a close relative of the 1918 virus. Briefly, all full-length H1 HAs from these hosts were downloaded from the Influenza Virus Resource (Bao et al., 2008). Up to three sequences per host and year were randomly subsampled and used to build a phylogenetic tree. Clear outliers from the molecular clock (typically lab artifacts or mis-annotated sequences) were iteratively excluded and the trees were rebuilt. The final sequence alignment is in Supplementary file 4. This alignment was used to build the phylogenetic trees in Figure 8 and Figure 8—figure supplement 1 with codonPhyML (Gil et al., 2013) using the codon-substitution model of (Goldman and Yang, 1994) or (Kosiol et al., 2007) with empirical codon frequencies determined using the CF3x4 method (Pond et al., 2010) or the F method, respectively. In both cases, the nonsynonymous-synonymous ratio (ω) was drawn from four gamma-distributed categories (Yang et al., 2000). A description of this process is at http://jbloom.github.io/phyloExpCM/example_2014Analysis_Influenza_H1_HA.html.

Comparison of evolutionary models

We compared the accuracy with which the naturally occurring HA phylogeny was described by an evolutionary model based on the experimentally measured amino-acid preferences vs several standard codon-substitution models. These comparisons were used made using HYPHY (Pond et al., 2005) and phyloExpCM (Bloom, 2014). A description of this analysis is at http://jbloom.github.io/phyloExpCM/example_2014Analysis_Influenza_H1_HA.html.

Briefly, the phylogenetic tree topology was fixed to that shown in Figure 8 or Figure 8—figure supplement 1. The branch lengths and any free parameters of the evolutionary model were then optimized by maximum likelihood. The experimentally determined evolutionary models were constructed from the inferred amino-acid preferences reported here and the experimentally measured mutation rates reported in Bloom (2014). The ‘fixation probabilities’ were computed using either the Metropolis-like relationship described in Bloom (2014) or the relationship proposed by Halpern and Bruno (1998). The results of these comparisons are in Tables 2 and 3. All of these comparisons show that the experimentally determined evolutionary models are far superior to the various standard models.

Structural analyses

The WSN HA studied here has a high degree of sequence identity to the HA crystallized in PDB 1RVX (Gamblin et al., 2004). It is this HA structure that is shown Figure 10. The relative solvent accessibilities (RSA) values in Figure 5 and Figure 10 were calculated by first determining the absolute solvent accessibilities of the residues in the full trimeric HA in PDB 1RVX with the DSSP (Joosten et al., 2011) webserver at http://www.cmbi.ru.nl/hsspsoap/, and then normalizing by the maximum solvent accessibilities given by Tien et al. (2013).

Classification of antigenic sites and conserved receptor-binding residues

Several sub-classifications of HA residues were performed.

Conserved receptor-binding sites were any residues listed in the first table of Martin et al. (1998) that are also conserved in at least 90% of H1 HAs. These residues are listed in Table 1.

All receptor-binding residues were any residues with any atom within 5 Å of the substrate in PDB 1RVX (Gamblin et al., 2004). No constraint is placed on whether or not these residues are conserved in natural sequences. The residues that fall into this classification are (in sequential numbering of the WSN HA): 108, 147, 148, 149, 150, 151, 158, 166, 168, 196, 198, 199, 203, 207, 238, 239, and 241.

The Caton et al. (1982) antigenic-site residues are classified based on antigenic mapping of the A/Puerto Rico/8/1934 (H1N1) HA. Specifically, these are any residues listed in the third table of Caton et al. (1982) with the following exceptions: residue 182 (H3 numbering) is not considered for the reason explained on page 421 of Caton et al. (1982), residue 273 (H3 numbering) is not considered for the reason explained on page 422 of Caton et al. (1982), and residue 129 (H3 numbering) is not considered because it has gained a glycosylation site in the WSN HA that is not present in the A/Puerto Rico/8/1934 (H1N1) HA and mutation of this WSN glycosylation site can strongly affect viral growth (Deom et al., 1986). Overall, this gives the following set of antigenic residues, listed by sequential numbering of the WSN HA with the H3 number in parentheses: 171 (158), 173 (160), 175 (162), 176 (163), 178 (165), 179 (166), 180 (167), 169 (156), 172 (159), 205 (192), 206 (193), 209 (196), 211 (198), 182 (169), 186 (173), 220 (207), 253 (240), 153 (140), 156 (143), 158 (145), 237 (224), 238 (225), 87 (78), 88 (79), 90 (81), 91 (82), 92 (83), and 135 (122).

A second classification is done that includes the Caton et al. (1982) plus any surface-exposed residues that are in contact with these residues, using an α-carbon to α-carbon distance of ≤6.0Å as the threshold for being in contact and classifying residues are solvent-exposed if they have an RSA of at least 20%. The rationale for this second classification is that the mapping by Caton et al. (1982) may have been biased towards inherently variable sites, and so other surface-exposed residues that contact these sites could also be antigenic. This classification adds the following 28 residues (listed by sequential numbering of the WSN HA) to the 29 Caton et al. (1982) residues: 85, 86, 89, 126, 132, 136, 137, 138, 142, 148, 150, 154, 155, 157, 170, 184, 185, 187, 202, 203, 207, 210, 212, 221, 235, 236, 239, and 252.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
    A codon-based model of nucleotide substitution probabilities for protein-coding DNA sequences
    1. N Goldman
    2. Z Yang
    (1994)
    Molecular Biology and Evolution 11:725–736.
  23. 23
  24. 24
  25. 25
    A novel mechanism for the acquisition of virulence by a human influenza A virus
    1. H Goto
    2. Y Kawaoka
    (1998)
    Proceedings of the National Academy of Sciences of the United States of America 95:10224–10228.
    https://doi.org/10.1073/pnas.95.17.10224
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
    A simple method of estimating fifty per cent endpoints
    1. LJ Reed
    2. H Muench
    (1938)
    American Journal of Epidemiology 27:493–497.
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
    Codon-substitution models for heterogeneous selection pressure at amino acid sites
    1. Z Yang
    2. R Nielsen
    3. N Goldman
    4. AMK Pedersen
    (2000)
    Genetics 155:431–449.
  60. 60

Decision letter

  1. Mercedes Pascual
    Reviewing Editor; University of Michigan, United States

eLife posts the editorial decision letter and author response on a selection of the published articles (subject to the approval of the authors). An edited version of the letter sent to the authors after peer review is shown, indicating the substantive concerns or comments; minor concerns are not usually shown. Reviewers have the opportunity to discuss the decision before the letter is sent (see review process). Similarly, the author response typically shows only responses to the major concerns raised by the reviewers.

Thank you for sending your work entitled “The inherent mutational tolerance and antigenic evolvability of influenza hemagglutinin” for consideration at eLife. Your article has been favorably evaluated by Ian Baldwin (Senior editor), a Reviewing editor, and 3 reviewers, one of whom, Sarah Cobey, has agreed to reveal her identity.

The Reviewing editor and the reviewers discussed their comments before we reached this decision, and the Reviewing editor has assembled the following comments to help you prepare a revised submission.

The reviewers agree that this paper advances knowledge on an important topic in pathogen evolution: the extent to which rapidly adapting viruses like influenza have evolved particularly evolvable phenotypes. They also judge the set of experiments important and of high quality. Finally, they recognize the novelty of combining the experimental results with an evolutionary model. This combination demonstrates that the site-specific amino acid preferences inferred from the experiments describe HA evolution more parsimoniously than traditional codon substitution models.

The reviewers provide however a number of comments whose clarification and further discussion would strengthen the arguments made in the manuscript and its eventual impact. These comments primarily concern the uncertainty and assumptions over the identity of antigenic sites and the relevant RBS residues for control purposes. It was felt that both these assumptions could be made more explicit and examined further in the ways suggested below, together with other major points.

1) The conclusions from this study are dependent on knowing which HA residues are physically located in antigenic sites. The authors use the classic Caton et al. 1982 study to classify which HA residues are located in antigenic sites. This is a reasonable approach because the WSN/1933 virus used in the current study is antigenically similar to the PR8/1934 used in the Caton et al. 1982 study. The Caton et al. experiments defined antigenic sites by selecting viral escape mutants in the presence of saturating amounts of individual monoclonal antibodies. Based on the Caton et al. experimental approach, viruses that possess mutations at HA residues that severely compromise viral fitness are never selected-essentially, Caton et al. defined the antigenic sites by isolating mutant viruses that were capable of growing. With this in mind, it is perhaps not surprising that these same HA residues are tolerant of mutations in the current study.

When the Caton et al. escape mutants are mapped onto the 1RVX structure, there are clusters of mutations that correspond to each antigenic site. Within each cluster, there are a few HA residues that appear to be in the cluster (thus likely in the antibody binding site) but these residues never appeared in the Caton et al. viral mutant escape experiments. It has been assumed that this is the case because either (a) there was a limited number of mutant viruses screened by Caton et al. or (b) that these residues within the antigenic site were detrimental to viral fitness.

The authors should consider making an HA structure analogous to the one shown in Figure 9, with entropy color labeling for all surface residues. This type of analysis will demonstrate if tolerability is limited to those residues identified in selection assays (assays that only identify residues capable of tolerating mutations) or also found in adjacent residues that are likely located in antibody binding sites but were not identified in the original antigenic mapping studies. It might be helpful to show surface on the structure for this type of analyses.

Another partial solution is to examine antigenic sites previously defined by other means (such as epitopes defined by crystal structures).

This is not a critical flaw but should be considered either in the discussion or by completing additional analyses as described above.

2) It is not clear that the residues most essential for function must be the most conserved ones. We know that naturally occurring neutralizing antibodies can target the RBS (Whittle et al., PNAS, 2011). There is also variation in receptor type within hosts and between species.

The main text justifies using conserved RBS as a control because these sites are expected to be strongly conserved, and the RBS is defined to exclude non-conserved RBS-associated sites. One could also argue that high mutational tolerance in the RBS could be adaptive to facilitate escape in nearby epitopes;it would be really interesting if this were the case, with the control being some non-immunogenic area near the stalk. The simplest solution may be to show that the results are unchanged when the non-conserved RBS sites are included, assuming this is the case. If it is not, it would help to know which sites the conserved RBS correspond to on the HA structure. (One would hope they're deep-in-the-pocket).

3) The paper addresses the mutational tolerance and antigenic evolvability of HA. Relative to what? Are these metrics different for other influenza proteins? Other viral glycopropteins? A cursory analysis of the HA site preference data vs. NP site preference (in Bloom 2014, bioRxiv, currently in review in MBE?) would further support this claim.

4) “Big data” mutagenesis scans must sacrifice something in their broad scope; for example, quantitative as opposed to binary fitness measurements, intragenic epistasis (e.g. interactions among substitutions within HA as will occur in this experimental set-up), pool drop out bottlenecks. The authors already address many of these. It would be good to acknowledge them as shortcomings or areas for follow up work, and to discuss the recent publication by Wu et al. of a similar scanning mutagenesis that does report fitness values (Wu et al., High throughput profiling of influenza A virus hemagglutinin gene at the single-nucleotide resolution, Scientific Reports 2013).

Unlike the paper by Wu et al., however, the present manuscript places the mutational data within an evolutionary framework, which is something generally missing in most such studies. This novelty of the current manuscript could be drawn out more. In addition, both this paper and Wu et al. (2014) present a snapshot of the (mostly) single-mutation neighbors from a single point (the sequence of WSN33 HA) in the fitness landscape. Both find that viable paths occur disproportionately at antigenic sites. The fact that these rules, when extrapolated over the phylogeny, greatly improve the fit implies that epistasis is in some ways less pervasive than we think and may not be such a huge obstacle to evolution. Most substitutions appear to be concentrated among a permissive network of sites.

5) The following aspects of the experiments should be clarified/discussed: (a) The possibility of reversion between the plasmid pool and the passaged virus is not accounted for. Granted the wild type control shows that there is low background mutation, but a weakly functional mutant could revert. (b) During transfection, one would expect that there may be viruses produced from cells expressing two different HA genes. How would such chimeric viruses confound the results in a single passage experiment? (c) Similarly to #2, wouldn't there also be viruses in which the HA on the surface has a given amino acid sequence, while the packaged genomic segments code for a different one? Multiple passages would ferret these out. (d) How were HA molecules sheared and why are there biases in sequence fragmentation? (e) Are more exposed HA residues generally more tolerant to mutations? The authors address this by correcting for relative solvent accessibility, but this description is difficult to follow. The importance of relevant solvent accessibility (RSA) is shown in Table 3 and Figure 9, but the RSA result is not stated clearly in the main text. Can the authors simplify or clarify this part?

6) The conclusions would be further strengthened if there was some way to “predict” retrospectively where HA would go next based on the model. It would be interesting to discuss this possibility along the lines of what Luksza and Lassig published in Nature this year (Nature 507:57-61).

https://doi.org/10.7554/eLife.03300.029

Author response

The reviewers provide however a number of comments whose clarification and further discussion would strengthen the arguments made in the manuscript and its eventual impact. These comments primarily concern the uncertainty and assumptions over the identity of antigenic sites and the relevant RBS residues for control purposes. It was felt that both these assumptions could be made more explicit and examined further in the ways suggested below, together with other major points.

Thank you to the editors and reviewers for their very careful review of our manuscript. The critiques were very helpful, and we believe that the revisions that we have made in response have substantially improved the manuscript. Specifically:

We have included a second expanded definition of antigenic sites to show that the essential finding that HA has high antigenic evolvability is robust to the method used to classify the antigenic sites (see Response 2 below).

We have clarified that not all receptor-binding site residues have low inherent mutational tolerance – this is only true of conserved residues deep in the pocket. We now include separate classifications for conserved and all receptor-binding residues (see Response 3 below).

The suggestion to include a comparison to other viral proteins is very helpful; as suggested by the reviewers, we now compare to NP to show that high antigenic evolvability is a property of HA but not NP (see Response 4 below).

As suggested by the reviewers, we have included a comparison to the results of the Wu et al. study that was published while our study was in review. This comparison shows that the results of the studies are correlated, but also explains why our study provides substantially more information, and also comments on issues related to these types of high-throughput studies (see Response 5 below).

1) The conclusions from this study are dependent on knowing which HA residues are physically located in antigenic sites. The authors use the classic Caton et al. 1982 study to classify which HA residues are located in antigenic sites. This is a reasonable approach because the WSN/1933 virus used in the current study is antigenically similar to the PR8/1934 used in the Caton et al. 1982 study. The Caton et al. experiments defined antigenic sites by selecting viral escape mutants in the presence of saturating amounts of individual monoclonal antibodies. Based on the Caton et al. experimental approach, viruses that possess mutations at HA residues that severely compromise viral fitness are never selected-essentially, Caton et al. defined the antigenic sites by isolating mutant viruses that were capable of growing. With this in mind, it is perhaps not surprising that these same HA residues are tolerant of mutations in the current study.

When the Caton et al. escape mutants are mapped onto the 1RVX structure, there are clusters of mutations that correspond to each antigenic site. Within each cluster, there are a few HA residues that appear to be in the cluster (thus likely in the antibody binding site) but these residues never appeared in the Caton et al. viral mutant escape experiments. It has been assumed that this is the case because either (a) there was a limited number of mutant viruses screened by Caton et al. or (b) that these residues within the antigenic site were detrimental to viral fitness.

The authors should consider making an HA structure analogous to the one shown in Figure 9, with entropy color labeling for all surface residues. This type of analysis will demonstrate if tolerability is limited to those residues identified in selection assays (assays that only identify residues capable of tolerating mutations) or also found in adjacent residues that are likely located in antibody binding sites but were not identified in the original antigenic mapping studies. It might be helpful to show surface on the structure for this type of analyses.

Another partial solution is to examine antigenic sites previously defined by other means (such as epitopes defined by crystal structures).

This is not a critical flaw but should be considered either in the discussion or by completing additional analyses as described above.

This is a very astute comment. Essentially, the reviewers are noting that Caton et al. defined the antigenic sites largely by mapping monoclonal-antibody escape mutants. The Caton et al. study therefore may have had a bias towards mapping sites that were tolerant to mutations, potentially confounding our interpretation.

To address this concern, we have added a second expanded classification of antigenic sites which includes the specific antigenic sites mapped by Caton et al. plus all surface-exposed residues that are in contact with these antigenic sites. This second classification should alleviate the concern about any bias in the Caton et al. mapping, because even if only some residues in a cluster had mutations selected by Caton et al., our expanded structural classification will also include all other residues in the same physical cluster.

In the revised manuscript, we now perform the analysis for both the original Caton et al. classification and this second expanded classification. The results are described in Figure 10 and Table 4. These results show that the antigenic sites have significantly higher inherent mutational tolerance regardless of whether we use the original Caton et al. classification or the new expanded structural classification.

In addition, we have followed the reviewers’ recommendation and included an entropy-colored surface-rendered structure of HA in Figure 10 as well as individual structures showing just the antigenic sites. This new image is panel A of the revised Figure 10. We think this image visually supports the statistical analyses showing that the expanded classification of antigenic sites has higher inherent mutational tolerance.

Finally, we have substantially revised the text in the section entitled “The inherent evolvability of antigenic sites on HA” to explain the point made by this reviewer comment and how we have addressed it. We refer the reviewers to the revised manuscript for the full text, but here is the most relevant portion:

One possible concern is that Caton et al. (1982) mapped antigenic sites largely by selecting monoclonal-antibody escape mutants, and so these sites might be biased towards being more mutationally tolerant. We therefore also made a broader classification of ‘antigenic sites and contacting residues’ consisting of the Caton et al. (1982) antigenic sites plus all surface-exposed residues in contact with these sites (see Methods for details). This broader classification includes all residues in regions of the HA surface targeted by antibodies, and so should not be biased by whether sites are amenable to the selection of monoclonal-antibody escape mutants. We hypothesized that both sets of anti-genic sites would have unusually high mutational tolerance.

2) It is not clear that the residues most essential for function must be the most conserved ones. We know that naturally occurring neutralizing antibodies can target the RBS (Whittle et al., PNAS, 2011). There is also variation in receptor type within hosts and between species.

The main text justifies using conserved RBS as a control because these sites are expected to be strongly conserved, and the RBS is defined to exclude non-conserved RBS-associated sites. One could also argue that high mutational tolerance in the RBS could be adaptive to facilitate escape in nearby epitopes; it would be really interesting if this were the case, with the control being some non-immunogenic area near the stalk. The simplest solution may be to show that the results are unchanged when the non-conserved RBS sites are included, assuming this is the case. If it is not, it would help to know which sites the conserved RBS correspond to on the HA structure. (One would hope they're deep-in-the-pocket).

This is another very astute comment. In the original version, we chose conserved receptor-binding residues as a control group for the antigenic sites because we wanted residues that we expected to be under strong constraint. However, as the reviewers point out, this could give the incorrect impression that all receptor-binding residues are highly con-served. In fact, a handful of “deep in the pocket” residues are highly conserved, but many other residues on the periphery of the receptor binding pocket are relatively amenable to change.

To address this comment, we now define two groups of receptor binding residues: the conserved ones (mostly deep in the pocket) and the set of all residues that contact the receptor. In the revised Figure 10 and Table 4 we analyze both of these groups separately, and discuss the results in the revised section “The inherent evolvability of antigenic sites on HA.” We refer the reviewers to the revised section and figure/table for full details, here we summarize the main findings by quoting from the most relevant portions of the revised section:

“For comparison to the antigenic sites, we used two classifications of receptor-binding residues (see Methods). The first classification consists of residues that have important roles in receptor binding (Martin et al, 1998) and are conserved in H1 HAs; these residues are mostly deep in the binding pocket. The second classification consists of all residues that contact the sialic-acid receptor in the crystal structure, regardless of their level of conservation. We hypothesized that the core set of conserved receptor-binding residues would have unusually low mutational tolerance, but that the set of all receptor-binding residues would have typical levels of mutational tolerance since influenza routinely escapes from antibodies that target the periphery of the receptor-binding pocket (Koel et al, 2013)...”

We then describe results that validate this hypothesis, and conclude:

“These results also show that while a core group of conserved residues deep in the receptor-binding pocket have unusually low mutational tolerance, the bulk of residues that contact the receptor are not under exceptional constraint. This fact probably explains why HA is able to escape from antibodies targeting the periphery of the receptor-binding pocket (Koel et al, 2013), and why only rare antibodies that penetrate deep into this pocket are broadly neutralizing (Whittle et al, 2011).”

3) The paper addresses the mutational tolerance and antigenic evolvability of HA. Relative to what? Are these metrics different for other influenza proteins? Other viral glycopropteins? A cursory analysis of the HA site preference data vs. NP site preference (in Bloom 2014, bioRxiv, currently in review in MBE?) would further support this claim.

This was an extremely helpful suggestion, and we believe that the changes that we have made in response have substantially strengthened the manuscript. Specifically, we now compare our results for HA to our previous results for influenza NP (which are now published in Molecular Biology and Evolution) in a new Figure 11 and Table 5. We also include a new section in the Results (“HA’s antigenic evolvability is not shared by all influenza proteins”).

4) “Big data” mutagenesis scans must sacrifice something in their broad scope; for example, quantitative as opposed to binary fitness measurements, intragenic epistasis (e.g. interactions among substitutions within HA as will occur in this experimental set-up), pool drop out bottlenecks. The authors already address many of these. It would be good to acknowledge them as shortcomings or areas for follow up work, and to discuss the recent publication by Wu et al. of a similar scanning mutagenesis that does report fitness values (Wu et al., High throughput profiling of influenza A virus hemagglutinin gene at the single-nucleotide resolution, Scientific Reports 2013).

Unlike the paper by Wu et al., however, the present manuscript places the mutational data within an evolutionary framework, which is something generally missing in most such studies. This novelty of the current manuscript could be drawn out more. In addition, both this paper and Wu et al. (2014) present a snapshot of the (mostly) single-mutation neighbors from a single point (the sequence of WSN33 HA) in the fitness landscape. Both find that viable paths occur disproportionately at antigenic sites. The fact that these rules, when extrapolated over the phylogeny, greatly improve the fit implies that epistasis is in some ways less pervasive than we think and may not be such a huge obstacle to evolution. Most substitutions appear to be concentrated among a permissive network of sites.

These are all very good comments. We have tried to be as transparent as possible about the shortcomings as well as the strengths of the high-throughput approach that we have used. For instance, we dedicate an entire figure (Figure 6) to transparently showing the correlations between our replicates, and honestly discuss how this figure shows that although the experiment is definitely extracting reproducibly meaningful information, it is also subject to a large amount of noise as judged by the variation among replicates. We discuss how bottlenecks and other experimental problems could contribute to this.

As the reviewers point out, while our manuscript was under review, Wu et al. reported a somewhat similar high-throughput study of influenza HA. The study by Wu et al. is clearly very valuable. However, we believe that our study contains some advantages: for instance, we survey all amino-acid mutations while Wu et al. survey only about 20% of them, we provide full raw data for experimental replicates and controls for possible sources of error while Wu et al. do not, and we perform substantially more quantitative analyses of our data including in connection to natural evolution.

To help address the points that the reviewers have made, we now include a new figure (Figure 7) and a new section in the Results that compares our study with that of Wu et al. We also use this new section to highlight some of the limitations of “big data” studies that the reviewers noted in their comments. Below we quote the full text of this new section, and put in bold the parts that specifically address the limitations noted by the reviewers.

As our paper was under review, Wu et al. (2014) published the results of using a similar strategy to examine the effects of mutations to the WSN HA. In their study, the HA gene was mutated at the nucleotide level, so their experiments surveyed only amino-acid mutations accessible by single-nucleotide codon changes. As a result, they provide data on the effects of only about 20% of the 19 564 = 10716 amino-acid mutations examined in our study. Despite this limitation, their study provides a large dataset of mutational effects to which we can compare our results.

Figure 7 compares the mutational effects determined in our study to those from Wu et al. (2014). There is a highly significant correlation between the results of the two studies; but the inferred mutational effects are certainly not identical. Because Wu et al. (2014) do not provide the data for replicates of their experiment, we are unable to assess whether the variability between the two different studies exceeds the variability between experimental replicates within each study. So one can imagine both biologically interesting and uninteresting explanations for the imperfect correlation between the results of the two studies. The interesting explanation is that differences in experimental methodology could lead to different selection pressures on specific mutations: for instance, Wu et al. (2014) use A549 cells while we use MDCK-SIAT1 cells, and perhaps the impact of certain mutations is dependent on the cell line. The uninteresting explanation is that the imperfect correlation is simply due to noise in the experimental measurements. Unfortunately, it is not straightforward to distinguish between these two explanations. This difficulty in pinpointing reasons for inter-study variation highlights a limitation of the high-throughput experimental methodology employed by ourselves and Wu et al. (2014): while such experiments provide a wealth of data, numerous factors can create noise in these data (sequencing errors, population bottlenecks, epistasis among mutations, etc.). Realizing the full potential of such studies will therefore require extensive experimental controls and biological replicates to quantify errors and noise to enable comparisons across data sets.

Nonetheless, Figure 7 shows that there is a highly significant correlation between the results of these two high-throughput studies, despite differences in experimental methodology and unquantified sources of experimental noise. This fact suggests that both studies capture fundamental constraints on HA’s mutational tolerance. In the remaining sections, we apply the more comprehensive data generated by our study to address questions about HA’s natural evolution and antigenic evolvability.

However, beyond the text quoted above and the new Figure 7, we are unable to analyze the data of Wu et al. to with respect to the other questions about inherent mutational tolerance and viral evolution that we address in our manuscript. The reason is that Wu et al. only measure the fitness effects of an average of 4 of the possible 19 amino-acid mutations at each site, and this incomplete information is not sufficient to either assess inherent mutational tolerance or construct phylogenetic evolutionary models.

5) The following aspects of the experiments should be clarified/discussed: (a) The possibility of reversion between the plasmid pool and the passaged virus is not accounted for. Granted the wild type control shows that there is low background mutation, but a weakly functional mutant could revert. (b) During transfection, one would expect that there may be viruses produced from cells expressing two different HA genes. How would such chimeric viruses confound the results in a single passage experiment? (c) Similarly to #2, wouldn't there also be viruses in which the HA on the surface has a given amino acid sequence, while the packaged genomic segments code for a different one? Multiple passages would ferret these out. (d) How were HA molecules sheared and why are there biases in sequence fragmentation? (e) Are more exposed HA residues generally more tolerant to mutations? The authors address this by correcting for relative solvent accessibility, but this description is difficult to follow. The importance of relevant solvent accessibility (RSA) is shown in Table 3 and Figure 9, but the RSA result is not stated clearly in the main text. Can the authors simplify or clarify this part?

The reviewers ask in point (a) above about the possibility of reversions. We suspect that reversions are very rare, since the virus is only passaged once which should give a relatively small opportunity for such mutations to arise. But we acknowledge that none of our controls directly measure the rate of reversions. Our sequencing of the wildtype virus shows that mutations overall are rare, but it is possible that reversions are more common. However, a key advantage of our approach is that reversions would only affect a small fraction of our mutations – because we introduce mutations at the codon level, most (54 of 63) of the possible mutations involve multiple nucleotide changes. We deem it extremely unlikely that a single codon would revert two nucleotides during a single viral passage. To address this issue, we have revised the text.

The reviewers ask in points (b) and (c) above about chimeric viruses that express multiple HA proteins, and about viruses that lack a genotype-to-phenotype link because they possess different HA genes and proteins. We agree that this is a potential concern – however, the problem should be fixed during the low multiplicity of infection (MOI) passaging of the virus. In the current manuscript, we only perform one such passage – so the reviewers ask if it might not be desirable to also perform multiple passages to make sure the genotype-phenotype link is strong. This is a great suggestion, but in fact, in our previous work with NP (Bloom, 2014) we did just that. Specifically, in that work, we sequenced NP after both one and two passages, and found that the results after just one passage were essentially indistinguishable from the results after two passages (see the fifth figure of http://mbe.oxfordjournals.org/content/early/2014/06/12/molbev.msu173). One reason that a single low-MOI passage is su?cient may be that there is probably substantial selection that also occurs during viral rescue in co-cultures prior to the first low-MOI passage. In any case, based on our previous NP work, we judge that one low-MOI passage is sufficient to ensure a strong genotype-phenotype link for influenza genes, and so for this reason we only performed one low-MOI passage in the current work (there are good reasons for wanting to keep the number of passages as low as possible, primarily to avoid problems like the reversion one mentioned in the reviewers’ previous comment).

To clarify this point, we have revised the manuscript (section starting “The mutant viruses generated for each replicate were passaged at a relatively low multiplicity of infection”).

The reviewers ask in point (d) how the HA molecules were fragmented. The fragmentation was done using the Illumina Nextera kit, which is widely used in the preparation of Illumina sequencing libraries. The fragmentation bias almost certainly arises due to weak sequence preferences of the transposon that mediates the fragmentation in the Nextera kit. It is known that this transposon has weak biases for certain sequences; for instance, see http://genomebiology.com/2010/11/12/r119. However, as should be clear from Figure 3—figure supplement 3 these biases are fairly mild, and lead to at most a five-fold variation in coverage at different sites.

To clarify this issue, we have revised the manuscript (section starting “In order to reduce the sequencing error rate”).

The reviewers ask in point (e) whether residues with higher relative solvent accessibility (RSA) are generally more tolerant to mutations. The answer is yes as shown in Table 4 and Figure 10, but as the reviewers point out this result was not clearly stated in the main text. We have revised the text to make this point more clearly.

6) The conclusions would be further strengthened if there was some way to ’predict‘ retrospectively where HA would go next based on the model. It would be interesting to discuss this possibility along the lines of what Luksza and Lassig published in Nature this year (Nature 507:57-61).

The reviewers ask if it might be possible to predict the evolution of influenza using data from our deep mutational scanning. In particular, they mention the work of Luksza and Lassig (2014), who develop a fitness model that enables improved forecasting about which of a variety of closely related existing strains of epidemic influenza are likely to predominate in future years. We do see some ways in which our data might synergize with their approach. For instance, a key component of their model (see Equation 2 of their paper) is assigning a “mutational load” to non-antigenic mutations. One could imagine that deep mutational scanning might enable more accurate assignment of the “mutational load” caused by specific mutations, although it remains unproven whether this would actually work.

For now, we mention this possibility with the following sentence in the Discussion:

“It also may be possible to utilize high-throughput experimental data on mutational effects to better estimate the fitness of naturally occurring strains in a way that aids in prediction of the year-to-year strain dynamics of influenza (Luksza and Lassig, 2014).”

We have also provided links in the manuscript to our complete raw data and source code as well as detailed documentation of our computational analysis pipeline. We hope that this transparent availability of all our raw data will help facilitate the work of others who wish to develop approaches for viral forecasting. However, we are currently unable to offer any sort of effective predictive model ourselves, and believe that prediction of viral evolution remains a daunting problem.

https://doi.org/10.7554/eLife.03300.030

Article and author information

Author details

  1. Bargavi Thyagarajan

    1. Division of Basic Sciences, Fred Hutchinson Cancer Research Center, Seattle, United States
    2. Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, United States
    Contribution
    BT, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon 0000-0003-3871-6410
  2. Jesse D Bloom

    1. Division of Basic Sciences, Fred Hutchinson Cancer Research Center, Seattle, United States
    2. Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, United States
    Contribution
    JDB, Conception and design, Analysis and interpretation of data, Drafting or revising the article
    For correspondence
    jbloom@fhcrc.org
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon 0000-0003-1267-3408

Funding

National Institutes of Health (R01GM102198)

  • Jesse D Bloom

Kinship Foundation (Searle Scholarship)

  • Jesse D Bloom

The funder had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Paul Edlefsen for assistance with the multiple linear regression. We thank Hugh Haddox for helpful comments on the manuscript.

Reviewing Editor

  1. Mercedes Pascual, Reviewing Editor, University of Michigan, United States

Publication history

  1. Received: May 7, 2014
  2. Accepted: July 3, 2014
  3. Accepted Manuscript published: July 8, 2014 (version 1)
  4. Accepted Manuscript updated: July 9, 2014 (version 2)
  5. Accepted Manuscript updated: July 11, 2014 (version 3)
  6. Version of Record published: July 29, 2014 (version 4)

Copyright

© 2014, Thyagarajan and Bloom

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 4,259
    Page views
  • 437
    Downloads
  • 35
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Comments

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Genes and Chromosomes
    Timothy W R Kelso et al.
    Research Article Updated