Molecular function limits divergent protein evolution on planetary timescales

Abstract
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Functional conservation is known to constrain protein evolution. Nevertheless, the long-term divergence patterns of proteins maintaining the same molecular function and the possible limits of this divergence have not been explored in detail. We investigate these fundamental questions by characterizing the divergence between ancient protein orthologs with conserved molecular function. Our results demonstrate that the decline of sequence and structural similarities between such orthologs significantly slows down after ~1–2 billion years of independent evolution. As a result, the sequence and structural similarities between ancient orthologs have not substantially decreased for the past billion years. The effective divergence limit (>25% sequence identity) is not primarily due to protein sites universally conserved in all linages. Instead, less than four amino acid types are accepted, on average, per site across orthologous protein sequences. Our analysis also reveals different divergence patterns for protein sites with experimentally determined small and large fitness effects of mutations.

Editorial note: This article has been through an editorial process in which the authors decide how to respond to the issues raised during peer review. The Reviewing Editor's assessment is that all the issues have been addressed (see decision letter).

https://doi.org/10.7554/eLife.39705.001

Introduction

As proteins evolve from a common ancestor, their sequences and structures diverge from each other (Chothia and Lesk, 1986; Povolotskaya and Kondrashov, 2010). Multiple previous studies have investigated the relationship between the conservation of protein molecular function, sequence identity (Lee et al., 2007; Tian and Skolnick, 2003; Worth et al., 2009) and structural similarity (Chothia and Lesk, 1986; Wilson et al., 2000). For example, the likelihood that two proteins share the same molecular function, given their sequence (Tian and Skolnick, 2003) or structural (Wilson et al., 2000) similarity, has been used to investigate the emergence of new protein functions (Rost, 2002; Conant and Wolfe, 2008), and to perform functional annotations of protein sequences (Lee et al., 2007; Wilson et al., 2000). In this work, we focused on a different and currently unaddressed set of questions. Namely, how far can two sequences diverge while continuously maintaining the same molecular function? What are the temporal patterns of this divergence across billions of years of evolution? And how different protein sites contribute to the long-term divergence between orthologs with the same molecular function? We note that the requirement for the continuous conservation of molecular function is crucial in this context, as multiple examples of convergent evolution and protein engineering demonstrate that the same molecular function, such as catalysis of a specific chemical reaction, can in principle be accomplished by proteins with unrelated sequences and even different folds (Bork et al., 1993; Galperin et al., 1998; Omelchenko et al., 2010).

It was previously demonstrated that proteins with the same structural fold frequently diverge to very low (~10%) levels of sequence identity (Rost, 1997). These results suggest that conservation of protein fold, that is, the overall arrangement and topological connections of protein secondary structures (Murzin et al., 1995), exerts relatively minor constraints on how far protein sequences can diverge. In contrast to protein folds, it is possible that conservation of specific molecular functions will significantly limit the long-term divergence of protein orthologs. While only a relatively small fraction of protein residues (~3–5%) are usually directly involved in catalysis (Lehninger et al., 2013), recent analyses have demonstrated that even sites located far from catalytic residues may be significantly constrained in evolution. Because substitutions at these sites can have substantial effects on molecular function (Firnberg et al., 2016), it is likely that functionally-related sequence constraints extend far beyond catalytic residues.

In this study, we explored the long-term divergence patterns of protein orthologs by characterizing their pairwise sequence and structural similarity as a function of their divergence time. We used several models of molecular evolution to calculate the divergence rates, defined as the decrease in pairwise sequence identity or structural similarity per unit time, between orthologous proteins with the same molecular function. We also characterized the long-term divergence patterns at protein sites with different levels of evolutionary conservation, different locations in protein structures, and different experimentally measured fitness effects of amino acid substitutions. Finally, we explored how the limits of sequence and structural divergence after billions of years of evolution depend on the degree of functional conservation between orthologs.

Results

To study the evolution of proteins with the same molecular function, we initially focused our analysis on enzymes because their molecular function is usually well defined. The Enzyme Commission (EC) classifies enzymatic functions using a hierarchical 4-digit code (Bairoch, 1999), such that two enzymes that share all four EC digits catalyze the same biochemical reaction. We used protein sequences representing 64 EC numbers from 22 diverse model organisms across the three domains of life (Supplementary file 1). The considered activities include members of all six major enzyme classes: oxidoreductases, transferases, hydrolases, lyases, isomerases and ligases.

To investigate whether the conservation of enzymatic function limits the divergence between orthologous sequences, we first calculated global pairwise sequence identities between orthologs as a function of their divergence times (Figure 1, Figure 1—figure supplement 1). The pairwise divergence times reported in the literature (Hedges et al., 2006) between the considered 22 species (Supplementary file 1) were used as a proxy for the divergence times between corresponding orthologous proteins. For each enzymatic activity, we constructed phylogenetic trees based on the orthologous protein sequences and compared them to the corresponding species’ trees. Protein sequences with clear differences to the species'phylogenetic tree topologies, suggesting cases of horizontal gene transfer, were excluded from the analysis (see Materials and methods).

Figure 1 with 7 supplements see all

Download asset Open asset

Sequence divergence of enzyme orthologs as a function of time.

The global pairwise sequence identities between pairs of orthologous enzymes are shown as a function of divergence times between the corresponding species. Three models of amino acid substitution were used to fit the divergence data. Model 1 (black lines) assumes independent and equal substitution rates across all protein sites. Model 2 (red lines) assumes, in addition to independent and equal substitution rates, that a given fraction of protein sites remains identical at large divergence distances. Model 3 (blue lines) assumes a gamma distribution of amino acid substitution rates across sites. Best fits of the models are shown for four representative EC numbers: (a) 1.5.1.3, (b) 2.7.4.3, (c) 4.2.1.2, (d) 6.3.4.2. The horizontal dashed black lines represent the average sequence identity for the global alignment of unrelated protein sequences. The data and corresponding model fits for the other EC numbers considered in the analysis are given in Figure 1—figure supplement 1 and Supplementary file 2a.

https://doi.org/10.7554/eLife.39705.002

We next considered two simple models of long-term protein evolution, one without a long-term limit of sequence divergence and the other with an explicit divergence limit. The first model corresponds to sequence divergence with equal and independent amino acid substitution rates across all proteins sites (Dickerson, 1971; Zuckerkandl and Pauling, 1965); see Equation 1, where y represents global sequence identity, t represents divergence time, and R₀ represents the average substitution rate (Dickerson, 1971). In this model, back substitutions are not allowed, and sequence divergence slows down with time simply due to multiple substitutions at the same protein sites and progressively fewer non-mutated sites. The second model corresponds to sequence divergence where, in addition to sites with equal and independent substitution rates, there is a minimal fraction of identical sites at long divergence times; the fraction of identical sites is represented by Y₀ in Equation 2.

y = 100 * e^{- R_{0} * t}

y = Y_{0} + (100 - Y_{0}) * e^{- R_{0} * t}

We applied the two models to fit the sequence divergence for each of the considered enzymatic functions. The best model fits for four representative metabolic activities are shown in Figure 1 (black for the first model and red for the second); the fits for the remaining metabolic activities are shown in Figure 1—figure supplement 1. In 62 of the 64 cases, the second model fits the divergence data significantly better than the first model (F-test p-value<0.05, Supplementary file 2a). Moreover, in 95% of the cases (61/64) the maximum likelihood value of the parameter Y₀ is significantly higher (Wald test p-value<0.05) than the average sequence identity between random protein sequences based on their optimal global alignment (~13.5%, shown in Figure 1 and Figure 1—figure supplement 1 by dashed black lines). The distribution of the fitted parameter Y₀ suggests a long-term sequence identity >25% (with average ~40%) between considered orthologs (Figure 2a); this demonstrates that conservation of a specific enzymatic function significantly limits long-term protein sequence divergence. Notably, model two is mathematically equivalent (see Materials and methods) to a divergence model with equal substitution rates across sites, a limited number of amino acid types accepted per site, and allowed back substitutions (Tajima and Nei, 1984; Gilson et al., 2017; Yang, 2006). In this model, the parameter Y₀ represents the inverse of the effective number of acceptable amino acid types per site during protein evolution. Our results thus suggest that, on average, only 2 to 4 amino acids are acceptable per site for proteins that strictly conserve their molecular function (Figure 2a, top blue X axis); we note that this low average number does not contradict the fact that more than four amino acid types could be observed at a given protein site at low frequencies (Breen et al., 2012).

Figure 2

Download asset Open asset

The limit of long-term protein sequence divergence between orthologous proteins.

(a) The distribution of Y₀ parameter values across 64 EC numbers for fits based on Model 2 (Equation 2). The Y₀ parameter represents the minimum percentage of protein sites that remain identical at long divergence times. The parameter Y₀ (considered as a fraction) can also be interpreted as the inverse of the average number of amino acid types accepted per protein site during long-term protein evolution (top blue X axis). (b) Similar to panel a, but for 29 ancient protein families annotated with non-enzymatic functions. In panels a and b, the vertical red dashed lines represent the median values of the distributions (39% and 30%, respectively).

https://doi.org/10.7554/eLife.39705.014

The two aforementioned models simplify the process of sequence divergence by considering the same substitution rates across protein sites. A more realistic and commonly used model of protein evolution assumes a gamma distribution (Yang et al., 2000) of substitution rates across protein sites; see Equation 3; (Ota and Nei, 1994), where α represents the shape parameter of the gamma distribution. The best fits of such a variable-rate model (blue in Figure 1 and Figure 1—figure supplement 1) showed that the rates of protein sequence divergence between orthologous enzymes have decreased by more than 10 times during ~4 billion years of evolution (see Materials and methods and Supplementary file 2b). Although the third model does not explicitly consider a long-term divergence limit, the obtained model fits also show that the vast majority of orthologous enzymes with the same function will remain above 25% sequence identity on the timescales when Earth environments will be hospitable to life (1–3 billion years from the present O'Malley-James et al., 2014) (Figure 1—figure supplement 2).

y = 100 * {(\frac{R_{0} * t}{α} + 1)}^{- α}

The observed divergence limit is not an artifact due to difficulty detecting remote protein homologs, as it occurs at relatively high sequence identities (Figure 1 and Figure 1—figure supplement 1), for which corresponding orthologs can be easily identified by computational sequence comparison methods. Furthermore, the results remained similar when we restricted the analysis to orthologous enzyme pairs with experimentally validated molecular functions (Figure 1—figure supplement 3), based on publications referenced in the BRENDA database (Chang et al., 2015). The results also remain robust towards the variance in the estimates of divergence times between considered species (see Materials and methods). We note that the divergence limit between orthologs with the same molecular function does not imply that the rates of molecular substitutions decrease in evolution. It is also not simply due to the curvilinear relationship between time and sequence identity caused by multiple mutations at the same sites; specifically, the observed decrease in divergence rates is substantially higher (by >10 fold) than the one expected in model one simply due to multiple substitutions at the same protein sites. Instead, the effective limit is reached when, due to a small number of amino acids types accepted per protein site and back substitutions, additional amino acid replacements do not lead to a substantial further increase in protein sequence and structural divergence (Meyer et al., 1986).

Interestingly, following the previously introduced metaphor of the expanding protein universe (Povolotskaya and Kondrashov, 2010; Dokholyan et al., 2002), we can use the third model (Equation 3) to express the divergence rate between orthologs as a function of protein distance (D = 1 – y, where y is the fractional sequence identity ranging from 0 to 1), see Equation 4. This equation, similarly to Hubble’s law of universe expansion (Hubble, 1929), describes how the divergence rate depends on the distance between protein orthologs. In contrast to the real universe, the expansion rate of the protein universe significantly decreases with divergence time and with distance between protein orthologs. For example, the divergence rate between protein orthologs drops, on average, to only ~2% sequene identity decrease per billion years when their mutual sequence identity reaches 30% (corresponding to protein distance of 70%; Figure 1—figure supplement 4).

\frac{\partial D}{\partial t} = R_{0} * {(1 - D)}^{\frac{(α + 1)}{α}}

The analyses described above focused on the divergence of enzymes with the same molecular function. In order to investigate whether the observed divergence patterns are not specific to enzymes, we repeated the analysis using non-enzymatic ancient orthologs (Figure 1—figure supplement 5, Supplementary file 2c). The set of analyzed 29 protein families included ribosomal proteins, heat shock proteins, membrane transporters, and electron transfer flavoproteins (Supplementary file 2d). Based on the same set of 22 species used in the analysis of enzyme families, we found that model two fitted the data significantly better than model one, and that the parameter Y₀ was >25% for the majority (23/29) of the non-enzymatic protein families (Figure 2b, Supplementary file 2c). Interestingly, we also identified 19 additional orthologous groups showing two clearly different divergence patterns (Figure 1—figure supplement 6), with pairs of eukaryotic orthologs diverging substantially faster and farther than prokaryotic orthologs in the same protein family. The orthologous groups with this behavior included mitochondrial ribosomal proteins and initiation factors of mitochondrial translation (Supplementary file 2e). It has been previously postulated that mitochondrial ribosomal proteins diverged significantly faster in eukaryotes, compared to the divergence between their bacterial orthologs, due to compensatory protein substitutions following the accumulation of slightly deleterious substitutions in the mitochondrial ribosomal RNA (Barreto and Burton, 2013).

Having established, in the first half of the manuscript that conservation of molecular function significantly limits long-term sequence evolution, we investigated, in the second half, how different protein sites contribute to the observed divergence constraints. Specifically, whether the same protein sites are conserved between ancient orthologs in different phylogenetic lineages, how sites with different fitness effects of amino acid substitutions contribute to the divergence limit, and how structural locations of protein sites affect their long-term divergence patterns. We also explored how different levels of functional specificity constrain sequence and structural divergence.

To investigate whether the same protein sites are usually conserved between orthologs in different phylogenetic lineages, we aligned the sequences of ancient enzyme orthologs with the same molecular function (see Materials and methods). We then calculated how often each protein site is occupied by identical amino acids across pairs of orthologs from phylogenetically independent linages (Figure 3—figure supplement 1). Orthologous protein pairs from independent lineages were obtained using species' pairs that do not share any edges in the corresponding phylogenetic tree (Arnold and Stadler, 2010) (Figure 3a); for example, in Figure 3a the pair D-H is independent of the pair A-B but not of the pair E-F. We performed the above analysis using 30 enzymatic activities for which at least 20 independent pairs of orthologs could be identified based on annotations in the KEGG database (Kanehisa et al., 2016) (see Materials and methods). The results demonstrated that only a relatively small fraction of protein sites (10–20%) are universally conserved, that is, they are identical in a majority (>90%) of independent lineages (Figure 3b). Therefore, the observed long-term divergence limit is not primarily due to sets of universally conserved protein sites; instead, different sites contribute to the limit in independent phylogenetic lineages. By comparing the fractions of universally conserved sites to the average sequence identity between distant orthologs (~40%, Figure 2a) we found that these sites account, on average, for only ~35% of the observed sequence identity at long divergence distances. The analysis also revealed that different protein families show different probability distributions of identical sites (Figure 3—figure supplement 1). This is likely a consequence of diverse structural and functional requirements across protein families, leading to protein-family specific constraints on protein sites.

Figure 3 with 1 supplement see all

Download asset Open asset

Conservation of protein sites in phylogenetically independent lineages.

To identify the fractions of protein sites that are universally conserved ― defined as sites that are identical in at least 90% of orthologs ― we considered phylogenetically independent lineages. (a) Illustration of pairs of species (e.g. **A–B and D–H**) representing phylogenetically independent lineages. In the figure, A-B and D-H are pairs of species that diverged within a certain time window (illustrated by the blue shaded region); these species pairs do not share more recent edges in the phylogenetic tree. (b) The distribution of the fraction of universally conserved sites across 30 enzymatic families. The analysis was performed using 30 enzymatic families for which at least 20 independent pairs of orthologs with the same molecular function could be identified based on annotations in the KEGG database (Kanehisa et al., 2016) (see Materials and methods); pairs of orthologs that diverged >2 billion years ago were selected for this analysis. Error bars represent the S.E.M. based on three replicates using different sets of orthologous pairs. The dashed red line represents the median of the distribution (~13%).

https://doi.org/10.7554/eLife.39705.015

We next investigated the long-term divergence patterns at protein sites with different fitness effects of amino acid substitutions. To that end, we experimentally measured the fitness effects of all possible single amino acid substitutions in a representative enzyme, the Escherichia coli dihydrofolate reductase (FolA, EC 1.5.1.3). We selected FolA for the experiments due to its small size (159 amino acids) and essential role in the E. coli metabolism (Benkovic et al., 1988); also, the long-term protein sequence identity between FolA orthologs (~32%, see Figure 1a) is similar to other analyzed enzymes (Figure 2a). Following a recently described strategy (Kelsic et al., 2016), we used the Multiplex Automated Genome Engineering (MAGE) approach (Wang et al., 2009) to introduce every possible amino acid substitution at each FolA site in E. coli. To evaluate the relative fitness effects of protein substitutions we measured the growth rate of strains containing each protein variant compared to the ‘wild type’ (WT) strain into which substitutions were introduced. Relative growth rates were measured in parallel by performing growth competition experiments between the pooled mutants. Amplicon sequencing of the folA gene was then used to measure the relative changes of mutant and WT abundances as a function of time (see Materials and methods, Supplementary file 3).

Using the MAGE growth measurements in E. coli, we investigated the patterns of long-term sequence divergence at protein sites with different fitness effects of amino acid substitutions. Specifically, we sorted FolA protein sites into several groups according to their experimentally measured average fitness effects (Figure 4—figure supplement 1), and explored sequence divergence for sites within each fitness group (Figure 4a, different colors). We evaluated sequence identity between FolA orthologs across divergence times using all pairwise comparisons between ~300 orthologous sequences from the COG database (Galperin et al., 2015). Although, as expected, sites with stronger fitness effects diverged more slowly, our analysis revealed interesting differences in temporal divergence patterns for sites with small and large fitness effects. For sites in the least deleterious fitness group (Figure 4a, blue) we observed, similar to the global sequence identity, a substantial decrease (~10 fold, see Equation 5 in Materials and methods) in mutual divergence rates after ~1.5 billion years of evolution. Notably, even for FolA sites with mild fitness effects, sequence identity remains above 25% at long divergence distances. In contrast to sites with mild fitness effects, sites with the most deleterious mutations (Figure 4a, black) displayed a much slower, but approximately constant average divergence rate throughout evolutionary history. This pattern suggests that, in contrast to divergence at sites with small fitness effects, the divergence at sites with large effects is not yet close to saturation.

Figure 4 with 2 supplements see all

Download asset Open asset

Sequence divergence of protein sites with different fitness effects of mutations measured in *E. coli*.

(a) The divergence of sequence identity for FolA protein sites with different average fitness effects of mutations (measured in *E. coli*) is shown using different colors. The average sequence identities were calculated using bacterial FolA orthologs available in the COG database (Galperin et al., 2015); divergence times were estimated using bacterial 16S rRNA sequences (see Materials and methods). Error bars represent the S.D. of sequence identity in each bin. (b) Similar to panel (a), but for the sequence divergence at protein sites of the *E. coli* translation initiation factor InfA. (c) The probability that protein sites in FolA orthologs (upper panel) and InfA orthologs (lower panel) are occupied by identical amino acids as a function of the average mutant fitness (measured in *E. coli*) at the corresponding protein sites. The probability represents the fraction of phylogenetically independent pairs of orthologs in which sites are identical at long divergence times (2 ± 0.25 billion years for FolA, and 2.5 ± 0.25 billion years for InfA). Error bars represent the S.E.M. across sites.

https://doi.org/10.7554/eLife.39705.017

To assess the generality of the FolA results we used another dataset (Kelsic et al., 2016), obtained using MAGE, of fitness values for all possible amino acids substitutions in the E. coli translation initiation factor InfA (Figure 4b). Consistent with the relatively higher level of sequence conservation of InfA, we observed lower average mutant growth rates and lower rates of sequence divergence in each fitness group. Nevertheless, the long-term divergence patterns were qualitatively similar between the two proteins. For sites in the least deleterious InfA fitness group (Figure 4b, blue), we observed a substantial decrease in the divergence rate after ~2 billon years of evolution. In contrast, sites with strongest fitness effects (Figure 4b, pink) displayed a slower but approximately constant divergence rate.

Because the fitness effects of mutations at a protein site may change in evolution (Lunzer et al., 2010; Chan et al., 2017), it is interesting to investigate how fitness effects measured in one species, such as E. coli, correlate with the site conservation in other species at the divergence limit. To explore this question, we calculated the probability that a protein site is occupied at long evolutionary times (~2 billion years for FolA and ~2.5 billion years for InfA) by the same amino acid in phylogenetically independent lineages (Figure 3a). We then investigated how this probability changes as a function of the average fitness effects of substitutions at the site measured in E. coli (Figure 4c). For both FolA and InfA, the probability that a protein site is identical at large divergence distances first increases, approximately linearly with increasing average fitness effects, and then begins to saturate for sites with large (>30% growth rate decrease) fitness effects. Thus, the fitness effects at a protein site correlate with the site’s conservation even after billions of years of evolution.

The sequence constraints revealed by our analysis likely arise due to the conservation of corresponding protein structures required for efficient catalysis and molecular function (Wilson et al., 2000; Watson et al., 2005). Therefore, in addition to sequence divergence we also investigated the long-term structural divergence of orthologous proteins with the same function. For this analysis we used >1000 orthologous pairs of enzymes sharing all 4 EC digits with known 3D structures (Berman et al., 2000) (see Materials and methods); structures of orthologous enzymes were aligned using the TM-align algorithm (Zhang and Skolnick, 2005). This analysis demonstrated that the average root mean square deviation (RMSD) between C-alpha atoms of the orthologous enzymes increases (Spearman’s r = 0.44, p-value<1e-20) with divergence time (Figure 5a). Nevertheless, the C-alpha RMSD between orthologs rarely increases beyond 3 Å, even at long evolutionary distances. Consistent with sequence evolution (Figure 1), we also observed a substantial decrease in the rate of structural divergence after ~1.5 billion years of divergent evolution.

Figure 5

Download asset Open asset

Long-term structural evolution of enzymes with the same molecular function.

(a) The pairwise C-alpha root mean square deviation (RMSD) as a function of the divergence time between pairs of orthologs annotated with the same EC number. RMSD values were calculated based on structural protein alignments using the TM-align algorithm (Zhang and Skolnick, 2005). Gray dots represent pairs of considered orthologs, boxes indicate the median and 25–75 RMSD percentiles for the corresponding divergence times, the vertical lines indicate the 5–95 percentiles, and the red line shows the moving average of the data. (b) Long-term divergence of sequence identity of protein sites located at different distances to enzymes’ active site residues. In this analysis we considered the same species and enzymatic activities used to explore the global sequence divergence (Figure 1 and Figure 1—figure supplement 1); the average sequence identities within each distance shell (shown using different colors) were calculated across all pairs of orthologs annotated with the same EC number (see Materials and methods). Error bars represent the S.E.M. across ortholog pairs.

https://doi.org/10.7554/eLife.39705.020

Figure 5—source data 1 RMSD versus divergence times for proteins with the same enzymatic function.: https://doi.org/10.7554/eLife.39705.021
Download elife-39705-fig5-data1-v2.txt

Only a small fraction of all enzyme residues forms an active site and directly participates in catalysis. Therefore, we investigated next how the sequence divergence depends on the spatial proximity of protein positions to active site residues. It was previously demonstrated that evolutionary rates of amino acid substitutions correlate with protein sites’ spatial distance to catalytic residues (Jack et al., 2016). Notably, differences in short-term evolutionary rates do not imply the existence of a divergence limit, nor do they inform how site-specific divergence patterns correlate with sites' structural locations. Thus, the main goal of our analysis was to investigate the temporal patterns of the long-term divergence, and the effective divergence limit for sites at various distances to the active site. We considered catalytic site annotations available from the Protein Data Bank (Berman et al., 2000), UniProt-KB (UniProt Consortium, 2015) and the Catalytic Site Atlas (Porter, 2004) and quantified the average sequence divergence for sites at various distances from catalytic residues (see Materials and methods, Figure 5b). We based this analysis on the same set of enzymatic activities used to study global sequence divergence (Figure 1 and Figure 1—figure supplement 1). Although, as expected, residues close to the active site were the most highly conserved (Jack et al., 2016; Halabi et al., 2009), even distant sites displayed a substantial divergence limit (~40%) at long evolutionary distances. This result suggests that the spatial constraints required for specific molecular function usually propagate throughout the entire protein structure and significantly limit the long-term divergence even at sites distant from catalytic residues.

Finally, we investigated how various degrees of functional conservation affect the long-term divergence between orthologs. To that end, we compared the long-term sequence and structural similarities of enzymes sharing their full EC classification to those sharing only the first three digits of their EC classification (Figure 6a and b); for this comparison we only used orthologs from species with divergence times > 2 billion years (see Materials and methods). In contrast to enzymes sharing all four EC digits, conservation of the first three digits indicates only a general class of substrates or cofactors (Bairoch, 1999). This analysis revealed significantly lower long-term sequence identities (27% vs. 37% identity, Mann-Whitney p-value<10⁻²⁰) and structural similarities (2.4 vs. 1.8 Å RMSD, P-value 2 × 10⁻¹⁸) between orthologs sharing only partial EC numbers. Notably, orthologs sharing only the first three EC digits are still substantially more conserved, both in sequence and structure (p-values<10⁻²⁰), than pairs of enzymes with the same structural fold, but unrelated enzyme activities (i.e. enzymes sharing no digits in the EC classification) (Figure 6a and b, Dawson et al., 2017).

Figure 6

Download asset Open asset

Effect of functional specificity on long-term sequence and structural divergence between orthologs.

(a) Sequence identities between orthologous pairs of enzymes that diverged over two billion years ago. The long-term sequence divergence between pairs of orthologs sharing the same EC number (gray, n = 272) is compared to the divergence between pairs only sharing the first three digits of their EC numbers (red, n = 265), that is, enzymes conserving only a general class of substrates or cofactors. The results are based on enzyme COGs for the 22 species used to analyze global sequence divergence (Supplementary file 1). Blue points show sequence identities between pairs of proteins with the same structural fold (Dawson et al., 2017) but unrelated enzyme activities, that is, activities sharing no digits in the EC classification (n = 298, see Materials and methods). The right blue Y axis represents the average number of amino acid types accepted per protein site during long-term protein evolution. (b) Similar to panel a, but showing the corresponding C-alpha structural divergence (RMSD) between protein pairs. (c) Long-term sequence identities between orthologous enzyme pairs at the same levels of structural similarity. Results are shown for pairs of enzymes sharing their full EC classification (gray dots), or only sharing the first three digits of their EC classification (red dots). In all panels: *(p<0.05), **(p<1e-4), ***(p<1e-10) for the Mann-Whitney test.

https://doi.org/10.7554/eLife.39705.022

Figure 6—source data 1 Sequence identities between ancient orthologs sharing the same EC number and only the first three digits of their EC numbers.: https://doi.org/10.7554/eLife.39705.023
Download elife-39705-fig6-data1-v2.txt

We also investigated the sequence constraints at the same level of protein structural divergence for proteins with different degrees of functional conservation. To that end, we calculated the sequence identity between orthologs, sharing either their full or partial EC numbers, at different bins of long-term structural similarity (Figure 6c). Interestingly, we observed that even at the same level of C-alpha RMSD divergence, orthologs sharing full EC numbers usually have higher levels of sequence identity compared to orthologous pairs with the same level of structural divergence but sharing only three EC digits. This result indicates that conservation of molecular function constrains sequence divergence even beyond the requirement to maintain a specific spatial structure.

Discussion

Our analysis demonstrates that, in contrast to proteins with the same fold (Rost, 1997), the requirement to strictly conserve the same molecular function significantly limits the long-term sequence and structural divergence of protein orthologs. Although we confirmed the result by Povolotskaya and Kondrashov (2010) that ancient protein orthologs are still diverging from each other, our study reveals that the rate of this divergence becomes increasingly slow for orthologs that strictly conserve their function. Even a slight relaxation of functional specificity, for example from full to partial EC conservation (Figure 6a and b), leads to substantially more pronounced long-term sequence and structural divergence. Similarly, a substantial sequence identity between homologous restriction endonucleases is usually limited to isoschizomers, that is, proteins specific to the same target DNA sequence (Pingoud et al., 2014).

We believe that the observed divergence patterns can be explained by the following mechanistic model. Proteins with the same molecular function usually conserve the identity of their chemical and biological substrates and interaction partners. This conservation leads to functional pressure to closely preserve the spatial positions and dynamics of key protein residues necessary for efficient catalysis and function (Lehninger et al., 2013). In turn, the requirement to continuously preserve structural properties and functional dynamics of key protein residues likely imposes a strict requirement to preserve the overall protein structure, in other words, structural optimality necessary for efficient and specific protein function. The evolutionary pressure to preserve structural optimality (to within <3 Å C-alpha RMSD) required for a given molecular function leads, in agreement with the results by Chothia and Lesk (1986) and others (Gilson et al., 2017), to substantial levels of overall sequence conservation and the observed divergence limit. Our analysis further demonstrates that less than four amino acid types are accepted, on average, per site for proteins strictly conserving their molecular function.

We note that the observed conservation may reflect both the impact of amino acid substitutions on protein activity, due to changes in equilibrium positions and dynamics of protein residues, and on protein abundance, due to changes in overall protein stability (Bershtein et al., 2015; Adkar et al., 2017). Nevertheless, direct and comprehensive biochemical experiments demonstrated that the deleterious effects of mutations primarily arise from changes in specific protein activity rather than decreases in protein stability and cellular abundance (Firnberg et al., 2016). Our results support this model, demonstrating that conservation of functional specificity imposes substantially more stringent long-term sequence constraints than simply conservation of protein folds and protein stability. Indeed, while every folded protein should maintain its stability, our results demonstrate that maintaining stable folds does not constrain long-term divergence to more than 10-15% sequence identity, which is substantially less than the requirement for maintaining specific molecular functions (~40% sequence identity).

The presented results demonstrate that only about a third of the sequence conservation between distant orthologs with the same molecular function can be attributed to universally conserved protein sites, that is, sites occupied by identical amino acids in almost all lineages. In contrast, we found that different protein sites are usually identical between orthologs from different lineages. This result is likely due, at least in part, to the epistatic nature of protein sequence landscapes, where mutations that are neutral in one lineage are often prohibitively deleterious in another (Breen et al., 2012; Lunzer et al., 2010). In the context of the aforementioned divergence model, the evolution of mitochondrial ribosomal proteins in eukaryotes (Figure 1—figure supplement 6) provides an interesting example, suggesting that orthologs’ divergence can be substantially accelerated by co-evolution with their interaction partners or relaxation of selection pressures.

Our experimental and computational analyses also delineate two distinct stages of the long-term divergence of orthologs with the same molecular function. During the first 1–2 billion years of divergence, substitutions at protein sites with mild fitness effects lead to a substantial (40–60%) decrease in sequence identity. After the first stage, divergence at these sites effectively saturates. The saturation at sites with small fitness effects, combined with very slow divergence rate at sites with large fitness effects (Figure 4), leads to a substantially slower sequence and structural divergence during the second stage. Interestingly, as a consequence of this slowdown, for the past billion years there has not been a substantial decrease in sequence and structural similarity between ancient orthologs with the same molecular function. Further analyses of biochemical, biophysical and cellular constraints will reveal how various structural and functional properties influence proteins’ long-term evolution, and how protein functional efficiency may be compromised by deleterious mutations (Shendure and Akey, 2015).

Materials and methods

Key resources table

Reagent type (species) or resource	Designation	Source or reference	Identifiers	Additional information
Strain, strain background (Escherichia coli EcNR2)	MG1655, bla, bio-, λ-Red+, mutS-::cmR	PMID: 19633652	Addgene #26931
Sequence-based reagent	90 bp DNA oligos with phosphorothioated bases	This paper	See Supplementary file 4	100 nmole DNA Plate oligo, Integrated DNA Technologies
Commercial assay or kit	Miseq Reagent Kit V2	Illumina	MS-102–2002
Commercial assay or kit	sybr green	ThermoFisher	S7567
Commercial assay or kit	Qubit HS DNA kit	ThermoFisher	Q32854
Commercial assay or kit	Q5 Hot Start High- Fidelity Mastermix	NEB	M0494S
Commercial assay or kit	DNA clean and concentration kit 5	Zymo Research	D4013
Commercial assay or kit	illustra bacteria genomicPrep Mini Spin kit	GE life sciences	28904259
Commercial assay or kit	Agilent DNA 1000 kit	Agilent Genomics	5067–1504
Software, algorithm	SeqPrep v1.1	John St. John	https://github.com/jstjohn/SeqPrep
Software, algorithm	Bowtie2	PMID: 22388286
Software, algorithm	Perl scripts to count mutant reads	This paper	https://github.com/platyias/count-MAGE-seq (copy archived at https://github.com/elifesciences-publications/count-MAGE-seq).
Other	Turbidostat for growth competition assay	PMID: 23429717

Considered enzyme activities and corresponding protein orthologs

Request a detailed protocol

We selected for analysis the sequences annotated in UniProt (UniProt Consortium, 2015) with EC numbers associated with the following metabolic pathways (defined in the KEGG database) (Kanehisa et al., 2016): Glycolysis and gluconeogenesis, pentose phosphate pathway, TCA cycle, purine metabolism, pyrimidine metabolism. Using the protein sequences from 22 diverse organisms (Supplementary file 1) we constructed clusters of orthologous groups (COGs) using the EdgeSearch algorithm (Kristensen et al., 2010). Following previous studies, we considered any two proteins from different species in the same COG as orthologs (Tatusov et al., 1997). COGs were obtained using the COGsoft software (Kristensen et al., 2010), starting from an all-against-all psi-blast (Altschul et al., 1997) search, setting the database size at 10⁸, and using a maximum considered E-value of 0.1. To obtain the largest number of likely orthologs we did not apply a filter on low complexity or composition-based statistics. Only proteins sharing the same EC number and assigned to the same COG were compared, and only COGs with sequences in 10 or more of the 22 species were used.

In order to exclude proteins clearly showing evidence of Horizontal Gene Transfer (HGT), we constructed a maximum likelihood phylogenetic tree of the 12 prokaryotes considered in our analysis using a concatenated alignment of marker genes (Wu and Scott, 2012). The species tree was then manually compared to the individual trees of the prokaryotic sequences sharing the same molecular function within each COG; COG-specific trees were built using the GAMMA model of amino-acid substitution implemented in the RAxML software (Stamatakis, 2014). Proteins that showed clear differences in tree topologies, suggesting HGT, were excluded from further analysis. Ancient gene duplications, that is, duplications occurring prior to the divergence between considered species, often lead to cases in which enzymes in the same COG but from different species have diverged for longer than the corresponding species’ divergence times; thus, we did not consider COGs with tree topologies showing evidence of ancient gene duplications. Ancient gene duplications were defined as those occurring prior to the last common ancestor of 3 or more of the 22 species considered in the analysis.

The same procedure was used to select non-enzymatic COGs for analyses (Figure 1—figure supplement 5). However, in this case we only considered COGs for which none of the proteins were annotated in UniProt with metabolic EC numbers. Naturally, UniProt functional annotations for non-enzymes vary in terms of their source and format. Therefore, it is difficult to ascertain the degree of functional specificity and conservation between non-enzymatic orthologs. To address this, we manually checked that the molecular functions associated with proteins in the same COG were related, although we could not ascertain perfect conservation of molecular function.

Models of long-term protein sequence evolution

Request a detailed protocol

Global sequence identities for pairs of proteins annotated with the same molecular function in the same COG were calculated using pairwise alignments with ClustalW2 (Larkin et al., 2007). Sequence identity was computed as the number of identical sites at aligned positions, divided by the total number of aligned sites, excluding gaps. Divergence times between organisms were obtained from the TimeTree database (Hedges et al., 2006) (November, 2015) and used as a proxy for protein divergence times; in the analysis we used the mean divergence times across studies listed in the database. Divergence times between bacteria and archaea were set to 4 billion years based on current estimates for the occurrence time of their Last Common Ancestor (Sheridan et al., 2003; Battistuzzi et al., 2004) and existing evidence of an early origin of life on Earth (Bell et al., 2015). It is likely that ancient eukaryotic genes originated through episodic endosymbiotic gene transfer events and vertical inheritance from bacterial and archaeal genomes (Ku et al., 2015; Thiergart et al., 2012). Because of the discrete nature of such transfer events, the vast majority of individual prokaryotic-eukaryotic orthologous pairs are likely to have diverged from each other long before the origin of eukaryotes (1.8 billion years ago [Parfrey et al., 2011]); specifically, because most ancient prokaryotic species would not have transferred genes to eukaryotes. Thus, based on the median divergence time between the considered prokaryotes (~4 billion years, Supplementary file 1), divergence times between eukaryotes and prokaryotes were set in our analyses at 4 billion years. The results presented in the paper remain insensitive to the exact value of this divergence estimate (within the 3–4 billion year interval). Based on the recently proposed affiliation of eukaryotes and members from the Lokiarchaeota (Spang et al., 2015), divergence times between S. solfataricus and eukaryotes were set at 2.7 billion years, that is, the estimated age of the TACK superphylum (Guy and Ettema, 2011; Betts et al., 2018).

In order to study the long-term divergence patterns of orthologs, we only used COGs containing pairs of orthologs with at least five different divergence times distributed across 4 billion years. Sequence divergence data were fitted with models 1 to 3 using the least-squares minimization algorithm implemented in the MATLAB R2017a fitnlm function (The MathWorks, Inc, Natick, MA). The best fits of model 1 and model two were compared using the F-test. To test whether the conservation of molecular function limits protein sequence divergence, the minimum sequence identity parameter in model 2 (Y₀, from Equation 2) was compared, for each enzymatic activity, to the average global sequence identity between unrelated protein pairs using the Wald test.

To investigate the effect of the uncertainty of divergence times’ estimates, we repeated the analysis of the 64 enzymatic activities while randomly assigning either the maximum or minimum value of the divergence times between lineages reported in the TimeTree database. This analysis was performed for a total of 1000 independent assignment runs. Across the independent assignment runs, the expected long-term sequence identity between orthologs was higher than 25% for at least 90% of enzymes (based on model 2), and the projected sequence identity after 7.8 billion years was above 25% (based on model 3) for at least 75% of enzymes (Figure 1—figure supplement 7).

To assess the effect of computational functional annotations on the observed divergence results, we repeated the analysis using only sequences with experimentally validated molecular functions (Figure 1—figure supplement 3). To keep only sequences with validated molecular functions, we manually reviewed published references for enzyme annotations in the BRENDA database (Chang et al., 2015), and discarded any functional assignments that were based exclusively on computational or high-throughput studies. After filtering for the experimentally validated annotations, we only considered EC numbers corresponding to pairs of orthologs with at least four different divergence times distributed across 4 billion years.

Calculation of the divergence rate

View detailed protocol

Based on Model 3, we determined the divergence rate, that is, the rate of the decrease in sequence identity per time, at a given divergence time t by solving for the derivative of Equation 3 with respect to time:

\frac{d y}{d t} = \frac{d (100 * {(\frac{R_{0} * t}{α} + 1)}^{- α})}{d t} = - 100 * R_{0} {(\frac{R_{0} * t}{α} + 1)}^{- α - 1}

where y represents global sequence identity, t represents divergence time, R₀ represents the average substitution rate, and α represents the shape parameter of the gamma distribution.

Equivalency between model two and a poisson divergence model with allowed back substitutions

Request a detailed protocol

In the Jukes-Cantor model of nucleotide divergence (Tajima and Nei, 1984; Jukes and Cantor, 1969), the expected number of substitutions per site (δ) between two sequences after a divergence time t from a common ancestor is given by:

δ = - \frac{a - 1}{a} l n (1 - \frac{a}{a - 1} (1 - y))

where y is the proportion of identical sites and a is the number of allowed nucleotide types (usually 4). The same model can be applied to the divergence of protein sequences (Yang et al., 2000; Ota and Nei, 1994), by setting a to the number of allowed amino acid types per protein site. Furthermore, δ = 2λt, where λ represents the substitution rate per site per unit time, which is assumed to be equal across all sites. Substituting δ, and solving the above equation for y yields:

y = \frac{1}{a} + (1 - \frac{1}{a}) e x p (- \frac{2 λ a}{a - 1} t)

which is mathematically equivalent to model 2 Equation 2, with R₀= $\frac{2 λ a}{a - 1}$ , and Y₀ = $\frac{1}{a}$ . Thus, Y₀ can also be interpreted as the inverse of the average number of amino acids accepted per protein site during protein evolution.

FolA competition experiment in E. coli

Request a detailed protocol

To perform competition experiments we used the EcNR2 strain derived from E. coli K12 MG1655. Mutagenesis was performed using Multiplex Automated Genomic Engineering (MAGE), as previously described (Wang et al., 2009). 90 bp DNA oligomers were designed around each folA codon using the MG1655 wild type sequence as reference (Supplementary file 4). For each codon, all possible nucleotide variants were synthesized. To avoid simultaneous mutations of multiple codons, cells were transformed targeting ten consecutive codons at a time. After four rounds of electroporation, cells were recovered and pooled together at approximately the same concentration based on cell counts. Two competition growth experiments were carried out, one for each half of the protein. For the competition experiments, cells were grown in LB media in a turbidostat while maintaining constant volume and cell density. Samples were taken every 2 hr for a period of 16 hr, spun down, washed in PBS, spun down again and stored at −20°C until all samples were collected. For each competition, the corresponding FolA region was amplified through PCR while assigning a specific DNA barcode for each time point. PCR products were then pooled and paired-end sequenced using the MiSeq Reagent Kit two from Illumina. Sequence reads were deposited to the SRA database with accession number: SRP152339.

To determine, at each time point, the abundance of each mutant relative to wild type, we joined paired-end reads using SeqPrep (v 1.1) and aligned the joined reads to the folA gene sequence using Bowtie2 (Langmead and Salzberg, 2012). We then counted the number of reads per mutant using a custom script (Plata, 2018). Reads with more than a single mutated codon were discarded. Counts were median-normalized to control for noise due to mutagenesis performed in batches of 10 codons. At each time point we calculated the ratio R_t of mutant to wild type (WT) reads. In exponential growth, the growth rate difference between a given mutant and WT was calculated based on the slope of $\ln (R_{τ})$ as a function of time:

\ln (R_{t}) = (m_{i} - m_{w t}) * t + l n (R_{0})

where m_i and m_wt represent the mutant and WT growth rates, respectively. Growth rate differences were calculated only for mutants with at least five time points with 20 or more reads. Relative growth rates were calculated by dividing the slopes in the equation above by the number of e-fold increases given the average dilution rate of the turbidostat (1.37/h).

To calculate a single value characterizing the effect of all possible mutations at a protein site, we first averaged the relative growth rates of mutants resulting in the same amino acid change. We then calculated the average fitness effect of mutations at each protein site by averaging across 20 possible amino acids substitutions (Supplementary file 3).

To estimate the sensitivity of our results to sequencing errors, we calculated the average fitness effect of substitutions at each FolA site using the relative growth rates of mutant strains carrying only 32 mutated codons selected at random out of 64 possible codons. We observed a high correlation (Pearson’s r: 0.95, p-value<1e-20, Figure 4—figure supplement 2) between the average growth rate effects at each site calculated using two non-overlapping subsets of 32 codons. As expected, nonsense mutations and substitutions in the folA start codon had substantially stronger average effects on growth rates compared to other substitutions (26% versus 4% slower growth than WT, respectively. Mann Whitney U, p-value<10⁻²⁰). Also, the relative growth rates due to synonymous codon substitutions were usually very mild (0.2% higher growth compared to WT); 97% of synonymous substitutions had growth effects of less than 3%.

Contribution of different sites to the divergence limit

Request a detailed protocol

In order to identify phylogenetically independent pairs of species, we aligned the 16S rRNA gene sequences of bacterial species having orthologs annotated with the target 30 EC numbers (Figure 3—figure supplement 1). 16S rRNA sequences were obtained from the GreenGenes database (DeSantis et al., 2006) (October, 2016). We then built maximum likelihood phylogenetic trees based on the 16S alignments using RAxML (Stamatakis, 2014). Next, we used the Maximum Pairing Problem approach byArnold and Stadler (2010) to find the largest number of edge-disjoint pairs of species with 16S rRNA genetic distances corresponding to >2 billion years of divergence. Divergence times were estimated from the 16S genetic distances based on the linear regression of literature reported divergence times (Hedges et al., 2006) (Supplementary file 1). The F84 model of nucleotide substitution implemented in the phylip package (Felsenstein, 2005) was used to compute the genetic distances. Using the 16S alignment data, we calculated the probability that a protein site was identical across independent lineages. The probability was calculated as the fraction of orthologous pairs from phylogenetically independent species pairs with identical amino acids at the site. The amino acid identities at a given site were obtained based on the multiple sequence alignment of all orthologs associated with each EC number, obtained using ClustalW2 (Larkin et al., 2007). A similar procedure was applied to analyze FolA and InfA orthologs from the COG database (Figure 3c).

To investigate the divergence of sites with different fitness effects, we used sequences of FolA and InfA bacterial orthologs from the COG database (Galperin et al., 2015). The FolA orthologs annotated with the same EC number in UniProt (n = 311) and the InfA orthologs annotated with the same KEGG Orthology (KO) number in KEGG (n = 514) were used to build multiple sequence alignments with ClustalW2 (Larkin et al., 2007). Divergence times were estimated from the 16S genetic distances as described above. Within each divergence bin (Figure 4a,b), sequence identities of sites with different average fitness effects (represented by different colors in Figure 4a,b) were averaged across all pairs of orthologs at a given divergence time.

Analysis of global protein structural evolution

Request a detailed protocol

To study the divergence of protein structures as a function of time, we obtained PDB codes for all proteins associated with EC numbers in the BRENDA database (Chang et al., 2015). We then selected for the analysis species with experimentally solved enzyme structures for at least 10 different EC numbers. Psi-blast searches with a conservative E-value cutoff of 10⁻⁶ were used to identify orthologs (defined as bi-directional best hits) in the selected species. The 3D structures of orthologous pairs, annotated with the same EC number, were aligned using the TM-align program (Zhang and Skolnick, 2005) to obtain the C-alpha RMSD values. Pairs of proteins were not considered if more than 70% of the residues of the shortest protein could not be structurally aligned. We also removed from the analysis pairs of structures with flexibility between domains, as they could result in large RMSD values despite significant structural similarity. To identify such proteins we used the FATCAT (Ye and Godzik, 2003) software to perform flexible structural alignments of all structure pairs. We then filtered the structural pairs that were split into two or more domains by the FATCAT alignments.

Analysis of the enzyme active sites

Request a detailed protocol

To analyze divergence as a function of active site distance we used protein sequences associated with the 64 EC numbers and 22 species considered in Figure 1 and Figure 1—figure supplement 1. To that end, PDB (Berman et al., 2000) was searched for homologous sequences annotated with the same enzymatic activities and with known 3D structures. Annotations of active site residues for the corresponding structures were obtained from the Catalytic Site Atlas (Porter, 2004), PDB and UniProt-KB (UniProt Consortium, 2015). For each PDB structure with available active site information, protein sites were then stratified into different layers according to the distance between their alpha carbons and the centroid of the active site residues. Each pair of orthologs was then aligned using ClustalW2 (Larkin et al., 2007) with a homolog in PDB annotated with the same activity and with defined distance layers around the active site; the PDB sequence with the highest sequence identity to either member of the pair was used for the alignment. Sequence identities for different layers were calculated based on the structural positions in the corresponding PDB reference sequences.

Comparison of pairs of enzymes with the same structural folds

Request a detailed protocol

We used structural classifications of protein domains from the CATH database (v4.2.0) (Dawson et al., 2017). For structural comparisons, we only considered PDB structures with a single classified domain per chain. Protein pairs classified in CATH in the same homologous structural superfamily were considered as having the same fold. To obtain pairs of proteins in the same fold but with different functions, we only considered PDB structures annotated with different EC numbers in BRENDA. For this analysis we randomly selected 300 pairs of structures with the same fold that do not share any digits of their EC classification.

Data availability

Sequence reads were deposited to the SRA database with accession number: SRP152339. Source data files have been provided for figure 1, figure 5, figure 6 and figure 1 supplements 1, 3, 5 and 6.

The following data sets were generated

1. Konaté MM
2. Plata G
3. Park J
4. Usmanova DR
5. Wang HH
6. Vitkup D
(2018) SRA
ID SRP152339. Molecular function limits divergent protein evolution on planetary timescales.

https://www.ncbi.nlm.nih.gov/sra/SRP152339

The following previously published data sets were used

(2016) Supplementary material of: https://doi.org/10.1016/j.cels.2016.11.004
Data S1.

https://doi.org/10.1016/j.cels.2016.11.004

References

(2017) Optimization of lag phase shapes the evolution of a bacterial enzyme
Nature Ecology & Evolution 1:149.

https://doi.org/10.1038/s41559-017-0149
- PubMed
- Google Scholar
1. Altschul SF
2. Madden TL
3. Schäffer AA
4. Zhang J
5. Zhang Z
6. Miller W
7. Lipman DJ
(1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research 25:3389–3402.

https://doi.org/10.1093/nar/25.17.3389
- PubMed
- Google Scholar
1. Arnold C
2. Stadler PF
(2010) Polynomial algorithms for the maximal pairing problem: efficient phylogenetic targeting on arbitrary trees
Algorithms for Molecular Biology 5:25.

https://doi.org/10.1186/1748-7188-5-25
- PubMed
- Google Scholar
1. Bairoch A
(1999) The ENZYME data bank in 1999
Nucleic Acids Research 27:310–311.

https://doi.org/10.1093/nar/27.1.310
- PubMed
- Google Scholar
1. Barreto FS
2. Burton RS
(2013) Evidence for compensatory evolution of ribosomal proteins in response to rapid divergence of mitochondrial rRNA
Molecular Biology and Evolution 30:310–314.

https://doi.org/10.1093/molbev/mss228
- PubMed
- Google Scholar
(2004) A genomic timescale of prokaryote evolution: insights into the origin of methanogenesis, Phototrophy, and the colonization of land
BMC Evolutionary Biology 4:44.

https://doi.org/10.1186/1471-2148-4-44
- PubMed
- Google Scholar
1. Bell EA
2. Boehnke P
3. Harrison TM
4. Mao WL
(2015) Potentially biogenic carbon preserved in a 4.1 billion-year-old zircon
PNAS 112:14518–14521.

https://doi.org/10.1073/pnas.1517557112
- PubMed
- Google Scholar
(1988) Insights into enzyme function from studies on mutants of dihydrofolate reductase
Science 239:1105–1110.

https://doi.org/10.1126/science.3125607
- PubMed
- Google Scholar
1. Berman HM
2. Westbrook J
3. Feng Z
4. Gilliland G
5. Bhat TN
6. Weissig H
7. Shindyalov IN
8. Bourne PE
(2000) The protein data bank
Nucleic Acids Research 28:235–242.

https://doi.org/10.1093/nar/28.1.235
- PubMed
- Google Scholar
1. Bershtein S
2. Serohijos AW
3. Bhattacharyya S
4. Manhart M
5. Choi JM
6. Mu W
7. Zhou J
8. Shakhnovich EI
(2015) Protein homeostasis imposes a barrier on functional integration of horizontally transferred genes in Bacteria
PLOS Genetics 11:e1005612.

https://doi.org/10.1371/journal.pgen.1005612
- PubMed
- Google Scholar
(2018) Integrated genomic and fossil evidence illuminates life's early evolution and eukaryote origin
Nature Ecology & Evolution 2:1556–1562.

https://doi.org/10.1038/s41559-018-0644-x
- PubMed
- Google Scholar
(1993) Convergent evolution of similar enzymatic function on different protein folds: the hexokinase, ribokinase, and galactokinase families of sugar kinases
Protein Science 2:31–40.

https://doi.org/10.1002/pro.5560020104
- PubMed
- Google Scholar
(2012) Epistasis as the primary factor in molecular evolution
Nature 490:535–538.

https://doi.org/10.1038/nature11510
- PubMed
- Google Scholar
(2017) Correlation of fitness landscapes from three orthologous TIM barrels originates from sequence and structure constraints
Nature Communications 8:14614.

https://doi.org/10.1038/ncomms14614
- PubMed
- Google Scholar
1. Chang A
2. Schomburg I
3. Placzek S
4. Jeske L
5. Ulbrich M
6. Xiao M
7. Sensen CW
8. Schomburg D
(2015) BRENDA in 2015: exciting developments in its 25th year of existence
Nucleic Acids Research 43:D439–D446.

https://doi.org/10.1093/nar/gku1068
- PubMed
- Google Scholar
1. Chothia C
2. Lesk AM
(1986) The relation between the divergence of sequence and structure in proteins
The EMBO Journal 5:823–826.

https://doi.org/10.1002/j.1460-2075.1986.tb04288.x
- Google Scholar
1. Conant GC
2. Wolfe KH
(2008) Turning a hobby into a job: how duplicated genes find new functions
Nature Reviews Genetics 9:938–950.

https://doi.org/10.1038/nrg2482
- PubMed
- Google Scholar
1. Dawson NL
2. Lewis TE
3. Das S
4. Lees JG
5. Lee D
6. Ashford P
7. Orengo CA
8. Sillitoe I
(2017) CATH: an expanded resource to predict protein function through structure and sequence
Nucleic Acids Research 45:D289–D295.

https://doi.org/10.1093/nar/gkw1098
- Google Scholar
1. DeSantis TZ
2. Hugenholtz P
3. Larsen N
4. Rojas M
5. Brodie EL
6. Keller K
7. Huber T
8. Dalevi D
9. Hu P
10. Andersen GL
(2006) Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB
Applied and Environmental Microbiology 72:5069–5072.

https://doi.org/10.1128/AEM.03006-05
- PubMed
- Google Scholar
1. Dickerson RE
(1971) The structures of cytochrome c and the rates of molecular evolution
Journal of Molecular Evolution 1:26–45.

https://doi.org/10.1007/BF01659392
- PubMed
- Google Scholar
(2002) Expanding protein universe and its origin from the biological big bang
PNAS 99:14132–14136.

https://doi.org/10.1073/pnas.202497999
- PubMed
- Google Scholar
Software
1. Felsenstein J
(2005) PHYLIP (Phylogeny inference package) , version 3.6
University of Washington, Seattle.

http://evolution.genetics.washington.edu/phylip.html
(2016) A comprehensive, High-Resolution map of a gene's Fitness Landscape
Molecular Biology and Evolution 33:1378.

https://doi.org/10.1093/molbev/msw021
- PubMed
- Google Scholar
(1998) Analogous enzymes: independent inventions in enzyme evolution
Genome Research 8:779–790.

https://doi.org/10.1101/gr.8.8.779
- PubMed
- Google Scholar
(2015) Expanded microbial genome coverage and improved protein family annotation in the COG database
Nucleic Acids Research 43:D261–D269.

https://doi.org/10.1093/nar/gku1223
- PubMed
- Google Scholar
(2017) The role of evolutionary selection in the dynamics of protein structure evolution
Biophysical Journal 112:1350–1365.

https://doi.org/10.1016/j.bpj.2017.02.029
- PubMed
- Google Scholar
1. Guy L
2. Ettema TJ
(2011) The archaeal 'TACK' superphylum and the origin of eukaryotes
Trends in Microbiology 19:580–587.

https://doi.org/10.1016/j.tim.2011.09.002
- PubMed
- Google Scholar
(2009) Protein sectors: evolutionary units of three-dimensional structure
Cell 138:774–786.

https://doi.org/10.1016/j.cell.2009.07.038
- PubMed
- Google Scholar
(2006) TimeTree: a public knowledge-base of divergence times among organisms
Bioinformatics 22:2971–2972.

https://doi.org/10.1093/bioinformatics/btl505
- PubMed
- Google Scholar
1. Hubble E
(1929) A relation between distance and radial velocity among extra-galactic nebulae
PNAS 15:168–173.

https://doi.org/10.1073/pnas.15.3.168
- PubMed
- Google Scholar
1. Jack BR
2. Meyer AG
3. Echave J
4. Wilke CO
(2016) Functional sites induce Long-Range evolutionary constraints in enzymes
PLOS Biology 14:e1002452.

https://doi.org/10.1371/journal.pbio.1002452
- PubMed
- Google Scholar
Book
1. Jukes TH
2. Cantor CR
(1969)
Evolution of Protein Molecules

In: Munro H. N, editors. Mammalian Protein Metabolism. Academic Press. pp. 21–132.
- Google Scholar
(2016) KEGG as a reference resource for gene and protein annotation
Nucleic Acids Research 44:D457–D462.

https://doi.org/10.1093/nar/gkv1070
- PubMed
- Google Scholar
1. Kelsic ED
2. Chung H
3. Cohen N
4. Park J
5. Wang HH
6. Kishony R
(2016) RNA structural determinants of optimal codons revealed by MAGE-Seq
Cell Systems 3:563–571.

https://doi.org/10.1016/j.cels.2016.11.004
- PubMed
- Google Scholar
1. Kristensen DM
2. Kannan L
3. Coleman MK
4. Wolf YI
5. Sorokin A
6. Koonin EV
7. Mushegian A
(2010) A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches
Bioinformatics 26:1481–1487.

https://doi.org/10.1093/bioinformatics/btq229
- PubMed
- Google Scholar
1. Ku C
2. Nelson-Sathi S
3. Roettger M
4. Sousa FL
5. Lockhart PJ
6. Bryant D
7. Hazkani-Covo E
8. McInerney JO
9. Landan G
10. Martin WF
(2015) Endosymbiotic origin and differential loss of eukaryotic genes
Nature 524:427–432.

https://doi.org/10.1038/nature14963
- PubMed
- Google Scholar
1. Langmead B
2. Salzberg SL
(2012) Fast gapped-read alignment with bowtie 2
Nature Methods 9:357–359.

https://doi.org/10.1038/nmeth.1923
- PubMed
- Google Scholar
1. Larkin MA
2. Blackshields G
3. Brown NP
4. Chenna R
5. McGettigan PA
6. McWilliam H
7. Valentin F
8. Wallace IM
9. Wilm A
10. Lopez R
11. Thompson JD
12. Gibson TJ
13. Higgins DG
(2007) Clustal W and clustal X version 2.0
Bioinformatics 23:2947–2948.

https://doi.org/10.1093/bioinformatics/btm404
- PubMed
- Google Scholar
1. Lee D
2. Redfern O
3. Orengo C
(2007) Predicting protein function from sequence and structure
Nature Reviews Molecular Cell Biology 8:995–1005.

https://doi.org/10.1038/nrm2281
- PubMed
- Google Scholar
Book
(2013)
Lehninger Principles of Biochemistry (6th Edition)

New York: W.H. Freeman.
- Google Scholar
(2010) Pervasive cryptic epistasis in molecular evolution
PLOS Genetics 6:e1001162.

https://doi.org/10.1371/journal.pgen.1001162
- PubMed
- Google Scholar
(1986) Evidence against use of bacterial amino acid sequence data for construction of all-inclusive phylogenetic trees
PNAS 83:217–220.

https://doi.org/10.1073/pnas.83.2.217
- Google Scholar
(1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures
Journal of Molecular Biology 247:536–540.

https://doi.org/10.1016/S0022-2836(05)80134-2
- PubMed
- Google Scholar
(2014) Swansong biospheres II: the final signs of life on terrestrial planets near the end of their habitable lifetimes
International Journal of Astrobiology 13:229–243.

https://doi.org/10.1017/S1473550413000426
- Google Scholar
(2010) Non-homologous isofunctional enzymes: A systematic analysis of alternative solutions in enzyme evolution
Biology Direct 5:31.

https://doi.org/10.1186/1745-6150-5-31
- Google Scholar
1. Ota T
2. Nei M
(1994) Estimation of the number of amino acid substitutions per site when the substitution rate varies among sites
Journal of Molecular Evolution 38:642–643.

https://doi.org/10.1007/BF00175885
- Google Scholar
1. Parfrey LW
2. Lahr DJ
3. Knoll AH
4. Katz LA
(2011) Estimating the timing of early eukaryotic diversification with multigene molecular clocks
PNAS 108:13624–13629.

https://doi.org/10.1073/pnas.1110633108
- PubMed
- Google Scholar
(2014) Type II restriction endonucleases--a historical perspective and more
Nucleic Acids Research 42:7489–7527.

https://doi.org/10.1093/nar/gku447
- PubMed
- Google Scholar
Software
1. Plata G
(2018) Count-MAGE-Seq, version 4f827f6
GitHub.

https://github.com/platyias/count-MAGE-seq
1. Porter CT
(2004) The catalytic site atlas: a resource of catalytic sites and residues identified in enzymes using structural data
Nucleic Acids Research 32:129D–133.

https://doi.org/10.1093/nar/gkh028
- Google Scholar
1. Povolotskaya IS
2. Kondrashov FA
(2010) Sequence space and the ongoing expansion of the protein universe
Nature 465:922–926.

https://doi.org/10.1038/nature09105
- PubMed
- Google Scholar
1. Rost B
(1997) Protein structures sustain evolutionary drift
Folding and Design 2:S19–S24.

https://doi.org/10.1016/S1359-0278(97)00059-X
- PubMed
- Google Scholar
1. Rost B
(2002) Enzyme function less conserved than anticipated
Journal of Molecular Biology 318:595–608.

https://doi.org/10.1016/S0022-2836(02)00016-5
- PubMed
- Google Scholar
1. Shendure J
2. Akey JM
(2015) The origins, determinants, and consequences of human mutations
Science 349:1478–1483.

https://doi.org/10.1126/science.aaa9119
- PubMed
- Google Scholar
(2003) Estimated minimal divergence times of the major bacterial and archaeal phyla
Geomicrobiology Journal 20:1–14.

https://doi.org/10.1080/01490450303891
- Google Scholar
1. Spang A
2. Saw JH
3. Jørgensen SL
4. Zaremba-Niedzwiedzka K
5. Martijn J
6. Lind AE
7. van Eijk R
8. Schleper C
9. Guy L
10. Ettema TJG
(2015) Complex archaea that bridge the gap between prokaryotes and eukaryotes
Nature 521:173–179.

https://doi.org/10.1038/nature14447
- PubMed
- Google Scholar
1. Stamatakis A
(2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies
Bioinformatics 30:1312–1313.

https://doi.org/10.1093/bioinformatics/btu033
- PubMed
- Google Scholar
1. Tajima F
2. Nei M
(1984) Estimation of evolutionary distance between nucleotide sequences
Molecular Biology and Evolution 1:269–285.

https://doi.org/10.1093/oxfordjournals.molbev.a040317
- PubMed
- Google Scholar
(1997) A genomic perspective on protein families
Science 278:631–637.

https://doi.org/10.1126/science.278.5338.631
- PubMed
- Google Scholar
1. Thiergart T
2. Landan G
3. Schenk M
4. Dagan T
5. Martin WF
(2012) An evolutionary network of genes present in the eukaryote common ancestor polls genomes on eukaryotic and mitochondrial origin
Genome Biology and Evolution 4:466–485.

https://doi.org/10.1093/gbe/evs018
- PubMed
- Google Scholar
1. Tian W
2. Skolnick J
(2003) How well is enzyme function conserved as a function of pairwise sequence identity?
Journal of Molecular Biology 333:863–882.

https://doi.org/10.1016/j.jmb.2003.08.057
- PubMed
- Google Scholar
1. UniProt Consortium
(2015) UniProt: a hub for protein information
Nucleic Acids Research 43:D204–D212.

https://doi.org/10.1093/nar/gku989
- PubMed
- Google Scholar
1. Wang HH
2. Isaacs FJ
3. Carr PA
4. Sun ZZ
5. Xu G
6. Forest CR
7. Church GM
(2009) Programming cells by multiplex genome engineering and accelerated evolution
Nature 460:894–898.

https://doi.org/10.1038/nature08187
- PubMed
- Google Scholar
(2005) Predicting protein function from sequence and structural data
Current Opinion in Structural Biology 15:275–284.

https://doi.org/10.1016/j.sbi.2005.04.003
- PubMed
- Google Scholar
(2000) Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores
Journal of Molecular Biology 297:233–249.

https://doi.org/10.1006/jmbi.2000.3550
- PubMed
- Google Scholar
(2009) Structural and functional constraints in the evolution of protein families
Nature Reviews Molecular Cell Biology 10:709–720.

https://doi.org/10.1038/nrm2762
- PubMed
- Google Scholar
1. Wu M
2. Scott AJ
(2012) Phylogenomic analysis of bacterial and archaeal sequences with AMPHORA2
Bioinformatics 28:1033–1034.

https://doi.org/10.1093/bioinformatics/bts079
- Google Scholar
(2000)
Codon-substitution models for heterogeneous selection pressure at amino acid sites

Genetics 155:431–449.
- PubMed
- Google Scholar
Book
1. Yang Z
(2006)
Computational Molecular Evolution

Oxford: Oxford University Press.
- Google Scholar
1. Ye Y
2. Godzik A
(2003) Flexible structure alignment by chaining aligned fragment pairs allowing twists
Bioinformatics 19 Suppl 2:ii246–ii255.

https://doi.org/10.1093/bioinformatics/btg1086
- PubMed
- Google Scholar
1. Zhang Y
2. Skolnick J
(2005) TM-align: a protein structure alignment algorithm based on the TM-score
Nucleic Acids Research 33:2302–2309.

https://doi.org/10.1093/nar/gki524
- PubMed
- Google Scholar
Book
1. Zuckerkandl E
2. Pauling L
(1965)
Evolutionary Divergence and Convergence in Proteins

In: Bryson V, Vogel H. J, editors. Evolving Genes and Proteins. Academic Press. pp. 97–166.
- Google Scholar

Article and author information

Author details

Mariam M Konaté
1. Department of Systems Biology, Columbia University, New York, United States
2. Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, United States
Contribution
Data curation, Software, Formal analysis, Investigation, Methodology, Writing—original draft

Contributed equally with
Germán Plata

Competing interests
No competing interests declared
Germán Plata

Department of Systems Biology, Columbia University, New York, United States

Contribution
Data curation, Software, Formal analysis, Investigation, Methodology, Writing—original draft, Writing—review and editing

Contributed equally with
Mariam M Konaté

For correspondence
gap2118@cumc.columbia.edu

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-6470-7748
Jimin Park
1. Department of Systems Biology, Columbia University, New York, United States
2. Department of Pathology and Cell Biology, Columbia University, New York, United States
Contribution
Investigation, Methodology, Writing—review and editing

Competing interests
No competing interests declared
Dinara R Usmanova

Department of Systems Biology, Columbia University, New York, United States

Contribution
Formal analysis, Investigation, Methodology, Writing—review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-5031-0013
Harris Wang
1. Department of Systems Biology, Columbia University, New York, United States
2. Department of Pathology and Cell Biology, Columbia University, New York, United States
Contribution
Supervision, Methodology, Project administration, Writing—review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-2164-4318
Dennis Vitkup
1. Department of Systems Biology, Columbia University, New York, United States
2. Department of Biomedical Informatics, Columbia University, New York, United States
Contribution
Conceptualization, Supervision, Funding acquisition, Methodology, Writing—original draft, Project administration, Writing—review and editing

For correspondence
dv2121@cumc.columbia.edu

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-4259-8162

Funding

National Institute of General Medical Sciences (R01)

Dennis Vitkup

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We sincerely thank Dan Tawfik, Eugene Koonin, and Fyodor Kondrashov for very helpful discussions. This work was supported in part by the National Institute of General Medical Sciences grants GM079759 and R35GM13188 to DV.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.