The seemingly limitless diversity of proteins in nature arose from only a few thousand domain prototypes, but the origin of these themselves has remained unclear. We are pursuing the hypothesis that they arose by fusion and accretion from an ancestral set of peptides active as co-factors in RNA-dependent replication and catalysis. Should this be true, contemporary domains may still contain vestiges of such peptides, which could be reconstructed by a comparative approach in the same way in which ancient vocabularies have been reconstructed by the comparative study of modern languages. To test this, we compared domains representative of known folds and identified 40 fragments whose similarity is indicative of common descent, yet which occur in domains currently not thought to be homologous. These fragments are widespread in the most ancient folds and enriched for iron-sulfur- and nucleic acid-binding. We propose that they represent the observable remnants of a primordial RNA-peptide world.https://doi.org/10.7554/eLife.09410.001
Life as we know it today is largely the result of the chemical activity of proteins. Much research suggests that the ancestors for most modern proteins were already present in the ‘Last Universal Common Ancestor’, a theoretical ancient organism from which all life on earth descended and which lived around 3.5 billion years ago.
Today, related versions of these ancestral proteins are found in organisms as different as bacteria, humans and plants. While they seem highly diverse, these proteins were all assembled from only a few thousand modular units, termed domains. However, it is not clear how the first domains emerged.
Previously, in 2001 and 2003, researchers hypothesized that the first protein domains arose by joining and swapping short lengths of proteins called peptides that had emerged before there were living cells on earth – a time that is often called the “RNA world”. Now, Alva et al. – including the researchers involved in the 2003 work – have attempted to detect remnants of these ancient peptides in modern proteins.
Alva et al. first compared modern proteins in a way that is similar to how linguists have compared modern languages to reconstruct ancient vocabularies. This revealed 40 fragments that occur in seemingly unrelated proteins, but are very similar in their sequence and structure. These fragments are commonly found in what are likely the oldest observable proteins, and are involved in the activities that are most fundamental to life (for example, binding to DNA and RNA). This led Alva et al. to propose that these fragments represent the observable remnants of a primordial “RNA-peptide world”.
The hypothesis that proteins evolved from peptides provides a number of predictions that can be tested in experiments. These fragments open avenues to explore in the laboratory the origin of modern proteins and to build new proteins not seen in nature.https://doi.org/10.7554/eLife.09410.002
The origin of most present-day proteins can be attributed to combinatorial shuffling and differentiation events involving a basic set of domain prototypes, which act as the unit of protein evolution (Anantharaman et al., 2001; Apic et al., 2001; Ponting and Russell, 2002; Orengo and Thornton, 2005). Many of these domains can be traced back to the time of the Last Universal Common Ancestor (LUCA) (Kyrpides et al., 1999; Koonin, 2003; Ranea et al., 2006), a hypothetical primordial organism from which all life on earth is thought to have descended approximately 3.5 billion years ago (Glansdorff et al., 2008). The origin of domains themselves, however, is poorly understood. An origin de novo, by random concatenation of amino acids, appears impossible due to the high sequence complexity and low folding yield of polypeptides, as well as to the absence of abiotic processes that could produce peptide chains of more than 5–10 residues. The abiotic scenario would also leave open the fundamental question as to how the information contained in successful polypeptides could have been passed on.
Many lines of evidence, including the identification of local sequence and structure similarity within domains of different fold (Brennan and Matthews, 1989; Doherty et al., 1996; Copley et al., 2001; Grishin, 2001b; Friedberg and Godzik, 2005; Alva et al., 2007; Andreeva et al., 2007), or the frequent construction of domains by repetition of subdomain-sized fragments (McLachlan, 1987; Andrade et al., 2001; Hocker et al., 2002; Chaudhuri et al., 2008; Remmert et al., 2010), show that domains might not constitute the only evolutionary unit of protein structure. These observations led to the proposal that the first folded domains arose by repetition, fusion, recombination, and accretion from an ancestral set of peptides (Fetrow and Godzik, 1998; Lupas et al., 2001; Soding and Lupas, 2003) that emerged in the RNA world (Gilbert, 1986), in which RNA served both as carrier of genetic information and catalyst of metabolic reactions (Jeffares et al., 1998). According to this model, the local similarities found in modern proteins represent the observable remnants of such peptides. In the RNA world, which is widely thought to have been an important intermediate stage in the origin of cellular life, simple peptides may have been recruited by RNA to expand its functional repertoire. The catalytic range of RNA molecules is restricted (Joyce, 2002) and peptides are good chelators of metals and small molecules. Peptides are also beneficial for RNA thermostability and folding specificity, and for the formation of oligomeric complexes. For these reasons, peptides of initially abiotic origin may have been co-opted as cofactors. In time, selective pressures on availability, interaction specificity, and functional effectiveness would have driven the emergence of longer, RNA-encoded and -produced peptides, optimized for the formation of secondary structure by exclusion of water with the RNA scaffold. It is known that many peptides formed of the 20 proteinogenic amino acids have an intrinsic affinity for RNA and form secondary structures upon binding (Patel, 1999; Das and Frankel, 2003). By repetition, fusion, recombination, and accretion, these preoptimized peptides would have reached a level of complexity enabling them to exclude water by hydrophobic contacts, making them independent of the RNA scaffold. In this theory, protein folding would have been an emergent property of peptide-RNA coevolution.
Given the striking conservation of many proteins in sequence and structure across evolutionary time, we conjectured that if this hypothesis is true, we might be able to see remnants of these primordial peptides in modern proteins. To this end, we decided to take a comparative approach based on the systematic analysis of present-day domains, similar to the approach taken by linguists to reconstruct ancient vocabularies by comparing modern languages. The comparative studies of languages and of proteins, in fact, exhibit many parallels because of their shared linear nature of information storage, the high conservation of evolutionary modules, and the similarity of evolutionary constraints acting upon them (Gray and Atkinson, 2003; Pagel et al., 2007; Searls, 2013; List et al., 2014) (Figure 1A). Today, for instance, it is known that hundreds of words in European languages contain the conserved Semitic root qnw (*qanaw-) meaning ‘reed’ (Huehnergard, 2011), despite it having diversified into a wide range of functional forms by the same processes as already familiar from biological evolution (e.g., orthology, paralogy, horizontal transfer).
To reconstruct the ‘vocabulary’ of ancient peptides, we aimed at finding local similarities in sequence and structure within globally different folds, which are presently thought to have arisen independently, by convergent evolution. Since the events that led to the emergence of domains took place before the time of LUCA, modern domains might only retain weak signals thereof in their sequences. As protein structures diverge much more slowly than their sequences, structural similarity is often used for identifying such distant events. However, similar structures may have arisen convergently, owing to the limited number of conformations available to a folded polypeptide chain, particularly at the supersecondary structure level (Salem et al., 1999; Kim et al., 2009; Fernandez-Fuentes et al., 2010). Consequently, structure similarity alone does not provide conclusive evidence of common ancestry. In contrast, the combinatorial sequence space is enormous and many sequences are compatible with a particular local structure, so that sequence convergence is rare. Thus sequence similarity is a more reliable marker for common ancestry (Soding, 2005; Kim et al., 2009). We have therefore employed sequence similarity, as evaluated by the comparison of profile hidden Markov models (HMMs) (Soding et al., 2005), as a first criterion for inferring common ancestry of domains in this study. Due to the large evolutionary divergence of sequences, we used structural similarity as a second criterion to confirm the potential homology relationships.
To implement this comparative approach between domains of different fold, we needed a reference database for the assignment of domains to fold types. For this we used the SCOPe (Murzin et al., 1995; Fox et al., 2014) database (release 2.03), which is a point of reference in the classification of protein folds. In this database, the first two classification levels (family and superfamily) capture homologous relationships, while the grouping of structurally similar superfamilies into the same fold reflects convergent evolution, i.e. analogy. In order to reduce the very large background of obvious homologous matches, we filtered SCOPe to a maximum of 30% sequence identity (SCOPe30). At this level, many relationships considered homologous by SCOPe are removed, whereas representatives for all families and superfamilies are still retained. As detailed in the ‘Materials and methods’, we then compared the resulting domain set in sequence space using HHsearch with stringent settings, and subsequently in structure space using TM-align (Zhang and Skolnick, 2005). We plotted the obtained scores separately for comparisons within families (presumed homologous relationships; Figure 2A) and between folds (presumed analogous relationships; Figure 2B). The expected score distributions would have been at the top, right (high HHsearch probabilities, high TM-scores) for homologous relationships and bottom, left (low HHsearch probabilities, low TM-scores) for analogous ones. In fact, the distributions we obtained were bimodal, with a higher incidence of scores at both the top, right and bottom, left of the plots. The presence of some low-scoring relationships in the homologous set (Figure 2A) did not appear too surprising, given the considerable evolutionary distance of some relationships captured in SCOPe, but the presence of some high-scoring relationships in the analogous set (Figure 2B) did, as it suggested the presence of hitherto unrecognized homologous relationships. We decided to explore these further, using cut-offs at HHsearch probabilities of 70% (corresponding to P-values < 5e-05) and TM-scores of 0.5. At these threshold values, about a fourth of presumed homologous relationships in SCOPe30 and >99.95% of matches between folds, which are presumed to be analogous, are omitted. This provided a substantial margin of safety in evaluating the high-scoring relationships between domains presumed to be analogous (Figure 2B).
Since the comparisons made in the analogous set were between domains of different fold, the high-scoring matches always involved subdomain-sized fragments. Those fragments that satisfied the cut-offs and a number of additional criteria, as detailed in the ‘Materials and methods’, were clustered together automatically if they overlapped by at least 80% of their length. Following upon this step, clusters were inspected individually and merged further where appropriate, as described in the ‘Materials and methods’. We obtained 65 clusters, of which 20 relied on a global similarity of the folds and were omitted from further consideration. Such cases of globally similar folds that are nevertheless classified as different are mainly due to homologous fold change events, such as strand invasion, hairpin swapping, or circular permutation (Grishin, 2001a; Andreeva and Murzin, 2006; Alva et al., 2008). Thus, for instance, SCOPe superfamilies d.12.1 and d.58.7 are clearly related globally by circular permutation, despite being classified into different folds. A further 5 groups were due to artifacts in domain boundaries assigned by SCOPe and were also omitted. For example, the d.79.4.1 family makes connections to the a.5.10.1 family, as its sequence in SCOPe encompasses the latter.
This yielded a final set of 40 clusters, within which all fragments were inspected individually, superimposed, and trimmed to a consensus length where appropriate (Figure 3, Figure 3—source data 1; Table 1). As examples, we show the different embodiments of the β-hammerhead motif, found in 4 folds comprising 8 superfamilies (Figure 1B, Figure 3: fragment 12), and the nucleic acid-binding helix-hairpin-helix motif (Doherty et al., 1996; Shao and Grishin, 2000), found in 8 folds comprising 15 superfamilies (Figure 3: fragment 2, Figure 4). The median length of the 40 consensus fragments is 24 residues, with the shortest fragment comprising 9 and the longest 38 residues. This is in accord with the expectation that ancient peptides were simple and subdomain-sized. Of these fragments, only about half correspond to typical supersecondary structure elements, such as α-hairpins, β-hairpins, β-meanders, and βαβ-elements (Salem et al., 1999), whereas the others are unusual fragments with odd shapes, which frequently do not form compact structures. This supports the notion that they predate the emergence of hydrophobic contacts as a driving force for protein folding and that their open structures reflect the association with an RNA scaffold, as still seen in ribosomal proteins today (Soding and Lupas, 2003).
To evaluate whether the sequence similarity exhibited by these fragments could be the result of biophysical constraints, rather than common descent, we searched SCOPe30 with each fragment for structurally similar matches: 36 of the 40 had at least one match to another superfamily with a TM-score ≥ 0.5, but undetectable sequence similarity (HHsearch probability <10%). Indeed, of these 36 fragments, 34 had more than half of their structure matches to fragments with undetectable sequence similarity. This shows that essentially every one of the structures we detected can be formed by fundamentally different sequences. To verify this anecdotal observation in a systematic way, we analyzed the relationship between sequence and structure similarity in our fragments by comparison with a reference set of 40 of the most frequent supersecondary structure elements from the Smotifs library (Fernandez-Fuentes et al., 2010) (assembled as described in the ‘Materials and methods’). We reasoned that a correlation should be detectable in homologs, as these would have started ancestrally with identical sequences and structures, before gradually diverging towards a baseline of similarity. There should be no correlation however in analogs, unless structural constraints had limited the number of residues allowed at specific positions, causing a convergence of the sequences (Remmert et al., 2010; Kopec and Lupas, 2013). We therefore computed sequence similarity scores using structure-based sequence alignments for each fragment in the two sets against all domains in SCOPe30. For fragments in our set, we considered matches to other superfamilies in which we had detected the respective fragment (listed in the Figure 3—source data 1) as homologous and to all other folds as analogous (Figure 5A). For the Smotifs fragments, we considered matches to the same superfamily as homologous and to all other folds as analogous (Figure 5B). For both fragment sets, the presumed homologous matches show a strong correlation between sequence and structure similarity. The correlation for the Smotifs reference set is expected, as the homologs there follow the generally accepted criteria laid out in the SCOPe classification. Observing a nearly equivalent correlation for our fragment set therefore underscores our inference of homology. Note that, although the correlation is slightly weaker in our fragments than in the Smotifs, this is likely due to the fact that comparisons in our set are made between superfamilies, whereas in the Smotifs they are made within superfamilies, i.e. across smaller evolutionary distances. In contrast, the correlation between sequence and structure similarity in the presumed analogous matches is very weak for both our fragments and the Smotifs. This shows that structurally induced sequence convergence is very low and the fragments can be formed by a broad range of different sequences. Biophysical constraints can therefore not explain the sequence similarity exhibited by our fragments.
We are of course aware that molecular homology cannot be proven rigorously by scientific standards and that the boundaries of what constitutes evidence of common descent evolve continuously as extrapolation from increasingly distant connections are found to yield useful structural and functional predictions. Due to the conservative nature of our sequence comparisons and the range of additional criteria we applied in order to eliminate potentially spurious matches, we expect that we did not extend these boundaries substantially beyond what is already established in the field. Given the depth of analysis in many structural studies, we therefore surmised that at least some of our fragments should have been noted previously. In fact, 40% (16) have been described individually before and connected to deep evolutionary events (Figure 3, indicated by a dot; Table 1), showing that we are moving within the boundaries of established sequence-structure analyses. To our knowledge, our fragments comprise all but one previously reported case. We failed to detect the ASP-box (Figure 3—figure supplement 1: B1), found in six different folds (Copley et al., 2001), owing to the stringency of our search criteria (if the HHsearch probability cut-off is lowered to 60%, this fragment is included, along with four others). Examples of previously described fragments include the dinucleotide-binding β-α-β motif (Buehner et al., 1973; Rossmann et al., 1974; Wierenga et al., 1986; Dym and Eisenberg, 2001) (Figure 3: 8), the KH motif of type I and type II KH domains (Grishin, 2001b) (Figure 3: 6), the helix-hairpin-helix motif of nucleic acid-binding domains (Doherty et al., 1996; Shao and Grishin, 2000) (Figure 3: 2, Figure 4), the EF-Tu-binding α-hairpin of elongation factor EF-Ts and ribosomal protein L7/12 (Wieden et al., 2001) (Figure 3: 24), and the P-loop of PEP carboxykinases and P-loop NTPases (Walker et al., 1982; Matte et al., 1996) (Figure 3: 16). For many of these fragments, our systematic approach, the growth of structure databases, and improved sequence comparison methods allowed us to identify new instances of occurrence. For example, another occurrence of the P-loop is in the catalytic domain of MurD-like peptide ligases (SCOP c.72.2), where it is involved in binding the phosphate group of a mononucleotide as well.
If our starting assumption of ancestral peptides in the context of an RNA world is correct, some of the most basic functions would be nucleic-acid binding and catalysis. We would therefore expect these properties to be enriched in our set. Indeed, we find that about a third of our fragments make contact with nucleic acids (13), based on evidence from known structures (fragments with a yellow background in Figure 3; Table 1). For comparison, the Smotifs reference set yielded only 2 nucleic-acid binders, even though it covers about three times as many superfamilies and folds as our fragment set (see ‘Materials and methods’). Our nucleic acid-binding fragments include four of the six most highly represented fragments in our dataset, particularly the helix-turn-helix motif (Sauer et al., 1982; Pabo and Sauer, 1984; Brennan and Matthews, 1989; Suzuki and Brenner, 1995; Aravind et al., 2005) (Figure 3: 1- found in 14 folds and 20 superfamilies, abbreviated in the following as 14, 20) and the helix-hairpin-helix motif (Figure 3: 2- 8, 15; Figure 4). A special case of nucleic-acid interaction is provided by ribosomal proteins, which contain 8 of our 13 nucleic acid-binding fragments (indicated by red font color in Figure 3; Table 1). Ribosomal proteins are likely to represent the oldest proteins observable today and, for the most part, still require the RNA scaffold to become structured, providing a window into the time when protein domains were being established (Hsiao et al., 2009). For comparison, of the nine folds proposed to be the most ancient (Caetano-Anolles et al., 2007), based on the comparative analysis of proteomes from diverse branches of life, six encompass at least one of our 40 fragments (Table 1).
The second basic function peptides would plausibly have had in the RNA world would have been catalytic, as coordinators of metals, iron-sulfur clusters, nucleotides, and nucleotide-derived cofactors (coenzyme A, NAD(P), FAD), all of which are thought to already have played an essential role in prebiotic metabolism (White, 1976; Wachtershauser, 1992). Here, however, an enrichment relative to the Smotifs reference set is less clearly apparent, being primarily seen for iron-sulfur clusters which are absent in the reference set. Seven of our fragments coordinate metal ions and iron-sulfur clusters (red background in Figure 3; Table 1), e.g., the cytochrome-heme-attachment motif (Figure 3: 17) (Mathews et al., 1985) and the 4Fe-4S coordinating peptide (Figure 3: 18) (Lupas et al., 2001), and five bind nucleotides and nucleotide-derived cofactors (blue background in Figure 3; Table 1). Particularly two of these fragments, the P-loop motif (Figure 3: 16) and the dinucleotide-binding β-α-β motif (Figure 3: 8), are at the core of some of the largest enzyme superfamilies in nature and play a central role throughout metabolic processes. We note that in all these fragments, the contribution of the peptides is the coordination of catalytic cofactors and not the provision of catalytic residues per se, in accordance with a primordial role as cofactors of RNA-driven catalysis. Thus, for example, the role of the P-loop in nucleotidases is to coordinate the nucleotide, the mechanism for its hydrolysis having evolved independently in different lineages of this superfamily.
When we initiated this study, we believed that assembly from non-identical fragments may have been one of the primary forces in the evolution of domains (Lupas et al., 2001; Soding and Lupas, 2003), and we expected to find many examples demonstrating it. However, we did not find even one domain that contained two or more different fragments from our set. Our fragments either form their folds by repetition or in single copy, decorated by heterologous structural elements. At present, we find the reasons for the lack of fragment combinations unclear, but we note that many of our fragments might represent dominant cores that guide the folding of the remainder of the polypeptide chain (Religa et al., 2007) and would, as such, not be generally compatible with each other. In the few cases where they would be sufficiently compatible to produce an initial fold capable of entering biological selection, they would be under considerable pressure to adapt to the new structural environment. This, in most cases, might lead to the retention of only one dominant fragment, the other(s) adjusting by rapid divergence, making them undetectable by our methods. For an initial exploration of this possibility, we asked whether any fold in SCOPe30 could be found to contain two or more of our fragments if we relaxed the sequence similarity requirement for all but one of them. We analyzed in detail the largest family from each superfamily containing one of our fragments (188 in all) for such combinations of 'significant' and 'non-significant' fragments. As long as we retained any sequence similarity cutoff for the 'non-significant' fragments, even as low as HHsearch probabilities of 10%, no combinations could be found. If we however removed the sequence similarity requirement entirely for the 'non-significant' fragments, asking only for structural similarity (TM-score ≥ 0.5), over 50% of the families showed fragment combinations. In Figure 6, we have collected examples, in which the fold is substantially formed by the combination of a 'significant' with a 'non-significant' fragment, showing that the incompatibility we observe for our fragments is not of a geometrical nature.
While we were unable to detect fragment combinations, repetition is wide-spread, seen for 14 of our 40 fragments (35%) (indicated by a dotted box in Figure 3; Table 1). Nine of these are also capable of forming folds in single copy, with additional decoration. For example, the TPR element (Figure 3: 28) is found in multiple copies in domains belonging to the TPR-like superfamily (a.118.8) (D'Andrea and Regan, 2003), but only in single copy in other folds that contain it (Figure 7A). Also, the β-hammerhead motif (Figure 1B, Figure 3: 12), which resembles a hammerhead, is duplicated in the barrel-sandwich hybrid fold (b.84), but occurs in single copy in the α/β-hammerhead fold (d.41) and in two other folds (e.29 and f.46). Two further fragments also occur in variable numbers per fold, but never in single copy. These are the outer membrane β-hairpin (Figure 3: 25) (Remmert et al., 2010), which forms β-barrels of between four and twelve hairpins (Figure 7B), and the four-stranded β-meander (Figure 3: 13) (Chaudhuri et al., 2008; Kopec and Lupas, 2013), which forms propeller-like toroids containing between four and twelve copies. The remaining two fragments always form folds by duplication (or homo-oligomerization). For example, the cradle-loop barrels, whose evolution we have studied in detail (Coles et al., 1999; 2005; 2006; 2007; 2008), encompass a broad range of topological variants, which are however all built from two copies of a β-α-β fragment (Figure 3: 7, Figure 7C).
Our findings thus indicate that repetition and accretion must have been the key forces in the emergence of domains. The importance of repetition has been pointed out by many earlier studies. Starting in the 1970's Andrew McLachlan charted out in more than 20 publications the origin of many proteins by repetition (McLachlan, 1972; 1987), in some cases clearly of subdomain-sized fragments (Blundell et al., 1979; McLachlan, 1980; 1980). Aided by computational tools to detect repeats in sequence (Heger and Holm, 2000; Szklarczyk and Heringa, 2004; Biegert and Soding, 2008) and structure (Kim et al., 2010; Myers-Turnbull et al., 2014), analyses of fold space have shown the high incidence of repetitive folds, amounting to as many as one fifth of all folds (Andrade et al., 2001; Balaji, 2015; Forrest, 2015). These include some of the most frequent folds, such as ferredoxins (Eck and Dayhoff, 1966), immunoglobulins (Huang and Xiao, 2007), β-propellers (Murzin, 1992) and TIM-barrels (Soding et al., 2006); indeed, of the 10 most populated folds in SCOPe, 6 (including the top 5), have repetitive structures and all of them have members in which the repetition is also detectable at the sequence level. In most cases, the repetitive structure of these folds has been interpreted as evidence for their origin by amplification of a subdomain-sized fragment and in some cases, such as for β-propellers (Yadid et al., 2010; Voet et al., 2014), β-trefoils (Lee and Blaber, 2011; Broom et al., 2012), and TIM-barrels (Bharat et al., 2008; Richter et al., 2010), this process has also been explored by protein engineering. Only a subset of these fragments that form folds by repetition are present in our set, since we require them to occur in at least two folds to consider them antecedent to folded proteins, but many are currently only detectable in one fold.
Our fragments are spread across a total of 130 folds and 188 superfamilies in SCOPe. Given that the current version of SCOPe comprises 1194 folds and 1961 superfamilies, they could be considered to cover only a small of fraction. However, of the 25 most populated folds in SCOPe, which comprise about 25% of all superfamilies, 14 contain one of the fragments and 7 of the top 10 do. We conclude that, far from being anecdotal, our fragments are indeed widespread in today’s domains.
The 40 fragments we describe clearly represent a lower bound, given the stringent significance cut-offs in our study, the as yet incomplete fold assignment for known structures, and the fact that some ancestral fragments may have survived only in a single fold and are thus invisible to our method. Indeed, our goal in this study was not completeness, given the extensive overlap in statistical scores that can be expected between the least likely false negatives and the most likely true positives. Rather, it was to assemble a comprehensive set of confident positives in order to show that a signal of homology that predates folding as a widespread property is still detectable in proteins today.
How many fragments may we be missing? With respect to the significance cut-offs we find, for example, that relaxing the sequence similarity requirement to a probability of 60% leads to the inclusion of 5 further fragments, some with substantial arguments in their favor, such as the aforementioned ASP-box (Copley et al., 2001) (Figure 3—figure supplement 1: B1, Figure 3—source data 1). Also, the availability of more sequences and structures, particularly for proteins from poorly sampled branches of life, should allow to bridge more widely separated folds by increasing the significance of their matches. For the fragments that occur in a single fold, we note that because of the importance of repetition, it might be possible to reconstruct further fragments by specifically analyzing repetitive folds. Certainly some of the chemical activities provided by these folds appear sufficiently ancient to justify the expectation that they were covered early in the genesis of folded proteins (Farias-Rico et al., 2014; Lupas et al., 2015).
For these reasons we think that the number of ancestral fragments that could be reconstructed at a satisfactory level of confidence may well approach 100 in the next decades. Nevertheless, the set of 40 fragments we have described here should cover both the most frequent and best conserved ancestral peptides, and support a viable theory for the emergence of folded proteins.
To assemble domains representative of all known fold types, we chose the SCOPe database (release 2.03) (Fox et al., 2014) and filtered it to a maximum of 30% sequence identity, obtaining 9452 domains. Multiple alignments were built for each of these domains using the buildali.pl script (with default parameters) from the HHsearch package (Soding, 2005). This script uses PSI-BLAST (Altschul et al., 1997) and contains heuristics to reduce the inclusion of nonhomologous sequence segments at the ends of PSI-BLAST sequence matches, the leading cause of high-scoring false positive matches. We used PSI-BLAST, rather than the more sensitive HHblits (Remmert et al., 2012), because in our experience the sensitivity of HHblits leads it to occasionally assign elevated probabilities to analogous matches. Profile HMMs were calculated from the alignments using hhmake, also from the HHsearch package, and subjected to pairwise comparisons with HHsearch. Comparisons were thus always made with the full domains, never with fragments thereof. We used default settings, but switched off secondary structure scoring (option ssm 0) in order to reduce the likelihood that matches were scored highly because of a chance similarity of their (predicted) secondary structures. The HHsearch probabilities we obtained are therefore conservative with respect to the ones obtained for example from the HHpred server in default settings (Soding et al., 2005), which is the most frequent source of HHsearch-based deep homology analyses in publications today. We only considered reciprocal matches (of the form domain A matches domain B and domain B matches domain A), and assigned these the average of the two obtained HHsearch probabilities. The structures of the aligned segments of SCOPe30 domains were subsequently compared using TM-align (Zhang and Skolnick, 2005). We filtered out all matches in which the aligned segment involved only a single secondary structure element.
To establish cut-offs for the comparison of domains of different fold, we plotted comparisons of domains within families (presumed homologs) (Figure 2A). The plot shows a bimodal distribution, with the highest representations at the top and bottom. At HHsearch probabilities of ≥ 70% and TM-scores of ≥ 0.5, about a fourth of all homologous relationships in SCOPe30 are filtered out. We used these cut-offs for the comparison of domains of different fold, as it gives us a substantial margin of safety in the interpretation of relationships presumed analogous by SCOPe.
We gathered all matches between domains of different fold with an HHsearch probability of ≥ 70% and a TM-score of ≥ 0.5. Next, we eliminated all matches between larger superfamilies that nevertheless only made a single connection. This was done in order to minimize the number of potential false positives. We then applied single-linkage clustering to this resulting set of matches. Fragments from matches between folds were pooled together if they overlapped by at least 80% of their length, i.e. domains A, B, and C were combined together if domain A matched domain B and domain B matched domain C such that the boundaries in B overlapped by at least 80%. The resulting clusters were analyzed interactively and merged together if their similarity was based on the presence of a shared sequence- and structure-similar fragment. This yielded a final set of 64 potentially interesting clusters. We generated sequence and structure alignments for the fragments that formed the basis of these clusters and assigned boundaries by manual inspection.
To obtain a reference set of fragments for our study, we assembled the 40 most frequent supersecondary structure motifs seen in proteins, according to the Smotifs database (Fernandez-Fuentes et al., 2010). This is an exhaustive library of geometrically defined local supersecondary structure motifs, composed of two consecutive secondary structures connected by a loop. In this library, motifs of varying lengths, but with similar geometry and secondary structures, are grouped together into clusters, 2296 in total. The Smotifs library is available as a MySQL database; however, the frequency of occurrence of the motif clusters cannot be directly calculated from this database. We were therefore kindly provided with the frequency table by Andras Fiser (Albert Einstein College of Medicine, New York). Using this table, we randomly selected one representative each for the 50 most frequent motif clusters; we required the representative to comprise 24 residues, the median length of our fragments, with at least 6 residues in each secondary structure element and a connecting loop of at most 6 residues. Next, we inspected these representatives and picked the 40 most frequent fragments after eliminating non-compact ones. Finally, for each of these 40 fragments, we picked a prototype in the SCOPe30 database by performing structure searches using TM-align and by selecting the match with the best TM-score. These 40 fragments formed our reference set.
To evaluate whether the sequence similarity shown by our fragments could be the result of structural convergence, rather than origin from a common ancestor, we searched the SCOPe30 database with each fragment from our set and the Smotifs reference set using TM-align at a cut-off of 0.5. Since our 40 fragments are structurally very similar in all their respective embodiments, we randomly picked one representative of each type. To calculate a sequence-versus-structure plot for our set and the Smotifs set, we followed the methodology described by us in previous studies (Remmert et al., 2010; Kopec and Lupas, 2013). We used TM-align to compare the Smotifs set with domains from within the respective superfamily (presumed homologous set) and with the background set that comprised all other folds (presumed analogous set) to search for structurally similar fragments. For our fragments, we compared each representative fragment with domains of all other superfamilies in which we had detected it (homologous) and with all other folds (analogous). In both cases, the TM-score was normalized based on the length of the query motifs. For each structure match, we calculated the profile-profile sequence similarity score with HHalign from the HHsearch package (Soding, 2005), however based on the fixed structural alignment obtained from TM-align. The HHalign score was normalized based on the number of aligned residues. Next, each pair of structure and sequence scores was plotted in a scatter plot. To calculate the correlation between TM- and HHalign-scores, we assumed a linear dependency between them and performed linear regression using SciPy (Jones et al., 2001). We performed a t-test to determine whether the slope of the regression line differs significantly from zero; we chose a significance level of 1e-10.
First, we searched the SCOPe30 database for occurrences of each of the 40 reference Smotifs using TM-align at a TM-score cut-off of 0.5 and length coverage of 100%. This yielded occurrences in 323 folds and 447 superfamilies. For comparison, our 40 fragments occur in 130 folds and 188 superfamilies. Following this step, for all occurrences of our fragments and the Smotifs fragments in SCOPe30, we searched for potential interactions with nucleic acids, metals, iron-sulfur clusters, nucleotides, and nucleotide-derived cofactors. A fragment was deemed to interact with these molecules if it made at least three inter-atomic contacts to them within a distance cut-off of 3Å. In addition, we included one of our fragments, the β-α-β motif seen in cradle-loop barrels (Alva et al., 2008) (Figure 3: 7) into the nucleic acid binding set, even though only a high-resolution model of its interaction with DNA is currently available, based on NMR data (Zorzini et al., 2015). Even discounting this addition, the 12 other nucleic acid-binding fragments of our set substantially exceed the two nucleic acid binders found in the Smotifs set, even though these cover about three times as many superfamilies and folds.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research 25:3389–3402.https://doi.org/10.1093/nar/25.17.3389
Cradle-loop barrels and the concept of metafolds in protein classification by natural descentCurrent Opinion in Structural Biology 18:358–365.https://doi.org/10.1016/j.sbi.2008.02.006
Regulatory potential, phyletic distribution and evolution of ancient, intracellular small-molecule-binding domainsJournal of Molecular Biology 307:1271–1292.https://doi.org/10.1006/jmbi.2001.4508
Protein repeats: structures, functions, and evolutionJournal of Structural Biology 134:117–131.https://doi.org/10.1006/jsbi.2001.4392
Evolution of protein fold in the presence of functional constraintsCurrent Opinion in Structural Biology 16:399–408.https://doi.org/10.1016/j.sbi.2006.04.003
SISYPHUS--structural alignments for proteins with non-trivial relationshipsNucleic Acids Research 35:D253–D259.https://doi.org/10.1093/nar/gkl746
Domain combinations in archaeal, eubacterial and eukaryotic proteomesJournal of Molecular Biology 310:311–325.https://doi.org/10.1006/jmbi.2001.4776
The many faces of the helix-turn-helix domain: transcription regulation and beyondFEMS Microbiology Reviews 29:231–262.https://doi.org/10.1016/j.fmrre.2004.12.008
Internal symmetry in protein structures: prevalence, functional relevance and evolutionCurrent Opinion in Structural Biology 32:156–166.https://doi.org/10.1016/j.sbi.2015.05.004
A beta alpha-barrel built by the combination of fragments from different foldsProceedings of the National Academy of Sciences of the United States of America 105:9942–9947.https://doi.org/10.1073/pnas.0802202105
The helix-turn-helix DNA binding motifThe Journal of Biological Chemistry 264:1903–1906.
D-glyceraldehyde-3-phosphate dehydrogenase: three-dimensional structure and evolutionary significanceProceedings of the National Academy of Sciences of the United States of America 70:3052–3054.https://doi.org/10.1073/pnas.70.11.3052
The origin of modern metabolic networks inferred from phylogenomic analysis of protein architectureProceedings of the National Academy of Sciences of the United States of America 104:9358–9363.https://doi.org/10.1073/pnas.0701214104
Sequence-structure analysis of FAD-containing proteinsProtein Science 10:1712–1728.https://doi.org/10.1110/ps.12801
Evolutionary relationship of two ancient protein superfoldsNature Chemical Biology 10:710–715.https://doi.org/10.1038/nchembio.1579
Function driven protein evolution. a possible proto-protein for the RNA-binding proteinsPacific Symposium on Biocomputing pp. 485–496.
Structural symmetry in membrane proteins*Annual Review of Biophysics 44:311–337.https://doi.org/10.1146/annurev-biophys-051013-023008
Peeling the onion: ribosomes are ancient molecular fossilsMolecular Biology and Evolution 26:2415–2425.https://doi.org/10.1093/molbev/msp163
American Heritage Dictionary of the English Language2066–2069, Proto-Semitic Language and Culture, American Heritage Dictionary of the English Language, Fifth edition, Houghton Mifflin Harcourt, Boston/New York.
SciPy, Open Source Scientific Tools for PythonSciPy, Open Source Scientific Tools for Python.
HorA web server to infer homology between proteins using sequence and structural similarityNucleic Acids Research 37:W532–W538.https://doi.org/10.1093/nar/gkp328
Comparative genomics, minimal gene-sets and the last universal common ancestorNature Reviews Microbiology 1:127–136.https://doi.org/10.1038/nrmicro751
Universal protein families and the functional content of the last universal common ancestorJournal of Molecular Evolution 49:413–423.https://doi.org/10.1007/PL00006564
Experimental support for the evolution of symmetric protein architecture from a simple peptide motifProceedings of the National Academy of Sciences of the United States of America 108:126–130.https://doi.org/10.1073/pnas.1015032108
Inactivation of rabbit pulmonary cytochrome p-450 in microsomes and isolated perfused lungs by the suicide substrate 1-aminobenzotriazoleThe Journal of Pharmacology and Experimental Therapeutics 235:186–190.
Structural repeats and evolution of tobacco mosaic virus coat protein and RNAJournal of Molecular Biology 136:203–224.https://doi.org/10.1016/0022-2836(80)90372-1
Repeating sequences and gene duplication in proteinsJournal of Molecular Biology 64:417–437.https://doi.org/10.1016/0022-2836(72)90508-6
Gene duplication and the origin of repetitive protein structuresCold Spring Harbor Symposia on Quantitative Biology 52:411–420.https://doi.org/10.1101/SQB.1987.052.01.048
SCOP: a structural classification of proteins database for the investigation of sequences and structuresJournal of Molecular Biology 247:536–540.https://doi.org/10.1016/S0022-2836(05)80134-2
Systematic detection of internal symmetry in proteins using CE-symmJournal of Molecular Biology 426:2255–2268.https://doi.org/10.1016/j.jmb.2014.03.010
Protein families and their evolution—a structural perspectiveAnnual Review of Biochemistry 74:867–900.https://doi.org/10.1146/annurev.biochem.74.082803.133029
Protein-DNA recognitionAnnual Review of Biochemistry 53:293–321.https://doi.org/10.1146/annurev.bi.53.070184.001453
Adaptive recognition in RNA complexes with peptides and protein modulesCurrent Opinion in Structural Biology 9:74–87.https://doi.org/10.1016/S0959-440X(99)80010-4
The natural history of protein domainsAnnual Review of Biophysics and Biomolecular Structure 31:45–71.https://doi.org/10.1146/annurev.biophys.31.082901.134314
Protein superfamily evolution and the last universal common ancestor (lUCA)Journal of Molecular Evolution 63:513–525.https://doi.org/10.1007/s00239-005-0289-7
The helix turn helix motif as an ultrafast independently folding domain: the pathway of folding of engrailed homeodomainProceedings of the National Academy of Sciences of the United States of America 104:9272–9277.https://doi.org/10.1073/pnas.0703434104
Evolution of outer membrane beta-barrels from an ancestral beta beta hairpinMolecular Biology and Evolution 27:1348–1358.https://doi.org/10.1093/molbev/msq017
Correlation of observed fold frequency with the occurrence of local structural motifsJournal of Molecular Biology 287:969–981.https://doi.org/10.1006/jmbi.1999.2642
The HHpred interactive server for protein homology detection and structure predictionNucleic Acids Research 33:W244–W248.https://doi.org/10.1093/nar/gki408
HHrep: de novo protein repeat detection and the origin of TIM barrelsNucleic Acids Research 34:W137–W142.https://doi.org/10.1093/nar/gkl130
Classification of multi-helical DNA-binding domains and application to predict the DBD structures of sigma factor, LysR, OmpR/PhoB, CENP-b, rapl, and Xy1S/Ada/AraCFEBS Letters 372:215–221.
Tracking repeats using significance and transitivityBioinformatics 20:i311–i317.https://doi.org/10.1093/bioinformatics/bth911
Computational design of a self-assembling symmetrical β-propeller proteinProceedings of the National Academy of Sciences of the United States of America 111:15102–15107.https://doi.org/10.1073/pnas.1412768111
Distantly related sequences in the alpha- and beta-subunits of ATP synthase, myosin, kinases and other ATP-requiring enzymes and a common nucleotide binding foldThe EMBO Journal 1:945–951.
Coenzymes as fossils of an earlier metabolic stateJournal of Molecular Evolution 7:101–104.https://doi.org/10.1007/BF01732468
A common structural motif in elongation factor ts and ribosomal protein L7/12 may be involved in the interaction with elongation factor tuJournal of Molecular Evolution 52:129–136.
Prediction of the occurrence of the ADP-binding beta alpha beta-fold in proteins, using an amino acid sequence fingerprintJournal of Molecular Biology 187:101–107.
Groundworks for an evolutionary biochemistry: the iron-sulphur worldProgress in Biophysics and Molecular Biology 58:85–201.https://doi.org/10.1016/0079-6107(92)90022-X
Metamorphic proteins mediate evolutionary transitions of structureProceedings of the National Academy of Sciences of the United States of America 107:7287–7292.https://doi.org/10.1073/pnas.0912616107
TM-align: a protein structure alignment algorithm based on the TM-scoreNucleic Acids Research 33:2302–2309.https://doi.org/10.1093/nar/gki524
Escherichia coli antitoxin MazE as transcription factor: insights into MazE-DNA bindingNucleic Acids Research 43:1241–1256.https://doi.org/10.1093/nar/gku1352
John KuriyanReviewing Editor; Howard Hughes Medical Institute, University of California, Berkeley, United States
eLife posts the editorial decision letter and author response on a selection of the published articles (subject to the approval of the authors). An edited version of the letter sent to the authors after peer review is shown, indicating the substantive concerns or comments; minor concerns are not usually shown. Reviewers have the opportunity to discuss the decision before the letter is sent (see review process). Similarly, the author response typically shows only responses to the major concerns raised by the reviewers.
Thank you for submitting your work entitled "A vocabulary of ancient peptides at the origin of folded proteins" for peer review at eLife. Your submission has been favorably evaluated by John Kuriyan (Senior editor) and three reviewers.
The following individual responsible for the peer review of your submission has agreed to reveal his identity: Rob Russell (peer reviewer).
The reviewers have discussed the reviews with one another and the Reviewing editor has drafted this decision to help you prepare a revised submission.
In this most interesting study, Lupas and colleagues hypothesize that modern protein domains arose from ancestral peptides. This is not a new idea; however, the current work is a systematic study of many proteins, while previous work focused on individual fold families. The authors have systematically revisited the original "antecedent domain segment" notion for protein structure, published some 15 years ago, by now using a systematic interrogation of structure and sequence. The authors identify 40 short 3D-structural motifs (they call them "fragments") that are most likely to be some of the ancestral primordial peptides that gave rise to the present-day protein world. These are subdomain-size segments found by sequence profile similarity and validated by 3D structure similarity. They typically lack a well-defined hydrophobic core and are complemented to form complete domains in several different and likely evolutionarily independent ways.
This work stems from a hypothesis that proteins originated by fusion of ancestral peptides. These peptides likely functioned in complex with RNA. The crux of the authors' argument is that subdomain sequence similarity between different domains indicates deep homology for the subdomain. Because the containing domains are not homologous, they argue that the homology at the subdomain level indicates an ancestral peptide shared by the domains. They further argue that this sequence similarity does not arise from structural constraints on sequence, as similar subdomain structures with apparently unrelated sequences can be found. The authors' arguments are mostly clear. They also do not overstate their conclusion that these peptides are ancestral, which necessarily remains somewhat speculative.
This study is made possible by the development of the HHsearch algorithm to find statistically supported similarity between sequence profiles. While some of the HHsearch hits represent false positives, most are predictive of 3D structure, as has been validated in many CASP experiments. It is indeed most likely that if the HHsearch sequence-based alignment is nearly the same as the 3D structure-based alignment then the two segments are homologous. This is the main assumption of this work.
Some of these fragments have been detected before and studied, e.g. in the SISYPHUS database and various publications (e.g. about HHH motifs), but this manuscript provides the first principled and comprehensive approach to find them all and therefore deserves attention. Interestingly, no example of multiple different peptides in any fold was found, which goes slightly against the original idea, though multiplication/repetition of fragments seems definitely to have been a theme in the evolution of the folds.
In summary, the elaboration presented in this paper of the idea that short sub-structure motifs might have arisen through primordial interactions with RNA will be of general interest to the readership of eLife, and the paper is potentially suitable for publication provided that the authors can address the following issues raised by the referees.
Important issues to address:
1) There is concern about the argument that the sequence correlation between these motifs must arise from history rather than convergence. This argument – the core argument in the paper – hinges on the idea that the sequence space compatible with the subdomain motif is too vast to lead to chance convergence. The authors do not, however, provide any estimate for the number of sequences compatible with a given subdomain motif. Their median fragment length is 24 residues. How many different amino acids can be tolerated at each site in the fragment? With what probability? If only a few amino acids can be tolerated at each position, the space associated with a 24 residue fragment could be quite small, potentially leading to a spurious signal of homology.
The reviewers wonder whether it would be feasible to compute more of a significance of the overall observation (i.e. considering whether such a distribution of sequence similarities might be expected by chance) than just what seems to be a rather lenient sequence similarity measure. The reviewers are concerned that the authors support the lack of convergence with anecdotal rather than statistical analysis.
As an aside, the reviewers recognize that a compelling point is that these ancient peptide candidates are associated with ancient functions, and perhaps this isn't emphasized enough in the paper.
2) One thing that is missing is more discussion of the fragments that were common, but not significant in terms of sequence. For instance, the point about there being no single fold containing two fragments leads to the question of just how many there were that satisfied only stringent structural criteria. The reviewers realise that this is considerable work, but wonder if the authors have these data lying around anyway. For example, of the most popular folds, how many have two common fragments or more, even if HMMsearch doesn't give a significance? This would lend some insight into the really ancient, no-longer-detectable, but still related relationships.
3) Some discussion of some old favorites is also missing. The two halves of the Immunoglobulin fold or Rossmann fold, etc. – that is the cases that most structurally-obsessed readers will have in their minds, could also, if fitting, do with some discussion in the same light.
Other issues to address:
1) The authors can reference more relevant publications (e.g. SISYPHUS and those dedicated to individual motifs, like HHH).
2) The description of your methods to establish the statistical cutoffs for sequence and structural similarity took multiple reads to understand (both in the text, Results and discussion, and in the graphic and legend of Figure 2). Highlighting the fragments on Figure 2 might help.
3) "domains might not constitute the evolutionary unit of protein structure" might better be stated as "domains might not constitute the only evolutionary unit of protein structure"https://doi.org/10.7554/eLife.09410.013
- Vikram Alva
- Johannes Söding
- Andrei N Lupas
- Johannes Söding
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
This work would not have been possible without Rob Russell (University of Heidelberg) and Chris Ponting (University of Oxford), with whom we framed the original hypothesis for the origin of folded proteins from subdomain-sized peptides. We thank Dek Woolfson (University of Bristol), Nick Grishin (University of Texas Southwestern Medical Center), Kris Brown (GlaxoSmithKline), and members of the Department of Protein Evolution (MPI for Developmental Biology, Tübingen) for many discussions over the years. We are grateful to Andras Fiser (Albert Einstein College of Medicine) for sharing his current Smotifs database prior to publication. Nir Ben-Tal, Philip Bourne, Janusz Bujinicki, Stanislaw Dunin-Horkawicz, Nick Grishin, Rachel Kolodny, Chris Ponting, Rob Russell, Ceslovas Venclovas, Dek Woolfson, and the reviewers of this article are gratefully acknowledged for their comments and suggestions. This work was supported by institutional funds of the Max Planck Society. JS acknowledges support by the German Federal Ministry of Education and Research (BMBF) within the framework of e:Med (grant e:AtheroSysMed, 01ZX1313A-2014) and e:bio (grant SysCore).
- John Kuriyan, Reviewing Editor, Howard Hughes Medical Institute, University of California, Berkeley, United States
© 2015, Alva et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.