Abstract
Evolutionary studies in population genetics and ecology were mainly focused on predicting and understanding past evolutionary events. Recently, however, a growing trend explores the prediction of evolutionary trajectories toward the future promoted by its wide variety of applications. In this context, we introduce a forecasting protein evolution method that integrates birth-death population models with substitution models that consider selection on protein folding stability. In contrast to traditional population genetics methods that usually make the unrealistic assumption of simulating molecular evolution separately from the evolutionary history, the present method combines both processes to simultaneously model forward-in-time birth-death evolutionary trajectories and protein evolution under structurally constrained substitution models that outperformed traditional empirical substitution models. We implemented the method into a freely available computer framework. We evaluated the accuracy of the predictions with several monitored viral proteins of broad interest. Overall, the method showed acceptable errors in predicting the folding stability of the forecasted protein variants but, expectedly, the errors grew up in the prediction of the corresponding sequences. We conclude that forecasting protein evolution is feasible in certain evolutionary scenarios and provide suggestions to enhance its accuracy by improving the underlying models of evolution.
Introduction
Molecular evolution is traditionally investigated through inferences about past evolutionary events, such as phylogenetic tree and ancestral sequence reconstructions, and predictions about the future were considered as inaccessible for a long time because they can be affected by complex processes such as environmental change. Nevertheless, a variety of biological systems display a Darwinian evolutionary process where selection operates toward a limited set of adapted variants. These variants, and in extension the evolutionary trajectories to reach them, would be positively selected and could present a certain degree of predictability (Lässig et al, 2017; Wortel et al, 2023). The progress made in developing more accurate models of evolution (Arenas, 2015b) and the benefits from predicting the outcome of evolution (i.e., to understand the course of evolution or to prepare for the future (Wortel et al., 2023)) motivated a variety of investigations on forecasting evolution in diverse fields including medicine, agriculture, biotechnology, and conservation biology, among others (e.g., Barton et al, 2016; Bull & Molineux, 2008; de Visser et al, 2018; Diaz-Uriarte & Vasallo, 2019; Fischer et al, 2015; Gerrish & Sniegowski, 2012; Lassig & Luksza, 2014; Lind et al, 2019; Luksza & Lassig, 2014; Morris et al, 2018; Munck et al, 2014; Neher et al, 2014; Wortel et al., 2023).
Unfortunately, forecasting evolution not always is achievable. Under neutral evolution, all the molecular variants are equally likely to be present in the population, showing lack of repeatability and disallowing accurate prediction of future variants. Thus, forecasting evolution requires a system with measurable selection pressures, and where certain positively selected variants could produce more descendants than other variants and expand in the population (Desai & Fisher, 2007; Goyal et al, 2012; Neher & Hallatschek, 2013; Neher et al., 2014). Actually, a rougher fitness landscape resulting from selection can lead to greater accuracy in evolutionary predictions (Papkou et al, 2023; Rubin et al, 2023; Van Cleve & Weissman, 2015). Overall, an evolutionary process could be predictable to some extent (prediction errors are inevitable with any method and in any evolutionary scenario) depending on the strength of selection driving evolution and the heterogeneity in fitness among different variants.
Here, we focus on forecasting protein evolution because it involves molecular evolutionary processes driven by selection pressures and where the fitness of each variant can be parameterized and predicted (Carneiro & Hartl, 2010; Gilson et al, 2017). Traditionally, evolutionary histories and ancestral sequences of proteins are inferred using probabilistic methods based on advanced substitution models of protein evolution (e.g., Arenas, 2015b; Arenas & Bastolla, 2020; Ferreiro et al, 2022; Malcom et al, 1990; Moreira et al, 2023; Stackhouse et al, 1990; Thornton et al, 2003; Ugalde et al, 2004). The accuracy of these inferences is affected by the accuracy of the applied substitution model, where substitution models that better fit with the study data usually produce more accurate phylogenetic trees and ancestral sequences (Arenas & Bastolla, 2020; Del Amparo & Arenas, 2022a, 2023; Lemmon & Moriarty, 2004). These findings suggest that accurate substitution models of evolution are also convenient for forecasting protein evolution. In this regard, a variety of studies showed that structurally constrained substitution (SCS) models of protein evolution provide more accurate evolutionary inferences than the traditional empirical substitution models of protein evolution, in terms of phylogenetic likelihood, distribution of amino acid frequencies among protein sites, rates of molecular evolution and folding stability of reconstructed proteins, among other aspects (e.g., Arenas & Bastolla, 2020; Arenas et al, 2015; Bastolla et al, 2006; Bordner & Mittelmann, 2013; Del Amparo et al, 2023; Echave & Wilke, 2017; Ferreiro et al, 2024a; Ferreiro et al, 2024b; Fornasari et al, 2002; Parisi & Echave, 2001; Pascual-Garcia et al, 2019; Rodrigue et al, 2005), although SCS models usually demand more computational resources than substitution models that only include information from the protein sequence. Notice that the protein structure provides information about the location and molecular interactions of amino acids at different protein sites, which could be far from each other in the sequence but close in the three-dimensional structure and interact affecting their evolution (Morcos et al, 2011; Ruiz-Gonzalez & Fares, 2013). Indeed, selection from the protein folding stability is relevant in the evolution of multiple proteins, including those in microbial and viral systems (e.g., Ferreiro et al., 2022; Gong et al, 2013; Jacquier et al, 2013; Rodrigues et al, 2016; Wylie & Shakhnovich, 2011; Zeldovich et al, 2007). Therefore, we believe that it should be taken into account for forecasting protein evolution in such systems.
Predictions about future evolutionary events can be performed with simulation-based methods (e.g., Eccleston et al, 2023; Neher et al., 2014; Yoshida et al, 2023). In order to simulate molecular evolution, traditional population genetics methods apply two separate steps (Arenas, 2012; Hoban et al, 2012). First, the simulation of the evolutionary history (i.e., a phylogenetic tree) using approaches such as the coalescent and birth-death population processes (Gernhard, 2008; Hudson, 1990; Kingman, 1982; Stadler, 2010). Afterward, the forward-in-time simulation of molecular evolution is performed, from the root node to the tip nodes, upon the previously simulated evolutionary history (Yang, 2006). This methodology was implemented into a variety of population genetics frameworks that simulate molecular evolution (Arenas, 2012; Hoban et al., 2012). However, for technical and computational simplicity, it assumes that the simulation of the evolutionary history is independent from the simulation of molecular evolution, which can produce biological incoherences (i.e., the evolutionary history is usually simulated under neutral evolution while molecular evolution is usually simulated with substitution models that consider selection). To enhance the realism of this modeling, here we merged both processes into a single one where evolutionary history influences molecular evolution and vice versa. In particular, we adopted a birth-death population genetics method to simulate the forward-in-time evolutionary history (already used for forecasting evolution (Lassig & Luksza, 2014; Neher et al., 2014)), taking into account the fitness of the molecular variant (through evolutionary constraints from the protein folding stability (Bastolla & Demetrius, 2005; Gong et al., 2013; Liberles et al, 2012; Zeldovich et al., 2007)) at the corresponding node, to determine its subsequent birth or death event, and we integrated this process with SCS models to model protein evolution along the derived phylogenetic branches. The method is detailed below, and we implemented it into a new version of our computer framework ProteinEvolver (Arenas et al, 2013), which is freely available from https://github.com/MiguelArenas/proteinevolver. ProteinEvolver2 includes detailed documentation and a variety of ready-to-use examples. Next, considering the potential applications of forecasting evolution to design vaccines and therapies against pathogens, we evaluated and applied the method to forecasting protein evolution in several real protein data of viruses monitored over time.
Materials and methods
A method for forecasting protein evolution by combining birth-death population genetics with structurally constrained substitution models of protein evolution
Following previous methods for forecasting evolution based on simulations (Lassig & Luksza, 2014; Neher et al., 2014), we developed a method to simulate the forward-in-time evolutionary history of a protein sample with a birth-death process that considers the fitness of the protein variant (based on folding stability) at every temporal node. The method derives the birth and death rates for a protein variant based on the fitness of that variant, where a high fitness results in a high birth rate and a low death rate, which can lead to a large number of descendants, and the opposite leading to a few or none (extinction) descendants. Thus, the fitness of the molecular variant at every node drives its corresponding forward in time birth-death evolutionary history. The details of this simulation process are outlined below.
First, similarly to common simulators of molecular evolution (Arenas, 2012; Hoban et al., 2012; Yang, 2006), a given protein sequence and structure (hereafter, protein variant) is assigned to the root node. The fitness (f) of the protein variant (A) is calculated from its folding stability (free energy, ΔG) following the Boltzmann distribution (Goldstein, 2013) (I, which takes values from 0 to 1),
The user can alternatively choose whether the fitness of the protein variant is based only on its folding stability or whether it is based on similarity with the stability of a real protein variant (i.e., a protein structure from the Protein Data Bank, PDB). We believe that the latter can be convenient since, in nature, a high folding stability does not necessarily indicate a high fitness but a stability close to real stability may suggest a proper fitness because the real stability is the result of a selection process. If the fitness is derived from only the folding stability of the protein variant, the birth rate (b) is considered equal to the fitness. Alternatively, if the fitness is determined based on the similarity in folding stability between the modeled variant and a real variant, the birth rate is assumed to be 1 minus the root mean square deviation (RMSD) in folding stability. Notice that the smaller this difference, the higher the birth rate. In both cases, the death rate (d) is considered as 1-b to allow a constant global (birth-death) rate. The birth-death process is simulated forward in time deciding whether every next event is a birth or a death event (Harmon, 2019; Stadler, 2010; Stadler, 2011) according to the fitness of the corresponding molecular variant, and it ends when a user-specified criterion, such as a particular sample size (considering or ignoring extinction nodes) or a certain evolutionary time (te), is reached. Starting from an “active” node (i.e., the root node) at current time tc, the time to the next event (birth to produce two descendants or death to produce the extinction of the node) can be calculated (details below). In contrast to standard birth-death processes where birth and death rates are constant over time and among lineages, the present method considers heterogeneity where each protein variant at a node has specific birth and death rates according to its corresponding fitness. The method is described below and summarized in Figure 1.
The process starts at the root node, assigning a user-specified protein sequence and corresponding protein structure (i.e., obtained from the PDB) to that node. In general, for every protein variant assigned to an active node, the corresponding birth and death rates are calculated following the indications presented above.
Calculation of the time to the next birth or death event. Following common birth-death methods, the time to the next event tn is calculated through an exponential distribution with rate based on the number of active nodes (s) and the sum of the birth and death rates (Harmon, 2019) (II),

Illustrative example of forward in time simulation of protein evolution integrating a birth-death population evolutionary process with fitness from the protein folding stability and the modeling of protein evolution with a structurally constrained substitution model.
Given a protein variant assigned to a node at time t (blue node), its fitness is calculated considering its protein folding stability. Then, the fitness is used to determine the birth and death rates for that variant, which provide the time to the next birth or death event (horizontal dashed line) that corresponds to the forward-in-time branch length. Next, the variant is evolved forward in time toward each descendant, upon the previously determined branch length, under a SCS model of protein evolution. The process is repeated, forward in time, starting at each new variant. If a death event occurs, the variant of the extinct node (pink node) is obtained but it does have descendants. The process finishes when a particular sample size or simulation time is reached (i.e., t+n).
As a consequence, since b+d=1 at each node, tn is consistent across all nodes, according to (Harmon, 2019).
Evaluate whether the simulation concludes before the next event occurs (tc + tn).
If it does conclude (i.e., tc + tn higher than te), the simulation of the evolutionary history finishes.
If it does not conclude, a random active node is selected, and its protein variant is analyzed to determine its type of next event (birth or death). The probability of a birth event is Pb = b/(b + d) and the probability of a death event is Pd = d/(b + d). A random sample from those probabilities is taken to determine the type of evolutionary event.
If a birth event is selected. Two descendant active nodes from the study node are incorporated with branch lengths tn, and the study node is then considered inactive. Next, molecular evolution is simulated from the original node to each descendant node based on the specified SCS model of protein evolution (Arenas et al., 2013). Thus, the process results in a protein variant for every descendant node. Finally, the folding stability and subsequent fitness of these descendant protein variants are calculated.
If a death event is selected. Then, the study node is considered inactive.
Return to step 1 while the user-specified criterion for ending the birth-death process is not satisfied and at least one active node exists in the evolutionary history. Otherwise, the simulation ends.
This process simultaneously simulates, forward in time, evolutionary history and protein evolution, with protein evolution influencing the evolutionary history through selection from the folding stability. Indeed, selection can vary among protein variants at their corresponding nodes of the evolutionary history. The process produces a forward in time birth-death phylogenetic history, that encompasses nodes that reached the ending time, internal nodes and nodes that were extinct at some time, along with the protein variant associated with each node.
The method includes several optional capabilities listed in Table S1 (Supplementary Material) and in the software documentation, with some summarized below.
The user can fix the birth and death rates along the evolutionary history or specify that they are based on the fitness of the corresponding protein variant as described before. For the latter, the fitness can be based on the folding stability of the analyzed protein variant or on the similarity in folding stability between the analyzed and real protein variants.
The birth-death simulation can finish any of the following criteria are met: (a) A specified sample size, including extinction nodes derived from death events. (b) A specified sample size, excluding extinction nodes. (c) A specified evolutionary time, measured from the root to a tip node.
Extinction nodes can either be preserved or removed from the evolutionary history.
Evaluation of the simulated birth-death evolutionary history in terms of tree balance using the Colless index (Colless, 1982; Lemant et al, 2022).
Possible variation of the site-specific substitution rate according to user specifications.
Customizable substitution model featuring site-specific exchangeability matrices (relative rates of change among amino acids and their frequencies at the equilibrium).
Implementation of several SCS models and a variety of empirical substitution models of protein evolution. The implemented SCS models were evaluated in our previous work (Arenas et al., 2013) and, the implemented empirical substitution models were properly identified using simulated data with ProtTest3 (Darriba et al, 2011).
The implemented SCS models consider molecular energy functions based on amino acid contact matrices and configurational entropies per residue in unfolded and misfolded proteins (Arenas et al., 2013; Bastolla et al, 2007; Mendez et al, 2010). Technical details about these SCS models are presented in our previous study (Arenas et al., 2013). The method also implements common empirical substitution models of protein evolution (i.e., Blosum62, CpRev, Dayhoff, DayhoffDCMUT, FLU, HIVb, HIVw, JTT, JonesDCMUT, LG, Mtart, Mtmam, Mtrev24, RtRev, VT and WAG; Table S1) and the user can specify any particular exchangeability matrix for all sites or for each site, allowing for heterogeneity of the substitution process among sites (details in Table S1 and in the software documentation). In addition, the framework implements heterogeneous substitution rates among sites by the traditional Gamma distribution (+G) (Yang, 1996) and proportion of invariable sites (+I) (Fitch & Margoliash, 1967), and also the user can directly alter the substitution rate at each site for any empirical or SCS model (Table S1). Regarding the evolutionary history, in addition to the birth-death process presented before, the user can specify a particular phylogenetic tree or simulate a coalescent evolutionary history (Hudson, 1983; Kingman, 1982) (Table S1). In this regard, we maintained the capabilities of the previous version, including the coalescent with recombination (Hudson, 1983) [which can be homogeneous or heterogeneous along the sequence (according to Wiuf & Posada, 2003)], variable population size over time (growth rate or demographic periods), several migration models [i.e., island (Hudson, 1998), stepping-stone (Kimura & Weiss, 1964) and continent-island (Wright, 1931)] with temporal variation of migration rates and convergence of demes or subpopulations, simulation of haploid or diploid data, and longitudinal sampling (Navascues et al, 2010) (details in Table S1 and in the software documentation). The framework outputs a simulated multiple sequence alignment with the protein sequences of the internal and tip nodes, as well as their folding stabilities, and the evolutionary history, among other information (Table S1 and software documentation).
Study data for evaluating the method for forecasting protein evolution
We evaluated the accuracy of the method for forecasting protein evolution using viral proteins sampled over time (longitudinal sampling). Specifically, we used protein sequences from previous experiments of virus evolution monitored over time, which contain consensus molecular data (avoiding rare variants) that belong to different evolutionary time points.
The matrix (MA) protein of HIV-1, with data obtained from an in vitro cell culture experiment where samples were collected at different times (Arenas et al, 2016). This data included 21 MA protein consensus sequences collected at times (population passages, T) T1 (initial time and includes an initial sequence) and T31 (time after 31 passages and includes 20 sequences), which accumulated 48 amino acid substitutions (sequence identity 0.973; Figure S1, Supplementary Material).
The main protease (Mpro) and papain-like protease (PLpro) proteins of SARS-CoV-2. For each protein, the data includes the first sequenced variant (Wuhan) and a sequence built with all the substitutions observed in a dataset of 384 genomes of the Omicron variant of concern collected from the GISAID database. The resulting Mpro and PLpro sequences presented 10 and 22 substitutions (sequence identity 0.967 and 0.930), respectively.
The protease (PR) of HIV-1 with data sampled from patients monitored over time, from 2008 to 2017, available from the Specialized Assistance Services in Sexually Transmissible Diseases and HIV/AIDS in Brazil (Ferreiro et al., 2022; Santos-Pereira et al, 2021; Souto et al, 2021). These evolutionary scenarios are complex due to the diverse antiretroviral therapies administrated to the patients (Table S2; Supplementary Material), which could vary during the studied time periods and that could promote the fixation of specific mutations (i.e., associated with resistance) (Ferreiro et al., 2022; Santos-Pereira et al., 2021; Souto et al., 2021). For each viral population (patient), the data included a consensus sequence for each of the four or five samples collected at different time points. These consensus sequences exhibited between one and 22 amino acid substitutions respect to the consensus sequence of the corresponding first sample (Table S2).
We identified a representative protein structure for each data, which was used to predict the folding stability and to inform the SCS model. In particular, we obtained the protein structures of the sequences at the initial time (T1) by homology modelling. We used SWISS-MODEL (Arnold et al, 2006) to identify the best-fitting templates of PDB structures (Table S3, Supplementary Material). Next, we predicted the protein structures by homology modelling with Modeller (Sali & Blundell, 1993) using the protein sequences and corresponding structural templates.
Forward in time prediction of viral protein variants
The evaluation was performed with the previously presented real data, which includes a real protein variant present at the initial time (T1) and subsequent protein variants present at a later time (Tn). We applied the method for forecasting protein evolution to predict the most likely protein variants at Tn derived from the real protein variant observed at the initial time (T1).
The prediction error was determined by measuring the distance between the real protein variants and the predicted variants corresponding to time Tn.
Thus, we assigned to the initial node (T1) the corresponding real protein variant (including its sequence and structure), and its evolution was simulated forward in time until it reached the number of substitutions observed in the real data at time Tn. Thus, we considered the number of observed substitutions as a measure of evolutionary time to allow proper comparisons between real and predicted variants. We simulated 100 alignments of protein sequences, each containing the same number of sequences as the real data at Tn. This included 20 HIV-1 MA protein sequences, a consensus sequence for the SARS-CoV-2 Mpro, a consensus sequence for the SARS-CoV-2 PLpro and, a consensus sequence of HIV-1 PR for every viral population. To avoid rare variants, each sequence of the simulated multiple sequence alignments was obtained as the consensus of 100 linked simulated sequences.
To investigate the effect of selection on the predictions, we compared the accuracy of forecasting protein evolution when selection from the protein structure is considered and, when it is ignored (neutral evolution). If selection is considered, as previously presented, the probability of birth and death events was based on the fitness of the protein variants, and protein evolution was modeled using an SCS model (Arenas et al., 2013). For the case of neutral evolution, all protein variants equally fit and are allowed, leading to only birth events, and we modeled protein evolution using an exchangeability matrix with the same relative rates of change for any pair of amino acids.
Accuracy of the predicted protein variants
We assessed the accuracy of the method for forecasting protein evolution by comparing the predicted and real protein variants present at time Tn, both at the protein sequence and structure levels.
For data containing multiple sequences at time Tn, we calculated the Kullback-Leibler (KL) divergence, which provides a distance between two multiple sequence alignments (the real and predicted data) based on the distribution of amino acid frequencies along the sequences (Equation III, where factors P and Q denote the distribution of amino acid frequencies in the real and predicted protein sequences at time Tn, respectively, i refers to protein site) (Kullback & Leibler, 1951). This distance was only calculated for data with a set of sequences at time Tn (HIV-1 MA) because a single sequence does not provide site-specific variability.
We also compared the real and predicted evolutionary trajectories of protein variants using the Grantham distance, which measures the differences between amino acids based on their physicochemical properties (Grantham, 1974). In particular, for both real and predicted protein variants, we calculated the Grantham distance at each protein site that differs between the two datasets, considering its evolution from T1 to the subsequent multiple sequence alignment at Tn. We examined sites that varied over time, thus the general site-specific Grantham distance Gi was calculated as the frequency of each amino acid f at site i multiplied by the specific Grantham distance between amino acid m at time Tn and amino acid n at time T1, normalized with the largest Grantham distance Gmax to obtain values between 0 and 1 (Equation IV). Next, to compare the real and predicted data, we calculated the site-specific difference of Grantham distance Gbi between the real P and predicted Q protein variants (Equation V),
In addition, we obtained and compared the protein folding stability (ΔG) of the predicted and real protein variants observed at time Tn, using their corresponding protein structures, with DeltaGREM (Arenas et al., 2015; Minning et al, 2013).
Results
Implementation of the forecasting protein evolution method
We extended the previous version of our framework ProteinEvolver (Arenas et al., 2013), maintaining its previous capabilities (i.e., simulation of protein evolution upon user-specified phylogenetic trees and upon phylogenetic trees simulated with the coalescent with or without recombination, migration, demographics and longitudinal sampling, empirical and SCS models, among others; Table S1), by adding, among others (Table S1), the forward in time modeling of protein evolution that combines a birth-death process based on the fitness of every protein variant (folding stability) at each node to determine its birth and death rates, as well as SCS models of protein evolution. The framework ProteinEvolver2 is written in C and distributed with a detailed documentation and a variety of illustrative practical examples. The framework is freely available from https://github.com/MiguelArenas/proteinevolver.
Evaluation of predictions of HIV-1 MA evolution
Regarding the evolution of the HIV-1 MA protein, the Grantham distance and the KL divergence between the real variants at time Tn and the corresponding predicted variants were low (around 5% and 6%, respectively; Table 1), and they did not differ comparing predictions that consider selection on the folding stability and predictions that ignore it (Table 1). On the other hand, we found that the folding stability of the protein variants predicted considering selection on the folding stability was closer to the folding stability of the real protein variants than that of the protein variants predicted under neutral evolution (Table 1). In particular, the protein variants predicted ignoring selection were less stable than those predicted considering selection and also less stable than the real protein variants.

Comparison of real and predicted sequences of the HIV-1 MA protein considering predictions based on the SCS and neutral models.
For the data simulated under the SCS and neutral models, the table shows the Grantham distance between the amino acids that changed during the real and predicted evolutionary trajectories and, the Kullback-Leibler (KL) divergence between the real and predicted multiple sequence alignments. Next, it shows the folding stability (ΔG) of the real protein variants at times T1 and T31, and the folding stability of the predicted protein variants at time T31. The error corresponds to the 95% confidence interval from the mean of predictions of folding stability.
Evaluation of predictions of SARS-CoV-2 Mpro and PLpro evolution
The analyses of the SARS-CoV-2 Mpro and PLpro data the Grantham distance between the real and predicted sequences was around 25% and 36%, respectively (Figure 2A). Again, this distance was similar when comparing predictions based on a model that considers selection on the protein folding stability and a model of neutral evolution. Regarding comparisons based on the protein folding stability, we found again that the model that considers selection from the folding stability produces variants closer to the stability of the real protein variants than the model that ignores selection (Figure 2B). Indeed, the protein sequences derived from the model that considers selection were more stable than those derived from the model of neutral evolution.

Prediction error of SARS-CoV-2 Mpro and PLpro evolution regarding physicochemical properties of the amino acid changes accumulated during the evolutionary trajectories and protein folding stability.
(A) Grantham distance calculated from the amino acid changes that occurred during the real and predicted evolutionary trajectories based on SCS and neutral models of protein evolution. (B) Variation of protein folding stability (ΔΔG) between real and predicted protein variants based on SCS and neutral models of protein evolution. Notice that positive ΔΔG indicates that the real protein variants are more stable than the predicted protein variants and vice versa. Error bars correspond to the 95% confidence interval of the mean of prediction error from 100 multiple sequence alignments simulated for the corresponding population and time.
Evaluation of predictions of HIV-1 PR evolution
In general, the Grantham distance, which compared the evolutionary trajectories of the real and predicted protein variants from time T1 to later times, varied among viral populations (patients) (Figure 3A). However, for the majority of these populations, the distance remained below 30% and exhibited minor fluctuations over time. One particular population exhibited a notable trend, with the distance increasing from 10% to nearly 60% over time. Considering that the length of the evolutionary trajectories of the protein can differ among the studied populations, we explored whether the accumulated number of amino acid substitutions could affect the accuracy of the predictions, and we found that the number of substitutions varied among populations and this variability did not correlate with the Grantham distance between the real and predicted data (R2=0.0001, Figure 3B). In general, the folding stability of the predicted protein variants was similar or slightly less stable than that of the real protein variants [with a difference ranging from 0 to 9 kcal/mol and a mean of 3.1 ± 0.9 (95% CI) kcal/mol; Figure 3C]. Indeed, the folding stability exhibited small fluctuations, increasing and decreasing, over time.

Prediction error of HIV-1 PR evolution at diverse populations regarding physicochemical properties of the amino acid changes accumulated during the evolutionary trajectories and protein folding stability.
(A) For each viral population (patient, represented with a particular color) and time, Grantham distance calculated from the amino acid changes that occurred during the real and predicted evolutionary trajectories. For each population, the mean of distances obtained over time is shown on the right. (B) Relationship between Grantham distances and accumulated number of substitution events (R2 = 0.0001, which indicates a lack of correlation between these parameters). (C) Variation of protein folding stability (ΔΔG) between real and predicted protein variants at each viral population and time. For each population, the mean of distances obtained over time is shown on the right. Notice that positive ΔΔG indicates that the real protein variants are more stable than the predicted protein variants and vice versa. Error bars correspond to the 95% confidence interval of the mean of prediction error from 100 multiple sequence alignments simulated for the corresponding viral population and time.
Discussion
While reconstructing evolutionary histories and ancestral sequences, among other inferences about the past, was popular in the field, predictions about evolution toward the future were traditionally ignored due to potential high prediction errors. However, in the last decade, forecasting evolution has gained attention because of its wide variety of applications and advancements in models of evolution (see the reviews Lässig et al., 2017; Wortel et al., 2023). Several studies showed that forecasting evolution can be feasible in some systems, including virus evolution (Luksza & Lassig, 2014; Thadani et al, 2023). Here, we also investigated forecasting evolution in viruses but at the molecular level, considering the relevance of the substitution process to produce molecular variants that affect the viral fitness (Arenas, 2015a; Arenas et al., 2016; Bloom & Neher, 2023; Poon et al, 2007; Watabe & Kishino, 2010). We believe that forecasting genome evolution remains challenging due to the complexity of its evolutionary processes [i.e., including epistatic interactions, chromosomal rearrangements and heterogeneous substitution patterns among genomic regions (Arbiza et al, 2011), among others] that complicate the parameterization and prediction of accurate fitness landscapes. However, we believe that it could be more feasible in structural proteins because of their relatively simpler evolutionary processes that usually include selection on folding stability (Bloom et al, 2006; Goldstein, 2011).
To make evolutionary predictions over time, either toward the past or toward the future, considering a substitution model of molecular evolution can be convenient. Actually, a variety of traditional studies showed that the substitution model influences the phylogenetic likelihood generated by probabilistic approaches used for evolutionary inference (Lemmon & Moriarty, 2004; Yang et al, 1994; Zhang & Nei, 1997). To study protein evolution, empirical substitution models of protein evolution are routinely used because they allow rapid calculations based on the assumption of site-independent evolution and typically apply the same exchangeability matrix for all the protein sites, which is highly unrealistic (Echave et al, 2016). Besides, a small set of empirical substitution models was developed for modeling the evolution of the diverse range of viral proteins (Dang et al, 2010; Del Amparo & Arenas, 2022b; Dimmic et al, 2002; Le & Vinh, 2020; Nickle et al, 2007). As an alternative, some substitution models that consider evolutionary constraints on the protein structure incorporated site-dependent evolution, which allows a more accurate modeling of protein evolution compared to the empirical substitution models (Arenas et al., 2013; Arenas et al., 2015; Bordner & Mittelmann, 2013; Ferreiro et al., 2024a; Parisi & Echave, 2005).
According to previous methods for forecasting evolution based on computer simulations (Lassig & Luksza, 2014; Neher et al., 2014), we also adopted the birth-death population genetics process to allow a forward-in-time evolutionary process where the birth and death rates can differ among nodes (Stadler, 2010; Stadler, 2011). Traditional population genetics methods to simulate the evolution of molecular data implement a two-step simulation process, where first the evolutionary history is simulated [i.e., with the coalescent theory or a forward-in-time simulation approach (Arenas, 2012; Hoban et al., 2012), often assuming neutral evolution] and, in a subsequent step, molecular evolution is simulated upon the previously obtained evolutionary history (Yang, 2006). However, this methodology is unrealistic because the fitness of the data can affect its evolutionary history (i.e., a variant with high fitness is likely to have more descendants than a variant with low fitness). Thus, we designed and implemented a method for forecasting protein evolution that integrates a birth-death population genetics process with a SCS model of protein evolution, where the folding stability of each protein variant is evaluated to predict its future trajectory in terms of both evolutionary history and molecular evolution. We implemented this forward-in-time simulation of protein evolution in a new version of our previous framework ProteinEvolver (Arenas et al., 2013), while maintaining its previous capabilities and extending some of them (i.e., incorporation of site-specific exchangeability matrices and additional substitution models of protein evolution, among others; see Table S1 and software documentation).
Next, we evaluated the forward-in-time evolutionary predictions with real data of HIV-1 and SARS-CoV-2 proteins. We determined the prediction errors between the real and the predicted protein variants by examining dissimilarity in evolutionary trajectory (Grantham distance based on physicochemical properties among the differing amino acids during the evolutionary trajectories), sequence divergence (distribution of amino acid frequencies among sites using the KL divergence), and protein structure (protein folding stability). Additionally, we analyzed the influence of accounting for selection, based on the protein folding stability, on the predictions.
In general, the sequence and evolutionary trajectory dissimilarities between the real and predicted protein variants were relatively small. For the HIV-1 MA protein, the SARS-CoV-2 Mpro and the SARS-CoV-2 PLpro, the sequence dissimilarity was below 10%, 26% and 36%, respectively. The low prediction errors for the HIV-1 MA protein were expected because the data was derived from an in vitro cell culture experiment that is not influenced by a variety of external factors that could affect the predictions in in vivo experiments. As a rough reference, in traditional ancestral sequence reconstruction (ASR) of protein data with high sequence identity the error was approximately 2% (Arenas & Posada, 2010). Notice that ASR methods can exhibit lower prediction errors due to their statistical evaluation based on a set of original sequences, rather than relying on a single original sequence used in the case of forecasting evolution. The contrasting scenario regarding evolutionary complexity was the HIV-1 PR data, which was sampled from populations (patients) undergoing various antiretroviral treatments (Ferreiro et al., 2022). There, the sequence dissimilarity varied among viral populations, although most were below 30%. A viral population exhibited higher dissimilarities (near 60%) that we believe could be caused by molecular adaptations promoted by the therapies, although this needs formal investigation.
An unexpected result was that the model of neutral evolution produced sequence dissimilarities between the real and predicted protein variants that were similar to those obtained with the SCS model (Table 1 and Figure 2). A few studies indicated that the substitution model has negligible effects on the reconstructed phylogenetic trees (Abadi et al, 2019; Spielman, 2020). Subsequent studies found that the influence of the substitution model on phylogenetic reconstructions is dependent on the diversity of the data, where data with high diversity is more sensible to the applied substitution model due to containing more evolutionary information to be modeled (Del Amparo & Arenas, 2022a, 2023). Considering that the data studied here has overall low diversity (i.e., sequence identity of 0.973, 0.967 and 0.930 for the HIV MA data, SARS-CoV-2 Mpro data and SARS-CoV-2 PLpro data, respectively; it is important to note that, in general, longitudinal data derived from monitored evolutionary processes usually show a low diversity because involve relatively short evolutionary histories, among other factors), we believe that the influence of the applied substitution models on the prediction of the sequences was small because of the small number of modeled substitution events, as found in phylogenetic reconstructions (Del Amparo & Arenas, 2022a, 2023). In all, the prediction error related to sequence dissimilarity varied among the studied evolutionary scenarios; as expected, it was higher for the more complex scenarios.
We also evaluated the prediction error between the real and predicted protein variants regarding their folding stability, again comparing the predictions made under a model that considers structural constraints and a model of neutral evolution. In general, the protein variants predicted under the SCS model presented a folding stability close to the folding stability of the respective real protein variants, with differences below 1, 2 and 7 kcal/mol for the HIV MA data, SARS-CoV-2 Mpro data and SARS-CoV-2 PLpro data, respectively. The higher differences (around 9 kcal/mol) were again observed for the HIV-1 PR data. In contrast to the prediction error based on sequence dissimilarity, the prediction error based on folding stability varies between predictions obtained under the SCS model and those obtained under the neutral model. In the studied evolutionary scenarios, the protein variants predicted under the neutral model were less stable and farther from the stability of the real protein variants compared to those predicted under the SCS model. These results were expected because, under SCS models, protein stability can be modeled with greater accuracy than sequence similarity due to selection for maintaining stability in the protein structure despite amino acid changes (Arenas & Bastolla, 2020; Illergard et al, 2009; Pascual-Garcia et al, 2010). Indeed, previous studies showed that models that ignore structural constraints often produce proteins with unrealistic folding instability (Arenas & Bastolla, 2020; Del Amparo et al., 2023), which suggests that accounting for protein folding stability in the modeling of protein evolution is recommended for predicting protein variants with appropriate structural properties.
We present a method to simulate forward-in-time protein evolution accounting for evolutionary constraints from the protein structure and a birth-death population process, and where the evolutionary history is influenced by the protein evolution and vice versa. The method is implemented in the computer framework ProteinEvolver2, which is freely distributed with several practical examples and detailed documentation. We believe that implementing methods into freely available phylogenetic frameworks is important to facilitate practical applications, as well as future improvements and evaluations. We applied the method to forecast protein evolution in some viral proteins. We found that the method provides acceptable approximations to the real evolution, especially in terms of protein folding stability, suggesting that combining structural constraints with birth-death population processes in the modeling of protein evolution is convenient. Still, to advance in methods for forecasting protein evolution, we believe that further efforts should be made in the field to improve the modeling of protein evolution, such as the incorporation of site-dependent evolutionary constraints from the protein activity.
Data Availability
The computer framework ProteinEvolver2 is freely available from https://github.com/MiguelArenas/proteinevolver. The SARS-CoV-2 data is available from GISAID database with DOI https://doi.org/10.55876/gis8.250206gt. The real and predicted protein variants are available from Zenodo repository at the URL https://doi.org/10.5281/zenodo.14767475.
Acknowledgements
This work was supported by the Project PID2023-151032NB-C22 funded by MCIU/AEI/10.13039/501100011033 and by FEDER, UE. D.F. was funded by a fellowship from Xunta de Galicia [ED481A-2020/192]. We thank the “Centro de Supercomputación de Galicia (CESGA)” for the computer resources. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Additional files
References
- Model selection may not be a mandatory step for phylogeny reconstructionNat Commun 10:934
- Genome-wide heterogeneity of nucleotide substitution model fitGenome Biol Evol 3:896–908
- Simulation of Molecular Data under Diverse Evolutionary ScenariosPLoS Comput Biol 8:e1002495
- Genetic Consequences of Antiviral Therapy on HIV-1Comput Math Methods Med 2015:9
- Trends in substitution models of molecular evolutionFront Genet 6:319
- ProtASR2: Ancestral reconstruction of protein sequences accounting for folding stabilityMethods Ecol Evol 11:248–257
- Protein evolution along phylogenetic histories under structurally constrained substitution modelsBioinformatics 29:3020–3028
- Influence of mutation and recombination on HIV-1 in vitro fitness recoveryMol Phylogenet Evol 94:264–270
- The effect of recombination on the reconstruction of ancestral sequencesGenetics 184:1133–1139
- Maximum likelihood phylogenetic inference with selection on protein folding stabilityMol Biol Evol 32:2195–2207
- The SWISS-MODEL workspace: a web-based environment for protein structure homology modellingBioinformatics 22:195–201
- Relative rate and location of intra-host HIV evolution to evade cellular immunity are predictableNature Communications 7:11660
- Stability constraints and protein evolution: the role of chain length, composition and disulfide bondsProtein Eng Des Sel 18:405–415
- A protein evolution model with independent sites that reproduces site-specific amino acid distributions from the Protein Data BankBMC Evol Biol 6:43
- Structural approaches to sequence evolution. Springer, BerlinHeidelberg
- Protein stability promotes evolvabilityProc Natl Acad Sci U S A 103:5869–5874
- Fitness effects of mutations to SARS-CoV-2 proteinsVirus Evol 9:vead055
- A new formulation of protein evolutionary models that account for structural constraintsMol Biol Evol 31:736–749
- Predicting evolution from genomics: experimental evolution of bacteriophage T7Heredity 100:453–463
- Colloquium papers: Adaptive landscapes and protein evolutionProc Natl Acad Sci U S A 107:1747–1751
- Review of Phylogenetics: The Theory and Practice of Phylogenetic SystematicsSystematic Zoology 31:100–104
- FLU, an amino acid substitution model for influenza proteinsBMC Evol Biol 10:99
- ProtTest 3: fast selection of best-fit models of protein evolutionBioinformatics 27:1164–1165
- The utility of fitness landscapes and big data for predicting evolutionHeredity (Edinb) 121:401–405
- Consequences of Substitution Model Selection on Protein Ancestral Sequence ReconstructionMolecular Biology and Evolution 39:msac144
- HIV Protease and Integrase Empirical Substitution Models of Evolution: Protein-Specific Models Outperform Generalist ModelsGenes :13
- Influence of substitution model selection on protein phylogenetic tree reconstructionGene 865:147336
- Consequences of Genetic Recombination on Protein Folding StabilityJournal of Molecular Evolution 91:33–45
- Beneficial Mutation–Selection Balance and the Effect of Linkage on Positive SelectionGenetics 176:1759–1798
- Every which way? On predicting tumor evolution using cancer progression modelsPLoS Comput Biol 15:e1007246
- rtREV: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogenyJ Mol Evol 55:65–73
- A computational method for predicting the most likely evolutionary trajectories in the stepwise accumulation of resistance mutationseLife 12:e84756https://doi.org/10.7554/eLife.84756
- Causes of evolutionary rate variation among protein sitesNat Rev Genet 17:109–121
- Biophysical Models of Protein Evolution: Understanding the Patterns of Evolutionary Sequence DivergenceAnnu Rev Biophys 46:85–103
- Selection among site-dependent structurally constrained substitution models of protein evolution by approximate Bayesian computationBioinformatics 40:btae096
- The evolution of the HIV-1 protease folding stabilityVirus Evol 8:veac115
- Substitution Models of Protein Evolution with Selection on Enzymatic ActivityMolecular Biology and Evolution 41:msae026
- The value of monitoring to control evolving populationsProc Natl Acad Sci U S A 112:1007–1012
- A method for estimating the number of invariant amino acid coding positions in a gene using cytochrome c as a model caseBiochem Genet 1:65–71
- Site-specific amino acid replacement matrices from structurally constrained protein evolution simulationsMol Biol Evol 19:352–356
- The conditioned reconstructed processJ Theor Biol 253:769–778
- Real time forecasting of near-future evolutionJ R Soc Interface 9
- The Role of Evolutionary Selection in the Dynamics of Protein Structure EvolutionBiophys J 112:1350–1365
- The evolution and evolutionary consequences of marginal thermostability in proteinsProteins 79:1396–1407
- Population Size Dependence of Fitness Effect Distribution and Substitution Rate Probed by Biophysical Model of Protein ThermostabilityGenome Biol Evol 5:1584–1593
- Stability-mediated epistasis constrains the evolution of an influenza proteineLife 2:e00631https://doi.org/10.7554/eLife.00631
- Dynamic Mutation–Selection Balance as an Evolutionary AttractorGenetics 191:1309–1319
- Amino acid difference formula to help explain protein evolutionScience 185:862–864
- Introduction to birth-death modelsIn: Phylogenetic Comparative Methods https://lukejharmon.github.io/pcm/chapter10_birthdeath/
- Computer simulations: tools for population and evolutionary geneticsNat Rev Genet 13:110–122
- Properties of a neutral allele model with intragenic recombinationTheoretical Population Biology 23:183–201
- Gene genealogies and the coalescent processOxford Surveys in Evolutionary Biology 7:1–44
- Island models and the coalescent processMol Ecol 7:413–418
- Structure is three to ten times more conserved than sequence--a study of structural response in protein coresProteins 77:499–508
- Capturing the mutational landscape of the beta-lactamase TEM-1Proceedings of the National Academy of Sciences 110:13067–13072
- The Stepping Stone Model of Population Structure and the Decrease of Genetic Correlation with DistanceGenetics 49:561–576
- The coalescentStochastic Processes and their Applications 13:235–248
- On information and sufficiencyAnn Math Stat 22:79–86
- Can we read the future from a treeeLife 3
- Predicting evolutionNature Ecology & Evolution 1:0077
- FLAVI: An amino acid substitution model for flavivirusesJ Mol Evol
- Robust, Universal Tree Balance IndicesSystematic Biology 71:1210–1224
- The importance of proper model assumption in bayesian phylogeneticsSyst Biol 53:265–277
- The interface of protein structure, protein biophysics, and molecular evolutionProtein Sci 21:769–785
- Predicting mutational routes to new adaptive phenotypeseLife :8
- A predictive fitness model for influenzaNature 507:57–61
- Ancestral lysozymes reconstructed, neutrality tested, and thermostability linked to hydrocarbon packingNature 345:86–89
- Mutation bias favors protein folding stability in the evolution of small populationsPLoS Comput Biol 6:e1000767
- Detecting selection for negative design in proteins through an improved model of the misfolded stateProteins 81:1102–1112
- Direct-coupling analysis of residue coevolution captures native contacts across many protein familiesProc Natl Acad Sci U S A 108:E1293–1301
- Evolution of TOP1 and TOP1MT Topoisomerases in ChordataJournal of Molecular Evolution 91:192–203
- Predictive Modeling of Influenza Shows the Promise of Applied Evolutionary BiologyTrends Microbiol 26:102–118
- Prediction of resistance development against drug combinations by collateral responses to component drugsScience translational medicine 6:262ra156
- Combining contemporary and ancient DNA in population genetic and phylogeographical studiesMol Ecol Resour 10:760–772
- Genealogies of rapidly adapting populationsProceedings of the National Academy of Sciences 110:437–442
- Predicting evolution from the shape of genealogical treeseLife :3
- HIV-specific probabilistic models of protein evolutionPLoS One 2:e503
- A rugged yet easily navigable fitness landscapeScience 382:eadh3860
- Structural constraints and emergence of sequence patterns in protein evolutionMol Biol Evol 18:750–756
- Generality of the structurally constrained protein evolution model: assessment on representatives of the four main fold classesGene 345:45–53
- Quantifying the evolutionary divergence of protein structures: the role of function change and function conservationProteins 78:181–196
- The Molecular Clock in the Evolution of Protein StructuresSyst Biol 68:987–1002
- Mapping protease inhibitor resistance to human immunodeficiency virus type 1 sequence polymorphisms within patientsJ Virol 81:13598–13607
- Site interdependence attributed to tertiary structure in amino acid sequence evolutionGene 347:207–217
- Biophysical principles predict fitness landscapes of drug resistanceProceedings of the National Academy of Sciences 113:E1470–E1478
- Adaptive diversification and niche packing on rugged fitness landscapesJ Theor Biol 562:111421
- Coevolution analyses illuminate the dependencies between amino acid sites in the chaperonin system GroES-LBMC Evol Biol 13:156
- Comparative protein modelling by satisfaction of spatial restraintsJ Mol Biol 234:779–815
- Nationwide Study of Drug Resistance Mutations in HIV-1 Infected Individuals under Antiretroviral Therapy in BrazilInt J Mol Sci :22
- Evolutionary dynamics of HIV-1 subtype C in BrazilSci Rep 11:23060
- Relative Model Fit Does Not Predict Topological Accuracy in Single-Gene Protein PhylogeneticsMolecular Biology and Evolution 37:2110–2123
- The ribonuclease from an extinct bovid ruminantFEBS Lett 262:104–106
- Sampling-through-time in birth–death treesJournal of Theoretical Biology 267:396–404
- Simulating trees with a fixed number of extant speciesSyst Biol 60:676–684
- Learning from prepandemic data to forecast viral escapeNature 622:818–825
- Resurrecting the ancestral steroid receptor: ancient origin of estrogen signalingScience 301:1714–1717
- Evolution of Coral Pigments RecreatedScience 305:1433–1433
- Measuring ruggedness in fitness landscapesProceedings of the National Academy of Sciences 112:7345–7346
- Structural considerations in the fitness landscape of a virusMol Biol Evol 27:1782–1791
- A coalescent model of recombination hotspotsGenetics 164:407–417
- Towards evolutionary predictions: Current promises and challengesEvolutionary Applications 16:3–21
- Evolution in Mendelian populationsGenetics 16:97–159
- A biophysical protein folding model accounts for most mutational fitness effects in virusesProceedings of the National Academy of Sciences 108:9916–9921
- Among-site rate variation and its impact on phylogenetic analysisTrends Ecol Evol 11:367–372
- Computational Molecular EvolutionOxford, England: Oxford University Press
- Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimationMolecular Biology and Evolution 11:316–324
- Predicting ecosystem changes by a new model of ecosystem evolutionScientific Reports 13:15353
- Protein stability imposes limits on organism complexity and speed of molecular evolutionProceedings of the National Academy of Sciences 104:16152–16157
- Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihood, and distance methodsJ Mol Evol 44:S139–146
Article and author information
Author information
Version history
- Sent for peer review:
- Preprint posted:
- Reviewed Preprint version 1:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.106365. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2025, Ferreiro et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.