Forecasting protein evolution by integrating birth-death population models with structurally constrained substitution models

eLife Assessment

This manuscript introduces a useful protein-stability-based fitness model for simulating protein evolution and unifying non-neutral models of molecular evolution with phylogenetic models. The model is applied to five viral proteins that are of structural and functional importance. While the general modelling approach is solid, and effectively preserves folding stability, the evidence for the model's predictive power remains limited, since it shows little improvement over neutral models in predicting protein evolution. The work should be of interest to researchers developing theoretical models of molecular evolution.

https://doi.org/10.7554/eLife.106365.3.sa0

Significance of the findings:

Useful: Findings that have focused importance and scope

Landmark
Fundamental
Important
Valuable
Useful

Strength of evidence:

Solid: Methods, data and analyses broadly support the claims with only minor weaknesses

Exceptional
Compelling
Convincing
Solid
Incomplete
Inadequate

During the peer-review process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments

Abstract
Introduction
Methods
Results
Discussion
Data availability
References
Article and author information
Metrics

Abstract

Evolutionary studies in population genetics and ecology were mainly focused on predicting and understanding past evolutionary events. Recently, however, a growing trend explores the prediction of evolutionary trajectories toward the future promoted by its wide variety of applications. In this context, we introduce a forecasting protein evolution method that integrates birth-death population models with substitution models that consider selection on protein folding stability. In contrast to traditional population genetics methods that usually make the unrealistic assumption of simulating molecular evolution separately from the evolutionary history, the present method combines both processes to simultaneously model forward-in-time birth-death evolutionary trajectories and protein evolution under structurally constrained substitution models that outperformed traditional empirical substitution models. We implemented the method into a freely available computer framework. We evaluated the accuracy of the predictions with several monitored viral proteins of broad interest. Overall, the method showed acceptable errors in predicting the folding stability of the forecasted protein variants, but, expectedly, the errors were larger in the prediction of the corresponding sequences. We conclude that forecasting protein evolution is feasible in certain evolutionary scenarios and provide suggestions to enhance its accuracy by improving the underlying models of evolution.

Introduction

Molecular evolution is traditionally investigated through inferences about past evolutionary events, such as phylogenetic tree and ancestral sequence reconstructions, and predictions about the future were considered as inaccessible for a long time because they can be affected by complex processes such as environmental change. Nevertheless, a variety of biological systems display a Darwinian evolutionary process where selection operates toward a limited set of adapted variants. These variants, and in extension the evolutionary trajectories to reach them, would be positively selected and could present a certain degree of predictability (Lässig et al., 2017; Wortel et al., 2023). The progress made in developing more accurate models of evolution (Arenas, 2015b) and the benefits from predicting the outcome of evolution (i.e. to understand the course of evolution or to prepare for the future Wortel et al., 2023) motivated a variety of investigations on forecasting evolution in diverse fields including medicine, agriculture, biotechnology, and conservation biology, among others (e.g. Barton et al., 2016; Bull and Molineux, 2008; de Visser et al., 2018; Diaz-Uriarte and Vasallo, 2019; Fischer et al., 2015; Gerrish and Sniegowski, 2012; Lässig and Łuksza, 2014; Lind et al., 2019; Luksza and Lässig, 2014; Morris et al., 2018; Munck et al., 2014; Neher et al., 2014; Wortel et al., 2023). Unfortunately, forecasting evolution is not always achievable. Under neutral evolution, all the molecular variants are equally likely to be present in the population, showing lack of repeatability and disallowing accurate prediction of future variants. Thus, forecasting evolution requires a system with measurable selection pressures, and where certain positively selected variants could produce more descendants than other variants and expand in the population (Desai and Fisher, 2007; Goyal et al., 2012; Neher and Hallatschek, 2013; Neher et al., 2014). Actually, a rougher fitness landscape resulting from selection can lead to greater accuracy in evolutionary predictions (Papkou et al., 2023; Rubin et al., 2023; Van Cleve and Weissman, 2015). Overall, an evolutionary process could be predictable to some extent (prediction errors are inevitable with any method and in any evolutionary scenario) depending on the strength of selection driving evolution and the heterogeneity in fitness among different variants.

Here, we focus on forecasting protein evolution because it involves molecular evolutionary processes driven by selection pressures and where the fitness of each variant can be parameterized and predicted (Carneiro and Hartl, 2010; Gilson et al., 2017). Traditionally, evolutionary histories and ancestral sequences of proteins are inferred using probabilistic methods based on advanced substitution models of protein evolution (e.g. Arenas, 2015b; Arenas and Bastolla, 2020; Ferreiro et al., 2022; Malcolm et al., 1990; Moreira et al., 2023; Stackhouse et al., 1990; Thornton et al., 2003; Ugalde et al., 2004). The accuracy of these inferences is affected by the accuracy of the applied substitution model, where substitution models that better fit with the study data usually produce more accurate phylogenetic trees and ancestral sequences (Arenas and Bastolla, 2020; Del Amparo and Arenas, 2022a; Lemmon and Moriarty, 2004). These findings suggest that accurate substitution models of evolution are also convenient for forecasting protein evolution. In this regard, a variety of studies showed that structurally constrained substitution (SCS) models of protein evolution provide more accurate evolutionary inferences than the traditional empirical substitution models of protein evolution, in terms of phylogenetic likelihood, distribution of amino acid frequencies among protein sites, rates of molecular evolution and folding stability of reconstructed proteins, among other aspects (e.g. Arenas and Bastolla, 2020; Arenas et al., 2016a; Bastolla et al., 2006; Bordner and Mittelmann, 2014; Del Amparo et al., 2023; Echave and Wilke, 2017; Ferreiro et al., 2024a; Ferreiro et al., 2024b; Fornasari et al., 2002; Parisi and Echave, 2001; Pascual-García et al., 2019; Rodrigue et al., 2005), although SCS models usually demand more computational resources than substitution models that only include information from the protein sequence. Notice that the protein structure provides information about the location and molecular interactions of amino acids at different protein sites, which could be far from each other in the sequence but close in the three-dimensional structure and interact affecting their evolution (Morcos et al., 2011; Ruiz-González and Fares, 2013). Indeed, selection from the protein folding stability is relevant in the evolution of multiple proteins, including those in microbial and viral systems (e.g. Ferreiro et al., 2022; Gong et al., 2013; Jacquier et al., 2013; Rodrigues et al., 2016; Wylie and Shakhnovich, 2011; Zeldovich et al., 2007). Therefore, we believe that it should be taken into account for forecasting protein evolution in such systems.

Predictions about future evolutionary events can be performed with simulation-based methods (e.g. Eccleston et al., 2023; Neher et al., 2014; Yoshida et al., 2023). In order to simulate molecular evolution, traditional population genetics methods apply two separate steps (Arenas, 2012; Hoban et al., 2012). First, the simulation of the evolutionary history (i.e. a phylogenetic tree) using approaches such as the coalescent and birth-death population processes (Gernhard, 2008; Hudson, 1990; Kingman, 1982; Stadler, 2010). Afterward, the forward-in-time simulation of molecular evolution is performed, from the root node to the tip nodes, upon the previously simulated evolutionary history (Yang, 2006). This methodology was implemented into a variety of population genetics frameworks that simulate molecular evolution (Arenas, 2012; Hoban et al., 2012). However, for technical and computational simplicity, it assumes that the simulation of the evolutionary history is independent from the simulation of molecular evolution, which can produce biological incoherences (i.e. the evolutionary history is usually simulated under neutral evolution while molecular evolution is usually simulated with substitution models that consider selection). To enhance the realism of this modeling, here we merged both processes into a single one where evolutionary history influences molecular evolution and vice versa. In particular, we adopted a birth-death population genetics method to simulate the forward-in-time evolutionary history already used for forecasting evolution (Lässig and Łuksza, 2014; Neher et al., 2014), taking into account the fitness of the molecular variant (through evolutionary constraints from the protein folding stability Bastolla and Demetrius, 2005; Gong et al., 2013; Liberles et al., 2012; Zeldovich et al., 2007) at the corresponding node, to determine its subsequent birth or death event, and we integrated this process with SCS models to model protein evolution along the derived phylogenetic branches. The method is detailed below, and we implemented it into a new version of our computer framework ProteinEvolver (Arenas et al., 2013), which is freely available from https://github.com/MiguelArenas/proteinevolver (Arenas, 2025) ProteinEvolver2 includes detailed documentation and a variety of ready-to-use examples. Next, considering the potential applications of forecasting evolution to design vaccines and therapies against pathogens, we evaluated and applied the method to forecasting protein evolution in several real protein data of viruses monitored over time.

Methods

A method for forecasting protein evolution by combining birth-death population genetics with structurally constrained substitution models of protein evolution

Following previous methods for forecasting evolution based on simulations (Lässig and Łuksza, 2014; Neher et al., 2014), we developed a method to simulate the forward-in-time evolutionary history of a protein sample with a birth-death process that considers the fitness of the protein variant (based on folding stability) at every temporal node. The method derives the birth and death rates for a protein variant based on the fitness of that variant, where a high fitness results in a high birth rate and a low death rate, which can lead to a large number of descendants, and the opposite leading to a few or none (extinction) descendants. Thus, the fitness of the molecular variant at every node drives its corresponding forward in time birth-death evolutionary history. The details of this simulation process are outlined below.

First, similarly to common simulators of molecular evolution (Arenas, 2012; Hoban et al., 2012; Yang, 2006), a given protein sequence and structure (hereafter, protein variant) is assigned to the root node. The fitness (f) of the protein variant (A) is calculated from its folding stability (free energy, ΔG) following the Boltzmann distribution (Goldstein, 2013) (Equation 1, which takes values from 0 to 1),

f (A) = \frac{1}{1 + e^{Δ G / kT}}

Protein folding stability constrains protein evolution and is commonly used to obtain protein fitness (Bastolla et al., 2007; Goldstein, 2013; Liberles et al., 2012; Lobkovsky et al., 2010; Mendez et al., 2010; Sella and Hirsh, 2005; Zeldovich et al., 2007). The user can alternatively choose whether the fitness of the modeled protein variant is determined solely by its folding stability or by its similarity to the stability of a real protein variant (i.e., a protein structure from the Protein Data Bank, PDB). We believe the latter can be more realistic, as in nature, high folding stability does not necessarily indicate high fitness, but a stability that closely resembles that of a real protein may suggest high fitness since the real stability is the result of a selection process (which also incorporates negative design). If the fitness is derived from only the folding stability of the protein variant, the birth rate (b) is considered equal to the fitness. Alternatively, if the fitness is determined based on the similarity in folding stability between the modeled variant and a real variant, the birth rate is assumed to be 1 minus the root mean square deviation (RMSD, which offers advantages such as minimizing the influence of small deviations while amplifying larger differences, thereby enhancing the detection of remarkable molecular changes) in folding stability. Notice that the smaller this difference, the higher the birth rate. In both cases, the death rate (d) is considered as 1-b to allow a constant global (birth-death) rate. In this model, the fitness influences reproductive success, where protein variants with higher fitness have higher birth rates leading to more birth events, while those with lower fitness have higher death rates leading to more extinction events. This parameterization is meaningful in the context of protein evolution because the fitness of a protein variant can affect its survival (birth or extinction) without necessarily altering its rate of evolution. Although a higher growth rate can sometimes correlate with higher fitness, a variant with high fitness does not necessarily accumulate substitutions more rapidly.

Additionally, we incorporated another birth-death model that follows the proposal by Neher et al., 2014, in which the death rate is fixed at 1 and the birth rate is modeled as 1+fitness. In this model, fitness not only affects reproductive success but also influences the global birth-death rate, which can vary among lineages.

The birth-death process is simulated forward in time deciding whether every next event is a birth or a death event (Harmon, 2019; Stadler, 2010; Stadler, 2011) according to the fitness of the corresponding molecular variant, and it ends when a user-specified criterion, such as a particular sample size (considering or ignoring extinction nodes) or a certain evolutionary time (t_e), is reached. Starting from an ‘active’ node (i.e. the root node) at current time t_c, the time to the next event (birth to produce two descendants or death to produce the extinction of the node) can be calculated (details below). In contrast to standard birth-death processes where birth and death rates are constant over time and among lineages, the present method considers heterogeneity where each protein variant at a node has specific birth and death rates according to its corresponding fitness. The method is described below and summarized in Figure 1.

Figure 1

Download asset Open asset

Illustrative example of forward in time simulation of protein evolution integrating a birth-death population evolutionary process with fitness from the protein folding stability and the modeling of protein evolution with a structurally constrained substitution model.

Given a protein variant assigned to a node at time t (blue node), its fitness is calculated considering its protein folding stability. Then, the fitness is used to determine the birth and death rates for that variant, which provide the time to the next birth or death event (horizontal dashed line) that corresponds to the forward-in-time branch length. Next, the variant is evolved forward in time toward each descendant, upon the previously determined branch length, under an SCS model of protein evolution. The process is repeated, forward in time, starting at each new variant. If a death event occurs, the variant of the extinct node (pink node) is obtained, but it does have descendants. The process finishes when a particular sample size or simulation time is reached (i.e. *t+n*).

(1) The process starts at the root node, assigning a user-specified protein sequence and corresponding protein structure (i.e. obtained from the PDB) to that node. In general, for every protein variant assigned to an active node, the corresponding birth and death rates are calculated following the indications presented above.
(2) Calculation of the time to the next birth or death event. Following common birth-death methods, the time to the next event t_n is calculated through an exponential distribution with rate based on the number of active nodes (s) and the sum of the birth and death rates (Harmon, 2019; Equation 2),

t_{n} = e^{(s (b + d))}

One of the birth-death models that we implemented considers that b+d = 1 at each node, allowing variation of the reproductive success among nodes while keeping t_n consistent among them, according to Harmon, 2019. In contrast, in the other birth-death model we implemented, b+d can vary among nodes (Neher et al., 2014), thereby allowing variation of both reproductive success and t_n among nodes.
(3) Evaluate whether the simulation concludes before the next event occurs (t_c +t_n).
(4) If it does conclude (i.e. t_c +t_n higher than t_e), the simulation of the evolutionary history finishes.
(5) If it does not conclude, a random active node is selected, and its protein variant is analyzed to determine its type of next event (birth or death). The probability of a birth event is P_b = b/(b+d) and the probability of a death event is P_d = d/(b+d). A random sample from those probabilities is taken to determine the type of evolutionary event.
(6) If a birth event is selected. Two descendant active nodes from the study node are incorporated with branch lengths t_n, and the study node is then considered inactive. Next, molecular evolution is simulated from the original node to each descendant node based on the specified SCS model of protein evolution (Arenas et al., 2013). The integration of SCS models to evolve protein variants along given branch lengths followed standards approaches of molecular evolution in population genetics, in which the branch length and the substitution model inform the number and type of substitution events, respectively (Arenas, 2012; Carvajal-Rodríguez, 2010; Hoban et al., 2012; Yang, 2006). Thus, the process results in a protein variant for every descendant node. Finally, the folding stability and subsequent fitness of these descendant protein variants are calculated.
(7) If a death event is selected. Then, the study node is considered inactive.
(8) Return to step 1 while the user-specified criterion for ending the birth-death process is not satisfied and at least one active node exists in the evolutionary history. Otherwise, the simulation ends.

This process simultaneously simulates, forward in time, evolutionary history and protein evolution, with protein evolution influencing the evolutionary history through selection from the folding stability. Indeed, selection can vary among protein variants at their corresponding nodes of the evolutionary history. The process produces a forward in time birth-death phylogenetic history that encompasses nodes that reached the ending time, internal nodes, and nodes that were extinct at some time, along with the protein variant associated with each node.

The method includes several optional capabilities listed in Supplementary file 1A and in the software documentation, with some summarized below.
The user can fix the birth and death rates along the evolutionary history or specify that they are based on the fitness of the corresponding protein variant as described before. For the latter, the fitness can be based on the folding stability of the analyzed protein variant or on the similarity in folding stability between the analyzed and real protein variants. In addition, the global birth-death rate can be constant or vary among lineages depending on the specified birth-death model.
The birth-death simulation can finish when any of the following criteria is met: (a) A specified sample size, including extinction nodes derived from death events. (b) A specified sample size, excluding extinction nodes. (c) A specified evolutionary time, measured from the root to a tip node.
Extinction nodes can either be preserved or removed from the evolutionary history.
Evaluation of the simulated birth-death evolutionary history in terms of tree balance using the Colless index (Colless and Wiley, 1982; Lemant et al., 2022).
Possible variation of the site-specific substitution rate according to user specifications.
Customizable substitution model featuring site-specific exchangeability matrices (relative rates of change among amino acids and their frequencies at the equilibrium).
Implementation of several SCS models and a variety of empirical substitution models of protein evolution. The implemented SCS models were evaluated in our previous work (Arenas et al., 2013), and the implemented empirical substitution models were properly identified using simulated data with ProtTest3 (Darriba et al., 2011).

The implemented SCS models consider molecular energy functions based on amino acid contact matrices and configurational entropies per residue in unfolded and misfolded proteins (Arenas et al., 2013; Bastolla et al., 2007; Mendez et al., 2010). These models incorporate both positive and negative design strategies. In particular, the evaluation of the target protein structure while taking into account a database of residue contacts from alternative protein structures in the PDB, thus considering background genetic information that helps reduce prediction biases (Minning et al., 2013). Technical details about these SCS models are presented in our previous study (Arenas et al., 2013). Next, SCS models outperformed models that ignore structural evolutionary constraints in terms of phylogenetic likelihood, among other properties (Arenas et al., 2013; Arenas et al., 2016a; Bordner and Mittelmann, 2014). The method also implements common empirical substitution models of protein evolution (i.e. Blosum62, CpRev, Dayhoff, DayhoffDCMUT, FLU, HIVb, HIVw, JTT, JonesDCMUT, LG, Mtart, Mtmam, Mtrev24, RtRev, VT, and WAG; Supplementary file 1A) and the user can specify any particular exchangeability matrix for all sites or for each site, allowing for heterogeneity of the substitution process among sites (details in Supplementary file 1A and in the software documentation). In addition, the framework implements heterogeneous substitution rates among sites by the traditional Gamma distribution (+G; Yang, 1996) and proportion of invariable sites (+I; Fitch and Margoliash, 1967), and also the user can directly alter the substitution rate at each site for any empirical or SCS model (Table S1). Regarding the evolutionary history, in addition to the birth-death process presented before, the user can specify a particular phylogenetic tree or simulate a coalescent evolutionary history (Hudson, 1983; Kingman, 1982; Supplementary file 1A). In this regard, we maintained the capabilities of the previous version, including the coalescent with recombination (Hudson, 1983) which can be homogeneous or heterogeneous along the sequence according to Wiuf and Posada, 2003, variable population size over time (growth rate or demographic periods), several migration models that is island (Hudson, 1998), stepping-stone (Kimura and Weiss, 1964), and continent-island Wright, 1931 with temporal variation of migration rates and convergence of demes or subpopulations, simulation of haploid or diploid data, and longitudinal sampling (Navascués et al., 2010; details in Supplementary file 1A and in the software documentation). The framework outputs a simulated multiple sequence alignment with the protein sequences of the internal and tip nodes, as well as their folding stabilities and the evolutionary history, among other information (Supplementary file 1A and software documentation).

Study data for evaluating the method for forecasting protein evolution

We evaluated the accuracy of the method for forecasting protein evolution using viral proteins sampled over time (longitudinal sampling). Specifically, we used protein sequences from previous experiments of virus evolution monitored over time, which contain consensus molecular data (avoiding rare variants) that belong to different evolutionary time points.

The matrix (MA) protein of HIV-1, with data obtained from an in vitro cell culture experiment where samples were collected at different times (Arenas et al., 2016b). This data included 21 MA protein consensus sequences collected at times (population passages, T) T1 (initial time and includes an initial sequence) and T31 (time after 31 passages and includes 20 sequences), which accumulated 48 amino acid substitutions (sequence identity 0.973; Figure 2).
The main protease (Mpro) and papain-like protease (PLpro) proteins of SARS-CoV-2. For each protein, the data includes the first sequenced variant (Wuhan) and a sequence built with all the substitutions observed in a dataset of 384 genomes of the Omicron variant of concern collected from the GISAID database. The resulting Mpro and PLpro sequences presented 10 and 22 substitutions (sequence identity 0.967 and 0.930), respectively.
The non-structural protein 1 (NS1) of the influenza virus. We retrieved the NS1 sequences from the years 2005, 2015 and 2020 from the Influenza Virus Resource (Bao et al., 2008). Next, we obtained the consensus sequence of the 2005 dataset as initial time point (T1), and the consensus sequences from the 2015 (T2) and 2020 (T3) datasets as subsequent time points. The resulting consensus sequences for 2015 and 2020 showed 40 and 37 substitutions (sequence identity 0.802 and 0.817), respectively.
The protease (PR) of HIV-1 with data sampled from patients monitored over time, from 2008 to 2017, available from the Specialized Assistance Services in Sexually Transmissible Diseases and HIV/AIDS in Brazil (Ferreiro et al., 2022; Santos-Pereira et al., 2021; Souto et al., 2021). These evolutionary scenarios are complex due to the diverse antiretroviral therapies administered to the patients (Supplementary file 1B), which could vary during the studied time periods and that could promote the fixation of specific mutations (i.e. associated with resistance; Ferreiro et al., 2022; Santos-Pereira et al., 2021; Souto et al., 2021). For each viral population (patient), the data included a consensus sequence for each of the four or five samples collected at different time points. These consensus sequences exhibited between one and 22 amino acid substitutions with respect to the consensus sequence of the corresponding first sample (Supplementary file 1B).

Figure 2

Download asset Open asset

Distribution of amino acid substitutions observed along the HIV-1 MA sequences at time *T31*.

Left: Distribution of the observed amino acid substitutions along the HIV-1 matrix (MA) protein sequences at time *T31*. Right: Distribution of the indicated amino acid substitutions (shown in blue) along the protein structure.

We identified a representative protein structure for each dataset, which was used to predict the folding stability and to inform the SCS model. In particular, we obtained the protein structures of the sequences at the initial time (T1) from the PDB (Supplementary file 1C). For the case of the HIV-1 protease, we obtained the protein structure through homology modeling. We used SWISS-MODEL (Arnold et al., 2006) to identify the best-fitting templates of PDB structures (Supplementary file 1C). Next, we predicted the protein structures by homology modeling with Modeller (Sali and Blundell, 1993) using the protein sequences and corresponding best-fitting structural templates.

Forward in time prediction of viral protein variants

The evaluation was performed with the previously presented real data, which includes a real protein variant present at the initial time (T1) and subsequent protein variants present at a later time (Tn). We applied the method for forecasting protein evolution to predict the most likely protein variants at Tn derived from the real protein variant observed at the initial time (T1). The prediction error was determined by measuring the distance between the real protein variants and the predicted variants corresponding to time Tn.

Thus, we assigned to the initial node (T1) the corresponding real protein variant (including its sequence and structure), and its evolution was simulated forward in time until it reached the number of substitutions observed in the real data at time Tn. Thus, we considered the number of observed substitutions as a measure of evolutionary time to allow proper comparisons between real and predicted variants. We simulated 100 alignments of protein sequences, each containing the same number of sequences as the real data at Tn. This included 20 HIV-1 MA protein sequences, a consensus sequence for the SARS-CoV-2 Mpro, a consensus sequence for the SARS-CoV-2 PLpro, a consensus sequence of the influenza NS1 protein for each time point, and a consensus sequence of HIV-1 PR for each viral population. To avoid rare variants, each sequence of the simulated multiple sequence alignments was obtained as the consensus of 100 linked simulated sequences.

To investigate the effect of selection on the predictions, we compared the accuracy of forecasting protein evolution when selection from the protein structure is considered and when it is ignored (neutral evolution). If selection is considered, as previously presented, the probability of birth and death events was based on the fitness of the protein variants, and protein evolution was modeled using an SCS model (Arenas et al., 2013). In the case of neutral evolution, all protein variants equally fit and are allowed. Since variants are observed, we allowed birth events. However, it assumed the absence of death events as no information independent of fitness is available to support their inclusion, thereby avoiding the imposition of arbitrary death events based on an arbitrary death rate. Also, to model neutral evolution, we used an exchangeability matrix with the same relative rates of change to all amino acid pairs.

Accuracy of the predicted protein variants

We assessed the accuracy of the method for forecasting protein evolution by comparing the predicted and real protein variants present at time Tn, both at the protein sequence and structure levels.

For data containing multiple sequences at time Tn (i.e. HIV-1 MA dataset), we calculated the Kullback-Leibler (KL) divergence, which provides a distance between two multiple sequence alignments (the real and predicted data) based on the distribution of amino acid frequencies along the sequences (Equation III, where factors P and Q denote the distribution of amino acid frequencies in the real and predicted protein sequences at time Tn, respectively, i refers to protein site) (Kullback and Leibler, 1951). This distance was only calculated for data with a set of sequences at time Tn (HIV-1 MA) because a single sequence does not provide site-specific variability.

K L (P ∥ Q) = \sum_{i} P_{i} \times l o g (\frac{P_{i}}{Q_{i}})

We also compared the real and predicted evolutionary trajectories of protein variants using the Grantham distance, which measures the differences between amino acids based on their physicochemical properties (Grantham, 1974). In particular, for both real and predicted protein variants, we calculated the Grantham distance at each protein site that differs between the two datasets, considering its evolution from T1 to the subsequent multiple sequence alignment at Tn. We examined sites that varied over time, thus the general site-specific Grantham distance Gi was calculated as the frequency of each amino acid f at site i multiplied by the specific Grantham distance between amino acid m at time Tn and amino acid n at time T1, normalized with the largest Grantham distance G_max to obtain values between 0 and 1 (Equation IV). Next, to compare the real and predicted data, we calculated the site-specific difference of Grantham distance Gb_i between the real P and predicted Q protein variants (Equation V),

G_{i} = \sum_{m = 1}^{20} f_{m} \times \frac{G_{(m, n)}}{G_{m a x}}

G b_{i} =∣ G_{P, i} - G_{Q, i} ∣

In addition, we obtained and compared the protein folding stability (ΔG) of the predicted and real protein variants observed at time Tn, using their corresponding protein structures, with DeltaGREM (Arenas et al., 2016a; Minning et al., 2013).

Results

Implementation of the forecasting protein evolution method

We extended the previous version of our framework ProteinEvolver (Arenas et al., 2013), maintaining its previous capabilities (i.e. simulation of protein evolution upon user-specified phylogenetic trees and upon phylogenetic trees simulated with the coalescent with or without recombination, migration, demographics and longitudinal sampling, empirical and SCS models, among others; Supplementary file 1A), by adding, among others (Supplementary file 1A), the forward in time modeling of protein evolution that combines a birth-death process based on the fitness of every protein variant (folding stability) at each node to determine its birth and death rates, as well as SCS models of protein evolution. The framework ProteinEvolver2 is written in C and distributed with a detailed documentation and a variety of illustrative practical examples. The framework is freely available from https://github.com/MiguelArenas/proteinevolver (Arenas, 2025).

Evaluation of predictions of HIV-1 MA evolution

Regarding the evolution of the HIV-1 MA protein, the Grantham distance and the KL divergence between the real variants at time Tn and the corresponding predicted variants were low (around 5% and 6%, respectively; Table 1), and they did not differ comparing predictions that consider selection on the folding stability (including birth-death models with constant and variable global birth-death rate among lineages) and predictions that ignore it (Table 1). On the other hand, we found that the folding stability of the protein variants predicted considering selection on the folding stability (again, including birth-death models with constant or variable global birth-death rate among lineages) was closer to the folding stability of the real protein variants than that of the protein variants predicted under neutral evolution (Table 1). In particular, the protein variants predicted ignoring selection were less stable than those predicted considering selection and also less stable than the real protein variants.

Table 1

Comparison of real and predicted sequences of the HIV-1 MA protein considering predictions based on the SCS and neutral models.

For the data simulated under the SCS [including birth-death models with constant (SCS) and variable global birth-death rate among lineages (GlobalBDvar)] and neutral models, the table shows the Grantham distance between the amino acids that changed during the real and predicted evolutionary trajectories and the Kullback-Leibler (KL) divergence between the real and predicted multiple sequence alignments. Next, it shows the folding stability (ΔG) of the real protein variants at times T1 and T31 and the folding stability of the predicted protein variants at time T31. The error corresponds to the 95% confidence interval from the mean (100 samples) of predictions of folding stability.

	Grantham distance	KL divergence	ΔG of the real variant at T1 (kcal/mol)	ΔG of the real variants at T31 (kcal/mol)	ΔG of the predicted variants at T31 (kcal/mol)	ΔΔG (kcal/mol) at T31 (predicted – real variants)
SCS model	5%	6%	–9.72	–10.34±0.14	–9.96±0.02	0.38
SCS GlobalBDvar model	5%	6%	–9.72	–10.34±0.14	–10.03±0.03	0.31
Neutral model	5%	6%	–9.72	–10.34±0.14	–9.21±0.07	1.14

Evaluation of predictions of SARS-CoV-2 Mpro and PLpro evolution

The analyses of the SARS-CoV-2 Mpro and PLpro data showed Grantham distances between the real and predicted sequences around 25% and 36%, respectively (Figure 3A). Again, this distance was similar when comparing predictions based on models that consider selection on the protein folding stability (including birth-death models with constant or variable global birth-death rate among lineages) and a model of neutral evolution. Regarding comparisons based on the protein folding stability, we found again that the models that consider selection from the folding stability produce variants closer to the stability of the real protein variants than the model that ignores selection (Figure 3B). Indeed, protein variants derived from the models that consider selection were more stable than those derived from the model of neutral evolution. Next, we did not find statistically significant differences in sequence similarity or folding stability between variants predicted under birth-death models with constant or variable global birth-death rate among lineages (Figure 3).

Figure 3

Download asset Open asset

Prediction error of SARS-CoV-2 Mpro and PLpro evolution under SCS and neutral models regarding physicochemical properties of the amino acid changes accumulated during the evolutionary trajectories and protein folding stability.

Predictions based on data simulated under the SCS [including birth-death models with constant (SCS) and variable global birth-death rate among lineages (GlobalBDvar)] and neutral models. (A) Grantham distance calculated from the amino acid changes that occurred during the real and predicted evolutionary trajectories based on SCS and neutral models of protein evolution. (B) Variation of protein folding stability (ΔΔG) between real and predicted protein variants based on SCS and neutral models of protein evolution. Notice that positive ΔΔG indicates that the real protein variants are more stable than the predicted protein variants and *vice versa*. Error bars correspond to the 95% confidence interval of the mean of prediction error from 100 multiple sequence alignments simulated for the corresponding population and time.

Evaluation of predictions of influenza NS1 protein evolution

The evolutionary predictions for the influenza NS1 protein varied depending on the model used. Specifically, at the sequence level and for the two prediction time points studied, Grantham distances between the real and predicted protein sequences were around 23.5% for the models that incorporated structural evolutionary constraints, compared to about 25.5% for the neutral model (Figure 4A). These differences became more pronounced when examining predictions based on protein folding stability. For both time points, models that included selection consistently generated protein variants with stability more similar to that of the observed variants than those predicted by the neutral model (Figure 4B). Indeed, sequences predicted by the model that accounts for selection were generally more stable than those predicted under neutral evolution. Again, we found no statistically significant differences in sequence similarity or folding stability between variants predicted under birth-death models with constant or variable global birth-death rate among lineages (Figure 4).

Figure 4

Download asset Open asset

Prediction error of influenza NS1 protein evolution under SCS and neutral models regarding physicochemical properties of the amino acid changes accumulated during the evolutionary trajectories and protein folding stability.

Predictions were performed for two time points (longitudinal samples T2 and T3). Predictions based on data simulated under the SCS [including birth-death models with constant (SCS) and variable global birth-death rate among lineages (GlobalBDvar)] and neutral models. (A) Grantham distance calculated from the amino acid changes that occurred during the real and predicted evolutionary trajectories based on SCS and neutral models of protein evolution. (B) Variation of protein folding stability (ΔΔG) between real and predicted protein variants based on SCS and neutral models of protein evolution. Notice that positive ΔΔG indicates that the real protein variants are more stable than the predicted protein variants and vice versa. Error bars correspond to the 95% confidence interval of the mean of prediction error from 100 multiple sequence alignments simulated for the corresponding population and time.

Evaluation of predictions of HIV-1 PR evolution

In general, the Grantham distance, which compared the evolutionary trajectories of the real and predicted protein variants from time T1 to later times, varied among viral populations (patients; Figure 5A). However, for the majority of these populations, the distance remained below 30% and exhibited minor fluctuations over time. One particular population exhibited a notable trend, with the distance increasing from 10% to nearly 60% over time. Considering that the length of the evolutionary trajectories of the protein can differ among the studied populations, we explored whether the accumulated number of amino acid substitutions could affect the accuracy of the predictions, and we found that the number of substitutions varied among populations and this variability did not correlate with the Grantham distance between the real and predicted data (R²=0.0001, Figure 5B). In general, the folding stability of the predicted protein variants was similar or slightly less stable than that of the real protein variants [with a difference ranging from 0 to 9 kcal/mol and a mean of 3.1±0.9 (95% CI) kcal/mol; Figure 5C]. Indeed, the folding stability exhibited small fluctuations, increasing and decreasing, over time.

Figure 5

Download asset Open asset

Prediction error of HIV-1 PR evolution at diverse populations regarding physicochemical properties of the amino acid changes accumulated during the evolutionary trajectories and protein folding stability.

(A) For each viral population (patient, represented with a particular color) and time, Grantham distance calculated from the amino acid changes that occurred during the real and predicted evolutionary trajectories. For each population, the mean of distances obtained over time is shown on the right. (B) Relationship between Grantham distances and accumulated number of substitution events (R²=0.0001, which indicates a lack of correlation between these parameters). (C) Variation of protein folding stability (ΔΔG) between real and predicted protein variants at each viral population and time. For each population, the mean of distances obtained over time is shown on the right. Notice that positive ΔΔG indicates that the real protein variants are more stable than the predicted protein variants and *vice versa*. Error bars correspond to the 95% confidence interval of the mean of prediction error from 100 multiple sequence alignments simulated for the corresponding viral population and time.

Discussion

While reconstructing evolutionary histories and ancestral sequences, among other inferences about the past, was popular in the field, predictions about evolution toward the future were traditionally ignored due to potential high prediction errors. However, in the last decade, forecasting evolution has gained attention because of its variety of applications and advancements in models of evolution (see the reviews Lässig et al., 2017; Wortel et al., 2023). Several studies showed that forecasting evolution can be feasible in some systems, including virus evolution (Luksza and Lässig, 2014; Thadani et al., 2023). Here, we also investigated forecasting evolution in viruses but at the molecular level, considering the relevance of the substitution process to produce molecular variants that affect the viral fitness (Arenas, 2015a; Arenas et al., 2016b; Bloom and Neher, 2023; Poon et al., 2007; Watabe and Kishino, 2010). We believe that forecasting genome evolution remains challenging due to the complexity of its evolutionary processes [i.e. including epistatic interactions, chromosomal rearrangements, and heterogeneous substitution patterns among genomic regions (Arbiza et al., 2011), among others] that complicate the parameterization and prediction of accurate fitness landscapes. However, we believe that it could be more feasible in structural proteins because of their relatively simpler evolutionary processes that usually include selection on folding stability (Bloom et al., 2006; Goldstein, 2011).

To make evolutionary predictions over time, either toward the past or toward the future, considering a substitution model of molecular evolution can be convenient. Actually, a variety of traditional studies showed that the substitution model influences the phylogenetic likelihood generated by probabilistic approaches used for evolutionary inference (Lemmon and Moriarty, 2004; Yang et al., 1994; Zhang and Nei, 1997). To study protein evolution, empirical substitution models of protein evolution are routinely used because they allow rapid calculations based on the assumption of site-independent evolution and typically apply the same exchangeability matrix for all the protein sites, which is highly unrealistic (Echave et al., 2016). Besides, a small set of empirical substitution models was developed for modeling the evolution of the diverse range of viral proteins (Dang et al., 2010; Del Amparo and Arenas, 2022b; Dimmic et al., 2002; Le and Vinh, 2020; Nickle et al., 2007). As an alternative, some substitution models that consider evolutionary constraints on the protein structure incorporated site-dependent evolution, which allows a more accurate modeling of protein evolution compared to the empirical substitution models (Arenas et al., 2013; Arenas et al., 2016a; Bordner and Mittelmann, 2014; Ferreiro et al., 2024a; Parisi and Echave, 2005).

According to previous methods for forecasting evolution based on computer simulations (Lässig and Łuksza, 2014; Neher et al., 2014), we also adopted the birth-death population genetics process to allow a forward-in-time evolutionary process where the birth and death rates can differ among nodes (Stadler, 2010; Stadler, 2011). Traditional population genetics methods to simulate the evolution of molecular data implement a two-step simulation process, where, first, the evolutionary history is simulated [i.e. with the coalescent theory or a forward-in-time simulation approach (Arenas, 2012; Hoban et al., 2012), often assuming neutral evolution] and, in a subsequent step, molecular evolution is simulated upon the previously obtained evolutionary history (Yang, 2006). However, this methodology is unrealistic because the fitness of the data can affect its evolutionary history (i.e., a variant with high fitness is likely to have more descendants than a variant with low fitness). Thus, we designed and implemented a method for forecasting protein evolution that integrates a birth-death population genetics process (including the modeling of constant and variable global birth-death rate among lineages) with a SCS model of protein evolution, where the folding stability of each protein variant is evaluated to predict its future trajectory in terms of both evolutionary history and molecular evolution. We implemented this forward-in-time simulation of protein evolution in a new version of our previous framework ProteinEvolver (Arenas et al., 2013), while maintaining its previous capabilities and extending some of them (i.e. incorporation of site-specific exchangeability matrices and additional substitution models of protein evolution, among others; see Supplementary file 1A and software documentation). As any other method for forecasting evolution, the present method ignores possible environmental shifts that are inherently unpredictable and that could affect the accuracy of the predictions. Next, we evaluated the forward-in-time evolutionary predictions with real data of HIV-1, SARS-CoV-2, and influenza virus proteins. We determined the prediction errors between the real and the predicted protein variants by examining dissimilarity in evolutionary trajectory (Grantham distance based on physicochemical properties among the differing amino acids during the evolutionary trajectories), sequence divergence (distribution of amino acid frequencies among sites using the KL divergence), and protein structure (protein folding stability). Additionally, we analyzed the influence of accounting for selection, based on the protein folding stability, on the predictions. We also evaluated birth-death models incorporating either constant or variable global birth-death rates among lineages. Notably, in the variable-rate model, fitness can influence both reproductive success and the rate of molecular evolution.

In general, the sequence and evolutionary trajectory dissimilarities between the real and predicted protein variants were relatively small, with some variations among the study proteins. For the HIV-1 MA protein, the SARS-CoV-2 Mpro and PLpro, and influenza NS1 protein, the sequence dissimilarity was below 10%, 26%, 36%, and 26%, respectively. The low prediction errors for the HIV-1 MA protein were expected because this dataset was derived from an in vitro cell culture experiment that is not influenced by a variety of external factors that could affect the predictions in in vivo experiments. As a rough reference, in traditional ancestral sequence reconstruction (ASR) of protein data with high sequence identity, the error was approximately 2% (Arenas and Posada, 2010). Notice that ASR methods can exhibit low prediction errors due to their statistical evaluation based on a set of original sequences, rather than relying on a single original sequence used in the case of forecasting evolution. The contrasting scenario regarding evolutionary complexity was the HIV-1 PR data, which was sampled from populations (patients) undergoing various antiretroviral treatments (Ferreiro et al., 2022). There, the sequence dissimilarity varied among viral populations, although most were below 30%. A viral population exhibited higher dissimilarities (near 60%) that we believe could be caused by molecular adaptations promoted by the therapies, although this needs formal investigation.

An unexpected result was that the model of neutral evolution produced sequence dissimilarities between the real and predicted protein variants that were quite similar to those obtained with the SCS model (Table 1 and Figures 3 and 4). A few studies indicated that the substitution model has negligible effects on the reconstructed phylogenetic trees (Abadi et al., 2019; Spielman, 2020). Subsequent studies found that the influence of the substitution model on phylogenetic reconstructions is dependent on the diversity of the data, where data with high diversity is more sensitive to the applied substitution model due to containing more evolutionary information to be modeled (Del Amparo and Arenas, 2023). Considering that the studied data present overall low diversity [i.e. sequence identity of 0.973, 0.967, 0.930, 0.802, and 0.817 for the HIV MA data, SARS-CoV-2 Mpro data, SARS-CoV-2 PLpro data, and influenza NS1 protein data (for each prediction time point, T2 and T3), respectively; it is important to note that, in general, longitudinal data derived from monitored evolutionary processes usually show a low diversity because they involve relatively short evolutionary histories, among other factors.], we believe that the influence of the applied substitution models on the prediction of the sequences was small because of the small number of modeled substitution events, as found in phylogenetic reconstructions (; Del Amparo and Arenas, 2023). Actually, in the case of the influenza NS1 protein dataset, which had the highest sequence diversity among the study datasets, the sequences predicted under the SCS models were more similar to the real sequences than those derived from predictions under neutral evolution. Overall, the prediction accuracy varied among the studied evolutionary scenarios; as expected, it was lower in the more complex scenarios. Indeed, datasets with higher sequence diversity contain more evolutionary signals, which can improve prediction quality.

We also evaluated the prediction error between the real and predicted protein variants regarding their folding stability, again comparing the predictions made under a model that considers structural constraints and a model of neutral evolution. In general, the protein variants predicted under the SCS model presented a folding stability close to the folding stability of the respective real protein variants, with differences below 1, 2, 7, and 1 kcal/mol for the HIV MA data, SARS-CoV-2 Mpro data, SARS-CoV-2 PLpro data, and influenza NS1 protein data, respectively. The higher differences (around 9 kcal/mol) were again observed for the HIV-1 PR data. In contrast to the prediction error based on sequence dissimilarity, the prediction error based on folding stability varies between predictions obtained under the SCS model and those obtained under the neutral model. In the studied evolutionary scenarios, the protein variants predicted under the neutral model were less stable and farther from the stability of the real protein variants compared to those predicted under the SCS model. These results were expected because, under SCS models, protein stability can be modeled with greater accuracy than sequence similarity due to selection for maintaining stability in the protein structure despite amino acid changes (Arenas and Bastolla, 2020; Illergård et al., 2009; Pascual-García et al., 2010). Indeed, previous studies showed that models that ignore structural constraints often produce proteins with unrealistic folding instability (Arenas and Bastolla, 2020; Del Amparo et al., 2023), which suggests that accounting for protein folding stability in the modeling of protein evolution is recommended for predicting protein variants with appropriate structural properties.

Therefore, we found a good accuracy in predicting the real folding stability of forecasted protein variants, while predicting the exact sequences was more challenging, which was not surprising considering previous studies. In particular, inferring specific sequences is considerably challenging even for ancestral molecular reconstruction (Arenas and Bastolla, 2020; Arenas et al., 2017). Indeed, observed sequence diversity is much greater than observed structural diversity (Illergård et al., 2009; Pascual-García et al., 2010), and substitutions between amino acids with similar physicochemical properties can yield modeled protein variants with more accurate folding stability, even when the exact amino acid sequences differ. Further work is demanded in the field of substitution models of molecular evolution. We also found that datasets with relatively high sequence diversity can improve the accuracy of the predictions due to containing more evolutionary information. Apart from that, forecasting the folding stability of future real proteins is an important advance in forecasting protein evolution, given the essential role of folding stability in protein function (Bloom et al., 2006; Scheiblhofer et al., 2017) and its variety of applications.

Variation in the global birth-death rate among lineages showed minor effects on prediction accuracy, suggesting a limited role in protein evolution modeling. Molecular evolution parameters, particularly the substitution model, appear to be more critical in this regard.

In the context of protein evolution, substitution models are a critical factor (Arenas et al., 2017; Bordner and Mittelmann, 2014; Echave et al., 2016; Echave and Wilke, 2017; Liberles et al., 2012; Wilke, 2012), and the presented combination with a birth-death model constitutes a first approximation upon which next studies can build to better understand this evolutionary system. Next, the present method assumes that the protein structure is maintained over the studied evolutionary time, which can be generally reasonable for short timescales where the structure is conserved (Illergård et al., 2009; Pascual-García et al., 2010). Over longer evolutionary timescales, structural changes may occur and, in such cases, modeling the evolution of the protein structure would be necessary. To our knowledge, modeling the evolution of the protein structure remains a challenging task that requires substantial methodological developments. Recent advances in artificial intelligence, particularly in protein structure prediction from sequence (Abramson et al., 2024; Jumper et al., 2021), may offer promising tools for addressing this challenge. However, we believe that evaluating such approaches in the context of structural evolution would be difficult, especially given the limited availability of real data with known evolutionary trajectories involving structural change. In any case, this is probably an important direction for future research.

We present a method to simulate forward-in-time protein evolution accounting for evolutionary constraints from the protein structure and a birth-death population process, and where the evolutionary history is influenced by the protein evolution and vice versa. The method is implemented in the computer framework ProteinEvolver2, which is freely distributed with several practical examples and detailed documentation. We believe that implementing methods into freely available phylogenetic frameworks is important to facilitate practical applications, as well as future improvements and evaluations. We applied the method to forecast protein evolution in some viral proteins. We found that the method provides acceptable approximations to the real evolution, especially in terms of protein folding stability, suggesting that combining structural constraints with birth-death population processes in the modeling of protein evolution is convenient. Still, to advance in methods for forecasting protein evolution, we believe that further efforts should be made in the field to improve the modeling of protein evolution, such as the incorporation of site-dependent evolutionary constraints from the protein activity.

Data availability

The computer framework ProteinEvolver2 is freely available from https://github.com/MiguelArenas/proteinevolver (Arenas, 2025). The SARS-CoV-2 data is available from GISAID database with https://doi.org/10.55876/gis8.250206gt. The real and predicted protein variants are available from Zenodo repository at the URL https://doi.org/10.5281/zenodo.15548146.

The following data sets were generated

(2025) Zenodo
Forecasting protein evolution by combining birth-death population models with structurally constrained substitution models.

https://doi.org/10.5281/zenodo.15548146

The following previously published data sets were used

1. GISAID
(2025) EpiCoV
SARS-CoV-2 protein sequences.

https://doi.org/10.55876/gis8.250206gt

References

1. Abadi S
2. Azouri D
3. Pupko T
4. Mayrose I
(2019) Model selection may not be a mandatory step for phylogeny reconstruction
Nature Communications 10:934.

https://doi.org/10.1038/s41467-019-08822-w
- PubMed
- Google Scholar
1. Abramson J
2. Adler J
3. Dunger J
4. Evans R
5. Green T
6. Pritzel A
7. Ronneberger O
8. Willmore L
9. Ballard AJ
10. Bambrick J
11. Bodenstein SW
12. Evans DA
13. Hung C-C
14. O’Neill M
15. Reiman D
16. Tunyasuvunakool K
17. Wu Z
18. Žemgulytė A
19. Arvaniti E
20. Beattie C
21. Bertolli O
22. Bridgland A
23. Cherepanov A
24. Congreve M
25. Cowen-Rivers AI
26. Cowie A
27. Figurnov M
28. Fuchs FB
29. Gladman H
30. Jain R
31. Khan YA
32. Low CMR
33. Perlin K
34. Potapenko A
35. Savy P
36. Singh S
37. Stecula A
38. Thillaisundaram A
39. Tong C
40. Yakneen S
41. Zhong ED
42. Zielinski M
43. Žídek A
44. Bapst V
45. Kohli P
46. Jaderberg M
47. Hassabis D
48. Jumper JM
(2024) Accurate structure prediction of biomolecular interactions with AlphaFold 3
Nature 630:493–500.

https://doi.org/10.1038/s41586-024-07487-w
- PubMed
- Google Scholar
1. Arbiza L
2. Patricio M
3. Dopazo H
4. Posada D
(2011) Genome-wide heterogeneity of nucleotide substitution model fit
Genome Biology and Evolution 3:896–908.

https://doi.org/10.1093/gbe/evr080
- PubMed
- Google Scholar
1. Arenas M
2. Posada D
(2010) The effect of recombination on the reconstruction of ancestral sequences
Genetics 184:1133–1139.

https://doi.org/10.1534/genetics.109.113423
- PubMed
- Google Scholar
1. Arenas M
(2012) Simulation of molecular data under diverse evolutionary scenarios
PLOS Computational Biology 8:e1002495.

https://doi.org/10.1371/journal.pcbi.1002495
- PubMed
- Google Scholar
(2013) Protein evolution along phylogenetic histories under structurally constrained substitution models
Bioinformatics 29:3020–3028.

https://doi.org/10.1093/bioinformatics/btt530
- PubMed
- Google Scholar
1. Arenas M
(2015a) Genetic consequences of antiviral therapy on HIV-1
Computational and Mathematical Methods in Medicine 2015:395826.

https://doi.org/10.1155/2015/395826
- PubMed
- Google Scholar
1. Arenas M
(2015b) Trends in substitution models of molecular evolution
Frontiers in Genetics 6:319.

https://doi.org/10.3389/fgene.2015.00319
- PubMed
- Google Scholar
(2016a) Maximum-likelihood phylogenetic inference with selection on protein folding stability
Molecular Biology and Evolution 32:2195–2207.

https://doi.org/10.1093/molbev/msv085
- PubMed
- Google Scholar
(2016b) Influence of mutation and recombination on HIV-1 in vitro fitness recovery
Molecular Phylogenetics and Evolution 94:264–270.

https://doi.org/10.1016/j.ympev.2015.09.001
- PubMed
- Google Scholar
(2017) ProtASR: an evolutionary framework for ancestral protein reconstruction with selection on folding stability
Systematic Biology 66:1054–1064.

https://doi.org/10.1093/sysbio/syw121
- PubMed
- Google Scholar
1. Arenas M
2. Bastolla U
(2020) ProtASR2: Ancestral reconstruction of protein sequences accounting for folding stability
Methods in Ecology and Evolution 11:248–257.

https://doi.org/10.1111/2041-210X.13341
- Google Scholar
Software
1. Arenas M
(2025) Proteinevolver, version db71444
GitHub.

https://github.com/MiguelArenas/proteinevolver
1. Arnold K
2. Bordoli L
3. Kopp J
4. Schwede T
(2006) The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling
Bioinformatics 22:195–201.

https://doi.org/10.1093/bioinformatics/bti770
- PubMed
- Google Scholar
1. Bao Y
2. Bolotov P
3. Dernovoy D
4. Kiryutin B
5. Zaslavsky L
6. Tatusova T
7. Ostell J
8. Lipman D
(2008) The influenza virus resource at the National Center for Biotechnology Information
Journal of Virology 82:596–601.

https://doi.org/10.1128/JVI.02005-07
- PubMed
- Google Scholar
(2016) Relative rate and location of intra-host HIV evolution to evade cellular immunity are predictable
Nature Communications 7:11660.

https://doi.org/10.1038/ncomms11660
- PubMed
- Google Scholar
1. Bastolla U
2. Demetrius L
(2005) Stability constraints and protein evolution: the role of chain length, composition and disulfide bonds
Protein Engineering, Design & Selection 18:405–415.

https://doi.org/10.1093/protein/gzi045
- PubMed
- Google Scholar
(2006) A protein evolution model with independent sites that reproduces site-specific amino acid distributions from the Protein Data Bank
BMC Evolutionary Biology 6:43.

https://doi.org/10.1186/1471-2148-6-43
- PubMed
- Google Scholar
Book
(2007) Structural Approaches to Sequence Evolution
Springer.

https://doi.org/10.1007/978-3-540-35306-5
- Google Scholar
(2006) Protein stability promotes evolvability
PNAS 103:5869–5874.

https://doi.org/10.1073/pnas.0510098103
- PubMed
- Google Scholar
1. Bloom JD
2. Neher RA
(2023) Fitness effects of mutations to SARS-CoV-2 proteins
Virus Evolution 9:vead055.

https://doi.org/10.1093/ve/vead055
- PubMed
- Google Scholar
1. Bordner AJ
2. Mittelmann HD
(2014) A new formulation of protein evolutionary models that account for structural constraints
Molecular Biology and Evolution 31:736–749.

https://doi.org/10.1093/molbev/mst240
- PubMed
- Google Scholar
1. Bull JJ
2. Molineux IJ
(2008) Predicting evolution from genomics: experimental evolution of bacteriophage T7
Heredity 100:453–463.

https://doi.org/10.1038/sj.hdy.6801087
- PubMed
- Google Scholar
1. Carneiro M
2. Hartl DL
(2010) Colloquium papers: Adaptive landscapes and protein evolution
PNAS 107 Suppl 1:1747–1751.

https://doi.org/10.1073/pnas.0906192106
- PubMed
- Google Scholar
1. Carvajal-Rodríguez A
(2010) Simulation of genes and genomes forward in time
Current Genomics 11:58–61.

https://doi.org/10.2174/138920210790218007
- PubMed
- Google Scholar
1. Colless DH
2. Wiley EO
(1982) Phylogenetics: the theory and practice of phylogenetic systematics
Systematic Zoology 31:13420.

https://doi.org/10.2307/2413420
- Google Scholar
1. Dang CC
2. Le QS
3. Gascuel O
4. Le VS
(2010) FLU, an amino acid substitution model for influenza proteins
BMC Evolutionary Biology 10:99.

https://doi.org/10.1186/1471-2148-10-99
- PubMed
- Google Scholar
(2011) ProtTest 3: fast selection of best-fit models of protein evolution
Bioinformatics 27:1164–1165.

https://doi.org/10.1093/bioinformatics/btr088
- PubMed
- Google Scholar
1. Del Amparo R
2. Arenas M
(2022a) Consequences of substitution model selection on protein ancestral sequence reconstruction
Molecular Biology and Evolution 39:msac144.

https://doi.org/10.1093/molbev/msac144
- PubMed
- Google Scholar
1. Del Amparo R
2. Arenas M
(2022b) HIV protease and integrase empirical substitution models of evolution: protein-specific models outperform generalist models
Genes 13:61.

https://doi.org/10.3390/genes13010061
- Google Scholar
1. Del Amparo R
2. Arenas M
(2023) Influence of substitution model selection on protein phylogenetic tree reconstruction
Gene 865:147336.

https://doi.org/10.1016/j.gene.2023.147336
- PubMed
- Google Scholar
(2023) Consequences of genetic recombination on protein folding stability
Journal of Molecular Evolution 91:33–45.

https://doi.org/10.1007/s00239-022-10080-2
- PubMed
- Google Scholar
1. Desai MM
2. Fisher DS
(2007) Beneficial mutation selection balance and the effect of linkage on positive selection
Genetics 176:1759–1798.

https://doi.org/10.1534/genetics.106.067678
- PubMed
- Google Scholar
(2018) The utility of fitness landscapes and big data for predicting evolution
Heredity 121:401–405.

https://doi.org/10.1038/s41437-018-0128-4
- PubMed
- Google Scholar
1. Diaz-Uriarte R
2. Vasallo C
(2019) Every which way? On predicting tumor evolution using cancer progression models
PLOS Computational Biology 15:e1007246.

https://doi.org/10.1371/journal.pcbi.1007246
- PubMed
- Google Scholar
(2002) rtREV: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogeny
Journal of Molecular Evolution 55:65–73.

https://doi.org/10.1007/s00239-001-2304-y
- PubMed
- Google Scholar
1. Eccleston RC
2. Manko E
3. Campino S
4. Clark TG
5. Furnham N
(2023) A computational method for predicting the most likely evolutionary trajectories in the stepwise accumulation of resistance mutations
eLife 12:e84756.

https://doi.org/10.7554/eLife.84756
- PubMed
- Google Scholar
(2016) Causes of evolutionary rate variation among protein sites
Nature Reviews. Genetics 17:109–121.

https://doi.org/10.1038/nrg.2015.18
- PubMed
- Google Scholar
1. Echave J
2. Wilke CO
(2017) Biophysical models of protein evolution: understanding the patterns of evolutionary sequence divergence
Annual Review of Biophysics 46:85–103.

https://doi.org/10.1146/annurev-biophys-070816-033819
- PubMed
- Google Scholar
1. Ferreiro D
2. Khalil R
3. Gallego MJ
4. Osorio NS
5. Arenas M
(2022) The evolution of the HIV-1 protease folding stability
Virus Evolution 8:veac115.

https://doi.org/10.1093/ve/veac115
- PubMed
- Google Scholar
(2024a) Selection among site-dependent structurally constrained substitution models of protein evolution by approximate Bayesian computation
Bioinformatics 40:btae096.

https://doi.org/10.1093/bioinformatics/btae096
- PubMed
- Google Scholar
1. Ferreiro D
2. Khalil R
3. Sousa SF
4. Arenas M
(2024b) Substitution models of protein evolution with selection on enzymatic activity
Molecular Biology and Evolution 41:msae026.

https://doi.org/10.1093/molbev/msae026
- PubMed
- Google Scholar
(2015) The value of monitoring to control evolving populations
PNAS 112:1007–1012.

https://doi.org/10.1073/pnas.1409403112
- PubMed
- Google Scholar
1. Fitch WM
2. Margoliash E
(1967) A method for estimating the number of invariant amino acid coding positions in a gene using cytochrome c as a model case
Biochemical Genetics 1:65–71.

https://doi.org/10.1007/BF00487738
- PubMed
- Google Scholar
(2002) Site-specific amino acid replacement matrices from structurally constrained protein evolution simulations
Molecular Biology and Evolution 19:352–356.

https://doi.org/10.1093/oxfordjournals.molbev.a004089
- PubMed
- Google Scholar
1. Gernhard T
(2008) The conditioned reconstructed process
Journal of Theoretical Biology 253:769–778.

https://doi.org/10.1016/j.jtbi.2008.04.005
- PubMed
- Google Scholar
1. Gerrish PJ
2. Sniegowski PD
(2012) Real time forecasting of near-future evolution
Journal of the Royal Society, Interface 9:2268–2278.

https://doi.org/10.1098/rsif.2012.0119
- PubMed
- Google Scholar
(2017) The role of evolutionary selection in the dynamics of protein structure evolution
Biophysical Journal 112:1350–1365.

https://doi.org/10.1016/j.bpj.2017.02.029
- PubMed
- Google Scholar
1. Goldstein RA
(2011) The evolution and evolutionary consequences of marginal thermostability in proteins
Proteins 79:1396–1407.

https://doi.org/10.1002/prot.22964
- PubMed
- Google Scholar
1. Goldstein RA
(2013) Population size dependence of fitness effect distribution and substitution rate probed by biophysical model of protein thermostability
Genome Biology and Evolution 5:1584–1593.

https://doi.org/10.1093/gbe/evt110
- PubMed
- Google Scholar
(2013) Stability-mediated epistasis constrains the evolution of an influenza protein
eLife 2:e00631.

https://doi.org/10.7554/eLife.00631
- PubMed
- Google Scholar
1. Goyal S
2. Balick DJ
3. Jerison ER
4. Neher RA
5. Shraiman BI
6. Desai MM
(2012) Dynamic mutation-selection balance as an evolutionary attractor
Genetics 191:1309–1319.

https://doi.org/10.1534/genetics.112.141291
- PubMed
- Google Scholar
1. Grantham R
(1974) Amino acid difference formula to help explain protein evolution
Science 185:862–864.

https://doi.org/10.1126/science.185.4154.862
- PubMed
- Google Scholar
1. Harmon LJ
(2019)
Phylogenetic Comparative Methods

Introduction to birth-death models, Phylogenetic Comparative Methods, Princeton University Press.
- Google Scholar
(2012) Computer simulations: tools for population and evolutionary genetics
Nature Reviews. Genetics 13:110–122.

https://doi.org/10.1038/nrg3130
- PubMed
- Google Scholar
1. Hudson RR
(1983) Properties of a neutral allele model with intragenic recombination
Theoretical Population Biology 23:183–201.

https://doi.org/10.1016/0040-5809(83)90013-8
- PubMed
- Google Scholar
1. Hudson RR
(1990)
Gene genealogies and the coalescent process

Oxford Surveys in Evolutionary Biology 7:1–44.
- Google Scholar
1. Hudson RR
(1998) Island models and the coalescent process
Molecular Ecology 7:413–418.

https://doi.org/10.1046/j.1365-294x.1998.00344.x
- Google Scholar
(2009) Structure is three to ten times more conserved than sequence--a study of structural response in protein cores
Proteins 77:499–508.

https://doi.org/10.1002/prot.22458
- PubMed
- Google Scholar
1. Jacquier H
2. Birgy A
3. Le Nagard H
4. Mechulam Y
5. Schmitt E
6. Glodt J
7. Bercot B
8. Petit E
9. Poulain J
10. Barnaud G
11. Gros PA
12. Tenaillon O
(2013) Capturing the mutational landscape of the beta-lactamase TEM-1
PNAS 110:13067–13072.

https://doi.org/10.1073/pnas.1215206110
- Google Scholar
1. Jumper J
2. Evans R
3. Pritzel A
4. Green T
5. Figurnov M
6. Ronneberger O
7. Tunyasuvunakool K
8. Bates R
9. Žídek A
10. Potapenko A
11. Bridgland A
12. Meyer C
13. Kohl SAA
14. Ballard AJ
15. Cowie A
16. Romera-Paredes B
17. Nikolov S
18. Jain R
19. Adler J
20. Back T
21. Petersen S
22. Reiman D
23. Clancy E
24. Zielinski M
25. Steinegger M
26. Pacholska M
27. Berghammer T
28. Bodenstein S
29. Silver D
30. Vinyals O
31. Senior AW
32. Kavukcuoglu K
33. Kohli P
34. Hassabis D
(2021) Highly accurate protein structure prediction with AlphaFold
Nature 596:583–589.

https://doi.org/10.1038/s41586-021-03819-2
- PubMed
- Google Scholar
1. Kimura M
2. Weiss GH
(1964) The stepping stone model of population structure and the decrease of genetic correlation with distance
Genetics 49:561–576.

https://doi.org/10.1093/genetics/49.4.561
- PubMed
- Google Scholar
1. Kingman JFC
(1982) The coalescent
Stochastic Processes and Their Applications 13:235–248.

https://doi.org/10.1016/0304-4149(82)90011-4
- Google Scholar
1. Kullback S
2. Leibler RA
(1951) On information and sufficiency
The Annals of Mathematical Statistics 22:79–86.

https://doi.org/10.1214/aoms/1177729694
- Google Scholar
1. Lässig M
2. Łuksza M
(2014) Can we read the future from a tree?
eLife 3:e05060.

https://doi.org/10.7554/eLife.05060
- PubMed
- Google Scholar
(2017) Predicting evolution
Nature Ecology & Evolution 1:0077.

https://doi.org/10.1038/s41559-017-0077
- Google Scholar
1. Le TK
2. Vinh LS
(2020) FLAVI: An amino acid substitution model for flaviviruses
Journal of Molecular Evolution 88:445–452.

https://doi.org/10.1007/s00239-020-09943-3
- PubMed
- Google Scholar
(2022) Robust, universal tree balance indices
Systematic Biology 71:1210–1224.

https://doi.org/10.1093/sysbio/syac027
- PubMed
- Google Scholar
1. Lemmon AR
2. Moriarty EC
(2004) The importance of proper model assumption in bayesian Phylogenetics
Systematic Biology 53:265–277.

https://doi.org/10.1080/10635150490423520
- Google Scholar
1. Liberles DA
2. Teichmann SA
3. Bahar I
4. Bastolla U
5. Bloom J
6. Bornberg-Bauer E
7. Colwell LJ
8. de Koning APJ
9. Dokholyan NV
10. Echave J
11. Elofsson A
12. Gerloff DL
13. Goldstein RA
14. Grahnen JA
15. Holder MT
16. Lakner C
17. Lartillot N
18. Lovell SC
19. Naylor G
20. Perica T
21. Pollock DD
22. Pupko T
23. Regan L
24. Roger A
25. Rubinstein N
26. Shakhnovich E
27. Sjölander K
28. Sunyaev S
29. Teufel AI
30. Thorne JL
31. Thornton JW
32. Weinreich DM
33. Whelan S
(2012) The interface of protein structure, protein biophysics, and molecular evolution
Protein Science 21:769–785.

https://doi.org/10.1002/pro.2071
- PubMed
- Google Scholar
1. Lind PA
2. Libby E
3. Herzog J
4. Rainey PB
(2019) Predicting mutational routes to new adaptive phenotypes
eLife 8:e38822.

https://doi.org/10.7554/eLife.38822
- PubMed
- Google Scholar
(2010) Universal distribution of protein evolution rates as a consequence of protein folding physics
PNAS 107:2983–2988.

https://doi.org/10.1073/pnas.0910445107
- PubMed
- Google Scholar
1. Luksza M
2. Lässig M
(2014) A predictive fitness model for influenza
Nature 507:57–61.

https://doi.org/10.1038/nature13087
- PubMed
- Google Scholar
(1990) Ancestral lysozymes reconstructed, neutrality tested, and thermostability linked to hydrocarbon packing
Nature 345:86–89.

https://doi.org/10.1038/345086a0
- PubMed
- Google Scholar
(2010) Mutation bias favors protein folding stability in the evolution of small populations
PLOS Computational Biology 6:e1000767.

https://doi.org/10.1371/journal.pcbi.1000767
- PubMed
- Google Scholar
(2013) Detecting selection for negative design in proteins through an improved model of the misfolded state
Proteins 81:1102–1112.

https://doi.org/10.1002/prot.24244
- PubMed
- Google Scholar
1. Morcos F
2. Pagnani A
3. Lunt B
4. Bertolino A
5. Marks DS
6. Sander C
7. Zecchina R
8. Onuchic JN
9. Hwa T
10. Weigt M
(2011) Direct-coupling analysis of residue coevolution captures native contacts across many protein families
PNAS 108:E1293–E1301.

https://doi.org/10.1073/pnas.1111471108
- PubMed
- Google Scholar
(2023) Evolution of TOP1 and TOP1MT topoisomerases in chordata
Journal of Molecular Evolution 91:192–203.

https://doi.org/10.1007/s00239-022-10091-z
- PubMed
- Google Scholar
1. Morris DH
2. Gostic KM
3. Pompei S
4. Bedford T
5. Łuksza M
6. Neher RA
7. Grenfell BT
8. Lässig M
9. McCauley JW
(2018) Predictive modeling of influenza shows the promise of applied evolutionary biology
Trends in Microbiology 26:102–118.

https://doi.org/10.1016/j.tim.2017.09.004
- PubMed
- Google Scholar
1. Munck C
2. Gumpert HK
3. Wallin AIN
4. Wang HH
5. Sommer MOA
(2014) Prediction of resistance development against drug combinations by collateral responses to component drugs
Science Translational Medicine 6:262ra156.

https://doi.org/10.1126/scitranslmed.3009940
- PubMed
- Google Scholar
(2010) Combining contemporary and ancient DNA in population genetic and phylogeographical studies
Molecular Ecology Resources 10:760–772.

https://doi.org/10.1111/j.1755-0998.2010.02895.x
- PubMed
- Google Scholar
1. Neher RA
2. Hallatschek O
(2013) Genealogies of rapidly adapting populations
PNAS 110:437–442.

https://doi.org/10.1073/pnas.1213113110
- Google Scholar
(2014) Predicting evolution from the shape of genealogical trees
eLife 3:e03568.

https://doi.org/10.7554/eLife.03568
- PubMed
- Google Scholar
(2007) HIV-specific probabilistic models of protein evolution
PLOS ONE 2:e503.

https://doi.org/10.1371/journal.pone.0000503
- PubMed
- Google Scholar
(2023) A rugged yet easily navigable fitness landscape
Science 382:eadh3860.

https://doi.org/10.1126/science.adh3860
- PubMed
- Google Scholar
1. Parisi G
2. Echave J
(2001) Structural constraints and emergence of sequence patterns in protein evolution
Molecular Biology and Evolution 18:750–756.

https://doi.org/10.1093/oxfordjournals.molbev.a003857
- PubMed
- Google Scholar
1. Parisi G
2. Echave J
(2005) Generality of the structurally constrained protein evolution model: assessment on representatives of the four main fold classes
Gene 345:45–53.

https://doi.org/10.1016/j.gene.2004.11.025
- PubMed
- Google Scholar
(2010) Quantifying the evolutionary divergence of protein structures: the role of function change and function conservation
Proteins 78:181–196.

https://doi.org/10.1002/prot.22616
- PubMed
- Google Scholar
(2019) The molecular clock in the evolution of protein structures
Systematic Biology 68:987–1002.

https://doi.org/10.1093/sysbio/syz022
- PubMed
- Google Scholar
(2007) Mapping protease inhibitor resistance to human immunodeficiency virus type 1 sequence polymorphisms within patients
Journal of Virology 81:13598–13607.

https://doi.org/10.1128/JVI.01570-07
- PubMed
- Google Scholar
(2005) Site interdependence attributed to tertiary structure in amino acid sequence evolution
Gene 347:207–217.

https://doi.org/10.1016/j.gene.2004.12.011
- PubMed
- Google Scholar
(2016) Biophysical principles predict fitness landscapes of drug resistance
PNAS 113:E1470–E1478.

https://doi.org/10.1073/pnas.1601441113
- Google Scholar
(2023) Adaptive diversification and niche packing on rugged fitness landscapes
Journal of Theoretical Biology 562:111421.

https://doi.org/10.1016/j.jtbi.2023.111421
- PubMed
- Google Scholar
1. Ruiz-González MX
2. Fares MA
(2013) Coevolution analyses illuminate the dependencies between amino acid sites in the chaperonin system GroES-L
BMC Evolutionary Biology 13:156.

https://doi.org/10.1186/1471-2148-13-156
- PubMed
- Google Scholar
1. Sali A
2. Blundell TL
(1993) Comparative protein modelling by satisfaction of spatial restraints
Journal of Molecular Biology 234:779–815.

https://doi.org/10.1006/jmbi.1993.1626
- PubMed
- Google Scholar
(2021) Nationwide study of drug resistance mutations in HIV-1 infected individuals under antiretroviral therapy in Brazil
International Journal of Molecular Sciences 22:5304.

https://doi.org/10.3390/ijms22105304
- PubMed
- Google Scholar
(2017) Influence of protein fold stability on immunogenicity and its implications for vaccine design
Expert Review of Vaccines 16:479–489.

https://doi.org/10.1080/14760584.2017.1306441
- PubMed
- Google Scholar
1. Sella G
2. Hirsh AE
(2005) The application of statistical physics to evolutionary biology
PNAS 102:9541–9546.

https://doi.org/10.1073/pnas.0501865102
- PubMed
- Google Scholar
(2021) Evolutionary dynamics of HIV-1 subtype C in Brazil
Scientific Reports 11:23060.

https://doi.org/10.1038/s41598-021-02428-3
- PubMed
- Google Scholar
1. Spielman SJ
(2020) Relative model fit does not predict topological accuracy in single-gene protein Phylogenetics
Molecular Biology and Evolution 37:2110–2123.

https://doi.org/10.1093/molbev/msaa075
- PubMed
- Google Scholar
(1990) The ribonuclease from an extinct bovid ruminant
FEBS Letters 262:104–106.

https://doi.org/10.1016/0014-5793(90)80164-e
- PubMed
- Google Scholar
1. Stadler T
(2010) Sampling-through-time in birth-death trees
Journal of Theoretical Biology 267:396–404.

https://doi.org/10.1016/j.jtbi.2010.09.010
- PubMed
- Google Scholar
1. Stadler T
(2011) Simulating trees with a fixed number of extant species
Systematic Biology 60:676–684.

https://doi.org/10.1093/sysbio/syr029
- PubMed
- Google Scholar
1. Thadani NN
2. Gurev S
3. Notin P
4. Youssef N
5. Rollins NJ
6. Ritter D
7. Sander C
8. Gal Y
9. Marks DS
(2023) Learning from prepandemic data to forecast viral escape
Nature 622:818–825.

https://doi.org/10.1038/s41586-023-06617-0
- PubMed
- Google Scholar
(2003) Resurrecting the ancestral steroid receptor: ancient origin of estrogen signaling
Science 301:1714–1717.

https://doi.org/10.1126/science.1086185
- PubMed
- Google Scholar
(2004) Evolution of coral pigments recreated
Science 305:1433.

https://doi.org/10.1126/science.1099597
- PubMed
- Google Scholar
1. Van Cleve J
2. Weissman DB
(2015) Measuring ruggedness in fitness landscapes
PNAS 112:7345–7346.

https://doi.org/10.1073/pnas.1507916112
- PubMed
- Google Scholar
1. Watabe T
2. Kishino H
(2010) Structural considerations in the fitness landscape of a virus
Molecular Biology and Evolution 27:1782–1791.

https://doi.org/10.1093/molbev/msq056
- PubMed
- Google Scholar
1. Wilke CO
(2012) Bringing molecules back into molecular evolution
PLOS Computational Biology 8:e1002572.

https://doi.org/10.1371/journal.pcbi.1002572
- PubMed
- Google Scholar
1. Wiuf C
2. Posada D
(2003) A coalescent model of recombination hotspots
Genetics 164:407–417.

https://doi.org/10.1093/genetics/164.1.407
- PubMed
- Google Scholar
1. Wortel MT
2. Agashe D
3. Bailey SF
4. Bank C
5. Bisschop K
6. Blankers T
7. Cairns J
8. Colizzi ES
9. Cusseddu D
10. Desai MM
11. van Dijk B
12. Egas M
13. Ellers J
14. Groot AT
15. Heckel DG
16. Johnson ML
17. Kraaijeveld K
18. Krug J
19. Laan L
20. Lässig M
21. Lind PA
22. Meijer J
23. Noble LM
24. Okasha S
25. Rainey PB
26. Rozen DE
27. Shitut S
28. Tans SJ
29. Tenaillon O
30. Teotónio H
31. de Visser JAGM
32. Visser ME
33. Vroomans RMA
34. Werner GDA
35. Wertheim B
36. Pennings PS
(2023) Towards evolutionary predictions: Current promises and challenges
Evolutionary Applications 16:3–21.

https://doi.org/10.1111/eva.13513
- PubMed
- Google Scholar
1. Wright S
(1931) Evolution in mendelian populations
Genetics 16:97–159.

https://doi.org/10.1093/genetics/16.2.97
- PubMed
- Google Scholar
1. Wylie CS
2. Shakhnovich EI
(2011) A biophysical protein folding model accounts for most mutational fitness effects in viruses
PNAS 108:9916–9921.

https://doi.org/10.1073/pnas.1017572108
- Google Scholar
(1994) Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation
Molecular Biology and Evolution 11:316–324.

https://doi.org/10.1093/oxfordjournals.molbev.a040112
- PubMed
- Google Scholar
1. Yang Z
(1996) Among-site rate variation and its impact on phylogenetic analyses
Trends in Ecology & Evolution 11:367–372.

https://doi.org/10.1016/0169-5347(96)10041-0
- PubMed
- Google Scholar
Book
1. Yang Z
(2006) Computational Molecular Evolution
Oxford, England: Oxford University Press.

https://doi.org/10.1093/acprof:oso/9780198567028.001.0001
- Google Scholar
1. Yoshida K
2. Hata K
3. Kawakami K
4. Hiradate S
5. Osawa T
6. Kachi N
(2023) Predicting ecosystem changes by a new model of ecosystem evolution
Scientific Reports 13:15353.

https://doi.org/10.1038/s41598-023-42529-9
- PubMed
- Google Scholar
(2007) Protein stability imposes limits on organism complexity and speed of molecular evolution
PNAS 104:16152–16157.

https://doi.org/10.1073/pnas.0705366104
- Google Scholar
1. Zhang J
2. Nei M
(1997) Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihood, and distance methods
Journal of Molecular Evolution 44 Suppl 1:S139–S146.

https://doi.org/10.1007/pl00000067
- PubMed
- Google Scholar

Article and author information

Author details

David Ferreiro
1. CINBIO, Universidade de Vigo, Vigo, Spain
2. Department of Biochemistry, Genetics and Immunology, Universidade de Vigo, Vigo, Spain
Contribution
Resources, Data curation, Software, Formal analysis, Supervision, Validation, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-0757-7702
Luis Daniel González-Vázquez
1. CINBIO, Universidade de Vigo, Vigo, Spain
2. Department of Biochemistry, Genetics and Immunology, Universidade de Vigo, Vigo, Spain
Contribution
Resources, Data curation, Formal analysis, Investigation, Visualization, Methodology, Writing – review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-6973-6660
Ana Prado-Comesaña

CINBIO, Universidade de Vigo, Vigo, Spain

Contribution
Resources, Data curation, Formal analysis, Investigation, Visualization, Writing – review and editing

Competing interests
No competing interests declared
Miguel Arenas
1. CINBIO, Universidade de Vigo, Vigo, Spain
2. Department of Biochemistry, Genetics and Immunology, Universidade de Vigo, Vigo, Spain
Contribution
Conceptualization, Resources, Software, Funding acquisition, Investigation, Visualization, Methodology, Writing – original draft, Project administration, Writing – review and editing

For correspondence
marenas@uvigo.es

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-0516-2717

Funding

Ministerio de Ciencia, Innovación y Universidades (PID2023-151032NB-C22)

Luis Daniel González-Vázquez
Miguel Arenas

Agencia Estatal de Investigación (PID2023-151032NB-C22)

Luis Daniel González-Vázquez
Miguel Arenas

Federación Española de Enfermedades Raras

Luis Daniel González-Vázquez
Miguel Arenas

Xunta de Galicia (ED481A-2020/192)

David Ferreiro

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was supported by the Project PID2023-151032NB-C22 funded by MCIU/AEI/10.13039/501100011033 and by FEDER, UE. DF was funded by a fellowship from Xunta de Galicia [ED481A-2020/192]. We thank the “Centro de Supercomputación de Galicia (CESGA)” for the computer resources. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Version history

Sent for peer review: February 21, 2025
Preprint posted: April 2, 2025
Reviewed Preprint version 1: May 22, 2025
Reviewed Preprint version 2: September 8, 2025
Version of Record published: September 24, 2025

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.106365. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

822

views
49

downloads
3

citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Citations by DOI

1

citation for umbrella DOI https://doi.org/10.7554/eLife.106365

2

citations for Version of Record https://doi.org/10.7554/eLife.106365.3

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Article PDF

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

David Ferreiro
Luis Daniel González-Vázquez
Ana Prado-Comesaña
Miguel Arenas

(2025)

Forecasting protein evolution by integrating birth-death population models with structurally constrained substitution models

eLife 14:RP106365.

https://doi.org/10.7554/eLife.106365.3

Categories and tags

Research organism

Viruses

Share this article

Cite this article

Illustrative example of forward in time simulation of protein evolution integrating a birth-death population evolutionary process with fitness from the protein folding stability and the modeling of protein evolution with a structurally constrained substitution model.

Distribution of amino acid substitutions observed along the HIV-1 MA sequences at time T31.

Comparison of real and predicted sequences of the HIV-1 MA protein considering predictions based on the SCS and neutral models.

Prediction error of SARS-CoV-2 Mpro and PLpro evolution under SCS and neutral models regarding physicochemical properties of the amino acid changes accumulated during the evolutionary trajectories and protein folding stability.

Prediction error of influenza NS1 protein evolution under SCS and neutral models regarding physicochemical properties of the amino acid changes accumulated during the evolutionary trajectories and protein folding stability.

Prediction error of HIV-1 PR evolution at diverse populations regarding physicochemical properties of the amino acid changes accumulated during the evolutionary trajectories and protein folding stability.

Author details

David Ferreiro

Contribution

Competing interests

Luis Daniel González-Vázquez

Contribution

Competing interests

Ana Prado-Comesaña

Contribution

Competing interests

Miguel Arenas

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism