Forecasting protein evolution by integrating birth-death population models with structurally constrained substitution models
Figures

Illustrative example of forward in time simulation of protein evolution integrating a birth-death population evolutionary process with fitness from the protein folding stability and the modeling of protein evolution with a structurally constrained substitution model.
Given a protein variant assigned to a node at time t (blue node), its fitness is calculated considering its protein folding stability. Then, the fitness is used to determine the birth and death rates for that variant, which provide the time to the next birth or death event (horizontal dashed line) that corresponds to the forward-in-time branch length. Next, the variant is evolved forward in time toward each descendant, upon the previously determined branch length, under an SCS model of protein evolution. The process is repeated, forward in time, starting at each new variant. If a death event occurs, the variant of the extinct node (pink node) is obtained, but it does have descendants. The process finishes when a particular sample size or simulation time is reached (i.e. t+n).

Distribution of amino acid substitutions observed along the HIV-1 MA sequences at time T31.
Left: Distribution of the observed amino acid substitutions along the HIV-1 matrix (MA) protein sequences at time T31. Right: Distribution of the indicated amino acid substitutions (shown in blue) along the protein structure.

Prediction error of SARS-CoV-2 Mpro and PLpro evolution under SCS and neutral models regarding physicochemical properties of the amino acid changes accumulated during the evolutionary trajectories and protein folding stability.
Predictions based on data simulated under the SCS [including birth-death models with constant (SCS) and variable global birth-death rate among lineages (GlobalBDvar)] and neutral models. (A) Grantham distance calculated from the amino acid changes that occurred during the real and predicted evolutionary trajectories based on SCS and neutral models of protein evolution. (B) Variation of protein folding stability (ΔΔG) between real and predicted protein variants based on SCS and neutral models of protein evolution. Notice that positive ΔΔG indicates that the real protein variants are more stable than the predicted protein variants and vice versa. Error bars correspond to the 95% confidence interval of the mean of prediction error from 100 multiple sequence alignments simulated for the corresponding population and time.

Prediction error of influenza NS1 protein evolution under SCS and neutral models regarding physicochemical properties of the amino acid changes accumulated during the evolutionary trajectories and protein folding stability.
Predictions were performed for two time points (longitudinal samples T2 and T3). Predictions based on data simulated under the SCS [including birth-death models with constant (SCS) and variable global birth-death rate among lineages (GlobalBDvar)] and neutral models. (A) Grantham distance calculated from the amino acid changes that occurred during the real and predicted evolutionary trajectories based on SCS and neutral models of protein evolution. (B) Variation of protein folding stability (ΔΔG) between real and predicted protein variants based on SCS and neutral models of protein evolution. Notice that positive ΔΔG indicates that the real protein variants are more stable than the predicted protein variants and vice versa. Error bars correspond to the 95% confidence interval of the mean of prediction error from 100 multiple sequence alignments simulated for the corresponding population and time.

Prediction error of HIV-1 PR evolution at diverse populations regarding physicochemical properties of the amino acid changes accumulated during the evolutionary trajectories and protein folding stability.
(A) For each viral population (patient, represented with a particular color) and time, Grantham distance calculated from the amino acid changes that occurred during the real and predicted evolutionary trajectories. For each population, the mean of distances obtained over time is shown on the right. (B) Relationship between Grantham distances and accumulated number of substitution events (R2=0.0001, which indicates a lack of correlation between these parameters). (C) Variation of protein folding stability (ΔΔG) between real and predicted protein variants at each viral population and time. For each population, the mean of distances obtained over time is shown on the right. Notice that positive ΔΔG indicates that the real protein variants are more stable than the predicted protein variants and vice versa. Error bars correspond to the 95% confidence interval of the mean of prediction error from 100 multiple sequence alignments simulated for the corresponding viral population and time.
Tables
Comparison of real and predicted sequences of the HIV-1 MA protein considering predictions based on the SCS and neutral models.
For the data simulated under the SCS [including birth-death models with constant (SCS) and variable global birth-death rate among lineages (GlobalBDvar)] and neutral models, the table shows the Grantham distance between the amino acids that changed during the real and predicted evolutionary trajectories and the Kullback-Leibler (KL) divergence between the real and predicted multiple sequence alignments. Next, it shows the folding stability (ΔG) of the real protein variants at times T1 and T31 and the folding stability of the predicted protein variants at time T31. The error corresponds to the 95% confidence interval from the mean (100 samples) of predictions of folding stability.
Grantham distance | KL divergence | ΔG of the real variant at T1 (kcal/mol) | ΔG of the real variants at T31 (kcal/mol) | ΔG of the predicted variants at T31 (kcal/mol) | ΔΔG (kcal/mol) at T31 (predicted – real variants) | |
---|---|---|---|---|---|---|
SCS model | 5% | 6% | –9.72 | –10.34±0.14 | –9.96±0.02 | 0.38 |
SCS GlobalBDvar model | 5% | 6% | –9.72 | –10.34±0.14 | –10.03±0.03 | 0.31 |
Neutral model | 5% | 6% | –9.72 | –10.34±0.14 | –9.21±0.07 | 1.14 |
Additional files
-
Supplementary file 1
Supplementary tables A–C and references cited in the supplementary file.
- https://cdn.elifesciences.org/articles/106365/elife-106365-supp1-v1.pdf
-
MDAR checklist
- https://cdn.elifesciences.org/articles/106365/elife-106365-mdarchecklist1-v1.pdf