1. Evolutionary Biology
Download icon

Coupling adaptive molecular evolution to phylodynamics using fitness-dependent birth-death models

  1. David A Rasmussen  Is a corresponding author
  2. Tanja Stadler
  1. North Carolina State University, United States
  2. ETH Zürich, Switzerland
  3. Swiss Institute of Bioinformatics, Switzerland
Research Article
Cite this article as: eLife 2019;8:e45562 doi: 10.7554/eLife.45562
9 figures and 1 table

Figures

Schematic overview of birth-death models.

(A) Standard phylogenetic models assume that there is an underlying process by which individuals replicate and give rise to a phylogeny. Mutations occur along the lineages of the tree, generating the sequence data observed at the tips. The mutation process is assumed to be independent of tree generating process, such that mutations do not impact the branching structure of the tree. (B) The MFBD allows us to relax this assumption, such that mutations at multiple sites feedback and shape both the tree and sequence data. (C) Under the original multi-type birth-death model we track Dn,i(t), the probability density that a lineage n at time t in state i produces the subtree descending from n and the observed tip states. We also track Ei, the probability that a lineage produces no sampled descendants and is therefore unobserved. (D) In the MFBD model we instead track Dn,k,i(t), the probability that a lineage n in state i at site k produces the subtree and the observed tip states at site k. Because the fitness of a lineage fn will depend on its genotype at all sites, we use the marginal site probabilities ω to compute the probability that a lineage has a certain genotype, such as ACT (Approximation 1). We can then marginalize over the fitness of each genotype weighted by its approximate genotype probability to compute the fitness fn of a lineage (Approximation 2). Finally, we need to know the probability En that a lineage left no other sampled descendants, which we approximate using the probability Eu that a lineage with same expected fitness u leaves no sampled descendants (Approximation 3). The schematic in A was reproduced from the original figure by Louis du Plessis (https://github.com/Taming-the-BEAST/TechnicalLectureSources/tree/master/BeastIntro2018) with permission under a Creative Commons license.

https://doi.org/10.7554/eLife.45562.002
Performance of the MFBD approximation under the four genotype model.

(A) Simulated phylogeny showing the genotype of each lineage through time. (B) Joint likelihood of the phylogeny and tip genotypes under different values of σ using the the approximate MFBD (solid line) or the exact MTBD model (dotted line). The vertical blue line marks the true parameter value. (C) The normalized probability of a single hypothetical lineage being in each genotype back through time based on the MFBD approximation (solid line) versus the exact MTBD model (dotted line) with σ=0.5. Note that the probabilities for genotypes 10 and 01 are identical. All parameters besides σ were fixed at λ=0.25, d=0.05, s=0.05. The mutation rate γ was symmetric between forward and backwards mutations and fixed at 0.05.

https://doi.org/10.7554/eLife.45562.003
The error introduced by approximating genotype probabilities under the MFBD model.

(A–C) The error introduced by approximating the genotype probability densities Dn,g based on the marginal sites probabilities under the MFBD model for different mutation rates (A), strengths of selection (B), and epistatic fitness effects (C). The solid line represents the MFBD approximation with fitness effects coupled across sites whereas the dashed line represents a more naive approximation that ignores the fitness effects of other sites entirely. The mean error represents the time-integrated average over all genotypes. (D–F) Normalized Dn,g probabilities for a single hypothetical lineage being in each genotype back through time based on the MFBD approximation (solid line) versus the exact MTBD model (dotted line). Each plot shows the dynamics of Dn,g for the parameter values marked by asterisks in the plots immediately above. Other parameters are fixed at λ=0.25, d=0.05, s=0.05.

https://doi.org/10.7554/eLife.45562.004
The error introduced by approximating the probability of no sampled descendants.

(A–C) The error introduced by approximating En in a discretized fitness space under the MFBD model for different mutation rates (A), strengths of selection (B), and epistatic fitness effects (C). The solid line represents the approximation where lineages are allowed to transition between fitness classes whereas the dashed line represents the assumption that fitness does not change along unobserved lineages. To obtain a single En value comparable across both models, we summed En over all genotypes weighted by the exact probability of the lineage being in each genotype and then took the time-integrated average to compute the mean error. (D–F) The dynamics of En for a single hypothetical lineage back through time based on the MFBD approximation (solid line) versus the exact MTBD model (dotted line). Each plot shows the dynamics of En for the parameter values marked by asterisks in the plots immediately above.

https://doi.org/10.7554/eLife.45562.005
Estimating site-specific fitness effects.

(A) A phylogeny simulated under a model with five evolving sites each with a random fitness effect. The lineages are colored according to the number of mutations they carry (blue = 0; yellow = 5). The distribution of fitness effects was assumed to be LogNormal with a mean of 0.85 and a standard deviation of 0.32. (B) Site-specific fitness effects estimated using the marginal fitness BD model. Red lines indicate the posterior median and 95% credible intervals. Blue lines mark the true fitness effect at each site.

https://doi.org/10.7554/eLife.45562.006
Inference of site-specific fitness effects from simulated phylogenies.

(A–C) Correlation between the true and estimated posterior median fitness effects for phylogenies simulated with 2, 5 or 10 evolving sites. Results are aggregated over 100 simulated phylogenies, with each point representing an estimate for a single site and phylogeny. The points are colored according to the frequency of the mutant allele among sampled individuals in the phylogeny. (D) Fitness effects estimated under the exact MTBD model tracking all four possible genotypes for the same two site simulations as in A. (E) Correlation between the site-specific fitness effects estimated under the approximate MFBD and exact MTBD for the two site simulations. (F) Error and uncertainty in estimated site-specific fitness effects across all 2, 5, and 10 site simulations. Error was calculated as the posterior median estimate minus the true fitness effect. Uncertainty was calculated as the standard deviation of the posterior values sampled via MCMC. In all simulations, sites where the Effective Sample Size of the MCMC samples was below 100 (less than 5% of all sites across simulations) were discarded. The death rate was fixed at d=0.05 but the birth, mutation and sampling rates were randomly drawn for each simulation from a prior distribution: λ Uniform(0.1,0.2); γ Exponential(0.01); s Uniform(0,1). Only the birth rate was jointly inferred with the site-specific fitness effects.

https://doi.org/10.7554/eLife.45562.007
Relative fitness of Ebola virus genotypes circulating during the 2013–16 epidemic in Western Africa.

Ancestral fitness values were reconstructed by first finding the probability of a lineage being in each possible genotype based on the marginal site probabilities computed using (Equation 20). Ancestral fitness values were then computed by averaging the posterior median fitness of each genotype, weighted by the probability that the lineage was in each genotype. Fitness values are given relative to the Makona genotype isolated at the start of the epidemic. Clades are labeled according to their most probable genotype.

https://doi.org/10.7554/eLife.45562.009
Influenza H3N2 mutational fitness effects.

(A) The fitness effects of mutations estimated in vitro using deep mutational scanning versus their estimated population-level effects. In vitro fitness effects were quantified as the relative preference for the mutant versus the consensus amino acid residue in the deep mutational scanning experiments, given on a log2 scale. Population-level fitness effects were estimated using the MFBD model assuming multiplicative effects across sites. Error bars show the 95% credible intervals on the estimated population-level fitness effects. (B) Coancestry matrix showing the fraction of ancestry shared between each pair of mutations in the H3N2 phylogeny. The coancestry value represents the fraction of branches in the phylogeny that share both mutations based on a maximum parsimony reconstruction. The diagonal gives the fraction of all branches in the phylogeny with each individual mutation.

https://doi.org/10.7554/eLife.45562.010
Relative fitness of influenza H3N2 lineages circulating in the United States between 2009 and 2012.

Fitness values were reconstructed based on a fitness model that maps mutational fitness effects predicted based on deep mutational scanning experiments to population level fitness. The inset shows this fitness mapping for the model parameters with the highest posterior probability: α=0.0098 and κ=0.964. Uncertainty in ancestral amino acid sequences was taken into account by first computing the marginal site probability at each site. Ancestral fitness values were then reconstructed by marginalizing over all possible ancestral sequences using the marginal site probabilities.

https://doi.org/10.7554/eLife.45562.011

Tables

Table 1
Estimated posterior median fitness and 95% CI for the Ebola GP mutants relative to the Makona genotype
https://doi.org/10.7554/eLife.45562.008
GenotypeSample freqBase modelModel + geo effectsEffect in cell culture
Makona0.0361.001.00Reference genotype
A82V0.7201.05 (1.04–1.07)1.26 (1.19–1.35)Increases infectivity 2X
P330S0.0020.98 (0.82–1.14)1.11 (0.96–1.24)Decreases infectivity
P330S+N107D+G480D0.0371.04 (0.98–1.12)1.27 (1.16–1.39)Increases infectivity > 2X
A82V+R410S0.0441.09 (1.00–1.18)1.31 (1.17–1.45)No or small effect
A82V+R410S+K439E0.0351.14 (1.01–1.26)1.36 (1.20–1.54)Increases infectivity 2-3X
A82V+R29K0.0191.06 (0.93–1.19)1.27 (1.10–1.45)Increases infectivity 2-3X
A82V+T230A0.0261.03 (0.93–1.11)1.23 (1.10–1.37)Increases infectivity 2-3X
A82V+I371V0.0671.03 (0.98–1.09)1.24 (1.14–1.35)Increases infectivity 2-3X

Data availability

All data and code required to reproduce our Ebola analysis in its entirety is available at https://github.com/davidrasm/Lumiere/tree/master/ebola (copy archived at https://github.com/elifesciences-publications/Lumiere). The sequence alignment along with the timecalibrated molecular phylogeny we used for our analysis were downloaded from https://github.com/ ebov/space-time/tree/master/Data. Dataset S3 of Lee et al. 2018 was downloaded from https://www.pnas.org/highwire/filestream/822898/field_highwire_adjunct_files/3/pnas.1806133115.sd03.xlsx.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)