Measuring the distribution of fitness effects in somatic evolution by combining clonal dynamics with dN/dS ratios
Abstract
The distribution of fitness effects (DFE) defines how new mutations spread through an evolving population. The ratio of nonsynonymous to synonymous mutations (dN/dS) has become a popular method to detect selection in somatic cells. However the link, in somatic evolution, between dN/dS values and fitness coefficients is missing. Here we present a quantitative model of somatic evolutionary dynamics that determines the selective coefficients of individual driver mutations from dN/dS estimates. We then measure the DFE for somatic mutant clones in ostensibly normal oesophagus and skin. We reveal a broad distribution of fitness effects, with the largest fitness increases found for TP53 and NOTCH1 mutants (proliferative bias 1–5%). This study provides the theoretical link between dN/dS values and selective coefficients in somatic evolution, and measures the DFE of mutations in human tissues.
Introduction
One of the principal goals of largescale genome sequencing of somatic tissues is to uncover genetic loci under positive selection, socalled ‘driver’ genes, that lead to clonal expansions. Measurement of the selective advantage of each driver mutation enables prediction of future evolutionary dynamics (Williams et al., 2019), provided the selective regime remains constant. In evolutionary biology, the distribution of fitness effects (DFE) is a fundamental entity that describes the selective consequences of a (large) number of individual mutations of an ancestral genome (EyreWalker and Keightley, 2007). In somatic evolution, particularly in cancer genomes, we have an extensive knowledge of the catalogue of recurrent, and likely positively selected, somatic mutations (Martincorena et al., 2017), but the fitness changes associated with each mutation remain largely unquantified.
Extensive experimental effort is ongoing to determine the fitness effects of mutations. Most prominently is lineage tracing of mutations in mouse models (Vermeulen et al., 2013; Rogers et al., 2018), but these methods are not sufficiently highthroughput to produce the DFE for all somatic mutations. Other studies have estimated the selective coefficient of somatic mutations by measuring their frequency over time in the same individual using longitudinal sampling (Körber et al., 2019), however this method is broadly limited to somatic evolution in the blood (Gibson and Steensma, 2018) (where it is feasible to take samples from healthy individuals over time) and in rare cases of patients under active surveillance.
An alternative approach is to infer selective coefficients directly from genome sequencing data. Methods to identify positivelyselected (driver) mutations rely on finding genes that have significantly more mutational ‘hits’ (typically hits are nonsynonymous mutations) than would be expected by chance, after correction for factors known to influence the mutation rate across the genome (Bailey et al., 2018). Conversely, negatively selected genes are expected to show a paucity of mutations (Weghorn and Sunyaev, 2017; Zapata et al., 2018). This idea is formalised in the calculation of the dN/dS ratio – a method originally developed in molecular species evolution – that has recently been adapted to study somatic evolution (both cancer and normal tissue) (Martincorena et al., 2017; Weghorn and Sunyaev, 2017; Zapata et al., 2018; Wu et al., 2016; Greenman et al., 2006; Yang et al., 2003; Martincorena et al., 2018; LeeSix et al., 2018). The intuitive idea behind dN/dS is to measure the rate of nonsynonymous (dN) mutations (possibly under selection) and compare that to the rate of synonymous (dS) mutations (presumed neutral). The ratio of these two numbers, each normalised for the local sequencespecific biases in the mutation rate, putatively identifies a signature of selection: dN/dS >1 indicating positive selection, dN/dS = 1 indicating neutral evolution and dN/dS <1 indicating negative selection.
Transforming dN/dS values to selective coefficients in somatic evolution is an unaddressed problem. dN/dS was originally developed in the context of species evolution using the WrightFisher process, a classical population genetics model that assumes that evolution occurs over very long timescales, which permits new mutations to fix within lineages. The WrightFisher model assumes constant population sizes, nonoverlapping generations and that all individuals have equal potency. Under the WrightFisher model, the dN/dS of a locus is related to its selective coefficient by the relation (Nielsen and Yang, 2003):
Where $N$ is the effective population size and $s$ the selection coefficient.
However, in somatic evolution the assumptions of the FisherWright model are violated. Somatic evolution is rapid and new mutations are infrequently fixed in the population (McGranahan and Swanton, 2017), clonal dynamics are complex (Williams et al., 2019), and population sizes unlikely to be constant (Sottoriva et al., 2015). Further, the lack of recombination in somatic evolution can result in strong hitchhiking effects (Tilk et al., 2019). In addition, since in somatic evolution, the ancestral genome is known, the need to measure dN/dS across a phylogeny is circumvented (a necessary step for dN/dS analysis in species evolution). Violations of some of these assumptions was previously recognised to make the interpretation of dN/dS problematic (Kryazhimskiy and Plotkin, 2008; Mugal et al., 2014), and consequently the relationship between selective coefficients and dN/dS values is uncertain.
The size distribution of clones (called the site frequency spectrum in population genetics nomenclature) also contains information on the selective coefficients of newly arising mutations. Mathematical descriptions of the dynamics of populations of cells can make predictions on the shape of the clone size distribution under different demographic and evolutionary models (Simons, 2016a; Durrett, 2013). This approach has been used to quantify the dynamics and cell fate properties of stem cells across many tissues in model systems (Klein et al., 2010; LopezGarcia et al., 2010; Vermeulen et al., 2013). We and others have also used similar approaches applied to deep sequencing data to infer the evolutionary dynamics of tumours (Williams et al., 2016; Williams et al., 2018; Bozic et al., 2016; Ling et al., 2015) and of clonal haematopoiesis in the blood (Watson et al., 2019).
To date, dN/dS analysis and the analysis of the clone size distribution have been performed independently, with conflictual results (Simons, 2016b; Martincorena et al., 2016). Here we develop the mathematical population genetics theory necessary to combine these approaches and explore how the interindividual measure of selection at a locus as provided by dN/dS values, is related to the underlying cell population dynamics that generate intraindividual clone size distributions. This approach naturally accounts for the nuances in somatic evolution that can make the interpretation of dN/dS difficult. We show how this unified approach allows for greater insight into patterns of selection than either method in isolation, and importantly reveal the precise mathematical relationship between dN/dS values and selective coefficients in somatic evolution. We use this approach to infer the selective advantage of mutations in normal tissue.
Results
A general approach to integrate dN/dS and clone size distributions
We present a general mathematical framework for the interpretation of frequencydependent dN/dS values in somatic evolution. First, we construct null models of the evolutionary dynamics in the absence of selection, and then augment these models to incorporate the consequences of selection. Evolutionary dynamics differ between normal tissues and cancer cells: in normal tissues maintained by stem cells, the longterm population dynamics is controlled by an approximately fixedsize set of equipotent stem cells undergoing a process of neutral competition (Klein and Simons, 2011), whereas in tumour growth the overall population increases over time. We develop a null model to predict the expected genetic diversity in the population in the absence of selection. Positive selection causes selected variants to rise to higher frequency than expected under neutral evolution (Figure 1a), and negative selection has the opposite effect. This insight guides how we model the effects of selection, namely the diversity of nonsynonymous mutations.
Specifically, we defined the function $g\left(\theta ,\mu ,s,f\right)$ as the expected number of mutations with selective (dis)advantage $s$ found at a frequency $f$, for a given evolutionary dynamics scenario, where mutations accumulate at a rate µ per division. For the remainder of the paper we use passenger mutations to refer to those mutations that have no functional effect (s=0) and driver mutations those that have s>0. When comparing to data, driver mutations are taken as equivalent to nonsynonymous mutations and passengers equivalent to synonymous mutations.
The functional form of $g\left(\theta ,\mu ,s,f\right)$ encapsulates the population dynamics of the system with parameter vector $\theta $, which may, for example, include the growth rate of a tumour, or loss replacement rate of stem cells in normal tissue. The direct interpretation of $s$ depends on the system under question. Following the logic of the effect of selection above, for ${s}^{\prime}>s$ we have that:
Since dN/dS measures the excess or deficiency of mutations due to selection, taking the ratio of $g\left(\theta ,\mu ,s,f\right)$ when $s\ne 0$ to $s=0$ and normalizing for the mutation rates, which may differ for passenger (${\mu}_{p}$) and driver (${\mu}_{d}$) mutations respectively, informs how dN/dS is expected to change as a function of the frequency $f$ of mutations in the population (Equation 1).
We discuss the general properties of this model. Firstly, when $s=0$ (neutral evolution), and provided that the mutation rates are correctly normalised, the numerator and denominator are equal resulting in $\frac{dN}{dS}=1$, as expected. Secondly, dN/dS increases as a function of frequency $f$ (clone size) for positive selection, and decreases as a function of $f$ for negative selection (Figure 1b), for all $g\left(\theta ,\mu ,s,f\right)$ that we explored. Thirdly, the shape of the curves predicted by the underlying population model encodes the value of the selection coefficient; for example the steepness of the increase is proportional to the selection coefficient $s$ (Figure 1c). These observations are a natural consequence of positive selection driving selected mutations to higher frequency (Figure 1a).
Unfortunately, directly using Equation (1) to measure selective coefficients from the slope of the dN/dS curve as function of frequency is often impractical. Real sequencing data often suffers from a limited number of mutations detected at any particular frequency and measurement uncertainties in these frequencies. To circumvent these issues, we introduce ‘interval dN/dS’ (idN/dS) that aggregates over a frequency range to reduce the influence of these sources of noise. Interval dN/dS is defined as:
Fixing the integration range ${[f}_{min}{,f}_{max}]$ allows for robust inference of $s$ in potentially sparse and noisy sequencing data using maximum likelihood methods (see Materials and methods).
Frequencydependent dN/dS values in stem cell populations
In healthy tissue, only mutations that are acquired in the stem cells will persist over long times, and so we restrict our attention to these cells. Quantitative analysis of lineage tracing data has shown that the stem cell dynamics of many tissues conform to a process of population asymmetry (Klein and Simons, 2011). In this paradigm, under homeostasis, the loss of stem cells through differentiation is compensated by the replication of a neighbouring stem cell, thus maintaining an approximately constant number of stem cells. These dynamics are represented by the rate equations:
where SC refers to a single stem cell which divides symmetrically to produce either two stem cells or two differentiated cells (denoted as D above), $\lambda $ is the rate of cell division per unit time, and $r$ is the probability of a symmetric divisions. The product $r\lambda $ is referred to as the loss/replacement rate. Differentiated cells will ultimately be lost from the population over long time scales. Under homeostasis, loss and replacement should be exactly balanced, so Δ = 0. With Δ ≠ 0, the fate of a stem cell is ‘biased’, introducing positive or negative selection into the model. Previous mathematical analysis shows that this model is a good description of the clonal dynamics in the oesophagus and skin (Klein et al., 2010; Doupé et al., 2012; Alcolea et al., 2014). Using previous analytical results describing the temporal evolution of the clone size distribution (see methods for detailed discussion) we derive the frequency distribution $g\left(\theta ,\mu ,s,A\right)$ for oesophagus and skin as (Simons, 2016a; Klein et al., 2010; Nicholson and Antal, 2016):
Where A is the area of the clone, $\rho $ is density of stem cells per mm^{2}, ${n}_{0}$ is the starting population size and μ the mutation rate, which may be different for drivers $(s>0)$ and passenger mutations $(s=0)$, ie drivers and passengers may accumulate at different rates. $N\left(t\right)$ is a scaling factor that depends on Δ, the bias toward selfrenewal, which we interpret as our selection coefficient in this system. Specifically:
$N\left(t\right)$ can be interpreted as the average size of a labelled clone after time $t$, which even under homeostasis grows over time and compensates for some clones being lost due to drift. From these expressions, we can then write down an expression for idN/dS as a function of clone frequency (see Materials and methods) that allows for maximum likelihood estimation of parameter values (Δ). We confirmed the accuracy of our derivation using simulations (Figure 2a), and performed power calculations to determine the minimum number of mutations required to correctly infer the underlying population dynamics: we determined that 8 mutations per gene was sufficient to accurately recover Δ (Figure 2b) with accuracy increasing for higher mutation burdens (Figure 2c). We also performed simulations where Δ was itself a random variable, simulating the effect of different sites within the gene having different fitness effects. We assumed Δ was exponentially distributed and generated 500 simulated cohorts. Fitting idN/dS demonstrated that on average we infer the mean value of the exponential distribution, Figure 2—figure supplement 1.
Selection advantages in histopathologicallynormal human oesophagus
We inferred the selective advantage of driver mutations in human oesophagus using published deep sequencing data from Martincorena and colleagues (Martincorena et al., 2018) that documents the clonal expansion of a panel of putative driver mutations in histopathologicallynormal oesophageal biopsies.
We used the dndscv bioinformatics tool (Martincorena et al., 2017) to calculate frequencydependent dN/dS values from these data (clone size measured in fraction of mutant reads multiplied by 2 mm^{2} – the area of the biopsy – and assuming 5000 stem cells per mm^{2} (EyreWalker and Keightley, 2007) tissue). dN/dS values varied considerably as a function of mutation area (Figure 1—figure supplement 1).
We considered the average frequencydependent dN/dS values simultaneously across all genes in the panel, on a patientbypatient basis. Our theoretical model of idN/dS calculated from these data fitted strikingly well (Figure 2—figure supplement 2). Estimates of the loss/replacement rate $r\lambda $ of the stem cell population were in the range 1.25.0 per year (Figure 2—figure supplements 2 and 3). Inference of the selective advantage $s$ (measured in terms of the bias towards self renewal Δ) revealed an average bias of 0.004 (0.002 – 0.005 95% CI) per missense mutation (Figure 2—figure supplement 3). Nonsense mutations caused a fivefold greater bias towards selfrenewal of 0.021 (0.008 – 0.032 95% CI) (Figure 2—figure supplement 3). After removal of all genes that are strongly selected, global dN/dS values on the remaining 48 genes show dN/dS of approximately 1 across the frequency range (Figure 2d), and idN/dS analysis revealed that these somatic mutations do not associate with a proliferative bias (Δ=0).
We then fitted the data on a genebygene and patientbypatient basis for cases where sufficient mutations were available to perform the fit (Figure 2e–g; Figure 2—figure supplement 4). A broad range of selective advantages were inferred (Figure 3 and Figure 3—figure supplement 1). Mutations in TP53 showed large biases across all patients for both missense, Δ = 0.057 (0.05–0.068 95% CI) and nonsense mutations, Δ = 0.094 (0.091–0.097 95% CI) (Figure 3a–b). This was also true for mutations in NOTCH1 with Δ = 0.029 (0.019–0.036 95% CI) for missense and Δ = 0.072 (0.034–0.089 95% CI) for nonsense mutations. NOTCH2, PIK3CA, CREBBP and FAT1 also showed a bias toward selfproliferation in multiple patients (Figure 3a–b), though most had a small effect on fitness (range 0.003–0.029 for missense mutations and 0.030–0.041 for nonsense mutations). Together these data suggest a distribution of fitness effects (DFE) characterized by many small effect mutations with few large effect mutations (Figure 3c–d), as is seen in organismal evolution (EyreWalker and Keightley, 2007). We recognize that there may be intragene variation of selection coefficients, that is, some sites within genes may have stronger fitness effects than others. This is supported by clustering of mutations within particular domains and hotspots of mutations as documented in the original study (Martincorena et al., 2018). In future, larger cohorts and methods to estimate site level dN/dS values would allow this approach to be extended to the site level.
As our model assumes that clones emerge and expand independently we checked that the data is not overly influenced by hitchhiking mutations,which would violate these assumptions. For this, we leveraged the spatial sampling of tissue pieces. Approximately 90 patches were sampled from each patient. We reasoned that patches with selected clones might be expected to have more hitchhiking mutations, and for those mutations to be at a higher frequency when compared to patches without selected clones. To test this hypothesis, we counted the number of nonsynonymous NOTCH1 and TP53 mutations and the number of synonymous mutations in each patch. If the synonymous mutations we observe in the data are largely due to hitchhiking effects we would expect the number and size of synonymous variants to correlate with the number of driver mutations per patch. While there =was a statistically significant correlation for both NOTCH1 (linear regression, p<0.001) and TP53 mutations (p=0.031), the effect was small (Figure 5—figure supplement 1): for each additional driver there was on average 0.05 additional synonymous variants (0.047 for NOTCH1 and 0.056 for TP53). We note too that the correlation was very noisy (R^{2} < 0.02) and we observed no statistically significant relationship between VAF of synonymous mutations and the number of TP53 or NOTCH1 mutations (linear regression, p>0.1). This analysis suggests that the majority of synonymous mutations are not hitchhikers, and consequently that assuming the independence of clones isreasonable.
Driver mutation selective advantage in normal skin
Martincorena and colleagues had also published data on the expansion of driver mutations in ostensibly normal human skin (Martincorena et al., 2015). Analyses of these data with interval dN/dS revealed a perpatient average selective advantage per mutation (again measured in terms of the bias towards self renewal) of Δ = 0.001 for missense mutations and fourfold higher (Δ = 0.004) for nonsense mutations (Figure 4ac). Performing the analysis on a genebygene basis was limited by the lower number of detected mutations, and the limited frequency range (clone size range) compared to the oesophagus dataset. Good fits to the data were obtainable for NOTCH1 missense mutations in patient PD18003 with fitness estimated to be Δ = 0.0149 (0.01480.0150 95% CI), and TP53 missense mutations also in patient PD18003, Δ = 0.0054 (0.00510.0058 95% CI) Figure 4. These fitness coefficients were similar to the oesophagus data. For missense mutations we were also able to produce the distribution of fitness effects across the skin cohort, which showed similar characteristics to the oesophagus data of a small number of high effect mutations and a larger number of smaller effect mutations, Figure 4f.
Site frequency spectra
We next sought to challenge our model by directly fitting the site frequency spectra across ages, taking a similar approach to studies of the blood (Watson et al., 2019), colon (LopezGarcia et al., 2010), skin (Simons, 2016a) and other tissues (Klein and Simons, 2011). Our model of stem cell dynamics makes predictions on the properties of these distributions as a function of the age of donors. In particular, Equations (5) and (6) predict that the characteristic frequency N(t) increases exponentially for nonneutral mutations and linearly for neutral mutations as a function of time. Plotting the distribution of clone size areas showed a widening of the distribution as a function of age, which was particularly striking for mutations in the NOTCH1 and TP53 genes, consistent with these genes conferring large selective advantages (Figure 5a).
To quantitively test the predictions of the model and to infer parameters of interest we implemented a Bayesian nonlinear fitting method (see methods) to fit the following model:
With the parameters $\alpha =\mu {n}_{0}/r\lambda \rho $, and $\beta =\mathrm{l}\mathrm{o}\mathrm{g}(N\left(t\right)/\rho )$ to be estimated. We first validated that the approach could correctly infer known parameters from synthetic data generated from our simulation framework, Figure 5—figure supplement 2. Next, we fitted the model to the oesophagus dataset, separately for nonsynonymous and synonymous mutations and across ages, Figure 5e. Posterior estimates of the compound parameter $\mu {n}_{0}/r\lambda \rho $ showed consistent estimates for both nonsynonymous and synonymous mutations (taking into account the approximate 3:1 ratio of mutable sites), Figure 5c. Unfortunately decoupling $\mu {n}_{0}/r\lambda \rho $ directly from the data is not possible and requires independent estimates of either the mutation rate or number of stem cells. Posterior estimates of the characteristic frequency N(t) showed an increase as a function of age for nonsynonymous mutations and a more modest increase for synonymous as would be predicted from theory, Figure 5d.
As another challenge to the proposed model, we also fitted the following two models which are simpler subsets of the full model we derive theoretically, thus testing two alternate decay functions.
Here, $\alpha $ and $\beta $ are parameters to be estimated. Comparison of the fits using the widely applicable information criteria, WAIC (a generalized version of information criteria such as AIC and BIC) found that our theoretical model with an exponential term and a power law term (Equation 7) provided the best predictive accuracy (lowest WAIC), Figure 5b. Bayes factors for the proposed theoretical model also strongly supported this as the best model (BF > 10^{3} for both models).
We also attempted to fit the model on a gene by gene basis. However due to the limited data points at the gene level we found that our posterior estimates for the parameters were very wide, precluding further insight from the site frequency spectrum at the gene level, Figure 5—figure supplement 3. This reveals one of the strengths of our idN/dS approach, which by leveraging information across genes to infer the background mutation model and integrating over clone sizes we can perform the inference with a limited number of data points. As an alternative to fitting the full clone size distribution, we performed a Bayesian multilevel regression to assess which mutations showed the largest increase in clone size as a function of age. This analysis revealed that TP53 and NOTCH1 had the largest regression coefficients, consistent with these genes having the largest selection coefficients, Figure 5—figure supplement 4, providing qualitative support for our approach. Our inferred selection coefficients from idN/dS were also correlated with the regression coefficients from the statistical model (linear regression, p=0.004, R^{2} = 0.47). Taken together, analysis of the site frequency spectrum on a patient by patient basis and a gene level statistical model are consistent with our inferences from idN/dS. We find that the proposed model provides the best fit to data when compared to other similar models, that the characteristic frequency of nonsynonymous mutations increases rapidly with age and that NOTCH1 and TP53 exhibit the largest increase in clone size as a function of age.
Discussion
Here we have shown that the combination of dN/dS values with mutation frequencybased information provides additional quantitative insight into dynamics of somatic evolution than either method alone. Specifically, the combined approach enables direct inference of the selection coefficients of mutations in somatic tissues. We note that our study also shows that the magnitude of the selection coefficient is not necessarily represented by simply calculating a ‘point estimate’ of dN/dS that neglects mutation subclonal frequency information.
Using this methodology, we have begun the construction of the distribution of fitness effects (DFE) in somatic evolution, observing a distribution where most mutations that we analysed were nearneutral, with a tail of highly selected variants. In both skin and oesophagus, the most highly selected mutant genes were NOTCH1 and TP53 (increased proliferative bias of >1% and>5% respectively). We observed that values of selective coefficients of individual genes varies between patients, likely because of interpatient difference in the precise location of point mutations, but potentially also because of interpatient variation in selective pressure from the microenvironment. Nevertheless, the comparative rank of pergene fitness coefficients was broadly consistent across patients (e.g. for missense mutations across patients, TP53 mutations always had the highest fitness, followed by NOTCH1 mutations). This consistency in selective coefficients is in agreement with the observation of highly recurrent gene mutations in cancer (Lawrence et al., 2013) and evidence of repeatability in cancer evolution (Caravagna et al., 2018).
Nonsynonymous NOTCH1 mutations were observed approximately 3fold more frequently that nonsynonymous TP53 mutations in oesophagus, and approximately 5fold more frequently in skin, suggesting that the mutation rate of NOTCH1 is greater than for TP53. Coupling these data with our quantitative measurements of the fitness coefficients leads to the prediction that the oesophagus will become transiently repopulated by NOTCH1mutant cells during ageing, before subsequent replacement by fitter mutantTP53 clones.
On a cautionary note, our theoretical work shows that the clonality of mutations strongly determine the observed value of dN/dS, and so a misleading picture of the selective forces will be produced if dN/dS frequencydependent effects are not corrected for. The accuracy of any estimate of evolutionary dynamics from dN/dS values is of course dependent on the underlying accuracy of the dN/dS measure itself, which is compromised by uncharacterised variability in the mutation rate across the genome (Van den Eynden and Larsson, 2017) and in the uncertain pathogenicity of individual single nucleotide variants (extensions to estimate site level selection coefficients may circumvent some of these issues [Cannataro et al., 2018; Temko et al., 2018]). We also recognize that our model assumes a wellmixed population, while the data used in our study is from spatially structured epithelia. Spatial structure influences the distribution of clone sizes. Effects include: the influence of the boundary that enable rapid growth of mutants on the expanding front and conversely ‘encapsulation’ within a growing mass that slows clone growth (Fusco et al., 2016; Chkhaidze et al., 2019), and clonal interference that slows the growth of two similarly fit competing clones (Martens et al., 2011; Hall et al., 2019).
Combining population genetics methods with comparative genomics is a powerful way to infer selection pressures in human somatic evolution, giving new insight into the fundamental parameters that determine evolutionary dynamics in health and disease.
Materials and methods
Oesophagus and skin data
Request a detailed protocolFor the oesophagus and skin data we used mutation calls provided by the original studies. In the oesophagus data when a mutation was present in multiple adjacent biopsies we used the sum of the mutation frequency times the area of the biopsies (2 mm^{2}) as our readout of clone size and performed the dN/dS analysis on a patient by patient basis.
dN/dS calculations
Request a detailed protocolFor calculating dN/dS ratios the dndscv R package was used, which calculates both global dN/dS ratios across the whole exome or a panel of genes as well as per gene dN/dS ratios using a covariate based model to infer dN/dS values with a limited number of mutations (Martincorena et al., 2017). We used the default settings of dndscv, using the default hg19 transcript reference provided by the package. dndscv can also take into account small insertions and deletions, which we included in our analysis. Therefore, where we refer to mutation this includes indels in addition to SNVs.
To calculate the interval dN/dS measure we took the clone size measurements and determined a low cutoff ${f}_{min}$ based on the minimum clone size. We then created a vector of clone sizes that covered the total range and calculated dN/dS between ${f}_{min}$ and all values of ${f}_{max}$. This allowed us to plot dN/dS as a function of ${f}_{max}$ and fit our interval dN/dS models. In our data, clone size is measured in units of area (mm^{2}).
Accurately estimating dN/dS from sequencing data of somatic tissues can be challenging due to the strong sequence context dependence of mutations and variability of mutation rates across the genome. To confirm our inferences were not dependent on the choice of dN/dS methodology we calculated dN/dS values using SSBdNdS and then fitted our model (Zapata et al., 2018). As SSBdNdS only uses SNVs, we reran dndscv after removing indels. Inferences on a patient by patient basis were highly consistent between the methods, see Figure 5—figure supplement 4. There was more variability at the gene level, perhaps due to differences in the approaches used to control for variability in mutation rates across the genome.
Model fitting
Request a detailed protocolWe used a maximum likelihood approach to fit our models to the data. Defining the observed interval dN/dS as $y$ and the model dN/dS as $\hat{y}\left(\theta \right)=\frac{{\mu}_{p}}{{\mu}_{d}}\frac{{\int}_{{f}_{min}}^{{f}_{max}}g\left(\theta ,{\mu}_{d},s,f\right)df}{{\int}_{{f}_{min}}^{{f}_{max}}g\left(\theta ,{\mu}_{p},s=0,f\right)df}$. First of all we define the residuals between the data and the model as $R=y\hat{y}\left(\theta \right)$. Assuming that the residuals are normally distributed with mean 0 we can write down the negative log likelihood (NLL) as
where N denotes the normal probability density function. We can then find the parameters $\theta $ that minimize the NLL and calculate confidence intervals on these estimates using the Fisher information matrix. When fitting to data we ensured that there were a minimum of 8 mutations and only included model fits with R^{2} > 0.6 for downstream analysis. We used a maximum likelihood approach over a Bayesian approach to fit the model because the integral of the clone size distribution does not have a closed form solution, making it unfeasible to use readily available MCMC samplers which we adopt later.
Interval dN/dS model
Request a detailed protocolFor the stem cell model, using Equations 2, 3, 4, 5, 6 in the main text, interval dN/dS is given by:
Where ${E}_{i}$ is the exponential integral ${E}_{i}\left(x\right)={\int}_{x}^{\infty}\frac{{e}^{n}}{n}dn$. $\rho $ is density of stem cells per mm^{2}, which we set to 5,000 cells /mm^{2} for fitting (Hall et al., 2019).
Simulations
Request a detailed protocolTo confirm the accuracy of our analytical model and investigate the influence of uncertainty in mutation frequencies due to sequencing noise and to challenge some of the underlying assumptions of our theoretical approach, we developed a simulation based model.
We seed a population of ${N}_{s}$ stem cells that then undergo loss/replacement as described by the following rate equations
As only the stem cells are long lived the differentiated cells are not explicitly modelled such that when a stem cell 'differentiates' it is effectively lost from the population. During division, daughter cells acquire mutations with a fitness effect at rate ${\mu}_{d}$ and passenger mutations at rate ${\mu}_{p}$. Fitness increases the bias toward selfproliferation $\mathrm{\Delta}$ of a stem cell lineage. Additional driver mutations do not further increase the fitness of stem cells.
To calculate dN/dS across a cohort of tissue biopsies we count the number of driver mutations ${N}_{d}$ and the number of passenger mutations, ${N}_{p}$ and then normalize by their respective mutation rates. In our model drivers = nonsynonymous and thus every driver has an effect on fitness. Then the ratio of these two numbers gives us the excess or deficit of mutations due to selection – ie the dN/dS ratio.
For the interval dN/dS we simply calculate the ${N}_{x}$ between ${f}_{min}$ and ${f}_{max}$.
To introduce uncertainty into mutation frequencies we perform a process of empirically motivated sampling to the true underlying frequency $f$. Firstly, we specify the average depth of sequencing D, then the depth of sequencing for mutation i is given by
The sampled number of read counts is then
And the sampled variant frequency is then ${f}_{s}={n}_{s}/{D}_{i}$.
The simulation framework was written in Julia (Bezanson et al., 2017) and is available at https://github.com/marcjwilliams1/StemCellModels.jl.
Fitting site frequency spectra
Request a detailed protocolTo fit the site frequency spectra we first removed mutations with clone size area < 0.008. This cutoff was determined by inspection of the point of highest density of the clone size area histograms, reasoning that below this frequency the data was limited due to the resolution of the sequencing assay. We then binned the data using a bin size of 0.005 and counted the number of mutations in each bin. We did this for each donor and separately for nonsynonymous and synonymous mutations. We then used the brms (Bayesian Regression Modeling with Stan) R package (Bürkner, 2017) to fit the following nonlinear model jointly across all patients:
With the parameters $\alpha =\frac{\mu {n}_{0}}{r\lambda \rho}$, and $\alpha =\mu {n}_{0}$ to be estimated and C being the number of mutations and A the area. We used the following priors for the parameters:
We ran 4 chains with 5000 iterations with the first 2500 used as warmup. To assess convergence we ensured that the scale reduction factor $\hat{R}$ (a measure of mixing of chains) was < 1.01 as recommended.
We also fitted the following two models:
We used the same prior distributions as above. We then compared the predictive accuracy of the different models using the widely applicable information criteria (WAIC) (Vehtari et al., 2017). This found that the functional form derived from theory provided the best fit, Figure 5b. Bayes factors to compare models were calculated using the bayestestR package (Makowski et al., 2019). The log10(BF) of the proposed theoretical model vs the power law was 52 and 44 when compared against the exponential model.
Bayesian multilevel regression
Request a detailed protocolWe performed a Bayesian multilevel regression of clone size ~ age with the gene as a random effect. This allowed us to determine which genes cause the largest increase in frequency as a function of age. This was done using brms in R with default priors, 4 chains and 5000 iterations. Using R’s statistical modelling syntax the model is given by clone size ~age + (1 + agegene). We fit the model assuming a normal distribution as well as a lognormal distribution, finding the latter to provide the best predictive accuracy (lowest WAIC) and superior posterior predictive check.
Site frequency spectra for a model of stem cell proliferation
Request a detailed protocolOur mathematical model of stem cell proliferation drew on results from a range of studies analysing the clone size distribution from lineage tracing experiments. Particularly useful are the results from Klein et al. (2010), which we took as a starting point for our model. We follow the notation used in this study to a large extent. Other studies that are also illuminating and relevant are the theoretical work of Nicholson and Antal (2016), while similar theoretical models have also been applied to the oesophagus (Doupé et al., 2012), airway epithelia (Teixeira et al., 2013) and the blood (Watson et al., 2019) amongst others. In this section we outline the key results relevant to our approach.
We begin with the set of rate equations presented in the main text:
From this we are interested in the clone size distribution, that is, the probability of observing a clone of size n after time t. In other fields such as population genetics, this distribution is equivalent to the site frequency spectrum. Given the above rate equations, we can express the dynamics as a birthdeath process with birth rate $b=r\lambda \left(1+\mathrm{\Delta}\right)$ and death rate $d=r\lambda \left(1\mathrm{\Delta}\right)$ which allows us to write down the master equation for the probability of observing a clone of size n at time t, $b=r\lambda (1\mathrm{\Delta})$.
This has the following solution (see Bailey, 1990 p92)
With ${p}_{0}\left(t\right)=\alpha \left(t\right)$ and $\alpha \left(t\right)$ defined as follows:
From these results we can obtain the average size of surviving clones, $\beta \left(t\right)=\frac{b\left({e}^{\left(bd\right)t}1\right)}{b{e}^{\left(bd\right)t}d}$:
Which, as will become apparent, gives a characteristic scale for the distribution.
Thus far we have only considered a single mutant at time 0, while we are interested in the case when mutants continually enter the population at a rate $N\left(t\right)=\sum _{n=0}^{\infty}n\times \frac{{p}_{n}}{1{p}_{0}}=\frac{1}{1\beta \left(t\right)}=\frac{\left(1+\mathrm{\Delta}\right){e}^{2r\lambda \mathrm{\Delta}t}(1\mathrm{\Delta})}{2\mathrm{\Delta}}$ where µ is the mutation rate per cell division and $n}_{0$ is the number of stem cells. To derive the clone size distribution in this case, we take the integral over time multiplied by the mutation rate.
We can approximate ${g}_{n}=\frac{\mu {n}_{0}}{r\lambda}{\int}_{0}^{t}{p}_{n}\left(\tau \right)d\tau =\frac{\mathrm{}\mu {n}_{0}}{r\lambda n}\mathrm{}{\beta \left(t\right)}^{n}$ which gives
The data we make use of doesn’t provide integer clone sizes but rather area of clones, so we can make the transformation ${g}_{n}\approx \frac{\mathrm{}\mu {n}_{0}}{r\lambda n}{e}^{\frac{n}{N\left(t\right)}}$:
This is the result presented in the main text.
Code and data availability
Request a detailed protocolCode to reproduce all the figures in the manuscript (using a snakemake [Köster and Rahmann, 2012] workflow) is available at github.com/marcjwilliams1/dndsclonesize (Williams, 2020; copy archived at https://elifesciencespublications/dndsclonesize). We also created a singularity image with all software dependencies which is available at shub://marcjwilliams1/dndsclonesizeRcontainer. Julia (Bezanson et al., 2017) was used for the simulations and R (R Development Core Team, 2019) was used to analyse the data and generate the figures. Some of the analysis rely in bespoke packages written for this study which are freely available under an open source licence.
Data availability
No new data was generated in this; only previously published data is reanalysed. Computer code implementing the new mathematical theory we developed is available here: https://github.com/marcjwilliams1/dndsclonesize (copy archived at https://github.com/elifesciencespublications/dndsclonesize).

European Genomephenome ArchiveID EGAD00001004158. EGAD00001004158.

European Genomephenome ArchiveID EGAD00001004159. EGAD00001004159.

European Genomephenome ArchiveID EGAS00001000860. EGAS00001000860.

European Genomephenome ArchiveID EGAS00001000515. EGAS00001000515.

European Genomephenome ArchiveID EGAS00001000603. EGAS00001000603.
References

Julia: a fresh approach to numerical computingSIAM Review 59:65–98.https://doi.org/10.1137/141000671

Quantifying clonal and subclonal passenger mutations in Cancer evolutionPLOS Computational Biology 12:e1004731.https://doi.org/10.1371/journal.pcbi.1004731

Brms: an RPackage for bayesian multilevel models using stanJournal of Statistical Software 80:1–28.https://doi.org/10.18637/jss.v080.i01

Effect sizes of somatic mutations in CancerJNCI: Journal of the National Cancer Institute 110:1171–1177.https://doi.org/10.1093/jnci/djy168

Population genetics of neutral mutations in exponentially growing Cancer cell populationsThe Annals of Applied Probability 23:230–250.https://doi.org/10.1214/11AAP824

The distribution of fitness effects of new mutationsNature Reviews Genetics 8:610–618.https://doi.org/10.1038/nrg2146

New insights from studies of clonal hematopoiesisClinical Cancer Research 24:4633–4642.https://doi.org/10.1158/10780432.CCR173044

Relating evolutionary selection and mutant clonal dynamics in normal epitheliaJournal of the Royal Society 16:20190230.https://doi.org/10.1098/rsif.2019.0230

Universal patterns of stem cell fate in cycling adult tissuesDevelopment 138:3103–3111.https://doi.org/10.1242/dev.060103

Snakemakea scalable bioinformatics workflow engineBioinformatics 28:2520–2522.https://doi.org/10.1093/bioinformatics/bts480

The population genetics of dN/dSPLOS Genetics 4:e1000304.https://doi.org/10.1371/journal.pgen.1000304

bayestestR: describing effects and their uncertainty, existence and significance within the bayesian frameworkJournal of Open Source Software 4:1541–1548.https://doi.org/10.21105/joss.01541

Spatial structure increases the waiting time for CancerNew Journal of Physics 13:115014.https://doi.org/10.1088/13672630/13/11/115014

Why time matters: codon evolution and the temporal dynamics of dN/dSMolecular Biology and Evolution 31:212–231.https://doi.org/10.1093/molbev/mst192

Universal asymptotic clone size distribution for general population growthBulletin of Mathematical Biology 78:2243–2276.https://doi.org/10.1007/s115380160221x

Estimating the distribution of selection coefficients from phylogenetic data with applications to mitochondrial and viral DNAMolecular Biology and Evolution 20:1231–1239.https://doi.org/10.1093/molbev/msg147

R: A language and environment for statistical computingR: A language and environment for statistical computing, Vienna, Austria, https://www.Rproject.org.

A big bang model of human colorectal tumor growthNature Genetics 47:209–216.https://doi.org/10.1038/ng.3214

Practical bayesian model evaluation using leaveoneout crossvalidation and WAICStatistics and Computing 27:1413–1432.https://doi.org/10.1007/s1122201696964

Bayesian inference of negative and positive selection in human cancersNature Genetics 49:1785–1788.https://doi.org/10.1038/ng.3987

Identification of neutral tumor evolution across Cancer typesNature Genetics 48:238–244.https://doi.org/10.1038/ng.3489

Measuring clonal evolution in Cancer with genomicsAnnual Review of Genomics and Human Genetics 20:309–329.https://doi.org/10.1146/annurevgenom083117021712

The ecology and evolution of Cancer: the UltraMicroevolutionary processAnnual Review of Genetics 50:347–369.https://doi.org/10.1146/annurevgenet112414054842

Likelihood models of somatic mutation and Codon substitution in Cancer genesGenetics 165:695–705.
Article and author information
Author details
Funding
Wellcome (202778/B/16/Z)
 Andrea Sottoriva
Wellcome (202778/Z/16/Z)
 Trevor A Graham
Wellcome (209409/Z/17/Z)
 Chris P Barnes
Wellcome (105104/Z/14/Z)
 Andrea Sottoriva
Cancer Research UK (A22909)
 Andrea Sottoriva
Cancer Research UK (A19771)
 Trevor A Graham
H2020 Marie SkłodowskaCurie Actions (846614)
 Luis Zapata
National Institutes of Health (CA217376)
 Andrea Sottoriva
 Trevor A Graham
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
AS is supported by the Wellcome Trust (202778/B/16/Z) and Cancer Research UK (A22909). TG is supported by the Wellcome Trust (202778/Z/16/Z) and Cancer Research UK (A19771). We acknowledge funding from the National Institute of Health (NCI U54 CA217376) to AS and TG This work was also supported a Wellcome Trust award to the Centre for Evolution and Cancer (105104/Z/14/Z). CPB is supported by the Wellcome Trust (209409/Z/17/Z).
Version history
 Received: May 23, 2019
 Accepted: March 9, 2020
 Version of Record published: March 30, 2020 (version 1)
Copyright
© 2020, Williams et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 6,013
 views

 467
 downloads

 30
 citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
Transcriptomic profiling became a standard approach to quantify a cell state, which led to accumulation of huge amount of public gene expression datasets. However, both reuse of these datasets or analysis of newly generated ones requires significant technical expertise. Here we present Phantasus  a userfriendly webapplication for interactive gene expression analysis which provides a streamlined access to more than 96000 public gene expression datasets, as well as allows analysis of useruploaded datasets. Phantasus integrates an intuitive and highly interactive JavaScriptbased heatmap interface with an ability to run sophisticated Rbased analysis methods. Overall Phantasus allows users to go all the way from loading, normalizing and filtering data to doing differential gene expression and downstream analysis. Phantasus can be accessed online at https://alserglab.wustl.edu/phantasus or can be installed locally from Bioconductor (https://bioconductor.org/packages/phantasus). Phantasus source code is available at https://github.com/ctlab/phantasus under MIT license.

 Computational and Systems Biology
 Evolutionary Biology
A comprehensive census of McrBC systems, among the most common forms of prokaryotic Type IV restriction systems, followed by phylogenetic analysis, reveals their enormous abundance in diverse prokaryotes and a plethora of genomic associations. We focus on a previously uncharacterized branch, which we denote coiledcoil nuclease tandems (CoCoNuTs) for their salient features: the presence of extensive coiledcoil structures and tandem nucleases. The CoCoNuTs alone show extraordinary variety, with three distinct types and multiple subtypes. All CoCoNuTs contain domains predicted to interact with translation system components, such as OBfolds resembling the SmpB protein that binds bacterial transfermessenger RNA (tmRNA), YTHlike domains that might recognize methylated tmRNA, tRNA, or rRNA, and RNAbinding Hsp70 chaperone homologs, along with RNases, such as HEPN domains, all suggesting that the CoCoNuTs target RNA. Many CoCoNuTs might additionally target DNA, via McrC nuclease homologs. Additional restriction systems, such as Type I RM, BREX, and Druantia Type III, are frequently encoded in the same predicted superoperons. In many of these superoperons, CoCoNuTs are likely regulated by cyclic nucleotides, possibly, RNA fragments with cyclic termini, that bind associated CARF (CRISPRAssociated Rossmann Fold) domains. We hypothesize that the CoCoNuTs, together with the ancillary restriction factors, employ an echeloned defense strategy analogous to that of Type III CRISPRCas systems, in which an immune response eliminating virus DNA and/or RNA is launched first, but then, if it fails, an abortive infection response leading to PCD/dormancy via host RNA cleavage takes over.