Measures of genetic diversification in somatic tissues at bulk and single-cell resolution

  1. Marius E Moeller
  2. Nathaniel V Mon Père
  3. Benjamin Werner  Is a corresponding author
  4. Weini Huang  Is a corresponding author
  1. Department of Mathematics, Queen Mary University of London, United Kingdom
  2. Evolutionary Dynamics Group, Centre for Cancer Genomics and Computational Biology, Barts Cancer Centre, Queen Mary University of London, United Kingdom
  3. Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles, Belgium
  4. Group of Theoretical Biology, The State Key Laboratory of Biocontrol, School of Life Science, Sun Yat-sen University, China

Abstract

Intra-tissue genetic heterogeneity is universal to both healthy and cancerous tissues. It emerges from the stochastic accumulation of somatic mutations throughout development and homeostasis. By combining population genetics theory and genomic information, genetic heterogeneity can be exploited to infer tissue organization and dynamics in vivo. However, many basic quantities, for example the dynamics of tissue-specific stem cells remain difficult to quantify precisely. Here, we show that single-cell and bulk sequencing data inform on different aspects of the underlying stochastic processes. Bulk-derived variant allele frequency spectra (VAF) show transitions from growing to constant stem cell populations with age in samples of healthy esophagus epithelium. Single-cell mutational burden distributions allow a sample size independent measure of mutation and proliferation rates. Mutation rates in adult hematopietic stem cells are higher compared to inferences during development, suggesting additional proliferation-independent effects. Furthermore, single-cell derived VAF spectra contain information on the number of tissue-specific stem cells. In hematopiesis, we find approximately 2 × 105 HSCs, if all stem cells divide symmetrically. However, the single-cell mutational burden distribution is over-dispersed compared to a model of Poisson distributed random mutations. A time-associated model of mutation accumulation with a constant rate alone cannot generate such a pattern. At least one additional source of stochasticity would be needed. Possible candidates for these processes may be occasional bursts of stem cell divisions, potentially in response to injury, or non-constant mutation rates either through environmental exposures or cell-intrinsic variation.

eLife assessment

In this paper, the authors introduce fundamental work on mathematical methods for inferring evolutionary parameters of interest from RNA data in healthy tissue and during hematopoiesis. By combining single cell and bulk sequencing analyses, the authors use a stochastic process to inform different aspects of genetic heterogeneity; the strength of evidence in support of the authors' claim is exceptional. The work will be of broad interest to cell biologists and theoretical biologists.

https://doi.org/10.7554/eLife.89780.3.sa0

Introduction

Intra-tissue genetic heterogeneity emerges from a multitude of dynamical processes (Turajlic et al., 2019; Black and McGranahan, 2021). Cells randomly accumulate mutations, they self-renew, differentiate, or die and clones of varying fitness may compete and expand (Watson et al., 2020; Rulands et al., 2018; Werner et al., 2020; Gunnarsson et al., 2021). These core evolutionary principles form the basis for describing the aging of somatic tissues, tumor initiation, and tumor progression (Martincorena, 2019; Greaves and Maley, 2012; Cagan et al., 2022; Martincorena et al., 2018; Mitchell et al., 2022; Abascal et al., 2021). Naturally, there is great interest in understanding precisely how these dynamical processes act. This typically involves hypothesizing models of cell behavior, and deriving quantitative estimates of their underlying physical parameters (Watson et al., 2020; Werner et al., 2020; Poon et al., 2021; Martincorena et al., 2017; Chatzeli and Simons, 2020; Williams et al., 2020; Durrett, 2013). These could by quantities such as mutation rates per cell division, active stem cell numbers, symmetric and asymmetric division rates, etc.,. Some such parameters are already well characterized in certain tissues. We know for example that during early development healthy somatic cells accumulate on average between 1.2 and 1.3 mutations per genome per division (Werner et al., 2020; Lee-Six et al., 2018). Other parameters, such as the number of stem cells in a tissue type, are well known in some tissues but harder to quantify in others.

In general, a lack of time-resolved information complicates evolutionary inferences, and we must often rely on indirect information, for example, the patterns of tissue-specific genetic heterogeneity within and across individuals (Bailey et al., 2021). These observed patterns of genetic heterogeneity emerge from the underlying stochastic processes and can strongly depend on the measurement technique (Caravagna et al., 2020). Both enforce specific limitations on the inferability of the underlying dynamics. Different types of genomic data – such as bulk and single-cell whole genome sequencing – may contain different information on the system, and each comes with their own limitations. When bulk sequencing data is used for inference, some evolutionary parameters such as the population size and proliferation rates of stem cells are entangled and cannot be obtained in isolation (Watson et al., 2020; Williams et al., 2020). If on the other hand, single-cell data is employed, some additional information becomes available, for example the individual cell mutational burdens and the co-occurrence of mutations. However, often the sampling size is sparse compared to the presumed number of long-term proliferating cells in many tissues (Lee-Six et al., 2018; Salehi et al., 2021; Lim et al., 2020), which in turn introduces other constraints. Here, we show that combining such data obtained from different resolutions (bulk or single-cell) can help to overcome limitations and further narrow possible ranges of inferred parameters. Here, we develop concrete mathematical and computational models for extracting evolutionary parameters from bulk and single-cell information and apply these methods to whole genome bulk sequencing data in healthy esophagus (Martincorena et al., 2018) and whole genome single-cell sequencing data in healthy hematopoiesis (Lee-Six et al., 2018).

Results

A stochastic model of mutation accumulation in healthy somatic tissue

We model the stem cell dynamics of healthy tissue as a collection of individual cells that divide, differentiate, and die stochastically at predefined rates (see Figure 1). Novel mutations can occur with every cell division, each daughter acquiring a random number drawn from a Poisson distribution with rate μ. We explicitly include symmetric and asymmetric stem cell divisions at potentially different rates, the former resulting in two stem or two differentiated cells, and the latter in one stem and one differentiated cell. Since differentiated cells are lost from the stem cell pool they are functionally equivalent to cell death. While only symmetric divisions allow variants to change in frequency, both division types introduce novel variants into the population and thus both contribute to the evolving heterogeneity.

The distribution of variant allele frequencies changes with the growth phases and by sampling.

(a) In the current population, cells divide symmetrically into two daughter cells or asymmetrically with only one daughter cell kept in the focused population. All other events are mathematically equivalent to and are treated as a part of cell death. (b) The rates of symmetric and asymmetric division change during the population growth and lead to a dynamic distribution of variant allele frequencies. (c) The observed VAF distribution is shifted again during sampling compared to the VAF of the whole population – a fact should be considered when inferring population properties through genetic data.

Mutation accumulation occurs through all developmental stages from conception to death (Spencer Chapman et al., 2021). We, therefore, consider three phases of demographic changes of the stem cell population. In the first early developmental phase the stem cell population rapidly expands from a single-cell exclusively by symmetric divisions at rate γ. This is followed by a growth and maintenance phase wherein the population continues to expand but also undergoes turnover through asymmetric divisions at rate ϕ, as well as cell removal (due to differentiation or death) and replacement at rate ρ. In the final mature phase, cell turnover continues but the size of the total stem population N remains constant.

We employ two independent implementations of our theoretical model. One is a direct stochastic simulation utilizing a Gillespie algorithm. This approach creates a simulated dataset containing all single-cell or bulk sequencing properties, however, it can be computationally expensive. In addition, we compute the time-dynamical expected value of the distribution of variant allele frequencies (the VAF spectrum) directly, which we find obeys the partial differential equation

(1) tV(κ,t)=κA(κ,t)V(κ,t)+κ2B(κ,t)V(κ,t)+C(t)δ(κ1),

where κ=fN(t) denotes the number of cells sharing a variant (the variant frequency f times the total population size N), δ(x) is the Dirac impulse function, t, and κ are the partial derivatives with respect to time and variant size, and

A(κ,t)=γκ,B(κ,t)=κ(1κ/N(t))ρ+γκ/2,C(t)=2μN(t)(ρ+γ+ϕ/2),

This set of equations allows for computationally efficient numerical solutions (SI 1.3).

Transition of developmental growth and constant population size signatures in bulk whole genome sequencing data of healthy adult esophagus

We first discuss established properties of the VAF spectrum relating to somatic tissues. In certain model systems it is given by a power law fα with a critical exponent α. The value of the exponent depends on the demographic dynamics of the population. For a well-mixed exponentially growing population without cell death the VAF spectrum v(f) is given by 2μ/(f+f2) (a f-2 power law) and is independent of time (Gunnarsson et al., 2021). In contrast, for a population of constant size – i.e., where birth and death rates are equal – the spectrum obeys v(f)2μ/f(Durrett, 2015) (a f-1 power law; see also SI 1.5.1), though this solution is only valid at sufficiently long times. In the following, we focus on the dynamics of genetic diversification in healthy tissues and show that the expected VAF spectrum contains both contributions of growing and constant population dynamics.

Healthy adult somatic tissues are thought to initially expand rapidly during ontogenic growth, continue to expand during infancy and childhood, and reach homeostasis by adulthood. It is thus natural to ask what the expected VAF spectrum would look like in tissues experiencing homeostasis after a period of growth. We first investigated the underlying VAF dynamics in a minimal theoretical model where the population switches from pure exponential growth to constant size once a maximal cell number is reached. Numerical solutions of Equation 1 show that the expected VAF distribution exhibits a gradual transition from the f-2 (growing population) to the f-1 (constant population) power law (Figure 2a). These transitional states themselves do not adhere to some intermediate power law (e.g. f-α for 1<α<2), but instead present a sigmoidal shape, with the low-frequency portion following f-1 and the high frequencies f-2. Over time the shape changes as a wavelike front traveling from low to high frequency, with the constant-size equilibrium establishing earliest at the lowest frequencies and moving to higher frequency overtime. Interestingly, the convergence towards equilibrium slows down over time – for evenly-spaced observation times the solutions lie increasingly closer together – further decreasing the speed at which the high frequency portion of the spectrum approaches equilibrium. More complex forms of such demographic changes, for example, the inclusion of a mixed growth-and-maintenance phase or a logistically growing population, result in qualitatively similar behavior (see Figure 2—figure supplement 1).

Figure 2 with 1 supplement see all
Bulk sequencing based variant allele frequency (VAF) and mutation rate inferences in healthy esophagus.

(a) Expected VAF distributions from evolving Equation 1 to different time points for a population with an initial exponential growth phase and subsequent constant population phase (mature size N=103). Once the population reaches the maximum carrying capacity, the distribution moves from a 1/f2 growing population shape (purple) to a 1/f constant population shape (green). Note that the shift slows considerably at older age. (b) VAF from healthy tissue in the esophagus of nine individuals sorted into age brackets. The youngest bracket, 20–39, is closer to the developmental 1/f2 scaling. The older age brackets are both close to the constant population 1/f scaling, resembling the theoretical expectations. (c) Inferred mutation rates increase linearly with age. (d) Simulations of slowly growing stem cell populations reveal that mutation rates appear to increase with age, although the true underlying per division mutation rate remaining constant (see Figure 2—figure supplement 1 as well).

To test whether adult human tissues show such transitional signatures, we grouped the VAFs of healthy adult esophagus samples from Martincorena and colleagues (Martincorena et al., 2018) into three age groups (young, middle. and old, Figure 2b). In accordance with our theoretical expectations, the averaged VAF spectrum of the young donor group is closest to the expected f-2 distribution of ontogenic growth, with low frequency variants only starting to approach the f-1 homeostatic scaling. The averaged spectra of the middle and old age groups on the other hand notably shift towards the expected f-1 homeostatic line. Interestingly, while there is a clear separation between the spectra of the youngest and middle age groups, those of the middle and older groups are much closer. This agrees with our prediction that the speed of convergence of the spectra towards homeostatic scaling slows down with age.

VAF-based mutation rate inferences in healthy tissues

The theoretical prediction for the VAF spectrum of an exponentially growing population, which is recovered in many cancers given sufficient sequencing depth (Caravagna et al., 2020), contains information on the effective mutation rate μ (Williams et al., 2016). Similarly, the VAF spectrum of a constant population includes the same mutation rate μ (SI). Thus we can in principle infer μ from the previously described esophagus data, for example by applying a regression approach. However, since the scaling of the VAF spectrum of healthy tissues can be shown to be age-dependent (Figure 2a–b), we use two reference shapes for a growing and a constant population, respectively. The fitting is performed by a linear regression least-squares approach. The general solution is computed from the VAF spectrum of the data, Vd(μ), and the reference shape Vr, which assumes μ=1, as follows:

(2) μ=Cov(Vd(μ),Vr)Var(Vr)

If Vd(μ) has the shape of Vr, then E(Vd(μ))=μE(Vr), which can be used to prove Equation 2:

(3) Cov(Vd(μ),Vr)=E(Vd(μ),Vr)E(Vd(μ))E(Vr)=μ(E(Vr2)E(Vr)2)=μVar(Vr)

The VAF shapes for the growing and constant population give close estimates, as can be seen in Figure 2, which compares the age of all individuals with corresponding inferred effective mutation rates. The estimates are in the range of 1–2 mutations per genome per cell division and agree with the previous observations based on early developmental stem cell divisions (Lee-Six et al., 2018). Estimates based on a growing population are slightly higher compared to estimates based on a constant population, but differences are small.

Surprisingly, we observe a clear trend of a linearly increasing effective mutation rate with age. This trend cannot be explained by the transition of variants from a growing into a constant population alone. An actual increase in the true mutation rate per cell division with age would be a direct explanation. However, this seems unlikely: It is well established that the mutational burden across individuals and tissues increases linearly with age (Cagan et al., 2022; Mitchell et al., 2022; Williams et al., 2022), and an increasing mutation rate would in contrast result in an accelerated increase of mutational burden with age, which is not observed experimentally.

An alternative explanation is a continued slow linear increase of the stem cell population with age (Figure 2d). In such a scenario, the effective mutation rate would appear to increase linearly with age as well, although the true mutation rate per cell division remains unchanged. At first glance, a linearly increasing stem cell population appears unnatural. However, such a linear increase would emerge from a small bias of stem cells towards self-renewal (growth) and concomitantly a slowdown of the stem cell proliferation rate proportional to the change in population size (Werner et al., 2015). The latter is a natural feedback mechanism ensuring a constant output of differentiated somatic cells. Direct in vivo imaging in multiple human tissues including esaphagus seem to support an increased density and slower proliferation of stem cells with age (Tomasetti et al., 2019). This explanation is attractive for another reason: A feedback of decreased cell proliferation with increasing population size would also function as a tumor suppressor mechanism, and maybe another reason cancer driver mutations in healthy tissues are abundant but rarely progress to cancer (Martincorena et al., 2018; Martincorena et al., 2015). Progression would require additional genomic or environmental changes to overcome this regulatory feedback.

Single-cell mutational burden allows sample size independent inferences of stem cell proliferation and mutation rates

With the increasing availability of single-cell data, it becomes possible to investigate distributions of certain quantities for which bulk sequencing only provides average measurements. One such is the number of mutations present in individual cells. Recent work by Lee-Six and colleagues (Lee-Six et al., 2018) showed that in a 59-year-old donor hematopoietic stem cells had accumulated on average around 1000 mutations per cell; however, there was significant variation between individual genomes, some carrying 900 mutations, while others up to 1200. This variation is the result of the stochastic nature of mutation accumulation and cell division, and can be exploited to estimate per division mutation and stem cell proliferation rates. To this end, we look at the distribution of the number of mutations per cell, which we will from here on refer to as the mutational burden distribution. Formally, we describe the mutational burden mj in a cell j by a stochastic sum over its past divisions mj=iyjuij, in which both the total number of divisions yj in the cell’s past and the number of mutations per division uij are random variables. Even without knowing the actual distributions of y and u, general expressions for the expectation and variance of mj that only depend on the expectation and variance of y and u exist and are given by E(m)=E(y)E(u) and Var(m)=E(y)Var(u)+E(u)2Var(y) (SI 1.7). These expressions allow direct estimates of the effective division and mutation rate. Assuming the mutation rate per daughter cell is Poisson distributed with expectation μ, the average number of divisions per lineage — expressed in terms of an effective division rate λ — becomes

(4) λ(t)dt=E(y)=E(m)/μ

In addition, if the mutation rate μ is unknown, it can similarly be obtained from

(5) μ=(Var(m)E(m)1)E(y)Var(y).

We first show that these estimators recover per division mutation and proliferation rates in stochastic simulations (Figure 3a and b). More importantly, they only require information on comparably few single cells. Approximately, 100 cells are sufficient to reliably reconstruct the mutational burden distribution, which for example would constitute a small sample of the hematopoietic stem cell pool. In fact, the inference does not significantly improve if we increase data resolution beyond a few hundred single cells (Figure 3a and b). This is in stark contrast to inferences based on single-cell phylogeny: Although they contain more information in principle, important aspects of a single-cell phylogeny, e.g., the number of observable branchings are strongly affected by sampling and stem cell population size.

Figure 3 with 3 supplements see all
Inference of evolutionary parameters on simulated stem cell populations.

Simulated populations were run up to age 59, growing exponentially from a single-cell to constant size NM=10000 at age tM=5, with mutation rate μ=1.2 and division rates λ=5 and p=0.4. Where sampling is mentioned, the sample size 89 was taken. (a) The single-cell mutational burden distribution. The compound Poisson distribution (dashed line) matches the burden distribution when averaging over multiple independently evolved populations (filled curve). (b) Distribution of estimated mutation rates from 10’000 individual simulations, obtained from burden distributions of the complete populations (blue) as well as sampled sets of cells (orange). Because the expected mutational burden distribution is unaltered by sampling, the expected estimate of the mutation rate from Equation 5 remains unchanged: E(μ~pop)=E(μ~sample). However, sampling increases the noise on the observed burden distribution, which results in a higher error margin of the estimate: σ(μ~pop)<σ(μ~sample). (c) VAF spectra measured in the complete population (blue) and a sampled set of cells (orange). In contrast with the mutational burden distribution, strong sampling changes the shape of the expected distribution. A single simulation result is shown (diamonds) alongside the theoretically predicted expected values for both the total and sampled populations (Equations 1 and 12) (dashed line) and the average across 100 simulations (solid line). (d) Distribution of NM and p inference results for 100 simulated and sampled populations, through estimation of μ~ and λ~ from the single-cell burden distribution and fitting the number of lowest frequency (1/S) mutations to the theoretical prediction in Equation 1 (see Figure 3—figure supplements 13 as well).

We apply our estimators to single-cell mutational data in blood obtained from the study by Lee-Six et al., 2018, in which 89 hematopoietic stem cells (HSCs) were extracted from a bone marrow aspirate of a 59-year-old individual (Figure 4a). Using Equation 4 to estimate the expected number of divisions per lineage from the sample burden (SI 1.8.1), we find a total division rate (including both symmetric and asymmetric divisions) in the constant population phase of λ=10.6 (7.6-15.3) divisions per HSC per year, which is within ranges previously suggested (Watson et al., 2020; Dingli and Pacheco, 2006). Notably, the single-cell mutational burden alone cannot disentangle symmetric and asymmetric division rates. Applying Equation 5 to the sample burden distribution and taking Var(y)/E(y)=1 (akin to assuming exponentially distributed division times, a common but simplified null model) we obtain an estimated mutation rate of μ~=4.3 per cell per division (Figure 4a). This is significantly higher compared to the 1.2 mutations per division suggested previously (Werner et al., 2020; Lee-Six et al., 2018). We posit different possible explanations for this discrepancy. Stem cell divisions and mutation accumulation are stochastic processes, therefore variation between individuals is expected. However, simulations of our model for a wide range of parameter values suggest the distribution of these inferred mutation rates is unlikely to present a wide enough variance to explain this difference (Figure 3—figure supplement 2). Recently, Mitchell and colleagues (Abascal et al., 2021) suggested that many mutations in somatic tissue are acquired independently of cell divisions. In fact, the previous estimates of 1.2–1.3 mutations per genome per division were derived from early developmental cell divisions where the effects of aging would be negligible. The increased mutation rate inferred here could in part be explained by proliferation-independent effects. However, it is important to note that the true distribution of divisions per lineage may also be over-dispersed compared to the Poisson model (i.e. Var(y)>E(y)), which would lead to an overestimation of μ in Equation 5. Furthermore, the observed burden distribution is incompatible with a simple division-independent mutation model, which would lead to a Poisson distribution of the burdens that has a variance much smaller than what we observed (Figure 4a). A combination of division-independent mutations together with non-Poissonian cell divisions could possibly reproduce the result.

Figure 4 with 1 supplement see all
Evolutionary inferences in single-cell hematopoietic stem cell (HSC) data.

(a) The single-cell mutational burden distribution of the data (bars) and the compound Poisson distribution obtained from its mean and variance, used to obtain the estimated per division mutation rate μ~. (b) Distribution of mutation frequencies of the data and theoretically predicted average fitted to only the lowest frequency (1/S) data point. (c) Difference Δvf between the measured value of the VAF spectrum at the lowest frequency (1/S) and its prediction from Equation 1, for varying total population size N and asymmetric division proportion p, with fixed maturation time tM=5 and operational hematopoietic population size NH=50. The solid line denotes the plane of best fit where this difference is 0. (d) Maximally inferred population size N (taking p=0 in (c)) for variation of the maturation time tN and the operational hematopoietic population size NH (see Figure 4—figure supplement 1 as well).

Sparse sampling, single-cell derived VAF spectra, and evolutionary inferences

Current methods limit us to whole genome information of at most a few thousand single cells per tissue or tumor (Lim et al., 2020). In many situations, this will only constitute a small fraction of the underlying long-term proliferating cell population. For example, although the exact number of HSCs remains unknown, most recent estimates suggest a population size of 105 cells (Watson et al., 2020; Lee-Six et al., 2018). In tumors, this number can be as large as 1010 cells (Werner et al., 2016). As we have shown in the previous paragraph, the single-cell mutational burden distribution allows certain inferences even if sampling is sparse. If we, however, construct the VAF spectrum from such a small sample, the distribution of variants will be significantly transformed with respect to that of the total population (Figure 3c). Since Equation 1 gives the expected curve measured from the total population, one must include a correction, which is given by the expectation of hypergeometric sampling (i.e. without replacement). This can be obtained through the transformation (see Sampling affects the observed VAF distribution).

(6) V~(i)=j=iNV(j)(ji)(NjSi)/(NS)

with V~ the VAFs observed in the sample and S the sample size. Furthermore, from stochastic simulations we note that the variance in the distribution increases with variant frequency f, making the lowest frequency state (i.e. 1/S) the best candidate for comparative fitting and evolutionary inferences (see Figure 4—figure supplement 1).

Taking the mutation rate from Werner et al., 2020 and the total division rate obtained from the burden distribution as constant, we explored solutions of Equation 1 (sampled by Equation 12) for a wide range of realistic values in the remaining parameter space {N,p,tM,NH}. Comparing the lowest sampled frequency state (f=1/89) with the data (Figure 4b) we identify a curve in the space of the numbers of HSCs N and p (fraction of divisions that are asymmetric) where theoretically predicted VAF distributions are identical (Figure 4c). Interestingly, this curve in N-p space naturally identifies a maximal stem cell population size capable of producing the data, which occurs when stem cells divide exclusively symmetrically (p=0). Variations of tM (the time until maturity of the HSC population) and NH (the HSC pool size at which maintenance begins to occur) had only a small effect on this inference (Figure 4d). This is partly because mutations arising during these phases are more likely to be found at higher frequencies upon measurement (Poon et al., 2021; Werner, 2021). We find the data to be congruent with an adult HSC pool of at most 200’000–300’000 cells, depending on the exact timings of the maturation phase (Figure 4d). This estimate agrees with the original study and other independent inferences based on population data (Watson et al., 2020; Lee-Six et al., 2018). However, this upper bound corresponds to the case of exclusively symmetric stem cell divisions. Accounting for the possibility of mutating asymmetric divisions reduces the estimated stem cell pool. For example, the data would be consistent with 50’000 stem cells if 90% (p=0.9) of all stem cell divisions are asymmetric. In principle, even smaller stem cell pool sizes are possible, though decreasing an order of magnitude would imply an extremely large portion (gt99%) of asymmetric HSC divisions. We note that in the data non-developmental branchings are observed, which immediately implies that not all HSC divisions can be asymmetric (p<1) and thus a scenario of an extremely small population of only a handful of HSCs maintaining hematopoisis can be rejected. With an orthogonal population-based method, Watson and colleagues estimated 25’000 as a reasonable lower bound for the number of HSCs. Based on our inference, this would imply 0.95 as an upper bound for the proportion of all stem cell divisions that are asymmetric.

Discussion

Here, we have shown that single-cell and bulk whole genome sequencing can inform on different aspects of somatic evolution. Single-cell information does support previous bulk-based estimates of possible ranges of HSC numbers. Surprisingly, the same single-cell data leads to a mutation rate inference during homeostasis is approximately four times higher compared to previous developmental estimates.

Another open question is the role of selection and how it shapes intra-tissue genetic heterogeneity. Evidence is emerging that positively selected variants in blood are almost universally present in individuals above 60, while the effective observable dynamics in younger individuals is well described by neutral dynamics. How results presented here generalize or modify will critically depend on the model of selection realized in human hematopoiesis, e.g., a model of rare or frequent driver events. Details of the underlying biology are currently unknown.

An important recent study has suggested an amended model of mutation accumulation in somatic tissues, wherein the majority of mutations are acquired continuously over time throughout a cell’s lifespan rather than during DNA replication (Abascal et al., 2021). We note that for a time-dependent rate of mutation accumulation that is constant across all cells at a given point in time, this model predicts Poisson distributed single-cell mutational burdens. In contrast, we find the single-cell mutational burden distribution in human HSCs to be highly over-dispersed compared to a Poisson model. A time-associated model of mutation accumulation with a constant rate alone cannot generate such a pattern. At least one additional source of stochasticity would be needed. This could be fluctuations in the time dependent mutation rates, non-constant cell proliferation rates, e.g., bursts of stem cell divisions potentially in reaction to injury or disease, or a combination of both effects. Either way, our observations suggest that our current theoretical models of somatic evolution in healthy tissues are incomplete. There is evidence for unknown processes contributing to intra-tissue genetic heterogeneity. The precise nature of these processes remains an open question.

Materials and methods

Model system

Request a detailed protocol

We model a dynamical population of N(t) cells, each able to carry multiple mutations. It evolves by the occurrence of divisions — which replace cells in the pool with their (potentially mutated) daughter cells — and removals — which can correspond to differentiation or any form of cell inactivation. As outlined in Figure 1, we model different types of divisions affecting the system:

  1. A symmetric (non-differentiating) division replaces a cell in the population with two daughter cells, each of which maintains the mutational profile of its parent and acquires a Poisson distributed number of new mutations.

  2. A symmetric differentiation removes a cell from the population, as both of its daughter cells have differentiated.

  3. an asymmetric division replaces a cell with a single daughter cell — as the other daughter has differentiated — and also acquires a Poisson distributed number of mutations.

The rates at which these divisions occur depends on what stage of development the population is in. A simplest approach is to consider two broad phases of evolution: an initial growth phase where N(t) increases according to some growth function, and a mature phase at a time tM where N(t) is constant and stem cell divisions occur only to maintain homeostasis of the population. While the earliest stage of population expansion can reasonably be approximated as purely developmental — i.e., where little to no turnover occurs in the population — maintenance of the functional population (e.g. in the blood system) will at some point be required even as the stem cell pool continues to grow. Thus we also include a mixed phase occurring before maturity but only after the stem cell pool has reached a size NH, where stem cells can differentiate and homeostasis must be maintained. Note that this size is expected to be relatively small, as the earliest hematopoietic cells appear 16 days post-conception (Ivanovs et al., 2017), and peripheral blood is already driven mostly by functional HSC’s by 14 weeks post conception (Spencer Chapman et al., 2021). Constancy of the population size in the mature phase is ensured by applying a Moran model of divisions (Ewens, 2004a), where each non-differentiating symmetric division — which results in two daughter cells with the HSC phenotype — is simultaneously accompanied by a removal, either through a symmetric differentiation or some form of cell inactivation. Thus in our model, we distinguish three types of events which occur independently with exponentially distributed waiting times:

  1. Growth divisions, which are non-differentiating symmetric divisions (i) occurring at rate γ, and cause the population to increase in size by one cell;

  2. Moran events, which consist of the simultaneous occurrence of a non-differentiating symmetric division (i) (which causes the population to increase in size by 1) and a cell removal (ii) (which causes the population to decrease in size by 1) either through symmetric differentiation or inactivation. These events occur at rate ρ;

  3. Asymmetric divisions (iii), in which one of the daughter cells is differentiated and the other is not, occurring at rate ϕ.

In the early developmental phase there are only growth divisions; during the mixed phase, growth divisions occur alongside Moran events and asymmetric divisions; and in the mature phase, only the Moran and asymmetric division events occur. Note that under these dynamics the rates of the different types of divisions changes across phases: the rate of non-differentiating symmetric divisions (sometimes referred to as self-renewals) is given by γ during early development, γ+ρ during mixed growth, and finally ρ at maturity; while the rate of asymmetric divisions respectively jumps from 0 to Φ in the mixed phase, and remains at ϕ in the mature phase. This can be summarized as:

(7) γ(t)=γΘ(t<tM)ρ(t)=ρΘ(N(t)>NH)ϕ(t)=ϕΘ(N(t)>NH)

We assume exponential growth (see section 1.4 for discussion of other possible growth curves) from the stem cell population’s most recent common ancestor (the zygote, or any later occurring stem cell), and make the simplifying assumption that all division rates remain constant in their respective range of operation. The population size then depends on the division rates as N(t)=exp(γ(t)t) with Θ(x) the Heaviside step function.

We introduce μ as the rate parameter of the Poisson distribution from which the number of mutations arising in a daughter cell is drawn, i.e., the mutation rate per daughter per division. Each new mutation is presumed to be distinct from all previously existing mutations in the system, an assumption known as the infinite sites model.

A summary of all parameters governing this model is given in Table 1.

Table 1
Evolutionary parameters appearing in the model system.
SymbolDescriptionUnits
NMCarrying capacity of the mature population
tMAge when the cell population reaches mature sizeyears
NHPopulation size at homeostatic divisions(start of the mixed-growth phase)
γSymmetric division rates in early developmental phase/year
ρSymmetric division rate in homeostatic state/year
ΦAsymmetric division rate in homeostatic state/year
μMutation rate/division/daughter

Stochastic simulation of the dynamical model

Request a detailed protocol

In direct simulations of the system we model individually distinct cells, each carrying a particular set of mutations. These are evolved according to the model described in Section 1.1 by means of a Gillespie algorithm (Gillespie, 1977): Events are performed successively with exponentially distributed waiting times, and the acting cell in each event is selected uniformly randomly from the population. Daughter cells arising from a division inherit all mutations from their parent cell, and acquire new mutations which are distinct from any existing mutations.

Simulations are initiated with a single-cell carrying no mutations. In the early and mixed growth phases, taking the rate of growth divisions γ as fixed ensures exponential growth of the complete population. To investigate the effect of differing growth curves (see section 1.4), we also implemented a logistic growth phase, where cell division gradually decreases and cell death gradually increases until the population reaches its mature size. The average population growth then follows

(8) N(t)=NM1+e-γt.

The expected VAF distribution of the complete population

To obtain the dynamics of the VAF distribution, we first look at the probabilistic time evolution of a single variant’s size (i.e. the number of cells sharing that mutation) Pk(t), which is given by a Fokker-Plank equation. We first sketch the procedure for obtaining the result for a Moran model, where the population size is constant. Next, we show how to obtain the expression for a deterministically growing population. Finally, we use these results to obtain the distribution of variants for a population that changes in size.

Probability density distribution of a single variant in a Moran process

Request a detailed protocol

The discrete probability distribution of the variant’s size is governed by the transition probabilities {l|k}, which denotes the likelihood of the number of cells carrying this mutation changing from k to l. In a Moran model, a constant population size N is maintained by the simultaneous occurrence of one cell division and one cell death/removal. Thus, the possible transitions following the occurrence of such an event are given by the probabilities

(9) {{k+1|k}=kN(1-kN)=Tk+1,k=Tk,{k-1|k}=kN(1-kN)=Tk-1,k=Tk, and {k|k}=1-(Tk+1,k+Tk-1,k).

Assuming events occur with exponentially distributed waiting times at rate ρ, Pk can be described through the master equation

(10) 1ρdPkdt=Tk1Pk1+2TkPk+Tk+1Pk+1

When the population size N is large, this leads to a system of N differential equations, which is computational taxing to solve. Instead, a diffusion approximation can be made (on the assumption that N is large, i.e., that there are many states k=0,1,,N), to describe p(κ), the probability density of variant’s size on the continuous support κ[0,N]. To this end, a Fokker-Planck equation is constructed of the form

(11) tp(κ,t)=κA(κ,t)p(κ,t)+κ2B(κ,t)p(κ,t)/2

where the coefficients are given through the infinitesimal propagator t(κ+Δκ,t+Δt|κ,t):

(12) A(κ,t)=limΔt01Δtd(Δκ)Δκt(κ+Δκ,t+Δt|κ,t)B(κ,t)=limΔt01Δtd(Δκ)(Δκ)2t(κ+Δκ,t+Δt|κ,t)

The integrals are the first and second moments of a displacement Δκ, and can be obtained using the heuristic (Ewens, 2004b):

(13) ΔκΔtΔκ1/ρΔt+ϑ(Δt2)=ΔkΔkTk+Δk,kρΔt+ϑ(Δt2)Δκ2ΔtΔκ21/ρΔt+ϑ(Δt2)=Δk(Δk)2Tk+Δk,kρΔt+ϑ(Δt2)

with Δt the expectation taken over a time interval Δt. Typically a transformation to frequency space is first performed, f=ρ/N, where f is the frequency of cells carrying this mutation in the population. Thus, we obtain

(14) A(f,t)=0,B(f,t)=2νN2f(1-f)

and finally the well-known result

(15) tp(f,t)=(ρ/N2)f2f(1-f)p(f,t)

Probability density distribution of a single variant in a deterministic growing population

Request a detailed protocol

We now consider a growing population due to additional divisions occurring—at rate γ per cell—alongside the previously described Moran events. For simplicity, we assume the population size N(t) increases deterministically according to the growth rate. The transition probabilities for the variant sizes are then given by

(16) {{k+1,t+Δt|k,t}=kN(t)(1-kN(t))ρN(t)Δt+kN(t)γN(t)Δt,{k-1,t+Δt|k,t}=kN(t)(1-kN(t))ρN(t)Δt, and {k,t+Δt|k,t}=1-{k+1,t+Δt|k,t}-{k-1,t+Δt|k,t}.

Using the previously described heuristic (19) the coefficients of the Fokker-Planck equation become

(17) A(κ,t)=κγB(κ,t)=2κ[1κ/N]ρ+κγ

And the full equation

(18) tp(κ,t)=-κγκp(κ,t)+κ2[κ(1-κ/N(t))ρ+κγ/2]p(κ,t).

The expected VAF distribution in a varying population

Request a detailed protocol

For independently evolving variants, we denote the number of variants at size k (i.e. shared by k cells) as Vk(t). We now note that in the limit of infinite variants, the fluxes between adjacent k-states are identical to the above-derived transition probabilities. Thus, the above-derived equations can similarly describe the expected dynamics of an ensemble of variant sizes V(κ,t), where we once again introduce the parameter κ[0,N(t)] to represent the continuous space analog of the variant size k. In our system of interest we, however, also wish to include an injected flux of newly arising mutations, which is limited to the state κ=1 and has amplitude

(19) C=2μ(ρ+γ+ϕ/2)

Note that because asymmetric divisions result in just a single daughter remaining in the system, they contribute only half as many novel mutations to the system as symmetric divisions. In the continuous domain we introduce this flux as a Dirac delta function Cδ(κ1). Note also that, while these diffusion equations are more commonly applied in the frequency domain f[0,1], for the purpose of numerical simulation it will be convenient to remain in the space of absolute sizes κ=fN(t). Then finally, the time evolution of the VAF distribution is given by

(20) tV(κ,t)=κγκV(κ,t)+κ2[κ(1κ/N(t))ρ+γκ/2]V(κ,t)+2μN(t)(ρ+γ+ϕ/2)δ(κ1),

which is the result stated in the main text.

Numerical approximation of the VAF distribution dynamics

Request a detailed protocol

While a solution to the fixed population Moran system (21) is known (Kimura, 1955), it is unwieldy, and does not readily generalize to the more complex expressions (24) and (26). We, therefore, resorted to numerical approximations of their solution. We applied a method of lines approach with a finite difference discretization in the size coordinate κ. The incoming flux of variants at κ=1 introduces a discontinuity in the derivative, which leads to stiffness in the system. This is the reason we eschewed the transformation to f=κ/N(t), since when working in frequency space the point of incoming flux would change position as the population grows. To maximize performance near the discontinuity, we implemented a variable step size smallest near κ=1, with the distances in space given by

(21) Δκi=1+(i1)αi

where i1,,n-1, with n the number of discretized points, and

(22) αi=2N-(n-1)(n-1)(n-2)

In this formalism the delta function in (1) was approximated as a step function with height 2/(Δκ1+Δκ2).

Changing the growth function during the maturation phase

Request a detailed protocol

In the main text we have shown that, for a constant population, the past occurrence of a growth phase influences the shape of the observed VAF spectrum. Here, we investigate to what extent the precise form of this growth function alters the observation. To this end, we compare the expected VAF spectrum, calculated from Equation 1, for two different growth models: One where the population grows exponentially until the maximal capacity NM is reached, at which point it immediately stops expanding; and one where the population grows according to the logistic function, which is a common model for population growth in ecological systems. Since in the latter case the function only asymptotically approaches capacity, we take maturity to be the point where the population reaches 0.99NM. The resulting growth functions and expected VAF spectra are shown in Figure 2—figure supplement 1 for two time points measured after maturity. We observe that there is indeed a quantitative difference between the spectra of the two models. However, we note the same qualitative characteristics as described in the Results section of the main text: a pronounced move towards the fixed size shape in the lower frequencies, and a delayed response at higher frequency. The difference between the two models furthermore diminishes as the measurement occurs farther away from the end of the growth phase.

Analytical solutions for the expected VAF distribution in select cases

While in section 1.3 it was shown how the expected VAF distribution of our stochastic model can be calculated numerically, under certain (stricter) conditions some solutions can also be obtained analytically. We present here derivations for analytical solutions of the distribution in a constant population adhering to a Moran process, and a strictly growing population subject to the pure-birth process. Although the results derived in this section are themselves not novel (see e.g. Durrett, 2015; Williams et al., 2016, and Gunnarsson et al., 2021), we present them here for completeness, with new derivations that facilitate a unified analysis and do not require readings of the previous literature. For convenience, a list of the parameters used in this section is given in Table 2.

Table 2
Evolutionary parameters appearing in the analytical derivations of the expected VAF distribution in the Moran and pure-birth models.
SymbolDescription
NTotal number of cells
kAbundance or number of cells < N
Vk(t)Number of mutations with abundance k
PkLikelihood for a mutation with abundance k to increase or decrease
Ck(t)The average number of mutations with abundance k that increase or decrease
dVk(t)dtAverage change of Vk(t) per time step
μMutation rate per daughter cell

The equilibrium VAF distribution in a Moran process

Request a detailed protocol

In a birth-death Moran process, one cell dies and one is born every time step. Then, let every birth coincide with an average of μ new unique neutral mutations. Every cell can hold any number of mutations, and since these are neutral, the cells that are dying/born are chosen completely at random. If two cells are chosen for death and birth, any particular mutation has a chance Pk to be in one of the cells but not in the other:

(23) Pk=kN(1-kN)

This was previously also shown in Equation 15. If we want to know how the number of mutations changes, we need to know how many mutations increase or decrease in abundance on average. This simply means that we multiply Pk by the number of mutations with the given abundance k:

(24) Ck(t)=kN(1-kN)Vk(t)

Note that because of the symmetry of the process, Ck(t) is both the average increase and decrease. Lastly, if we look at a certain abundance k, there are two ways to lose mutations and two ways to gain mutations: Mk increases if mutations of abundance k+1 lose one cell or if mutations of abundance k+1 gain one cell. Mk decreases if mutations of abundance k lose or gain a cell. Note that all of these things can happen at the same time. If we want to know the average change dVk(t)dt, we, therefore, need to incorporate all of these four possibilities. After we add the rate division rate ρ, we finally get the following equation for the average change per unit of time:

(25) dVk(t)dt=ρ(Ck1,t)+Ck+1(t)2Ck(t)

This is a variation of Equation 16 applied to all mutations, as opposed to just a single one. For k=1 the first term C0 is simply replaced by a fixed mutation rate μ, for k=N, the middle term has to be removed (and the last term CN(t) is always 0). Once we apply this to an entire distribution of mutations, it gives us an estimate of how, on average, this distribution changes every time step. Notably, it should approach some equilibrium state if we exclude the last state, which is absorbing. Let us consider the distribution from k=1 to k=N-1. The only way for this distribution to lose mutations is by either a mutation of abundance 1 to die out, or a mutation of abundance N-1 to fixate. This implies an equilibrium at:

(26) k=1N1dVk(t)dt=ρ(μCN1(t)C1(t))=0μ=C1(t)+CN1(t)C1(t)=μCN1(t)

We can now apply this iteratively to the other Ck(t) (for convenience, we ignore the ρ, since it has no bearing on the equilibrium, merely on how fast we approach it):

(27) dV1(t)dt=μ+C2(t)2C1(t)=0C2(t)=2(μCN1(t))μ=μ2CN1(t)0=C1(t)+C3(t)2C2(t)C3(t)=2(μ2CN1(t))μ+CN1(t)=μ3CN1(t)Ck(t)=μkCN1(t)

Which can then be applied to CN-1(t) itself:

(28) CN1(t)=μ(N1)CN1(t)CN1(t)=μN

This can now be used to solve for C1(t), C1(t)CN-2(t) iteratively, resulting in an equilibrium solution Ck*:

(29) Ck*=μN-kN

Finally, this can be transformed back into the Vk(t) to get the equilibrium solution V*(k):

(30) V*(k)=Ck*Nk11-kN=μN-kNNkNN-k=μNk

The VAF distribution of a growing population under a pure-birth process

Request a detailed protocol

In a pure-birth process only births, and no deaths, occur in the population. Suppose then a strictly growing population starting from a single-cell increasing by 1 every time step, with a mutation rate of μ. We denote the number of mutations with abundance k at time t as Vk(t). Then we can set up a system for the average change per time step:

(31) Vk(t+1)=(1kt+1)Vk(t)+k1t+1Vk1(t)V1(t+1)=(11t+1)V1(t)+2μ

For simplicity we view the process in event time, so that every time step is exactly one division event. However, the process can be projected so that the amount of time δt that passes between every event is drawn at random from a given distribution. Since this projection is always possible in both directions now matter the chosen distribution, this proof is valid for any process, as long as births are not allowed to be simultaneous. The Vk(t) themselves are all strictly increasing every time step and hence approach infinity, but we can make a statement about the average shape of the distribution and the value of each specific points in time:

(32) Vk(t)=(2k-2k+1)(t+1)μ=2k(k+1)(t+1)μ

To prove that 2/(k(k+1)) is indeed the expected distribution for this system, we will first show that the first several time steps follow this shape. Then, we prove that every time a new cell and thus a new possible abundance k=t is added to the system, the new distribution at the new abundance will follow the expected distribution as well. Finally, we will show that if an abundance and the neighboring abundance follow the distribution already, it will continue to follow the distribution.

Let’s show now that the first few time steps follow the claimed distribution:

(33) V1(1)=2μV1(2)=(112)2μ+2μ=3μV2(2)=(12)2μ=1μV2(2)V1(2)=13=2223V1(3)=(113)3μ+2μ=4μV2(3)=(123)3+131μ=43μV3(3)=(23)1=23μV2(3)V1(3)=13=2223V3(3)V1(3)=16=2324

Now, since the first steps seem to follow the distribution as expected, we can move on to showing that generally at time step t, the newly generated abundance k=t will always fit the distribution:

(34) Vt(t)=0+t1t2(t1)ttμ=2tμ=2t(t+1)(t+1)μ=2k(k+1)(t+1)μ

The only part remaining is proving that if the preceding time steps fit the distribution, the following time step also follows the distribution:

(35) Vk(t+1)V1(t)=(1kt+1)Vk(t)V1(t)+k1t+1Vk1(t)V1(t)=(1kt+1)(2k(k+1))+k1t+1(2(k1)(k))=(1kt+1)(2k(k+1))+k+1t+1(2(k+1)(k))=(1kt+1+k+1t+1)(2k(k+1))=(1+1t+1)(2k(k+1))V1(t+1)=(11t+1)(t+1)μ+2μ=(t+11+2)μ=(t+2)μV1(t+1)V1(t)=t+2t+1=1+1t+1Vk(t+1)V1(t+1)=2k(k+1)Vk(t+1)=2k(k+1)(t+2)μ

Note that unlike for the constant population case, the expected distribution does not need time to asymptotically approach the steady-state solution, as it is already at equilibrium from t=1.

Sampling affects the observed VAF distribution

Request a detailed protocol

Equation 1 describes the evolution of mutation frequencies with respect to the whole population, however, these are typically not known from sequencing data. Instead, an observation provides the frequencies of mutations as measured in the sequenced sample of the population. This distinction should be recognized when interpreting experimental data, as the statistical distribution for the sample’s VAF distribution differs from that of the true population (Figure 3c). To obtain the expected distribution v~(f) from a sample of S cells, we must consider how each mutation’s frequency in the true population can be altered in the sample, so that the resulting curve is given by the sum over all original mutation frequencies j/N weighted by their probability of being sampled to a different frequency i/S. If sampling is performed randomly this probability is given by the hypergeometric distribution (sampling without replacement), leading to the expression

(36) v~(i/S)=j=1Nv(j/N)(ji)(NjSi)/(NS)

Besides the anticipated loss of resolution, the effect of this operation is to shift the expected mutation frequencies upward with respect to their true frequency in the population (see Figure 3c in the main text), which can lead to improper inferences if not correctly accounted for.

The mutational burden distribution

The expected mutational burden distribution

Request a detailed protocol

The expected distribution of mutational burdens can be obtained from the argument that the burden mj of an individual cell at the time of observation must equal the sum over all mutations that occurred during each of its past cell divisions:

(37) mj=iyjuji

Without knowledge of the distributions of y and u other than the requirement that the uji are identically distributed and independently drawn, it is straightforward to show that the expectation and variance of mj are given by E(m)=E(y)E(u) and Var(m)=E(y)Var(u)+E(u)2Var(y) (Supplement). With the number of novel mutations per division as Poisson distributed, the average number of divisions per lineage — expressed in terms of an effective division rate λ — can thus be estimated from a sample as

(38) λ(t)dt=E(y)=E(m)/μ

If the mutation rate μ is not known, it can similarly be obtained from

(39) μ=(Var(m)E(m)1)E(y)Var(y)

Note that while a common modeling practice is to take the number of divisions as Poisson distributed — i.e., assuming exponentially distributed times between divisions — the estimates (Equation 4) and (Equation 5) do not require this. In fact, the factor E(y)/Var(y) in (Equation 5) can be interpreted as a measure for the under- or over-dispersion of divisions per lineage with respect to the Poisson model where E(y)=Var(y). In other words, if the Poisson model is assumed (then (Equation 37) describes the compound Poisson distribution (CPD)) the mutation rate can be estimated directly from the data m, whereas if the mutation rate is known the accuracy of the Poisson model can be assessed. Finally, note that in the true system there are nonzero correlations between lineages – the early uji are shared amongst many cells – meaning that for a single population the mj are not truly drawn from (Equation 37). While this does not affect the expected division rate (Equation 4), an inferred mutation rate will be influenced by a shared history, as it effectively reduces Var(y) and Var(u) (Figure 3b).

Fluctuations of the mutational burden distribution

Request a detailed protocol

To investigate expected within- and between-patient variation of the mutational burden distribution we implemented individual-based stochastic simulations of single-cell mutation acquisition. We start simulations with N cells and 0 mutations. Cells are picked randomly for division and acquire a number of mutations drawn from a Poisson distribution with rate parameter μ. In addition, cells may undergo symmetric or asymmetric divisions with a given probability p. If a cell divides asymmetrically, the novel mutations are added to that cell in the simulation. If a cell divides symmetrically, one cell in the population replaces another and novel mutations are added to the replaced and the dividing cell.

These simulations allow us to track the mutational burden distribution and its variance for arbitrary p, μ and large N over time. Realizations of these dynamics for μ=1.3 per cell division are shown in Figure 3—figure supplement 2 and are compared to the experimental observation of the mutational burden distribution in Durrett, 2013. We observe significant stochastic variations, especially pronounced for small-stem cell population sizes N and low asymmetric division probabilities p. However, the true biological parameters are more likely to be a combination of large N and large p and, therefore, a mutation rate of μ=1.3 per cell division is unlikely to explain the observations in Lee-Six et al., 2018.

In Figure 3—figure supplement 2 and Figure 3—figure supplement 3 we show the likelihood that the ratio of variance/mean is within the experimentally observed interval 3<θ<5. Again we can see that for μ=1.3 the ability to explain the experimental observations is limited for any combination of parameters. Increasing the mutation rate to μ=3 in contrast can explain the experimental observations for a wide range of parameter combinations.

Data fitting pipeline

To estimate the values of the model parameters from the data, we combine both predictions for the single-cell mutational burden as well as the mutation allele frequency distribution. The unknown parameters are those listed in Table 1, whereby the growth rate is implicitly defined from the mature population size and the maturation time as γ=log(NM)/tM.

Estimating division rates from the single-cell mutational burden distribution

Request a detailed protocol

While the single-cell mutational burden distribution allows for direct estimation of the average number of divisions that occurred per lineage through Equation 4, this does not directly provide the division rates, as these combine in the integrand on the lhs. To perform the integration we must first determine the expected contributions of each type of division (growth, symmetric, or asymmetric) to the mutational burden of a single lineage, as they — perhaps counter-intuitively — do not influence the expected burden equally.

Consider the system of N cells which obey the dynamics described in Section 1.1. At any point in time each cell describes a single lineage, and we are interested in the expected number of mutations added per lineage over time. Since divisions occur as independent Poisson events, we may treat the distinct types individually and simply add their contributions together to obtain the total burden increase over time. When an asymmetric division occurs the situation is simple: The single stem-type daughter cell acquires on average μ new mutations, which increase the average burden per cell by μ/N. Since these dynamics are Markovian, and events occur at rate Nϕ in the population, we obtain the average number of mutations acquired per lineage in a time Δt as

(40) Δmϕ=Nγ×(μ/N)×Δt=ϕμΔt

When a symmetric division occurs (as performed in the Moran process described previously) the situation becomes slightly more complicated. Following a single division we discern two disjoint outcomes: either two different cells are selected for replication and removal, or the same cell is selected for both. The former case occurs with probability 1-1/N and results in the average burden per lineage increasing by 2μ/N (two daughter cells obtained on average μ new mutations), while the latter occurs with probability 1/N and causes the average burden to increase by μ/N (one of the daughter cells was removed).With these events occurring at rate Nρ the expected increase per lineage after a time Δt is:

(41) Δmρ=Nρμ[(11N)2N+1N1N]Δt=ρμ(21/N)Δt2ρμΔt

with the final approximation obtained in the limit of large N. For the pure growth divisions occurring at rate Nγ we take a similar approach, though we now consider an infinitesimal time increment dt to account for the varying population size. After a single division, the average burden increase is 2μ/(N(t)+1), so that

(42) dmγ=N(t)γμ2N(t)+1dt=2γμN(t)N(t)+1dt2γμdt

Thus, we find that the growth and Moran divisions on average add twice as many mutations to a lineage as the asymmetric divisions. Note that we obtain the average increase in lineage (the number of divisions in a cell’s past) by setting μ=1 in the equations above. Plugging these results into the integral of (Equation 4), we find for the expected number of divisions per lineage at time.

(43) y(t)=0t[2ρ(112N(τ))+ϕ+2γ(11N(τ)+1)]dτ

Taking into account the different phases of evolution we can separate the integral into three parts.

(44) y(t)=0logNH/γ2γ(11N(τ)+1)dτ+logNH/γtM[2ρ(112N(τ))+ϕ+2γ(11N(τ)+1)]dτ+tmt[2ρ(112N(τ))+ϕ]dτ

Denoting λ=ρ+ϕ the total rate of homeostasis divisions within the stem cell compartment, performing the integration we solve this for,

(45) λ=R2log(NM+1)+log(4)(1p)(ttM)/NM+(2p)(ttMlog(NH)log(NM))(1p)tM(1NH1NM)/log(NM)

with R=E(m)/μ the average number of divisions per lineage estimated from the mutational burden distribution. With μ estimated from Werner et al., 2020, the remaining unknown parameters are tM, NH, NM, and p. Finding the standard error on E(m) as σE(m)=σm/S (with S as the sample size) and sweeping the unknown parameters within reasonable limits (tM[0,10], NH[0,1000], NM[20000,500000]) gives an error interval for the expected λ.

Fitting to the lowest frequency of the VAF distribution

Request a detailed protocol

We must be cautious about attempting to directly fit realizations of Equation 1 to an observed distribution in its entirety. Without knowledge of the variance across individual realizations for each possible allele frequency, there is no rigorous way to weigh the errors from different frequencies to obtain a single best fit. Indeed, simulations suggest that the variance in the number of mutations per VAF grows with the frequency (Figure 4—figure supplement 1), implying that a non-weighted fit would overemphasize fluctuations at one end of the spectrum. We, therefore, selected only the lowest frequency available in the sample – which is anticipated to be the best described by the expected value (Figure 4—figure supplement 1b) – as the reference for parameter fitting.

For select values of tM and NH, we calculated the VAF distribution — by numerically solving Equation 1 and applying the sampling operator Equation 12 — at lattice points in the parameter space of NM[5103,5105]×p[0,1]. For each point (NMi,pi) we compared the lowest frequency of the theoretical expected distribution v1/S with the value measured in the data, resulting in a 2D lattice of errors. From this lattice a continuous error function was obtained through spline interpolation, wherein we identified a zero-error line, which is the line of optimal fit shown in Figure 4c.

Code availability

Request a detailed protocol

The code for the simulations was written in the Julia computing language Julia 1.7.3 or C++ (gnu++ 14), and can be accessed at https://github.com/natevmp/hsc-vaf-dynamics.

Reused data

Request a detailed protocol

We have reused the publicly available data in Lee-Six et al., 2018, where the reused data in our Figure 4 can be accessed at https://data.mendeley.com/datasets/yzjw2stk7f/1, and in Martincorena et al., 2018, where the reused data in our Figure 2 can be accessed in the Resources section aau3879 tables2.xlsx at https://www.science.org/doi/10.1126/science.aau3879.

Data availability

The current manuscript is a computational study, so no data have been generated for this manuscript. We have reused the publicly available data in Lee-Six et al., 2018, where the reused data in our Figure 4 can be accessed at https://data.mendeley.com/datasets/yzjw2stk7f/1, and in Martincorena et al., 2018, where the reused data in our Figure 2 can be accessed in the Resources section aau3879 tables2.xlsx at https://www.science.org/doi/10.1126/science.aau3879. The code for the simulations was written in the Julia computing language Julia 1.7.3 or C++ (gnu++ 14), and can be accessed at https://github.com/natevmp/hsc-vaf-dynamics (copy archived at Mon Père and Moeller, 2023).

The following previously published data sets were used
    1. Martincorena et al.
    (2018) European Genome-Phenome Archive
    ID EGAD00001004158. The evolving mutational landscape of normal human esophagus.
    1. Martincorena et al.
    (2018) European Genome-Phenome Archive
    ID EGAD00001004159. The evolving mutational landscape of normal human esophagus.
    1. Lee-Six et al.
    (2018) European Genome-Phenome Archive
    ID EGAD00001004086. The Causes of Clonal Blood Cell Disorders Study - SCOR (2018-04-19).
    1. Lee-Six H
    2. Campbell P
    (2018) Mendeley Data
    Population dynamics of normal human blood inferred from spontaneous somatic mutations - substitution calls.
    https://doi.org/10.17632/yzjw2stk7f.1

References

    1. Ewens WJ
    (2004)
    Mathematical Population Genetics: I. Theoretical Introduction
    92–135, Discrete stochastic models, Mathematical Population Genetics: I. Theoretical Introduction, Springer, 10.1007/978-0-387-21822-9_3.
    1. Ewens WJ
    (2004)
    Mathematical Population Genetics: I. Theoretical Introduction
    136–155, Diffusion theory, Mathematical Population Genetics: I. Theoretical Introduction, Springer New York, 10.1007/978-0-387-21822-9_4.

Peer review

Reviewer #1 (Public Review):

Authors propose mathematical methods for inferring evolutionary parameters of interest from bulk/single cell sequencing data in healthy tissue and hematopoiesis. Authors attempt to go beyond previous models by including three phases of human development: early development, growth and maintenance, and mature phase. Introductory figures (1 and 2) provide the connection to previous analytical results (based on power laws), while figure 3 denotes the role of sampling effects, and figure 4 provides a real-world example.

This approach dovetails nicely with previous literature, providing clear insight into when previous theoretical results are valid and when they break down. Much of the previous literature is devoted to bulk sequencing, leading the authors to investigate the role of (sub)-sampling due to single cell data, where mutation burden and mutation rate distributions are easily recapitulated. Although not strongly emphasized in the manuscript, sub-sampling does increase noise leading to differences between population and sample distributions. From my view, these results provide an important contribution to the literature and are able to nicely describe and make inferences in a single cell HSC data set.

https://doi.org/10.7554/eLife.89780.3.sa1

Reviewer #2 (Public Review):

Summary: The authors provide a nice summary on the possibility to study genetic heterogeneity and how to measure the dynamics of stem cells. By combining single cell and bulk sequencing analyses, they aim to use a stochastic process and inform on different aspects of genetic heterogeneity.

Strengths: Well designed study and strong methods.

https://doi.org/10.7554/eLife.89780.3.sa2

Author response

The following is the authors’ response to the original reviews.

We are very grateful for your time and efforts spent on our manuscript. Your feedback has been very valuable. Please see below a point-by-point response to each suggestion and actions taken to address each point in the manuscript.

eLife assessment

In this fundamental study, the authors propose analytical methods for inferring evolutionary parameters of interest from sequencing data in healthy tissue relevant to hematopoiesis. By combining analyses of single cell and bulk sequencing data, the authors can use a stochastic process to inform different aspects of genetic heterogeneity. The strength of evidence in support of the authors' claim is thus compelling. The work will be of broad interest to cell biologists and theoretical biologists.

Public Reviews:

Reviewer #1 (Public Review):

Summary:

Authors propose mathematical methods for inferring evolutionary parameters of interest from bulk/single cell sequencing data in healthy tissue and hematopoiesis. In general, the introduction is well-written and adequately references the relevant and important previous literature and findings in this field (e.g. the power laws for well-mixed exponentially growing populations). The authors consider 3 phases of human development: early development, growth and maintenance, and mature phase. In particular, time-dependent mutation rates in Figure 2d is an intriguing and strong result, and the process underlying Figures 3 and 4 are generally wellexplained and convincing.

Thank you for your positive comments.

Notes & suggestions:

1. The explanation of Figure 2 in Lines 101 - 111 should be expanded for clarity. First, is Figure 2a derived from stochastic simulation (line 101 suggests) or some theoretical analysis? Second, the gradual transition from f-2 to f-1 is appreciated, but the shape of the intermediates is not addressed in detail. The power laws are straight lines, and the simulations provide curved lines -- please expand in what range (low or high frequency variants) the power law approximations apply.

Figure 2a was obtained from a numerical solution of equation 1, which describes the time dynamics of the expected VAF distribution. This is indeed unclear from the text, and we thank the reviewer for pointing out this discrepancy.

We thank the reviewer for this suggestion and have now adjusted this in the text (102-110):

“Numerical solutions of Eq.(1) show that the expected VAF distribution exhibits a gradual transition from the f-2 (growing population) to the f-1 (constant population) power law (Fig.2). These transitional states themselves do not adhere to some intermediate power-law (e.g. f for 1<α<2), but instead present a sigmoidal shape, with the low frequency portion following f-1 and the high frequencies f-2 . Over time the shape changes as a wavelike front traveling from low to high frequency, with the constant-size equilibrium establishing earliest at the lowest frequencies and moving to higher frequency over time. Interestingly, the convergence towards equilibrium slows down over time -- for evenly-spaced observation times the solutions lie increasingly closer together -- further decreasing the speed at which the high frequency portion of the spectrum approaches equilibrium.”

We also changed the caption of Figure 2 to make this clearer as

“(a) Expected VAF distributions from evolving Eq1 to different time points for a population with an initial exponential growth phase and subsequent constant population phase (mature size N=103). Once the population reaches the maximum carrying capacity, the distribution moves from a 1/f2 growing population shape (purple) to a 1/f constant population shape (green). Note that the shift slows considerably at older age.”

In addition, we have also added annotations to Figure 2a and 2b to further clarify which line(green or purple) is f-1 and f-2.

Additionally, I do not understand the claim in line 108, that the transition is fast for low frequency variants, as the low frequency (on the left of the graph) lines are all close together, whereas the high frequency lines are far apart.

The lines are closer together in the low frequency portion (left of the plot) because they are already very close to the constant-size equilibrium (f-1/green line) and these frequencies approached equilibrium very fast. On the contrary, in the high frequency portion (right side of plot) they are still very far from equilibrium and approached equilibrium much slower.

It would be helpful to reiterate in this paragraph that these power laws are derived based on exponentially growing populations and are expected to break down under homeostatic conditions.

We have adjusted the relevant paragraph in the text to make the validity of the power laws clearer (90-94):

“For a well-mixed exponentially growing population without cell death the VAF spectrum v(f) is given by 2μ/(f+f2) (a f2 power law) and is independent of time. In contrast, for a population of constant size – i.e. where birth and death rates are equal – the spectrum obeys v(f)2μ/f (a f1 power law; see also SI), though this solution is only valid at sufficiently long times.”

1. The sample vs population (blue vs orange) in Figure 3 is under-explained. How is it that the mutational burden and inferred mutation rate in A and B roughly match, but the VAF distributions in C are so different? How was the sampled set chosen? Perhaps this is an unimportant distinction based on the particular sample set, but the divergence of the two in C may serve as a distraction, here.

This is an important question, and the answer was perhaps underemphasized in the caption. The sampling was performed as a uniform random sampling with replacement, and the same sample set was used for both the mutational burden and the VAF distribution. The reason for this stark contrast is that while the expectation of the burden distribution is not affected by sampling (i.e. sampling only affects the resolution/amount of stochasticity), the expectation of the VAF distribution changes due to sampling. While this was discussed in the section "Sparse sampling, single cell derived VAF spectra and evolutionary inferences", we have added note of this (indeed surprising) effect in the caption as well:

“(b) Distribution of estimated mutation rates from 10'000 individual simulations, obtained from burden distributions of the complete populations (blue) as well as sampled sets of cells (orange). Because the expected mutational burden distribution is unaltered by sampling, the expected estimate of the mutation rate from (5) remains unchanged: E(μ~pop)=E(μ~sample). However, sampling increases the noise on the observed burden distribution, which results in a higher errormargin of the estimate: σ(μ~pop)<σ(μ~sample).”

“(c) VAF spectra measured in the complete population (blue) and a sampled set of cells (orange). In contrast with the mutational burden distribution, strong sampling changes the shape of the expected distribution. A single simulation result is shown (diamonds) alongside the theoretically predicted expected values for both the total and sampled populations (Eqs. (1) and (6))(dashed line) and the average across 100 simulations (solid line).”

1. The comparison of results herein to claims by Mitchell (ref. 12) are quite important results within the paper. I appreciate the note in the final paragraph of the discussion, and I suggest adding a sentence referencing the result noted in line 248-249 to the abstract, as well.

We agree with the reviewer. We have extended the abstract now to reference the result in more detail:

“However, the single cell mutational burden distribution is over-dispersed compared to a model of Poisson distributed random mutations suggesting. A time-associated model of mutation accumulation with a constant rate alone cannot generate such a pattern. At least one additional source of stochasticity would be needed. Possible candidates for these processes may be occasional bursts of stem cell divisions, potentially in response to injury, or non-constant mutation rates either through environmental exposures or cell intrinsic variation.”

Reviewer #2 (Public Review):

Summary: The authors provide a nice summary on the possibility to study genetic heterogeneity and how to measure the dynamics of stem cells. By combining single cell and bulk sequencing analyses, they aim to use a stochastic process and inform on different aspects of genetic heterogeneity.

Strengths: Well designed study and strong methods

Thank you for your positive comments.

Weaknesses: Minor

Further clarification to Figure 3 legend would be good to explain the 'no association' of number of samples and mutational burden estimate as per line 180-182 p.8.

We have added a note to the caption of Figure 3b to explain more clearly how sampling affects the burden distribution and the mutation rate inferred from it (see also previous response to Reviewer 1):

“Because the expected mutational burden distribution is unaltered by sampling, the expected estimate of the mutation rate from (5) remains unchanged: E(μ~pop)=E(μ~sample). However, sampling increases the noise on the observed burden distribution, which results in a higher errormargin of the estimate: σ(μ~pop)<σ(μ~sample).”

Reviewer #1 (Recommendations For The Authors):

Minor/editorial suggestions:

1. Equation 1, please define \partial_t and \partial_K, for clarity.

These have now been defined in the text (between line 85-86): “where κ=fN(t) denotes the number of cells sharing a variant (the variant frequency f times the total population size N), δ(x) is the Dirac impulse function, t and κ are the partial derivatives with respect to time and variant size.”

1. Figure 2: It would be helpful to label the green and purple lines with the corresponding 1/f and 1/f^2 rule, in addition to the growing/fixed label, for clarity.

We agree and have now added the corresponding labels to each line.

Reviewer #2 (Recommendations For The Authors):

Minor suggestions are given below:

It would be nice for the authors to comment on whether the results could be extended/modified to account for possible fitness advantage of mutations which would be clinically relevant, for instance in the case of CHIP mutations and difference in time to myeloid malignancies transformation between CHIP/No CHIP individuals.

This is an important point. We agree with the reviewer that CHIP mutations play an important role in shaping mutational diversity especially in older individuals. Evidence is now emerging that CHIP mutations are almost universally present in individuals 60+. Interestingly, in individuals younger than 60, a neutral model (as presented here), does capture the observed effective dynamics well. For the purpose of the analysis underlying this manuscript, a neutral model seems reasonable.

The techniques we use here can be adjusted to include selection. How the results extend or modify will critically depend on the actual model of selection (rare or frequent CHIP mutations, strong vs weak selection etc.) that is realized in human hematopoiesis. Here we would say, the underlying biology currently is mostly unknown and is subject to (by others and in part by us) ongoing investigations, which extend beyond the scope of this manuscript.

We now make note of this point in the manuscript and added a small paragraph in page 11 to the discussion:

“Another open question is the role of selection and how it shapes intra-tissue genetic heterogeneity. Evidence is emerging that positively selected variants in blood are almost universally present in individuals above 60, while the effective observable dynamics in younger individuals is well described by neutral dynamics. How results presented here generalize or modify will critically depend on the model of selection realized in human hematopoiesis, e.g. a models of rare or frequent driver events. Details of the underlying biology are currently unknown.”

It would be nice to see if any significant differences in parameter estimates occur between loci with/without linkage disequilibrium, for instance HLA region. Could the number of single-cell samples be 'more' relevant when studying the VAF distribution in HLA region?

This is a good suggestion. We might be wrong or missing an important point, but somatic evolution as we use it in our modeling here is solely driven by asexual reproduction of cells. As such the entire genome of the cell is in linkage disequilibrium, independent of the precise genomic region (somatic evolution is in first approximation blind to germline mutations, as they are present in every single cell of the organism and therefore do not carry any information on the somatic evolutionary dynamics).

We thank all editors and reviewers again for your constructive comments.

https://doi.org/10.7554/eLife.89780.3.sa3

Article and author information

Author details

  1. Marius E Moeller

    Department of Mathematics, Queen Mary University of London, London, United Kingdom
    Contribution
    Conceptualization, Formal analysis, Validation, Methodology, Writing – original draft, Writing – review and editing
    Contributed equally with
    Nathaniel V Mon Père
    Competing interests
    No competing interests declared
  2. Nathaniel V Mon Père

    1. Evolutionary Dynamics Group, Centre for Cancer Genomics and Computational Biology, Barts Cancer Centre, Queen Mary University of London, London, United Kingdom
    2. Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles, Ixelles, Belgium
    Contribution
    Conceptualization, Formal analysis, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing
    Contributed equally with
    Marius E Moeller
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3561-2050
  3. Benjamin Werner

    Evolutionary Dynamics Group, Centre for Cancer Genomics and Computational Biology, Barts Cancer Centre, Queen Mary University of London, London, United Kingdom
    Contribution
    Conceptualization, Supervision, Funding acquisition, Investigation, Methodology, Writing – original draft, Writing – review and editing
    For correspondence
    b.werner@qmul.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6857-8699
  4. Weini Huang

    1. Department of Mathematics, Queen Mary University of London, London, United Kingdom
    2. Group of Theoretical Biology, The State Key Laboratory of Biocontrol, School of Life Science, Sun Yat-sen University, Guangzhou, China
    Contribution
    Conceptualization, Formal analysis, Supervision, Funding acquisition, Methodology, Writing – original draft, Writing – review and editing
    For correspondence
    weini.huang@qmul.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9016-2665

Funding

Barts Charity (MGU045)

  • Nathaniel V Mon Père

UK Research and Innovation (MR/V02342X/1)

  • Benjamin Werner

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

The authors are grateful to Marc Williams and Christo Morison for discussions at different stages of the manuscript. NVMP acknowledges support by Télévie and the Barts Charity. BW is supported by a Barts Charity Lectureship (grant no. MGU045) and a UKRI Future Leaders Fellowship (grant no. MR/V02342X/1). This project has been facilitated by an Alan Turing Institute Data Science for Science Program grant TU/ASG/R-SPES-121.

Senior Editor

  1. Detlef Weigel, Max Planck Institute for Biology Tübingen, Germany

Reviewing Editor

  1. Jennifer Flegg, The University of Melbourne, Australia

Version history

  1. Preprint posted: March 5, 2023 (view preprint)
  2. Sent for peer review: June 8, 2023
  3. Preprint posted: September 18, 2023 (view preprint)
  4. Preprint posted: November 28, 2023 (view preprint)
  5. Version of Record published: January 24, 2024 (version 1)

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.89780. This DOI represents all versions, and will always resolve to the latest one.

Copyright

© 2023, Moeller, Mon Père et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 157
    Page views
  • 28
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Marius E Moeller
  2. Nathaniel V Mon Père
  3. Benjamin Werner
  4. Weini Huang
(2024)
Measures of genetic diversification in somatic tissues at bulk and single-cell resolution
eLife 12:RP89780.
https://doi.org/10.7554/eLife.89780.3

Share this article

https://doi.org/10.7554/eLife.89780

Further reading

    1. Evolutionary Biology
    2. Genetics and Genomics
    Thomas A Sasani, Aaron R Quinlan, Kelley Harris
    Research Article

    Maintaining germline genome integrity is essential and enormously complex. Although many proteins are involved in DNA replication, proofreading, and repair, mutator alleles have largely eluded detection in mammals. DNA replication and repair proteins often recognize sequence motifs or excise lesions at specific nucleotides. Thus, we might expect that the spectrum of de novo mutations – the frequencies of C>T, A>G, etc. – will differ between genomes that harbor either a mutator or wild-type allele. Previously, we used quantitative trait locus mapping to discover candidate mutator alleles in the DNA repair gene Mutyh that increased the C>A germline mutation rate in a family of inbred mice known as the BXDs (Sasani et al., 2022, Ashbrook et al., 2021). In this study we developed a new method to detect alleles associated with mutation spectrum variation and applied it to mutation data from the BXDs. We discovered an additional C>A mutator locus on chromosome 6 that overlaps Ogg1, a DNA glycosylase involved in the same base-excision repair network as Mutyh (David et al., 2007). Its effect depends on the presence of a mutator allele near Mutyh, and BXDs with mutator alleles at both loci have greater numbers of C>A mutations than those with mutator alleles at either locus alone. Our new methods for analyzing mutation spectra reveal evidence of epistasis between germline mutator alleles and may be applicable to mutation data from humans and other model organisms.

    1. Chromosomes and Gene Expression
    2. Evolutionary Biology
    Katherine Rickelton, Trisha M Zintel ... Courtney C Babbitt
    Research Article Updated

    Primate evolution has led to a remarkable diversity of behavioral specializations and pronounced brain size variation among species (Barton, 2012; DeCasien and Higham, 2019; Powell et al., 2017). Gene expression provides a promising opportunity for studying the molecular basis of brain evolution, but it has been explored in very few primate species to date (e.g. Khaitovich et al., 2005; Khrameeva et al., 2020; Ma et al., 2022; Somel et al., 2009). To understand the landscape of gene expression evolution across the primate lineage, we generated and analyzed RNA-seq data from four brain regions in an unprecedented eighteen species. Here, we show a remarkable level of variation in gene expression among hominid species, including humans and chimpanzees, despite their relatively recent divergence time from other primates. We found that individual genes display a wide range of expression dynamics across evolutionary time reflective of the diverse selection pressures acting on genes within primate brain tissue. Using our samples that represent a 190-fold difference in primate brain size, we identified genes with variation in expression most correlated with brain size. Our study extensively broadens the phylogenetic context of what is known about the molecular evolution of the brain across primates and identifies novel candidate genes for the study of genetic regulation of brain evolution.