Measures of genetic diversification in somatic tissues at bulk and single-cell resolution

eLife assessment

In this paper, the authors introduce fundamental work on mathematical methods for inferring evolutionary parameters of interest from RNA data in healthy tissue and during hematopoiesis. By combining single cell and bulk sequencing analyses, the authors use a stochastic process to inform different aspects of genetic heterogeneity; the strength of evidence in support of the authors' claim is exceptional. The work will be of broad interest to cell biologists and theoretical biologists.

https://doi.org/10.7554/eLife.89780.3.sa0

Significance of the findings:

Fundamental: Findings that substantially advance our understanding of major research questions

Landmark
Fundamental
Important
Valuable
Useful

Strength of evidence:

Exceptional: Exemplary use of existing approaches that establish new standards for a field

Exceptional
Compelling
Convincing
Solid
Incomplete
Inadequate

During the peer-review process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments

Abstract
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Intra-tissue genetic heterogeneity is universal to both healthy and cancerous tissues. It emerges from the stochastic accumulation of somatic mutations throughout development and homeostasis. By combining population genetics theory and genomic information, genetic heterogeneity can be exploited to infer tissue organization and dynamics in vivo. However, many basic quantities, for example the dynamics of tissue-specific stem cells remain difficult to quantify precisely. Here, we show that single-cell and bulk sequencing data inform on different aspects of the underlying stochastic processes. Bulk-derived variant allele frequency spectra (VAF) show transitions from growing to constant stem cell populations with age in samples of healthy esophagus epithelium. Single-cell mutational burden distributions allow a sample size independent measure of mutation and proliferation rates. Mutation rates in adult hematopietic stem cells are higher compared to inferences during development, suggesting additional proliferation-independent effects. Furthermore, single-cell derived VAF spectra contain information on the number of tissue-specific stem cells. In hematopiesis, we find approximately 2 × 10⁵ HSCs, if all stem cells divide symmetrically. However, the single-cell mutational burden distribution is over-dispersed compared to a model of Poisson distributed random mutations. A time-associated model of mutation accumulation with a constant rate alone cannot generate such a pattern. At least one additional source of stochasticity would be needed. Possible candidates for these processes may be occasional bursts of stem cell divisions, potentially in response to injury, or non-constant mutation rates either through environmental exposures or cell-intrinsic variation.

Introduction

Intra-tissue genetic heterogeneity emerges from a multitude of dynamical processes (Turajlic et al., 2019; Black and McGranahan, 2021). Cells randomly accumulate mutations, they self-renew, differentiate, or die and clones of varying fitness may compete and expand (Watson et al., 2020; Rulands et al., 2018; Werner et al., 2020; Gunnarsson et al., 2021). These core evolutionary principles form the basis for describing the aging of somatic tissues, tumor initiation, and tumor progression (Martincorena, 2019; Greaves and Maley, 2012; Cagan et al., 2022; Martincorena et al., 2018; Mitchell et al., 2022; Abascal et al., 2021). Naturally, there is great interest in understanding precisely how these dynamical processes act. This typically involves hypothesizing models of cell behavior, and deriving quantitative estimates of their underlying physical parameters (Watson et al., 2020; Werner et al., 2020; Poon et al., 2021; Martincorena et al., 2017; Chatzeli and Simons, 2020; Williams et al., 2020; Durrett, 2013). These could by quantities such as mutation rates per cell division, active stem cell numbers, symmetric and asymmetric division rates, etc.,. Some such parameters are already well characterized in certain tissues. We know for example that during early development healthy somatic cells accumulate on average between 1.2 and 1.3 mutations per genome per division (Werner et al., 2020; Lee-Six et al., 2018). Other parameters, such as the number of stem cells in a tissue type, are well known in some tissues but harder to quantify in others.

In general, a lack of time-resolved information complicates evolutionary inferences, and we must often rely on indirect information, for example, the patterns of tissue-specific genetic heterogeneity within and across individuals (Bailey et al., 2021). These observed patterns of genetic heterogeneity emerge from the underlying stochastic processes and can strongly depend on the measurement technique (Caravagna et al., 2020). Both enforce specific limitations on the inferability of the underlying dynamics. Different types of genomic data – such as bulk and single-cell whole genome sequencing – may contain different information on the system, and each comes with their own limitations. When bulk sequencing data is used for inference, some evolutionary parameters such as the population size and proliferation rates of stem cells are entangled and cannot be obtained in isolation (Watson et al., 2020; Williams et al., 2020). If on the other hand, single-cell data is employed, some additional information becomes available, for example the individual cell mutational burdens and the co-occurrence of mutations. However, often the sampling size is sparse compared to the presumed number of long-term proliferating cells in many tissues (Lee-Six et al., 2018; Salehi et al., 2021; Lim et al., 2020), which in turn introduces other constraints. Here, we show that combining such data obtained from different resolutions (bulk or single-cell) can help to overcome limitations and further narrow possible ranges of inferred parameters. Here, we develop concrete mathematical and computational models for extracting evolutionary parameters from bulk and single-cell information and apply these methods to whole genome bulk sequencing data in healthy esophagus (Martincorena et al., 2018) and whole genome single-cell sequencing data in healthy hematopoiesis (Lee-Six et al., 2018).

Results

A stochastic model of mutation accumulation in healthy somatic tissue

We model the stem cell dynamics of healthy tissue as a collection of individual cells that divide, differentiate, and die stochastically at predefined rates (see Figure 1). Novel mutations can occur with every cell division, each daughter acquiring a random number drawn from a Poisson distribution with rate $μ$ . We explicitly include symmetric and asymmetric stem cell divisions at potentially different rates, the former resulting in two stem or two differentiated cells, and the latter in one stem and one differentiated cell. Since differentiated cells are lost from the stem cell pool they are functionally equivalent to cell death. While only symmetric divisions allow variants to change in frequency, both division types introduce novel variants into the population and thus both contribute to the evolving heterogeneity.

Figure 1

Download asset Open asset

The distribution of variant allele frequencies changes with the growth phases and by sampling.

(a) In the current population, cells divide symmetrically into two daughter cells or asymmetrically with only one daughter cell kept in the focused population. All other events are mathematically equivalent to and are treated as a part of cell death. (b) The rates of symmetric and asymmetric division change during the population growth and lead to a dynamic distribution of variant allele frequencies. (c) The observed VAF distribution is shifted again during sampling compared to the VAF of the whole population – a fact should be considered when inferring population properties through genetic data.

Mutation accumulation occurs through all developmental stages from conception to death (Spencer Chapman et al., 2021). We, therefore, consider three phases of demographic changes of the stem cell population. In the first early developmental phase the stem cell population rapidly expands from a single-cell exclusively by symmetric divisions at rate $γ$ . This is followed by a growth and maintenance phase wherein the population continues to expand but also undergoes turnover through asymmetric divisions at rate $ϕ$ , as well as cell removal (due to differentiation or death) and replacement at rate $ρ$ . In the final mature phase, cell turnover continues but the size of the total stem population $N$ remains constant.

We employ two independent implementations of our theoretical model. One is a direct stochastic simulation utilizing a Gillespie algorithm. This approach creates a simulated dataset containing all single-cell or bulk sequencing properties, however, it can be computationally expensive. In addition, we compute the time-dynamical expected value of the distribution of variant allele frequencies (the VAF spectrum) directly, which we find obeys the partial differential equation

\partial_{t} V (κ, t) = - \partial_{κ} A (κ, t) V (κ, t) + \partial_{κ}^{2} B (κ, t) V (κ, t) + C (t) δ (κ - 1),

where $κ = f N (t)$ denotes the number of cells sharing a variant (the variant frequency $f$ times the total population size $N$ ), $δ (x)$ is the Dirac impulse function, $\partial_{t}$ , and $\partial_{κ}$ are the partial derivatives with respect to time and variant size, and

A (κ, t) = γ κ, B (κ, t) = κ (1 - κ / N (t)) ρ + γ κ / 2, C (t) = 2 μ N (t) (ρ + γ + ϕ / 2),

This set of equations allows for computationally efficient numerical solutions (SI 1.3).

Transition of developmental growth and constant population size signatures in bulk whole genome sequencing data of healthy adult esophagus

We first discuss established properties of the VAF spectrum relating to somatic tissues. In certain model systems it is given by a power law $\propto f^{- α}$ with a critical exponent $α$ . The value of the exponent depends on the demographic dynamics of the population. For a well-mixed exponentially growing population without cell death the VAF spectrum $v (f)$ is given by $2 μ / (f + f^{2})$ (a $f^{- 2}$ power law) and is independent of time (Gunnarsson et al., 2021). In contrast, for a population of constant size – i.e., where birth and death rates are equal – the spectrum obeys $v (f) \propto 2 μ / f$ (Durrett, 2015) (a $f^{- 1}$ power law; see also SI 1.5.1), though this solution is only valid at sufficiently long times. In the following, we focus on the dynamics of genetic diversification in healthy tissues and show that the expected VAF spectrum contains both contributions of growing and constant population dynamics.

Healthy adult somatic tissues are thought to initially expand rapidly during ontogenic growth, continue to expand during infancy and childhood, and reach homeostasis by adulthood. It is thus natural to ask what the expected VAF spectrum would look like in tissues experiencing homeostasis after a period of growth. We first investigated the underlying VAF dynamics in a minimal theoretical model where the population switches from pure exponential growth to constant size once a maximal cell number is reached. Numerical solutions of Equation 1 show that the expected VAF distribution exhibits a gradual transition from the $f^{- 2}$ (growing population) to the $f^{- 1}$ (constant population) power law (Figure 2a). These transitional states themselves do not adhere to some intermediate power law (e.g. $f^{- α}$ for $1 < α < 2$ ), but instead present a sigmoidal shape, with the low-frequency portion following $f^{- 1}$ and the high frequencies $f^{- 2}$ . Over time the shape changes as a wavelike front traveling from low to high frequency, with the constant-size equilibrium establishing earliest at the lowest frequencies and moving to higher frequency overtime. Interestingly, the convergence towards equilibrium slows down over time – for evenly-spaced observation times the solutions lie increasingly closer together – further decreasing the speed at which the high frequency portion of the spectrum approaches equilibrium. More complex forms of such demographic changes, for example, the inclusion of a mixed growth-and-maintenance phase or a logistically growing population, result in qualitatively similar behavior (see Figure 2—figure supplement 1).

Figure 2 with 1 supplement see all

Download asset Open asset

Bulk sequencing based variant allele frequency (VAF) and mutation rate inferences in healthy esophagus.

(a) Expected VAF distributions from evolving Equation 1 to different time points for a population with an initial exponential growth phase and subsequent constant population phase (mature size $N = 10^{3}$ ). Once the population reaches the maximum carrying capacity, the distribution moves from a $1 / f^{2}$ growing population shape (purple) to a $1 / f$ constant population shape (green). Note that the shift slows considerably at older age. (b) VAF from healthy tissue in the esophagus of nine individuals sorted into age brackets. The youngest bracket, 20–39, is closer to the developmental $1 / f^{2}$ scaling. The older age brackets are both close to the constant population $1 / f$ scaling, resembling the theoretical expectations. (c) Inferred mutation rates increase linearly with age. (d) Simulations of slowly growing stem cell populations reveal that mutation rates appear to increase with age, although the true underlying per division mutation rate remaining constant (see Figure 2—figure supplement 1 as well).

To test whether adult human tissues show such transitional signatures, we grouped the VAFs of healthy adult esophagus samples from Martincorena and colleagues (Martincorena et al., 2018) into three age groups (young, middle. and old, Figure 2b). In accordance with our theoretical expectations, the averaged VAF spectrum of the young donor group is closest to the expected $f^{- 2}$ distribution of ontogenic growth, with low frequency variants only starting to approach the $f^{- 1}$ homeostatic scaling. The averaged spectra of the middle and old age groups on the other hand notably shift towards the expected $f^{- 1}$ homeostatic line. Interestingly, while there is a clear separation between the spectra of the youngest and middle age groups, those of the middle and older groups are much closer. This agrees with our prediction that the speed of convergence of the spectra towards homeostatic scaling slows down with age.

VAF-based mutation rate inferences in healthy tissues

The theoretical prediction for the VAF spectrum of an exponentially growing population, which is recovered in many cancers given sufficient sequencing depth (Caravagna et al., 2020), contains information on the effective mutation rate $μ$ (Williams et al., 2016). Similarly, the VAF spectrum of a constant population includes the same mutation rate $μ$ (SI). Thus we can in principle infer $μ$ from the previously described esophagus data, for example by applying a regression approach. However, since the scaling of the VAF spectrum of healthy tissues can be shown to be age-dependent (Figure 2a–b), we use two reference shapes for a growing and a constant population, respectively. The fitting is performed by a linear regression least-squares approach. The general solution is computed from the VAF spectrum of the data, $V_{d} (μ)$ , and the reference shape $V_{r}$ , which assumes $μ = 1$ , as follows:

μ = \frac{Cov (V_{d} (μ), V_{r})}{Var (V_{r})}

If $V_{d} (μ)$ has the shape of $V_{r}$ , then $E (V_{d} (μ)) = μ E (V_{r})$ , which can be used to prove Equation 2:

Cov (V_{d} (μ), V_{r}) = E (V_{d} (μ), V_{r}) - E (V_{d} (μ)) E (V_{r}) = μ (E (V_{r}^{2}) - E (V_{r})^{2}) = μ \cdot Var (V_{r})

The VAF shapes for the growing and constant population give close estimates, as can be seen in Figure 2, which compares the age of all individuals with corresponding inferred effective mutation rates. The estimates are in the range of 1–2 mutations per genome per cell division and agree with the previous observations based on early developmental stem cell divisions (Lee-Six et al., 2018). Estimates based on a growing population are slightly higher compared to estimates based on a constant population, but differences are small.

Surprisingly, we observe a clear trend of a linearly increasing effective mutation rate with age. This trend cannot be explained by the transition of variants from a growing into a constant population alone. An actual increase in the true mutation rate per cell division with age would be a direct explanation. However, this seems unlikely: It is well established that the mutational burden across individuals and tissues increases linearly with age (Cagan et al., 2022; Mitchell et al., 2022; Williams et al., 2022), and an increasing mutation rate would in contrast result in an accelerated increase of mutational burden with age, which is not observed experimentally.

An alternative explanation is a continued slow linear increase of the stem cell population with age (Figure 2d). In such a scenario, the effective mutation rate would appear to increase linearly with age as well, although the true mutation rate per cell division remains unchanged. At first glance, a linearly increasing stem cell population appears unnatural. However, such a linear increase would emerge from a small bias of stem cells towards self-renewal (growth) and concomitantly a slowdown of the stem cell proliferation rate proportional to the change in population size (Werner et al., 2015). The latter is a natural feedback mechanism ensuring a constant output of differentiated somatic cells. Direct in vivo imaging in multiple human tissues including esaphagus seem to support an increased density and slower proliferation of stem cells with age (Tomasetti et al., 2019). This explanation is attractive for another reason: A feedback of decreased cell proliferation with increasing population size would also function as a tumor suppressor mechanism, and maybe another reason cancer driver mutations in healthy tissues are abundant but rarely progress to cancer (Martincorena et al., 2018; Martincorena et al., 2015). Progression would require additional genomic or environmental changes to overcome this regulatory feedback.

Single-cell mutational burden allows sample size independent inferences of stem cell proliferation and mutation rates

With the increasing availability of single-cell data, it becomes possible to investigate distributions of certain quantities for which bulk sequencing only provides average measurements. One such is the number of mutations present in individual cells. Recent work by Lee-Six and colleagues (Lee-Six et al., 2018) showed that in a 59-year-old donor hematopoietic stem cells had accumulated on average around 1000 mutations per cell; however, there was significant variation between individual genomes, some carrying 900 mutations, while others up to 1200. This variation is the result of the stochastic nature of mutation accumulation and cell division, and can be exploited to estimate per division mutation and stem cell proliferation rates. To this end, we look at the distribution of the number of mutations per cell, which we will from here on refer to as the mutational burden distribution. Formally, we describe the mutational burden m_j in a cell $j$ by a stochastic sum over its past divisions $m_{j} = \sum_{i}^{y_{j}} u_{i j}$ , in which both the total number of divisions y_j in the cell’s past and the number of mutations per division $u_{i j}$ are random variables. Even without knowing the actual distributions of $y$ and $u$ , general expressions for the expectation and variance of m_j that only depend on the expectation and variance of $y$ and $u$ exist and are given by $E (m) = E (y) E (u)$ and $Var (m) = E (y) Var (u) + E {(u)}^{2} Var (y)$ (SI 1.7). These expressions allow direct estimates of the effective division and mutation rate. Assuming the mutation rate per daughter cell is Poisson distributed with expectation $μ$ , the average number of divisions per lineage — expressed in terms of an effective division rate $λ$ — becomes

\int λ (t) d t = E (y) = E (m) / μ

In addition, if the mutation rate $μ$ is unknown, it can similarly be obtained from

μ = (\frac{Var (m)}{E (m)} - 1) \frac{E (y)}{Var (y)} .

We first show that these estimators recover per division mutation and proliferation rates in stochastic simulations (Figure 3a and b). More importantly, they only require information on comparably few single cells. Approximately, 100 cells are sufficient to reliably reconstruct the mutational burden distribution, which for example would constitute a small sample of the hematopoietic stem cell pool. In fact, the inference does not significantly improve if we increase data resolution beyond a few hundred single cells (Figure 3a and b). This is in stark contrast to inferences based on single-cell phylogeny: Although they contain more information in principle, important aspects of a single-cell phylogeny, e.g., the number of observable branchings are strongly affected by sampling and stem cell population size.

Figure 3 with 3 supplements see all

Download asset Open asset

Inference of evolutionary parameters on simulated stem cell populations.

Simulated populations were run up to age 59, growing exponentially from a single-cell to constant size $N_{M} = 10^{'} 000$ at age $t_{M} = 5$ , with mutation rate $μ = 1.2$ and division rates $λ = 5$ and $p = 0.4$ . Where sampling is mentioned, the sample size 89 was taken. (a) The single-cell mutational burden distribution. The compound Poisson distribution (dashed line) matches the burden distribution when averaging over multiple independently evolved populations (filled curve). (b) Distribution of estimated mutation rates from 10’000 individual simulations, obtained from burden distributions of the complete populations (blue) as well as sampled sets of cells (orange). Because the expected mutational burden distribution is unaltered by sampling, the expected estimate of the mutation rate from Equation 5 remains unchanged: $E ({\tilde{μ}}_{p o p}) = E ({\tilde{μ}}_{s a m p l e})$ . However, sampling increases the noise on the observed burden distribution, which results in a higher error margin of the estimate: $σ ({\tilde{μ}}_{p o p}) < σ ({\tilde{μ}}_{s a m p l e})$ . (c) VAF spectra measured in the complete population (blue) and a sampled set of cells (orange). In contrast with the mutational burden distribution, strong sampling changes the shape of the expected distribution. A single simulation result is shown (diamonds) alongside the theoretically predicted expected values for both the total and sampled populations (Equations 1 and 12) (dashed line) and the average across 100 simulations (solid line). (d) Distribution of $N_{M}$ and $p$ inference results for 100 simulated and sampled populations, through estimation of $\tilde{μ}$ and $\tilde{λ}$ from the single-cell burden distribution and fitting the number of lowest frequency ( $1 / S$ ) mutations to the theoretical prediction in Equation 1 (see Figure 3—figure supplements 1–3 as well).

We apply our estimators to single-cell mutational data in blood obtained from the study by Lee-Six et al., 2018, in which 89 hematopoietic stem cells (HSCs) were extracted from a bone marrow aspirate of a 59-year-old individual (Figure 4a). Using Equation 4 to estimate the expected number of divisions per lineage from the sample burden (SI 1.8.1), we find a total division rate (including both symmetric and asymmetric divisions) in the constant population phase of $λ = 10.6$ (7.6-15.3) divisions per HSC per year, which is within ranges previously suggested (Watson et al., 2020; Dingli and Pacheco, 2006). Notably, the single-cell mutational burden alone cannot disentangle symmetric and asymmetric division rates. Applying Equation 5 to the sample burden distribution and taking $Var (y) / E (y) = 1$ (akin to assuming exponentially distributed division times, a common but simplified null model) we obtain an estimated mutation rate of $\tilde{μ} = 4.3$ per cell per division (Figure 4a). This is significantly higher compared to the 1.2 mutations per division suggested previously (Werner et al., 2020; Lee-Six et al., 2018). We posit different possible explanations for this discrepancy. Stem cell divisions and mutation accumulation are stochastic processes, therefore variation between individuals is expected. However, simulations of our model for a wide range of parameter values suggest the distribution of these inferred mutation rates is unlikely to present a wide enough variance to explain this difference (Figure 3—figure supplement 2). Recently, Mitchell and colleagues (Abascal et al., 2021) suggested that many mutations in somatic tissue are acquired independently of cell divisions. In fact, the previous estimates of 1.2–1.3 mutations per genome per division were derived from early developmental cell divisions where the effects of aging would be negligible. The increased mutation rate inferred here could in part be explained by proliferation-independent effects. However, it is important to note that the true distribution of divisions per lineage may also be over-dispersed compared to the Poisson model (i.e. $Var (y) > E (y)$ ), which would lead to an overestimation of $μ$ in Equation 5. Furthermore, the observed burden distribution is incompatible with a simple division-independent mutation model, which would lead to a Poisson distribution of the burdens that has a variance much smaller than what we observed (Figure 4a). A combination of division-independent mutations together with non-Poissonian cell divisions could possibly reproduce the result.

Figure 4 with 1 supplement see all

Download asset Open asset

Evolutionary inferences in single-cell hematopoietic stem cell (HSC) data.

(a) The single-cell mutational burden distribution of the data (bars) and the compound Poisson distribution obtained from its mean and variance, used to obtain the estimated per division mutation rate $\tilde{μ}$ . (b) Distribution of mutation frequencies of the data and theoretically predicted average fitted to only the lowest frequency ( $1 / S$ ) data point. (c) Difference $Δ v_{f}$ between the measured value of the VAF spectrum at the lowest frequency ( $1 / S$ ) and its prediction from Equation 1, for varying total population size $N$ and asymmetric division proportion $p$ , with fixed maturation time $t_{M} = 5$ and operational hematopoietic population size $N_{H} = 50$ . The solid line denotes the plane of best fit where this difference is 0. (d) Maximally inferred population size $N$ (taking $p = 0$ in (c)) for variation of the maturation time $t_{N}$ and the operational hematopoietic population size $N_{H}$ (see Figure 4—figure supplement 1 as well).

Sparse sampling, single-cell derived VAF spectra, and evolutionary inferences

Current methods limit us to whole genome information of at most a few thousand single cells per tissue or tumor (Lim et al., 2020). In many situations, this will only constitute a small fraction of the underlying long-term proliferating cell population. For example, although the exact number of HSCs remains unknown, most recent estimates suggest a population size of 10⁵ cells (Watson et al., 2020; Lee-Six et al., 2018). In tumors, this number can be as large as 10¹⁰ cells (Werner et al., 2016). As we have shown in the previous paragraph, the single-cell mutational burden distribution allows certain inferences even if sampling is sparse. If we, however, construct the VAF spectrum from such a small sample, the distribution of variants will be significantly transformed with respect to that of the total population (Figure 3c). Since Equation 1 gives the expected curve measured from the total population, one must include a correction, which is given by the expectation of hypergeometric sampling (i.e. without replacement). This can be obtained through the transformation (see Sampling affects the observed VAF distribution).

\tilde{V} (i) = \sum_{j = i}^{N} V (j) \cdot (\begin{matrix} j \\ i \end{matrix}) (\begin{matrix} N - j \\ S - i \end{matrix}) / (\begin{matrix} N \\ S \end{matrix})

with $\tilde{V}$ the VAFs observed in the sample and $S$ the sample size. Furthermore, from stochastic simulations we note that the variance in the distribution increases with variant frequency $f$ , making the lowest frequency state (i.e. $1 / S$ ) the best candidate for comparative fitting and evolutionary inferences (see Figure 4—figure supplement 1).

Taking the mutation rate from Werner et al., 2020 and the total division rate obtained from the burden distribution as constant, we explored solutions of Equation 1 (sampled by Equation 12) for a wide range of realistic values in the remaining parameter space ${N, p, t_{M}, N_{H}}$ . Comparing the lowest sampled frequency state ( $f = 1 / 89$ ) with the data (Figure 4b) we identify a curve in the space of the numbers of HSCs $N$ and $p$ (fraction of divisions that are asymmetric) where theoretically predicted VAF distributions are identical (Figure 4c). Interestingly, this curve in $N$ - $p$ space naturally identifies a maximal stem cell population size capable of producing the data, which occurs when stem cells divide exclusively symmetrically ( $p = 0$ ). Variations of $t_{M}$ (the time until maturity of the HSC population) and $N_{H}$ (the HSC pool size at which maintenance begins to occur) had only a small effect on this inference (Figure 4d). This is partly because mutations arising during these phases are more likely to be found at higher frequencies upon measurement (Poon et al., 2021; Werner, 2021). We find the data to be congruent with an adult HSC pool of at most 200’000–300’000 cells, depending on the exact timings of the maturation phase (Figure 4d). This estimate agrees with the original study and other independent inferences based on population data (Watson et al., 2020; Lee-Six et al., 2018). However, this upper bound corresponds to the case of exclusively symmetric stem cell divisions. Accounting for the possibility of mutating asymmetric divisions reduces the estimated stem cell pool. For example, the data would be consistent with 50’000 stem cells if 90% ( $p = 0.9$ ) of all stem cell divisions are asymmetric. In principle, even smaller stem cell pool sizes are possible, though decreasing an order of magnitude would imply an extremely large portion (gt₉₉%) of asymmetric HSC divisions. We note that in the data non-developmental branchings are observed, which immediately implies that not all HSC divisions can be asymmetric ( $p < 1$ ) and thus a scenario of an extremely small population of only a handful of HSCs maintaining hematopoisis can be rejected. With an orthogonal population-based method, Watson and colleagues estimated 25’000 as a reasonable lower bound for the number of HSCs. Based on our inference, this would imply 0.95 as an upper bound for the proportion of all stem cell divisions that are asymmetric.

Discussion

Here, we have shown that single-cell and bulk whole genome sequencing can inform on different aspects of somatic evolution. Single-cell information does support previous bulk-based estimates of possible ranges of HSC numbers. Surprisingly, the same single-cell data leads to a mutation rate inference during homeostasis is approximately four times higher compared to previous developmental estimates.

Another open question is the role of selection and how it shapes intra-tissue genetic heterogeneity. Evidence is emerging that positively selected variants in blood are almost universally present in individuals above 60, while the effective observable dynamics in younger individuals is well described by neutral dynamics. How results presented here generalize or modify will critically depend on the model of selection realized in human hematopoiesis, e.g., a model of rare or frequent driver events. Details of the underlying biology are currently unknown.

An important recent study has suggested an amended model of mutation accumulation in somatic tissues, wherein the majority of mutations are acquired continuously over time throughout a cell’s lifespan rather than during DNA replication (Abascal et al., 2021). We note that for a time-dependent rate of mutation accumulation that is constant across all cells at a given point in time, this model predicts Poisson distributed single-cell mutational burdens. In contrast, we find the single-cell mutational burden distribution in human HSCs to be highly over-dispersed compared to a Poisson model. A time-associated model of mutation accumulation with a constant rate alone cannot generate such a pattern. At least one additional source of stochasticity would be needed. This could be fluctuations in the time dependent mutation rates, non-constant cell proliferation rates, e.g., bursts of stem cell divisions potentially in reaction to injury or disease, or a combination of both effects. Either way, our observations suggest that our current theoretical models of somatic evolution in healthy tissues are incomplete. There is evidence for unknown processes contributing to intra-tissue genetic heterogeneity. The precise nature of these processes remains an open question.

Share this article

Cite this article

The distribution of variant allele frequencies changes with the growth phases and by sampling.

Bulk sequencing based variant allele frequency (VAF) and mutation rate inferences in healthy esophagus.

Inference of evolutionary parameters on simulated stem cell populations.

Evolutionary inferences in single-cell hematopoietic stem cell (HSC) data.

Evolutionary parameters appearing in the model system.

Evolutionary parameters appearing in the analytical derivations of the expected VAF distribution in the Moran and pure-birth models.

Author details

Marius E Moeller

Contribution

Contributed equally with

Competing interests

Nathaniel V Mon Père

Contribution

Contributed equally with

Competing interests

Benjamin Werner

Contribution

For correspondence

Competing interests

Weini Huang

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism

Further reading