The success of artificial selection for collective composition hinges on initial and target values

eLife Assessment

This important study of artificial selection in microbial communities shows that the possibility of selecting a desired fraction of slow and fast-growing types is impacted by their initial fractions. The evidence, which relies on mathematical analysis and simulations of a stochastic model, is compelling. It highlights the tension between selection at the strain and the community level. This study should be of interest to researchers interested in ecology, both theoretical and experimental.

https://doi.org/10.7554/eLife.97461.3.sa0

Significance of the findings:

Important: Findings that have theoretical or practical implications beyond a single subfield

Landmark
Fundamental
Important
Valuable
Useful

Strength of evidence:

Compelling: Evidence that features methods, data and analyses more rigorous than the current state-of-the-art

Exceptional
Compelling
Convincing
Solid
Incomplete
Inadequate

During the peer-review process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments

Abstract
Introduction
Results and discussion
Materials and methods
Appendix 1
Appendix 2
Appendix 3
Appendix 4
Appendix 5
Appendix 6
Appendix 7
Appendix 8
Appendix 9
Data availability
References
Article and author information
Metrics

Abstract

Microbial collectives can perform functions beyond the capability of individual members. Enhancing collective functions through artificial selection is, however, challenging. Here, we explore the ‘rafting-a-waterfall’ metaphor where achieving a target population composition depends on both target and initial compositions. Specifically, collectives comprising fast-growing (F) and slow-growing (S) individuals were grown for ‘maturation’ time, and the collective with S-frequency closest to the target value is chosen to ‘reproduce’ via inoculating offspring collectives. During collective maturation, intra-collective selection acts like a waterfall, relentlessly driving the S-frequency to lower values, while during collective reproduction, inter-collective selection resembles a rafter striving to reach the target frequency. Using simulations and analytical calculations, we show that intermediate target S frequencies are the most challenging, akin to a target within the vertical drop of a waterfall, rather than above or below it. This arises because intra-collective selection is the strongest at intermediate S-frequencies, which can overpower inter-collective selection. While achieving a low target S frequencies is consistently feasible, attaining high target S-frequencies requires an initially high S-frequency — much like a raft that can descend but not ascend a waterfall. As Newborn size increases, the region of achievable target frequency is reduced until no frequency is achievable. In contrast, the number of collectives under selection plays a less critical role. In scenarios involving more than two populations, the evolutionary trajectory must navigate entirely away from the metaphorical ‘waterfall drop.’ Our findings illustrate that the strength of intra-collective evolution is frequency-dependent, with implications in experimental planning.

Introduction

Microbial collectives can carry out functions that arise from interactions among member species. These functions, such as waste degradation (Woo et al., 2020; Sun et al., 2022), probiotics (Bober et al., 2018), and vitamin production (Wang et al., 2016), can be useful for human health and biotechnology. To improve collective functions, one can perform artificial selection (directed evolution) on collectives (see Figure 1): Low-density ‘Newborn’ collectives are allowed to ‘mature’ during which cells proliferate and possibly mutate, and community function develops. ‘Adult’ collectives with high functions are then chosen to reproduce, each seeding multiple offspring Newborns. Artificial selection of collectives have been attempted both in experiments (Goodnight, 1990; Swenson et al., 2000b; Swenson et al., 2000a; Blouin et al., 2015; Panke-Buisse et al., 2015; Panke-Buisse et al., 2017; Jochum et al., 2019; Wright et al., 2019; Raynaud et al., 2019; Arora et al., 2020; Chang et al., 2020; Mueller et al., 2021; Jacquiod et al., 2022; Raynaud et al., 2022; Arias-Sánchez et al., 2024) and in simulations (Penn, 2003; Penn and Harvey, 2004; Williams and Lenton, 2007; Xie et al., 2019; Doulcier et al., 2020; Xie and Shou, 2021; Chang et al., 2021; Fraboul et al., 2023; Lalejini et al., 2022; Zaccaria et al., 2023; Vessman et al., 2023), often with unimpressive outcomes.

Figure 1

Download asset Open asset

Schematic for artificial selection on collectives.

Each selection cycle begins with a total of $g$ Newborn collectives, each with $N_{0}$ total cells of slow-growing S population (light gray dots) and fast-growing F population (dark gray dots). During maturation (over time $τ$ ), S and F cells divide at rates $r_{S}$ and $r_{S} + ω$ ( $ω > 0$ ), respectively, and S mutates to F at rate $μ$ . During inter-collective selection, the Adult collective with F frequency $f$ closest to the target composition $\hat{f}$ is chosen to reproduce $g$ Newborns for the next cycle. Newborns are sampled from the chosen Adult (yellow star) with $N_{0}$ cells per Newborn. The selection cycle is then repeated until the F frequency reaches a steady state, which may or may not be the target composition. To denote a variable $x$ of $i$ -th collective in cycle $k$ at time $t$ ( $0 \leq t \leq τ$ ), we use notation $x_{k, t}^{(i)}$ where $x \in {S, F, s, f}$ . Note that time $t = 0$ is for Newborns and $t = τ$ is for Adults.

One of the major challenges in selecting collectives is to ensure the inheritance of a collective function (Xie et al., 2023; Thomas et al., 2024). Inheritance from a parent collective to offspring collectives can be compromised by changes in genotype and species compositions. During maturation of a collective, genotype compositions within each species can change due to intra-collective selection favoring fast-growing individuals (Figure 1, ‘intra-collective’ selection), while species compositions can change due to ecological interactions. Furthermore, during the reproduction of a collective, genotype and species compositions of offspring can vary stochastically from those of the parent (Figure 1, ‘genetic drift’).

Here, we consider the selection of collectives comprising two or three populations with different growth rates, and our goal is to achieve a target composition in the Adult collective. This is a common quest: whenever a collective function depends on both populations, the collective function is maximized, by definition, at an intermediate frequency (e.g. too little of either population will hamper function; Xie et al., 2019). Earlier work has demonstrated that nearly any target species composition can be achieved when selecting communities of two competing species with unequal growth rates (Doulcier et al., 2020; Rainey, 2023), so long as the shared resource is depleted during collective maturation (Doulcier et al., 2020). In this case, initially, both species evolved to grow faster, and the slower-growing species was preserved due to stochastic fluctuations in species composition during collective reproduction. Eventually, both species evolved to grow sufficiently fast to deplete the shared resource during collective maturation, and evolution in competition coefficients then acted to stabilize the species ratio to the target value (Doulcier et al., 2020). Regardless, earlier studies are often limited to numerical explorations, with prohibitive costs for a full characterization of the parameter space for such nested populations (population of collectives, and populations of variants within a collective).

We mathematically examine the selection of composition in collectives consisting of populations growing at different rates. We made simplifying assumptions so that we can analytically examine the evolutionary tipping point between intra-collective and inter-collective selection. We show that this tipping point creates a ‘waterfall’ effect which restricts not only which target compositions are achievable, but also the initial composition required to achieve the target. We also investigate how the range of achievable target composition is affected by the total population size in Newborns and the total number of collectives under selection. Finally, we show that the waterfall phenomenon extends to systems with more than 2 populations.

Results and discussion

To enable the derivation of an analytical expression, we have made the following simplifying assumptions. First, growth is always exponential, without complications such as resource limitation, ecological interactions between the two populations, or density-dependent growth. Thus, the exponential growth equation can be used. Second, we initially consider only two populations (genotypes or species): the fast-growing F population with size $F$ and the slow-growing S population with size $S$ . We do not consider a spectrum of mutants or species, since with more than two populations, an analytical solution becomes very difficult. Finally, the single top-functioning community is chosen to reproduce, which allows us to employ the simplest version of the extreme value theory (see section below for further justification).

Our goal is to select for collective composition in terms of F frequency $f = F / (S + F)$ , or equivalently, S frequency $s = 1 - f$ . More precisely, we want collectives such that after maturation time $τ$ , $f (τ)$ is as close to the target value $\hat{f}$ as possible (Figure 1). Note that even if the target frequency has been achieved, since F frequency will always increase during maturation, inter-collective selection is required in each cycle to maintain the target frequency.

We will start with a complete model where S mutates to F at a nonzero mutation rate $μ$ . We made this choice because it is more challenging to attain or maintain the target frequency when the abundance of fast-growing F is further increased via mutations. This scenario is encountered in biotechnology: an engineered pathway will slow down cell growth, and breaking the pathway (and thus faster growth) is much easier than the other way around. When the mutation rate is set to zero, the same model can be used to capture collectives of two species with different growth rates. We show that intermediate F frequencies or equivalently, intermediate S frequencies, are the hardest targets to achieve. We then show using simulations that similar conclusions hold when selecting for a target composition in collectives of three populations.

Model structure

A selection cycle (Figure 1; Table 1) starts with a total of $g$ Newborn collectives. At the beginning of cycle $k$ ( $t = 0$ ), each Newborn collective has a fixed total cell number $N_{0} = S_{k, 0}^{(i)} + F_{k, 0}^{(i)}$ where $S_{k, t}^{(i)}$ and $F_{k, t}^{(i)}$ denote the numbers of S and F cells in collective $i$ ( $1 \leq i \leq g$ ) at time $t$ ( $0 \leq t \leq τ$ ) of cycle $k$ . The average F frequency among the $g$ Newborn collectives in cycle $k$ is ${\bar{f}}_{k, 0}$ , such that the initial F cell number in each Newborn is drawn from the binomial distribution $B i n o m (N_{0}, {\bar{f}}_{k, 0})$ .

Table 1

Nomenclature.

Variables	Representing
$S$	Number of slower-growing (S) cells
$F$	Number of faster-growing (F) cells
$N$	Total cell numbers in a collective, $N = S + F$
$s$	Frequency of S cells, $s = S / (S + F)$
$f$	Frequency of F cells, $f = F / (S + F) = 1 - s$
$f^{*}$	F frequency of the selected collective in a cycle
Parameters	Representing
$r_{S}$	Growth rate of S
$ω > 0$	Growth rate advantage of F over S
$μ$	Mutation rate from S to F
$g$	Total number of collectives
$τ$	Maturation time
$N_{0}$	Total number of cells in Newborn, or Newborn size
	Target frequency in $s$ or $f$ .
$f^{L}, f^{H}$	Low and High thresholds of inaccessible $\hat{f}$
$R_{τ}$	Fold-growth of S cells over time $τ$ , $R_{τ} = e^{r_{S} τ}$
$W_{τ}$	Fold ratio change of F cells over S cells over time $τ$ , $W_{τ} = e^{ω τ}$

Collectives are allowed to grow for time $τ$ (‘Maturation’ in Figure 1). During maturation, S and F grow at rates $r_{S}$ and $r_{S} + ω$ ( $ω > 0$ ), respectively. If maturation time $τ$ is too small, a matured collective (‘Adult’) does not have enough cells to reproduce $g$ Newborn collectives with $N_{0}$ cells. On the other hand, if maturation time $τ$ is too long, fast-growing F will take over. Hence, we set the maturation time $τ = \ln (g + 1) / r_{S}$ , which guarantees sufficient cells to produce $g$ Newborn collectives from a single Adult collective. At the end of a cycle, a single Adult with the highest function (with F frequency $f$ closest to the target frequency $\hat{f}$ ) is chosen to reproduce $g$ Newborn collectives, each with $N_{0}$ cells (‘Selection’ and ’Reproduction’ in Figure 1). Note that even though S and F do not compete for nutrients, they compete for space: because the total number of cells transferred to the next cycle is fixed, an overabundance of one population will reduce the likelihood of the other being propagated.

Collective function is dictated by the Adult’s F frequency $f$ . Among all Adult collectives, the selected Adult is the one whose F frequency is closest to the target value, $\hat{f}$ . In contrast with findings from an earlier study (Xie et al., 2019), choosing top 1 is more effective than the less stringent ‘choosing top 5%.’ In the earlier study, variation in the collective trait is partly due to nonheritable factors such as random fluctuations in Newborn biomass. In that context, a less stringent selection criterion proved more effective, as it helped retain collectives with favorable genotypes that might have exhibited suboptimal collective traits due to unfavorable non-heritable factors. However, since this study excludes non-heritable variations in collective traits, selecting the top 1 collective is more effective than selecting the top 5% (see Appendix 7—figure 1).

The selected Adult, with F frequency denoted as $f^{*}$ , is then used to reproduce $g$ offspring collectives, each with $N_{0}$ total cells. The number of F cells in a newborn follows a binomial distribution $B (N_{0}, f^{*})$ . By repeating the selection cycle, we aim to achieve and maintain the target composition $\hat{f}$ .

Overall, our model considers mutational stochasticity, as well as demographic stochasticity in terms of stochastic birth and stochastic sampling of a parent collective by offspring collectives. Other types of stochasticity, such as environmental stochasticity and measurement noise, are not considered and require future research.

The success of collective selection is constrained by the target composition, and sometimes also by the initial composition

Since intra-collective selection favors F, we expect that a higher target $\hat{f}$ (a lower target $\hat{s}$ ) is easier to achieve. By ‘achieve,’ we mean that the absolute error $d$ between the target frequency $\hat{f}$ and the selected frequency averaged among independent simulations $⟨ f^{*} ⟩$ is smaller than 0.05 (i.e. $d = | ⟨ f^{*} ⟩ - \hat{f} | \leq 0.05$ ).

We fixed $N_{0},$ the total population size of a Newborn to 1000, and obtained selection dynamics for various initial and target F frequencies by implementing stochastic simulations (Appendix 1). If the target $\hat{f}$ is high (e.g. 0.9, Figure 2a magenta), selection is successful (computed absolute errors Appendix 1—figure 4): regardless of the initial frequency, $f^{*}$ of the chosen collective eventually converges to the target $\hat{f}$ and stays around it. In contrast, without collective-level selection (e.g. choosing a random collective to reproduce), F frequency increases until F reaches fixation (Supplementary information Appendix 1—figure 3b).

Figure 2

Download asset Open asset

Initial and target compositions determine the success of artificial selection on collectives.

(**a–c**) F frequency of the selected Adult collective ( $f^{*}$ ) over cycles at different target $\hat{f}$ values (long dashed lines). $\hat{f}$ between $f^{L}$ and $f^{H}$ (orange dotted and solid line segments) is inaccessible where selection will fail. (a) A high target F frequency (e.g. $\hat{f} = 0.9 > f^{H}$ ; magenta) can be achieved from any initial frequency (black dots). (b) An intermediate target frequency (e.g. $f^{L} < \hat{f} = 0.5 < f^{H}$ ; green) is never achievable, as all initial conditions converge to $f^{H}$ . (c) A low target frequency (e.g. $\hat{f} = 0.1 < f^{L}$ ; dark blue) is achievable, but only from initial frequencies below $f^{L}$ . For initial frequencies at $f^{L}$ , stochastic outcomes (gray curves) are observed: while some replicates reached the target frequency, others reached $f^{H}$ . For parameters, we used S growth rate $r_{S} = 0.5$ , F growth advantage $ω = 0.03$ , mutation rate $μ = 0.0001$ , maturation time $τ \approx 4.8$ , and $N_{0} = 1000$ . The number of collectives $g = 10$ . Each black line is averaged from independent 300 realizations. (d) Inter-collective selection opposes intra-collective selection. We plot probability density distributions of F frequency $f$ during two consecutive cycles when selection is successful. Data correspond to cycles 31 and 32 from the second lowest initial point in c. $Δ f$ is the selection progress within a cycle (see Box 1). Black triangle: median. (e) Two accessible regions (gold). Either high $\hat{f}$ ( $\hat{f} > f^{H}$ ; region 2) or low $\hat{f}$ starting from low initial $f$ ( $\hat{f} < f^{L}$ and ${\bar{f}}_{1, 0} < f^{L}$ ; region 1) can be achieved. We theoretically predict (by numerically integrating Equation 1) $f^{H}$ (orange solid line) and $f^{L}$ (orange dotted line), which agree with simulation results (gold regions). (f) Example trajectories from initial compositions (black dots) to the target compositions (dashed lines). The gold areas indicate the region of initial frequencies where the target frequency can be achieved. (g) The tension between intra-collective selection and inter-collective selection creates a ‘waterfall’ phenomenon. See the main text for details.

Box 1

Changes in the distribution of F frequency $f$ after one cycle

We consider the case where $f_{k}^{*}$ , the F frequency of the selected Adult at cycle $k$ , is above the target value ( $f_{k}^{*} > \hat{f}$ ). This case is particularly challenging because intra-collective evolution favors fast-growing F and thus will further increase $f$ away from the target. From $f_{k}^{*}$ , Newborns of cycle $k + 1$ will have $f$ fluctuating around $f_{k}^{*}$ , and after they mature, the minimum $f$ is selected ( $f_{k + 1}^{*} = min [f_{k + 1, τ}^{(1)}, f_{k + 1, τ}^{(2)}, \dots, f_{k + 1, τ}^{(g)}]$ ). If the selected composition at cycle $k + 1$ can be reduced compared to that of cycle $k$ (i.e. $f_{k + 1}^{*} < f_{k}^{*}$ ), the system can evolve to the lower target value.

To find $f_{k}^{*}$ values such that $f_{k + 1}^{*} < f_{k}^{*}$ , we used the median value of the conditional probability distribution $Ψ$ of $f_{k + 1}^{*}$ given the selected $f_{k}^{*}$ at cycle $k$ (mathematical details in Appendix 2). If the median value ( $M e d i a n [Ψ (f_{k + 1}^{*} | f_{k}^{*})]$ ) is smaller than $f_{k}^{*}$ , then selection will likely be successful since the selected Adult in cycle $k + 1$ has more than 50% chance to have a reduced F frequency compared to cycle $k$ .

There are two points where the median values are the same as $f_{k}^{*}$ (Figure 3a), which are assigned as lower-threshold ( $f^{L}$ ) and higher-threshold ( $f^{H}$ ).

Following the extreme value theory, the conditional probability density function $Ψ (f_{k + 1}^{*} = f | f_{k}^{*})$ is

Ψ (f_{k + 1}^{*} = f | f_{k}^{*}) = g P_{f_{k + 1, τ}} (f | f_{k}^{*}) {[1 - \int_{0}^{f} d f^{'} P_{f_{k + 1, τ}} (f^{'} | f_{k}^{*})]}^{g - 1} .

Equation 1 can be described as the product between two terms related to probability: (i) $g P_{f_{k + 1, τ}} (f | f_{k}^{*})$ describes the probability density that any one of the $g$ Adult collectives achieves $f$ given $f_{k}^{*}$ , and (ii) ${[1 - \int_{0}^{f} d f^{'} P_{f_{k + 1, τ}} (f^{'} | f_{k}^{*})]}^{g - 1}$ describes the probability that all other $g - 1$ collectives achieve frequencies above $f$ and thus not selected.

Since computing the exact formula of Adults’ $f$ distribution in cycle $k + 1$ is hard, we approximate it as Gaussian with mean $\bar{f} (τ)$ and variance $σ_{f}^{2} (τ)$ . The Gaussian approximation on Equation 1 requires sharp Gaussian distributions of $S (τ)$ and $F (τ)$ (i.e. $\bar{S} (τ) ≫ σ_{s} (τ)$ and $\bar{F} (τ) ≫ σ_{F} (τ)$ ). Compared to Gaussian, the exact $S (τ)$ (negative binomial) distribution and $F (τ)$ (Luria-Delbrück) distribution are right-skewed and heavy-tailed. However, these problems are alleviated when the initial numbers of $S$ and $F$ cells are not small (on the order of 100). Indeed, the sharpness of distributions could be achieved (see Appendix 1—figure 1).

To obtain an analytical solution of the change in $f$ over one cycle, we first assume that in a Newborn collective, the number of S cells is distributed as Gaussian with mean ${\bar{S}}_{0} = N_{0} (1 - f_{k}^{*})$ and variance $σ_{S, 0}^{2} = N_{0} f_{k}^{*} (1 - f_{k}^{*})$ . Then, the number of F cells, $F_{0} = N_{0} - S_{0}$ , is distributed as Gaussian with mean ${\bar{F}}_{0} = N_{0} f_{k}^{*}$ and variance $σ_{f, 0}^{2} = N_{0} f_{k}^{*} (1 - f_{k}^{*})$ . From these, we can calculate for Adult collectives the mean and variance of population sizes $F (τ)$ (i.e. $\bar{F} (τ)$ , $σ_{F}^{2} (τ)$ ) and $S (τ)$ (i.e. $\bar{S} (τ)$ , $σ_{S}^{2} (τ)$ ) (mathematical details in Appendix 1). This task is simplified by the exponential growth of S and F: $R_{τ} = e^{r_{S} τ}$ describes the fold growth of S over maturation time $τ$ , and since $ω$ is the fitness advantage of F over S, $W_{τ} = e^{ω τ}$ describes the fold change of F/S over time $τ$ . From $R_{τ}$ , $W_{τ}$ , $\frac{μ}{ω}$ (mutation rate scaled with the fitness difference), $f_{k}^{*}$ (F frequency in the selected collective at cycle $k$ ), $N_{0}$ (Newborn size), $\frac{ω}{r_{S}}$ (relative fitness advantage), we can calculate the mean and variance of F frequency among the Adults of $k + 1$ cycle ( $\bar{f} (τ); σ_{f}^{2} (τ)$ , detailed formula in Equations 48 and 49).

Selection progress - the difference between the median value of the conditional probability distribution $Ψ (f_{k + 1}^{*} | f_{k}^{*})$ and the selected frequency of $f_{k}^{*}$ (Appendix 2) - can be expressed as:

△ f = M e d i a n [Ψ (f_{k + 1}^{*} | f_{k}^{*})] - f_{k}^{*} = \bar{f} (τ) + [Φ^{- 1} (\frac{\ln 2}{g})] σ_{f} (τ) - f_{k}^{*},

where $Φ^{- 1} (\dots)$ is the inverse cumulative function of standard normal distribution (see main text for an example). We chose the median because compared to the mean, it is easier to get an analytical expression since $Φ^{- 1} (\dots)$ is known in a closed form. Regardless, using median generated results similar to simulations (Appendix 2—figure 3). As expected, selection progress $△ f$ is governed by both the mean ( $\bar{f} (τ)$ ) and the variation ( $σ_{f} (τ)$ ) in $f$ among Adults.

When the mutation rate $μ = 0$ , $\bar{f} (τ)$ and $σ_{f} (τ)$ can be simplified to:

\bar{f} (τ) = \frac{f_{k}^{*}}{\frac{1 - f_{k}^{*}}{W_{τ}} + f_{k}^{*}},

and

σ_{f}^{2} (τ) = \frac{1}{N_{0} W_{τ}^{2}} \frac{f_{k}^{*} (1 - f_{k}^{*}) (2 - 2 f_{k}^{*} + 2 f_{k}^{* 2} - \frac{1 - f_{k}^{*}}{R_{τ} W_{τ}} - \frac{f_{k}^{*}}{R_{τ}})}{{(\frac{1 - f_{k}^{*}}{W_{τ}} + f_{k}^{*})}^{4}} .

In the limit of small $f_{k}^{*}$ , Equation 3 becomes $\bar{f} (τ) |_{f_{k}^{*} ≪ 1} \approx f_{k}^{*} W_{τ}$ while Equation 4 becomes $σ_{f}^{2} (τ) |_{f_{k}^{*} ≪ 1} = (2 - \frac{1}{R_{τ} W_{τ}}) f_{k}^{*} W_{τ}^{2} / N_{0}$ . Thus, both Newborn size ( $N_{0}$ ) and fold-change in F/S during maturation ( $W_{τ}$ ) are important determinants of selection progress.

In contrast, an intermediate target frequency (e.g. $\hat{f} = 0.5$ ; Figure 2b green) is never achievable. High initial F frequencies (e.g. 0.95) decline toward the target but stabilize at the ‘high-threshold’ $f^{H}$ (∼ 0.7, solid orange line segment in Figure 2a-c) above the target. Low initial F frequencies (e.g. 0) increase toward the target, but then overshoot and stabilize at the $f^{H}$ value.

If the target frequency is low (e.g. $\hat{f} = 0.1$ ; Figure 2c dark blue), artificial selection succeeds when the initial frequency is below the ‘lower-threshold’ $f^{L}$ (dotted orange line segment in Figure 2a-c). Initial F frequencies above $f^{L}$ (e.g. 0.45 and 0.95) converge to $f^{H}$ instead. Initial F frequencies near $f^{L}$ display stochastic trajectories, converging to either $f^{H}$ or $\hat{f}$ .

To achieve target $\hat{f}$ , inter-collective selection must overcome intra-collective selection. We can visualize the distributions of $f$ over two consecutive cycles (bottom to top, Figure 2d) where $f$ started above target $\hat{f}$ . When newborns matured into adults, the distribution of $f$ up-shifted due to intra-collective selection. The distribution of $f$ was then down-shifted toward the target due to inter-collective selection. If the magnitude of down-shift exceeded that of up-shift, progress toward the target was made. During reproduction of collectives, the distribution of $f$ retained the same mean but became broader due to stochastic sampling by the Newborns from their parent.

In summary, two regions of target frequencies are ‘accessible’ (gold in Figure 2e, f; Box 1): (1) target frequencies above $f^{H}$ ( $\hat{f} > f^{H}$ ) or (2) target frequencies below $f^{L}$ ( $\hat{f} < f^{L}$ ) and starting at an average frequency below $f^{L}$ ( ${\bar{f}}_{1, 0} < f^{L}$ ).

Intra-collective evolution is the fastest at intermediate F frequencies, creating the ‘waterfall’ phenomenon

To understand what gives rise to the two accessible regions, we calculated $△ f$ , the selection progress in F frequency over two consecutive cycles (Box 1, Equation 2). The solution (Figure 3a, green) has the same shape as results from numerically integrating Equation 1 (Figure 3a, orange) and from stochastic simulations (Figure 3a, blue).

Figure 3

Download asset Open asset

Intra-collective selection and inter-collective selection jointly set the boundaries for selection success.

(a) The change in F frequency over one cycle. When $f_{k}^{*}$ is sufficiently low or high, inter-collective selection can lower the F frequency to below $f_{k}^{*}$ ( $Δ f < 0$ ). The points where $Δ f = 0$ (in the orange line) are denoted as $f^{L}$ and $f^{H}$ , corresponding to the boundaries in Figure 2. (b) The distributions of frequency differences obtained by 1000 numerical simulations. The cyan, purple, and black box plots respectively indicate the changes in F frequency after intra-collective selection (the mean frequency among the 100 Adults minus the mean frequency among the 100 Newborns during maturation), after inter-collective selection (the frequency of the 1 selected Adult minus the mean frequency among the 100 Adults), and over one selection cycle (the frequency of the selected Adult of one cycle minus that of the previous cycle). The box ranges from 25% to 75% of the distribution, and the median is indicated by a line across the box. The upper and lower whiskers indicate maximum and minimum values of the distribution. ***p<0.001 in an unpaired $t$ -test.

If $△ f$ is negative, then inter-collective selection will succeed in countering intra-collective selection and reducing $f$ toward the target. $△ f$ is negative if the selected $f_{k}^{*}$ is low or high, but not if it is intermediate between $f^{L}$ and $f^{H}$ (Figure 3a). This is because the increase in $f$ during maturation is the most drastic when Newborn $f$ is intermediate (Figure 3b), for intuitive reasons: when Newborn $f$ is low, the increase in $f$ will be minor; when Newborn $f$ is high, the fitness advantage of F over the population average is small and hence the increase is also minor. Thus, when Newborn F frequency is intermediate, intra-collective selection is the strongest and may overwhelm inter-collective selection (Figure 3b and Appendix 2—figure 2a). Not surprisingly, similar conclusions are derived where S and F are slow-growing and fast-growing species which cannot be converted through mutations (Appendix 4 and Appendix 4—figure 1).

Thus, inter-collective selection is akin to a raftman rowing the raft to a target, while intra-collective selection is akin to a waterfall. This metaphor is best understood in terms of S frequency $s = 1 - f$ . The lower-threshold $f^{L}$ corresponds to higher-threshold in $s^{H} = 1 - f^{L}$ . Intra-collective selection is akin to a waterfall, driving the S frequency $s$ from high to low (Figure 2g). Intra-collective selection acts the strongest when $s$ is intermediate ( $s^{L} < s < s^{H}$ ), similar to the vertical drop of the fall. Intra-collective selection acts weakly at high ( $> s^{H}$ ) or low ( $< s^{L}$ ) $s$ , similar to the gentle sloped upper and lower pools of the fall (regions 1 and 2 of Figure 2e and g). Thus, an intermediate target frequency can be impossible to achieve: a raft starting from the upper pool will be flushed down to $s^{L}$ ( $f^{H}$ ), while a raft starting from the lower pool cannot go beyond $s^{L}$ ( $f^{H}$ ). In contrast, a low target S frequency (in the lower pool) is always achievable. Finally, a high target S frequency (in the upper pool) can only be achieved if starting from the upper pool (as the raft cannot jump to the upper pool if starting from below).

Manipulating experimental setups to expand the achievable target region

In Equation 2; Box 1, selection progress $△ f$ depends on the total number of collectives under selection ( $g$ ). $△ f$ also depends on the mean and the standard deviation of Adult F frequency — $\bar{f} (τ)$ and $σ_{f} (τ)$ . Equations 3 and 4 of Box 1 provide simplified expressions of $\bar{f} (τ)$ and $σ_{f} (τ)$ when mutation rate $μ$ has been set to 0. When the mutation rate $μ$ is not zero (Equations 48 and 49 in Appendix 2), selection progress is additionally influenced by $\frac{μ}{ω}$ (mutation rate $μ$ scaled with fitness difference $ω$ ).

Our goal is to make $△ f$ as negative as possible so that any increase in $f$ during collective maturation may be reduced. From Equation 2 in Box 1, a small $\bar{f} (τ)$ will facilitate collective-level selection. Additionally, a large $σ_{f} (τ)$ will also facilitate collective-level selection due to negative $Φ^{- 1} (\frac{\ln 2}{g})$ . Note that since $\frac{\ln 2}{g}$ <0.5 for $g \geq 2$ , $Φ^{- 1} (\frac{\ln 2}{g})$ — corresponding to the number $y$ such that the probability of a standard normal random variable being less than or equal to $y$ is $\frac{\ln 2}{g}$ — is negative.

From Equation 4 in Box 1, $σ_{f} (τ)$ will be large if Newborn size $N_{0}$ is small. Indeed, as Newborn size $N_{0}$ declines, the region of achievable target frequency expands (gold area in Figure 4a). If the Newborn size $N_{0}$ is sufficiently small (e.g. ≤ 700 in our parameter regime), any target frequency can be reached. An analytical approximation of the maximal Newborn size permissible for all target frequencies is given in Appendix 3.

Figure 4

Download asset Open asset

Expanding the region of success for artificial collective selection.

(a) Reducing the population size in Newborn $N_{0}$ expands the region of success. In the gold area, the probability that $f_{k + 1}^{*}$ becomes smaller than $f_{k}^{*}$ in a cycle is more than 50%. We used $g = 10$ and $τ \approx 4.8$ . Figures 2–3 correspond to ${\overset{˘}{N}}_{0} = 1000$ in this graph. Black dotted line indicates the critical Newborn size below which all target frequencies can be achieved. (b) Increasing the total number of collectives $g$ also expands the region of success, although only slightly. We used a fixed Newborn size $N_{0} = 1000$ . The maturation time $τ = \log (100) / r_{S} \approx 9.2$ is set to be long enough so that an Adult can generate at least 100 Newborns. (c) Increasing the maturation time shrinks the region of success. We used a fixed Newborn size $N_{0} = 1000$ and number of collectives $g = 10$ .

From Equations 3 and 4 in Box 1, maturation time $τ$ affects $\bar{f} (τ)$ and $σ_{f} (τ)$ through $W_{τ} = e^{ω τ}$ (the fold change in F/S over $τ$ ), and affects $σ_{f} (τ)$ additionally through $R_{τ} = e^{r_{S} τ}$ (fold-growth of S over $τ$ ). Longer $τ$ increases $\bar{f} (τ)$ and is thus detrimental to selection progress. The relationship between $σ_{f} (τ)$ and $τ$ is not monotonic (Appendix 2—figure 2c), meaning that an intermediate value of $τ$ is the best for achieving large $σ_{f} (τ)$ . However, the effect of $\bar{f} (τ)$ dominates that of $σ_{f} (τ)$ and therefore, the region of success monotonically reduces with longer maturation time (Figure 4c). Similarly, $\bar{f} (τ)$ will be small if $ω$ (fitness advantage of F over S) is small. Indeed, as $ω$ becomes larger, the region of success becomes smaller (Appendix 5—figure 1).

$g$ , the number of collectives under selection, also affects selection outcomes. As $g$ increases, the value of $Φ^{- 1} (\frac{\ln 2}{g})$ becomes more negative, and so does $△ f$ — meaning collective-level selection will be more effective. Intuitively, with more collectives, the chance of finding a $f$ closer to the target is more likely. Thus, a larger number of collectives broadens the region of success (Figure 4b). However, the effect of $g$ is not dramatic. To see why, we note that the only place that $g$ appears is Equation 2 in $Φ^{- 1} (\frac{1}{g})$ . When $g$ becomes large, $Φ^{- 1} (\frac{1}{g})$ is asymptotically expressed as $Φ^{- 1} (\frac{1}{g}) \approx - \sqrt{2 \ln g - \ln [\ln g] + \dots}$ (Appendix 2) (Phllip, 1960), and thus does not change dramatically as $g$ varies.

The waterfall phenomenon in a higher dimension

To examine the waterfall effect in a higher dimension, we investigate a three-population system where a faster-growing population (FF) grows faster than the fast-growing population (F) which grows faster than the slow-growing population (S) (Figure 5a and Appendix 8—figure 1). In the three-population case, the evolutionary trajectory travels in a two-dimensional plane. A target population composition can be achieved if inter-collective selection can sufficiently reduce the frequencies of F as well as FF (accessible regions, gold in Figure 5b).

Figure 5

Download asset Open asset

In higher dimensions, the success of artificial selection requires the entire evolutionary trajectory remaining in the accessible region.

(a) During collective maturation, a slow-growing population (S) (with growth rate $r_{S}$ ; light gray) can mutate to a fast-growing population (F) (with growth rate $r_{S} + ω$ ; medium gray), which can mutate further into a faster-growing population (FF) (with growth rate $r_{S} + 2 ω$ ; dark gray). Here, the rates of both mutational steps are $μ$ , and $ω > 0$ . (b) Evolutionary trajectories from various initial compositions (open circles) to various targets (filled triangles). Intra-collective evolution favors FF over F (vertical blue arrow) over S (horizontal blue arrow). The accessible regions are marked gold (see Appendix 1). We obtain final compositions starting from several initial compositions while aiming for different target compositions in i, ii, and iii. The evolutionary trajectories are shown in dots with color gradients from initial time (light grey) to final time (dark grey). (i) A target composition with a high FF frequency is always achievable. (ii) A target composition with intermediate FF frequency is never achievable. (iii) A target composition with low FF frequency is achievable only if starting from an appropriate initial composition such that the entire trajectory never meanders away from the accessible region. The figures are drawn using the mpltern package (Ikeda et al., 2019). (c) The accessible region in the three-population problem is interpreted as an extension of the two-population problem. First, the accessible region between FF and S+F is given, and then the S+F region is stretched into S and F.

From numerical simulations, we identified two accessible regions: a small region near FF and a band region spanning from S to F (gold in Figure 5b i). Intuitively, the rate at which FF grows faster than S+F is greater than the rate at which F grows faster than S (see Appendix 8). Thus, the problem can initially be reduced to a two-population problem (i.e. FF versus F+S; Figure 5c left), and then expanded to a three-population problem (Figure 5c right).

Similar to the two-population case, targets in the inaccessible region are never achievable (Figure 5b ii), while those in the FF region are always achievable (Figure 5b i). Strikingly, a target composition in an accessible region may not be achievable even when the initial composition is within the same region: once the composition escapes the accessible region, the trajectory cannot return back to the accessible region (Figure 5biii, the leftmost initial condition). However, if the initial position is closer to the target in the accessible region, the target becomes achievable (Figure 5b iii, initial condition near the bottom). Note that here, the selection outcome is path-dependent in the sense of being sensitive to initial conditions. This phenomenon is distinct from hysteresis, where path-dependence results from whether a tuning parameter is increased or decreased.

In conclusion, we have investigated the evolutionary trajectories of population compositions in collectives under selection, which are governed by intra-collective selection (which favors fast-growing populations) and inter-collective selection (which, in our case, strives to counter fast-growing populations). Intra-collective selection has the strongest effect at intermediate frequencies of faster-growing populations, potentially creating an inaccessible region of target frequency analogous to the vertical drop of a waterfall. High and low target frequencies are both accessible, analogous to the lower and the upper pools of a waterfall, respectively. A less challenging target (high $\hat{f}$ ; low $\hat{s}$ ) is achievable from any initial position. In contrast, a more challenging target (low $\hat{f}$ ; high $\hat{s}$ ) is only achievable if the entire trajectory is contained within the region, similar to a raft striving to reach a point in the upper pool must start at and remain in the upper pool. Our work suggests that the strength of intra-collective selection is not constant, and that strategically choosing an appropriate starting point can be essential for successful collective selection.

Materials and methods

Stochastic simulations

Request a detailed protocol

A selection cycle is composed of three steps: maturation, selection, and reproduction. At the beginning of the cycle $k$ , a collective $i$ has $S_{k, 0}^{(i)}$ slow-growing cells and $F_{k, 0}^{(i)}$ fast-growing cells. At the first cycle, the mean F frequency of collectives is set to be ${\bar{f}}_{1, 0} . F_{1, 0}^{(i)}$ is sampled from the binomial distribution with mean $N_{0} {\bar{f}}_{1, 0}$ . Then, $S_{1, 0}^{(i)} (= N_{0} - F_{1, 0}^{(i)})$ S cells are in the collective $i$ . In the maturation step, we calculate $S_{k, τ}^{(i)}$ and $F_{k, τ}^{(i)}$ by using stochastic simulation. We can simulate the division and mutation of each individual cell stochastically by using the tau-leaping algorithm (Gillespie, 2001; Cao et al., 2006; see Appendix 1—figure 3). However, individual-based simulations require a long computing time. Instead, we randomly sample $S_{k, τ}^{(i)}$ and $F_{k, τ}^{(i)}$ from the joint probability density distribution $P (S_{k, τ}^{(i)}, F_{k, τ}^{(i)})$ . To obtain $P (S_{k, τ}^{(i)}, F_{k, τ}^{(i)})$ , we solve the master equation which describes the time evolution of the probability distribution $P (S_{k, t}^{(i)}, F_{k, t}^{(i)})$ under the random processes (see Appendix 1). We assumed that $S_{k, τ}^{(i)} and F_{k, τ}^{(i)}$ are independent (as S and F populations grow independently without ecological interactions), and thus $P (S_{k, τ}^{(i)}, F_{k, τ}^{(i)})$ is product of two probability density functions $P (S_{k, τ}^{(i)}) and P (F_{k, τ}^{(i)})$ . Each distribution follows a Gaussian distribution, with the mean and variance numerically obtained from ordinary differential equations derived from the master equation (see Appendix 1). We choose the collective with the closest frequency to the target $\hat{f}$ to generates $g$ Newborns. The number of F cells is sampled from the binomial distribution with the mean of $N_{0} f_{k}^{*}$ . We start a new cycle with those Newborn collectives. Then, the number of S cells in a collective $i$ is $S_{k + 1, 0}^{(i)} = N_{0} - F_{k + 1, 0}^{(i)}$ .

Analytical approach to the conditional probability

Request a detailed protocol

The conditional probability distribution $Ψ (f_{k + 1}^{*} | f_{k}^{*})$ of observing $f_{k + 1}^{*}$ at a given $f_{k}^{*}$ is calculated by the following procedure. Given the selected collective in cycle $k$ with $f_{k}^{*}$ , the collective-level reproduction proceeds by sampling $g$ Newborn collectives with $N_{0}$ cells in cycle $k + 1$ . Each Newborn collective contains certain F numbers $F_{k + 1, 0}^{(1)}, \dots, F_{k + 1, 0}^{(g)}$ at the beginning of the cycle $k + 1$ , which can be mapped into $f_{k + 1, 0}^{(1)}, \dots, f_{k + 1, 0}^{(g)}$ with the constraint of $N_{0}$ cells. If the number of cells in the selected collective is large enough, the joint conditional distribution function $P (f_{k + 1, 0}^{(1)}, \dots, f_{k + 1, 0}^{(g)} | f_{k}^{*})$ is well described by the product of $g$ independent and identical Gaussian distribution $N (μ, σ^{2})$ . So we consider the frequencies of $g$ Newborn collectives as $g$ identical copies of the Gaussian random variable $f_{k + 1, 0}$ . The mean and variance of $f_{k + 1, 0}$ are given by $m = f_{k}^{*}$ and $σ^{2} = f_{k}^{*} (1 - f_{k}^{*}) / N_{0}$ . Then, the conditional probability distribution function of $f_{k + 1, 0}$ being $ζ$ is given by

P_{f_{k + 1, 0}} (ζ | f_{k}^{*}) = \frac{1}{\sqrt{2 π}} \exp (- \frac{(ζ - m)^{2}}{2 σ^{2}}) .

After the reproduction step, the Newborn collectives grow for time $τ$ . The frequency is changed from the given frequency $ζ$ to $f$ by division and mutation processes. We assume that the frequency $f$ of an Adult is also approximated by a Gaussian random variable $N (\bar{f} (τ), σ_{f}^{2} (τ))$ . The mean $\bar{f} (τ)$ and variance $σ_{f}^{2} (τ)$ are calculated by using means and variances of $S$ and $F$ (see Appendix 2). Since $\bar{f} (τ)$ and $σ_{f}^{2} (τ)$ also depend on $ζ$ , the conditional probability distribution function of $f_{k + 1, τ}$ being $f$ is given by

P_{f_{k + 1, τ}} (f | ζ) = \frac{1}{\sqrt{2 π}} \exp (- \frac{(f - \bar{f} (τ))^{2}}{2 σ_{f}^{2} (τ)}) .

The conditional probability distribution of an Adult collective in cycle $k + 1$ ( $f_{k + 1, τ}$ ) to have frequency $f$ at a given $f_{k}^{*}$ is calculated by multiplying two Gaussian distribution functions and integrating overall $ζ$ values, which is given by

P_{f_{k + 1, τ}} (f | f_{k}^{*}) = \int_{0}^{1} d ζ P_{f_{k + 1, τ}} (f | ζ) P_{f_{k + 1, 0}} (ζ | f_{k}^{*}) .

Since we select the minimum frequency $f_{k + 1}^{min}$ among $g$ identical copies of $f_{k + 1, τ}$ , the conditional probability distribution function of $f_{k + 1}^{min}$ follows a minimum value distribution, which is given in Equation 1. Here, for the case of $\hat{f} < f_{k}^{*}$ , the selected frequency $f_{k + 1, 0}$ is the minimum frequency $f_{k + 1}^{min}$ . So we have $Ψ (f_{k + 1}^{*} | f_{k}^{*})$ by replacing $f_{k + 1}^{min}$ with $f_{k + 1}^{*}$ .

We assume that the conditional probability distribution in Equation 7 follows a normal distribution, whose mean and variances are described by Equation 48 and Equation 49. Then, the extreme value theory (Gumbel, 1958) estimates the median of the selected Adult by

Median (f_{k + 1}^{*}) = \bar{f} (τ) + [Φ^{- 1} (\frac{\ln 2}{g})] σ_{f} (τ) .

The selection progress $Δ f$ in Equation 2 is obtained by subtracting $f_{k}^{*}$ from Equation 8.

Appendix 1

Stochastic simulation of the selection cycle

In the main text, we design a simple model of artificial selection on collectives. The selection cycle starts with $g$ ‘Newborn’ collectives which consist of two populations - slow-growing population (S) and fast-growing population (F). S mutates to F at a rate $μ$ . The newborns mature for a fixed time $τ$ . The matured collective (‘Adult’) with the highest function (with F frequency $f$ closest to the target $\hat{f}$ ) is chosen to reproduce $g$ Newborn collectives, each with $N_{0}$ cells.

In our selection cycle, variation among collectives mainly resulted from demographic noises during cell birth, cell mutation, and collective reproduction. In this section, we provide details of the simulation.

Maturation

Here, we calculate the cell numbers during maturation. Each collective $i$ ( $i = 1, \dots, g$ ) has $S_{k, t}^{(i)}$ S cells and $F_{k, t}^{(i)}$ F cells where $k$ is the cycle number and $t$ indicates time ( $0 \leq t \leq τ$ ). At the beginning of cycle $k$ ( $t = 0$ ), each Newborn collective has a total of $N_{0} = S_{k, 0}^{(i)} + F_{k, 0}^{(i)}$ cells. The collectives are allowed to ‘mature’ for $t = τ$ during which S and F grow at rates $r_{S}$ and $r_{S} + ω$ ( $ω > 0$ ), respectively. In this subsection, we ignore the cycle number index $k$ and the collective index $k$ for convenience. That is, we denote $S_{k, t}^{(i)}$ and $F_{k, t}^{(i)}$ as $S (t)$ and $F (t)$ , respectively.

We describe cell divisions of S and F cells and mutation from S to F with the following chemical reaction rules:

S \overset{r_{S}}{\to} S + S,

F \to_{ω > 0}^{r_{S} + ω} F + F,

S \overset{μ}{\to} F,

One can run an individual-based simulation by counting the number of events occurring during collective maturation via the tau-leaping algorithm (Gillespie, 2001; Cao et al., 2006) to generate a sample trajectory of $S (t)$ and $F (t)$ for each collective. However, the individual-based simulation requires long computing times due to a large number of random events to be counted. Hence, we used a ‘sampling method’ by sampling the numbers of S and F cells in collectives from a joint probability density distribution (jpdf) $P (S, F, t)$ which denotes the probability density to have $S$ number of S cells and $f_{k + 1, 0}$ number of F cells at time $t$ in the cycle. To do so, we require an analytical expression of $P (S, F, t)$ .

First, we assume that the chemical reactions in Equations 9–11 occur independently, and never occur simultaneously within a short time interval $[t, t + d t)$ . Then, the differential of $P (S, F, t)$ with respect to time is given by

\begin{aligned} \frac{d P (S, F, t)}{d t} = & r_{S} (S - 1) P (S - 1, F, t) \\ + (r_{S} + ω) (F - 1) P (S, F - 1, t) \\ + μ (S + 1) P (S + 1, F - 1, t) \\ - (r_{S} S + (r_{S} + ω) F + μ S) P (S, F, t) . \end{aligned}

This master equation describes a probability density ‘flux’ at the state $(S, F)$ . The first term describes the scenario where a single birth event of a S cell happens during time interval $[t, t + d t)$ , which changes the collective’s composition from $(S - 1, F)$ to $(S, F)$ . Similarly, the second term comes from a birth event of an F cell. The third term indicates the mutation event from $(S + 1, F - 1)$ to $(S, F)$ . The last term corresponds to the outflow of probability density by birth and mutation processes, which describes the changes from $(S, F)$ to any other states.

Calculating the exact form of $P (S, F, t)$ is not simple. Instead, we assume that the mutation rate is much smaller than the growth rates, and hence the correlation between $S$ and $F$ is sufficiently small. Additionally, S and F do not interact ecologically. Then, we can express $P (S, F, t)$ as a product of two probability density functions (pdf) of $S (t)$ and $F (t)$ , $P (S, F, t) = P (S, t) P (F, t)$ . We assume that each pdf of $S$ and $F$ can be approximated as Gaussian ( $N$ ), which is supported by the Central Limit Theorem and Appendix 1—figure 1. In more detail, the cell numbers $S$ and $F$ are mainly determined by growth (Equations 9; 10), and also mutations (Equation 11). Even though the number of events would be different among different realizations, the mean numbers of events will follow Gaussian distributions. So, we can simply assume that the distributions of cell numbers also follow Gaussian distributions. This assumption requires that the distributions have insignificant skewness and no heavy tails, which we will numerically check afterwards. The pdfs of $S (t)$ and $F (t)$ are given by

P (S, t) = \frac{1}{\sqrt{2 π σ_{S}^{2} (t)}} e^{- \frac{(S (t) - \bar{S} (t))^{2}}{2 σ_{S}^{2} (t)}}

and

P (F, t) = \frac{1}{\sqrt{2 π σ_{F}^{2} (t)}} e^{- \frac{(F (t) - \bar{F} (t))^{2}}{2 σ_{F}^{2} (t)}} .

That is, $P (S, F, t)$ is written as

P (S, F, t) = \frac{1}{\sqrt{2 π σ_{S}^{2} (t)}} \frac{1}{\sqrt{2 π σ_{F}^{2} (t)}} e^{- \frac{(S (t) - \bar{S} (t))^{2}}{2 σ_{S}^{2} (t)} - \frac{(F (t) - \bar{F} (t))^{2}}{2 σ_{F}^{2} (t)}} .

Now we need means ( $\bar{S} (t)$ and $\bar{F} (t)$ ) and variances ( $σ_{s}^{2} (t)$ and $σ_{F}^{2} (t)$ ) of S and F cell numbers to express the distribution analytically.

The means are defined by $\bar{S} (t) = \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} S P (S, F, t)$ and $\bar{F} (t) = \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} F P (S, F, t)$ . The differential equations for means are obtained by applying the definition to the master equation in Equation 12, as

\frac{d \bar{S} (t)}{d t} = r_{S} \bar{S} (t) - μ \bar{S} (t),

\frac{d \bar{F} (t)}{d t} = (r_{S} + ω) \bar{F} (t) + μ \bar{S} (t) .

We assume that the mutation rate $μ$ is much smaller than $r_{S}$ and $ω$ . By solving Equation 16 and Equation 17, the means $\bar{S} (t)$ and $\bar{F} (t)$ are given by

\bar{S} (t) = {\bar{S}}_{o} e^{(r_{S} - μ) t} \approx {\bar{S}}_{o} e^{r_{S} t},

\bar{F} (t) = {\bar{F}}_{o} e^{(r_{S} + ω) t} + \frac{μ {\bar{S}}_{o}}{ω + μ} (e^{(r_{S} + ω) t} - e^{(r_{S} - μ) t}) \approx {\bar{F}}_{o} e^{(r_{S} + ω) t} + \frac{μ {\bar{S}}_{o}}{ω} (e^{(r_{S} + ω) t} - e^{r_{S} t}),

where ${\bar{S}}_{o} \equiv \bar{S} (0)$ and ${\bar{F}}_{o} \equiv \bar{F} (0)$ are the mean numbers of S and F cells at the beginning of cycle, $t = 0$ . Note that the second term of Equation 19 is consistent with previous studies (Zheng, 1999). Now we introduce factors $R_{t} = e^{r_{S} t}$ and $W_{t} = e^{ω t}$ in Equations 18; 19 in order to simplify the formula. $R_{t}$ is the multiplying factor by which the S cell number increases after time $t$ . $W_{t}$ is the fold change in $F / S$ . Then, we can rewrite

\bar{S} (t) = {\bar{S}}_{o} R_{t},

\bar{F} (t) = {\bar{F}}_{o} R_{t} W_{t} + \frac{μ {\bar{S}}_{o}}{ω} R_{t} (W_{t} - 1) .

We define the second momenta of $S$ and $F$ as

\bar{S^{2}} = \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} S^{2} P (S, F, t),

\bar{F^{2}} = \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} F^{2} P (S, F, t) .

Then, the corresponding differential equations are given by

\frac{d \bar{S^{2}}}{d t} = 2 (r_{S} - μ) \bar{S^{2}} + (r_{S} + μ) \bar{S},

\frac{d \bar{F^{2}}}{d t} = 2 (r_{S} + ω) \bar{F^{2}} + (r_{S} + ω) \bar{F} + 2 μ \bar{S} \bar{F} + μ \bar{S} .

The solution of Equation 24 is

\bar{S^{2}} (t) = \frac{r_{S} + μ}{μ - r_{S}} {\bar{S}}_{o} [e^{(μ - r_{S}) t} - 1] e^{2 (r_{S} - μ) t} + {\bar{S^{2}}}_{o} e^{2 (r_{S} - μ) t}

\approx \bar{S_{o}^{2}} e^{2 r_{S} t} + {\bar{S}}_{o} e^{r_{S} t} (e^{r_{S} t} - 1),

where $\bar{S_{o}^{2}} \equiv \bar{S^{2}} (0)$ is the second moment of initial values. Thus, the variance $σ_{S}^{2} (t) = \bar{S^{2}} (t) - [\bar{S} (t)]^{2}$ is

σ_{S}^{2} (t) = σ_{S, 0}^{2} e^{2 r_{S} t} + {\bar{S}}_{o} e^{r_{S} t} (e^{r_{S} t} - 1)

= σ_{S, 0}^{2} R_{t}^{2} + {\bar{S}}_{o} R_{t} (R_{t} - 1)

where $σ_{S, 0}^{2} \equiv σ_{S}^{2} (0)$ is a variance of S cell numbers at $t = 0$ . In Equation 25, we require $\bar{S F} (t) = \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} S F P (S, F, t)$ to calculate $\bar{F^{2}} (t)$ . Equation 12 provides a differential equation for $\bar{S F} (t)$ as

\frac{d \bar{S F}}{d t} = (2 r_{S} + ω - μ) \bar{S F} + μ (\bar{S^{2}} - \bar{S}) .

The solution of Equation 30 is given by

\begin{aligned} \bar{S F} (t) = & {\bar{S F}}_{o} e^{(2 r_{S} + ω) t} \\ + \frac{μ}{ω} ({\bar{S^{2}}}_{0} + {\bar{S}}_{0}) (e^{(2 r_{S} + ω) t} - e^{2 r_{S} t}) \\ - \frac{2 μ {\bar{S}}_{0}}{r_{S} + ω} (e^{(2 r_{S} + ω) t} - e^{r_{S} t}) . \end{aligned}

By using Equation 31, the solution of Equation 25 is given by

\begin{aligned} \bar{F^{2}} (t) = & {\bar{F^{2}}}_{0} e^{2 (r_{S} + ω) t} + {\bar{F}}_{0} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - 1) \\ + \frac{μ {\bar{S}}_{o}}{ω} (\frac{2 ω}{r + 2 ω} e^{2 (r_{S} + ω) t} - e^{(r_{S} + ω) t} + \frac{r}{r + 2 ω} e^{r_{S} t}) \\ + \frac{2 μ {\bar{S F}}_{0}}{ω} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - e^{r_{S} t}) \\ + O (μ^{2}), \end{aligned}

where $\bar{F_{o}^{2}} \equiv \bar{F^{2}} (0)$ is the second moment of initial values. Thus, the variance $σ_{F}^{2} (t) = \bar{F^{2}} (t) - [\bar{F} (t)]^{2}$ is given, up to the order of $μ$ , by

\begin{aligned} σ_{F}^{2} (t) = & σ_{F, 0}^{2} e^{2 (r_{S} + ω) t} + {\bar{F}}_{0} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - 1) \\ + \frac{μ {\bar{S}}_{o}}{ω} (\frac{2 ω}{r_{S} + 2 ω} e^{2 (r_{S} + ω) t} - e^{(r_{S} + ω) t} + \frac{r_{S}}{r_{S} + 2 ω} e^{r_{S} t}) \\ + \frac{2 μ ({\bar{S F}}_{0} - {\bar{S}}_{0} {\bar{F}}_{0})}{ω} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - e^{r_{S} t}) \\ + O (μ^{2}) . \end{aligned}

Using $R_{τ} = e^{r_{S} τ}$ and $W_{τ} = e^{ω τ}$ , we rewrite

\begin{aligned} σ_{F}^{2} (t) = & σ_{F, 0}^{2} R_{t}^{2} W_{t}^{2} + {\bar{F}}_{0} R_{t} W_{t} (R_{t} W_{t} - 1) \\ + \frac{μ {\bar{S}}_{o} R_{t}}{ω} (\frac{2 ω}{r_{S} + 2 ω} R_{t} W_{t}^{2} - W_{t} + \frac{r_{S}}{r_{S} + 2 ω}) \\ + \frac{2 μ ({\bar{S F}}_{0} - {\bar{S}}_{0} {\bar{F}}_{0})}{ω} R_{t}^{2} W_{t} (W_{t} - 1) \\ + O (μ^{2}) . \end{aligned}

Using Equations 18; 19; 28; 33, we construct pdfs for $S (t)$ and $F (t)$ at the end of cycle $t = τ$ . Then, we randomly sample a number from $P (S, τ)$ for $S (τ)$ and another number from $P (F, τ)$ for $F (τ)$ . Those two numbers are cell numbers in a single Adult. We repeat this process for each Newborn to get cell numbers of all Adults. Note that the initial values for the Newborn $i$ are ${\bar{S}}_{0} = S_{k, 0}^{(i)}, {\bar{F}}_{0} = F_{k, 0}^{(i)}, σ_{S, 0}^{2} = 0,$ and $σ_{F, 0}^{2} = 0$ . This process only requires two random numbers per collective, while the result is consistent with the individual-based simulation.

Now, we check the validity of the Gaussian approximation for probability density functions of S and F populations. If we consider mutation from S to F as death in the S population, then the process in S corresponds to a branching process with death. Also, the birth process in F, including mutation, results in a Luria-Delbrück distribution (Zheng, 1999). Thus, the distributions of Adults’ S and F numbers are more skewed and heavy-tailed than Gaussian. But this problem is alleviated by larger initial S and F numbers and when the maturation time $τ$ is not very long (see Appendix 1—figure 1). Since we usually consider larger initial cell numbers, we use the Gaussian approximation on S and F populations in further calculations.

Appendix 1—figure 1

Download asset Open asset

Comparison between the calculated Gaussian distribution (‘Gauss,’ with the mean and variances computed from Equations 18; 19; 28; 33) and simulations using tau-leaping (‘tau’).

The simulations run 3000 times. The initial number of cells are $(S_{0}, F_{0}) = (990, 10), (500, 500)$ , and $(10, 990)$ for each column. The parameters $r = 0.5$ , $ω = 0.03$ , $μ = 0.0001$ , and $τ = 4.8$ are used.

Selection

After sampling cell numbers of each Adult in the maturation step, we compute the F frequencies in each collectives ${f_{k, τ}^{(1)}, \dots, f_{k, τ}^{(g)}}$ . We denote the F frequency of collective $i$ at time $t = τ$ in cycle $k$ as $f_{k, τ}^{(i)}$ . Among the $g$ Adults, we select one collective with the F frequency which is the closest value to the target frequency $\hat{f}$ . The selected Adult’s F frequency value is denoted by $f_{k}^{*}$ . In mathematical expression, the selected frequency is defined by

f_{k}^{*} = {a r g m i n}_{f_{k, τ}^{(i)}, i \in {1, \dots, g}} | f_{k, τ}^{(i)} - \hat{f} | .

Reproduction

Using the chosen Adult, we generate $g$ Newborn collectives for the next cycle $k + 1$ . The most natural way is consecutive random sampling $N_{0}$ cells from the selected Adult without replacement. In the mathematical expression, we first randomly sample $F_{k + 1, 0}^{(1)}$ F cells and draw $S_{k + 1, 0}^{(1)} = N_{0} - F_{k + 1, 0}^{(1)}$ S cells from the selected Adult. Next, we sample $F_{k + 1, 0}^{(2)}$ F cells and $S_{k + 1, 0}^{(2)} = N_{0} - F_{k + 1, 0}^{(2)}$ S cells from the remaining cells in the Adult. We repeat the process $g$ times. Then the jpdf to choose $F_{k + 1, 0}^{(1)}, \dots, F_{k + 1, 0}^{(g)}$ F cells, $P (F_{k + 1, 0}^{(1)}, \dots, F_{k + 1, 0}^{(g)})$ , follows a multivariate hypergeometric distribution.

If we assume that the selected Adult size $N_{k}^{*} = S_{k}^{*} + F_{k}^{*}$ is large enough compared to Newborn size $N_{0}$ , the consecutive sampling is well approximated to the independent binomial sampling (see Appendix 1—figure 2). Thus, we independently sample $g$ numbers of F cells, ${F_{k + 1, 0}^{(1)}, \dots, F_{k + 1, 0}^{(g)}}$ , from the binomial distribution. The probability mass function of each $F_{k + 1, 0}^{(i)}$ is given by

P_{F_{k + 1, 0}^{(i)}} (F) = \frac{N_{0}!}{(N_{0} - F)! F!} (f_{k}^{*})^{F} (1 - f_{k}^{*})^{N_{0} - F} .

After sampling, the numbers of S cells are set to be $S_{k + 1}^{(i)} (0) = N_{0} - F_{k + 1, 0}^{(i)}$ for each collective. We can now start cycle $k + 1$ with these Newborn collectives. By repeating the above three steps (maturation, selection, and reproduction), we run the simulation until F frequency reaches a stationary state.

Appendix 1—figure 2

Download asset Open asset

Congruence between consecutive sampling (MHG for multivariate hypergeometric distribution) and independent binomial (BN) sampling.

The initial number of cells are $S = 8000$ and $F = 2000$ for the left panel, and $S = 20$ and $F = 5$ for the right panel. 10,000 samples are drawn for each distribution. Here, a parent collective is divided into 10 collectives.

Simulation result

Appendix 1—figure 3 presents the composition trajectories of all collectives using the tau-leaping algorithm in the maturation step. The selected adults have the closest composition to the target composition $\hat{f}$ . The selected Adult can have smaller F frequency than its parent Adult, so F frequency can be lowered after cycles.

Appendix 1—figure 3

Download asset Open asset

Trajectories of F frequency for 10 collectives ( $g = 10$ ) over time.

(a) The collective whose frequency is closest to the target value is selected in every cycle (black lines). The gray lines denote the other collectives. For parameters, we used S growth rate $r_{S} = 0.5$ , F growth advantage $ω = 0.03$ , mutation rate $μ = 0.0001$ , maturation time $τ \approx 4.8$ , and $N_{0} = 1000$ . (b) Comparison between frequency trajectories with selection (the chosen one Adult producing all offspring; black) and without selection (each Adult producing one offspring; blue) clearly shows the effect of artificial selection. The black line indicates F frequency of the selected collective $f_{k}^{*}$ at each cycle in (a). The blue line indicates the average trajectory without selection $f_{k + 1}^{*}$ (the average of $g = 10$ individual lineages without inter-collective selection at the end of each cycle).

In Appendix 1—figure 4, we plot the absolute error $d$ between the target frequency $\hat{f}$ and $⟨ f^{*} ⟩$ (i.e. $d = | ⟨ f^{*} ⟩ - \hat{f} |$ ) at the end of simulations (1000 cycles). Since the computing time for the Tau-leaping algorithm (individual-based simulation) to reach 1000 cycles is very long, we used the sampling scheme in the above subsection. In the colormap, errors higher than 0.15 are marked with gray, which indicates selection failure. The dashed lines indicate the same boundary in Figure 2e in the main text.

Appendix 1—figure 4

Download asset Open asset

Color map of the absolute error $d = | ⟨ f^{*} ⟩ - \hat{f} |$ averaged selected collectives at the end of simulations ( $k = 1000$ ) and the target frequency $\hat{f}$ .

The solid and dashed lines are drawn by the arguments in the main text. For parameters, we used $r_{S} = 0.5$ , $ω = 0.03$ , $μ = 0.0001, N_{0} = 1000, g = 10$ and $τ \approx 4.8$ . The result is the average of 300 independent simulations. Compared to Figure 2e, this figure has a higher resolution.

Appendix 2

Conditional probability distribution of the selected collective frequency $f^{*}$ and selection progress $Δ f$

In the main text, we identify the region of success by using selection progress $Δ f = Median [Ψ (f_{k + 1}^{*} | f_{k}^{*})] - f_{k}^{*}$ , which is obtained from the conditional pdf of $f_{k + 1}^{*}$ (the F frequency of the selected Adult at cycle $k + 1$ ) given the selected $f_{k}^{*}$ at cycle $k$ , written as $Ψ (f_{k + 1}^{*} = f | f_{k}^{*})$ . We consider the challenging case where $f_{k}^{*}$ is above the target value ( $f_{k}^{*} > \hat{f}$ ), and therefore the Adult with minimum F frequency will be selected. To get an analytical expression of $Ψ (f_{k + 1}^{*} = f | f_{k}^{*})$ , we first find the conditional pdf of $f$ of Adults in cycle $k + 1$ given $f_{k}^{*}$ at cycle $k$ . Then, we find $Ψ (f_{k + 1}^{*} | f_{k}^{*})$ from the minimum value distribution of F frequencies among $G_{S} (τ)$ Adults. Below, we describe the mathematical details of this process.

Let us start from the reproduction step from the selected Adult in cycle $k$ . We reproduce $g$ Newborns in the next cycle $k + 1.$ Then the probability distribution of the F cell numbers in Newborn collectives is given in Equation 36. If the total number of cells in a Newborn collective $N_{0}$ is large enough, Equation 36 is approximated by the Gaussian distribution $F_{k + 1, 0}^{(i)} \sim N (N_{0} f_{k}^{*}, N_{0} f_{k}^{*} (1 - f_{k}^{*}))$ . Then, the probability density function that $f_{k + 1, 0}^{(i)}$ to be $ζ$ in Newborn collective $i$ is

P_{f_{k + 1, 0}^{(i)}} (ζ | f_{k}^{*}) = \frac{\sqrt{N_{0}}}{\sqrt{2 π f_{k}^{*} (1 - f_{k}^{*})}} e^{- \frac{N_{0} (ζ - f_{k}^{*})^{2}}{2 f_{k}^{*} (1 - f_{k}^{*})}} .

The Newborn collective $i$ has initial cell numbers $S_{k + 1, 0}^{(i)} = N_{0} (1 - ζ)$ and $F_{k + 1, 0}^{(i)} = N_{0} ζ$ . From here, we ignore cycle index $k + 1$ in subscript and $i$ superscript for convenience.

Next, we write the conditional pdf of Adults’ F frequency with given Newborn F frequency $ζ$ . We assume that cell numbers in Adult $S (τ)$ and $F (τ)$ follow Gaussian distributions as in Equations 13; 14. Based on Equations 18; 19; 28; 33, we have

S (τ) \approx \bar{S} (τ) + σ_{S} (τ) G_{S} (τ),

F (τ) \approx \bar{F} (τ) + σ_{F} (τ) G_{F} (τ),

where $G_{S} (τ)$ and $G_{F} (τ)$ are random variables following the standard distribution $N (0, 1)$ . Note that each Gaussian is sharp if Newborn size $N_{0}$ is sufficiently large ( $\bar{S} (τ) ≫ σ_{s} (τ)$ and $\bar{F} (τ) ≫ σ_{F} (τ)$ ). Then, we can approximately write $f (τ)$ as

\begin{aligned} f (τ) & \approx \frac{\bar{F} (τ) + σ_{F} (τ) G_{F} (τ)}{\bar{S} (τ) + σ_{S} (t) G_{S} (τ) + \bar{F} (τ) + σ_{F} (τ) G_{F} (τ)} \\ \approx \bar{f} (τ) + σ_{f} (t) G_{f} (τ) \end{aligned} .

The mean of $f$ is given by

\bar{f} (τ) = \frac{\bar{F} (τ)}{\bar{S} (τ) + \bar{F} (τ)} = \frac{ζ W_{τ} + \frac{μ}{ω} (1 - ζ) (W_{τ} - 1)}{1 - ζ + ζ W_{τ} + \frac{μ}{ω} (1 - ζ) (W_{τ} - 1)},

and the variance is

\begin{aligned} σ_{f}^{2} (τ) = & \frac{(1 - \bar{f} (τ))^{2} σ_{F}^{2} (τ) + \bar{f} (τ)^{2} σ_{S}^{2} (τ)}{\bar{N} (τ)^{2}} \\ = & \frac{1}{N_{0} R_{τ} {((1 - ζ) + ζ W_{τ} + \frac{μ}{ω} (1 - ζ) (W_{τ} - 1))}^{4}} \times \\ [{(1 - ζ)}^{2} (ζ W_{τ} (R_{τ} W_{τ} - 1) + \frac{μ}{ω} (1 - ζ) (\frac{2 ω / r_{S}}{1 + 2 ω / r_{S}} R_{τ} W_{τ}^{2} - W_{τ} + \frac{1}{1 + 2 ω / r_{S}})) \\ + {(ζ W_{τ} + \frac{μ}{ω} (1 - ζ) (W_{τ} - 1))}^{2} (1 - ζ) (R_{τ} - 1)] . \end{aligned}

where $R_{τ} = e^{r_{S} τ}$ and $W_{τ} = e^{ω τ}$ . The average Adult size is $\bar{N} (τ) = \bar{S} (τ) + \bar{F} (τ)$ . Thus, the Adult’s F frequency $f (τ) = f_{τ}$ follows the Gaussian distribution $N ({\bar{f}}_{τ}, σ_{f_{τ}}^{2})$ whose pdf is given by

P_{f_{τ}} (f | ζ) = \frac{1}{\sqrt{2 π} σ_{f_{τ}}} e^{- \frac{(f - \bar{f_{τ}})^{2}}{2 σ_{f_{τ}}^{2}}} .

Next, we get the conditional pdf of $f_{k + 1, τ}$ (offspring Adult’s F frequency in cycle $k + 1$ ) given $f_{k}^{*}$ . We multiply Equations 37 and 43 and take the integral over $ζ$ :

P_{f_{k + 1, τ}} (f | f_{k}^{*}) = \int_{0}^{1} d ζ P_{f_{k + 1, τ}} (f | ζ) P_{f_{k + 1, 0}} (ζ | f_{k}^{*}) .

After maturation in cycle $k + 1$ , the Adult with the smallest frequency is selected among $g$ Adult collectives, denoted as $f_{k+1}^{min}$ . The pdf of $f_{k + 1}^{m i n}$ is obtained by the theory of extreme value statistics (Gumbel, 1958). The cumulative distribution function (cdf) of the minimum value $f_{k + 1}^{min}$ is given by

\begin{aligned} C_{f_{k + 1}^{min}} (f | f_{k}^{*}) & = P r o b [f_{k + 1}^{min} \leq f | f_{k}^{*}] \\ = 1 - P r o b [f_{k + 1, τ}^{min} \geq f | f_{k}^{*}] \\ = 1 - P r o b [(f_{k + 1, τ}^{(1)} \geq f) \land (f_{k + 1, τ}^{(2)} \geq f) \land \dots \land (f_{k + 1, τ}^{(g)} \geq f) | f_{k}^{*}] . \end{aligned}

Since frequencies are independent and identically distributed, $C_{f_{k+1}^{m i n}} (f | f_{k}^{*}) = 1 - [P r o b (f_{k + 1, τ} \geq f) | f_{k}^{*}]^{g}$ . Note that $Prob [f_{k + 1, τ} \geq f | f_{k}^{*}]$ $= \int_{f}^{1} d f^{'} P_{f_{k + 1, τ}} (f^{'} | f_{k}^{*})$ $= 1 - \int_{0}^{f} d f^{'} P_{f_{k + 1, τ}} (f^{'} | f_{k}^{*})$ $= 1 - C_{f_{k + 1, τ}} (f | f_{k}^{*})$ , and Equation 45 becomes

C_{f_{k + 1}^{m i n}} (f | f_{k}^{*}) = 1 - {(1 - C_{f_{k + 1, τ}} (f | f_{k}^{*}))}^{g} .

Then, the probability density function $Ψ (f_{k + 1}^{*} | f_{k}^{*})$ is obtained by differentiating Equation 46 with respect to $f$ and replacing $f ⟶ f_{k + 1}^{*}$ ,

Ψ (f_{k + 1}^{*} | f_{k}^{*}) = g {(1 - C_{f_{k + 1, τ}} (f_{k + 1}^{*} | f_{k}^{*}))}^{g - 1} P_{f_{k + 1, τ}} (f_{k + 1}^{*} | f_{k}^{*}) .

We compute the probability density function Equation 47 by using numerical integration and compare it with the stochastic simulation results in Appendix 2—figure 1. The two distributions are similar.

To get the analytic approximation of the median of Equation 47, we assume that the Adult’s F frequency distribution is Gaussian. Then we only need to calculate the mean $\bar{f} (τ)$ and variance $σ_{f}^{2} (τ)$ of Adult’s F frequency. Instead of calculating the integral with respect to $ζ$ in Equation 44, we put a set of initial values from Newborn’s F frequency distribution $N (f_{k}^{*}, f_{k}^{*} (1 - f_{k}^{*}) / N_{0})$ in Equations 20, 21, 28 and 34: ${\bar{S}}_{0} = N_{0} (1 - f_{k}^{*})$ , ${\bar{F}}_{0} = N_{0} f_{k}^{*}$ , $σ_{S, 0}^{2} = N_{0} f_{k}^{*} (1 - f_{k}^{*})$ , $σ_{F, 0}^{2} = N_{0} f_{k}^{*} (1 - f_{k}^{*})$ , and ${\bar{S F}}_{0} - {\bar{S}}_{0} {\bar{F}}_{0} = - N_{0} f_{k}^{*} (1 - f_{k}^{*})$ . Then we have

\bar{f} (τ) = \frac{\bar{F} (τ)}{\bar{S} (τ) + \bar{F} (τ)} = \frac{f_{k}^{*} W_{τ} + \frac{μ}{ω} (1 - f_{k}^{*}) (W_{τ} - 1)}{1 - f_{k}^{*} + f_{k}^{*} W_{τ} + \frac{μ}{ω} (1 - f_{k}^{*}) (W_{τ} - 1)},

\begin{aligned} σ_{f}^{2} (τ) = & \frac{(1 - f_{k}^{*})}{N_{0} R_{τ} {(1 - f_{k}^{*} + f_{k}^{*} W_{τ} + \frac{μ}{ω} (1 - f_{k}^{*}) (W_{τ} - 1))}^{4}} \times \\ [f_{k}^{*} W_{τ} {(2 - 2 f_{k}^{*} + 2 f_{k}^{* 2}) R_{τ} W_{τ} - (1 - f_{k}^{*}) - f_{k}^{*} W_{τ}} \\ + \frac{μ}{ω} (1 - f_{k}^{*}) {(1 - f_{k}^{*}) ((\frac{2 ω}{r_{S} + 2 ω} - 2 f_{k}^{*}) R_{τ} W_{τ}^{2} - W_{t} + \frac{r_{S}}{r_{S} + 2 ω} + 2 f_{k}^{*} R_{τ} W_{τ}) \\ + 2 f_{k}^{*} ((1 + f_{k}^{*}) R_{τ} - 1) W_{τ} (W_{τ} - 1)} + O (μ^{2})], \end{aligned}

which give rise to Equation 3 and Equation 4 in the main text, respectively. The functional form of Equations 48; 49 are plotted in Appendix 2—figure 2a.

The median ( $Median [Ψ (f_{k + 1}^{*} | f_{k}^{*})] \equiv \tilde{f}$ ) of Equation 47 satisfies $C_{f_{k + 1}^{m i n}} (\tilde{f} | f_{k}^{*}) = \frac{1}{2}$ , which means $\tilde{f} = C_{f_{k + 1, τ}}^{- 1} (\frac{\ln 2}{g})$ . If we assume that the distribution Equation 47 is Gaussian, then the inverse function $C_{f_{k + 1, τ}}^{- 1} (\frac{\ln 2}{g})$ can be written as

Median [Ψ (f_{k + 1}^{*} | f_{k}^{*})] = C_{f_{k + 1, τ}}^{- 1} (\frac{\ln 2}{g}) = \bar{f} (τ) + [Φ^{- 1} (\frac{\ln 2}{g})] σ_{f} (τ),

where $Φ^{- 1} (y)$ is an inverse cumulative density function (CDF) of the normal distribution with mean $i$ in Equation 49 and standard deviation $σ_{f} (τ)$ , a square root of Equation 49. Subtracting $f_{k}^{*}$ from Equation 50 gives the selection progress

Δ f = Median [Ψ (f_{k + 1}^{*} | f_{k}^{*})] - f_{k}^{*} = \bar{f} (τ) + [Φ^{- 1} (\frac{\ln 2}{g})] σ_{f} (τ) - f_{k}^{*}

which is Equation 2 in the main text.

Furthermore, we get an asymptotic expression of $Φ^{- 1} (\ln 2 / g)$ when $g$ is large (or $Φ^{- 1} (y)$ with small $y$ ). Here, we introduce a method from Phllip, 1960. We start from the CDF of the standard normal distribution, $Φ (x) = erfc (- x / \sqrt{2}) / 2$ where the function $x$ is the complementary error function. To get the expression of $Φ^{- 1}$ , we need an asymptotic expression of the inverse of $y = erfc(x)$ function ( $x = {erfc}^{- 1} (y)$ ) as the inverse CDF $Φ^{- 1} (y) = - \sqrt{2} {erfc}^{- 1} (2 y)$ . The known asymptotic expansion of $y = erfc (x)$ for large $Φ^{- 1} (\ln 2 / g)$ is $erfc (x) \approx e^{- x^{2}} / x \sqrt{π}$ . By taking the logarithm of both sides, we have

x^{2} \approx - \ln y - \frac{1}{2} \ln π x^{2} .

Replacing $x^{2}$ on the right-hand side in Equation 52 into the expression itself, we get a continued logarithmic form of

x^{2} \approx - \ln y - \frac{1}{2} \ln π (- \ln y - \frac{1}{2} \ln π (- \ln y - \frac{1}{2} \ln \dots) .

Inserting $x = {erfc}^{- 1} (y)$ (square root of Equation 53) into the inverse CDF $Φ^{- 1} (y) = - \sqrt{2} {erfc}^{- 1} (2 y)$ , we have $Φ^{- 1} (y) \approx - \sqrt{- 2 \ln 2 y - \ln π (- \ln 2 y - \dots)}$ . So, the asymptotic expression of $N (f_{k}^{*}, f_{k}^{*} (1 - f_{k}^{*}) / N_{0})$ is given by

Φ^{- 1} (\frac{\ln 2}{g}) \approx - \sqrt{2 \ln g - \ln [\ln g] + \dots} .

Appendix 2—figure 1

Download asset Open asset

The probability density functions of the selected Adult’s F frequency $f_{k + 1}^{*}$ subtracted by $f_{k}^{*}$ .

For simulations (blue), at each $f_{k}^{*}$ , we performed 1000 stochastic simulations. The orange distribution represents Equation 47 computed by numerical integration. The median values of the distributions are shown in Figure 3a in the main text.

Appendix 2—figure 2

Download asset Open asset

Effect of experimental parameters in the distribuiton of Adult's F frequency.

(a) Mean (Equation 41) and variance (Equation 42) of $f$ values of Adult collectives with respect to the Newborn frequency $f_{0}$ . (b) Scaling relation of F frequency variance (Equation 49) with Newborn collective size $N_{0}$ . The initial F frequency is 0.5. The parameters are $r_{S} = 0.5$ , $ω = 0.03$ , $μ = 0.0001$ , and $τ \approx 4.8$ . (c) Relation of F frequency variance (Equation 49) with maturation time $τ$ . Other parameters are the same as b.

Appendix 2—figure 3

Download asset Open asset

Median (orange) and mean (violet) have similar distributions.

We performed 1000 simulations to get probability density. (a) $g = 10$ , (b) $g = 100$ , and (c) $g = 1000$ . Initial F frequency is $f_{k}^{*} = 0.5$ . The parameters are $r_{S} = 0.5$ , $ω = 0.03$ , $μ = 0.0001$ and $τ = \ln [1000] / r_{S}$ .

Appendix 3

Critical newborn size ${\overset{˘}{N}}_{0}$ to allow all target frequencies

First, we note that $σ_{f}^{2}$ in Equation 49 is proportional to $N_{0}^{- 1}$ for the following reasons. Variance $σ_{S}^{2} (t)$ in Equation 28 scales linearly with $N_{0}$ since both ${\bar{S}}_{0}$ and $σ_{S, 0}^{2}$ scale linearly with $N_{0}$ . Variance $σ_{F}^{2} (t)$ in Equation 33 also scales linearly with $N_{0}$ because $σ_{F, 0}^{2}$ , ${\bar{F}}_{0}$ , ${\bar{S}}_{0}$ , and covariance ${\bar{S F}}_{0} - {\bar{S}}_{0} {\bar{F}}_{0}$ all scale linearly with $N_{0}$ . The mean adult size $\bar{N} (τ) = \bar{S} (τ) + \bar{F} (τ)$ is also proportional to $N_{0}$ because the average cell numbers in Equations 18; 19 are linear with respect to $N_{0}$ . Thus, the scaling relation of Equation 49 is given by $σ_{f}^{2} (τ)$ $= [(1 - \bar{f} (τ))^{2} σ_{F}^{2} (τ) + \bar{f} (τ)^{2} σ_{S}^{2} (τ)] / \bar{N} (t)^{2}$ $\sim N_{0} / N_{0}^{2} \sim 1 / N_{0}$ .

Small $N_{0}$ makes all target frequencies achievable, as shown in Figure 4a in the main text. That is because small $N_{0}$ induces large $σ_{f}$ , and thus $N_{0}$ smaller than a certain critical value ${\overset{˘}{N}}_{0}$ makes the selection progress $Δ f$ always negative, regardless of the value of $f_{k}^{*}$ (i.e. $Δ f$ $= \bar{f} (τ)$ $+ [Φ^{- 1} (\frac{\ln 2}{g})] σ_{f} (τ)$ $- f_{k}^{*} < 0$ ). That means the inter-collective selection overcomes intra-collective selection in any target frequencies. To get an analytical approximation of the critical newborn size ${\overset{˘}{N}}_{0}$ , we simply assume that selection progress $Δ f$ is maximum at $f_{k}^{*} = 1 - f_{k}^{*} = \frac{1}{2}$ where the changes in $\bar{f}$ and $σ_{f}^{2}$ are fastest. If the maximum value of $Δ f$ is zero, all other values of Equation 50 are negative, which naturally states that all targets are achievable. Putting $f_{k}^{*} = \frac{1}{2}$ , Equations 48; 49 become

{\bar{f} (τ) |}_{f_{k}^{*} = \frac{1}{2}} = \frac{\bar{F} (τ)}{\bar{S} (τ) + \bar{F} (τ)} = \frac{W_{τ} + \frac{μ}{ω} (W_{τ} - 1)}{1 + W_{τ} + \frac{μ}{ω} (W_{τ} - 1)},

and

\begin{aligned} {σ_{f}^{2} (τ) |}_{f_{k}^{*} = \frac{1}{2}} \approx & \frac{2}{N_{0} R_{τ} {(1 + W_{τ} + \frac{μ}{ω} (W_{τ} - 1))}^{4}} \times \\ [W_{τ} (3 R_{t} W_{τ} - 1 - W_{τ}) \\ + \frac{μ}{ω} {((\frac{2 ω}{r_{S} + 2 ω} - 1) R_{t} W_{τ}^{2} - W_{t} + \frac{r_{S}}{r_{S} + 2 ω} + R_{t} W_{τ}) + (3 R_{t} - 2) W_{τ} (W_{τ} - 1)}] \end{aligned}

So, by setting $Δ f |_{f_{k}^{*} = \frac{1}{2}} = \bar{f} (τ) |_{f_{k}^{*} = \frac{1}{2}} + [Φ^{- 1} (\frac{\ln 2}{g})] σ_{f} (τ) |_{f_{k}^{*} = \frac{1}{2}} - \frac{1}{2} = 0$ with $N_{0} = {\overset{˘}{N}}_{0}$ , we get a solution of

\begin{aligned} {\overset{˘}{N}}_{0} = & {[Φ^{- 1} (\frac{\ln 2}{g})]}^{2} \frac{8}{R_{τ} {[1 + W_{τ} + \frac{μ}{ω} (W_{τ} - 1)]}^{2} {[1 - W_{τ} - \frac{μ}{ω} (W_{τ} - 1)]}^{2}} \times \\ [W_{τ} (3 R_{τ} W_{τ} - 1 - W_{τ}) \\ + \frac{μ}{ω} {((\frac{2 ω}{r_{S} + 2 ω} - 1) R_{τ} W_{τ}^{2} - W_{τ} + \frac{r_{S}}{r_{S} + 2 ω} + R_{τ} W_{τ}) + (3 R_{τ} - 2) W_{τ} (W_{τ} - 1)}] \end{aligned}

Thus, all target frequencies are successfully selected with Newborn size $N_{0}$ smaller than ${\overset{˘}{N}}_{0}$ . If the mutation rate is zero, the critical value becomes

{\overset{˘}{N}}_{0} = {[Φ^{- 1} (\frac{\ln 2}{g})]}^{2} \frac{8 W_{τ} (3 R_{τ} W_{τ} - W_{τ} - 1)}{R_{τ} {(W_{τ}^{2} - 1)}^{2}} .

Appendix 4

Selection without mutation $μ = 0$

When the mutation rate is zero, two genotypes behave as two distinct species. The compositional change is provided by Equation 50 with setting $μ = 0$ . Corresponding $\bar{f}$ in Equation 48 and $σ_{f}^{2}$ in Equation 49 become

\bar{f} (τ) = \frac{f_{k}^{*} W_{τ}}{1 - f_{k}^{*} + f_{k}^{*} W_{τ}},

σ_{f}^{2} (τ) = \frac{f_{k}^{*} (1 - f_{k}^{*}) W_{τ} [(2 - 2 f_{k}^{*} + 2 f_{k}^{* 2}) R_{τ} W_{τ} - (1 - f_{k}^{*}) - f_{k}^{*} W_{τ}]}{N_{0} R_{τ} {(1 - f_{k}^{*} + f_{k}^{*} W_{τ})}^{4}} .

Equations 59; 60 suggest that when a community consists of two competing species, we obtain similar conclusions on the accessible region for target composition. The stochastic simulation results are presented in Appendix 4—figure 1.

Appendix 4—figure 1

Download asset Open asset

Simulation with zero mutation rate.

Color map of the absolute error $d = | ⟨ f^{*} ⟩ - \hat{f} |$ between frequency $⟨ f^{*} ⟩$ of the averaged selected collectives at the end of simulations ( $k = 1000$ ) and the target frequency $\hat{f}$ . For parameters, we used $r_{S} = 0.5$ , $ω = 0.03$ , $μ = 0, N_{0} = 1000, g = 10$ , and $τ \approx 4.8$ .

Appendix 5

Stronger or weaker advantages $ω$

The solution of Equation (2) in main text provides the boundary values with varying the $ω$ , the fitness advantage of F over S. We numerically calculate the solutions and plot in Appendix 5—figure 1.

Appendix 5—figure 1

Download asset Open asset

Change of success region in varying selective advantage $ω . r_{s}$ , $ω = 0.03$ , $μ = 0.0001, N_{0} = 1000, g = 10$ , and $τ \approx 4.8$ .

Appendix 6

Deleterious mutation $ω < 0$

In the main text, we show that the target composition can be achieved in some ranges of initial and target values when the mutation is beneficial to growth. The same analogy can be applied when the mutation is deleterious. Since the F cells grow slower than the S cells ( $ω < 0$ ), the F frequency naturally decreases in the maturation step. Then, the challenging case is selecting a larger F frequency against the intra-collective selection. So the conditional probability distribution $Ψ (f_{k + 1}^{*} | f_{k}^{*})$ that we consider now is a maximum value distribution of Equation 44. Thus, instead of Equation 45, we look for the cumulative distribution function of the maximum value $f_{k + 1}^{m a x}$ such that

\begin{aligned} C_{f_{k + 1}^{m a x}} (f | f_{k}^{*}) & = P r o b [f_{k + 1}^{m a x} \geq f | f_{k}^{*}] \\ = P r o b [(f_{k + 1}^{(1)} (τ) \geq f) \land (f_{k + 1}^{(2)} (τ) \geq f) \land \dots \land (f_{k + 1}^{(g)} (τ) \geq f) | f_{k}^{*}] . \end{aligned}

If all frequencies are independent and identically distributed random variables, the cumulative distribution function becomes

C_{f_{k + 1}^{m a x}} (f | f_{k}^{*}) = {(C_{f_{k + 1, τ}} (f | f_{k}^{*}))}^{g} .

Likewise in the previous section, we get the conditional probability density function by differentiating Equation 62 with respect to $f$ and replacing $f \to f_{k + 1}^{*}$ as

Ψ (f_{k + 1}^{*} | f_{k}^{*}) = g {(C_{f_{k + 1, τ}} (f_{k + 1}^{*} | f_{k}^{*}))}^{g - 1} P_{f_{k + 1, τ}} (f_{k + 1}^{*} | f_{k}^{*}) .

The distribution in Equation 63 is evaluated for various $f_{k}^{*}$ in Appendix 6—figure 1a with numerical simulations, and the median values of distributions are presented in Appendix 6—figure 1b. In the case of $ω = - 0.03$ , the target frequency is lower than around 0.3 and larger than around 0.7 can be selected. Since the sign of $ω$ is opposite to the result in the main text, the diagram is reversed from Figure 2e in the main text.

Appendix 6—figure 1

Download asset Open asset

Artificial selection also works for deleterious mutation.

(a) Conditional probability density functions of $f_{k + 1}^{*} - f_{k}^{*}$ for various $f_{k}^{*}$ values. The left-hand side distribution is obtained from simulations and the right-hand side distribution is numerically obtained by evaluating Equation 63. Small triangles inside indicate the median values of the distributions. (b) The median value of distributions at a given $f_{k}^{*}$ . The points where the shifted median becomes zero, $M e d i a n [Ψ (f_{k + 1}^{*} - f_{k}^{*} | f_{k}^{*})] = 0$ are denoted as $f^{L}$ and $f^{U}$ , respectively. (c) The relative error between the target frequency $\hat{f}$ and the ensemble averaged selected frequency $⟨ f_{k}^{*} ⟩$ is measured after 1000 cycles starting from the initial frequency ${\bar{f}}_{1, 0}$ . Either the lower target frequencies or the higher target frequencies starting from the high initial frequencies can be achieved. The black dashed lines indicate the predicted boundary values $f^{U}$ and $f^{L}$ in a.

Appendix 7

Selecting more than one collective

In the main text, we choose one collective which has the closest frequency to the target among $g$ collectives. Such a ‘top 1’ strategy allows us to apply extreme value theory. However, ‘top 1’ may be too restrictive (Xie et al., 2019). Thus, we test the ‘top-tier’ strategy by choosing the top five among 100 Adults (Appendix 7—figure 1). The top-tier strategy is shown to be inefficient in our system. This is because in Xie et al., 2019, nonheritable variations – such as stochastic fluctuations in species composition introduced by pipetting – caused nonheritable variations in collective function. Nonheritable variations could potentially mask desired mutations if these mutations happened to occur in an ‘unlucky’ environment that yielded lower collective functions. Hence, lenient selection would allow the preservation of these mutations. In contrast here, stochastic fluctuations in genotype composition are heritable: a parent Newborn with lower F frequency $f$ will tend to have offspring Newborns with lower $f$ values. Hence, top-1 is more effective in this study.

Appendix 7—figure 1

Download asset Open asset

Selecting top 5% outperforms selecting top 1.

We bred 100 collectives and chose either top-1 collective (solid line) or top-5 collectives (dashed line) with $f$ closest to the target value $\hat{f}$ (black dotted line).

Appendix 8

Extension to three-population system

We assume that collectives consist of three genotypes with slow-growing (S), fast-growing (F), and faster-growing (FF) types. The growth rate of S is $r_{S}$ . Each mutation adds $ω$ to the growth rate. Thus, the F and FF types have growth rates $r_{S} + ω$ and $r_{S} + 2 ω$ , respectively. The mutation rate is $μ$ . So, the birth and mutation events are written by the chemical reactions:

S \overset{r_{S}}{\to} S + S,

F \to_{ω > 0}^{r_{S} + ω} F + F,

FF \to_{ω > 0}^{r_{S} + 2 ω} FF + FF,

S \overset{μ}{\to} F,

F \overset{μ}{\to} FF

We write a master equation of the processes for $P (S, F, F F, t)$ which is the probability to have $S$ , $F$ , and $F F$ numbers of S, F, and FF cells at time $t$ , respectively.

\begin{aligned} \frac{d P (S, F, F F, t)}{d t} = & r_{S} (S - 1) P (S - 1, F, F F, t) - r_{S} S P (S, F, F F, t) \\ + (r_{S} + ω) (F - 1) P (S, F - 1, F F, t) - (r_{S} + ω) F P (S, F, F F, t) \\ + (r_{S} + 2 ω) (F F - 1) P (S, F, F F - 1, t) - (r_{S} + 2 ω) F F P (S, F, F F, t) \\ + μ (S + 1) P (S + 1, F - 1, F F, t) - μ S P (S, F, F F, t) \\ + μ (F + 1) P (S, F + 1, F F - 1, t) - μ F P (S, F, F F, t) . \end{aligned}

The composition of collective $i$ in cycle $k$ is now represented with two frequencies $(f_{k}^{(i)} (t), h_{k}^{(i)} (t)) \equiv p_{k}^{(i)} (t)$ where the F frequency is $f_{k}^{(i)} (t) = F_{k}^{(i)} (t) / (S_{k}^{(i)} (t) + F_{k}^{(i)} (t) + F F_{k}^{(i)} (t))$ and the FF frequency is $h_{k}^{(i)} (t) = F F_{k}^{(i)} (t) / (S_{k}^{(i)} (t) + F_{k}^{(i)} (t) + F F_{k}^{(i)} (t))$ . Then, the target composition is set to be $(\hat{f}, \hat{h})$ . The composition of the selected Adult in cycle $k$ is $(f_{k}^{*}, h_{k}^{*}) \equiv p_{k}^{*}$ . We apply the processes used in the above Appendix 2 to obtain the conditional probability $Ψ (p_{k + 1}^{*} | p_{k}^{*})$ by using the master Equation 69.

At the reproduction step in cycle $k$ , we choose $N_{0}$ cells from the selected Adult whose composition is ( $f_{k}^{*}$ , $h_{k}^{*}$ ) $\equiv p_{k}^{*}$ . Then, newborn collectives are independently sampled from a multinomial distribution. For convenience, we drop the collective index $(i)$ . Then, the conditional joint probability mass function of $F_{k + 1, 0}, F F_{k + 1, 0}$ cells is represented by

\begin{aligned} P (F_{k + 1, 0}, F F_{k + 1, 0} | f_{k}^{*}, h_{k}^{*}) = & \frac{N_{0}!}{(N_{0} - F_{k + 1, 0} - F F_{k + 1, 0})! F_{k + 1, 0}! F F_{k + 1, 0}!} \times \\ (1 - f_{k}^{*} - h_{k}^{*})^{N_{0} - F_{k + 1, 0} - F F_{k + 1, 0}} (f_{k}^{*})^{F_{k + 1, 0}} (h_{k}^{*})^{F F_{k + 1, 0},} \end{aligned}

where the number of S $S_{k + 1, 0}$ is automatically set to be $S_{k + 1, 0} = N_{0} - F_{k + 1, 0} - F F_{k + 1, 0}$ . Then, the approximated multivariate normal distribution is $N (N_{0} p_{k}^{*}, N_{0} M_{k + 1})$ where the mean distribution is $p_{k}^{*} = (f_{k}^{*}, h_{k}^{*})$ and covariance matrix is $M_{k + 1}$ . The diagonal terms of $M_{k + 1}$ are variances $σ_{X}^{2} = \bar{X^{2}} - [\bar{X}]^{2}$ and the off-diagonal terms are covariances $σ_{X Y} = \bar{X Y} - \bar{X} \bar{Y}$ . The matrix is given by

M_{k + 1} = [\begin{matrix} \frac{f_{k}^{*} (1 - f_{k}^{*})}{N_{0}^{2}} & - \frac{f_{k}^{*} h_{k}^{*}}{N_{0}^{2}} \\ - \frac{f_{k}^{*} h_{k}^{*}}{N_{0}^{2}} & \frac{h_{k}^{*} (1 - h_{k}^{*})}{N_{0}^{2}} \end{matrix}] \equiv [\begin{matrix} σ_{f_{k + 1, 0}}^{2} & σ_{f_{k + 1, 0} h_{k + 1, 0}} \\ σ_{f_{k + 1, 0} h_{k + 1, 0}} & σ_{h_{k + 1, 0}}^{2} \end{matrix}] .

Then a Newborn’s composition $ρ (ζ, η)$ follows the multivariate Gaussian distribution $(ζ, η) \sim N (p_{k}^{*}, M_{k + 1})$ whose joint probability distribution is given by

P_{ρ_{k + 1, 0}} (ρ | p_{k}^{*}) = \frac{1}{{\sqrt{2 π}}^{2} \sqrt{det M_{k + 1}}} e^{- \frac{1}{2} (ρ - p_{k}^{*}) M_{k + 1}^{- 1} (ρ - p_{k}^{*})^{T}} .

At the beginning of cycle $k$ , a newborn collective starts from $(S_{0}, F_{0}, F F_{0})$ cells (for convenience, cycle index $k$ is dropped.) In terms of $(ζ, η)$ , each initial numbers are $S_{0} = N_{0} (1 - ζ - η), F_{0} = N_{0} ζ$ , and $F F_{0} = N_{0} η$ . Their initial covariance matrix is $N_{0}^{2} M_{k + 1}$ . By using Equation 69, we can write ordinary differential equations up to the second moment.

\frac{d \bar{S} (t)}{d t} = r_{S} \bar{S} (t) - μ \bar{S} (t),

\frac{d \bar{F} (t)}{d t} = (r_{S} + ω) \bar{F} (t) + μ (\bar{S} (t) - \bar{F} (t)),

\frac{d \bar{F F} (t)}{d t} = (r_{S} + 2 ω) \bar{F F} (t) + μ \bar{F} (t),

\frac{d \bar{S^{2}}}{d t} = 2 (r_{S} - μ) \bar{S^{2}} + (r + μ) \bar{S},

\frac{d \bar{F^{2}}}{d t} = 2 (r_{S} + ω - μ) \bar{F^{2}} + (r_{S} + ω + μ) \bar{F} + μ (2 \bar{S F} + \bar{F}),

\frac{d \bar{F F^{2}}}{d t} = 2 (r_{S} + 2 ω) \bar{F F^{2}} + (r_{S} + 2 ω) \bar{F F} + μ (2 \bar{F F F} + \bar{F}),

\frac{d \bar{S F}}{d t} = (2 r_{S} + ω - 2 μ) \bar{S F} + μ (\bar{S^{2}} - \bar{S}),

\frac{d \bar{F F F}}{d t} = (2 r_{S} + 3 ω - μ) \bar{F F F} + μ (\bar{S F F} + \bar{F^{2}} - \bar{F}),

\frac{d \bar{S F F}}{d t} = (2 r_{S} + 2 ω - μ) \bar{S F F} + μ \bar{S F} .

The initial conditions of the system in coupled Equations 73–81 are obtained by the mean and (co)variances of Equation 70. By solving equations numerically, we obtain a set of mean cell numbers $(\bar{S}, \bar{F}, \bar{F F})$ and a set of variances $(σ_{S}^{2}, σ_{F}^{2}, σ_{F F}^{2})$ as well as covariances $(σ_{S F}, σ_{F F F}, σ_{S F F})$ . We assume that the covariances are smaller than the variances. We consider $S, F,$ and $F F$ as Gaussian random variables

S (t) \approx \bar{S} (t) + σ_{S} (t) G_{S} (t),

F (t) \approx \bar{F} (t) + σ_{F} (t) G_{F} (t),

F F (t) \approx \bar{F F} (t) + σ_{F F} (t) G_{F F} (t) .

Then, the F frequency becomes

\begin{aligned} f (t) & \approx \frac{\bar{F} (t) + σ_{F} (t) G_{F} (t)}{\bar{S} (t) + σ_{S} (t) G_{S} (t) + \bar{F} (t) + σ_{F} (t) G_{F} (t) + \bar{F F} (t) + σ_{F F} (t) G_{F F} (t)} \\ \approx \bar{f} (t) + σ_{f} (t) G_{f} (t), \end{aligned}

where $\bar{f} = \bar{F} / (\bar{S} + \bar{F} + \bar{F F})$ and $σ_{f}^{2} = ({\bar{f}}^{2} (σ_{S}^{2} + σ_{F F}^{2}) + (1 - {\bar{f}}^{2}) σ_{F}^{2}) / (\bar{S} + \bar{F} + \bar{F F})$ . Similarly, the FF frequency is

\begin{aligned} h (t) & \approx \frac{\bar{F F} (t) + σ_{F F} (t) G_{F F} (t)}{\bar{S} (t) + σ_{S} (t) G_{S} (t) + \bar{F} (t) + σ_{F} (t) G_{F} (t) + \bar{F F} (t) + σ_{F F} (t) G_{F F} (t)} \\ \approx \bar{h} (t) + σ_{h} (t) G_{h} (t), \end{aligned}

where $\bar{h} = \bar{F F} / (\bar{S} + \bar{F} + \bar{F F})$ and $σ_{h}^{2} = ({\bar{h}}^{2} (σ_{S}^{2} + σ_{F}^{2}) + (1 - {\bar{h}}^{2}) σ_{F F}^{2}) / (\bar{S} + \bar{F} + \bar{F F})$ . The dynamic flow of F and FF frequencies during maturation is shown in Appendix 8—figure 1a. If the covariances are small enough, we can approximate the joint probability distribution of Adult’s composition $(f_{τ}, h_{τ}) = p_{τ}$ as

P_{p_{τ}} (p | ρ) = \frac{1}{\sqrt{2 π} σ_{f_{τ}}} e^{- \frac{(f - {\bar{f}}_{τ})^{2}}{2 σ_{f_{τ}}^{2}}} \frac{1}{\sqrt{2 π} σ_{h_{τ}}} e^{- \frac{(h - {\bar{h}}_{τ})^{2}}{2 σ_{h_{τ}}^{2}}} .

With cycle index $k$ , we get the conditional probability of matured collectives $P_{p_{k + 1, τ}} (p | p_{k}^{*})$ by

P_{p_{k + 1, τ}} (p | p_{k}^{*}) = \int_{0}^{1} d ζ \int_{0}^{1} d η P_{p_{k + 1, τ}} (p | ζ, η) P_{p_{k + 1, 0}} (ζ, η | p_{k}^{*}) .

We select the Adult collective among $g$ Adult collectives such that the change in frequencies during maturation could be compensated. During maturation, a frequency distribution moves in different directions in $(f, h)$ space depending on the initial composition $(f_{k}^{*}, h_{k}^{*})$ So, we take different directions to obtain the extreme value distributions. Considering only the sign of the frequency changes in $f$ and $h$ , we take either maximum or minimum. The mean change in $h$ is always positive in the whole $(f, h)$ space since $d \bar{F F} / d t$ is always positive in Equation 75. Thus, we choose the minimum value $h_{k + 1}^{m i n}$ in every selection step.

If the mean $\bar{f_{k + 1, τ}^{(i)}} = \int_{0}^{1} d f^{'} \int_{0}^{1} d h^{'} f P_{p_{k + 1, τ}} (f^{'}, h^{'} | p_{k}^{*})$ is larger than $f_{k}^{*}$ , the minimum value among $f_{k + 1, τ}^{(1)}, f_{k + 1, τ}^{(2)}, \dots, f_{k + 1, τ}^{(g)}$ will be chosen in the selection step to compensate for the frequency change in the maturation step. Let us denote the selected valued of $f$ and $h$ as $f_{k + 1}^{*} = min (f_{k + 1, τ}^{(1)}, f_{k + 1, τ}^{(2)}, \dots, f_{k + 1, τ}^{(g)})$ and $h_{k + 1}^{*} = min (h_{k + 1, τ}^{(1)}, h_{k + 1, τ}^{(2)}, \dots, h_{k + 1, τ}^{(g)})$ . We temporarily drop the time index $τ$ for simplicity. Then, the joint cumulative distribution function $C_{p_{k + 1}^{*}} (f, h | p_{k}^{*}) = P r [f_{k + 1}^{*} < f \land h_{k + 1}^{*} < h | p_{k}^{*}]$ is

\begin{aligned} C_{p_{k + 1}^{*}} (f, h | p_{k}^{*}) = & \int_{0}^{f} d f^{'} \int_{0}^{h} d h^{'} P_{p_{k + 1}^{*}} (f^{'}, h^{'} | p_{k}^{*}) \\ = & (\int_{0}^{1} - \int_{f}^{1}) d f^{'} (\int_{0}^{1} - \int_{h}^{1}) d h^{'} P_{p_{k + 1}^{*}} (f^{'}, h^{'} | p_{k}^{*}) \\ = & \int_{0}^{1} d f^{'} \int_{0}^{1} d h^{'} P_{p_{k + 1}^{*}} (f^{'}, h^{'} | p_{k}^{*}) - \int_{h}^{1} d f^{'} \int_{0}^{1} d h^{'} P_{p_{k + 1}^{*}} (f^{'}, h^{'} | p_{k}^{*}) \\ - \int_{0}^{1} d f^{'} \int_{h}^{1} d h^{'} P_{p_{k + 1}^{*}} (f^{'}, h^{'} | p_{k}^{*}) + \int_{f}^{1} d f^{'} \int_{h}^{1} d h^{'} P_{p_{k + 1}^{*}} (f^{'}, h^{'} | p_{k}^{*}) \\ = & 1 - P r [f_{k + 1}^{*} \geq f | p_{k}^{*}] - P r [h_{k + 1}^{*} \geq h | p_{k}^{*}] + P r [f_{k + 1}^{*} \geq f \land h_{k + 1}^{*} \geq h | p_{k}^{*}] \end{aligned} .

The probability $P r [f_{k + 1}^{*} \geq f \land h_{k + 1}^{*} \geq h | p_{k}^{*}]$ can be converted as

\begin{aligned} P r [f_{k + 1}^{*} & \geq f \land h_{k + 1}^{*} \geq h | p_{k}^{*}] \\ = P r [f_{k + 1}^{(1)} \geq f \land h_{k + 1}^{(1)} \geq h \land f_{k + 1}^{(2)} \geq f \land h_{k + 1}^{(2)} \geq h \land \dots \land f_{k + 1}^{(g)} \geq f \land h_{k + 1}^{(g)} \geq h | p_{k}^{*}] \\ = [P r [f_{k + 1} \geq f \land h_{k + 1} \geq h | p_{k}^{*}]]^{g} \\ = [1 - P r [f_{k + 1} < f | p_{k}^{*}] - P r [h_{k + 1} < h | p_{k}^{*}] + P r [f_{k + 1} < f \land h_{k + 1} < h | p_{k}^{*}]]^{g} \\ = {[1 - C_{f_{k + 1}} (f ∣ p_{k}^{*}) - C_{h_{k + 1}} (h ∣ p_{k}^{*}) + C_{p_{k + 1}} (f, h ∣ p_{k}^{*})]}^{g}, \end{aligned}

where $C_{p_{k + 1}} (f, h | p_{k}^{*}) = \int_{0}^{f} d f^{'} \int_{0}^{h} d h^{'} P_{p_{k + 1}} (f^{'}, h^{'} | p_{k}^{*})$ is a conditional joint cumulative distribution function of $(f^{(i)}, h^{(i)})$ . The marginal cumulative distribution functions are

C_{f_{k + 1}} (f | p_{k}^{*}) = \int_{0}^{f} d f^{'} \int_{0}^{1} d h^{'} P_{p_{k + 1}} (f^{'}, h^{'} | p_{k}^{*}),

C_{h_{k + 1}} (h | p_{k}^{*}) = \int_{0}^{1} d f^{'} \int_{0}^{h} d h^{'} P_{p_{k + 1}} (f^{'}, h^{'} | p_{k}^{*}) .

Similarly, the probabilities $P r [f_{k + 1}^{*} \geq f | p_{k}^{*}]$ and $P r [h_{k + 1}^{*} \geq h | p_{k}^{*}]$ are converted into $P r [f_{k + 1}^{*} \geq f | p_{k}^{*}] = {[1 - C_{f_{k + 1}} (f | p_{k}^{*})]}^{g}$ and $P r [h_{k + 1}^{*} \geq h | p_{k}^{*}] = {[1 - C_{h_{k + 1}} (h | p_{k}^{*})]}^{g}$ . Thus, the joint cumulative distribution function is

\begin{aligned} C_{p_{k + 1}^{*}} (f, h | p_{k}^{*}) = & 1 - {[1 - C_{f_{k + 1}} (f | p_{k}^{*})]}^{g} - {[1 - C_{h_{k + 1}} (h | p_{k}^{*})]}^{g} \\ + {[1 - C_{f_{k + 1}} (f | p_{k}^{*}) - C_{h_{k + 1}} (h | p_{k}^{*}) + C_{p_{k + 1}} (f, h | p_{k}^{*})]}^{g} . \end{aligned}

Then, the conditional probability of the selected collective is given by

\begin{aligned} P_{p_{k + 1}^{*}} (f, h | p_{k}^{*}) = & \frac{\partial^{2}}{\partial f \partial h} C_{p_{k + 1}^{*}} (f, h | p_{k}^{*}) \\ = & g (g - 1) {[1 - C_{f_{k + 1}} (f | p_{k}^{*}) - C_{h_{k + 1}} (h | p_{k - 1}^{*}) + C_{p_{k + 1}} (f, h | p_{k}^{*})]}^{g - 2} \\ \times (P_{f_{k + 1}} (f | p_{k - 1}^{*}) - \partial_{f} C_{p_{k + 1}} (f, h | p_{k}^{*})) (P_{h_{k + 1}} (h | p_{k - 1}^{*}) - \partial_{h} C_{p_{k + 1}} (f, h | p_{k}^{*})) \\ + g {[1 - C_{f_{k + 1}} (f | p_{k}^{*}) - C_{h_{k + 1}} (h | p_{k - 1}^{*}) + C_{p_{k + 1}} (f, h | p_{k}^{*})]}^{g - 1} P_{p_{k + 1}} (f, h | p_{k}^{*}), \end{aligned}

where $\partial_{f} C_{p_{k + 1}} (f, h | p_{k}^{*})$ $= \frac{\partial}{\partial h} C_{p_{k + 1}} (f, h | p_{k}^{*})$ $= \int_{0}^{h} d h^{'} P_{p_{k + 1}} (f^{'}, h | p_{k}^{*})$ and $\partial_{h} C_{p_{k + 1}} (f, h | p_{k}^{*})$ $= \frac{\partial}{\partial h} C_{p_{k + 1}} (f, h | p_{k}^{*})$ $= \int_{0}^{f} d f^{'} P_{p_{k + 1}} (f^{'}, h | p_{k}^{*})$ .

If the mean $\bar{f_{k + 1, τ}}$ is smaller than $f_{k}^{*}$ , the chosen collective is likely to have maximum $f$ values among $g$ matured collectives. Then, the definition of $f^{*}$ is written by $f_{k + 1}^{*} = max (f_{k + 1}^{(1)}, f_{k + 1}^{(2)}, \dots, f_{k + 1}^{(g)})$ . We rewrite the joint cumulative distribution function $C_{p_{k + 1}} (f, h | p_{k}^{*})$ to be a little different from Equation 89 because now we have to utilize the condition $f^{*} < f$ instead of $f^{*} > f$ ,

\begin{aligned} C_{p_{k + 1}} (f, h | p_{k}^{*}) & = \int_{0}^{f} d f^{'} \int_{0}^{h} d h^{'} P_{p_{k + 1}^{*}} (f^{'}, h^{'} | p_{k}^{*}) \\ = \int_{0}^{f} d f^{'} (\int_{0}^{1} - \int_{h}^{1}) d h^{'} P_{p_{k + 1}^{*}} (f^{'}, h^{'} | p_{k}^{*}) \\ = \int_{0}^{f} d f^{'} \int_{0}^{1} d h^{'} P_{p_{k + 1}^{*}} (f^{'}, h^{'} | p_{k}^{*}) - \int_{0}^{f} d f^{'} \int_{h}^{1} d h^{'} P_{p_{k + 1}^{*}} (f^{'}, h^{'} | p_{k}^{*}) . \\ = P r [f_{k + 1}^{*} < f | p_{k}^{*}] - P r [f_{k + 1}^{*} < f \land h_{k + 1}^{*} \geq h | p_{k}^{*}] . \end{aligned}

The probability $P r [f_{k + 1}^{*} < f \land h_{k + 1}^{*} \geq h | p_{k}^{*}]$ is converted as

\begin{aligned} P r [f_{k + 1}^{*} & < f \land h_{k + 1}^{*} \geq h | p_{k}^{*}] \\ = P r [f_{k + 1}^{(1)} < f \land h_{k + 1}^{(1)} \geq h \land f_{k + 1}^{(2)} < f \land h_{k + 1}^{(2)} \geq h \land \dots \land f_{k + 1}^{(2)} < f \land h_{k + 1}^{(2)} \geq h | p_{k}^{*}] \\ = {[P r [f_{k + 1} < f \land h_{k + 1} \geq h | p_{k}^{*}]]}^{g} \\ = {[P r [f_{k + 1} < f | p_{k}^{*}] - P r [f_{k + 1} < f \land h_{k + 1} < h | p_{k}^{*}]]}^{g} . \end{aligned}

= {[C_{f_{k + 1}} (f | p_{k}^{*}) - C_{p_{k + 1}} (f, h | p_{k}^{*})]}^{g} .

Thus, the joint cumulative distribution function is given by

C_{p_{k + 1}^{*}} (f, h | p_{k}^{*}) = {[C_{f_{k + 1}} (f | p_{k}^{*})]}^{g} - {[C_{f_{k + 1}} (f | p_{k}^{*}) - C_{p_{k + 1}} (f, h | p_{k}^{*})]}^{g} .

In this case, the conditional probability distribution function is given by

\begin{aligned} P_{p_{k + 1}^{*}} (f, h | p_{k}^{*}) = & \frac{\partial^{2}}{\partial f \partial h} C_{p_{k + 1}^{*}} (f, h | p_{k}^{*}) \\ = & g (g - 1) {[C_{f_{k + 1}} (f | p_{k}^{*}) - C_{p_{k + 1}} (f, h | p_{k}^{*})]}^{g - 2} \\ \times (P_{f_{k + 1}} (f | p_{k}^{*}) - \partial_{f} C_{p_{k + 1}} (f, h | p_{k}^{*})) \partial_{h} C_{p_{k + 1}} (f, h | p_{k}^{*}) \\ + g {[C_{f_{k + 1}} (f | p_{k}^{*}) - C_{p_{k + 1}} (f, h | p_{k}^{*})]}^{g - 1} P_{p_{k + 1}} (f, h | p_{k}^{*}) . \end{aligned}

By replacing $(f, h)$ to $p_{k + 1}^{*}$ , we finally obtain the conditional probability distribution $Ψ (p_{k + 1}^{*} | p_{k}^{*})$ ,

Ψ (p_{k + 1}^{*} | p_{k}^{*}) = {\begin{cases} g (g - 1) {[1 - C_{f_{k + 1}} (f_{k + 1}^{*} | p_{k}^{*}) - C_{h_{k + 1}} (h_{k + 1}^{*} | p_{k}^{*}) + C_{p_{k + 1}} (p_{k + 1}^{*} | p_{k}^{*})]}^{g - 2} \\ \times (P_{f_{k + 1}} (f_{k + 1}^{*} | p_{k}^{*}) - \partial_{f} C_{p_{k + 1}} (p_{k + 1}^{*} | p_{k}^{*})) (P_{h_{k + 1}} (h) - \partial_{h} C_{p_{k + 1}} (p_{k + 1}^{*} | p_{k}^{*})) \\ + g {[1 - C_{f_{k + 1}} (f_{k + 1}^{*} | p_{k}^{*}) - C_{h_{k + 1}} (h_{k + 1}^{*} | p_{k}^{*}) + C_{p_{k + 1}} (p_{k + 1}^{*} | p_{k}^{*})]}^{g - 1} P_{p_{k + 1}} (p_{k + 1}^{*} | p_{k}^{*}) \\ for \bar{f_{k + 1, τ}} - f_{k}^{*} \geq 0 and \bar{h_{k + 1, τ}} - h_{k}^{*} \geq 0 \\ g (g - 1) {[C_{f_{k + 1}} (f_{k + 1}^{*} | p_{k}^{*}) - C_{p_{k + 1}} (p_{k + 1}^{*} | p_{k}^{*})]}^{g - 2} \\ (P_{f_{k + 1}} (f_{k + 1}^{*} | p_{k}^{*}) - \partial_{f} C_{p_{k + 1}} (p_{k + 1}^{*} | p_{k}^{*})) \partial_{h} C_{p_{k + 1}} (p_{k + 1}^{*} | p_{k}^{*}) \\ + g {[C_{f_{k + 1}} (f_{k + 1}^{*} | p_{k}^{*}) - C_{p_{k + 1}} (p_{k + 1}^{*} | p_{k}^{*})]}^{g - 1} P_{p_{k + 1}} (p_{k + 1}^{*} | p_{k}^{*}) \\ for \bar{f_{k + 1, τ}} - f_{k}^{*} < 0 and \bar{h_{k + 1, τ}} - h_{k}^{*} \geq 0 \end{cases} .

Using Equation 100, we get the mean values of $f^{*}$ and $h^{*}$ as

\bar{f_{k + 1}^{*}} = \int_{0}^{1} d f^{'} \int_{0}^{1} d h^{'} f^{'} \times P_{f_{k + 1}^{*}, h_{k + 1}^{*}} (f^{'}, h^{'} | f_{k}^{*}, h_{k}^{*}),

\bar{h_{k + 1}^{*}} = \int_{0}^{1} d f^{'} \int_{0}^{1} d h^{'} h^{'} \times P_{f_{k + 1}^{*}, h_{k + 1}^{*}} (f^{'}, h^{'} | f_{k}^{*}, h_{k}^{*}) .

We define the accessible region in frequency space where the signs of the changes in both F frequency and FF frequency after a cycle are opposite to that of maturation (see Appendix 8—figure 1),

s i g n (\bar{f_{k + 1}^{*}} - f_{k}^{*}) \times s i g n (\bar{f_{k + 1, τ}} - f_{k}^{*}) \leq 0 a n d s i g n (\bar{h_{k + 1}^{*}} - h_{k}^{*}) \times s i g n (\bar{h_{k + 1, τ}} - h_{k}^{*}) \leq 0,

where $\bar{f_{k + 1, τ}}$ and $\bar{f_{k + 1}^{*}}$ are the mean values of F frequencies after the maturation step in cycle $k + 1$ before and after selection, respectively, and $h$ values are defined similarly for FF. Or, if the condition is not met, the composition of the selected collective may diverge from the target composition after several cycles. The accessible regions are marked in the gold-colored area in Appendix 8—figure 1b. Similar to the two-population case, the accessible region is shaped by the flow velocity of the composition during the maturation step, as depicted in the flow diagram in Appendix 8—figure 1a. Both F and FF frequencies tend to increase, and the inter-collective selection can compensate for these changes if the composition changes slowly when the F and FF frequencies are small. However, if the changes occur too rapidly when the FF frequency is intermediate, the frequency cannot be stabilized. So the accessible region is limited to the regions where the composition changes slowly.

This is explainable by projecting the three-population problem into the two-population problem. The selective advantage of FF relative to the rest of the collective mainly determines the accessible region. The growth rate of the rest varies from $r_{S}$ to $r_{S} + ω$ according to F frequency, so the mean growth rate of the rest is written by $\bar{r_{S}} = r_{S} + f^{'} ω$ where $f^{'}$ is F frequency in S+F. Then, the corresponding selective advantage of FF is $\bar{ω} = (2 - f^{'}) ω$ which varies between $ω$ to $2 ω$ . Using $\bar{r_{S}}$ and $\bar{ω}$ similar to Appendix 2, we get bounds of the accessible region (see dashed line in Appendix 8—figure 1b). The boundary from the projected problem agreed well with the original three-population problem.

Appendix 8—figure 1

Download asset Open asset

Accessible regions in the three-population system.

(a) The flow of composition change in fast-growing (F) and faster-growing (FF) frequencies at each composition $(f, h)$ . Top corner indicates that FF cells fix in the collective. Right bottom corner means collectives with only F cells, while collectives contain S cells only at left bottom corner. Arrow length means the speed of change. (b) The accessible regions are marked by the gold area. If the signs of changes in both F frequency and FF frequency after inter-collective selection are opposite to those during maturation, then the given composition is accessible. Otherwise, the composition is not accessible and will change after cycles. Dashed lines are the boundary of the accessible region by projecting the collective into a two-population problem (FF vs. S+F). The figures are drawn using the *mpltern* package (Ikeda et al., 2019).

Appendix 9

Derivation of equations

In this section, we go over the derivation of Equations 18–42 for readers not equipped with advanced mathematics training.

Assumptions: $μ ≪ ω, r_{S}$

Equations 18 and 19

Equation 18 is straightforwardly solved by integrating Equation 16. Equation 19 is obtained from Equation 17 using integration factor $e^{- (r_{S} + ω) t}$ :

[\frac{d \bar{F} (t)}{d t} - (r_{S} + ω) \bar{F} (t)] e^{- (r_{S} + ω) t} = μ {\bar{S}}_{o} e^{(r_{S} - μ) t} e^{- (r_{S} + ω) t},

\frac{d [\bar{F} (t) e^{- (r_{S} + ω) t}]}{d t} = μ {\bar{S}}_{o} e^{- (ω + μ) t} .

Integrating both sides, we get $\bar{F} (t) e^{- (r_{S} + ω) t} - \bar{F} (0) = \frac{μ {\bar{S}}_{o}}{ω + μ} (1 - e^{- (ω + μ) t})$ . Thus,

\bar{F} (t) = {\bar{F}}_{o} e^{(r_{S} + ω) t} + \frac{μ {\bar{S}}_{o}}{ω + μ} (e^{(r_{S} + ω) t} - e^{(r_{S} - μ) t}) \approx {\bar{F}}_{o} e^{(r_{S} + ω) t} + \frac{μ {\bar{S}}_{o}}{ω} (e^{(r_{S} + ω) t} - e^{r_{S} t}) .

Equations 24 and 25

Applying Equation 12, we have

\begin{aligned} \frac{d \bar{S^{2}}}{d t} & = \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} S^{2} \frac{d P (S, F, t)}{d t} \\ = \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} [r_{S} S^{2} (S - 1) P (S - 1, F, t) + S^{2} (r_{S} + ω) (F - 1) P (S, F - 1, t) \\ + μ (S + 1) S^{2} P (S + 1, F - 1, t) - (r S + (r_{S} + ω) F + μ S) S^{2} P (S, F, t)] . \end{aligned}

We collect the two purple-colored terms and change the order of summation. Note that the first purple-colored term does not change regardless of whether $S$ starts from 0 or 1 because the term is zero for $S = 0$ . Thus, the first purple-colored term is equivalent to $\sum_{F = 0}^{\infty} \sum_{S = 1}^{\infty} [r_{S} S^{2} (S - 1) P (S - 1, F, t)]$ . Let $α = S - 1$ , and this becomes $\sum_{F = 0}^{\infty} \sum_{α = 0}^{\infty} [r_{S} (α + 1)^{2} α P (α, F, t)]$ . We reassign α as $S$ , and obtain:

\begin{aligned} \sum_{F = 0}^{\infty} \sum_{S = 0}^{\infty} [r_{S} (S + 1)^{2} S P (S, F, t)] - \sum_{F = 0}^{\infty} \sum_{S = 0}^{\infty} r_{S} S^{3} P (S, F, t) \\ = r_{S} \sum_{F = 0}^{\infty} \sum_{S = 0}^{\infty} [(2 S + 1) S] P (S, F, t) \\ = r_{S} (2 \bar{S^{2}} + \bar{S}) . \end{aligned}

We collect the two blue terms, and similarly obtain:

\begin{aligned} \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} [S^{2} (r_{S} + ω) (F - 1) P (S, F - 1, t) - (r_{S} + ω) F S^{2} P (S, F, t)] \\ = (r_{S} + ω) \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} [S^{2} F P (S, F, t) - F S^{2} P (S, F, t)] \\ = 0. \end{aligned}

Finally, we collect the two red terms. For the first red term, the sum is the same regardless of whether we start from $S = 0$ or -1. Let $S$ start from -1, and we have

\sum_{F = 0}^{\infty} \sum_{S = - 1}^{\infty} [μ (S + 1) S^{2} P (S + 1, F - 1, t)] .

Let $α = S + 1,$ then the term becomes $\sum_{F = 0}^{\infty} \sum_{α = 0}^{\infty} [μ α (α - 1)^{2} P (α, F - 1, t)]$ . We reassign α as $S$ , and additionally apply index change on $F - 1$ :

\begin{aligned} μ \sum_{F = 0}^{\infty} \sum_{S = 0}^{\infty} [S (S - 1)^{2} P (S, F, t) - S^{3} P (S, F, t)] \\ = μ \sum_{F = 0}^{\infty} \sum_{S = 0}^{\infty} [(- 2 S^{2} + S) P (S, F, t)] \\ = μ (- 2 \bar{S^{2}} + \bar{S}) . \end{aligned}

Now, add the three parts together, and we have

\begin{aligned} \frac{d \bar{S^{2}}}{d t} & = μ (- 2 \bar{S^{2}} + \bar{S}) + r_{S} (2 \bar{S^{2}} + \bar{S}) \\ = 2 \bar{S^{2}} (r_{S} - μ) + \bar{S} (r_{S} + μ), \end{aligned}

which is Equation 24. Likewise,

\begin{aligned} \frac{d \bar{F^{2}}}{d t} & = \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} F^{2} \frac{d P (S, F, t)}{d t} \\ = \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} [r_{S} F^{2} (S - 1) P (S - 1, F, t) + F^{2} (r_{S} + ω) (F - 1) P (S, F - 1, t) \\ + μ (S + 1) F^{2} P (S + 1, F - 1, t) - (r_{S} S + (r_{S} + ω) F + μ S) F^{2} P (S, F, t)] \\ = \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} [r_{S} F^{2} S + (F + 1)^{2} (r_{S} + ω) F + μ S (F + 1)^{2} - (r_{S} S + (r_{S} + ω) F + μ S) F^{2}] P (S, F, t) \\ = \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} [(F + 1)^{2} (r_{S} + ω) F + μ S (F + 1)^{2} - ((r_{S} + ω) F + μ S) F^{2}] P (S, F, t) \\ = \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} [(r_{S} + ω) F (2 F + 1) + μ S (2 F + 1)] P (S, F, t) \end{aligned}

= 2 (r_{S} + ω) \bar{F^{2}} + (r_{S} + ω) \bar{F} + 2 μ \bar{S} \bar{F} + μ \bar{S}

which is Equation 25.

Equation 26 and 28

Using integration factor $e^{- 2 (r_{S} - μ) t}$ and Equation 24, we have:

[\frac{d \bar{S^{2}}}{d t} - 2 (r_{S} - μ) \bar{S^{2}}] e^{- 2 (r_{S} - μ) t} = (r_{S} + μ) {\bar{S}}_{o} e^{(r_{S} - μ) t} e^{- 2 (r_{S} - μ) t},

\frac{d [\bar{S^{2}} e^{- 2 (r_{S} - μ) t}]}{d t} = (r_{S} + μ) {\bar{S}}_{o} e^{(μ - r_{S}) t},

\bar{S^{2}} e^{- 2 (r_{S} - μ) t} = \frac{r_{S} + μ}{μ - r_{S}} {\bar{S}}_{o} [e^{(μ - r_{S}) t} - 1] + {\bar{S^{2}}}_{o} .

Since $μ ≪ r_{S}$ , we have

\begin{aligned} \bar{S^{2}} & = \frac{r_{S} + μ}{μ - r_{S}} {\bar{S}}_{o} [e^{(μ - r_{S}) t} - 1] e^{2 (r_{S} - μ) t} + {\bar{S^{2}}}_{o} e^{2 (r_{S} - μ) t} \\ \approx {\bar{S}}_{o} e^{2 r_{S} t} [1 - e^{- r_{S} t}] + {\bar{S^{2}}}_{o} e^{2 r_{S} t}, \end{aligned}

where ${\bar{S^{2}}}_{o}$ is the expected $S^{2}$ at time 0. For Equation 28,

\begin{aligned} σ_{S}^{2} (t) & = \bar{S^{2}} (t) - [\bar{S} (t)]^{2} \\ \approx {\bar{S}}_{o} e^{2 r_{S} t} [1 - e^{- r_{S} t}] + {\bar{S^{2}}}_{o} e^{2 r_{S} t} - {({\bar{S}}_{o} e^{r_{S} t})}^{2} \\ = {\bar{S}}_{o} e^{2 r_{S} t} [1 - e^{- r_{S} t}] + e^{2 r_{S} t} ({\bar{S^{2}}}_{o} - {({\bar{S}}_{o})}^{2}) = {\bar{S}}_{o} e^{2 r_{S} t} [1 - e^{- r_{S} t}] + e^{2 r_{S} t} σ_{S}^{2} (0) . \end{aligned}

Equation 30 and 31

Since $\bar{S F} = \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} S F P (S, F, t)$ ,

\begin{aligned} \frac{d \bar{S F}}{d t} = & \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} [S F r_{S} (S - 1) P (S - 1, F, t) + S F (r_{S} + ω) (F - 1) P (S, F - 1, t) \\ + S F μ (S + 1) P (S + 1, F - 1, t) \\ - S F (r_{S} S + (r_{S} + ω) F + μ S) P (S, F, t)] \\ = & \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} [(S + 1) F r_{S} S + S (F + 1) (r_{S} + ω) F + (S - 1) (F + 1) μ S \\ - S F (r_{S} S + (r_{S} + ω) F + μ S)] P (S, F, t) \\ = & \sum_{S = 0}^{\infty} \sum_{F = 0}^{\infty} [2 r_{S} S F + ω S F - μ S F + μ S^{2} - μ S)] P (S, F, t) \\ = & (2 r_{S} + ω - μ) \bar{S F} + μ (\bar{S^{2}} - \bar{S}) . \end{aligned}

We can solve this, again using the integration factor technique above:

\frac{d [\bar{S F} e^{- (2 r_{S} + ω - μ) t}]}{d t} = μ (\bar{S^{2}} - \bar{S}) e^{- (2 r_{S} + ω - μ) t} .

Thus, we have

\begin{aligned} \bar{S F} e^{- (2 r_{S} + ω - μ) t} - {\bar{S F}}_{o} = & μ \int (\bar{S^{2}} - \bar{S}) e^{- (2 r_{S} + ω - μ) t} d t \\ = & μ \int (\frac{r_{S} + μ}{μ - r_{S}} {\bar{S}}_{o} [e^{(μ - r_{S}) t} - 1] e^{2 (r_{S} - μ) t} \\ + {\bar{S^{2}}}_{o} e^{2 (r_{S} - μ) t} - {\bar{S}}_{o} e^{(r_{S} - μ) t}) e^{- (2 r_{S} + ω - μ) t} d t \\ = & μ \int (\frac{r_{S} + μ}{μ - r_{S}} {\bar{S}}_{o} [e^{- (μ - r_{S}) t} - e^{2 (r_{S} - μ) t}] e^{- (2 r_{S} + ω - μ) t} \\ + {\bar{S^{2}}}_{o} e^{2 (r_{S} - μ) t} e^{- (2 r_{S} + ω - μ) t} - {\bar{S}}_{o} e^{(r_{S} - μ) t - (2 r_{S} + ω - μ) t}) d t \\ = & μ \int (\frac{r_{S} + μ}{μ - r_{S}} {\bar{S}}_{o} e^{- (r_{S} + ω) t} - \frac{r_{S} + μ}{μ - r_{S}} {\bar{S}}_{o} e^{- (μ + ω) t} \\ + {\bar{S^{2}}}_{o} e^{- (μ + ω) t} - {\bar{S}}_{o} e^{- (r_{S} + ω) t}) d t \\ = & μ {\frac{2 r_{S}}{μ - r_{S}} {\bar{S}}_{o} \frac{e^{- (r_{S} + s) t} - 1}{- (r_{S} + ω)} \\ + ({\bar{S^{2}}}_{o} - \frac{r_{S} + μ}{μ - r_{S}} {\bar{S}}_{o}) \frac{e^{- (μ + ω) t} - 1}{- (μ + ω)}}, \end{aligned}

which results in

\begin{aligned} \bar{S F} (t) = & {\bar{S F}}_{o} e^{(2 r_{S} + ω - μ) t} + μ {\frac{2 r_{S}}{μ - r_{S}} {\bar{S}}_{o} \frac{e^{- (r_{S} + ω) t} - 1}{- (r_{S} + ω)} + ({\bar{S^{2}}}_{o} - \frac{r_{S} + μ}{μ - r_{S}} {\bar{S}}_{o}) \frac{e^{- (μ + ω) t} - 1}{- (μ + ω)}} e^{(2 r_{S} + ω - μ) t} \\ = & {\bar{S F}}_{o} e^{(2 r_{S} + ω - μ) t} + \frac{2 r_{S} μ}{μ - r_{S}} {\bar{S}}_{o} \frac{e^{(r_{S} - μ) t} - e^{(2 r_{S} + ω - μ) t}}{- (r_{S} + ω)} \\ + μ ({\bar{S^{2}}}_{o} - \frac{r_{S} + μ}{μ - r_{S}} {\bar{S}}_{o}) \frac{e^{2 (r_{S} - μ) t} - e^{(2 r_{S} + ω - μ) t}}{- (μ + ω)} \\ \approx & {\bar{S F}}_{o} e^{(2 r_{S} + ω) t} - \frac{2 μ {\bar{S}}_{o}}{r_{S} + ω} (e^{(2 r_{S} + ω) t} - e^{r_{S} t}) + \frac{μ}{ω} ({\bar{S^{2}}}_{o} + {\bar{S}}_{o}) (e^{(2 r_{S} + ω) t} - e^{2 r_{S} t}) . \end{aligned}

Equation 32

From Equation 25, we have

(\frac{d \bar{F^{2}}}{d t} - 2 (r_{S} + ω) \bar{F^{2}}) e^{- 2 (r_{S} + ω) t} = \frac{d (\bar{F^{2}} e^{- 2 (r_{S} + ω) t})}{d t} = ((r_{S} + ω) \bar{F} + 2 μ \bar{S F} + μ \bar{S}) e^{- 2 (r_{S} + ω) t} .

The right-hand side becomes

\begin{aligned} \approx & (r_{S} + ω) ({\bar{F}}_{o} e^{(r_{S} + ω) t} + \frac{μ {\bar{S}}_{o}}{ω + μ} (e^{(r_{S} + ω) t} - e^{(r_{S} - μ) t})) e^{- 2 (r_{S} + ω) t} \\ + 2 μ {\bar{S F}}_{o} e^{- (ω + μ) t} + μ {\bar{S}}_{o} e^{- (r_{S} + 2 ω + μ) t} + o (μ^{2}) \\ = & (r_{S} + ω) ({\bar{F}}_{o} e^{- (r_{S} + ω) t} + \frac{μ {\bar{S}}_{o}}{ω + μ} (e^{- (r_{S} + ω) t} - e^{- (r_{S} + 2 ω + μ) t})) \\ + 2 μ {\bar{S F}}_{o} e^{- (ω + μ) t} + μ {\bar{S}}_{o} e^{- (r_{S} + 2 ω + μ) t} + o (μ^{2}) \\ = & (r_{S} + ω) (\frac{{\bar{F}}_{o} ω + N_{0} μ}{ω + μ} e^{- (r_{S} + ω) t} - \frac{μ {\bar{S}}_{o}}{ω + μ} e^{- (r_{S} + 2 ω + μ) t}) \\ + 2 μ {\bar{S F}}_{o} e^{- (ω + μ) t} + μ {\bar{S}}_{o} e^{- (r_{S} + 2 ω + μ) t} + o (μ^{2}) . \end{aligned}

Note that we have checked that the second and third terms of $\bar{S F}$ can be ignored after we compare the full calculation with this simpler version. Integrate both sides:

\begin{aligned} \bar{F^{2}} e^{- 2 (r_{S} + ω) t} - {\bar{F^{2}}}_{o} \approx & (r_{S} + ω) (\frac{{\bar{F}}_{o} ω + N_{0} μ}{(ω + μ) (r_{S} + ω)} (1 - e^{- (r_{S} + ω) t}) \\ + \frac{μ {\bar{S}}_{o}}{(ω + μ) (r_{S} + 2 ω + μ)} (e^{- (r_{S} + 2 ω + μ) t} - 1)) \\ + \frac{2 μ {\bar{S F}}_{o}}{ω + μ} (1 - e^{- (ω + μ) t}) + \frac{μ {\bar{S}}_{o}}{r_{S} + 2 ω + μ} (1 - e^{- (r_{S} + 2 ω + μ) t}) + o (μ^{2}) \end{aligned}

Then, we have

\begin{aligned} \bar{F^{2}} (t) \approx & {\bar{F^{2}}}_{o} e^{2 (r_{S} + ω) t} + (r_{S} + ω) (\frac{{\bar{F}}_{o} ω + N_{0} μ}{(ω + μ) (r_{S} + ω)} (e^{2 (r_{S} + ω) t} - e^{(r_{S} + ω) t}) \\ + \frac{μ {\bar{S}}_{o}}{(ω + μ) (r_{S} + 2 ω + μ)} (e^{(r_{S} - μ) t} - e^{2 (r_{S} + ω) t})) \\ + \frac{2 μ {\bar{S F}}_{o}}{ω + μ} (e^{2 (r_{S} + ω) t} - e^{(2 r_{S} + ω - μ) t}) + \frac{μ {\bar{S}}_{o}}{r_{S} + 2 ω + μ} (e^{2 (r_{S} + ω) t} - e^{(r_{S} - μ) t}) + o (μ^{2}) \\ = & {\bar{F^{2}}}_{o} e^{2 (r_{S} + ω) t} + (\frac{{\bar{F}}_{o} ω + N_{0} μ}{ω} (e^{2 (r_{S} + ω) t} - e^{(r_{S} + ω) t}) + (\frac{1}{ω} - \frac{1}{r_{S} + 2 ω}) μ {\bar{S}}_{o} (e^{r_{S} t} - e^{2 (r_{S} + ω) t})) \\ + \frac{2 μ {\bar{S F}}_{o}}{ω} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - e^{r_{S} t}) + \frac{μ {\bar{S}}_{o}}{r_{S} + 2 ω} (e^{2 (r_{S} + ω) t} - e^{r_{S} t}) + o (μ^{2}) \\ = {\bar{F^{2}}}_{o} e^{2 (r_{S} + ω) t} + \frac{2 μ {\bar{S F}}_{o}}{ω} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - e^{r_{S} t}) + {\bar{F}}_{o} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - 1) \\ + \frac{μ {\bar{S}}_{o}}{ω} [(- e^{(r_{S} + ω) t}) + \frac{{\bar{F}}_{o}}{{\bar{S}}_{o}} (e^{2 (r_{S} + ω) t} - e^{(r_{S} + ω) t}) + \frac{2 ω}{r_{S} + 2 ω} e^{2 (r_{S} + ω) t} + \frac{r_{S}}{r_{S} + 2 ω} e^{r_{S} t}] + o (μ^{2}) \\ = & {\bar{F^{2}}}_{o} e^{2 (r_{S} + ω) t} + \frac{2 μ {\bar{S F}}_{o}}{ω} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - e^{r_{S} t}) + {\bar{F}}_{o} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - 1) \\ + \frac{μ {\bar{S}}_{o}}{ω} [(- e^{(r_{S} + ω) t}) + \frac{2 ω}{r_{S} + 2 ω} e^{2 (r_{S} + ω) t} + \frac{r_{S}}{r_{S} + 2 ω} e^{r_{S} t}] + \frac{μ {\bar{F}}_{o}}{ω} (e^{2 (r_{S} + ω) t} - e^{(r_{S} + ω) t}) + o (μ^{2}) \\ = & {\bar{F^{2}}}_{o} e^{2 (r_{S} + ω) t} + \frac{2 μ {\bar{S F}}_{o}}{ω} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - e^{r_{S} t}) + {\bar{F}}_{o} (1 + \frac{μ}{ω}) e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - 1) \\ + \frac{μ {\bar{S}}_{o}}{ω} [(- e^{(r_{S} + ω) t}) + \frac{2 ω}{r_{S} + 2 ω} e^{2 (r_{S} + ω) t} + \frac{r_{S}}{r_{S} + 2 ω} e^{r_{S} t}] + o (μ^{2}) \\ = & {\bar{F^{2}}}_{o} e^{2 (r_{S} + ω) t} + \frac{2 μ {\bar{S F}}_{o}}{ω} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - e^{r_{S} t}) + {\bar{F}}_{o} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - 1) \\ + \frac{μ {\bar{S}}_{o}}{ω} [(- e^{(r_{S} + ω) t}) + \frac{2 ω}{r_{S} + 2 ω} e^{2 (r_{S} + ω) t} + \frac{r_{S}}{r_{S} + 2 ω} e^{r_{S} t}] + o (μ^{2}) . \end{aligned}

Equation 33

\begin{aligned} σ_{F}^{2} (t) = & \bar{F^{2}} (t) - [\bar{F} (t)]^{2} \\ \approx & {\bar{F^{2}}}_{o} e^{2 (r_{S} + ω) t} + \frac{2 μ {\bar{S F}}_{o}}{ω} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - e^{r_{S} t}) + {\bar{F}}_{o} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - 1) \\ + \frac{μ {\bar{S}}_{o}}{ω} [(- e^{(r_{S} + ω) t}) + \frac{2 ω}{r_{S} + 2 ω} e^{2 (r_{S} + ω) t} + \frac{r_{S}}{r_{S} + 2 ω} e^{r_{S} t}] + o (μ^{2}) \\ - {[{\bar{F}}_{o} e^{(r_{S} + ω) t} + \frac{μ {\bar{S}}_{o}}{ω} (e^{(r_{S} + ω) t} - e^{r_{S} t)})]}^{2} \\ = & {\bar{F^{2}}}_{o} e^{2 (r_{S} + ω) t} + \frac{2 μ {\bar{S F}}_{o}}{ω} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - e^{r_{S} t}) + {\bar{F}}_{o} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - 1) \\ + \frac{μ {\bar{S}}_{o}}{ω} [(- e^{(r_{S} + ω) t}) + \frac{2 ω}{r_{S} + 2 ω} e^{2 (r_{S} + ω) t} + \frac{r_{S}}{r_{S} + 2 ω} e^{r_{S} t}] + o (μ^{2}) \\ - {\bar{F}}_{o}^{2} e^{2 (r_{S} + ω) t} - 2 {\bar{F}}_{o} e^{(r_{S} + ω) t} \frac{μ {\bar{S}}_{o}}{ω} (e^{(r_{S} + ω) t} - e^{r_{S} t)}) - {[\frac{μ {\bar{S}}_{o}}{ω} (e^{(r_{S} + ω) t} - e^{r_{S} t)})]}^{2} \\ = & σ_{F, 0}^{2} e^{2 (r_{S} + ω) t} + \frac{2 μ ({\bar{S F}}_{o} - {\bar{S}}_{o} {\bar{F}}_{o})}{ω} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - e^{r_{S} t}) \\ + {\bar{F}}_{o} e^{(r_{S} + ω) t} (e^{(r_{S} + ω) t} - 1) + \frac{μ {\bar{S}}_{o}}{ω} [- e^{(r_{S} + ω) t} + \frac{2 ω}{r_{S} + 2 ω} e^{2 (r_{S} + ω) t} + \frac{r_{S}}{r_{S} + 2 ω} e^{r_{S} t}] + o (μ^{2}) . \end{aligned}

Equations 40-42

To derive this equation, we use the fact that $1 / (1 + x) \sim 1 - x$ for small $x$ . We will omit $(t)$ for simplicity. Also, note that we are considering relatively large populations so that the standard deviation is much smaller than the mean.

\begin{aligned} f & \approx \frac{\bar{F} + σ_{F} G_{F}}{\bar{S} + σ_{S} G_{S} + \bar{F} + σ_{F} G_{F}} \\ = \frac{\bar{F} (1 + \frac{σ_{F}}{\bar{F}} G_{F})}{(\bar{S} + \bar{F}) (1 + \frac{σ_{S} G_{S} + σ_{F} G_{F}}{\bar{S} + \bar{F}}}) \\ \approx \frac{\bar{F}}{\bar{S} + \bar{F}} (1 + \frac{σ_{F}}{\bar{F}} G_{F}) (1 - \frac{σ_{S} G_{S} + σ_{F} G_{F}}{\bar{S} + \bar{F}}) \\ \approx \bar{f} (1 + \frac{σ_{F}}{\bar{F}} G_{F} - \frac{σ_{S} G_{S} + σ_{F} G_{F}}{\bar{S} + \bar{F}}) \\ = \bar{f} (1 + (\frac{σ_{F}}{\bar{F}} - \frac{σ_{F}}{\bar{S} + \bar{F}}) G_{F} - \frac{σ_{S}}{\bar{S} + \bar{F}} G_{S}) . \end{aligned}

Recall that if $A \sim N (μ_{A}, σ_{A}), B \sim N (μ_{B}, σ_{B})$ , then $A - B \sim N (μ_{A} - μ_{B}, \sqrt{σ_{A}^{2} + σ_{B}^{2}})$ . Thus, $f$ is distributed as a Gaussian with the mean of

\begin{aligned} \bar{f} (t) & = \frac{\bar{F} (t)}{\bar{S} (t) + \bar{F} (t)} \approx \frac{{\bar{F}}_{o} e^{(r_{S} + ω) t} + \frac{μ {\bar{S}}_{o}}{ω} (e^{(r_{S} + ω) t} - e^{r_{S} t})}{{\bar{S}}_{o} e^{r_{S} t} + {\bar{F}}_{o} e^{(r_{S} + ω) t} + \frac{μ {\bar{S}}_{o}}{ω} (e^{(r_{S} + ω) t} - e^{r_{S} t})} \\ = \frac{{\bar{F}}_{o} e^{ω t} + \frac{μ {\bar{S}}_{o}}{ω} (e^{ω t} - 1)}{{\bar{S}}_{o} + {\bar{F}}_{o} e^{ω t} + \frac{μ {\bar{S}}_{o}}{ω} (e^{ω t} - 1)} \\ = \frac{{\bar{f}}_{o} e^{ω t} + \frac{μ}{ω} (1 - {\bar{f}}_{o}) (e^{ω t} - 1)}{(1 - {\bar{f}}_{o}) + {\bar{f}}_{o} e^{ω t} + \frac{μ}{ω} (1 - {\bar{f}}_{o}) (e^{ω t} - 1)} . \end{aligned}

Note that the initial value of mean $\bar{f} (0)$ is equal to the mean of the binomial distribution Equation 37, $f_{k}^{*}$ . The variance is

\begin{aligned} σ_{f}^{2} (t) & = {\bar{f}}^{2} ({(\frac{\bar{S}}{(\bar{S} + \bar{F}) \bar{F}})}^{2} σ_{F}^{2} (t) + {(\frac{1}{\bar{S} + \bar{F}})}^{2} σ_{S}^{2} (t)) \\ = {\bar{f}}^{2} (\frac{(1 - \bar{f} (t))^{2}}{\bar{N} (t)^{2} (\bar{f} (t))^{2}} σ_{F} (t) + \frac{1}{\bar{N} (t)^{2}} σ_{S}^{2} (t)) \\ = \frac{(1 - \bar{f} (t))^{2} σ_{F}^{2} (t) + \bar{f} (t)^{2} σ_{S}^{2} (t)}{\bar{N} (t)^{2}} . \end{aligned}

Data availability

Data and source code of stochastic simulations are available in https://github.com/schwarzg/artificial_selection_collective_composition (copy archived at Lee, 2025).

References

(2024) Artificial selection improves pollutant degradation by bacterial communities
Nature Communications 15:7836.

https://doi.org/10.1038/s41467-024-52190-z
- PubMed
- Google Scholar
(2020) Effects of microbial evolution dominate those of experimental host-mediated indirect selection
PeerJ 8:e9350.

https://doi.org/10.7717/peerj.9350
- PubMed
- Google Scholar
1. Blouin M
2. Karimi B
3. Mathieu J
4. Lerch TZ
(2015) Levels and limits in artificial selection of communities
Ecology Letters 18:1040–1048.

https://doi.org/10.1111/ele.12486
- Google Scholar
(2018) Synthetic biology approaches to engineer probiotics and members of the human microbiota for biomedical applications
Annual Review of Biomedical Engineering 20:277–300.

https://doi.org/10.1146/annurev-bioeng-062117-121019
- PubMed
- Google Scholar
(2006) Efficient step size selection for the tau-leaping simulation method
The Journal of Chemical Physics 124:044109.

https://doi.org/10.1063/1.2159468
- PubMed
- Google Scholar
(2020) Artificially selecting bacterial communities using propagule strategies
Evolution; International Journal of Organic Evolution 74:2392–2403.

https://doi.org/10.1111/evo.14092
- PubMed
- Google Scholar
1. Chang C-Y
2. Vila JCC
3. Bender M
4. Li R
5. Mankowski MC
6. Bassette M
7. Borden J
8. Golfier S
9. Sanchez PGL
10. Waymack R
11. Zhu X
12. Diaz-Colunga J
13. Estrela S
14. Rebolleda-Gomez M
15. Sanchez A
(2021) Engineering complex communities by directed evolution
Nature Ecology & Evolution 5:1011–1023.

https://doi.org/10.1038/s41559-021-01457-5
- PubMed
- Google Scholar
(2020) Eco-evolutionary dynamics of nested Darwinian populations and the emergence of community-level heredity
eLife 9:e53433.

https://doi.org/10.7554/eLife.53433
- PubMed
- Google Scholar
(2023) Artificial selection of communities drives the emergence of structured interactions
Journal of Theoretical Biology 571:111557.

https://doi.org/10.1016/j.jtbi.2023.111557
- PubMed
- Google Scholar
1. Gillespie DT
(2001) Approximate accelerated stochastic simulation of chemically reacting systems
The Journal of Chemical Physics 115:1716–1733.

https://doi.org/10.1063/1.1378322
- Google Scholar
1. Goodnight CJ
(1990) Experimental studies of community evolution i: the response to selection at the community level
Evolution; International Journal of Organic Evolution 44:1614–1624.

https://doi.org/10.1111/j.1558-5646.1990.tb03850.x
- PubMed
- Google Scholar
Book
1. Gumbel EJ
(1958) Statistics of Extremes
Columbia university press.

https://doi.org/10.7312/gumb92958
- Google Scholar
Software
(2019) Mpltern 0.3.0: ternary plots as projections of matplotlib, version Version 0.3.0
Zenodo.

https://doi.org/10.5281/zenodo.3528355
1. Jacquiod S
2. Spor A
3. Wei S
4. Munkager V
5. Bru D
6. Sørensen SJ
7. Salon C
8. Philippot L
9. Blouin M
(2022) Artificial selection of stable rhizosphere microbiota leads to heritable plant phenotype changes
Ecology Letters 25:189–201.

https://doi.org/10.1111/ele.13916
- PubMed
- Google Scholar
(2019) Host-mediated microbiome engineering (HMME) of drought tolerance in the wheat rhizosphere
PLOS ONE 14:e0225933.

https://doi.org/10.1371/journal.pone.0225933
- PubMed
- Google Scholar
(2022) Artificial selection methods from evolutionary computing show promise for directed evolution of microbes
eLife 11:e79665.

https://doi.org/10.7554/eLife.79665
- PubMed
- Google Scholar
Software
1. Lee J
(2025) Artificial_selection_collective_composition, version swh:1:rev:b87606f2397f1b9f97fb39dd7d103245ddcf4a6e
Software Heritage.

https://archive.softwareheritage.org/swh:1:dir:e85ec419537ab184c51bcf79f84aef29a7d7aa61;origin=https://github.com/schwarzg/artificial_selection_collective_composition;visit=swh:1:snp:573eac2af3980280312f8213089b880a76bb387d;anchor=swh:1:rev:b87606f2397f1b9f97fb39dd7d103245ddcf4a6e
1. Mueller UG
2. Juenger TE
3. Kardish MR
4. Carlson AL
5. Burns KM
6. Edwards JA
7. Smith CC
8. Fang C-C
9. Des Marais DL
(2021) Artificial selection on microbiomes to breed microbiomes that confer salt tolerance to plants
mSystems 6:e0112521.

https://doi.org/10.1128/mSystems.01125-21
- PubMed
- Google Scholar
(2015) Selection on soil microbiomes reveals reproducible impacts on plant function
The ISME Journal 9:980–989.

https://doi.org/10.1038/ismej.2014.196
- PubMed
- Google Scholar
(2017) Cultivated sub-populations of soil microbiomes retain early flowering plant trait
Microbial Ecology 73:394–403.

https://doi.org/10.1007/s00248-016-0846-1
- PubMed
- Google Scholar
Book
1. Penn A
(2003) Modelling artificial ecosystem selection: a preliminary investigation
In: Banzhaf W, Ziegler J, Christaller T, Dittrich P, Kim JT, editors. Advances in Artificial Life. Berlin, Heidelberg: Springer. pp. 659–666.

https://doi.org/10.1007/978-3-540-39432-7_71
- Google Scholar
Conference
1. Penn A
2. Harvey I
(2004) The role of non-genetic change in the heritability, variation, and response to selection of artificially selected ecosystems
Artificial Life IX: Proceedings of the Ninth International Conference on the Simulation and Synthesis of Artificial Life.

https://doi.org/10.7551/mitpress/1429.003.0059
- Google Scholar
1. Phllip JR
(1960) The Function Inverfc?
Australian Journal of Physics 13:PH600013.

https://doi.org/10.1071/PH600013
- Google Scholar
1. Rainey PB
(2023) Major evolutionary transitions in individuality between humans and AI
Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 378:20210408.

https://doi.org/10.1098/rstb.2021.0408
- PubMed
- Google Scholar
1. Raynaud T
2. Devers M
3. Spor A
4. Blouin M
(2019) Effect of the reproduction method in an artificial selection experiment at the community level
Frontiers in Ecology and Evolution 7:416.

https://doi.org/10.3389/fevo.2019.00416
- Google Scholar
(2022) Community diversity determines the evolution of synthetic bacterial communities under artificial selection
Evolution; International Journal of Organic Evolution 76:1883–1895.

https://doi.org/10.1111/evo.14558
- PubMed
- Google Scholar
1. Sun J
2. Prabhu A
3. Aroney STN
4. Rinke C
(2022) Corrigendum: Insights into plastic biodegradation: community composition and functional capabilities of the superworm (Zophobas morio) microbiome in styrofoam feeding trials
Microbial Genomics 8:12.

https://doi.org/10.1099/mgen.0.000916
- PubMed
- Google Scholar
(2000a) Artificial selection of microbial ecosystems for 3-chloroaniline biodegradation
Environmental Microbiology 2:564–571.

https://doi.org/10.1046/j.1462-2920.2000.00140.x
- PubMed
- Google Scholar
(2000b) Artificial ecosystem selection
PNAS 97:9110–9114.

https://doi.org/10.1073/pnas.150237597
- PubMed
- Google Scholar
(2024) Artificial selection of microbial communities: what have we learnt and how can we improve?
Current Opinion in Microbiology 77:102400.

https://doi.org/10.1016/j.mib.2023.102400
- PubMed
- Google Scholar
Preprint
(2023) Novel artificial selection method improves function of simulated microbial communities
bioRxiv.

https://doi.org/10.1101/2023.01.08.523165
- Google Scholar
1. Wang EX
2. Ding MZ
3. Ma Q
4. Dong XT
5. Yuan YJ
(2016) Reorganization of a synthetic microbial consortium for one-step vitamin C fermentation
Microbial Cell Factories 15:21.

https://doi.org/10.1186/s12934-016-0418-6
- PubMed
- Google Scholar
1. Williams HTP
2. Lenton TM
(2007) Artificial selection of simulated microbial ecosystems
PNAS 104:8918–8923.

https://doi.org/10.1073/pnas.0610038104
- PubMed
- Google Scholar
1. Woo S
2. Song I
3. Cha HJ
(2020) Fast and facile biodegradation of polystyrene by the gut microbial flora of Plesiophthalmus davidis larvae
Applied and Environmental Microbiology 86:e01361-20.

https://doi.org/10.1128/AEM.01361-20
- PubMed
- Google Scholar
(2019) Understanding microbial community dynamics to improve optimal microbiome selection
Microbiome 7:85.

https://doi.org/10.1186/s40168-019-0702-x
- PubMed
- Google Scholar
1. Xie L
2. Yuan AE
3. Shou W
(2019) Simulations reveal challenges to artificial community selection and possible strategies for success
PLOS Biology 17:e3000295.

https://doi.org/10.1371/journal.pbio.3000295
- PubMed
- Google Scholar
1. Xie L
2. Shou W
(2021) Steering ecological-evolutionary dynamics to improve artificial selection of microbial communities
Nature Communications 12:6799.

https://doi.org/10.1038/s41467-021-26647-4
- PubMed
- Google Scholar
Preprint
1. Xie L
2. Yuan AE
3. Shou W
(2023) A quantitative genetics framework for understanding the selection response of microbial communities
bioRxiv.

https://doi.org/10.1101/2023.10.24.563725
- Google Scholar
1. Zaccaria M
2. Sandlin N
3. Soen Y
4. Momeni B
(2023) Partner-assisted artificial selection of a secondary function for efficient bioremediation
iScience 26:107632.

https://doi.org/10.1016/j.isci.2023.107632
- PubMed
- Google Scholar
1. Zheng Q
(1999) Progress of a half century in the study of the Luria-Delbrück distribution
Mathematical Biosciences 162:1–32.

https://doi.org/10.1016/s0025-5564(99)00045-0
- PubMed
- Google Scholar

Article and author information

Author details

Juhee Lee
1. Department of Physics, Inha University, Incheon, Republic of Korea
2. Asia Pacific Center for Theoretical Physics, Pohang, Republic of Korea
Present address
Integrated Science Lab, Department of Physics, Umeå university, Umeå, Sweden

Contribution
Conceptualization, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-3318-6377
Wenying Shou

Centre for Life’s Origins and Evolution, Department of Genetics, Evolution and Environment, University College London, London, United Kingdom

Contribution
Conceptualization, Supervision, Writing – original draft, Writing – review and editing

For correspondence
wenying.shou@gmail.com

Competing interests
Reviewing editor, eLife

"This ORCID iD identifies the author of this article:" 0000-0001-5693-381X
Hye Jin Park
1. Department of Physics, Inha University, Incheon, Republic of Korea
2. Asia Pacific Center for Theoretical Physics, Pohang, Republic of Korea
Contribution
Conceptualization, Supervision, Validation, Writing – original draft, Writing – review and editing

For correspondence
hyejin.park@inha.ac.kr

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-3552-6275

Funding

National Research Foundation of Korea (RS-2023-00214071)

Juhee Lee
Hye Jin Park

National Research Foundation of Korea (RS-2024-00460958)

Juhee Lee
Hye Jin Park

Asia Pacific Center for Theoretical Physics (JRG Program)

Juhee Lee
Hye Jin Park

Academy of Medical Sciences (Professorship)

Wenying Shou

Royal Society (Wolfson Fellowship)

Wenying Shou

Inha University (Research grant)

Hye Jin Park

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

J Lee and HJ Park were supported by the National Research Foundation of Korea grant funded by the Korean government (MSIT), Grant No. RS-2023–00214071 and RS-2024–00460958 and by an appointment to the JRG Program at the APCTP through the Science and Technology Promotion Fund and the Lottery Fund of the Korean Government. W Shou was supported by the Academy of Medical Sciences Professorship and a Royal Society Wolfson Fellowship. This was also supported by the Korean Local Governments–Gyeongsangbuk-do Province and Pohang City and INHA UNIVERSITY Research Grant. We thank Su-Chan Park, Li Xie, Alex Yuan, and Botond Major for constructive comments and discussions.

Version history

Preprint posted: January 2, 2024
Sent for peer review: March 18, 2024
Reviewed Preprint version 1: May 29, 2024
Reviewed Preprint version 2: March 3, 2025
Version of Record published: September 15, 2025

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.97461. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

939

views
78

downloads
0

citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Article PDF

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Juhee Lee
Wenying Shou
Hye Jin Park

(2025)

The success of artificial selection for collective composition hinges on initial and target values

eLife 13:RP97461.

https://doi.org/10.7554/eLife.97461.3

Share this article

Cite this article

Schematic for artificial selection on collectives.

Nomenclature.

Initial and target compositions determine the success of artificial selection on collectives.

Changes in the distribution of F frequency f after one cycle

Intra-collective selection and inter-collective selection jointly set the boundaries for selection success.

Expanding the region of success for artificial collective selection.

In higher dimensions, the success of artificial selection requires the entire evolutionary trajectory remaining in the accessible region.

Comparison between the calculated Gaussian distribution (‘Gauss,’ with the mean and variances computed from Equations 18; 19; 28; 33) and simulations using tau-leaping (‘tau’).

Congruence between consecutive sampling (MHG for multivariate hypergeometric distribution) and independent binomial (BN) sampling.

Trajectories of F frequency for 10 collectives (g=10) over time.

Color map of the absolute error d=|⟨f∗⟩−f^| averaged selected collectives at the end of simulations (k=1000) and the target frequency f^.

The probability density functions of the selected Adult’s F frequency fk+1∗ subtracted by fk∗.

Effect of experimental parameters in the distribuiton of Adult's F frequency.

Median (orange) and mean (violet) have similar distributions.

Simulation with zero mutation rate.

Change of success region in varying selective advantage ω.rs, ω=0.03, μ=0.0001, N0=1000, g=10, and τ≈4.8.

Artificial selection also works for deleterious mutation.

Selecting top 5% outperforms selecting top 1.

Accessible regions in the three-population system.

Author details

Juhee Lee

Present address

Contribution

Competing interests

Wenying Shou

Contribution

For correspondence

Competing interests

Hye Jin Park

Contribution

For correspondence

Competing interests

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Changes in the distribution of F frequency $f$ after one cycle

Trajectories of F frequency for 10 collectives ( $g = 10$ ) over time.

Color map of the absolute error $d = | ⟨ f^{*} ⟩ - \hat{f} |$ averaged selected collectives at the end of simulations ( $k = 1000$ ) and the target frequency $\hat{f}$ .

The probability density functions of the selected Adult’s F frequency $f_{k + 1}^{}$ subtracted by $f_{k}^{}$ .

Change of success region in varying selective advantage $ω . r_{s}$ , $ω = 0.03$ , $μ = 0.0001, N_{0} = 1000, g = 10$ , and $τ \approx 4.8$ .