Microbial collectives can carry out functions that arise from interactions among member species. These functions, such as waste degradation (1, 2), probiotics (3), and vitamin production (4), can be useful for human health and biotechnology. To improve collective functions, one can perform artificial selection (directed evolution) on collectives: Low-density “Newborn” collectives are allowed to “mature” during which cells proliferate and possibly mutate, and community function develops. “Adult” collectives with high functions are then chosen to reproduce, each seeding multiple offspring Newborns. Artificial selection of collectives have been attempted both in experiments (519) and in simulations (2030), often with unimpressive outcomes.

One of the major challenges in selecting collectives is to ensure the inheritance of a collective function (31, 32). Inheritance can be compromised when genotype and species compositions change from a parent collective to offspring collectives. During maturation of a collective, genotype compositions within each species can be altered by intra-collective evolution, while species compositions can be shifted by ecological interactions. Furthermore, during reproduction of a collective, genotype and species compositions of offspring can vary stochastically from those of the parent.

Here, we consider the selection of collectives of two types with different growth rates to achieve a taget composition in the Adult collective. Earlier work has demonstrated that nearly any target species composition can be achieved when selecting communities of two competing species with unequal growth rates (24, 33), so long as the shared resource is depleted during collective maturation (24). In this case, both species initially evolved to grow faster, and the slower-growing species was preserved due to stochastic fluctuations in species composition during collective reproduction. Eventually, both species evolved to grow sufficiently fast to deplete the shared resource during collective maturation, and evolution in competition coefficients then acted to stabilize species ratio to the target value (24).

Here, we mathematically examine the selection of target type compositions in two-type collectives when nutrient is always in excess. This allows us to analytically examine the evolutionary tipping point between intra-collective and inter-collective selection. We show that this tipping point creates a “waterfall” effect which restricts not only which target compositions are achievable, but also the initial composition required to achieve the target. We also investigate how the range of achievable target composition is affected by the population size in Newborns and the total number of collectives under selection. Finally, we show that the waterfall phenomenon extends to systems with more than 2 types.

Results and Discussions

We start with selecting for a target composition in collectives of fast-growing F and slow-growing S types. We show that intermediate F frequencies or equivalently, intermediate S frequencies, are the hardest targets to achieve. We then show that similar conclusions hold when selecting for a target composition in collectives of more than two types. We mainly describe in terms of F frequency f = F/(S + F), which is related to S frequency s = 1 − f .

A selection cycle

A selection cycle (Fig. 1) starts with a total of g Newborn collectives. We consider two genotypes - slow-growing type (S) and fast-growing type (F), with S mutating to F at a rate μ. At the beginning of cycle k (t = 0), a collective has a fixed total cell number where and denote the numbers of S and F cells in collective i (1 ≤ ig) at time t (0 ≤ tτ) of cycle k. The average F frequency among the g Newborn collectives is , such that the initial F cell number in each Newborn is drawn from the binomial distribution Binom .

Schematic for artificial selection on collectives.

Each selection cycle begins with a total of g Newborn collectives (black open circles), each with N0 total cells of slow-growing type (S, red-colored dots) and fast-growing type (F, blue-colored dots). During maturation (over time τ), S and F cells divide at rates r and r + ω (ω > 0), respectively. S mutates to F at rate μ. In the selection stage, the Adult collective with F frequency f closest to the target composition is chosen to reproduce g Newborns for the next cycle. Newborns are sampled from the chosen Adult (yellow star) with N0 cells each Newborn. The selection cycle is then repeated until the F frequency reaches a steady state, which may or may not be the target composition. To denote a quantity x of i-th collective in cycle k at time t (0 ≤ tτ), we use notation where x ∈ {S, F, s, f }. Note that time t = 0 is for Newborns and t = τ is for Adults.

Collectives are allowed to grow for time τ (‘Maturation’ in Fig. 1). During maturation, S and F grow at rates r and r + ω (ω > 0), respectively. If maturation time τ is too small, matured collectives (“Adults”) do not have enough cells to reproduce g Newborn collectives with N0 cells. On the other hand, if maturation time τ is too long, fast-growing F will take over. Hence we set the maturation time τ = ln(g + 1)/r, which guarantees sufficient cells to produce g Newborn collectives from a single Adult collective.

At the end of a cycle, the Adult with the highest function is chosen to reproduce g Newborn collectives each with N0 cells (‘Selection’ and ‘Reproduction’ in Fig. 1). This allows us to employ the simplest version of the extreme value theory, and in our case, choosing top 1 outperforms a less stringent “choosing top 5%” selection regime. Collective function is dictated by the F frequency f, and is maximized at the target value . Thus among all Adult collectives, the chosen Adult has F frequency f closest to the target value . For the next cycle, the number of F cells in Newborns follow a binomial distribution with the mean value N0f . By repeating the selection cycle, we aim to achieve the target composition .

The success of collective selection is constrained by the target composition, and sometimes also by the initial composition

Since intra-collective selection favors F, we expect that a higher target (a lower target ŝ) is easier to achieve. Note that once a target frequency is achieved, inter-collective selection is always required to maintain that frequency due to the fitness difference between the two types.

We fixed N0,the total population size of a Newborn, to 1000, and obtained selection dynamics for various initial and target F frequencies by implementing stochastic simulations (Sec. 1 in Supplementary Information). If the target value is high (e.g., Figure 2a magenta), selection is successful in the sense that the absolute error d between the target frequency and the selected frequency averaged among independent simulations ⟨f ⟩ is smaller than 0.05 (i.e computed absolute errors are shown in Supplementary Information Fig. S3). F frequency of the chosen collective eventually converges to the target value and stays around it, regardless of the initial frequency. In contrast, without collective-level selection (e.g. choosing a random collective to reproduce), F frequency increases until F reaches fixation (Supplementary information Fig. S4b).

Initial and target compositions determine the success of artificial selection on collectives.

(a-c) Mutant frequency of the selected Adult collective (f ) over cycles. The target frequency is marked as a dotted line, while f H (black solid line) and f L (black dashed line) delineate boundary values that define the region of successful selection. a A high target F frequency ( ; magenta dotted line) can be achieved from any initial frequency (black dots). b An intermediate target frequency ( ; green dotted line) is never achievable, as all initial conditions converge to near f H . c A low target freqeuncy is acheiveable, but only from initial frequencies below f L. For initial frequencies at f L, stochastic outcomes (grey curves) are observed: while some replicates reached the target frequency, other reached f H. For parameters, we used wildtype growth rate r = 0.5, F growth advantage ω = 0.03, mutation rate μ = 0.0001, maturation time τ ≈ 4.8, and N0 = 1000. The number of collective g = 10. Each black line is averaged from independent 300 realizations. d Two accessible regions (gold). Either high (region 2) or low starting from low initial f (region 1) can be achieved. We theoretically predict (by numerically integrating Eq. 1) the boundaries of success regions, f H (black solid line) and f L (black dashed lines), which agree with simulation results (gold regions). e Example trajectories from initial compositions (black dots) to the target compositions (dashed lines). The gold areas indicate the region of initial frequencies where the target frequency can be achieved. f The tension between intra-collective selection and inter-collective selection creates a “waterfall” phenomenon. See the main text for details.

In contrast, an intermediate target frequency (e.g. Fig. 2b green) is never achievable. High initial F frequencies (e.g. 0.6 and 0.95) decline toward the target, but stabilize at an “higher-threshold” (f H ∼ 0.7, solid line segment in Fig. 2a-c) above the target. Low initial F frequencies (e.g. 0) increase toward the target, but then overshoot and stabilize at the f H value.

If the target frequency is low (e.g. Fig. 2c red line), artificial selection succeeds when the initial frequency is below a “lower-threshold” (f L, dashed line segment in Fig. 2a-c). Initial F frequencies above f L (e.g. 0.45 and 0.95) converge to f H instead. Initial F frequencies near f L display stochastic trajectories, converging to either f H or the target.

In summary, two regions of target frequencies are “accessible” in the sense that a target value in the region can be maintained once reached; gold in Fig. 2d, e): (1) target frequencies above and (2) target frequencies below and starting at an average frequency below .

Intra-collective evolution is the fastest at intermediate F frequencies, creating the “waterfall” phenomenon

To understand what gives rise to the two accessible regions, we calculated how the distribution of F frequency f might change after one cycle. We consider the case where , the F frequency of the selected Adult at cycle k, is above the target value . This case is particularly challenging because intra-collective evolution favours fast-growing F and thus will further increase f away from the target. From , Newborns of cycle k + 1 will have f fluctuating around , and after they mature,in F frequency over one cycle, the minimum f is selected . If the selected composition at cycle k + 1 can be reduced compared to that of cycle k (i.e. ), the system can evolve to the lower target value.

Mathematically, the conditional probability distribution Ψ of (the F frequency of the selected Adult at cycle k + 1 being f) given the selected at cycle k is

Equation 1 can be broken down into two parts: (i) describes that any one of the g collectives must achieve f, where denotes the probability of an Adult (t = τ) in cycle k + 1 achieving f given . (ii) describes that the remaining g − 1 collectives must all achieve a frequency above f (and thus not selected). is calculated by integrating across ζ the term (the probability density of fk+1,0 of a Newborn in cycle k + 1 achieving ζ given ) multiplied by the term (the probability density of an Adult achieving f′ given Newborn’s frequency ζ, see Methods).

When we approximate the Adults’ f in cycle k + 1 as Guassian with mean and variance , we can obtain an analytical solution of the change in f over one cycle - the difference between the median value of the conditional probability distribution Ψ and the selected frequency of (Sec. 2 in Supplementary Information):



e is Euler’s number (2.71828…), and Φ−1(…) is an inverse cumulative function of standard normal distribution. For those unfamiliar with Φ−1(…), y = Φ−1(x) (where 0 ≤ x ≤ 1) is the number such that the probability of a standard normal random variable being less than or equal to y is x (e.g. 0 = Φ−1(0.5)). Eq. [2] contains all experimental parameters, including g (the total number of collectives under selection), τ (maturation time), μ (mutation rate), r (growth rate of the slow-growing S type), and ω (growth advantage of F over S). Equation [2] has the same shape as results from numerically integrating Eq. [1] and from stochastic simulations (Fig. 3a).

Intra-collective selection and inter-collective selection jointly set the boundaries for selection success.

a The change in F frequency over one cycle. When is sufficiently low or high, inter-collective selection can lower the F frequency to below . The points where Median are denoted as f L and f H, corresponding to the boundaries in Fig. 2. b The distributions of frequency differences obtained by 1000 numerical simulations. The cyan, purple, and black box plots indicate the changes in F frequency after intra-collective selection during maturation, after inter-collective selection, and over one selection cycle, respectively. The box ranges from 25% to 75% of the distribution, and the median is indicated by a line across the box. The upper and lower whiskers indicate maximum and minimum values of the distribution. ***: P < 0.001 in an unpaired t-test.

F frequency can be reduced if the selected is low or high, but not if it is intermediate between f L and f H (Fig. 3a). Increase in f during maturation is the most drastic when Newborn f is around 0.5 (Fig. 3b blue). This is intuitive: when Newborn f is low, the increase will be minor; when Newborn f is high, the fitness advantage of F over the population average is small and hence the increase is also minor. Thus, when Newborn F frequency is intermediate, inter-collective selection may not be able to counter intra-collective selection (Fig. 3b).

Not surprisingly, similar conclusions are derived where S and F are slow-growing and fast-growing species which cannot be converted through mutations (Sec. 3 and Fig. S7 in Supplementary Infomation).

Overall, inter-collective selection is akin to a raftman, rowing the raft to a target, while intra-collective selection is akin to a waterfall. This metaphore is best understood in terms of S frequency s = 1 − f . Then, the lower-threshold f L corresponds to higher-threshold in sH = 1 − f L, while higher-threshold f H corresponds to lower-threshold in sL = 1 − f H . Intra-collective selection is akin to a waterfall, driving the S frequency s from high to low (Fig. 2f). Intra-collective selection acts the strongest when s is intermediate (sL < s < sH), similar to the vertical drop of the fall. Intra-collective selection acts weakly at high (> sH) or low (< sL) s, similar to the gentle sloped upper and lower pools of the fall (regions 1 and 2 of Fig. 2d, f and Fig. 3a). Thus, an intermediate target frequency can be impossible to achieve - a raft starting from the upper pool will be flushed down to sL (f H), while a raft starting from the lower pool cannot go beyond sL (f H). In contrast, a low target S frequency (in the lower pool) is always achievable. Finally, a high target S frequency (in the upper pool) can only be achieved if starting from the upper pool (as the raft can not jump to the upper pool if starting from below).

Manipulating experimental setups to expand the achievable target region

Variation among collectives is a key element for successful selection. Here, variation among collectives originates from the stochastic nature of growth and mutation during collective maturation, and of sampling effects during collective reproduction. From Eq. [4], the variance of f at time τ scales with . Indeed, as Newborn size N0 declines, the region of acheivable of target frequency expands (larger gold area in Fig. 4a). If the Newborn size N0 is smaller than ∼700, any target frequency can be reached when other parameters are fixed.

Expanding the success region for artificial collective selection.

a Reducing the population size in Newborn N0 expands the region of success. In gold area, the probability that becomes smaller than in a cycle is more than 50%. We used g = 10 and τ ≈ 4.8. Figures 2-3 correspond to N0 = 1000. b Increasing the total number of collectives g also expands the region of success, although only slightly. We used a fixed Newborn size N0 = 1000. The maturation time τ (τ = log(100)/r) is set to be long enough so that an Adult can generate at least 100 Newborns.

The number of collectives g also affects the selection outcomes. With more collectives, the chance to find a f closer to the target is more likely. Thus, a larger number of collectives broadens the region of success (Fig. 4b). However, the effect of g is not dramatic. To see why, we note that the only place that g appears is Eq. [2] in the form of . When g becomes large, Φ−1(1/g) is asymptotically expressed as (Sec. 2 in Supplementary Information)(34). Thus, φ(g) does not change dramatically as g varies.

The waterfall phenomenon in a higher dimension

In contrast to the above two-type case where the evolutionary trajectory travels along a one-dimensional line from the initial frequency to the final frequency, in a three-type case the trajectory travels in a two-dimensional plane. To examine the waterfall effect in a higher dimension, we investigate a three-type system where a faster-growing type (FF) grows faster than the fast-growing type (F) which grows faster than the slow-growing type (S) (Fig. 5a and Sec. 6 in Supplementary Information). In this system, a target type composition can be achieved if inter-collective selection reduces the frequencies of F as well as FF (accessible region, gold in Fig. 5b). Strikingly, a target composition in an accessible region may not be achievable even when the initial composition is within the same region: once the composition escapes the accessible region, the trajectory cannot return back to the accessible region (Fig. 5biii, the leftmost initial condition). However, if the initial position is closer to the target in the accessible region, the target becomes achievable (Fig. 5biii, initial condition near the bottom).

In higher dimensions, the success of artificial selection requires the entire evolutionary trajectory remaining in the accessible region.

a During collective maturation, a slow-growing type (S) (with growth rate r; dark red) can mutate to a fast-growing type (F) (with growth rate r + ω; blue), which can mutate further into a faster-growing type (FF) (with growth rate r + 2ω; purple). Here, the rates of both mutational steps are μ, and ω > 0. b Evolutionary trajectories from various initial compositions (open circles) to various targets. Intra-collective evolution favors FF over F (vertical blue arrow) over S (horizontal blue arrow). The accessible regions are marked gold (see Sec. 1 in Supplementary Information). We obtain final compositions starting from several initial compositions while aiming for different target compositions in i, ii, and iii. The evolutionary trajectories are shown in dots with color gradients from the initial to final time. (i) A target composition with a high FF frequency is always achievable. (ii) A target composition with intermediate FF frequency is never achievable. (iii) A target composition with low FF frequency is achievable only if starting from an appropriate initial composition such that the entire trajectory never meanders away from the accessible region. The figures are drawn using mpltern package (35).

In conclusion, we have investigated the evolutionary trajectories of type composition in collectives under selection, which are governed by intra-collective selection (which favors fast-growing types) and inter-collective selection (which, in our case, strives to counter fast-growing types). Intra-collective selection has the strongest effect at intermediate frequencies of faster-growing types, potentially creating an inaccessible region analogous to the vertical drop of a waterfall. High and low target frequencies are both accessible, analogous to the lower- and the upper-pools of a waterfall, respectively. A less challenging target (low ŝ)is achievable from any initial position. In contrast, a more challenging target (high ŝ) is only achievable if the entire trajectory is contained within the region, similar to a raft striving to reach a point in the upper-pool must start at and remain in the upper pool. Our work suggests that the strength of intra-collective selection is not constant, and that strategically choosing an appropriate starting point can be essential for successful collective selection.

Materials and Methods

Stochastic simulations

A selection cycle is composed of three steps: maturation, selection, and reproduction. At the beginning of the cycle k, a collective i has slow-growing cells and fast-growing cells. At the first cycle, mean F frequency of collectives are set to be is sampled from the binomial distribution with mean . Then, S cells are in the collective i. In the maturation step, we calculate and by using stochastic simulation. We can simulate the division and mutation of each individual cell stochastically by using tau-leaping algorithm (36, 37)(see Fig. S4a and b in Supplementray Information). However, individual-based simulation requires a long computing time. Instead, we randomly sample and from the probability distribution . To obtain , we solve the master equation which describes a time evolution of probability distribution, under the random processes (see Sec. 1 in Supplementary Information). We assumed that follows the Gaussian distribution. The mean and variance of the distribution are numerically obtained from ordinary differential equations which is derived from the master equation (see Sec. 1 in Supplementary Information). We choose the collective with the closest frequency to the target . The selected collective generates g Newborns. The number of F cells is sampled from the binomial distribution with the mean of . We start a new cycle with those Newborn collectives. Then, the number of S cells in a collective i is

Analytical approach to the conditional probability

The conditional probability distribution Ψ that we observe at a given is calculated by the following procedure. Given the selected collective in cycle k with , the collective-level reproduction proceeds by sampling g Newborn collectives with N0 cells in cycle k +1. Each Newborn collective contains certain F numbers at the beginning of the cycle k + 1, which can be mapped into with the constraint of N0 cells. If the number of cells in the selected collective is large enough, the joint conditional distribution function is well described by the product of g independent and identical Gaussian distribution 𝒩(μ, σ2). So we consider the frequencies of g Newborn collectives as g identical copies of the Gaussian random variable fk+1,0. The mean and variance of fk+1,0 are given by and . Then, the conditional probability distribution function of fk+1,0 being ζ is given by

After the reproduction step, the Newborn collectives grow for time τ . The frequency is changed from the given frequency ζ to f by division and mutation processes. We assume that the frequency f of an Adult is also approximated by a Gaussian random variable . The mean and variance are calculated by using means and variances of S and F (see Sec. 2 in Supplementary Information.) Since and also depend on ζ, the conditional probability distribution function of fk+1,τ being f is given by

The conditional probability distribution of an Adult collective in cycle k + 1 (fk+1,τ) to have frequency f at a given is calculated by producting two Gaussian distribution functions and integrating overall ζ values, which is given by

Since we select the minimum frequency among g identical copies of fk+1,τ, the conditional probability distribution function of follows a minimum value distribution, which is given in Eq. [1]. Here, For the case of , the selected frequency is the minimum frequency So we have by replace to

We assume that the conditional probability distribution in Eq. [7] follows a normal distribution, whose mean and variances are describe by Eq. [3] and Eq. [4]. Then, the extreme value theory (38) estimates the median of the selected Adult by

The F frequency difference in Eq. [2] is obtained by substracting from Eq. [8].

Data and source code availability

Data and source code of stochastic simulations are available in


J. Lee and H.J. Park were supported by the National Research Foundation of Korea grant funded by the Korea government (MSIT), Grant No. RS-2023-00214071 (H.J.P.) and by an appointment to the JRG Program at the APCTP through the Science and Technology Promotion Fund and Lottery Fund of the Korean Government. W. Shou was supported by an Academy of Medical Sciences Professorship and a Royal Society Wolfson Fellowship. This was also supported by the Korean Local Governments– Gyeongsangbuk-do Province and Pohang City and INHA UNIVERSITY Research Grant. We thank Su-Chan Park, Li Xie, and Alex Yuan for constructive comments and discussions.

Supplementary Information

1. Stochastic simulation for the evolution of collective function

In the main text, we design a simple model of artificial selection on collectives. The collectives are composed of two genotypes, slow-growing (S) and fast-growing (F) genotypes. We start with g ‘Newborn’ collectives and each collective i (i = 1, …, g) has S cells and F cells. The index k is a cycle number and t is time indices where 0 ≤ tτ . At the beginning of cycle k (t = 0), any collective has a fixed total cell number . The collectives are allowed to “mature” for t = τ, where S and F populations grow at rates of r and r + ω, respectively. S cells also mutate to F cells at a rate of μ per cell, and the back mutation from F to S is ignored. After maturation, we choose the collective with F frequency the closest to the target value , and the chosen collectiv e“reproduce” g newborn collectives each with N0 cells for the next cycle.


In this subsection, we ignore the cycle number index k and collective index (i) in convenience. So we denote S(t) and F (t) as representative notations of and , respectively. The initial S and F cell numbers in a Newborn collective are and . We consider mutation that is beneficial to grow (ω > 0).

We consider the population growth of S(t) and F (t) as a stochastic process. We use P (S, F, t) to denote the probability to have S S cells and F F at time t from the beginning of a cycle. Birth and mutation events stochastically happen which are represented by the three chemical reaction rules:

where S and F denote individual cells of a S type and a F type, respectively. The reaction rules can be simulated via tau-leaping algorithm (1, 2), but the individual-based simulation requires long simulation times to get enough results. Hence we go over probability distribution P (S, F, t) to sample compositions instead of tracking all events that occurred in individual collectives. First, we calculate the distribution P (S, F, t) analytically, and then sample numbers of S and F cells from this distribution at the end of cycle t = τ . Below, we describe the derivation of P (S, F, t).

The master equation of P (S, F, t), which follows reaction rules Eq. [1-3], is given by

Here, dt is small such that at most one cell birth or mutation event could occur. The first term describes the scenario where a single birth event of a S cell happens during time interval [t, t + dt), which changes the collective’s composition from (S −1, F) to (S, F). Similarly, the second term comes from a birth event of a F cell. The third term indicates the mutation event. The last term corresponds to the outflow of probability density by birth and mutation processes, which induces the changes in composition from (S, F) to others.

Calculating the exact form of P (S, F, t) is not simple. Instead, we consider the probability distributions of S(t) and F (t) at time t. We assume that the mutation rate is much smaller than the growth rates, and hence the correlation between S and F is sufficiently small. Then, P (S, F, t) ∼ P (S, t)P (F, t). The distributions of S and F can be approximated as Gaussian (𝒩), which is supported by the Central Limit Theorem and Figure S1 (note the concordance between the blue and green distributions). That is, the probability distribution .

The means are defined by and The differential equations for means are obtained by applying the definition to the master equation in Eq. [4], as

We assume that the mutation rate μ is very smaller than r and ω. By solving Eq. [5] and Eq. [6], the means and are given by

where and are the mean number of S and F cells at the beginning of cycle t = 0. Note that the second term of Eq. [8] is consistent with the studies on the distribution of the number of F cells starting from S0 S cells (3).

We define the second momenta of S and F as

Then, corresponding differential equations are given by

The solution of Eq. [11] is

where is the second moment of initial values . Thus, the variance is

where is a variance of S cell numbers at the beginning of cycle t = 0. In Eq. [12], we require to calculate . The master Eq. [4] provides a differential equation for as

The solution of Eq. [16] is given by

By using Eq. [17], the solution of Eq. [12] is given by

where is the second moment of initial values . Thus, the variance is given, up to the order of μ, by

Using Eq. [7,8,15,19], we construct Gaussian distribution functions for S(t) and F (t), and . Then, numbers of S cells and F cells are randomly sampled from each distribution: S (τ) from and F( τ) from . This process requires only two sampling steps for each collective.

Comparison between the calculated Gaussian distribution (“Gauss”, with the mean and variances computed from Eq. [7,8,15,19]) and simulations using tau-leaping (“tau”) and sampling (“samp”) methods. The simulations run 500 times. The initial number of cells are S0 = 200 and F0 = 800. The parameters are r = 0.5, ω = 0.03, μ = 0.0001, and τ = 4.8.


After the maturation step is finished in cycle k, we compute the frequencies of the F cells in each collectives where we denote the frequency of collective i at time t = τ in cycle k as . We select one frequency which is the closest value to the target frequency . The frequency selected at the cycle k is denoted by . In mathematical expression, the selected frequency is


We divide the selected collective into g Newborn collectives. We sample F cells from the selected collective for the Newborns in the next cycle k + 1. The exact description of reproduction step is described by consecutive sampling without replacement from the selected collective. Then the joint probability distribution follows a multivariate hypergeometric distribution. The selected collective is assumed to have cells and F cells. If the size of selected collective is large enough, the consecutive sampling is approximated to the independent binomial sampling (see Fig. S2). Thus, we independently sample g numbers of F cells, , from the binomial distribution. The probability mass function of is given by

After the sampling, the numbers of S cells are set to be for each collective. These Newborn collectives start the maturation in the next cycle.

Simulation Result

In the main text, we simulate artificial selection by using stochastic simulation. In Fig. S3, we draw the absolute error d between the target frequency and at the end of simulations (1000 cycles). In the colormap, errors higher than 0.15 are marked with gray, which indicates the failure of selection. The dashed lines indicate the same boundary in Fig 2d in the main text.

Figure S4 presents the composition trajectories of all collectives using tau-leaping algorithm. The selected collectives have the closest compotisition to the target composition at the end of each cycle. The selected collective can have smaller frequency than its parent collective, so the composition can be lowered after cycles.

2. Conditional probability distribution of the selected collective frequency f *

Since maturation and reproduction are stochastic processes, the frequency of the selected collective at cycle k + 1 starting from is also described by the probability distribution. In this case, depends on , and the distribution of follows the conditional probability distribution . Below, we describe calculation of the conditional probability distribution .

We focus on the situation where the change of frequency in the selection step has the opposite direction of the maturation step. That means the target frequency is smaller than , and the collective with the smallest frequency will be selected among g Newborn collectives. We can calculate the probability distribution of the Newborn collectives’ frequencies at the end of cycle k + 1 by using Eq. [7,8,15,19] and Eq. [21]. Then, we obtain a distribution of the minimum value among g copies of the collectives. The minimum value distribution is the conditional probability distribution .

Let us start the calculation from the reproduction step at the end of cycle k. The probability distribution of the F cell numbers in Newborn collectives is given in Eq. [21]. If the total number of cells in a Newborn collective N0 is large enough, Eq. [21] is approximated by the Gaussian distribution . Then, the probability density function that to be ζ in Newborn collective i is

The Newborn collective i maturates in cycle k + 1 with the initial cell numbers and . Since all Newborn collectives are independent and identically distributed, we can drop the superscript (i) for simplicity. We can express S(t) and F (t) based on Eq. [7,8,15,19] as

where GS (t) and GF (t) are random variable following the standard distribution 𝒩 (0, 1). Here we ignore cycle index k + 1 in subscription for convenience. Then, we can approximately write f (t) as

Comparison between consecutive sampling and independent binomial sampling. A parent collective is divided into 10 collectives. The histogram labeled with ‘MHG’ is the probability mass function of F of the fifth collective sampled via multivariate hypergeometric distribution. The independent binomial sampling is labeled with ‘BN’. The initial numbers of cells are S = 8000 and F = 2000 for the left panel, and S = 20 and F = 5 for the right panel. 10000 samples are drawn for each distribution.

Color map of the absolute error between frequency ⟨ f *⟩ of the averaged selected collectives at the end of simulations (k = 1000) and the target frequency . The dashed lines are drawn by the arguments in the main text. For parameters, we used r = 0.5, ω = 0.03, μ = 0.0001, N0 = 1000, g = 10 and τ ≈ 4.8.

a Trajectories of F frequency for 10 collectives (g = 10) over time. The collective whose frequency is closest to the target value is selected in every cycle (black lines). The gray lines denote the other collectives. For parameters, we used S growth rate r = 0.5, F growth advantage ω = 0.03, mutation rate μ = 0.0001, maturation time τ ≈ 4.8, and N0 = 1000. b Comparison between frequency trajectories with (black) and without (blue) selection clearly shows the effect of artificial selection. The black line indicates F frequency of the selected collective at each cycle in a. The blue line indicates the average trajectory without selection (the average of g = 10 individual lineages without collective-level selection at the end of each cycle).

The mean of f is given by

and the variance is

The average size of collectives is . In terms of a probability distribution, a random variable that a Adult’s F frequency f (τ) = fτ follows the Gaussian distribution whose the probability density function is given by

With indices of cycle k and k + 1, the conditional probability that a offspring collective has frequency f after the maturation in cycle k + 1 from the parent frequency is

After maturation in cycle k + 1, the collective with the smallest frequency is selected among g Adult collectives. The probability distribution function of is obtained by the extreme value statistics (4). The cumulative distribution function of the minimum value is given by

The above Eq. [30] simply states that if every single of g collectives at the end of maturation (time τ) in cycle k + 1 have frequencies greater than or equal to f . Frequencies are independent and identically distributed, and thus . Note that Prob , and Eq. [30] becomes

Then, the probability density function is obtained by differentiating Eq. [31] with respect to f and replacing ,

We compute the probability density function [32] by using numerical integration and compare it with the stochastic simulation results in Fig. S5. Both distributions agree with each other.

To get the analytic approximation of the median of Eq. [32], we assume that the F frequency distribution of Adult collective is Gaussian form. Thus, instead of calculating marginal probability with ζ variable as in Eq. [29], we only calculate the mean and variance , from Newborn’s frequency distribution of :

which are Eq. [3] and Eq.[4] in the main text, respectively.

The probability density functions of the selected collective’s frequency with the offset of f *. The blue distribution is obtained using 300 stochastic simulation results. The orange distribution represents Eq. [32] computed by numerical integration. The median values of the distributions are shown in Fig. S3a in the main text.

We note that is proportional to . The variances in Eq. [15] and [19] are linearly scale with N0 by using and from Eq. [22]. The mean collective size is also proportional to N0 because the average cell numbers in Eq. [7] and [8] are linear to N0. Thus, the scaling relation of Eq. [27] is given .

The median (Median) of Eq. [32] satisfies , which means . If we assume the distribution Eq. [32] is the Gaussian form, the inverse function is writen as

where Φ−1(y) is an inverse cumulative density function (CDF) of the normal distribution with the mean in Eq.[34] and the standard deviation σf (τ), a square root of Eq.[34]. Substracting from Eq. [35] gives Eq. [2] in the main text.

Further, we get an asymptotic expression of Φ−1(ln 2/g) when g is large (or Φ−1(y) with small y). We start from the CDF of the standard normal distribution, where the function erfc(x) is an complementary error function. To get the expression of Φ−1, we need an asymptotic expression of inverse of y = erfc(x) function (x = erfc−1(y)) as the inverse . Here, we introduce a method in (5). The known asymptotic expansion of y = erfc(x) for large x is erfc . By taking logarithm in both side, we have

Replacing x2 in the righthand side in Eq. [36] into the expression itself, we get continued logarithmic form of

Inserting = erfc 1(y) (square root of Eq. [37]) into the inverse CDF , we have . So, the asymptotic expression of Φ−1(ln 2/g) is given by

3. Selection without mutation

When the mutation rate is zero, two genotypes behave as two distinct species. So the S corresponds to the slower-growing species and the F corresponds to the faster-growing species. Then, our problem is generalized to the selection of two-species collective. The compositional change is provided by Eq. [35] with setting μ = 0. Corresponding and become

Equations [39] and [40] suggest that when a community consists of two competing species, we obtain similar conclusions on the accessible region for target composition. The stochastic simulation results are present in Fig. S7.

4. Selecting more than one collective

In the main text, we choose one collective which have the closest frequency value to the target among g collectives. Such ‘Top 1’ stratedgy has an advantage on applying extreme value theory while it has potentially lose the ‘unlucky’ Adults (6). Such ‘unlucky’ Adults are further to the target value than the selected one, but have possiblity to be selected in next cycle. The ‘Top tier’ stratedgy may secure the ‘unlucky’ Adult by choosing more than one collective (6). So we test the ‘top-tier’ stratedgy by choosing 5 among 100 Adults with the distance from the target value in Fig. S8. The top-tier stratedgy is shown to be inefficient in the two-genotype system. This is because the system is too simple, so lower ranked Adult is lower again in next cycle.

Scaling relation of F frequency variance (Eq. [27]) with Newborn collective size N0. The initial F frequency is 0.3. The parameters are r = 0.5, ω = 0.03, μ = 0.0001, and τ ≈ 4.8.

Color map of the absolute error between frequency ⟨f *⟩ of the averaged selected collectives at the end of simulations (k = 1000) and the target frequency . For parameters, we used r = 0.5, ω = 0.03, μ = 0, N0 = 1000, g = 10 and τ ≈ 4.8.

Comparison of selecting Top-tier 5 with Top 1. We breed 100 collectives and choose 5 collectives with the closest to the target value.

5. Deleterious mutation

In the main text, we show the target composition can be achieved in some ranges of initial and target values when the mutation is beneficial to growth. The same analogy can be applied even when the mutation is deleterious. Since the F cells grow slower than the S cells (ω < 0), the F frequency naturally decreases in the maturation step. Then, the conditional probability distribution is a maximum value distribution of Eq. [29]. Thus, instead of Eq. [30], we look for the cumulative distribution function of the maximum value such that

If all frequencies are independent and identical distributed random variables, the cumulative distribution function becomes

Likewise in the previous section, we get the conditional probability density function by differentiating Eq. [42] with respect to f and replacing as

The distribution in Eq. [43] is evaluated for various in Fig. S9a with numerical simulations and the median values of distributions are presented in Fig. S9b. In the case of ω = − 0.03, the target frequency lower than around 0.3 and larger than around 0.7 can be selected. Since the sign of ω is opposite to the result in the main text, the diagram is reversed from Fig. 2d in the main text.

6. Extension to the system including double-mutation

In the main text, we categorize the target frequencies based on whether the frequency change in the maturation step can be compensated in the selection step. Here, we extend the idea into more complex systems in order to suggest a possible generalization. We assume that collectives consist of three genotypes with slow-growing(S), fast-growing(F), and faster-growing(FF) types. The growth rate of S is r. Each mutation adds the growth rate with ω. So the F and FF type have growth rates with r + ω and r + 2ω, respectively. The mutation rate is μ. So, the birth and mutation events are written by the chemical reactions:

Here, FF is an individual of a F type. We write a master equation of the processes for P (S, F, FF, t) which is the probability to have S S cells, F F cells, and FF FF cells at time t.

The composition of collective i in cycle k is now represented with two frequencies where the F frequency and the FF frequency . Then, the target composition is set to be . The composition of the selected Adult in cycle k is . We apply the processes used in the above section 2 to obtain the conditional probability by using the master Eq. [49].

At the reproduction step in cycle k, we choose N0 cells from the selected Adult whose composition is . Then, Newborn collectives are independently sampled from a multinomial distribution. For convenience, we drop the collective index (i). Then, the conditional probability mass function to choosing Fk,0, FFk,0 cells are represented by

where the number of S Sk+1,0 is automatically set to be Sk+1,0 = N0Fk+1,0FFk+1,0. Then, the approximated multivariate normal distribution is where the mean distribution is and covariance matrix is Mk+1. The diagonal terms of Mk+1 are variances and the offdiagonal terms are covariances . The matrix is given by

Artificial selection also works for deleterious mutation. a Conditional Probability density functions of for various values. The left-hand side distribution is obtained from simulations and the right-hand side distribution is numerically obtained by evaluating Eq. [43]. Small triangles inside indicate the median values of the distributions. b the median value of distributions at a given . The points where the shifted median becomes zero, Median are denoted as f L and f U, respectively. c The relative error between the target frequency and the ensemble averaged selected frequency is measured after 1000 cycles starting from the initial frequency . Either the lower target frequencies or the higher target frequencies starting from the high initial frequencies can be achieved. The black dashed lines indicate the predicted boundary values f U and f L in a.

Then a Newborn’s composition ρ(ζ, η) follows the multivariate Gaussian distribution whose the joint probability distribution function is given by

At the beginning of cycle k, a newborn collective starts to maturate from (S0, F0, FF0) cells (for convenience, cycle index k is dropped.) In terms of (ζ, η), each initial numbers are S0 = N0(1 − ζη), F0 = N0ζ, and FF0 = N0η. Their initial covariance matrix is . By using Eq. [49], we can write ordinary differential equations up to the second moment.

The initial conditions of the system in coupled Eqs. [53-61] are obtained by the mean and (co)variances of Eq. [50]. By solving equations numerically, we obtain a set of mean cell numbers and a set of variances as well as covariances (σSF, σFFF, σSFF). We assume that the covariances are smaller than the variances. We consider S, F, and FF as Gaussian random variables

Then, the F frequency becomes

where and . Similarly, the FF frequency is

where and . If the covariances are small enough, we can approximate the joint probability distribution of Adult’s composition (fτ, hτ) = pτ as

With cycle index k, we get conditional probability of matured collectives by

We select the Adult collective among g Adult collectives such that the change in frequencies during maturation could be compensate. During maturation, a freuquency distribution moves different direction in (f, h) space depending on the initial composition . So, we take different directions to obtain the extreme value distributions. Considering only the sign of the frequency changes in f and h, we take either maximum or minimum. The mean change in h is always positive in the whole (f, h) space since is always positive in Eq. [55]. Thus, we choose the minimum value in every selection step.

If the mean is larger than , the minimum value among will be chosen in selection step to compensate the frequency change in maturation step. Let us denote the selected valued of f and h and . We temporally drop time index τ for simplicity. Then, the joint cumulative distribution function is

The probability can be converted as

where is a conditional joint cumulative distribution function of (f (i), h(i)). The marginal cumulative distribution functions are

Similarly, the probabilities and are converted into and . Thus, the joint cumulative distribution function is

Then, the conditional probability of the selected collective is given by

where and .

If the mean is smaller than , the chosen collective is likely to have maximum f values among g matured collectives. Then, the definition of f * is written by . We rewrite the joint cumulative distribution function to be little different from Eq. [69] because now we have to utilize the condition f * < f instead of f * > f,

The probability is converted as

Thus, the joint cumulative distribution function is given by

In this case, the conditional probability distribution function is given by

By replacing (f, h) to , we finally obtain the conditional probability distribution ,

Using Eq. [79], we get the mean values of f * and h* as

We define the accessible region in frequency space where the signs of the changes in both F frequency and FF frequency after a cycle are opposite to that of maturation (see Fig. S10),

where and are the mean values after the maturation step in cycle k + 1. Figure S10a shows a concise explanation of the classification of regions where the initial and target compositions satisfy the above conditions in Eq. [82]. Or, if the condition is not met, the composition of the selected collective may diverge from the target composition after several cycles. The accessible regions are marked in the gray-colored area in Fig. S10b. Similar to the case of the two-genotype population, the accessible region is shaped by the flow velocity of the composition in the maturation step, as depicted in the flow diagram in Fig. S10b. Both F and FF frequencies tend to increase, and the inter-collective selection can compensate for these changes if the composition changes slowly when the F and FF frequencies are small. However, if the changes occur too rapidly when the FF frequency is intermediate, the frequency cannot be stabilized. So the accessible region is limited to the regions where the composition changes slowly.

a The schematic procedure to determine accessible regions by collective selection. From the given parent composition (blue dot), the probability distribution of offspring Adults (Eq. [68]) is computed (marked in the orange-colored area). From Adult compositions, the probability distribution of the selected collective (Eq.[79]) is computed (marked in the green area). If the signs of the changes in both F frequency and FF frequency after the selection (from blue dot to green dot) are opposite to that of maturation (from blue dot to orange dot), the given composition is accessible. Otherwise, the composition is not accessible and will change after cycles. b The accessible regions are marked by the gray area. The vector field is the flow of compositions during maturation. The length and color of the arrows indicate the speed of composition changes. The figures are drawn using mpltern package (7).

7. Derivation of equations

In this section, we go over the derivation of Eq. [7-27] for readers not equipped with advanced mathematics training.

Assumptions: μω, r

Equations [7] and [8]. Equation [7] is straightforwardly solved by integrating Eq. [5]. Equation [8] is obtained from Eq. [6] using integration factor e−(r+ω)t:

Integrating both sides, we get . Thus,

Equations [11] and [12]. Applying Eq. [4], we have

We collect the two violet-colored terms and change the order of summation. Note that the first violet-colored term does not change regardless of whether S starts from 0 or 1 because the term is zero for S = 0. Thus, the first violet-colored term is equivalent to . Let α = S − 1, and this becomes . We reassign α as S, and obtain:

We collect the two green terms, and similarly obtain:

Finally, we collect the two orange terms. For the first orange term, the sum is the same regardless of whether we start from S = 0 or −1. Let S start from −1, and we have ,then the term becomes . We reassign α as S, and additionally apply index change on F − 1:

Now, add the three parts together, and we have

which is Eq. [11]. Likewise,

which is Eq. [12].

Equation [13] and [15]. Using integration factor e−2(rμ)t and Eq. [11], we have:

Since μr,we have

where is the expected S2 at time 0. For Eq. [15],

Equation [16] and [17]. Since ,

We can solve this, again using the integration factor technique above:

Thus, we have

which results in

Equation [18]. From Eq. [12], we have

The right-hand side becomes

Note that we have checked that the second and third terms of can be ignored after we compare the full calculation with this simpler version. Integrate both sides:

Then, we have

Equation [19].

Equations [25]-[27]. To derive this equation, we use the fact that 1/(1 + x) ∼1 − x for small x. We will omit (t) for simplicity. Also note that we are considering relatively large populations so that the standard deviation is much smaller than the mean.

Recall that if A ∼ 𝒩 (μA, σA), B ∼ 𝒩 (μB, σB), then AB ∼ 𝒩 (μAμB, ). Thus, f is distributed as a Gaussian with the mean of

Note that the initial value of mean is equal to the mean of the binomial distribution Eq. [22], . The variance is