# Abstract

Microbial collectives, capable of functions beyond the reach of individual populations, can be enhanced through artificial selection. However, this process presents unique challenges. Here, we explore the ‘waterfall’ phenomenon, a metaphor describing how the success in achieving a desired genotype or species composition in microbial collectives can depend on both the target characteristics and initial conditions. We focus on collectives comprising fast-growing (F) and slow-growing (S) types, aiming to achieve specific S frequencies. Through simulations and analytical calculations, we show that intermediate target S frequencies might be elusive, akin to maintaining a raft’s position within a waterfall, rather than above or below it. This challenge arises because intra-collective selection, favoring F during growth, is the strongest at intermediate S frequencies, which can overpower counteracting inter-collective selection effects. Achieving low target S frequencies is consistently possible as expected, but high target S frequencies require an initially high S frequency — similar to a raft that can descend but not ascend a waterfall. The range of attainable target frequencies is significantly influenced by the initial population size of the collectives, while the number of collectives under selection plays a less critical role. In scenarios involving more than two types, the evolutionary trajectory must navigate entirely away from the metaphorical ‘waterfall drop.’ Our findings illustrate that the strength of intra-collective evolution is frequency-dependent, with implications in experimental planning.

**eLife assessment**

This **important** study of artificial selection in microbial communities shows that the possibility of selecting a desired fraction of slow and fast-growing types is impacted by their initial fractions. The evidence, which relies on mathematical analysis and simulations of a stochastic model, is **convincing**. It highlights the tension between selection at the strain and the community level. This study should be of interest to researchers interested in ecology, both theoretical and experimental.

# Introduction

**M**icrobial collectives can carry out functions that arise from interactions among member species. These functions, such as waste degradation (1, 2), probiotics (3), and vitamin production (4), can be useful for human health and biotechnology. To improve collective functions, one can perform artificial selection (directed evolution) on collectives: Low-density “Newborn” collectives are allowed to “mature” during which cells proliferate and possibly mutate, and community function develops. “Adult” collectives with high functions are then chosen to reproduce, each seeding multiple offspring Newborns. Artificial selection of collectives have been attempted both in experiments (5–19) and in simulations (20–30), often with unimpressive outcomes.

One of the major challenges in selecting collectives is to ensure the inheritance of a collective function (31, 32). Inheritance can be compromised when genotype and species compositions change from a parent collective to offspring collectives. During maturation of a collective, genotype compositions within each species can be altered by intra-collective evolution, while species compositions can be shifted by ecological interactions. Furthermore, during reproduction of a collective, genotype and species compositions of offspring can vary stochastically from those of the parent.

Here, we consider the selection of collectives of two types with different growth rates to achieve a taget composition in the Adult collective. Earlier work has demonstrated that nearly any target species composition can be achieved when selecting communities of two competing species with unequal growth rates (24, 33), so long as the shared resource is depleted during collective maturation (24). In this case, both species initially evolved to grow faster, and the slower-growing species was preserved due to stochastic fluctuations in species composition during collective reproduction. Eventually, both species evolved to grow sufficiently fast to deplete the shared resource during collective maturation, and evolution in competition coefficients then acted to stabilize species ratio to the target value (24).

Here, we mathematically examine the selection of target type compositions in two-type collectives when nutrient is always in excess. This allows us to analytically examine the evolutionary tipping point between intra-collective and inter-collective selection. We show that this tipping point creates a “waterfall” effect which restricts not only which target compositions are achievable, but also the initial composition required to achieve the target. We also investigate how the range of achievable target composition is affected by the population size in Newborns and the total number of collectives under selection. Finally, we show that the waterfall phenomenon extends to systems with more than 2 types.

# Results and Discussions

We start with selecting for a target composition in collectives of fast-growing F and slow-growing S types. We show that intermediate F frequencies or equivalently, intermediate S frequencies, are the hardest targets to achieve. We then show that similar conclusions hold when selecting for a target composition in collectives of more than two types. We mainly describe in terms of F frequency *f* = *F/*(*S* + *F*), which is related to S frequency *s* = 1 − *f* .

## A selection cycle

A selection cycle (Fig. 1) starts with a total of *g* Newborn collectives. We consider two genotypes - slow-growing type (S) and fast-growing type (F), with S mutating to F at a rate *μ*. At the beginning of cycle *k* (*t* = 0), a collective has a fixed total cell number where and denote the numbers of S and F cells in collective *i* (1 ≤ *i* ≤ *g*) at time *t* (0 ≤ *t* ≤ *τ*) of cycle *k*. The average F frequency among the *g* Newborn collectives is , such that the initial F cell number in each Newborn is drawn from the binomial distribution Binom .

Collectives are allowed to grow for time *τ* (‘Maturation’ in Fig. 1). During maturation, S and F grow at rates *r* and *r* + *ω* (*ω >* 0), respectively. If maturation time *τ* is too small, matured collectives (“Adults”) do not have enough cells to reproduce *g* Newborn collectives with *N*_{0} cells. On the other hand, if maturation time *τ* is too long, fast-growing F will take over. Hence we set the maturation time *τ* = ln(*g* + 1)*/r*, which guarantees sufficient cells to produce *g* Newborn collectives from a single Adult collective.

At the end of a cycle, the Adult with the highest function is chosen to reproduce *g* Newborn collectives each with *N*_{0} cells (‘Selection’ and ‘Reproduction’ in Fig. 1). This allows us to employ the simplest version of the extreme value theory, and in our case, choosing top 1 outperforms a less stringent “choosing top 5%” selection regime. Collective function is dictated by the F frequency *f*, and is maximized at the target value . Thus among all Adult collectives, the chosen Adult has F frequency *f* ^{∗} closest to the target value . For the next cycle, the number of F cells in Newborns follow a binomial distribution with the mean value *N*_{0}*f* ^{∗}. By repeating the selection cycle, we aim to achieve the target composition .

## The success of collective selection is constrained by the target composition, and sometimes also by the initial composition

Since intra-collective selection favors F, we expect that a higher target (a lower target *ŝ*) is easier to achieve. Note that once a target frequency is achieved, inter-collective selection is always required to maintain that frequency due to the fitness difference between the two types.

We fixed *N*_{0},the total population size of a Newborn, to 1000, and obtained selection dynamics for various initial and target F frequencies by implementing stochastic simulations (Sec. 1 in Supplementary Information). If the target value is high (e.g., Figure 2**a** magenta), selection is successful in the sense that the absolute error *d* between the target frequency and the selected frequency averaged among independent simulations ⟨*f* ^{∗}⟩ is smaller than 0.05 (i.e computed absolute errors are shown in Supplementary Information Fig. S3). F frequency of the chosen collective eventually converges to the target value and stays around it, regardless of the initial frequency. In contrast, without collective-level selection (e.g. choosing a random collective to reproduce), F frequency increases until F reaches fixation (Supplementary information Fig. S4**b**).

In contrast, an intermediate target frequency (e.g. Fig. 2**b** green) is never achievable. High initial F frequencies (e.g. 0.6 and 0.95) decline toward the target, but stabilize at an “higher-threshold” (*f* ^{H} ∼ 0.7, solid line segment in Fig. 2**a-c**) above the target. Low initial F frequencies (e.g. 0) increase toward the target, but then overshoot and stabilize at the *f* ^{H} value.

If the target frequency is low (e.g. Fig. 2**c** red line), artificial selection succeeds when the initial frequency is below a “lower-threshold” (*f* ^{L}, dashed line segment in Fig. 2**a-c**). Initial F frequencies above *f* ^{L} (e.g. 0.45 and 0.95) converge to *f* ^{H} instead. Initial F frequencies near *f* ^{L} display stochastic trajectories, converging to either *f* ^{H} or the target.

In summary, two regions of target frequencies are “accessible” in the sense that a target value in the region can be maintained once reached; gold in Fig. 2**d, e**): (1) target frequencies above and (2) target frequencies below and starting at an average frequency below .

## Intra-collective evolution is the fastest at intermediate F frequencies, creating the “waterfall” phenomenon

To understand what gives rise to the two accessible regions, we calculated how the distribution of F frequency *f* might change after one cycle. We consider the case where , the F frequency of the selected Adult at cycle *k*, is above the target value . This case is particularly challenging because intra-collective evolution favours fast-growing F and thus will further increase *f* away from the target. From , Newborns of cycle *k* + 1 will have *f* fluctuating around , and after they mature,in F frequency over one cycle, the minimum *f* is selected . If the selected composition at cycle *k* + 1 can be reduced compared to that of cycle *k* (i.e. ), the system can evolve to the lower target value.

Mathematically, the conditional probability distribution Ψ of (the F frequency of the selected Adult at cycle *k* + 1 being *f*) given the selected at cycle *k* is

Equation 1 can be broken down into two parts: (i) describes that any one of the *g* collectives must achieve *f*, where denotes the probability of an Adult (*t* = *τ*) in cycle *k* + 1 achieving *f* given . (ii) describes that the remaining *g* − 1 collectives must all achieve a frequency above *f* (and thus not selected). is calculated by integrating across *ζ* the term (the probability density of *f*_{k+1,0} of a Newborn in cycle *k* + 1 achieving *ζ* given ) multiplied by the term (the probability density of an Adult achieving *f*′ given Newborn’s frequency *ζ*, see Methods).

When we approximate the Adults’ *f* in cycle *k* + 1 as Guassian with mean and variance , we can obtain an analytical solution of the change in *f* over one cycle - the difference between the median value of the conditional probability distribution Ψ and the selected frequency of (Sec. 2 in Supplementary Information):

where

and

*e* is Euler’s number (2.71828…), and Φ^{−1}(…) is an inverse cumulative function of standard normal distribution. For those unfamiliar with Φ^{−1}(…), *y* = Φ^{−1}(*x*) (where 0 ≤ *x* ≤ 1) is the number such that the probability of a standard normal random variable being less than or equal to *y* is *x* (e.g. 0 = Φ^{−1}(0.5)). Eq. [2] contains all experimental parameters, including *g* (the total number of collectives under selection), *τ* (maturation time), *μ* (mutation rate), *r* (growth rate of the slow-growing S type), and *ω* (growth advantage of F over S). Equation [2] has the same shape as results from numerically integrating Eq. [1] and from stochastic simulations (Fig. 3**a**).

F frequency can be reduced if the selected is low or high, but not if it is intermediate between *f* ^{L} and *f* ^{H} (Fig. 3a). Increase in *f* during maturation is the most drastic when Newborn *f* is around 0.5 (Fig. 3**b** blue). This is intuitive: when Newborn *f* is low, the increase will be minor; when Newborn *f* is high, the fitness advantage of F over the population average is small and hence the increase is also minor. Thus, when Newborn F frequency is intermediate, inter-collective selection may not be able to counter intra-collective selection (Fig. 3**b**).

Not surprisingly, similar conclusions are derived where S and F are slow-growing and fast-growing species which cannot be converted through mutations (Sec. 3 and Fig. S7 in Supplementary Infomation).

Overall, inter-collective selection is akin to a raftman, rowing the raft to a target, while intra-collective selection is akin to a waterfall. This metaphore is best understood in terms of S frequency *s* = 1 − *f* . Then, the lower-threshold *f* ^{L} corresponds to higher-threshold in *s*^{H} = 1 − *f* ^{L}, while higher-threshold *f* ^{H} corresponds to lower-threshold in *s*^{L} = 1 − *f* ^{H} . Intra-collective selection is akin to a waterfall, driving the S frequency *s* from high to low (Fig. 2**f**). Intra-collective selection acts the strongest when *s* is intermediate (*s*^{L} *< s <* s^{H}), similar to the vertical drop of the fall. Intra-collective selection acts weakly at high (*> s*^{H}) or low (*< s*^{L}) *s*, similar to the gentle sloped upper and lower pools of the fall (regions 1 and 2 of Fig. 2**d, f** and Fig. 3**a**). Thus, an intermediate target frequency can be impossible to achieve - a raft starting from the upper pool will be flushed down to *s*^{L} (*f* ^{H}), while a raft starting from the lower pool cannot go beyond *s*^{L} (*f* ^{H}). In contrast, a low target S frequency (in the lower pool) is always achievable. Finally, a high target S frequency (in the upper pool) can only be achieved if starting from the upper pool (as the raft can not jump to the upper pool if starting from below).

## Manipulating experimental setups to expand the achievable target region

Variation among collectives is a key element for successful selection. Here, variation among collectives originates from the stochastic nature of growth and mutation during collective maturation, and of sampling effects during collective reproduction. From Eq. [4], the variance of *f* at time *τ* scales with . Indeed, as Newborn size *N*_{0} declines, the region of acheivable of target frequency expands (larger gold area in Fig. 4**a**). If the Newborn size *N*_{0} is smaller than ∼700, any target frequency can be reached when other parameters are fixed.

The number of collectives *g* also affects the selection outcomes. With more collectives, the chance to find a *f* closer to the target is more likely. Thus, a larger number of collectives broadens the region of success (Fig. 4b). However, the effect of *g* is not dramatic. To see why, we note that the only place that *g* appears is Eq. [2] in the form of . When *g* becomes large, Φ^{−1}(1*/g*) is asymptotically expressed as (Sec. 2 in Supplementary Information)(34). Thus, *φ*(*g*) does not change dramatically as *g* varies.

## The waterfall phenomenon in a higher dimension

In contrast to the above two-type case where the evolutionary trajectory travels along a one-dimensional line from the initial frequency to the final frequency, in a three-type case the trajectory travels in a two-dimensional plane. To examine the waterfall effect in a higher dimension, we investigate a three-type system where a faster-growing type (FF) grows faster than the fast-growing type (F) which grows faster than the slow-growing type (S) (Fig. 5**a** and Sec. 6 in Supplementary Information). In this system, a target type composition can be achieved if inter-collective selection reduces the frequencies of F as well as FF (accessible region, gold in Fig. 5**b**). Strikingly, a target composition in an accessible region may not be achievable even when the initial composition is within the same region: once the composition escapes the accessible region, the trajectory cannot return back to the accessible region (Fig. 5**b**iii, the leftmost initial condition). However, if the initial position is closer to the target in the accessible region, the target becomes achievable (Fig. 5**b**iii, initial condition near the bottom).

In conclusion, we have investigated the evolutionary trajectories of type composition in collectives under selection, which are governed by intra-collective selection (which favors fast-growing types) and inter-collective selection (which, in our case, strives to counter fast-growing types). Intra-collective selection has the strongest effect at intermediate frequencies of faster-growing types, potentially creating an inaccessible region analogous to the vertical drop of a waterfall. High and low target frequencies are both accessible, analogous to the lower- and the upper-pools of a waterfall, respectively. A less challenging target (low *ŝ*)is achievable from any initial position. In contrast, a more challenging target (high *ŝ*) is only achievable if the entire trajectory is contained within the region, similar to a raft striving to reach a point in the upper-pool must start at and remain in the upper pool. Our work suggests that the strength of intra-collective selection is not constant, and that strategically choosing an appropriate starting point can be essential for successful collective selection.

# Materials and Methods

## Stochastic simulations

A selection cycle is composed of three steps: maturation, selection, and reproduction. At the beginning of the cycle *k*, a collective *i* has slow-growing cells and fast-growing cells. At the first cycle, mean F frequency of collectives are set to be is sampled from the binomial distribution with mean . Then, S cells are in the collective *i*. In the maturation step, we calculate and by using stochastic simulation. We can simulate the division and mutation of each individual cell stochastically by using tau-leaping algorithm (36, 37)(see Fig. S4**a** and **b** in Supplementray Information). However, individual-based simulation requires a long computing time. Instead, we randomly sample and from the probability distribution . To obtain , we solve the master equation which describes a time evolution of probability distribution, under the random processes (see Sec. 1 in Supplementary Information). We assumed that follows the Gaussian distribution. The mean and variance of the distribution are numerically obtained from ordinary differential equations which is derived from the master equation (see Sec. 1 in Supplementary Information). We choose the collective with the closest frequency to the target . The selected collective generates *g* Newborns. The number of F cells is sampled from the binomial distribution with the mean of . We start a new cycle with those Newborn collectives. Then, the number of S cells in a collective *i* is

## Analytical approach to the conditional probability

The conditional probability distribution Ψ that we observe at a given is calculated by the following procedure. Given the selected collective in cycle *k* with , the collective-level reproduction proceeds by sampling *g* Newborn collectives with *N*_{0} cells in cycle *k* +1. Each Newborn collective contains certain F numbers at the beginning of the cycle *k* + 1, which can be mapped into with the constraint of *N*_{0} cells. If the number of cells in the selected collective is large enough, the joint conditional distribution function is well described by the product of *g* independent and identical Gaussian distribution 𝒩(*μ, σ*^{2}). So we consider the frequencies of *g* Newborn collectives as *g* identical copies of the Gaussian random variable *f*_{k+1,0}. The mean and variance of *f*_{k+1,0} are given by and . Then, the conditional probability distribution function of *f*_{k+1,0} being *ζ* is given by

After the reproduction step, the Newborn collectives grow for time *τ* . The frequency is changed from the given frequency *ζ* to *f* by division and mutation processes. We assume that the frequency *f* of an Adult is also approximated by a Gaussian random variable . The mean and variance are calculated by using means and variances of *S* and *F* (see Sec. 2 in Supplementary Information.) Since and also depend on *ζ*, the conditional probability distribution function of *f*_{k+1,τ} being *f* is given by

The conditional probability distribution of an Adult collective in cycle *k* + 1 (*f*_{k+1,τ}) to have frequency *f* at a given is calculated by producting two Gaussian distribution functions and integrating overall *ζ* values, which is given by

Since we select the minimum frequency among *g* identical copies of *f*_{k+1,τ}, the conditional probability distribution function of follows a minimum value distribution, which is given in Eq. [1]. Here, For the case of , the selected frequency is the minimum frequency So we have by replace to

We assume that the conditional probability distribution in Eq. [7] follows a normal distribution, whose mean and variances are describe by Eq. [3] and Eq. [4]. Then, the extreme value theory (38) estimates the median of the selected Adult by

The F frequency difference in Eq. [2] is obtained by substracting from Eq. [8].

# Data and source code availability

Data and source code of stochastic simulations are available in https://github.com/schwarzg/artificial_selection_collective_composition

# Acknowledgements

J. Lee and H.J. Park were supported by the National Research Foundation of Korea grant funded by the Korea government (MSIT), Grant No. RS-2023-00214071 (H.J.P.) and by an appointment to the JRG Program at the APCTP through the Science and Technology Promotion Fund and Lottery Fund of the Korean Government. W. Shou was supported by an Academy of Medical Sciences Professorship and a Royal Society Wolfson Fellowship. This was also supported by the Korean Local Governments– Gyeongsangbuk-do Province and Pohang City and INHA UNIVERSITY Research Grant. We thank Su-Chan Park, Li Xie, and Alex Yuan for constructive comments and discussions.

# Supplementary Information

## 1. Stochastic simulation for the evolution of collective function

In the main text, we design a simple model of artificial selection on collectives. The collectives are composed of two genotypes, slow-growing (S) and fast-growing (F) genotypes. We start with *g* ‘Newborn’ collectives and each collective *i* (*i* = 1, …, *g*) has S cells and F cells. The index *k* is a cycle number and *t* is time indices where 0 ≤ *t* ≤ *τ* . At the beginning of cycle *k* (*t* = 0), any collective has a fixed total cell number . The collectives are allowed to “mature” for *t* = *τ*, where S and F populations grow at rates of *r* and *r* + *ω*, respectively. S cells also mutate to F cells at a rate of *μ* per cell, and the back mutation from F to S is ignored. After maturation, we choose the collective with F frequency the closest to the target value , and the chosen collectiv e“reproduce” *g* newborn collectives each with *N*_{0} cells for the next cycle.

### Maturation

In this subsection, we ignore the cycle number index *k* and collective index (*i*) in convenience. So we denote *S*(*t*) and *F* (*t*) as representative notations of and , respectively. The initial S and F cell numbers in a Newborn collective are and . We consider mutation that is beneficial to grow (*ω >* 0).

We consider the population growth of *S*(*t*) and *F* (*t*) as a stochastic process. We use *P* (*S, F, t*) to denote the probability to have *S* S cells and *F* F at time *t* from the beginning of a cycle. Birth and mutation events stochastically happen which are represented by the three chemical reaction rules:

where *S* and *F* denote individual cells of a S type and a F type, respectively. The reaction rules can be simulated via tau-leaping algorithm (1, 2), but the individual-based simulation requires long simulation times to get enough results. Hence we go over probability distribution *P* (*S, F, t*) to sample compositions instead of tracking all events that occurred in individual collectives. First, we calculate the distribution *P* (*S, F, t*) analytically, and then sample numbers of S and F cells from this distribution at the end of cycle *t* = *τ* . Below, we describe the derivation of *P* (*S, F, t*).

The master equation of *P* (*S, F, t*), which follows reaction rules Eq. [1-3], is given by

Here, *dt* is small such that at most one cell birth or mutation event could occur. The first term describes the scenario where a single birth event of a S cell happens during time interval [*t, t* + *dt*), which changes the collective’s composition from (*S* −1, *F*) to (*S, F*). Similarly, the second term comes from a birth event of a F cell. The third term indicates the mutation event. The last term corresponds to the outflow of probability density by birth and mutation processes, which induces the changes in composition from (*S, F*) to others.

Calculating the exact form of *P* (*S, F, t*) is not simple. Instead, we consider the probability distributions of *S*(*t*) and *F* (*t*) at time *t*. We assume that the mutation rate is much smaller than the growth rates, and hence the correlation between *S* and *F* is sufficiently small. Then, *P* (*S, F, t*) ∼ *P* (*S, t*)*P* (*F, t*). The distributions of *S* and *F* can be approximated as Gaussian (𝒩), which is supported by the Central Limit Theorem and Figure S1 (note the concordance between the blue and green distributions). That is, the probability distribution .

The means are defined by and The differential equations for means are obtained by applying the definition to the master equation in Eq. [4], as

We assume that the mutation rate *μ* is very smaller than *r* and *ω*. By solving Eq. [5] and Eq. [6], the means and are given by

where and are the mean number of S and F cells at the beginning of cycle *t* = 0. Note that the second term of Eq. [8] is consistent with the studies on the distribution of the number of F cells starting from *S*_{0} S cells (3).

We define the second momenta of *S* and *F* as

Then, corresponding differential equations are given by

The solution of Eq. [11] is

where is the second moment of initial values . Thus, the variance is

where is a variance of S cell numbers at the beginning of cycle *t* = 0. In Eq. [12], we require to calculate . The master Eq. [4] provides a differential equation for as

The solution of Eq. [16] is given by

By using Eq. [17], the solution of Eq. [12] is given by

where is the second moment of initial values . Thus, the variance is given, up to the order of *μ*, by

Using Eq. [7,8,15,19], we construct Gaussian distribution functions for *S*(*t*) and *F* (*t*), and . Then, numbers of S cells and F cells are randomly sampled from each distribution: *S* (*τ*) from and *F*( *τ*) from . This process requires only two sampling steps for each collective.

### Selection

After the maturation step is finished in cycle *k*, we compute the frequencies of the F cells in each collectives where we denote the frequency of collective *i* at time *t* = *τ* in cycle *k* as . We select one frequency which is the closest value to the target frequency . The frequency selected at the cycle *k* is denoted by . In mathematical expression, the selected frequency is

### Reproduction

We divide the selected collective into *g* Newborn collectives. We sample F cells from the selected collective for the Newborns in the next cycle *k* + 1. The exact description of reproduction step is described by consecutive sampling without replacement from the selected collective. Then the joint probability distribution follows a multivariate hypergeometric distribution. The selected collective is assumed to have cells and F cells. If the size of selected collective is large enough, the consecutive sampling is approximated to the independent binomial sampling (see Fig. S2). Thus, we independently sample *g* numbers of F cells, , from the binomial distribution. The probability mass function of is given by

After the sampling, the numbers of S cells are set to be for each collective. These Newborn collectives start the maturation in the next cycle.

### Simulation Result

In the main text, we simulate artificial selection by using stochastic simulation. In Fig. S3, we draw the absolute error *d* between the target frequency and at the end of simulations (1000 cycles). In the colormap, errors higher than 0.15 are marked with gray, which indicates the failure of selection. The dashed lines indicate the same boundary in Fig 2**d** in the main text.

Figure S4 presents the composition trajectories of all collectives using tau-leaping algorithm. The selected collectives have the closest compotisition to the target composition at the end of each cycle. The selected collective can have smaller frequency than its parent collective, so the composition can be lowered after cycles.

## 2. Conditional probability distribution of the selected collective frequency *f* ^{*}

Since maturation and reproduction are stochastic processes, the frequency of the selected collective at cycle *k* + 1 starting from is also described by the probability distribution. In this case, depends on , and the distribution of follows the conditional probability distribution . Below, we describe calculation of the conditional probability distribution .

We focus on the situation where the change of frequency in the selection step has the opposite direction of the maturation step. That means the target frequency is smaller than , and the collective with the smallest frequency will be selected among *g* Newborn collectives. We can calculate the probability distribution of the Newborn collectives’ frequencies at the end of cycle *k* + 1 by using Eq. [7,8,15,19] and Eq. [21]. Then, we obtain a distribution of the minimum value among *g* copies of the collectives. The minimum value distribution is the conditional probability distribution .

Let us start the calculation from the reproduction step at the end of cycle *k*. The probability distribution of the F cell numbers in Newborn collectives is given in Eq. [21]. If the total number of cells in a Newborn collective *N*_{0} is large enough, Eq. [21] is approximated by the Gaussian distribution . Then, the probability density function that to be *ζ* in Newborn collective *i* is

The Newborn collective *i* maturates in cycle *k* + 1 with the initial cell numbers and . Since all Newborn collectives are independent and identically distributed, we can drop the superscript (*i*) for simplicity. We can express *S*(*t*) and *F* (*t*) based on Eq. [7,8,15,19] as

where *G*_{S} (*t*) and *G*_{F} (*t*) are random variable following the standard distribution 𝒩 (0, 1). Here we ignore cycle index *k* + 1 in subscription for convenience. Then, we can approximately write *f* (*t*) as

The mean of *f* is given by

and the variance is

The average size of collectives is . In terms of a probability distribution, a random variable that a Adult’s F frequency *f* (*τ*) = *f*_{τ} follows the Gaussian distribution whose the probability density function is given by

With indices of cycle *k* and *k* + 1, the conditional probability that a offspring collective has frequency *f* after the maturation in cycle *k* + 1 from the parent frequency is

After maturation in cycle *k* + 1, the collective with the smallest frequency is selected among *g* Adult collectives. The probability distribution function of is obtained by the extreme value statistics (4). The cumulative distribution function of the minimum value is given by

The above Eq. [30] simply states that if every single of *g* collectives at the end of maturation (time *τ*) in cycle *k* + 1 have frequencies greater than or equal to *f* . Frequencies are independent and identically distributed, and thus . Note that Prob , and Eq. [30] becomes

Then, the probability density function is obtained by differentiating Eq. [31] with respect to *f* and replacing ,

We compute the probability density function [32] by using numerical integration and compare it with the stochastic simulation results in Fig. S5. Both distributions agree with each other.

To get the analytic approximation of the median of Eq. [32], we assume that the F frequency distribution of Adult collective is Gaussian form. Thus, instead of calculating marginal probability with *ζ* variable as in Eq. [29], we only calculate the mean and variance , from Newborn’s frequency distribution of :

which are Eq. [3] and Eq.[4] in the main text, respectively.

We note that is proportional to . The variances in Eq. [15] and [19] are linearly scale with *N*_{0} by using and from Eq. [22]. The mean collective size is also proportional to *N*_{0} because the average cell numbers in Eq. [7] and [8] are linear to *N*_{0}. Thus, the scaling relation of Eq. [27] is given .

The median (Median) of Eq. [32] satisfies , which means . If we assume the distribution Eq. [32] is the Gaussian form, the inverse function is writen as

where Φ^{−1}(*y*) is an inverse cumulative density function (CDF) of the normal distribution with the mean in Eq.[34] and the standard deviation *σ*_{f} (*τ*), a square root of Eq.[34]. Substracting from Eq. [35] gives Eq. [2] in the main text.

Further, we get an asymptotic expression of Φ^{−1}(ln 2*/g*) when *g* is large (or Φ^{−1}(*y*) with small *y*). We start from the CDF of the standard normal distribution, where the function erfc(*x*) is an complementary error function. To get the expression of Φ^{−1}, we need an asymptotic expression of inverse of *y* = erfc(x) function (*x* = erfc^{−1}(*y*)) as the inverse . Here, we introduce a method in (5). The known asymptotic expansion of *y* = erfc(*x*) for large *x* is erfc . By taking logarithm in both side, we have

Replacing *x*^{2} in the righthand side in Eq. [36] into the expression itself, we get continued logarithmic form of

Inserting = erfc ^{1}(y) (square root of Eq. [37]) into the inverse CDF , we have . So, the asymptotic expression of Φ^{−1}(ln 2*/g*) is given by

## 3. Selection without mutation

When the mutation rate is zero, two genotypes behave as two distinct species. So the S corresponds to the slower-growing species and the F corresponds to the faster-growing species. Then, our problem is generalized to the selection of two-species collective. The compositional change is provided by Eq. [35] with setting *μ* = 0. Corresponding and become

Equations [39] and [40] suggest that when a community consists of two competing species, we obtain similar conclusions on the accessible region for target composition. The stochastic simulation results are present in Fig. S7.

## 4. Selecting more than one collective

In the main text, we choose one collective which have the closest frequency value to the target among *g* collectives. Such ‘Top 1’ stratedgy has an advantage on applying extreme value theory while it has potentially lose the ‘unlucky’ Adults (6). Such ‘unlucky’ Adults are further to the target value than the selected one, but have possiblity to be selected in next cycle. The ‘Top tier’ stratedgy may secure the ‘unlucky’ Adult by choosing more than one collective (6). So we test the ‘top-tier’ stratedgy by choosing 5 among 100 Adults with the distance from the target value in Fig. S8. The top-tier stratedgy is shown to be inefficient in the two-genotype system. This is because the system is too simple, so lower ranked Adult is lower again in next cycle.

## 5. Deleterious mutation

In the main text, we show the target composition can be achieved in some ranges of initial and target values when the mutation is beneficial to growth. The same analogy can be applied even when the mutation is deleterious. Since the F cells grow slower than the S cells (*ω <* 0), the F frequency naturally decreases in the maturation step. Then, the conditional probability distribution is a maximum value distribution of Eq. [29]. Thus, instead of Eq. [30], we look for the cumulative distribution function of the maximum value such that

If all frequencies are independent and identical distributed random variables, the cumulative distribution function becomes

Likewise in the previous section, we get the conditional probability density function by differentiating Eq. [42] with respect to *f* and replacing as

The distribution in Eq. [43] is evaluated for various in Fig. S9**a** with numerical simulations and the median values of distributions are presented in Fig. S9**b**. In the case of *ω* = − 0.03, the target frequency lower than around 0.3 and larger than around 0.7 can be selected. Since the sign of *ω* is opposite to the result in the main text, the diagram is reversed from Fig. 2**d** in the main text.

## 6. Extension to the system including double-mutation

In the main text, we categorize the target frequencies based on whether the frequency change in the maturation step can be compensated in the selection step. Here, we extend the idea into more complex systems in order to suggest a possible generalization. We assume that collectives consist of three genotypes with slow-growing(S), fast-growing(F), and faster-growing(FF) types. The growth rate of S is *r*. Each mutation adds the growth rate with *ω*. So the F and FF type have growth rates with *r* + *ω* and *r* + 2*ω*, respectively. The mutation rate is *μ*. So, the birth and mutation events are written by the chemical reactions:

Here, *FF* is an individual of a F type. We write a master equation of the processes for *P* (*S, F, FF, t*) which is the probability to have *S* S cells, *F* F cells, and *FF* FF cells at time *t*.

The composition of collective *i* in cycle *k* is now represented with two frequencies where the F frequency and the FF frequency . Then, the target composition is set to be . The composition of the selected Adult in cycle *k* is . We apply the processes used in the above section 2 to obtain the conditional probability by using the master Eq. [49].

At the reproduction step in cycle *k*, we choose *N*_{0} cells from the selected Adult whose composition is . Then, Newborn collectives are independently sampled from a multinomial distribution. For convenience, we drop the collective index (*i*). Then, the conditional probability mass function to choosing *F*_{k,0}, *FF*_{k,0} cells are represented by

where the number of S *S*_{k+1,0} is automatically set to be *S*_{k+1,0} = *N*_{0} − *F*_{k+1,0} − *FF*_{k+1,0}. Then, the approximated multivariate normal distribution is where the mean distribution is and covariance matrix is **M**_{k+1}. The diagonal terms of **M**_{k+1} are variances and the offdiagonal terms are covariances . The matrix is given by

Then a Newborn’s composition ** ρ**(

*ζ, η*) follows the multivariate Gaussian distribution whose the joint probability distribution function is given by

At the beginning of cycle *k*, a newborn collective starts to maturate from (*S*_{0}, *F*_{0}, *FF*_{0}) cells (for convenience, cycle index *k* is dropped.) In terms of (*ζ, η*), each initial numbers are *S*_{0} = *N*_{0}(1 − *ζ*− *η*), *F*_{0} = *N*_{0}*ζ*, and *FF*_{0} = *N*_{0}*η*. Their initial covariance matrix is . By using Eq. [49], we can write ordinary differential equations up to the second moment.

The initial conditions of the system in coupled Eqs. [53-61] are obtained by the mean and (co)variances of Eq. [50]. By solving equations numerically, we obtain a set of mean cell numbers and a set of variances as well as covariances (*σ*_{SF}, *σ*_{FFF}, *σ*_{SFF}). We assume that the covariances are smaller than the variances. We consider *S, F*, and *FF* as Gaussian random variables

Then, the F frequency becomes

where and . Similarly, the FF frequency is

where and . If the covariances are small enough, we can approximate the joint probability distribution of Adult’s composition (*f*_{τ}, *h*_{τ}) = **p**_{τ} as

With cycle index *k*, we get conditional probability of matured collectives by

We select the Adult collective among *g* Adult collectives such that the change in frequencies during maturation could be compensate. During maturation, a freuquency distribution moves different direction in (*f, h*) space depending on the initial composition . So, we take different directions to obtain the extreme value distributions. Considering only the sign of the frequency changes in *f* and *h*, we take either maximum or minimum. The mean change in *h* is always positive in the whole (*f, h*) space since is always positive in Eq. [55]. Thus, we choose the minimum value in every selection step.

If the mean is larger than , the minimum value among will be chosen in selection step to compensate the frequency change in maturation step. Let us denote the selected valued of *f* and *h* and . We temporally drop time index *τ* for simplicity. Then, the joint cumulative distribution function is

The probability can be converted as

where is a conditional joint cumulative distribution function of (*f* ^{(i)}, *h*^{(i)}). The marginal cumulative distribution functions are

Similarly, the probabilities and are converted into and . Thus, the joint cumulative distribution function is

Then, the conditional probability of the selected collective is given by

where and .

If the mean is smaller than , the chosen collective is likely to have maximum *f* values among *g* matured collectives. Then, the definition of *f* ^{*} is written by . We rewrite the joint cumulative distribution function to be little different from Eq. [69] because now we have to utilize the condition *f* ^{*} *< f* instead of *f* ^{*} *> f*,

The probability is converted as

Thus, the joint cumulative distribution function is given by

In this case, the conditional probability distribution function is given by

By replacing (*f, h*) to , we finally obtain the conditional probability distribution ,

Using Eq. [79], we get the mean values of *f* ^{*} and *h*^{*} as

We define the accessible region in frequency space where the signs of the changes in both F frequency and FF frequency after a cycle are opposite to that of maturation (see Fig. S10),

where and are the mean values after the maturation step in cycle *k* + 1. Figure S10**a** shows a concise explanation of the classification of regions where the initial and target compositions satisfy the above conditions in Eq. [82]. Or, if the condition is not met, the composition of the selected collective may diverge from the target composition after several cycles. The accessible regions are marked in the gray-colored area in Fig. S10**b**. Similar to the case of the two-genotype population, the accessible region is shaped by the flow velocity of the composition in the maturation step, as depicted in the flow diagram in Fig. S10**b**. Both F and FF frequencies tend to increase, and the inter-collective selection can compensate for these changes if the composition changes slowly when the F and FF frequencies are small. However, if the changes occur too rapidly when the FF frequency is intermediate, the frequency cannot be stabilized. So the accessible region is limited to the regions where the composition changes slowly.

## 7. Derivation of equations

In this section, we go over the derivation of Eq. [7-27] for readers not equipped with advanced mathematics training.

Assumptions: *μ* ≪ *ω, r*

**Equations [7] and [8]**. Equation [7] is straightforwardly solved by integrating Eq. [5]. Equation [8] is obtained from Eq. [6] using integration factor *e*^{−(r+ω)t}:

Integrating both sides, we get . Thus,

**Equations [11] and [12]**. Applying Eq. [4], we have

We collect the two violet-colored terms and change the order of summation. Note that the first violet-colored term does not change regardless of whether *S* starts from 0 or 1 because the term is zero for *S* = 0. Thus, the first violet-colored term is equivalent to . Let *α* = *S* − 1, and this becomes . We reassign *α* as *S*, and obtain:

We collect the two green terms, and similarly obtain:

Finally, we collect the two orange terms. For the first orange term, the sum is the same regardless of whether we start from *S* = 0 or −1. Let *S* start from −1, and we have ,then the term becomes . We reassign *α* as *S*, and additionally apply index change on *F* − 1:

Now, add the three parts together, and we have

which is Eq. [11]. Likewise,

which is Eq. [12].

**Equation [13] and [15]**. Using integration factor *e*^{−2(r−μ)t} and Eq. [11], we have:

Since *μ* ≪ *r*,we have

where is the expected *S*^{2} at time 0. For Eq. [15],

**Equation [16] and [17]**. Since ,

We can solve this, again using the integration factor technique above:

Thus, we have

which results in

**Equation [18]**. From Eq. [12], we have

The right-hand side becomes

Note that we have checked that the second and third terms of can be ignored after we compare the full calculation with this simpler version. Integrate both sides:

Then, we have

**Equations [25]-[27]**. To derive this equation, we use the fact that 1*/*(1 + *x*) ∼1 − *x* for small *x*. We will omit (*t*) for simplicity. Also note that we are considering relatively large populations so that the standard deviation is much smaller than the mean.

Recall that if *A* ∼ 𝒩 (*μ*_{A}, *σ*_{A}), *B* ∼ 𝒩 (*μ*_{B}, *σ*_{B}), then *A* − *B* ∼ 𝒩 (*μ*_{A} − *μ*_{B}, ). Thus, *f* is distributed as a Gaussian with the mean of

Note that the initial value of mean is equal to the mean of the binomial distribution Eq. [22], . The variance is

# References

- 1.Fast and facile biodegradation of polystyrene by the gut microbial flora of Plesiophthalmus davidis larvae
*Appl. Environ. Microbiol***86**:e01361–20 - 2.Corrigendum: Insights into plastic biodegradation: community composition and functional capabilities of the superworm (Zophobas morio) microbiome in styrofoam feeding trials
*Microb. Genomics***8** - 3.Synthetic biology approaches to engineer probiotics and members of the human microbiota for biomedical applications
*Annu. review biomedical engineering***20**:277–300 - 4.Reorganization of a synthetic microbial consortium for one-step vitamin c fermentation
*Microb. Cell Factories***15** - 5.Experimental studies of community evolution i: The response to selection at the community level
*Evolution***44**:1614–1624 - 6.Experimental studies of community evolution ii: The ecological basis of the response to community selection
*Evolution***44**:1625–1636 - 7.Artificial ecosystem selection
*Proc. Natl. Acad. Sci***97**:9110–9114 - 8.Artificial selection of microbial ecosystems for 3-chloroaniline biodegradation
*Environ. Microbiol***2**:564–571 - 9.Selection on soil microbiomes reveals reproducible impacts on plant function
*The ISME journal***9**:980–989 - 10.Levels and limits in artificial selection of communities
*Ecol. Lett***18**:1040–1048 - 11.Cultivated sub-populations of soil microbiomes retain early flowering plant trait
*Microb. Ecol***73**:394–403 - 12.Understanding microbial community dynamics to improve optimal microbiome selection
*Microbiome***7**:1–14 - 13.Host-mediated microbiome engineering (hmme) of drought tolerance in the wheat rhizosphere
*PLoS One***14** - 14.Effect of the reproduction method in an artificial selection experiment at the community level
*Front. Ecol. Evol***7** - 15.Artificially selecting bacterial communities using propagule strategies
*Evolution***74**:2392–2403 - 16.Effects of microbial evolution dominate those of experimental host-mediated indirect selection
*PeerJ***8** - 17.Artificial selection on microbiomes to breed microbiomes that confer salt tolerance to plants
*mSystems***6**:e01125–21 - 18.Community diversity determines the evolution of synthetic bacterial communities under artificial selection
*Evolution***76**:1883–1895 - 19.Artificial selection of stable rhizosphere microbiota leads to heritable plant phenotype changes
*Ecol. Lett***25**:189–201 - 20.Modelling artificial ecosystem selection: A preliminary investigation
*Advances in Artificial Life*:659–666 - 21.The role of non-genetic change in the heritability, variation, and response to selection of artificially selected ecosystems
*Artificial Life IX: Proceedings of the Ninth International Conference on the Simulation and Synthesis of Artificial Life* - 22.Artificial selection of simulated microbial ecosystems
*Proc. Natl. Acad. Sci***104** - 23.Simulations reveal challenges to artificial community selection and possible strategies for success
*PLoS Biol***17** - 24.Eco-evolutionary dynamics of nested darwinian populations and the emergence of community-level heredity
*Elife***9** - 25.Steering ecological-evolutionary dynamics to improve artificial selection of microbial communities
*Nat. Commun***12** - 26.Engineering complex communities by directed evolution
*Nat. Ecol. & Evol***5** - 27.Artificial selection of communities drives the emergence of structured interactions
*J. Theor. Biol***571** - 28.Artificial selection methods from evolutionary computing show promise for directed evolution of microbes
*Elife***11** - 29.Partner-assisted artificial selection of a secondary function for efficient bioremediation
*Iscience***26** - 30.Novel artificial selection method improves function of simulated microbial communities
*bioRxiv [preprint]*https://doi.org/10.1101/2023.01.08.523165 - 31.A quantitative genetics framework for understanding the selection response of microbial communities
*bioRxiv [preprint]*https://doi.org/10.1101/2023.10.24.563725 - 32.Artificial selection of microbial communities: what have we learnt and how can we improve?
*Curr. Opin. Microbiol***77** - 33.Major evolutionary transitions in individuality between humans and ai
*Philos. Transactions Royal Soc. B***378** - 34.The function inverfc theta
*Aust. J. Phys***13** - 35.mpltern 0.3.0: ternary plots as projections of Matplotlib
- 36.Approximate accelerated stochastic simulation of chemically reacting systems
*The J. Chem. Phys***115**:1716–1733 - 37.Efficient step size selection for the tau-leaping simulation method
*The J. Chem. Phys***124** - 38.Statistics of extremes
- 1.Approximate accelerated stochastic simulation of chemically reacting systems
*The J. Chem. Phys***115**:1716–1733 - 2.Efficient step size selection for the tau-leaping simulation method
*The J. Chem. Phys***124** - 3.Progress of a half century in the study of the luria–delbrück distribution
*Math. Biosci***162**:1–32 - 4.Statistics of extremes
- 5.The function inverfc theta
*Aust. J. Phys***13** - 6.Simulations reveal challenges to artificial community selection and possible strategies for success
*PLoS Biol***17** - 7.mpltern 0.3.0: ternary plots as projections of Matplotlib

# Article and author information

## Version history

- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:

## Copyright

© 2024, Lee et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

# Metrics

- views
- 51
- downloads
- 0
- citations
- 0

Views, downloads and citations are aggregated across all versions of this paper published by eLife.