Abstract
Collectives, such as microbial communities, can perform functions beyond the capability of individual members. Enhancing these collective functions through artificial selection, however, presents significant challenges. Here, we explore the ‘rafting-a-waterfall’ phenomenon, a metaphor illustrating how the success in achieving a target population composition in microbial collectives depends on both the target characteristics and initial conditions. Specifically, collectives comprising fast-growing (F) and slow-growing (S) individuals were grown for a period of “maturation” time, and the collective with S-frequency closest to the target value is chosen to “reproduce” (inoculate) offspring collectives. Such collective selection is dictated by two opposing forces: during collective maturation, intra-collective selection acts like a waterfall, relentlessly driving the S-frequency to lower values, while during collective reproduction, inter-collective selection resembles a rafter striving to reach the target frequency. Due to this model structure, maintaining a target frequency requires the continued action of inter-collective selection. Using simulations and analytical calculations, we show that intermediate target S frequencies are the most challenging, akin to a target within the vertical drop of a waterfall, rather than above or below it. This arises because intra-collective selection is the strongest at intermediate S-frequencies, which can overpower inter-collective selection. While achieving low target S frequencies is consistently feasible, attaining high target S-frequencies requires an initially high S-frequency — much like a raft that can descend but not ascend a waterfall. The range of attainable target frequencies depends on the initial population size of the collectives: as the population size in Newborn collectives increases, the region of achievable target frequency is reduced until no frequency is achievable. In contrast, the number of collectives under selection plays a less critical role. In scenarios involving more than two populations, the evolutionary trajectory must navigate entirely away from the metaphorical ‘waterfall drop.’ Our findings illustrate that the strength of intra-collective evolution is frequency-dependent, with implications in experimental planning.
Introduction
Microbial collectives can carry out functions that arise from interactions among member species. These functions, such as waste degradation [1, 2], probiotics [3], and vitamin production [4], can be useful for human health and biotechnology. To improve collective functions, one can perform artificial selection (directed evolution) on collectives: Low-density “Newborn” collectives are allowed to “mature” during which cells proliferate and possibly mutate, and community function develops. “Adult” collectives with high functions are then chosen to reproduce, each seeding multiple offspring Newborns. Artificial selection of collectives have been attempted both in experiments [5–19] and in simulations [20–31], often with unimpressive outcomes.
One of the major challenges in selecting collectives is to ensure the inheritance of a collective function [32, 33]. Inheritance from a parent collective to offspring collectives can be compromised by changes in genotype and species compositions. During maturation of a collective, genotype compositions within each species can change due to intra-collective selection favoring fast-growing individuals, while species compositions can change due to ecological interactions. Furthermore, during reproduction of a collective, genotype and species compositions of offspring can vary stochastically from those of the parent.
Here, we consider the selection of collectives comprising two or three populations with different growth rates, and our goal is to achieve a target composition in the Adult collective. This is a common quest: whenever a collective function depends on both populations, collective function is maximised, by definition, at an intermediate frequency (e.g. too little of either population will hamper function [23]). Earlier work has demonstrated that nearly any target species composition can be achieved when selecting communities of two competing species with unequal growth rates [24, 34], so long as the shared resource is depleted during collective maturation [24]. In this case initially, both species evolved to grow faster, and the slower-growing species was preserved due to stochastic fluctuations in species composition during collective reproduction. Eventually, both species evolved to grow sufficiently fast to deplete the shared resource during collective maturation, and evolution in competition coefficients then acted to stabilize species ratio to the target value [24]. Regardless, earlier studies are often limited to numerical explorations, with prohibitive costs for a full characterization of the parameter space for such nested populations (population of collectives, and populations of mutants within a collective).
We mathematically examine the selection of composition in collectives consisting of populations growing at different rates. A selection cycle consists of three stages (Fig. 1). During collective maturation, intra-collective selection favors fast-growing individuals within a collective. At the end of maturation, inter-collective selection acts on collectives and favors those achieving the target composition. Finally during collective reproduction, offspring collectives sample stochastically from the parents, a process dominated by genetic drift. We made simplifying assumptions so that we can analytically examine the evolutionary tipping point between intra-collective and inter-collective selection. We show that this tipping point creates a “waterfall” effect which restricts not only which target compositions are achievable, but also the initial composition required to achieve the target. We also investigate how the range of achievable target composition is affected by the total population size in Newborns and the total number of collectives under selection. Finally, we show that the waterfall phenomenon extends to systems with more than 2 populations.

Schematic for artificial selection on collectives.
Each selection cycle begins with a total of g Newborn collectives, each with N0 total cells of slow-growing S population (light gray dots) and fast-growing F population (dark gray dots). During maturation (over time τ), S and F cells divide at rates rS and rS + ω (ω > 0), respectively, and S mutates to F at rate µ. In the selection stage, the Adult collective with F frequency f closest to the target composition
Results and discussions
To enable the derivation of an analytical expression, we have made the following simplifications. First, growth is always exponential, without complications such as resource limitation, ecological interactions between the two populations, or density-dependent growth. Thus, the exponential growth equation can be used. Second, we consider only two populations (genotypes or species): the fast-growing F population with size F and the slow-growing S population with size S. We do not consider a spectrum of mutants or species, since with more than two populations, an analytical solution becomes very difficult. Finally, the single top-functioning community is chosen to reproduce, which allows us to employ the simplest version of the extreme value theory (see section below for further justification.)
Our goal is to select for collective composition in terms of F frequency f = F/(S + F), or equivalently, S frequency s = 1 − f. More precisely, we want collectives such that after maturation time τ, f (τ) is as close to the target value
We will start with a complete model where S mutates to F at a nonzero mutation rate µ. We made this choice because it is more challenging to attain or maintain the target frequency when the abundance of fast-growing F is further increased via mutations. This scenario is encountered in biotechnology: an engineered pathway will slow down growth, and breaking the pathway (and thus faster growth) is much easier than the other way around. When the mutation rate is set to zero, the same model can be used to capture collectives of two species with different growth rates. We show that intermediate F frequencies or equivalently, intermediate S frequencies, are the hardest targets to achieve. We then show that similar conclusions hold when selecting for a target composition in collectives of more than two populations.
Model structure
A selection cycle (Fig. 1) starts with a total of g Newborn collectives. At the beginning of cycle k (t = 0), each Newborn collective has a fixed total cell number
Collectives are allowed to grow for time τ (‘Maturation’ in Fig. 1). During maturation, S and F grow at rates rS and rS + ω (ω > 0), respectively. If maturation time τ is too small, a matured collective (“Adult”) does not have enough cells to reproduce g Newborn collectives with N0 cells. On the other hand, if maturation time τ is too long, fast-growing F will take over. Hence we set the maturation time τ = ln(g +1)/rS, which guarantees sufficient cells to produce g Newborn collectives from a single Adult collective. At the end of a cycle, a single Adult with the highest function (with F frequency f closest to the target frequency
Collective function is dictated by the Adult’s F frequency f. Among all Adult collectives, the selected Adult is the one whose F frequency is closest to the target value,
The selected Adult, with F frequency denoted as f *, is then used to reproduce g offspring collectives, each with N0 total cells. The number of F cells in a Newborn follows a binomial distribution B(N0, f *). By repeating the selection cycle, we aim to achieve and maintain the target composition

Nomenclature
Overall our model considers mutational stochasticity, as well as demographic stochasticity in terms of stochastic birth and stochastic sampling of a parent collective by offspring collectives. Other types of stochasticity, such as environmental stochasticity and measurement noise, are not considered and require future research.
The success of collective selection is constrained by the target composition, and sometimes also by the initial composition
Since intra-collective selection favors F, we expect that a higher target
We fixed N0, the total population size of a Newborn, to 1000, and obtained selection dynamics for various initial and target F frequencies by implementing stochastic simulations (Sec. I in Supplementary Information). If the target

Initial and target compositions determine the success of artificial selection on collectives.
(a-c) F frequency of the selected Adult collective (f *) over cycles at different target

Intra-collective selection and inter-collective selection jointly set the boundaries for selection success.
a The change in F frequency over one cycle. When

Expanding the success region for artificial collective selection.
a Reducing the population size in Newborn N0 expands the region of success. In gold area, the probability that
In contrast, an intermediate target frequency (e.g.
If the target frequency is low (e.g.
To achieve target
In summary, two regions of target frequencies are “accessible” in the sense that a target value in the region can be reached (gold in Fig. 2e, f): (1) target frequencies above
Intra-collective evolution is the fastest at intermediate F frequencies, creating the “waterfall” phenomenon
To understand what gives rise to the two accessible regions, we calculated Δf, the selection progress in F frequency over two consecutive cycles (Box 1, Eq. (2)). The solution (Fig. 3a, green) has the same shape as results from numerically integrating Eq. (1) (Fig. 3a, orange) and from stochastic simulations (Fig. 3a, blue).
If Δf is negative, then inter-collective selection will succeed in countering intra-collective selection and reducing f toward the target. Δf is negative if the selected
Thus, inter-collective selection is akin to a raftman rowing the raft to a target, while intra-collective selection is akin to a waterfall. This metaphor is best understood in terms of S frequency s = 1 − f. The lower-threshold f L corresponds to higher-threshold in sH = 1 − f L, while higher-threshold f H corresponds to lower-threshold in sL = 1 − f H. Intra-collective selection is akin to a waterfall, driving the S frequency s from high to low (Fig. 2g). Intra-collective selection acts the strongest when s is intermediate (sL < s < sH), similar to the vertical drop of the fall. Intracollective selection acts weakly at high (> sH) or low (< sL) s, similar to the gentle sloped upper and lower pools of the fall (regions 1 and 2 of Fig. 2e, g). Thus, an intermediate target frequency can be impossible to achieve − a raft starting from the upper pool will be flushed down to sL (f H), while a raft starting from the lower pool cannot go beyond sL (f H). In contrast, a low target S frequency (in the lower pool) is always achievable. Finally, a high target S frequency (in the upper pool) can only be achieved if starting from the upper pool (as the raft can not jump to the upper pool if starting from below).
Box 1:
Changes in the distribution of F frequency f after one cycle
We consider the case where
To find
There are two points where the median values are the same as
Following the extreme value theory, the conditional probability density function
Equation (1) can be described as the product between two terms related to probability: (i)
Since computing the exact formula of Adults’ f distribution in cycle k + 1 is hard, we approximate it as Gaussian with mean
To obtain an analytical solution of the change in f over one cycle, we first assume that in a Newborn collective, the number of S cells is distributed as Gaussian with mean
Selection progress - the difference between the median value of the conditional probability distribution
where Φ−1(…) is the inverse cumulative function of standard normal distribution (see main text for an example). We chose median because compared to mean, it is easier to get analytical expression since Φ−1(…) is known in a closed form. Regardless, using median generated results similar to simulations (Fig. 7 in Supplementary Information). As expected, selection progress Δf is governed by both the mean
When
and
In the limit of small
Manipulating experimental setups to expand the achievable target region
In Eq. (2) (Box 1), selection progress Δf depends on the total number of collectives under selection (g). Δf also depends on the mean and the standard deviation of Adult F frequency
Our goal is to make Δf as negative as possible so that any increase in f during collective maturation may be reduced. From Eq. (2) (Box 1), a small f (τ) will facilitate collective-level selection. Additionally, a large σf (τ) will also facilitate collective-level selection due to negative
From Eq. (4) (Box 1), σf (τ) will be large if Newborn size N0 is small. Indeed, as Newborn size N0 declines, the region of achievable target frequency expands (larger gold area in Fig. 4a). If the Newborn size N0 is sufficiently small (e.g. ≤ 700 in our parameter regime), any target frequency can be reached. An analytical approximation of the maximal Newborn size permissible for all target frequencies is given in section III in Supplementary Information.
From Eqs. (3) and (4) (Box 1), maturation time τ affects
g, the number of collectives under selection, also affects selection outcomes. As g increases, the value of
The waterfall phenomenon in a higher dimension
To examine the waterfall effect in a higher dimension, we investigate a three-population system where a faster-growing population (FF) grows faster than the fast-growing population (F) which grows faster than the slow-growing population (S) (Fig. 5a and Sec. VIII in Supplementary Information). In the three-population case, the evolutionary trajectory travels in a two-dimensional plane. A target population composition can be achieved if inter-collective selection can sufficiently reduce the frequencies of F as well as FF (accessible region, gold in Fig. 5b).

In higher dimensions, the success of artificial selection requires the entire evolutionary trajectory remaining in the accessible region.
a During collective maturation, a slow-growing population (S) (with growth rate rS ; light gray) can mutate to a fast-growing population (F) (with growth rate rS + ω; medium gray), which can mutate further into a faster-growing population (FF) (with growth rate rS + 2ω; dark gray). Here, the rates of both mutational steps are µ, and ω > 0. b Evolutionary trajectories from various initial compositions (open circles) to various targets. Intra-collective evolution favors FF over F (vertical blue arrow) over S (horizontal blue arrow). The accessible regions are marked gold (see Sec. I in Supplementary Information). We obtain final compositions starting from several initial compositions while aiming for different target compositions in i, ii, and iii. The evolutionary trajectories are shown in dots with color gradients from the initial to final time. (i) A target composition with a high FF frequency is always achievable. (ii) A target composition with intermediate FF frequency is never achievable. (iii) A target composition with low FF frequency is achievable only if starting from an appropriate initial composition such that the entire trajectory never meanders away from the accessible region. The figures are drawn using mpltern package [36]. c The accessible region in the three-population problem is interpreted as an extension of the two-population problem. First the accessible region between FF and S+F is given, and then the S+F region is stretched into S and F.
From numerical simulations, we identified two accessible regions: a small region near FF and a band region spanning from S to F (gold in Fig. 5b i). Intuitively, the rate at which FF grows faster than S+F is greater than the rate at which F grows faster than S (see section VIII in Supplementary Information). Thus, the problem can initially be reduced to a two-population problem (i.e. FF versus F+S; Fig. 5c left), and then expanded to a three-population problem (Fig. 5c right).
Similar to the two-population case, targets in the inaccessible region are never achievable (Fig. 5b ii), while those in the FF region are always achievable (Fig. 5b i). Strikingly, a target composition in an accessible region may not be achievable even when the initial composition is within the same region: once the composition escapes the accessible region, the trajectory cannot return back to the accessible region (Fig. 5biii, the leftmost initial condition). However, if the initial position is closer to the target in the accessible region, the target becomes achievable (Fig. 5b iii, initial condition near the bottom). Note that here, selection outcome is path-dependent in the sense of being sensitive to initial conditions. This phenomenon is distinct from hysteresis where path-dependence results from whether a tuning parameter is increased or decreased.
In conclusion, we have investigated the evolutionary trajectories of population compositions in collectives under selection, which are governed by intra-collective selection (which favors fast-growing populations) and inter-collective selection (which, in our case, strives to counter fast-growing populations). Intra-collective selection has the strongest effect at intermediate frequencies of faster-growing populations, potentially creating an inaccessible region analogous to the vertical drop of a waterfall. High and low target frequencies are both accessible, analogous to the lower- and the upper-pools of a waterfall, respectively. A less challenging target (high
Methods
Stochastic simulations
A selection cycle is composed of three steps: maturation, selection, and reproduction. At the beginning of the cycle k, a collective i has
Analytical approach to the conditional probability
The conditional probability distribution
After the reproduction step, the Newborn collectives grow for time τ. The frequency is changed from the given frequency ζ to f by division and mutation processes. We assume that the frequency f of an Adult is also approximated by a Gaussian random variable
The conditional probability distribution of an Adult collective in cycle k + 1 (fk+1,τ) to have frequency f at a given
Since we select the minimum frequency
We assume that the conditional probability distribution in Eq. (7) follows a normal distribution, whose mean and variances are describe by Eq. (40) and Eq. (41). Then, the extreme value theory [39] estimates the median of the selected Adult by
The selection progress Δf in Eq. (2) is obtained by subtracting
Data and source code availability
Data and source code of stochastic simulations are available in https://github.com/schwarzg/artificial_selection_collective_composition.
Acknowledgements
J. Lee and H.J. Park were supported by the National Research Foundation of Korea grant funded by the Korea government (MSIT), Grant No. RS-2023-00214071 and RS-2024-00460958 and by an appointment to the JRG Program at the APCTP through the Science and Technology Promotion Fund and Lottery Fund of the Korean Government. W. Shou was supported by an Academy of Medical Sciences Professorship and a Royal Society Wolfson Fellowship. This was also supported by the Korean Local Governments–Gyeongsangbuk-do Province and Pohang City and INHA UNIVERSITY Research Grant. We thank Su-Chan Park, Li Xie, Alex Yuan, and Botond Major for constructive comments and discussions.
Supplementary information
In this supplementary information (SI), we present the mathematical details of all calculations in the main text, as well as additional figures. The SI is organized as follows: In section I, we provide the mathematical details of stochastic simulation of the selection cycle. In section II, we show the mathematical process to obtain the analytical expression of the selection progress in Eq.(2) in the main text. In section III, we provide the analytical approximation of the critical Newborn size. In section IV, we discuss a limiting case when mutation rate is zero. Section V provides the boundary of the region of success with different selective advantages. In section VI, we discuss specific scenarios under which mutation converts F to S. In section VII, we present an example in which more than one collective is selected for reproduction. In section VIII, we provide the mathematical details of the selection cycle for a three-population system. Finally, in section IX, we provide detailed derivations of equations from section I to section II.
I. Stochastic simulation of the selection cycle
In the main text, we design a simple model of artificial selection on collectives. The selection cycle starts with g “Newborn” collectives which consist of two populations - slow-growing population (S) and fast-growing population (F). S mutate to F at a rate µ. The Newborns mature for a fixed time τ. The matured collective (“Adult”) with the highest function (with F frequency f closest to the target
In our selection cycle, variation among collectives mainly resulted from demographic noises during cell birth, cell mutation, and collective reproduction. In this section, we provide details of the simulation.
Maturation
Here, we calculate the cell numbers during maturation. Each collective i (i = 1, …, g) has
We describe cell divisions of S and F cells and mutation from S to F with following chemical reaction rules:
One can run an individual-based simulation by counting the number of events occurring during collective maturation via the tau-leaping algorithm [1, 2] to generate a sample trajectory of S(t) and F (t) for each collective. However, the individual-based simulation requires long computing times due to a large number of random events to be counted. Hence we used a ‘sampling method’ by sampling the numbers of S and F cells in collectives from a joint probability density distribution (jpdf) P (S, F, t) which denotes the probability density to have S number of S cells and F number of F cells at time t in the cycle. To do so, we require an analytical expression of P (S, F, t).
First, we assume that the chemical reactions in Eqs. (1)-(3) occur independently, and never occur simultaneously within a short time interval [t, t + dt). Then, the differential of P (S, F, t) with respect to time is given by
This master equation describes a probability density ‘flux’ at the state (S, F). The first term describes the scenario where a single birth event of a S cell happens during time interval [t, t + dt), which changes the collective’s composition from (S− 1, F) to (S, F). Similarly, the second term comes from a birth event of a F cell. The third term indicates the mutation event from (S +1, F− 1) to (S, F). The last term corresponds to the outflow of probability density by birth and mutation processes, which describes the changes from (S, F) to any other states.
Calculating the exact form of P (S, F, t) is not simple. Instead, we assume that the mutation rate is much smaller than the growth rates, and hence the correlation between S and F is sufficiently small. Additionally, S and F do not interact ecologically. Then, we can express P (S, F, t) as a product of two probability density functions (pdf) of S(t) and F (t), P (S, F, t) = P (S, t)P (F, t). We assume that each pdf of S and F can be approximated as Gaussian (𝒩), which is supported by the Central Limit Theorem and Fig. 1. In more detail, the cell numbers S and F are mainly determined by growth (Eqs. (1) and (2)), and also mutations (Eq. (3)). Even though the number of events would be different among different realizations, the mean numbers of events will follow Gaussian distributions. So, we can simply assume that the distributions of cell numbers also follow Gaussian distributions. This assumption requires that the distributions have insignificant skewness and no heavy tails, which we will numerically check afterwards. The pdfs of S(t) and F (t) are given by
and
That is, P (S, F, t) is written as
Now we need means (
The means are defined by
We assume that the mutation rate µ is much smaller than rS and ω. By solving Eq. (8) and Eq. (9), the means
where
We define the second momenta of S and F as
Then, the corresponding differential equations are given by
The solution of Eq. (16) is
where
where
The solution of Eq. (22) is given by
By using Eq. (23), the solution of Eq. (17) is given by
where
Using
Using Eqs. (10),(11),(20),(25), we construct pdfs for S(t) and F (t) at the end of cycle t = τ. Then, we randomly sample a number from P (S, τ) for S(τ) and another number from P (F, τ) for F (τ). Those two numbers are cell numbers in a single Adult. We repeat this process for each Newborn to get cell numbers of all Adults. Note that the initial values for the Newborn i are

Comparison between the calculated Gaussian distribution (“Gauss”, with the mean and variances computed from Eqs. (10),(11),(20),(25)) and simulations using tau-leaping (“tau”). The simulations run 3000 times. The initial number of cells are (S0, F0) = (990, 10), (500, 500), and (10, 990) for each column. The parameters r = 0.5, ω = 0.03, µ = 0.0001, and τ = 4.8 are used.
Now, we check validity of Gaussian approximation for probability density functions of S and F populations. If we consider mutation from S to F as death in S population, then the process in S corresponds to a branching process with death. Also, the birth process in F including mutation results in a Luria-Delbrück distribution[3]. Thus the distributions of Adults’ S and F numbers are more skewed and heavy-tailed than Gaussian. But this problem is alleviated by larger initial S and F numbers and when the maturation time τ is not very long (see Fig. 1). Since we usually consider larger initial cell numbers, we use the Gaussian approximation on S and F populations in further calculations.
Selection
After sampling cell numbers of each Adult in the maturation step, we compute the F frequencies in each collectives

Congruence between consecutive sampling (MHG for multivariate hypergeometric distribution) and independent binomial (BN) sampling. The initial numbers of cells are S = 8000 and F = 2000 for the left panel, and S = 20 and F = 5 for the right panel. 10000 samples are drawn for each distribution. Here, a parent collective is divided into 10 collectives.
In mathematical expression, the selected frequency is defined by
Reproduction
Using the chosen Adult, we generate g Newborn collectives for the next cycle k + 1. The most natural way is consecutive random sampling N0 cells from the selected Adult without replacement. In mathematical expression, we first randomly sample
If we assume that the selected Adult size
After sampling, the numbers of S cells are set to be
We can now start cycle k + 1 with these Newborn collectives. By repeating the above three steps (maturation, selection, and reproduction), we run the simulation until F frequency reaches a stationary state.

a Trajectories of F frequency for 10 collectives (g = 10) over time. The collective whose frequency is closest to the target value is selected in every cycle (black lines). The gray lines denote the other collectives. For parameters, we used S growth rate rS = 0.5, F growth advantage ω = 0.03, mutation rate µ = 0.0001, maturation time τ ≈ 4.8, and N0 = 1000. b Comparison between frequency trajectories with selection (the chosen one Adult producing all offspring; black) and without selection (each Adult producing one offspring; blue) clearly shows the effect of artificial selection. The black line indicates F frequency of the selected collective
Simulation Result
Figure 3 presents the composition trajectories of all collectives using tau-leaping algorithm in the maturation step. The selected Adults have the closest composition to the target composition
In Fig. 4, we plot the absolute error d between the target frequency
II. Conditional probability distribution of the selected collective frequency f * and selection progress Δf
In the main text, we identify the region of success by using selection progress

Color map of the absolute error
Let us start from the reproduction step from the selected Adult in cycle k. We reproduce g Newborns in next cycle k + 1. Then the probability distribution of the F cell numbers in Newborn collectives is given in Eq. (28). If the total number of cells in a Newborn collective N0 is large enough, Eq. (28) is approximated by the Gaussian distribution
The Newborn collective i has initial cell numbers
Next, we write conditional pdf of Adults’ F frequency with given Newborn F frequency ζ. We assume that cell numbers in Adult S(τ) and F (τ) follow Gaussian distributions as in Eqs. (5) and (6). Based on Eqs. (10), (11), (20), and (25), we have
where GS(τ) and GF (τ) are random variable following the standard distribution 𝒩 (0, 1). Note that each Gaussian is sharp if Newborn size N0 is sufficiently large (
The mean of f is given by
and the variance is
where
Next, we get the conditional pdf of
After maturation in cycle k + 1, the Adult with the smallest frequency is selected among g Adult collectives, denoted as

The probability density functions of the selected Adult’s F frequency
Since frequencies are independent and identically distributed,
Then, the probability density function
We compute the probability density function(39) by using numerical integration and compare it with the stochastic simulation results in Fig. 5. The two distributions are similar.
To get the analytic approximation of the median of Eq. (39), we assume that the Adult’s F frequency distribution is Gaussian. Then we only need to calculate the mean
which give rise to Eq. (3) and Eq. (4) in the main text, respectively. The functional form of Eqs.(40) and (41) are plotted in Fig. 6a.

a Mean (Eq. (33)) and variance (Eq. (34)) of f values of Adult collectives with respect to the Newborn frequency f0. b Scaling relation of F frequency variance (Eq. (41)) with Newborn collective size N0. The initial F frequency is 0.5. The parameters are rS = 0.5, ω = 0.03, µ = 0.0001, and τ ≈ 4.8. c Relation of F frequency variance (Eq. (41)) with maturation time τ. Other parameters are the same as b.
The median (Median
where Φ−1(y) is an inverse cumulative density function (CDF) of the normal distribution with mean
which is Eq. (2) in the main text.

Median (orange) and mean (violet) have similar distributions. We performed 1000 simulations to get probability density. a g = 10, b g = 100, and c g = 1000. Initial F frequency is
Further, we get an asymptotic expression of Φ−1(ln 2/g) when g is large (or Φ−1(y) with small y).
Here, we introduce a method from [5]. We start from the CDF of the standard normal distribution,
Replacing x2 on the righthand side in Eq. (44) into the expression itself, we get continued logarithmic form of
Inserting x = erfc−1(y) (square root of Eq. (45)) into the inverse CDF
III. Critical newborn size N̆0 to allow all target frequencies
First we note that
Small N0 makes all target frequencies achievable shown in Fig. 4a in the main text. That is because small N0 induces large σf, and thus N0 smaller than a certain critical value
and
So, by setting
Thus, all target frequencies are successfully selected with Newborn size N0 smaller than
IV. Selection without mutation µ = 0
When the mutation rate is zero, two genotypes behave as two distinct species. The compositional change is provided by Eq. (42) with setting µ = 0. Corresponding

Simulation with zero mutation rate. Color map of the absolute error
Equations (51) and (52) suggest that when a community consists of two competing species, we obtain similar conclusions on the accessible region for target composition. The stochastic simulation results are present in Fig. 8.
V. STronger or weaker advantages ω
The solution of equation (2) in main text provides the boundary values with varying the ω, the fitness advantage of F over S. We numerically calculate the solutions and plot in Fig. 9.
VI. Deleterious mutation ω < 0
In the main text, we show the target composition can be achieved in some ranges of initial and target values when the mutation is beneficial to growth. The same analogy can be applied when the mutation is deleterious. Since the F cells grow slower than the S cells (ω < 0), the F frequency naturally decreases in the maturation step. Then, the challenging case is selecting larger F frequency against to the intra-collective selection. So the conditional probability distribution

Change of success region in varying selective advantage ω. rS = 0.5, ω = 0.03, µ = 0.0001, N0 = 1000, g = 10 and τ ≈ 4.8.
If all frequencies are independent and identical distributed random variables, the cumulative distribution function becomes
Likewise in the previous section, we get the conditional probability density function by differentiating Eq. (54) with respect to f and replacing
The distribution in Eq. (55) is evaluated for various
VII. Selecting more than one collective
In the main text, we choose one collective which has the closest frequency to the target among g collectives. Such ‘top 1’ strategy allows us to apply extreme value theory. However, ‘top 1’ may be too restrictive [6]. Thus, we test the ‘top-tier’ strategy by choosing top 5 among 100 Adults (Fig. 11). The top-tier strategy is shown to be inefficient in our system. This is because in [6], non-heritable variations−such as stochastic fluctuations in species composition introduced by pipetting − caused nonheritable variations in collective function. Nonheritable variations could potentially mask desired mutations if these mutations happened to occur in an ‘unlucky’ environment that yielded lower collective functions. Hence, lenient selection would allow the preservation of these mutations. In contrast here, stochastic fluctuations in genotype composition is heritable: a parent Newborn with lower F frequency f will tend to have offspring Newborns with lower f values. Hence, top-1 is more effective in this study.

Artificial selection also works for deleterious mutation. a Conditional probability density functions of
VIII. Extension to three-population system
We assume that collectives consist of three genotypes with slow-growing(S), fast-growing(F), and faster-growing(FF) types. The growth rate of S is rS. Each mutation adds ω to the growth rate. Thus, the F and FF types have growth rates rS + ω and rS + 2ω, respectively. The mutation rate is µ. So, the birth and mutation events are written by the chemical reactions:

Selecting Top-5% outperforms selecting Top 1. We bred 100 collectives and chose either top-1 collective (solid line) or top-5 collectives (dashed line) with f closest to the target value
We write a master equation of the processes for P (S, F, FF, t) which is the probability to have S, F and FF numbers of S, F, and FF cells at time t, respectively.
The composition of collective i in cycle k is now represented with two frequencies
where the number of S Sk+1,0 is automatically set to be Sk+1,0 = N0 − Fk+1,0 − FFk+1,0. Then, the approximated multivariate normal distribution is
Then a Newborn’s composition ρ(ζ, η) follows the multivariate Gaussian distribution
At the beginning of cycle k, a newborn collective starts from (S0, F0, FF0) cells (for convenience, cycle index k is dropped.) In terms of (ζ, η), each initial numbers are S0 = N0(1− ζ − η), F0 = N0ζ, and FF0 = N0η. Their initial covariance matrix is
The initial conditions of the system in coupled Eqs. (65)-(73) are obtained by the mean and (co)variances of Eq. (62). By solving equations numerically, we obtain a set of mean cell numbers
Then, the F frequency becomes
where
where
With cycle index k, we get conditional probability of matured collectives
We select the Adult collective among g Adult collectives such that the change in frequencies during maturation could be compensated. During maturation, a frequency distribution moves in different directions in (f, h) space depending on the initial composition
If the mean
Then, the joint cumulative distribution function
The probability
where
Similarly, the probabilities
Then, the conditional probability of the selected collective is given by
where
If the mean
We rewrite the joint cumulative distribution function
The probability
Thus, the joint cumulative distribution function is given by
In this case, the conditional probability distribution function is given by
By replacing (f, h) to
Using Eq. (92), we get the mean values of f * and h* as
We define the accessible region in frequency space where the signs of the changes in both F frequency and FF frequency after a cycle are opposite to that of maturation (see Fig. 12),
where
This is explainable by projecting the three-population problem into the two-population problem. The selective advantage of FF relative to the rest of collective mainly determines the accessible region. The growth rate of the rest varies from rS to rS + w according to F frequency, so the mean growth rate of the rest is written by
IX. Derivation of equations
In this section, we go over the derivation of Eq. (10)-(34) for readers not equipped with advanced mathematics training.
Assumptions: µ ≪ ω, rS

a The flow of composition change in F and FF frequencies at each composition (f, h). Top corner indicates that FF cells fix in the collective. Right bottom corner means collectives with only F cells while collectives contain S cells only at left bottom corner. Arrow length means the speed of change. b The accessible regions are marked by the gold area. If the signs of changes in both F frequency and FF frequency after inter-collective selection are opposite to those during maturation, then the given composition is accessible. Otherwise, the composition is not accessible and will change after cycles. Dashed lines are the boundary of accessible region by projecting the collective into a two-population problem (FF vs. S+F). The figures are drawn using mpltern package [7].
Equations (10) and (11)
Equation (10) is straightforwardly solved by integrating Eq. (8). Equation (11) is obtained from Eq. (9) using integration factor
Integrating both sides, we get
Equations (16) and (17)
Applying Eq. (4), we have
We collect the two violet-colored terms and change the order of summation. Note that the first violet-colored term does not change regardless of whether S starts from 0 or 1 because the term is zero for S = 0. Thus, the first violet-colored term is equivalent to
We collect the two green terms, and similarly obtain:
Finally, we collect the two orange terms. For the first orange term, the sum is the same regardless of whether we start from S = 0 or −1. Let S start from −1, and we have
Let α = S + 1,then the term becomes
Now, add the three parts together, and we have
which is Eq. (16). Likewise,
which is Eq. (17).
Equation (18) and (20)
Using integration factor
Since µ ≪ rS, we have
where
Equation (22) and (23)
Since
We can solve this, again using the integration factor technique above:
Thus, we have
which results in
Equation (24)
From Eq. (17), we have
The right-hand side becomes
Note that we have checked that the second and third terms of
Then, we have
Equation (25)
Equations (32)-(34)
To derive this equation, we use the fact that 1/(1 + x) ∼1 − x for small x. We will omit (t) for simplicity. Also note that we are considering relatively large populations so that the standard deviation is much smaller than the mean.
Recall that if A ∼ 𝒩 (µA, σA), B ∼ 𝒩 (µB, σB), then
Note that the initial value of mean
References
- [1]Fast and facile biodegradation of polystyrene by the gut microbial flora of Plesiophthalmus davidis larvaeApplied and Environmental Microbiology 86:e01361–20
- [2]Corrigendum: Insights into plastic biodegradation: community composition and functional capabilities of the superworm (Zophobas morio) microbiome in styrofoam feeding trialsMicrobial Genomics 8
- [3]Synthetic biology approaches to engineer probiotics and members of the human microbiota for biomedical applicationsAnnual Review of Biomedical Engineering 20:277–300
- [4]Reorganization of a synthetic microbial consortium for one-step vitamin c fermentationMicrobial Cell Factories 15:21
- [5]Experimental studies of community evolution i: The response to selection at the community levelEvolution 44:1614–1624
- [6]Experimental studies of community evolution ii: The ecological basis of the response to community selectionEvolution 44:1625–1636
- [7]Artificial ecosystem selectionProceedings of the National Academy of Sciences 97:9110–9114
- [8]Artificial selection of microbial ecosystems for 3-chloroaniline biodegradationEnvironmental Microbiology 2:564–571
- [9]Levels and limits in artificial selection of communitiesEcology Letters 18:1040–1048
- [10]Selection on soil microbiomes reveals reproducible impacts on plant functionThe ISME Journal 9:980–989
- [11]Cultivated sub-populations of soil microbiomes retain early flowering plant traitMicrobial Ecology 73:394–403
- [12]Host-mediated microbiome engineering (hmme) of drought tolerance in the wheat rhizospherePLoS One 14:e0225933
- [13]Understanding microbial community dynamics to improve optimal microbiome selectionMicrobiome 7:1–14
- [14]Effect of the reproduction method in an artificial selection experiment at the community levelFrontiers in Ecology and Evolution 7:416
- [15]Mars Brisbin. Effects of microbial evolution dominate those of experimental host-mediated indirect selectionPeerJ 8:e9350
- [16]Artificially selecting bacterial communities using propagule strategiesEvolution 74:2392–2403
- [17]Artificial selection on microbiomes to breed microbiomes that confer salt tolerance to plantsmSystems 6:e01125–21
- [18]Artificial selection of stable rhizosphere microbiota leads to heritable plant phenotype changesEcology Letters 25:189–201
- [19]Community diversity determines the evolution of synthetic bacterial communities under artificial selectionEvolution 76:1883–1895
- [20]Modelling artificial ecosystem selection: A preliminary investigationAdvances in Artificial Life Berlin, Heidelberg: Springer Berlin Heidelberg :659–666
- [21]The role of non-genetic change in the heritability, variation, and response to selection of artificially selected ecosystemsArtificial Life IX: Proceedings of the Ninth International Conference on the Simulation and Synthesis of Artificial Life MIT Press :352
- [22]Artificial selection of simulated microbial ecosystemsProceedings of the National Academy of Sciences 104:8918
- [23]Simulations reveal challenges to artificial community selection and possible strategies for successPLoS Biology 17:e3000295
- [24]Eco-evolutionary dynamics of nested darwinian populations and the emergence of community-level heredityeLife 9:e53433
- [25]Steering ecological-evolutionary dynamics to improve artificial selection of microbial communitiesNature Communications 12:6799
- [26]Engineering complex communities by directed evolutionNature Ecology & Evolution 5:1011
- [27]Artificial selection of communities drives the emergence of structured interactionsJournal of Theoretical Biology 571:111557
- [28]Artificial selection methods from evolutionary computing show promise for directed evolution of microbeseLife 11:e79665
- [29]Partner-assisted artificial selection of a secondary function for efficient bioremediationiScience 26:107632
- [30]Novel artificial selection method improves function of simulated microbial communitiesbioRxiv https://doi.org/10.1101/2023.01.08.523165
- [31]Artificial selection improves pollutant degradation by bacterial communitiesNature Communications 15:7836
- [32]A quantitative genetics framework for understanding the selection response of microbial communitiesbioRxiv https://doi.org/10.1101/2023.10.24.563725
- [33]Artificial selection of microbial communities: what have we learnt and how can we improve?Current Opinion in Microbiology 77:102400
- [34]Major evolutionary transitions in individuality between humans and aiPhilosophical Transactions of the Royal Society B 378:20210408
- [35]The function inverfc thetaAustralian Journal of Physics 13:13
- [36]mpltern 0.3.0: ternary plots as projections of MatplotlibZenodo https://doi.org/10.5281/zenodo.3528355
- [37]Approximate accelerated stochastic simulation of chemically reacting systemsThe Journal of Chemical Physics 115:1716–1733
- [38]Efficient step size selection for the tau-leaping simulation methodThe Journal of Chemical Physics 124:044109
- [39]Statistics of extremesColumbia university press
- [1]Approximate accelerated stochastic simulation of chemically reacting systemsThe Journal of Chemical Physics 115:1716–1733
- [2]Efficient step size selection for the tau-leaping simulation methodThe Journal of Chemical Physics 124:044109
- [3]Progress of a half century in the study of the luria–delbrück distributionMathematical Biosciences 162:1–32
- [4]Statistics of extremesColumbia university press
- [5]The function inverfc thetaAustralian Journal of Physics 13:13
- [6]Simulations reveal challenges to artificial community selection and possible strategies for successPLoS Biology 17:e3000295
- [7]mpltern 0.3.0: ternary plots as projections of MatplotlibZenodo https://doi.org/10.5281/zenodo.3528355
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
Copyright
© 2024, Lee et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 177
- downloads
- 3
- citation
- 1
Views, downloads and citations are aggregated across all versions of this paper published by eLife.