Abstract
Bacterial counts from native environments, such as soil or the animal gut, often show substantial variability across replicate samples. This heterogeneity is typically attributed to genetic or environmental factors. A common approach to estimating bacterial populations involves successive dilution and plating, followed by multiplying colony counts by dilution factors. This method, however, overestimates the heterogeneity in bacterial population because it conflates the inherent uncertainty in drawing a subsample from the total population with the uncertainty in the sample arising from biological origins. In other words, this approach may obscure features that may otherwise be present in the data hinting at the presence of genuine subpopulations. For example, in plate counting applied to C. elegans gut microbiota, observed multimodality is often interpreted as large host-to-host variance, while the randomness introduced by measurement is frequently ignored. To explicitly account for the uncertainty introduced by dilution and plating randomness, we introduce REPOP, a PyTorch-based library to REconstruct POpulations from Plates within a Bayesian framework. Beyond simple cases, REPOP addresses more complex scenarios, including multimodal populations and correcting the mathematically subtle, but experimentally relevant, bias introduced by excluding plates deemed too crowded to distinguish individual colonies. We demonstrate REPOP’s ability to resolve distinct population peaks otherwise obscured by standard multiplication methods. Applications to both simulated and experimental datasets, including bacterial samples of different concentrations and ones from the gut microbiota of C. elegans, show that REPOP accurately recovers the underlying multimodality by properly accounting for error propagation, where naive multiplication fails. REPOP is available on GitHub: https://github.com/PessoaP/REPOP.
1 Introduction
Plate counting is fundamental to microbiology in estimating the number of living microor-ganisms in a sample [1, 2]. For example, it is essential in food safety testing [3], water quality assessments [4, 5], amongst many other applications [6–9] that require flexible quantification of microbial loads.
Briefly, plate counting involves serially diluting an original sample to ensure an appropriate colony density, allowing for the accurate enumeration of colonies for each sample (as summarized in Fig. 1a). Subsequently, the diluted samples are spread over nutrient-rich agar plates and incubated. After incubation, each viable organism (typically a bacterium) spread over the plate forms a colony, or in some cases a plaque [10], which are then enumerated as colony forming units (CFUs) or plaque forming units (PFUs), respectively. The degree of dilution (represented by the dilution factor) is typically set in such a way as to collect a statistically precise number of colonies but simultaneously avoiding colony overlap – usually aiming for 30 to 300 CFUs in a plate with a diameter of 10 cm. However, when studying a poorly characterized or highly variable bacterial population, collecting plates that are overcrowded is inevitable. Challenges related to counting crowded plates were only recently mitigated by AI-inspired tools [11–14].

Visual description of the dilution and plate counting process and the need for its analysis.
a) A sample of volume V is mixed with sterile medium in a larger vial (e.g., 19V, for a total dilution of 20). A portion of volume V of this diluted sample is then spread on a plate and incubated. Further tenfold dilutions are performed by transferring V into a vial with 9V of medium before plating. The process continues until a reasonable number of counts is obtained. b) The histogram on the left shows simulated counts across 1000 plates, obtained by diluting samples from the ground truth distribution by a factor of 200. The usual method of multiplying the counts by the total dilution factor produces the orange histogram (in the right) which appears significantly broader than the ground truth (black line, right) due to additional stochasticity introduced by dilution and plating. The ground truth distribution is a Gaussian with a mean of 8000 and a standard deviation of 500. REPOP’s reconstruction (blue) corrects the broadening. c) A similar simulation to (b) where the ground truth population distribution is trimodal (parameters in Fig. 2). Here the stochasticity introduced dilution and plating obscure the three peaks, making them difficult to distinguish in the observed counts. Nevertheless, REPOP successfully reconstructs the trimodal distribution.
Beyond simple quantification, plate counting is often used to assess population variation within a single microbial species, the degree of phenotypic variability [15–18] and environmental resource availability [19,20]. This stems from interpreting multimodality – several samples exhibiting high and several others low counts – or general heterogeneity in plate counts as reflecting population differences in colony formation, growth rates, or stochastic expression of traits [21, 22]. This heterogeneity can then be used to understand ecological dynamics of bacterial populations [19, 23–26] such as the competition of populations inside the gut of C. elegans [27].
On account of this, learning the variability from sample to sample is critical in accurately capturing a population’s heterogeneity. Thus, in order to draw biological conclusions, one would have to disentangle the effects of the stochasticity involved in the measurement (e.g., the randomness involved in the process of dilution and plating) from the heterogeneity in the original sample’s population arising from the underlying biological or ecological variation.
Naively, in dilution and plating, the total number of viable bacteria in the original sample is usually estimated by multiplying the number of colonies by the total dilution factor for each plate, resulting in a histogram of the population in the original samples. This process significantly overestimates the true variability in the bacterial count within the samples and can even obscure the multimodality in reconstructed population count histograms, as demonstrated in Fig. 1b and c.
While all methods – including those based on flow cytometry or DNA quantification [28–35] – have their own limitations, plate counting is considered the gold standard in several fields of microbiology [4, 36]. It has, nevertheless, the critical limitation of propagating error introduced via dilution.
To overcome the limitations of traditional plate count estimation, we developed REPOP (REconstruct POpulations from Plates), a software package that employs a Bayesian approach that rigorously accounts for the stochasticity introduced by dilution and plating. By modeling these statistical processes, REPOP infers both the full probability distribution of the original samples, preserving its inherent heterogeneity, as demonstrated in Fig. 1b. Notably, REPOP can detect the multimodal structure of the data even when such multimodality is not readily apparent from dilution count histograms, as shown in Fig. 1c. REPOP is available on GitHub [37]
When presenting our method, we address a mathematically subtle point that may qualitatively affect the inferred counts: the cutoff for “uncountable” on a plate. That is to say, when multiple dilutions are performed in experiments to obtain plates with a large quantity of CFUs, plates that exceed a preset colony cutoff – a number beyond which colonies are considered too numerous to count – are excluded. While this practice is common, failing to properly account the exclusion of such plates can introduce bias in the inferred population distribution.
Within the present paper, we will explain how REPOP is able integrate the previously ignored over-populated plates exceeding the cutoff in a more sophisticated albeit requisite iteration of our simpler method. Why requisite? No matter how low we set our dilution factor, a plate may probabilistically always exceed the countable cutoff, though increasingly rarely. Thus, if we can deal with over-populated plates, then we may collect better statistics on colony counts by avoiding too large a dilution.
To demonstrate REPOP’s applicability to real-world datasets, we analyzed two distinct experimental systems. First, we applied REPOP to plate counting data obtained from controlled mixtures of E. coli populations at different optical densities. Without prior knowledge of vial identity, REPOP successfully reconstructed the underlying distribution. Second, we employed REPOP to study bacterial colonization dynamics in individual C. elegans hosts. By analyzing plate counts of gut bacterial populations over several days post-feeding, REPOP revealed the progressive increase in bacterial load and captured the multimodal structure associated with host heterogeneity. These results highlight REPOP’s ability to disentangle true biological variability from stochastic sampling noise in complex, real experimental contexts.
2 Methods
2.1 Statistical description of plate counting
We first describe the process of serial dilutions in mathematical terms. From Fig. 1a we observe that, on each plate, a fraction of the volume in the initial sample, equal to the inverse of their respective dilution factor (ϕ), is spread over the plate. As such, supposing the population in the original sample to be well-stirred, each individual bacterium has probability
We denote the total number of bacteria in the original sample as n. Assuming each bacterium is independently and equally likely to be transferred, the number of bacteria transferred and forming colonies, denoted as k, follows a binomial distribution with parameters n (the number of trials) and
With this description of the dilution and plating process, we now describe how we use Bayesian inference to reconstruct the population distribution.
2.2 Bayesian inference
For now we assume that each sample (drawn from the sample at the previous dilution) gives rise to one plate. The dataset obtained consists of the sequence of plate counts, ki and their respective dilution factors ϕi for each replicate i. We denote the complete dataset as
Our overall goal is to learn the the probability distribution for the number of bacteria in the original samples, n, from the dataset D. To achieve this, we assume a priori that the distribution over n follows a model of population distribution with a set of parameters θ, written as p(n|θ).
The following subsection (2.3) is aimed at considering both unimodal and multimodal population distributions, p(n|θ). First, assuming that the population distribution is unimodal, we represent p(n|θ) as a simple Gaussian distribution, where θ reflects the mean and standard deviation. Second, in order to model a multimodal distribution, we invoke non-parametric models which provide a means to describe multimodal probability distributions for which the number of modes is unknown a priori [38–43]. More specifically, we assume a non-parametric Gaussian mixture, where θ represents a (potentially infinite) set of means, standard deviations, and mixture weights. Importantly, in large datasets, non-parametric models can flexibly adapt to approximate probabilities that do not strictly follow a predefined functional form [43–45].
The goal of reconstructing the distribution of bacteria in the original samples then becomes one of learning θ from D. In other words, we want to learn the probability of θ conditioned in D, p(θ|D). This probability is called a posterior. As we will see later, calculating the posterior will require the use of p(n|θ), which is unior multimodal, and the use of p(ϕi|n) when cutoffs on countability of plates are used in order to eliminate plates containing too many colonies.
REPOP is applied here to three cases of increasing complexity: the first (Model 1) is the simplest, assumes unimodal p(n|θ) without cutoffs. In Model 2 we keep no cutoff in counts, but generalize p(n|θ) to a multimodal (which simplifies to the unimodal Model when all but one component is considered). In Model 3 we keep the most general multimodal distribution for p(n|θ), but expand the counting method to account for the cutoff resulting in some plates not being counted and simply exceeding the cutoff. We summarize the differences among the three mechanisms in Table 1.

For ease of reference, we summarize the differences among the three models taken into account in the present article.
For all models highlighted above, in order to compute p(θ|D), we need the likelihood, p(D|θ), then related to p(θ|D) through Bayes’ theorem
Here p(θ) is the prior. Further details on the prior are given in the supplemental information A. Once we calculate probability p(θ | D), we may maximize this probability to obtain the maximum a posteriori (MAP) value, θMAP. More details are provided later in Sec. 2.5.
2.3 Models for the population distribution
2.3.1 Unimodal distribution (Model 1)
For Model 1, we can mathematically represent the probability of the true population count n given the distribution parameters θ as:
Here, θ includes the mean, µ, and standard deviation, σ, of the integer Gaussian p(n|θ), θ = {µ, σ}. The denominator above is necessary because we are dealing with (positive) integer Gaussians and will approach the usual
2.3 Multimodal distribution non-parametric inference (Models 2 and 3)
Naturally, the choice of modeling n through a single Gaussian component, Model 1, already commits us to assuming a unimodal distribution. However, if bacteria are present across replicates in a multimodal distribution — such as observed in some studies of C.elegans gut microbiota [8, 27, 46], where some replicates have few bacteria and others have more based on rare colonization events [27, 46] — how many modes should we consider?
This consideration immediately forces upon us an important motivation for considering a much more general choice for p(n|θ), namely a non-parametric Gaussian mixture. Loosely speaking, non-parametric Gaussian mixture models do not assume a fixed number of parameters (or components) beforehand. Instead, they adapt and determine the appropriate number of components as warranted by the data. Knowledge of non-parametric mixture models, while subtle and detailed in Ref. [43] and references therein, are not required for running REPOP. In such a model, θ encompasses the means
When performing non-parametric inference, the choice of prior p(θ) is key. A well-chosen prior can add computational efficiency and avoid runaway pathologies associated with assigning every datapoint a different mixture component. A common choice of prior for the infinite mixture model satisfying both of these criteria is the Dirichlet process prior [43]; see supplemental information A for more details. While a finite truncation in the number of components can be avoided [47, 48], for computational reasons, here we use a finite truncation (defaulted to the smallest value among 25 and the square root of the total number of samples, though modifiable, in REPOP) for the sum over the components, α, in (4).
2.4 Calculating the likelihood
Here we explain how to calculate the likelihood, p(D|θ), appearing in (2) and start with Models 1 and 2 and move to Model 3 where the cutoff is taken into consideration.
2.4.1 Likelihood for Models 1 and 2
In the Models where cutoffs are not taken into consideration (i.e., where it is assumed that the colonies on a plate can always be counted and plates are never rejected for having too many colonies), the calculation of the total likelihood, for all counted plates from the original number of bacteria in one sample, n, follows the scheme described in Sec. 2.1. However, ni is itself a realization of the population distribution p(n|θ) described in the previous section (2.3). Thus the probability that a plate with ki counts were generated from a dilution ϕi is given by marginalizing over ni as
With each factor p(ki|ϕi, ni) within the summation given by (1) and p(ni|θ) given by the respective population model as presented in Sec. 2.3.
This allows us to write Bayes’ theorem (2) as
where I is defined as the total number of replicates. Note that while this summation is truncated to
2.4.2 Likelihood for Model 3
For Model 3, we assume that the number of colonies may exceed a certain cutoff, kCO, and that such plates are not counted. Typically, for each sample, only one of the plates generated in the process described in Fig. 1a is reported. In other words, for each collected sample only the counts (and dilution) of one of the plates at the first dilution resulting in a countable number of colonies (below a pre-defined cutoff) is recorded as a datapoint. This is illustrated in Fig. 1a.
This process is different from the previous models. If a dilution different from the first dilution in the dilution schedule is reported, this implies that all lower dilutions therefore resulted in plates whose colony count (a stochastic variable) already exceeded the cutoff kCO. To obtain the posterior over θ, it is necessary to know the dilution schedule from the plating experiment. We refer to the full set of these as Φ = {ϕ1, ϕ2, …, ϕM} ordered from lowest to highest dilution factor. In the example given in Fig. 1a we have ϕ1 = 20, ϕ2 = 200, and ϕ3 = 2000. While, mathematically, we could extend this indefinitely by sequentially plating dilutions that are factors of 10 larger, in practice only a finite number of plates can be produced, thus justifying the upper limit M. It is expected that the dilution schedule leads to dilution values sufficiently large that the approximate number of colonies recovered approaches zero. Expanding upon the previous models, it is now important to report not only the counts but also the associated plate index. The datapoint is now reported as (ki, mi), which means that for the i-th sample, the plate diluted by a total factor of
With the set of all possible dilutions Φ explicitly considered, we incorporate this dependence into the calculation of all relevant probabilities. Thus, the total posterior for the full dataset is given by
Here, we used the fact that the population distribution and its parameters, θ, are independent of the dilutions selected to plate, Φ. Therefore, p(θ|Φ) = p(θ) is the original prior and p(ni|θ, Φ) = p(ni|θ) is given by the respective population model of Sec. 2.3. The remainder of this subsection focuses on how to calculate the likelihood of each datapoint, p(ki, mi|ni, Φ), within the summation above. To obtain this likelihood, we revisit the dilution and plating process. As described in Fig. 1a, a plate is generated for each dilution within the set of dilutions Φ. For each sample, labeled i, the number of colonies on the plate with dilution ϕm is denoted as
Since all plates originate from the same original sample (with a finite number of bacteria, ni), the values of
Under this assumption, each count
As previously described, the datapoint (ki, mi) corresponds to the counts and the index of the first value within the sequence
from which the observed datapoint (ki, mi) is obtained with:
With this in mind, it becomes possible to rewrite the likelihood, p(ki, mi|ni, Φ), as
Once more, we proceed by showing how to calculate each factor within the product above. Starting from the last factor, p(mi|ni, Φ), we note that if the plate with index mi is reported, we must have that both
Here we use the capitalized P to represent that we are calculating the probability that a variable is within a set, rather than the probability of a single possible value. In this specific model, what we are asking is the probability that
so (12) becomes
with each
Then, for the remaining factor in (11), p(ki|mi, ni, Φ), since counts above the cutoff are not reported as data, the likelihood of finding ki colonies in a plate diluted by a total factor
for ki ≤ kCO and zero otherwise.
Thus, in order to obtain the posterior (7), we need to substitute the likelihood per datapoint, p(ki, mi|ni, Φ) in (11), which is computed using (15) and (14).
2.5 Bayesian MAP
In Bayesian inference, we compute the posterior distribution p(θ|D) as it captures the updated probability over the model parameters after observing the data. However, posteriors are often complex and high-dimensional (technically infinite in non-parametric models), making direct computation and visualization challenging. To address this, we often resort to sampling methods to approximate the posterior. However, these samplers can also require long computation.
Since our objective here is to create a tool that is quick and easy to run, we resort to maximum a posteriori (MAP) estimation. This reduces the problem of knowing the distribution over θ to identifying the optimal θ. In this model, we maximize p(θ | D) and denote the maximum value as θMAP,
with p(θ | D) given by (6) for Models 1 and 2, or (7) for Model 3. For our purposes here, we say that p(n | θMAP) is the distribution of n we learned from the data. In REPOP, the maximum value θMAP is found using optimization libraries within PyTorch [49].
3 Results
In the previous section we have constructed the posterior p(θ|D) across unimodal and multimodal population distributions, and by using a likelihood that either assumes all plates are countable (Model 1-2) or considers plates exceeding the cutoff (Model 3).
3.1 When all plates are considered countable
For the first model, we generate simulated data drawn from a single Gaussian with equal dilution factors across all samples. Meaning, this assumes that the dilution factor used (200) is able to generate a countable number of colonies in all samples. We also assume a unimodal Gaussian, as in Model 1. The reconstruction obtained by REPOP was shown earlier in Fig. 1b where we see that REPOP corrects for the overestimation of variance introduced by the stochasticity in dilution and plating.
In the more complex multimodal Gaussian case (Model 2), we further observe that RE-POP uncovers the true underlying distribution’s peaks with greater resolution and sensitivity to multimodality than by naive multiplication. As an example, here we generate simulated data for a mixture of Gaussians with three modes. In Fig. 2a, we show a relatively simple dataset, where the three peaks are already discernible in the histogram obtained via naive multiplication. Nonetheless, REPOP improves the resolution of these peaks, particularly as the number of samples increases. In Fig. 2b, we present a more challenging case where the modes are less distinct under the naive multiplication approach; the additional stochasticity introduced by dilution causes the resulting histogram to appear as a single broad peak. Nonetheless, REPOP is able to accurately recover the underlying multimodality. In both cases, the performance of REPOP improves with larger datasets, as evidenced by the progressively sharper resolution of each peak.

REPOP can reconstruct multimodal populations even when they are not directly observable by naively multiplying counts by dilution.
Similar to Fig. 1b and c, this figure shows the observed counts, the naïve estimate obtained by multiplying counts by dilution, and the reconstruction produced by REPOP, compared to the ground truth. In both cases, the population n is drawn from a ground truth mixture of three Gaussians, shown as the black curves. In a), the mixture has means (4000, 8000, 14,000), standard deviations (200, 1500, 1000), and weights (0.25, 0.4, 0.35), while in b), it has means (8000, 16,000, 24,000), standard deviations (1000, 1000, 1000), and weights (0.3, 0.2, 0.5). Observed counts are sampled from (1), diluted by a factor of 200, and shown as histograms on the left. In a), a trimodal structure is visible in the observed counts histogram. However, in (b), the trimodality is heavily obscured by the stochasticity introduced by dilution and plating, making it indiscernible even in large datasets. Despite this, when applying REPOP to datasets of increasing size, we see that while a small dataset of 25 plates may result in missed modes, larger datasets allow REPOP to accurately recover the underlying multimodal structure in both cases. This demonstrates REPOP’s ability to infer the true population despite the stochastic noise introduced by dilution and plating.
3.2 When the cutoff is taken into account
For our final demonstration with simulated data, we show results when the data follows a cutoff in reporting, as described in Sec. 2.4.2. We have generated simulated data according to the following scheme: starting from the lowest dilution (20, as shown in Fig.1a). If the counts are larger than the cutoff, we used kCO, we sample again at a dilution factor 10 times higher until we find a plate that has fewer colonies than kCO. We then record that number of counts and that dilution factor as the i-th datapoint (ki, ϕi), as would be done by rigorously following the cutoff. If we know the dilution schedule, this is equivalent to having the data as the counts and the index of the reported dilution (ki, mi) as discussed in Sec. 2.4.2, with each reported dilution matched to the corresponding index mi, meaning the dilution reported is the mi-th in the schedule
We visualize the data generated this way and show the reconstruction obtained for Model 3 in Fig. 3. In the rightmost column, we demonstrate how failing to account for the cutoff leads to an artificial peak (e.g., an extra mode near the leftmost peak in the ground truth, Fig. 3b). This occurs because counts near the cutoff kCO are ambiguously assigned to incorrect dilutions.

Not accounting for the cutoff can lead to incorrectly attributed multimodality.
This figure compares population reconstructions with and without accounting for the cutoff effect, (using Model 3 and Model 2 respectively). a) simulated data generated with a cutoff kCO = 50. The ground truth population consists of three Gaussian components with means
These errors become particularly pronounced when colony counts approach kCO, as the cutoff procedure forces certain samples to be reported at different dilutions. For example, consider a cutoff kCO = 300 and two possible dilution factors, ϕ1 = 20 and ϕ2 = 200. A sample with an original bacterial count ni near
3.3 Results for experimental data: Discerning different true concentrations of bacteria
While we have demonstrated in the previous subsections how to resolve overlapping distributions in simulated datasets, here we move on to obtaining a similar case built with real-world plate counting. In particular, we construct a case that emulates multiple peaks at different dilutions, as shown in Fig. 3, by generating three different sample sources of bacteria.
We prepared vials with four different concentrations of E.coli measured by optical density, and consider samples of a volume equivalent to 0.2nL from these vials. These samples are further diluted according to the dilution schedule Φ = 20, 200, 2000. However, in practice, we do not directly dilute 0.2nL. Instead, we take 1µL and dilute it by factors of 105, 106, and 107 – further details in Supplemental Information C.1. This approach ensures that while the original vials can be assumed to be have neglegible variability, the small 0.2nL subsamples may no longer be, resulting in a unimodal but with a larger relative variance population for each vial.
The colony counts obtained are processed without identifying their original vial, aiming to reconstruct the underlying tetramodal structure. However, the dataset remains relatively small, consisting of only 97 total samples (see details in Supplemental Information C.1). This dataset and the reconstructed distribution are presented in Fig. 4a.

Reconstructing, from real experimental plate counting, the population from vials with different optical densities.
a) Distribution of observed colony counts obtained as described in Sec. 3.3. The data is passed to the system without vial identification, aiming to reconstruct the underlying tetramodal structure. b) Results from a simulated dataset with 500 datapoints (plates). The increased sample size improves the estimation of the four main components, yielding a more accurate reconstruction.
Although the limited sample size results in an imperfect match, the method successfully captures the four main components – corresponding to the estimated values of µα and σα associated with the four highest values of ρα. To further validate the approach, we supplement the dataset by generating 800 synthetic datapoints from the mixture distribution made of the four main components, leading to a dataset considerably larger than what is typical of plate counting. The results, shown in Fig. 4b, demonstrate a significant improvement in capturing the underlying structure of the data.
3.4 Application: Capturing host differences through examining gut microbiota population
Here, we use REPOP to studying heterogeneity in more complex biological systems, such as microbial ecology within a host organism. Thus, we apply our framework to analyze gut bacterial counts from individual C. elegans.
Heterogeneity in bacterial counts among individual C. elegans has been reported and is typically attributed to rare colonization events and size variability [27, 46, 50]. However, to derive meaningful insights into gut ecology, it is essential to distinguish true population heterogeneity from the stochasticity introduced by dilution and plating.
To minimize confounding factors, we restrict our analysis to the OP50 strain of E. coli, which is well-characterized in laboratory settings and, under idealized conditions, is expected to produce a unimodal population distribution. Any deviations from this expected pattern may reflect individual C. elegans variability, either due to differences in the gut microenvironment or the fact that gut colonization is a rare event [27].
Due to this, we conducted an experiment in which C. elegans were synchronized, cleared of gut bacteria at three days post-hatching and subsequently fed live E. coli for controlled durations (1, 3, 5, 7, or 9 days). At each studied time point, 25 individual worms were picked, cleaned of all non-gut adhered bacteria, and crushed. The worm homogenate was then serially diluted and plated with a dilution schedule, Φ = {22, 222, 2222}. The plates were counted after a 24 hour incubation. More detailed experimental methods may be found in the Supplemental Information C.2.
The results presented in Fig. 5 demonstrate a clear trend in bacterial population dynamics within the C. elegans gut. The population distribution shifts consistently to higher values with each successive day, aligning with our expectation that bacterial numbers increase over time due to colonization and replication inside the gut.

Results for the population of E.coli-fed C. elegans’ gut at days 3, 5, 7, and 9 of adulthood.
Here we are able to observe the shift in the population distribution obtained with plate counting shifts to higher population for longer-living C. elegans. To better visualize the distribution and model fits, we use a split-log x-axis: bacterial counts from 0 to 100 are shown on a linear scale, and counts above 100 on a logarithmic scale. The population distribution shifts toward higher values each day, as expected from colonization and replication within the gut. If biological significance, such as different C. elegans types, as proposed in e.g. [51] is to be inferred from this multimodality, a rigorous separation of population heterogeneity and plating stochasticity, as implemented by REPOP, is essential.
The multimodality exhibited in Fig. 5 can be explained by the heterogineity of the host. Recently, this has been collaborated by in vivo experiments [51]. This phenomenon could be related to the well-established stratification in heat-shock protein (HSP) expression, which is also correlated with aging and lifespan in nematodes [52–54]. Thus the variation in the population could be due to the fact that ingestion and egestion exhibit an age-related change [55, 56] or due to state switching [51] in feeding and digestion, akin to genetic state switching [57, 58]. The results in Fig. 5 support the conclusion that biologically meaningful differences in population counts are being captured, rather than the data being dominated by instrumentation error. REPOP provides a robust method for distinguishing multiple peaks in the population distribution, indicating that intrinsic variability among individual C. elegans can be deconvolved from measurement noise.
4 Conclusion
In accurately estimating bacterial counts, simply multiplying counts by the dilution factor provides a starting point. However, this method alone often fails to capture the original individual counts across samples, as straightforward multiplication of colony counts by the total dilution factor typically leads to an overestimation of the spread (Fig. 1b) that may obscure multimodal behavior (Fig. 1c). Conversaly, inferring population growth dynamics [8, 15,19,46,51] and learning features driving population heterogeneity across samples [19,25,26], demands a deeper analysis of the statistics involved in dilution and counting. Indeed, such a treatment avoids attributing ecological (or other) significance to the randomness introduced by plating.
The Bayesian reconstruction presented here and implemented through REPOP provides rigorous method to obtain the distribution of populations from plate counts, including observing the correctly multimodality of the population, as seen in Figs. 1 – 3.
The study of the multimodal case accounting for cutoff (Model 3) also demonstrates that retaining all data, regardless of their individual importance in reconstructing the final population size distribution, remains crucial. As shown in Fig. 3, ignoring the cutoff can lead to misidentifying multimodality, especially for plates with colony counts near the cutoff. Consequently, REPOP’s results emphasize that explicitly reporting both the dilution schedule and the cutoff values used should be a fundamental part of data reporting, as they impact the form in which likelihoods are calculated.
As a clear extension of the current work, it is possible to combine the two modular steps involved in counting bacteria in plates with the current work of establishing the original population size based on plated samples. For instance, a notion of possible error rate, such as false positive or negative rate in counting that may even appear at high dilution (where there is always a chance for bacterial colony overlap), could help in further reducing the breadth in the original counts over bacteria. For instance, intuitively, we may imagine the probability of miscounting two or more colonies as a single one to grow as we reach colony counts approaching the cutoff. Simple simulations, for instance, considering typical colony and plate size, may be used to gain initial insight on potential false negative rates, and thus its effect on the original sample size.
Finally, REPOP provides a means by which to test hypothesis [8, 51] relating observed multimodality in gut microbial content of C. elegans, as shown in Fig. 5, to distinct categories of hosts. What is more, REPOP can also be used to study other organisms, exhibiting similar multimodatlity in their microbiota [59–61]. More than just a large dataset, identifying these multiple modes requires a theoretical framework maximizing the information extracted from each plate while remaining flexible enough to accommodate different numbers of modes. REPOP fulfills both of these requirements.
Supplemental Information
A Prior selection across the uni-modal and non-parametric model
In Sec. 2.3.2 we allude to a selection of a prior, p(θ), capable of handling multi-modal data. This will be explained in detail here.
As mentioned in the main text, this prior encompasses all of the parameters needed to construct the probability distribution for n, given by Eq.(3) in the main text. Therefore, θ must include the means
Let us separate each piece of the prior
where
With the prior for
For the prior in
Thus, the prior used to generate the figures in the main text is defined as:
With
B Relative error between the means
When analyzing the error when ignoring the cutoff in Sec. 3.2, we used a metric that is the relative error between the means. When comparing two distributions, p(n|θ) and p(n|θ′), in terms of how the means match, we use the metric
given as percentages where α are indexes to the components of θ, while α′ are the indexes of the components of θ′. The term within the absolute value is equivalent to the relative error between two means µα and µα′. We take the minimum to account for cases where the labels for different components do not match across the distributions. Throughout the text, we have presented the metric M in percentages. In the main text, Sec. 3.2, θ represents the reconstruction obtained by the method while θ′ represents the ground truth.
For the dataset shown in Fig.3a, the relative error is 0.82% when the cutoff is accounted for (model 3). In contrast, when the cutoff is ignored (model 2), the error increases significantly to 7.50%. Similarly, for the dataset shown in Fig.3b, the relative error is 2.10% when the cutoff is considered, compared to 11.1% when it is not. Full details of these calculations are available on our GitHub repository [37].
C Experimental Methodology
C.1 E. coli culture for simulating different expected population peaks
An overnight culture of the OP50 strain of Escherichia coli was grown in lysogeny broth (LB) media, and its optical density at 600 nm (OD600) was measured against an LB blank. The culture was concentrated to an expected 5X of OD600=1.0 by centrifugation and resuspension in phosphate-buffered saline (PBS). The 5X suspension was then diluted to an expected OD600 of 0.5, 0.1,, and two at 0.01, with empirical OD600 values recorded against a PBS blank, which were 0.804, 0.188., 0.025, and 0.02. From the first three vials, we obtained 25 samples (each with 3 dilutions), while the last vial yielded 22 samples.
For each OD600 condition, serial dilutions were prepared to 1:100,000, 1:1,000,000, and 1:10,000,000 in PBS. A total of 90 µL from each dilution was plated onto LB agar, resulting in 75 plates per OD condition. Plates were incubated at 37°C for 18–24 hours before counting colony-forming units (CFUs) to estimate bacterial concentration.
C.2 C. elegans culture
The temperature-sensitive, reproductively sterile strain AU37 of Caenorhabditis elegans (C. elegans) was cultivated and synchronized according to standard protocols [50]. Following the protocol described by Gore and Vega [27], synchronized eggs were cultivated at the non-permissive temperature of 25°C to produce sterile adult worms. These worms were then washed and transferred to a heat-killed E. coli OP50 suspension in S media supplemented with gentamicin for 24 hours to clear gut bacteria.
Once cleared of gut bacteria, worms were washed once with 10 mL M9 buffer containing 0.1% Triton X and twice with 10 mL M9 buffer. They were then transferred to a live bacterial suspension (E. coli OP50 at OD600 = 1.0 in 1 mL S media) for a duration of 3, 5, 7, or 9 days. Fresh bacterial suspension was provided at 24, 72, 120, and 168 hours (approximately every 48 hours after the initial 24-hour feeding).
On the designated sampling days for gut bacterial enumeration, worms were washed to remove non-adhered surface bacteria and manually disrupted following the protocol outlined in the section titled Disruption of Worms and Plating of Intestinal Populations of Ref. [27]. The resulting homogenate was serially diluted using a 10-fold dilution series, with final expected dilution factors of 1:22, 1:222, and 1:2222 before plating. Colony-forming units were enumerated after incubating the plates for 24 hours.
Acknowledgements
SP acknowledges support from the NIH (R35GM148237), ARO (W911NF-23-1-0304), and NSF (Grant No. 2310610). We also thank Ayush Saurabh, Max Schweiger, Lance (W. Q.) Xu, and Ioannis Sgouralis for insightful discussions in the development of this work.
References
- [1]Handbook of Culture Media for Food and Water MicrobiologyThe Royal Society of Chemistry https://doi.org/10.1039/9781847551450Google Scholar
- [2]Benson’s microbiological applications laboratory manualMcGraw-Hill Education Google Scholar
- [3]Microbiological testing results of boneless and ground beef purchased for the U.S. national school lunch program, school years 2015 to 2018Journal of Food Protection 82:1761–1768https://doi.org/10.4315/0362-028x.jfp-19-241Google Scholar
- [4]Heterotropic plate counts and drinking-water safetyIn:
- World Health Organization
- [5]Microbial pollution of water with special reference to coliform bacteria and their nexus with environmentEnergy Nexus 1:100008https://doi.org/10.1016/j.nexus.2021.100008Google Scholar
- [6]Bioinspired oral delivery of gut microbiota by self-coating with biofilmsScience Advances 6:1472–1476https://doi.org/10.1126/sciadv.abb1952Google Scholar
- [7]PET microplastics affect human gut microbiota communities during simulated gastrointestinal digestion, first evidence of plausible polymer biodegradation during human digestionScientific Reports 12:528https://doi.org/10.1038/s41598-021-04489-wGoogle Scholar
- [8]Maximum likelihood estimators for colony forming unitsMicrobiology Spectrum 12:e03946–23https://doi.org/10.1128/spectrum.03946-23Google Scholar
- [9]Using single-worm data to quantify heterogeneity in Caenorhabditis elegans-bacterial interactionsJournal of Visualized Ex-periments 185:e64027https://doi.org/10.3791/64027Google Scholar
- [10]A novel method to determine antibiotic sensitivity in Bdellovibrio bacteriovorus reveals a DHFR-dependent natural trimethoprim resistanceScientific Reports 10:5315https://doi.org/10.1038/s41598-020-62014-xGoogle Scholar
- [11]OpenCFU, a new free and open-source software to count cell colonies and other circular objectsPLOS One 8:1–10https://doi.org/10.1371/journal.pone.0054072Google Scholar
- [12]AutoCellSeg: robust automatic colony forming unit (CFU)/cell analysis using adaptive image segmentation and easy-to-use post-editing techniquesScientific Reports 8:7302https://doi.org/10.1038/s41598-018-24916-9Google Scholar
- [13]Automated counting of colony forming units using deep transfer learning from a model for congested scenes analysisIEEE Access 8:164340–164346https://doi.org/10.1109/ACCESS.2020.3021656Google Scholar
- [14]A comprehensive review of image analysis methods for microorganism counting: from classical image processing to deep learning approachesArtificial Intelligence Review 55:2875–2944https://doi.org/10.1007/s10462-021-10082-4Google Scholar
- [15]Growth against entropy in bacterial metabolism: the phenotypic trade-off behind empirical growth rate distributions ine.coliPhysical Biology 13:036005https://doi.org/10.1088/1478-3975/13/3/036005Google Scholar
- [16]Maximum entropy modeling of metabolic networks by constraining growth-rate moments predicts coexistence of phenotypesPhysical Review E 96:060401https://doi.org/10.1103/physreve.96.060401Google Scholar
- [17]Modeling population heterogeneity from microbial communities to immune response in cellsCellular and Molecular Life Sciences 77:415–432https://doi.org/10.1007/s00018-019-03378-wGoogle Scholar
- [18]Relationship between fitness and heterogeneity in exponentially growing microbial populationsBiophysical Journal 121:1919–1930https://doi.org/10.1016/j.bpj.2022.04.012Google Scholar
- [19]How sample heterogeneity can obscure the signal of microbial interactionsThe ISME Journal 13:2639–2646https://doi.org/10.1038/s41396-019-0463-3Google Scholar
- [20]Co-occurrence is not evidence of ecological interactionsEcology Letters 23:1050–1063https://doi.org/10.1111/ele.13525Google Scholar
- [21]Bistabilityepigenetics, and bet-hedging in bacteria, Annual Review of Microbiology 62:193–210https://doi.org/10.1146/annurev.micro.62.081307.163002Google Scholar
- [22]Population dynamics of metastable growth-rate phenotypesPLoS ONE 8:e81671https://doi.org/10.1371/journal.pone.0081671Google Scholar
- [23]Population-dynamic modeling of bacterial horizontal gene transfer by natural transformationBiophysical Journal 110:258–268https://doi.org/10.1016/j.bpj.2015.11.033Google Scholar
- [24]Intrinsic nonlinear dynamics drive single-species systemsProceedings of the National Academy of Sciences 119:e2209601119https://doi.org/10.1073/pnas.2209601119Google Scholar
- [25]Cautionary notes on the use of co-occurrence networks in soil ecologySoil Biology and Biochemistry 166:108534https://doi.org/10.1016/j.soilbio.2021.108534Google Scholar
- [26]Species abundance correlations carry limited information about microbial network interactionsPLoS Computational Biology 18:e1010491https://doi.org/10.1371/journal.pcbi.1010491Google Scholar
- [27]Stochastic assembly produces heterogeneous communi-ties in the caenorhabditis elegans intestinePLOS Biology 15:e2000633https://doi.org/10.1371/journal.pbio.2000633Google Scholar
- [28]Flow cytometric bacterial cell counts challenge conventional heterotrophic plate counts for routine microbiological drinking water monitoringWater Research 113:191–206https://doi.org/10.1016/j.watres.2017.01.065Google Scholar
- [29]Flow cytometry as a potential method of measuring bacterial viability in probiotic products: A reviewTrends in Food Science and Technology 78:1–10https://doi.org/10.1016/j.tifs.2018.05.006Google Scholar
- [30]Rapid identification of bacterial viability, culturable and non-culturable stages by using flow cytometryprotocols.io https://doi.org/10.21203/rs.3.pex-1153/v1Google Scholar
- [31]Estimation of microbial viability using flow cytometryCurrent Protocols in Cytometry 93:e72https://doi.org/10.1002/cpcy.72Google Scholar
- [32]Droplet digital pcr is an improved alternative method for high-quality enumeration of viable probiotic strainsFrontiers in Microbiology 10:3025https://doi.org/10.3389/fmicb.2019.03025Google Scholar
- [33]Improving and comparing probiotic plate count methods by analytical procedure lifecycle managementFrontiers in Microbiology 12:693066https://doi.org/10.3389/fmicb.2021.693066Google Scholar
- [34]Insights into the enumeration of mixtures of probiotic bacteria by flow cytometryBMC Microbiology 23:48https://doi.org/10.1186/s12866-023-02792-2Google Scholar
- [35]Beyond the colony-forming-unit: Rapid bacterial evaluation in osteomyelitiseLife 13:RP93698https://doi.org/10.7554/elife.93698.2Google Scholar
- [36]Microbiology of the food chain -— Horizontal method for the enumeration of microorganisms —- Part 2: Colony count at 30 degrees C by the surface plating technique, StandardGoogle Scholar
- [37]REPOP - Reconstructing population from plateshttps://github.com/PessoaP/platereconstruction
- [38]Dynamical fluctuations in biochemical reactions and cyclesPhysical Review E 82:031905https://doi.org/10.1103/PhysRevE.82.031905Google Scholar
- [39]A method for single molecule tracking using a conventional single-focus confocal setupThe Journal of Chemical Physics 150:114108https://doi.org/10.1063/1.5083869Google Scholar
- [40]Direct photon-by-photon analysis of time-resolved pulsed excitation data using bayesian nonparametricsCell Reports Physical Science 1:100234https://doi.org/10.1016/j.xcrp.2020.100234Google Scholar
- [41]Generalizing HMMs to Continuous Time for Fast Kinetics: Hidden Markov Jump ProcessesBiophysical Journal 120:409–423https://doi.org/10.1016/j.bpj.2020.12.022Google Scholar
- [42]BNPTrack: a framework for superresolved trackingNature Methods 21:1716–1724https://doi.org/10.1038/s41592-024-02349-9Google Scholar
- [43]Data Modeling for the Sciences: Applications, Basics, ComputationsCambridge University Press Google Scholar
- [44]Inferring effective forces for Langevin dynamics using gaussian processesThe Journal of Chemical Physics 152:124106https://doi.org/10.1063/1.5144523Google Scholar
- [45]Avoiding subtraction and division of stochastic signals using normalizing flows: NFdeconvolvearXiv https://doi.org/10.48550/arXiv.2501.08288Google Scholar
- [46]Sample pooling inflates error rates in between-sample comparisons: an empirical investigation of the statistical properties of count-based databioRxiv https://doi.org/10.1101/2022.07.25.501406Google Scholar
- [47]Sampling the dirichlet mixture model with slicesCommunications in Statistics - Simulation and Computation 36:45–54https://doi.org/10.1080/03610910601096262Google Scholar
- [48]Slice sampling mixture modelsStatistics and Computing 21:93–105https://doi.org/10.1007/s11222-009-9150-yGoogle Scholar
- [49]Pytorch: An imperative style, high-performance deep learning libraryIn: Advances in Neural Information Processing Systems 32 Curran Associates, Inc pp. 8024–8035Google Scholar
- [50]Wormbookhttp://www.wormbook.org/Google Scholar
- [51]Variance in c. elegans gut bacterial load suggests complex host-microbe dynamicsbioRxiv https://doi.org/10.1101/2024.12.09.627557Google Scholar
- [52]Heat shock changes the heterogeneity distribution in populations of caenorhabditis elegans: does it tell us anything about the biological mechanism of stress response?The Journals of Gerontology Series A: Biological Sciences and Medical Sciences 57:B83–B92https://doi.org/10.1093/gerona/57.3.b83Google Scholar
- [53]Visualizing hidden heterogeneity in isogenic populations of c. elegansExperimental gerontology 41:261–270https://doi.org/10.1016/j.exger.2006.01.003Google Scholar
- [54]Stochastic and age-dependent proteostasis decline underlies heterogeneity in heat-shock response dynamicsSmall 17:2102145https://doi.org/10.1002/smll.202102145Google Scholar
- [55]Quantitative measures of aging in the nematode caenorhabditis elegans. i. population and longitudinal studies of two behavioral parametersMechanisms of ageing and development 15:279–295https://doi.org/10.1016/0047-6374(81)90136-6Google Scholar
- [56]Measurements of age-related changes of physiological processes that predict lifespan of caenorhabditis elegansProceedings of the National Academy of Sciences 101:8084–8089https://doi.org/10.1073/pnas.0400848101Google Scholar
- [57]Avoiding matrix exponentials for large transition rate matricesThe Journal of Chemical Physics 160:094109https://doi.org/10.1063/5.0190527Google Scholar
- [58]Gene expression model inference from snapshot rna data using bayesian non-parametricsNature Computational Science 3:174–183https://doi.org/10.1038/s43588-022-00392-0Google Scholar
- [59]Host resistance to bacterial infection varies over time, but is not affected by a previous exposure to the same pathogenFrontiers in Physiology 13:860875https://doi.org/10.3389/fphys.2022.860875Google Scholar
- [60]An effector peptide family required for drosophila toll-mediated immunityPLoS Pathogens 11:e1004876https://doi.org/10.1371/journal.ppat.1004876Google Scholar
- [61]Ecological drift during colonization drives within-host and between-host heterogeneity in an animal-associated symbiontPlos Biology 22:e3002304https://doi.org/10.1371/journal.pbio.3002304Google Scholar
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.107122. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2025, Pessoa et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 145
- downloads
- 0
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.