Figures and data

Overview of the workflow for this paper.
After specifying a birth-death model with sigmoid affinity-fitness response function, we simulate many trees and their sequences at each node, with parameters roughly consistent with our real data samples. The simulation is cell-based and implements a carrying-capacity population size limit. The results of the simulation are then encoded and used to train a neural network that infers the sigmoid response parameters on real data. In addition to encoded trees, the network also takes as input the assumed values of several non-sigmoid parameters (carrying capacity, initial population size, and death rate). Inference on real data is performed many times, with many different combinations of non-sigmoid parameter values, and additional “data mimic” simulation is generated using each of the resulting inferred parameter value combinations. The summary statistics of each data mimic sample are compared to real data, with the best match selected as the “central data mimic” sample with final parameter values for both sigmoid (inferred with the neural network) and non-sigmoid (inferred by matching summary statistics) parameters. Plots from elsewhere in the manuscript are rendered in schematic form: those in “infer on data” refer to Figure 4—figure Supplement 1, and those in “simulate with inferred parameters” to Figure 5.

Simulation parameter values for the sample used for training (left column, in which each GC has different parameters), and the “central data mimic” sample used to evaluate final results (right column, in which all GCs have the same, data-inferred parameters).
Square brackets indicate a range, from which values are chosen either as detailed in the text (for sigmoid parameters) or uniformly at random. The mutability multiplier is an empirical factor that modulates the intensity of mutation to more closely match the speed of evolution to that observed in data.

Neural network architecture.

Clip function bounds for neural network training.

Training and testing results for the sigmoid model on simulation.
We show curve difference loss distributions on several subsets of the training sample (where each GC has different parameter values): training and validation (left) and testing (right). For computational efficiency when plotting, the curve difference distributions display only the first 1000 values. See Figure 3 for per-bin model.

Training and testing results for the per-bin model on simulation.
See Figure 2 for details.

Inferred response functions on real data
for sigmoid (left) and per-bin (right) models, corresponding to the non-sigmoid parameter values yielding simulation with the best-matching summary statistics. The medoid curve is shown in orange with 68% and 95% confidence intervals in blue, with observed affinity values in grey.

Summary statistics on data vs simulation
for the central data mimic simulation sample that most closely mimics inferred data parameters. Simulation truth (dashed green) is unobservable and shown only for completeness; the important comparison is between purple and green solid lines, where both data and simulation have been run through IQ-TREE.

Inferred sigmoid curves for the central data mimic data-like simulation sample.
The medoid curve is shown in orange, and true curve in green. This sample consists of 120 GCs all simulated with the same sigmoid parameters (from the central data prediction) and non-sigmoid parameters (from the best-matched summary statistics).

Effective birth rates (left column) calculated for three GCs simulated with the central data mimic parameters.
The effective birth rate for cell i is defined as mλi − μi for the carrying capacity modulating factor m, intrinsic birth rate λi, and death rate μi. It is shown at several different time values for all living cells, except that for plotting clarity, cells closer to each other than 0.1 in affinity are not shown. Note that we extend time here to 50 days in order to aid comparison to the bulk data used by the traveling wave model (DeWitt et al., 2025, Fig. S6D), which assumes a steady state. (Our extracted GC data is sampled at 15 and 20 days.) Vertical lines (left and middle columns) mark the point at which the slope has dropped to 1/2 its maximum value (if absent, the slope never falls below this threshold). We also show the intrinsic birth rate 2 (middle column, solid red curve), and include histograms of sampled affinities at the final time point. The right column shows time traces of the number of living cells (i.e. nodes, in blue) and the mean affinity of all living cells (red).

The net growth rate from (DeWitt et al., 2025, Fig. S6D) (left) compared to our effective birth rate (right)
for three different “target” mean affinity values. We selected three well-spaced timepoints from Fig. S6D, and matched the corresponding mean affinity values from the bulk data to particular time slices in 25 GCs simulated with our central data mimic parameters (inferred on the extracted GC data) out to time 70. The net growth rate describes the net population growth at affinity x and time t in a traveling wave fitness model (see text). In our model, the effective birth rate is defined and plotted as in Figure 7. We also show stacked histograms of observed affinity values from the bulk data (left) and the 25 simulated GCs (not from the extracted GC data) (right). The measured affinity values from the extracted GC data (which extends only to time 20) are shown in Figure 4.

Example of simulation response function (red, left axis) and resulting node affinity values (histogram, right axis).
The response function describes the relationship between affinity and fitness, while the node values show the actual affinity values of nodes in the resulting simulation for both internal (blue) and leaf (orange) nodes. Note that the count is not sampling the response function, so we do not expect the histogram and the function to match.

Curve difference loss function calculation.
We divide the “difference” area (red bars) between the true (green) and inferred (red) curves by the area (light red) under the true curve within the bounds [-2.5, 3]. The loss value is thus the fraction of the area under the true curve represented by the area between the true and inferred curves.

Example of approximate sigmoid parameter degeneracy.
A potential pair of true (green) and inferred (blue) sigmoid curves, with true (inferred) parameter values. In the range shown, the inferred curve has compensated for a too-small xscale (transition steepness) by increasing xshift (shifting to the right) and yscale (upper asymptote). This results in a curve difference loss value of only 9%, which is much smaller our expected inference accuracy.

Example true and inferred response functions on the training sample.
The “diff” entry in the per-plot table gives the curve difference loss.

Example inferred sigmoid curves on data
for four representative GCs.

Example inferred per-bin response functions on data
for four representative GCs.

Additional summary statistics distributions for the same central data mimic sample as the main figure.

Summary statistic distributions for the simulation sample used for training.
In order to allow easy comparison of abundance distributions, these plots include only 120 of the simulated trees. This sample is designed to have a wide range of parameters, encompassing all plausible true data values, and is thus not designed to closely match data summary statistics.