Overview of the workflow for this paper.

After specifying a birth-death model with sigmoid affinity-fitness response function, we simulate many trees and their sequences at each node, with parameters roughly consistent with our real data samples. The simulation is cell-based and implements a carrying-capacity population size limit. The results of the simulation are then encoded and used to train a neural network that infers the sigmoid response parameters on real data. In addition to encoded trees, the network also takes as input the assumed values of several non-sigmoid parameters (carrying capacity, initial population size, and death rate). Inference on real data is performed many times, with many different combinations of non-sigmoid parameter values, and additional “data mimic” simulation is generated using each of the resulting inferred parameter value combinations. The summary statistics of each data mimic sample are compared to real data, with the best match selected as the “central data mimic” sample with final parameter values for both sigmoid (inferred with the neural network) and non-sigmoid (inferred by matching summary statistics) parameters. Figure 1—figure supplement 1. Simulation response function example with sampled affinity values. Figure 1—figure supplement 2. Diagram of curve difference loss function calculation. Figure 1—figure supplement 3. Example of approximate sigmoid parameter degeneracy.

Simulation parameter values for training sample (left column, in which each GC has different parameters), and the central data mimic sample (right column, in which all GCs have the same, data-inferred parameters).

Square brackets indicate a range, from which values are chosen either as detailed below (for sigmoid parameters) or uniformly at random.

Neural network architecture.

Clip function bounds for neural network training.

Training and testing results for the sigmoid model on simulation.

We show curve difference loss distributions on several subsets of the training sample (where each GC has different parameter values): training and validation (left) and testing (right). For computational efficiency when plotting, the curve difference distributions display only the first 1000 values. See Figure 3 for per-bin model.

Training and testing results for the per-bin model on simulation.

See Figure 2 for details.

Inferred response functions on real data

for sigmoid (left) and per-bin (right) models, corresponding to the non-sigmoid parameter values yielding simulation with the best-matching summary statistics. The medoid curve is shown in orange with 68% and 95% confidence intervals in blue, with observed affinity values in grey. Figure 4—figure supplement 1. Example inferred sigmoid curves on data for four representative GCs. Figure 4—figure supplement 2. Example inferred per-bin response functions on data for four representative GCs.

Summary statistics on data vs simulation

for the central data mimic simulation sample that most closely mimics inferred data parameters. Simulation truth (dashed green) is unobservable and shown only for completeness; the important comparison is between purple and green solid lines, where both data and simulation have been run through IQ-TREE. Figure 5—figure supplement 1. Additional summary statistics distributions for the same central data mimic sample as the main figure. Figure 5—figure supplement 2. Summary statistic distributions for the simulation sample used for training.

Inferred sigmoid curves for the central data mimic data-like simulation sample.

The medoid curve is shown in orange, and true curve in green. This sample consists of 120 GCs all simulated with the same sigmoid parameters (from the central data prediction) and non-sigmoid parameters (from the best-matched summary statistics).

Effective birth rates (left column) calculated for three randomly-selected GCs from the central data mimic simulation sample.

The effective birth rate for cell i is defined as iμi for the carrying capacity modulating factor m, intrinsic birth rate λi, and death rate μi. It is shown at several different time values for all living cells, except that for plotting clarity, cells closer to each other than 0.1 in affinity are not shown. Note that we extend time here to 50 days in order to aid comparison to the bulk data used by the traveling wave model (DeWitt et al., 2025, Fig. S6D), which assumes a steady state. (Our extracted GC data is sampled at 15 and 20 days.) Vertical dashed lines (left and middle columns) mark the point at which the slope has dropped to 1/2 its maximum value (if absent, the slope never falls below this threshold). We also show the intrinsic birth rate λ (middle column, solid red curve), and include histograms of sampled affinities at the final time point. The right column shows time traces of the number of living cells (i.e. nodes, in blue) and the mean affinity of all living cells (red).

Example of simulation response function (red, left axis) and resulting node affinity values (histogram, right axis).

The response function describes the relationship between affinity and fitness, while the node values show the actual affinity values of nodes in the resulting simulation for both internal (blue) and leaf (orange) nodes. Note that the count is not sampling the response function, so we do not expect the histogram and the function to match.

Curve difference loss function calculation.

We divide the “difference” area (red bars) between the true (green) and inferred (red) curves by the area (light red) under the true curve within the bounds [−2.5, 3].

Example of approximate sigmoid parameter degeneracy.

A potential pair of true (green) and inferred (blue) sigmoid curves, with true (inferred) parameter values. In the range shown, the inferred curve has compensated for a too-small xscale (transition steepness) by increasing xshift (shifting to the right) and yscale (upper asymptote). This results in a curve difference loss value of only 9%, which is much smaller our expected inference accuracy.

Example inferred sigmoid curves on data

for four representative GCs.

Example inferred per-bin response functions on data

for four representative GCs.

Additional summary statistics distributions for the same central data mimic sample as the main figure.

Summary statistic distributions for the simulation sample used for training.

In order to allow easy comparison of abundance distributions, these plots include only 120 of the simulated trees. This sample is designed to have a wide range of parameters, encompassing all plausible true data values, and is thus not designed to closely match data summary statistics.