Figures and data

(a) Overview of data processing and objective. Out-of-frame sequences are clustered into clonal families. Trees are built on clonal families and then ancestral sequences are reconstructed using a simple sequence model. The prediction task is to predict the location and identity of mutations of child sequences given parent sequences. (b) Strategy for “thrifty” CNNs with relatively few parameters. We use a trainable embedding of each 3-mer into a space; downstream convolutions happen on sequential collections of these embeddings. The “width” of the k-mer model is determined by the size of the convolutional kernel, which in this cartoon is 3. This would give us effectively a 5-mer model because the 3-mer model adds one base on either side of a convolution of length 3. For the sake of simplicity, the probability distribution of the new base conditioned on there being a substitution (which we call the conditional substitution probability or CSP) is not shown. The CSP output can emerge in several ways (Figure 1—figure Supplement 1).
Figure 1—figure supplement 1. Strategies for estimating both per-site rate and CSPs.

Selected model shapes and dropout probabilities.
The release name of the model is the name of the trained model released in the GitHub repository. The paper name is the name of the model used in this manuscript, which describes more about its architecture. “Kernel”: the size of the convolutional kernel used in the model. “Embed”: the size of the embedding used for each 3-mer. Because there is one additional base on either side of a 3-mer, a model with kernel size 9 is effectively an 11-mer model, and a model with kernel size 11 is effectively a 13-mer model. The “Medium” and “Large” labels in the paper name designate the settings for Kernel, Embed, Filters, and Dropout.

Model predictive performance on held-out samples: held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row).
Note that the two rows use different x-axis scales. Integer in parentheses indicates the number of parameters of the model. Each model has multiple points, each corresponding to an independent model training. This is a subset of the models for clarity; see Figure 2—figure Supplement 1 for all models.
Figure 2—figure supplement 1. Performance results for all the models.
Figure 2—figure supplement 2. Agreement between the original “shmoof” model of Spisak et al. (2020) and our “reshmoof” reimplementation.
Figure 2—figure supplement 3. Performance comparison including S5F model with original coefficients.

Performance evaluation on held-out briney data for S5F, DeepSHM, and thrifty models.
The * on S5F indicates that this model was trained using synonymous mutations on a distinct data set to those considered here. The † on tang† is to signify that this is the tang data but with a different preprocessing scheme (Tang et al., 2022).

Data used in this paper.
briney data is from (Briney et al., 2019) after processing done by (Spisak et al., 2020). tang data is from Vergani et al. (2017); Tang et al. (2020) and was sequenced using the methods of Vergani et al. (2017). Out-of-frame sequences from briney and tang are used. For productive sequences from tang, only 4-fold synonymous sites are used. jaffe data is from Jaffe et al. (2022) sequenced using 10X, where only 4-fold synonymous sites of productive sequences are used. The “read/cell” column is the sequencing depth listed as average number of reads per cell. “Samples” is the number of individual samples in the dataset; in these datasets, each sample is from a distinct individual. “CFs” is the number of clonal families in the dataset. “PCPs” is the number of parent-child pairs in the dataset. “Median mutations” is the median number of mutations per PCP in the dataset.

Model fit on held-out samples: held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row).
Observed mutations in held-out data are placed in bins according to their probability of mutation. For every bin, the points show the observed number of mutations in that bin, while the bars show the expected number of mutations in that bin. The overlap metric is the area of the intersection of the observed and expected divided by the average area between the two.
Figure 3—figure supplement 1. Comparison of held-out log likelihoods between the models.

The relative change in performance for each statistic, namely the statistic for the model trained with out of frame (OOF) data and synonymous data, minus the statistic for the model trained with OOF data only, divided by the statistic trained with OOF data only.
Thus, adding in synonymous mutations to the training set does not help predict on held-out out-of-frame data. Results shown for briney data (upper row) and tang data (lower row).
Figure 4—figure supplement 1. S5F performs better for mutation position prediction on synonymous mutations in the jaffe data than any model trained on out of frame data.
Figure 4—figure supplement 2. S5F performs better for mutation position prediction on synonymous mutations in the tang data than any model trained on out of frame data.

Our three strategies for estimating the per-site rate and CSPs.
In the joined version, the two outputs come directly out of the convolutional layer. In the hybrid version, the two outputs share the embedding layer. In the independent version, the two outputs are estimated separately.

Performance results for all the models: held-out individuals from the briney data (upper row) and the tang data (lower row).
The “Small” thrifty models have settings (7, 6, 14, 0.1) for hyperparameters (Kernel, Embed, Filters, Dropout), as described in Table 1.

There is good agreement between the originally inferred shmoof coefficients and our re-implementation, both in the motif mutability terms and the per-position mutabilities.
The primary exception is that we infer more reasonable values when sequencing coverage is weak or absent and avoid an extreme value at site 67. These are due to a slight regularization to the per-position mutabilities.This analysis can be reproduced using the reshmoof.ipynb notebook.

Performance plot including the original S5F model (vertical black lines) for held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row).

Comparison of held-out log likelihoods between the models.

S5F performs better for mutation position prediction on synonymous mutations in the jaffe data than any model trained on out of frame data.
Performance plot is as before, but black vertical bar indicates the performance of the S5F model, and all models evaluated on synonymous mutations in the jaffe data.

S5F performs better for mutation position prediction on synonymous mutations in the tang data than any model trained on out of frame data.
Performance plot is as before, but black vertical bar indicates the performance of the S5F model, and all models evaluated on synonymous mutations in the tang data.