Figures and data

Selected model shapes and dropout probabilities.
The release name of the model is the name of the trained model released in the GitHub repository; upon publication of this paper we will release V1.0. The paper name is the name of the model used in this manuscript, which describes more about its architecture. “Kernel”: the size of the convolutional kernel used in the model. “Embed”: the size of the embedding used for each 3-mer. Because there is one additional base on either side of a 3-mer, a model with kernel size 9 is effectively an 11-mer model, and a model with kernel size 11 is effectively a 13-mer model.

Performance evaluation on held-out briney data for S5F, DeepSHM, and thrifty models.
The ∗ on S5F indicates that this model was trained using synonymous mutations on a distinct data set to those considered here. The † on tang† is to signify that this is the tang data but with a different preprocessing scheme (Tang et al., 2022).

Data used in this paper.
briney data is from (Briney et al., 2019) after processing done by (Spisak et al., 2020). tang data is from Vergani et al., 2017; Tang et al., 2020 and was sequenced using the methods of Vergani et al., 2017. Out-of-frame sequences from briney and tang are used. jaffe data is from Jaffe et al., 2022 sequenced using 10X, where only 4-fold synonymous sites of productive sequences are used. The “samples” column is the number of individual samples in the dataset; in these datasets, each sample is from a distinct individual. “Clonal families” is the number of clonal families in the dataset. “PCPs” is the number of parent-child pairs in the dataset. “Median mutations” is the median number of mutations per PCP in the dataset.

(a) Overview of data processing and objective. Out-of-frame sequences are clustered into clonal families. Trees are built on clonal families and then ancestral sequences are reconstructed using a simple sequence model. The prediction task is to predict the location and identity of mutations of child sequences given parent sequences. (b) Strategy for “thrifty” CNNs with relatively few parameters. We use a trainable embedding of each 3-mer into a space; downstream convolutions happen on sequential collections of these embeddings. The “width” of the k-mer model is determined by the size of the convolutional kernel, which in this cartoon is 3. This would give us effectively a 5-mer model because the 3-mer model adds one base on either side of a convolution of length 3. For the sake of simplicity, the probability distribution of the new base conditioned on there being a substitution (which we call the conditional substitution probability or CSP) is not shown. The CSP output can emerge in several ways (Figure 1—figure Supplement 1).
Figure 1—figure supplement 1. Strategies for estimating both per-site rate and CSPs.

Model predictive performance on held-out samples: held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row). Integer in parentheses indicates the number of parameters of the model. Each model has multiple points, each corresponding to an independent model training. This is a subset of the models for clarity; see Figure 2—figure Supplement 1 for all models.
Figure 2—figure supplement 1. Performance results for all the models.
Figure 2—figure supplement 2. Agreement between the original “shmoof” model of Spisak et al., 2020 and our “reshmoof” reimplementation.
Figure 2—figure supplement 3. Performance comparison including S5F model with original coefficients.

Model fit on held-out samples: held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row). Observed mutations in held-out data are placed in bins according to their probability of mutation. For every bin, the points show the observed number of mutations in that bin, while the bars show the expected number of mutations in that bin. The overlap metric is the area of the intersection of the observed and expected divided by the average area between the two.
Comparison of held-out log likelihoods between the models.

The relative change in performance for each statistic, namely the statistic for the model trained with out of frame (OOF) data and synonymous data, minus the statistic for the model trained with OOF data only, divided by the statistic trained with OOF data only. Thus, adding in synonymous mutations to the training set does not help predict on held-out out-of-frame data.
Figure 4—figure supplement 1. S5F performs better for mutation position prediction on synonymous mutations in the jaffe data than any model trained on out of frame data.

Our three strategies for estimating the per-site rate and CSPs. In the joined version, the two outputs come directly out of the convolutional layer. In the hybrid version, the two outputs share the embedding layer. In the independent version, the two outputs are estimated separately.

Performance results for all the models.

There is good agreement between the originally inferred shmoof coefficients and our re-implementation, both in the motif mutability terms and the per-position mutabilities. The primary exception is that we infer more reasonable values when sequencing coverage is weak or absent and avoid an extreme value at site 67. These are due to a slight regularization to the per-position mutabilities.This analysis can be reproduced using the reshmoof.ipynb notebook.

Performance plot including the original S5F model (vertical black lines) for held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row).

Comparison of held-out log likelihoods between the models.

S5F performs better for mutation position prediction on synonymous mutations in the jaffe data than any model trained on out of frame data. Performance plot is as before, but black vertical bar indicates the performance of the S5F model, and all models evaluated on synonymous mutations in the jaffe data.