Thrifty wide-context models of B cell receptor somatic hypermutation
Figures

Overview of data processing and objective.
(a) Out-of-frame sequences are clustered into clonal families. Trees are built on clonal families and then ancestral sequences are reconstructed using a simple sequence model. The prediction task is to predict the location and identity of mutations of child sequences given parent sequences. (b) Strategy for ‘thrifty’ convolutional neural networks with relatively few parameters. We use a trainable embedding of each 3-mer into a space; downstream convolutions happen on sequential collections of these embeddings. The ‘width’ of the k-mer model is determined by the size of the convolutional kernel, which in this cartoon is 3. This would give us effectively a 5-mer model because the 3-mer model adds one base on either side of a convolution of length 3. For the sake of simplicity, the probability distribution of the new base conditioned on there being a substitution (which we call the conditional substitution probability [CSP]) is not shown. The CSP output can emerge in several ways (Figure 1—figure supplement 1).

Strategies for estimating both per-site rate and conditional substitution probability (CSP).
Our three strategies for estimating the per-site rate and CSPs. In the joined version, the two outputs come directly out of the convolutional layer. In the hybrid version, the two outputs share the embedding layer. In the independent version, the two outputs are estimated separately.

Model predictive performance on held-out samples: held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row).
Note that the two rows use different x-axis scales. The integer in parentheses indicates the number of parameters of the model. Each model has multiple points, each corresponding to an independent model training. This is a subset of the models for clarity; see Figure 2—figure supplement 1 for all models.

Performance results for all the models.
Performance results for all the models: held-out individuals from the briney data (upper row) and the tang data (lower row). The ‘Small’ thrifty models have settings (7, 6, 14, 0.1) for hyperparameters (Kernel, Embed, Filters, Dropout), as described in Table 1.

Agreement between the original ‘shmoof’ model of Spisak et al., 2020 and our ‘reshmoof’ reimplementation.
There is good agreement between the originally inferred shmoof coefficients and our re-implementation, both in the motif mutability terms and the per-position mutabilities. The primary exception is that we infer more reasonable values when sequencing coverage is weak or absent and avoid an extreme value at site 67. These are due to a slight regularization to the per-position mutabilities. This analysis can be reproduced using the reshmoof.ipynb notebook.

Performance comparison including S5F model with original coefficients.
Performance plot including the original S5F model (vertical black lines) for held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row).

Model fit on held-out samples: held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row).
Observed mutations in held-out data are placed in bins according to their probability of mutation. For every bin, the points show the observed number of mutations in that bin, while the bars show the expected number of mutations in that bin. The overlap metric is the area of the intersection of the observed and expected divided by the average area between the two.

Comparison of held-out log likelihoods between the models.
Comparison of held-out log likelihoods between the models.

The relative change in performance for each statistic, namely the statistic for the model trained with out-of-frame (OOF) data and synonymous data, minus the statistic for the model trained with OOF data only, divided by the statistic trained with OOF data only.
Thus, adding in synonymous mutations to the training set does not help predict on held-out out-of-frame data. Results shown for briney data (upper row) and tang data (lower row).

S5F performs better for mutation position prediction on synonymous mutations in the jaffe data than any model trained on out-of-frame data.
S5F performs better for mutation position prediction on synonymous mutations in the jaffe data than any model trained on out-of-frame data. The performance plot is as before, but the black vertical bar indicates the performance of the S5F model, and all models evaluated on synonymous mutations in the jaffe data.

S5F performs better for mutation position prediction on synonymous mutations in the tang data than any model trained on out-of-frame data.
S5F performs better for mutation position prediction on synonymous mutations in the tang data than any model trained on out-of-frame data. Performance plot is as before, but the black vertical bar indicates the performance of the S5F model, and all models evaluated on synonymous mutations in the tang data.
Tables
Selected model shapes and dropout probabilities.
The release name of the model is the name of the trained model released in the GitHub repository. The paper name is the name of the model used in this article, which describes more about its architecture. ‘Kernel’: the size of the convolutional kernel used in the model. ‘Embed’: the size of the embedding used for each 3-mer. Because there is one additional base on either side of a 3-mer, a model with kernel size 9 is effectively an 11-mer model, and a model with kernel size 11 is effectively a 13-mer model. The ‘Medium’ and ‘Large’ labels in the paper name designate the settings for Kernel, Embed, Filters, and Dropout.
Release name: paper name | Kernel | Embed | Filters | Dropout | Params |
---|---|---|---|---|---|
ThriftyHumV0.2-20: CNN Joined Large | 11 | 7 | 19 | 0.3 | 2057 |
5mer | - | - | - | - | 3077 |
Spisak | - | - | - | - | 3576 |
ThriftyHumV0.2-45: CNN Indep Medium | 9 | 7 | 16 | 0.2 | 4539 |
ThriftyHumV0.2-59: CNN Indep Large | 11 | 7 | 19 | 0.3 | 5931 |
Performance evaluation on held-out briney data for S5F, DeepSHM, and thrifty models.
The * on S5F indicates that this model was trained using synonymous mutations on a distinct data set to those considered here. The †on tang † is to signify that this is the tang data but with a different preprocessing scheme (Tang et al., 2022).
Model | Training data | AUROC | AUPRC | R-prec | Sub. acc. |
---|---|---|---|---|---|
S5F | * | 0.775 | 0.0698 | 0.0290 | 0.514 |
CNN Joined Large | tang | 0.787 | 0.0850 | 0.0406 | 0.510 |
CNN Indep Medium | tang | 0.786 | 0.0848 | 0.0407 | 0.528 |
CNN Indep Large | tang | 0.786 | 0.0852 | 0.0406 | 0.524 |
DeepSHM | tang† | 0.786 | 0.0876 | 0.0421 | 0.537 |
CNN Joined Large | tang+briney | 0.793 | 0.0923 | 0.0439 | 0.551 |
CNN Indep Medium | tang+briney | 0.793 | 0.0919 | 0.0441 | 0.560 |
CNN Indep Large | tang+briney | 0.794 | 0.0926 | 0.0450 | 0.562 |
-
AUPRC, area under the precision-recall curve; AUROC, area under the ROC curve; R-prec, R-precision; sub. acc., substitution accuracy.
Data used in this article.
briney data is from Briney et al., 2019 after processing done by Spisak et al., 2020. tang data is from Vergani et al., 2017; Tang et al., 2020 and was sequenced using the methods of Vergani et al., 2017. Out-of-frame sequences from briney and tang are used. For productive sequences from tang, only fourfold synonymous sites are used. jaffe data is from Jaffe et al., 2022 sequenced using 10X, where only fourfold synonymous sites of productive sequences are used. The ‘read/cell’ column is the sequencing depth listed as average number of reads per cell. ‘Samples’ is the number of individual samples in the data set; in these data sets, each sample is from a distinct individual. ‘CFs’ is the number of clonal families in the data set. ‘PCPs’ is the number of parent–child pairs in the data set. ‘Median mutations’ is the median number of mutations per PCP in the data set.
Name | Type | Sequencing methods | Read/cell | Samples | CFs | PCPs | Median mutations |
---|---|---|---|---|---|---|---|
briney | Out-of-frame | Bulk | ∼1 | 9 | 3903 | 52,915 | 2 |
tang | Out-of-frame | Single cell | ∼40 | 21 | 3209 | 9984 | 7 |
tang | Productive | Single cell | ∼40 | 21 | 157,863 | 820,837 | 1 |
jaffe | Productive | Single cell | ∼5000 | 4 | 67,304 | 282,732 | 1 |