Thrifty wide-context models of B cell receptor somatic hypermutation

  1. Kevin Sung
  2. Mackenzie M Johnson
  3. Will Dumm
  4. Noah Simon
  5. Hugh Haddox
  6. Julia Fukuyama
  7. Frederick A Matsen  Is a corresponding author
  1. Computational Biology Program, Fred Hutchinson Cancer Center, United States
  2. Department of Biostatistics, University of Washington, United States
  3. Department of Statistics, Indiana University, United States
  4. Howard Hughes Medical Institute, United States
  5. Department of Genome Sciences, University of Washington, United States
  6. Department of Statistics, University of Washington, United States
4 figures, 3 tables and 1 additional file

Figures

Figure 1 with 1 supplement
Overview of data processing and objective.

(a) Out-of-frame sequences are clustered into clonal families. Trees are built on clonal families and then ancestral sequences are reconstructed using a simple sequence model. The prediction task is to predict the location and identity of mutations of child sequences given parent sequences. (b) Strategy for ‘thrifty’ convolutional neural networks with relatively few parameters. We use a trainable embedding of each 3-mer into a space; downstream convolutions happen on sequential collections of these embeddings. The ‘width’ of the k-mer model is determined by the size of the convolutional kernel, which in this cartoon is 3. This would give us effectively a 5-mer model because the 3-mer model adds one base on either side of a convolution of length 3. For the sake of simplicity, the probability distribution of the new base conditioned on there being a substitution (which we call the conditional substitution probability [CSP]) is not shown. The CSP output can emerge in several ways (Figure 1—figure supplement 1).

Figure 1—figure supplement 1
Strategies for estimating both per-site rate and conditional substitution probability (CSP).

Our three strategies for estimating the per-site rate and CSPs. In the joined version, the two outputs come directly out of the convolutional layer. In the hybrid version, the two outputs share the embedding layer. In the independent version, the two outputs are estimated separately.

Figure 2 with 3 supplements
Model predictive performance on held-out samples: held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row).

Note that the two rows use different x-axis scales. The integer in parentheses indicates the number of parameters of the model. Each model has multiple points, each corresponding to an independent model training. This is a subset of the models for clarity; see Figure 2—figure supplement 1 for all models.

Figure 2—figure supplement 1
Performance results for all the models.

Performance results for all the models: held-out individuals from the briney data (upper row) and the tang data (lower row). The ‘Small’ thrifty models have settings (7, 6, 14, 0.1) for hyperparameters (Kernel, Embed, Filters, Dropout), as described in Table 1.

Figure 2—figure supplement 2
Agreement between the original ‘shmoof’ model of Spisak et al., 2020 and our ‘reshmoof’ reimplementation.

There is good agreement between the originally inferred shmoof coefficients and our re-implementation, both in the motif mutability terms and the per-position mutabilities. The primary exception is that we infer more reasonable values when sequencing coverage is weak or absent and avoid an extreme value at site 67. These are due to a slight regularization to the per-position mutabilities. This analysis can be reproduced using the reshmoof.ipynb notebook.

Figure 2—figure supplement 3
Performance comparison including S5F model with original coefficients.

Performance plot including the original S5F model (vertical black lines) for held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row).

Figure 3 with 1 supplement
Model fit on held-out samples: held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row).

Observed mutations in held-out data are placed in bins according to their probability of mutation. For every bin, the points show the observed number of mutations in that bin, while the bars show the expected number of mutations in that bin. The overlap metric is the area of the intersection of the observed and expected divided by the average area between the two.

Figure 3—figure supplement 1
Comparison of held-out log likelihoods between the models.

Comparison of held-out log likelihoods between the models.

Figure 4 with 2 supplements
The relative change in performance for each statistic, namely the statistic for the model trained with out-of-frame (OOF) data and synonymous data, minus the statistic for the model trained with OOF data only, divided by the statistic trained with OOF data only.

Thus, adding in synonymous mutations to the training set does not help predict on held-out out-of-frame data. Results shown for briney data (upper row) and tang data (lower row).

Figure 4—figure supplement 1
S5F performs better for mutation position prediction on synonymous mutations in the jaffe data than any model trained on out-of-frame data.

S5F performs better for mutation position prediction on synonymous mutations in the jaffe data than any model trained on out-of-frame data. The performance plot is as before, but the black vertical bar indicates the performance of the S5F model, and all models evaluated on synonymous mutations in the jaffe data.

Figure 4—figure supplement 2
S5F performs better for mutation position prediction on synonymous mutations in the tang data than any model trained on out-of-frame data.

S5F performs better for mutation position prediction on synonymous mutations in the tang data than any model trained on out-of-frame data. Performance plot is as before, but the black vertical bar indicates the performance of the S5F model, and all models evaluated on synonymous mutations in the tang data.

Tables

Table 1
Selected model shapes and dropout probabilities.

The release name of the model is the name of the trained model released in the GitHub repository. The paper name is the name of the model used in this article, which describes more about its architecture. ‘Kernel’: the size of the convolutional kernel used in the model. ‘Embed’: the size of the embedding used for each 3-mer. Because there is one additional base on either side of a 3-mer, a model with kernel size 9 is effectively an 11-mer model, and a model with kernel size 11 is effectively a 13-mer model. The ‘Medium’ and ‘Large’ labels in the paper name designate the settings for Kernel, Embed, Filters, and Dropout.

Release name: paper nameKernelEmbedFiltersDropoutParams
ThriftyHumV0.2-20: CNN Joined Large117190.32057
5mer----3077
Spisak----3576
ThriftyHumV0.2-45: CNN Indep Medium97160.24539
ThriftyHumV0.2-59: CNN Indep Large117190.35931
Table 2
Performance evaluation on held-out briney data for S5F, DeepSHM, and thrifty models.

The * on S5F indicates that this model was trained using synonymous mutations on a distinct data set to those considered here. The †on tang † is to signify that this is the tang data but with a different preprocessing scheme (Tang et al., 2022).

ModelTraining dataAUROCAUPRCR-precSub. acc.
S5F*0.7750.06980.02900.514
CNN Joined Largetang0.7870.08500.04060.510
CNN Indep Mediumtang0.7860.08480.04070.528
CNN Indep Largetang0.7860.08520.04060.524
DeepSHMtang0.7860.08760.04210.537
CNN Joined Largetang+briney0.7930.09230.04390.551
CNN Indep Mediumtang+briney0.7930.09190.04410.560
CNN Indep Largetang+briney0.7940.09260.04500.562
  1. AUPRC, area under the precision-recall curve; AUROC, area under the ROC curve; R-prec, R-precision; sub. acc., substitution accuracy.

Table 3
Data used in this article.

briney data is from Briney et al., 2019 after processing done by Spisak et al., 2020. tang data is from Vergani et al., 2017; Tang et al., 2020 and was sequenced using the methods of Vergani et al., 2017. Out-of-frame sequences from briney and tang are used. For productive sequences from tang, only fourfold synonymous sites are used. jaffe data is from Jaffe et al., 2022 sequenced using 10X, where only fourfold synonymous sites of productive sequences are used. The ‘read/cell’ column is the sequencing depth listed as average number of reads per cell. ‘Samples’ is the number of individual samples in the data set; in these data sets, each sample is from a distinct individual. ‘CFs’ is the number of clonal families in the data set. ‘PCPs’ is the number of parent–child pairs in the data set. ‘Median mutations’ is the median number of mutations per PCP in the data set.

NameTypeSequencing methodsRead/cellSamplesCFsPCPsMedian mutations
brineyOut-of-frameBulk∼19390352,9152
tangOut-of-frameSingle cell∼4021320999847
tangProductiveSingle cell∼4021157,863820,8371
jaffeProductiveSingle cell∼5000467,304282,7321

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Kevin Sung
  2. Mackenzie M Johnson
  3. Will Dumm
  4. Noah Simon
  5. Hugh Haddox
  6. Julia Fukuyama
  7. Frederick A Matsen
(2025)
Thrifty wide-context models of B cell receptor somatic hypermutation
eLife 14:RP105471.
https://doi.org/10.7554/eLife.105471.3