Figures and data in Thrifty wide-context models of B cell receptor somatic hypermutation

Figures
Tables
Additional files

4 figures, 3 tables and 1 additional file

Figures

Figure 1 with 1 supplement

Download asset Open asset

Overview of data processing and objective.

(a) Out-of-frame sequences are clustered into clonal families. Trees are built on clonal families and then ancestral sequences are reconstructed using a simple sequence model. The prediction task is to predict the location and identity of mutations of child sequences given parent sequences. (b) Strategy for ‘thrifty’ convolutional neural networks with relatively few parameters. We use a trainable embedding of each 3-mer into a space; downstream convolutions happen on sequential collections of these embeddings. The ‘width’ of the k-mer model is determined by the size of the convolutional kernel, which in this cartoon is 3. This would give us effectively a 5-mer model because the 3-mer model adds one base on either side of a convolution of length 3. For the sake of simplicity, the probability distribution of the new base conditioned on there being a substitution (which we call the conditional substitution probability [CSP]) is not shown. The CSP output can emerge in several ways (Figure 1—figure supplement 1).

Figure 1—figure supplement 1

Download asset Open asset

Strategies for estimating both per-site rate and conditional substitution probability (CSP).

Our three strategies for estimating the per-site rate and CSPs. In the joined version, the two outputs come directly out of the convolutional layer. In the hybrid version, the two outputs share the embedding layer. In the independent version, the two outputs are estimated separately.

Figure 2 with 3 supplements

Download asset Open asset

Model predictive performance on held-out samples: held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row).

Note that the two rows use different x-axis scales. The integer in parentheses indicates the number of parameters of the model. Each model has multiple points, each corresponding to an independent model training. This is a subset of the models for clarity; see Figure 2—figure supplement 1 for all models.

Figure 2—figure supplement 1

Download asset Open asset

Performance results for all the models.

Performance results for all the models: held-out individuals from the briney data (upper row) and the tang data (lower row). The ‘Small’ thrifty models have settings (7, 6, 14, 0.1) for hyperparameters (Kernel, Embed, Filters, Dropout), as described in Table 1.

Figure 2—figure supplement 2

Download asset Open asset

Agreement between the original ‘shmoof’ model of Spisak et al., 2020 and our ‘reshmoof’ reimplementation.

There is good agreement between the originally inferred shmoof coefficients and our re-implementation, both in the motif mutability terms and the per-position mutabilities. The primary exception is that we infer more reasonable values when sequencing coverage is weak or absent and avoid an extreme value at site 67. These are due to a slight regularization to the per-position mutabilities. This analysis can be reproduced using the reshmoof.ipynb notebook.

Figure 2—figure supplement 3

Download asset Open asset

Performance comparison including S5F model with original coefficients.

Performance plot including the original S5F model (vertical black lines) for held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row).

Figure 3 with 1 supplement

Download asset Open asset

Model fit on held-out samples: held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row).

Observed mutations in held-out data are placed in bins according to their probability of mutation. For every bin, the points show the observed number of mutations in that bin, while the bars show the expected number of mutations in that bin. The overlap metric is the area of the intersection of the observed and expected divided by the average area between the two.

Figure 3—figure supplement 1

Download asset Open asset

Comparison of held-out log likelihoods between the models.

Comparison of held-out log likelihoods between the models.

Figure 4 with 2 supplements

Download asset Open asset

The relative change in performance for each statistic, namely the statistic for the model trained with out-of-frame (OOF) data and synonymous data, minus the statistic for the model trained with OOF data only, divided by the statistic trained with OOF data only.

Thus, adding in synonymous mutations to the training set does not help predict on held-out out-of-frame data. Results shown for briney data (upper row) and tang data (lower row).

Figure 4—figure supplement 1

Download asset Open asset

S5F performs better for mutation position prediction on synonymous mutations in the jaffe data than any model trained on out-of-frame data.

S5F performs better for mutation position prediction on synonymous mutations in the jaffe data than any model trained on out-of-frame data. The performance plot is as before, but the black vertical bar indicates the performance of the S5F model, and all models evaluated on synonymous mutations in the jaffe data.

Figure 4—figure supplement 2

Download asset Open asset

S5F performs better for mutation position prediction on synonymous mutations in the tang data than any model trained on out-of-frame data.

S5F performs better for mutation position prediction on synonymous mutations in the tang data than any model trained on out-of-frame data. Performance plot is as before, but the black vertical bar indicates the performance of the S5F model, and all models evaluated on synonymous mutations in the tang data.

Tables

Table 1

Selected model shapes and dropout probabilities.

The release name of the model is the name of the trained model released in the GitHub repository. The paper name is the name of the model used in this article, which describes more about its architecture. ‘Kernel’: the size of the convolutional kernel used in the model. ‘Embed’: the size of the embedding used for each 3-mer. Because there is one additional base on either side of a 3-mer, a model with kernel size 9 is effectively an 11-mer model, and a model with kernel size 11 is effectively a 13-mer model. The ‘Medium’ and ‘Large’ labels in the paper name designate the settings for Kernel, Embed, Filters, and Dropout.

Release name: paper name	Kernel	Embed	Filters	Dropout	Params
ThriftyHumV0.2-20: CNN Joined Large	11	7	19	0.3	2057
5mer	-	-	-	-	3077
Spisak	-	-	-	-	3576
ThriftyHumV0.2-45: CNN Indep Medium	9	7	16	0.2	4539
ThriftyHumV0.2-59: CNN Indep Large	11	7	19	0.3	5931

Table 2

Performance evaluation on held-out briney data for S5F, DeepSHM, and thrifty models.

The * on S5F indicates that this model was trained using synonymous mutations on a distinct data set to those considered here. The †on tang † is to signify that this is the tang data but with a different preprocessing scheme (Tang et al., 2022).

Model	Training data	AUROC	AUPRC	R-prec	Sub. acc.
S5F	*	0.775	0.0698	0.0290	0.514
CNN Joined Large	tang	0.787	0.0850	0.0406	0.510
CNN Indep Medium	tang	0.786	0.0848	0.0407	0.528
CNN Indep Large	tang	0.786	0.0852	0.0406	0.524
DeepSHM	tang†	0.786	0.0876	0.0421	0.537
CNN Joined Large	tang+briney	0.793	0.0923	0.0439	0.551
CNN Indep Medium	tang+briney	0.793	0.0919	0.0441	0.560
CNN Indep Large	tang+briney	0.794	0.0926	0.0450	0.562

AUPRC, area under the precision-recall curve; AUROC, area under the ROC curve; R-prec, R-precision; sub. acc., substitution accuracy.

Table 3

Data used in this article.

briney data is from Briney et al., 2019 after processing done by Spisak et al., 2020. tang data is from Vergani et al., 2017; Tang et al., 2020 and was sequenced using the methods of Vergani et al., 2017. Out-of-frame sequences from briney and tang are used. For productive sequences from tang, only fourfold synonymous sites are used. jaffe data is from Jaffe et al., 2022 sequenced using 10X, where only fourfold synonymous sites of productive sequences are used. The ‘read/cell’ column is the sequencing depth listed as average number of reads per cell. ‘Samples’ is the number of individual samples in the data set; in these data sets, each sample is from a distinct individual. ‘CFs’ is the number of clonal families in the data set. ‘PCPs’ is the number of parent–child pairs in the data set. ‘Median mutations’ is the median number of mutations per PCP in the data set.

Name	Type	Sequencing methods	Read/cell	Samples	CFs	PCPs	Median mutations
briney	Out-of-frame	Bulk	∼1	9	3903	52,915	2
tang	Out-of-frame	Single cell	∼40	21	3209	9984	7
tang	Productive	Single cell	∼40	21	157,863	820,837	1
jaffe	Productive	Single cell	∼5000	4	67,304	282,732	1

Additional files

MDAR checklist: https://cdn.elifesciences.org/articles/105471/elife-105471-mdarchecklist1-v1.pdf
Download elife-105471-mdarchecklist1-v1.pdf

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Kevin Sung
Mackenzie M Johnson
Will Dumm
Noah Simon
Hugh Haddox
Julia Fukuyama
Frederick A Matsen

(2025)

Thrifty wide-context models of B cell receptor somatic hypermutation

eLife 14:RP105471.

https://doi.org/10.7554/eLife.105471.3

Share this article

Cite this article

Overview of data processing and objective.

Strategies for estimating both per-site rate and conditional substitution probability (CSP).

Model predictive performance on held-out samples: held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row).

Performance results for all the models.

Agreement between the original ‘shmoof’ model of Spisak et al., 2020 and our ‘reshmoof’ reimplementation.

Performance comparison including S5F model with original coefficients.

Model fit on held-out samples: held-out individuals from the briney data (upper row) and on a separate sequencing experiment (tang data, lower row).

Comparison of held-out log likelihoods between the models.

The relative change in performance for each statistic, namely the statistic for the model trained with out-of-frame (OOF) data and synonymous data, minus the statistic for the model trained with OOF data only, divided by the statistic trained with OOF data only.

S5F performs better for mutation position prediction on synonymous mutations in the jaffe data than any model trained on out-of-frame data.

S5F performs better for mutation position prediction on synonymous mutations in the tang data than any model trained on out-of-frame data.

Selected model shapes and dropout probabilities.

Performance evaluation on held-out briney data for S5F, DeepSHM, and thrifty models.

Data used in this article.

MDAR checklist

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)