Thrifty wide-context models of B cell receptor somatic hypermutation

Kevin Sung; Mackenzie M Johnson; Will Dumm; Noah Simon; Hugh Haddox; Julia Fukuyama; Frederick A Matsen IV

doi:10.7554/eLife.105471.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Reviewing Editor
Thierry Mora
École Normale Supérieure - PSL, Paris, France
Senior Editor
Aleksandra Walczak
CNRS, Paris, France

Reviewer #1 (Public review):

Summary:

This paper introduces a new class of machine learning models for capturing how likely a specific nucleotide in a rearranged IG gene is to undergo somatic hypermutation. These models modestly outperform existing state-of-the-art efforts, despite having fewer free parameters. A surprising finding is that models trained on all mutations from non-functional rearrangements give divergent results from those trained on only silent mutations from functional rearrangements.

Strengths:

(1) The new model structure is quite clever and will provide a powerful way to explore larger models.

(2) Careful attention is paid to curating and processing large existing data sets.

(3) The authors are to be commended for their efforts to communicate with the developers of previous models and use the strongest possible versions of those in their current evaluation.

Weaknesses:

(1) 10x/single cell data has a fairly different error profile compared to bulk data. A synonymous model should be built from the same `briney` dataset as the base model to validate the difference between the two types of training data.

(3) The decision to test only kernels of 7, 9, and 11 is not described. The selection/optimization of embedding size is not explained. The filters listed in Table 1 are not defined.

https://doi.org/10.7554/eLife.105471.1.sa2

Reviewer #2 (Public review):

This work offers an insightful contribution for researchers in computational biology, immunology, and machine learning. By employing a 3-mer embedding and CNN architecture, the authors demonstrate that it is possible to extend sequence context without exponentially increasing the model's complexity.

Key findings include:

(1) Efficiency and Performance: Thrifty CNNs outperform traditional 5-mer models and match the performance of significantly larger models like DeepSHM.

(2) Neutral Mutation Data: A distinction is made between using synonymous mutations and out-of-frame sequences for model training, with evidence suggesting these methods capture different aspects of SHM, or different biases in the type of data.

(3) Open Source Contributions: The release of a Python package and pre-trained models adds practical value for the community.

However, readers should be aware of the limitations. The improvements over existing models are modest, and the work is constrained by the availability of high-quality out-of-frame sequence data. The study also highlights that more complex modeling techniques, like transformers, did not enhance predictive performance, which underscores the role of data availability in such studies.

https://doi.org/10.7554/eLife.105471.1.sa1

Reviewer #3 (Public review):

Summary:

Modeling and estimating sequence context biases during B cell somatic hypermutation is important for accurately modeling B cell evolution to better understand responses to infection and vaccination. Sung et al. introduce new statistical models that capture a wider sequence context of somatic hypermutation with a comparatively small number of additional parameters. They demonstrate their model's performance with rigorous testing across multiple subjects and datasets. Prior work has captured the mutation biases of fixed 3-, 5-, and 7-mers, but each of these expansions has significantly more parameters. The authors developed a machine-learning-based approach to learn these biases using wider contexts with comparatively few parameters.

Strengths:

Well-motivated and defined problem. Clever solution to expand nucleotide context. Complete separation of training and test data by using different subjects for training vs testing. Release of open-source tools and scripts for reproducibility.

Weaknesses:

This study could be improved with better descriptions of dataset sequencing technology, sequencing depth, etc but this is a minor weakness.

https://doi.org/10.7554/eLife.105471.1.sa0

Thrifty wide-context models of B cell receptor somatic hypermutation

Peer review process

Editors

Be the first to read new articles from eLife