Research Article

Separating selection from mutation in antibody language models

Computational Biology Program, Fred Hutchinson Cancer Center, United States
Department of Genome Sciences, University of Washington, United States
Department of Statistics, University of Washington, United States
Howard Hughes Medical Institute, United States
Department of Biochemistry, University of Utah School of Medicine, United States
Computer Science Division and Department of Statistics, University of California, Berkeley, United States
Department of Statistics, Indiana University, United States

Apr 7, 2026

https://doi.org/10.7554/eLife.109644.3

Open access
Copyright information

Figures
Tables
Additional files

3 figures, 6 tables and 1 additional file

Figures

Figure 1 with 1 supplement

Download asset Open asset

Nucleotide-level mutation processes distort protein language model predictions.

(a) AbLang2 assigns almost 100× lower probabilities to amino acids requiring multiple nucleotide mutations compared to single-mutation variants. Each point represents a possible amino acid substitution at a single site in the amino acid sequence. (b) AbLang2 probabilities correlate with neutral somatic hypermutation probabilities across the V-encoded portion of nine naive sequences, demonstrating how the model is strongly impacted by mutation bias. Each point represents a site in the sequence. Triangles are outliers that have been brought into the y range. (c) AbLang2 functional prediction accuracy drops substantially for amino acids that are multiple (2 or 3) nucleotide mutations away from the wild-type codon. Data are from Koenig et al., 2017.

Figure 1—figure supplement 1

Download asset Open asset

Conflating mutation and selection hinders functional prediction.

This plot is analogous to the rightmost panel of Figure 1c, but colored by mutability according to the Thrifty (Sung et al., 2025a) model.

Figure 2 with 2 supplements

Download asset Open asset

Our model separates mutation from selection to predict functional effects without nucleotide-level biases.

(a) Our model combines a fixed mutation component (trained on non-functional data) with a learned selection component (deep amino acid selection model [DASM] transformer). Training uses inferred parent-child sequence pairs from reconstructed B cell phylogenies to predict natural affinity maturation after a jointly inferred time $t$ . (b) The DASM directly predicts selection factors for all amino acid substitutions at every position in a single forward pass. Positive factors indicate beneficial changes, and negative factors indicate deleterious changes.

Figure 2—figure supplement 1

Download asset Open asset

Overview of full detailed methods.

The model is trained to predict the probability of a child sequence given the parent sequence. It is divided into mutation and selection components. Mutation: given a parent sequence X, probability of mutation to alternate codons after time t is calculated using a model of SHM (Sung et al., 2025a) and a ‘multihit model’ (see Methods) and then aggregated into codons as in Matsen et al., 2025b to obtain $p_{j, c} (t, X)$ . Selection: given an amino acid translation $\bar{X}$ of the parent, the transformer-encoder gives per-site selection factors $f_{j, \bar{c}} (\bar{X})$ . These are then multiplied (Equation 1) and summed to give the probability of the observed child sequence at every site. This gives a likelihood for a parent-child pair. The algorithm maximizes the likelihood across branch lengths t for each parent-child pair as well as across the parameters of the transformer model for all parent-child pairs in the dataset (dashed lines).

Figure 2—figure supplement 2

Download asset Open asset

Multihit correction improves model fit.

Models that allow for multiple hits per codon (right column) show better model fit than without (left column) on the out-of-frame data from Tang et al., 2022 as prepared in Sung et al., 2025a. These observed-versus-expected plots compare the observed number of mutations to the computationally predicted number across bins of mutation probability. For each bin, the expected number of mutations is calculated as the sum of mutation probabilities for all sites in that bin, while the observed count represents the actual mutations found in those sites. The overlap metric quantifies the area of overlap between observed and expected distributions divided by their average area, while the residual metric measures the normalized root-mean-square difference between observed and expected counts. These plots are faceted by the number of mutations per codon.

Figure 3 with 6 supplements

Download asset Open asset

Comparing model predictions with experimentally measured effects of mutations on antibody expression from Koenig et al., 2017.

(a) The deep amino acid selection model (DASM) maintains high predictive accuracy on functional effects of mutations regardless of codon accessibility. The correlation is equally high for amino acid mutations that only require a single-nucleotide mutation (left plot) vs. amino acid mutations that require multi-nucleotide mutations (center plot), demonstrating successful separation of mutation bias from functional effects. Compare Figure 1c. (b) DASM predictions mimic patterns in the expression data. For additional heatmap comparisons, see Figure 3—figure supplement 3.

Figure 3—figure supplement 1

Download asset Open asset

Light chain version of main figure.

DASM removes codon bias in light chain functional predictions. Equivalent plot as Figure 3 but for light chain.

Figure 3—figure supplement 2

Download asset Open asset

Comparison of selection factors for codon neighbors in the deep amino acid selection model (DASM).

DASM selection factors match the codon-table pattern seen in experimental measurements, while masked language models show artifacts from the codon table. The experimental data (left two panels) show a slight decrease in median scores for amino acids requiring multiple nucleotide mutations (‘multiple’) versus single mutations (‘single’). DASM captures this pattern, showing similar distributions for both categories. In contrast, AbLang and ESM assign radically lower scores to multi-nucleotide amino acid substitutions, consistent with the masked language modeling objective learning codon-level mutation probabilities as described in the main text (Figure 1a). Comparison done on the heavy chain of the (Koenig et al., 2017) wildtype sequence.

Figure 3—figure supplement 3

Download asset Open asset

Heavy chain heatmap.

Heatmap showing the heavy-chain data of Koenig et al., 2017 along with model predictions.

Figure 3—figure supplement 4

Download asset Open asset

Parent-child pair (PCP) predictions.

The DASM with the Thrifty neutral model accurately assesses probabilities of natural affinity maturation paths. (a, b) The DASM is better than AbLang2 at predicting the location of nonsynonymous mutations observed in PCPs withheld from model training. (c) The DASM achieves lower conditional perplexity (median 4.88 vs 7.39) when predicting amino acid identity at mutated sites, with fewer extreme outliers. The conditional perplexity is the perplexity of the child amino acid, conditioned on there being a mutation at that site. Note that due to the inherent stochasticity of affinity maturation, there is a lower limit to this conditional perplexity that is substantially greater than 1.

Figure 3—figure supplement 5

Download asset Open asset

Scatterplots for Shanehsazzadeh data.

Scatterplots for the data of Shanehsazzadeh et al., 2023 zeroshot data set, partitioned by sequence length.

Figure 3—figure supplement 6

Download asset Open asset

MAGMA scatter plots.

Model predictions versus experimentally measured binding affinity for five antibodies using the MAGMA-seq protocol. ‘Petersen’ data are from an experiment probing the rules of recognition for influenza neutralizing antibodies (Petersen et al., 2024) and ‘Kirby’ data are from combinatorial libraries applying combinations of mutations on the path from naive to mature (Kirby et al., 2025). Each row shows a different model (DASM, ESM2, AbLang2, ProGen2) and each column shows a different antibody lineage.

Tables

Table 1

Correlation of models with predicting effects of single mutations on antibody expression and antigen binding, as measured in Koenig et al., 2017.

Masked language models scored using per-sequence pseudo-perplexity following the FLAb protocol (Chungyoun et al., 2024).

	Binding		Expression
Model	Heavy	Light	Heavy	Light
AbLang2	−0.114	−0.108	0.153	−0.109
DASM	0.335	0.316	0.688	0.674
ESM2	0.009	0.243	0.384	0.416
ProGen2	0.156	0.276	0.559	0.568

Table 2

Correlation of models with binding measurements on the data of Shanehsazzadeh et al., 2023, which typically involves multi-mutant variants.

See Figure 3—figure supplement 5 for scatterplots.

Seq. length	AbLang2	DASM	ESM2	ProGen2
119	0.263	0.458	0.191	0.074
120	0.166	0.518	0.308	0.052

Table 3

Correlations between model predictions and binding affinity.

The ‘Petersen’ data is from an experiment probing the rules of recognition for influenza neutralizing antibodies (Petersen et al., 2024) and ‘Kirby’ data is from combinatorial libraries applying combinations of mutations on the path from naive to mature SARS-CoV-2 antibodies (Kirby et al., 2025). See Figure 3—figure supplement 6 for corresponding scatter plots.

Source	Antibody	AbLang2	DASM	ESM2	ProGen2
Petersen (Petersen et al., 2024)	222-1C06	0.041	0.248	−0.039	−0.030
Petersen (Petersen et al., 2024)	319-345	0.079	0.279	0.199	0.222
Kirby (Kirby et al., 2025)	002-S21F2	−0.246	−0.165	−0.283	−0.353
	Ab_2-15	−0.325	0.094	−0.265	−0.069
	C118	−0.624	0.276	−0.293	−0.159

Table 4

Computational efficiency comparison on sequences from the MAGMA-seq experiments.

10 sequences were run on CPU, and 100 on the GPU server.

Model	CPU (s/seq)	GPU (s/seq)
DASM	0.0097	0.0053
AbLang2	11.0529	0.5196
ESM2	112.5989	7.6090

Appendix 1—table 1

Data used in this paper.

CC means that PCPs with mutated cysteines are excluded from the data. JaffePairedCC is paired data from Jaffe et al., 2022, sequenced using 10×. TangCC data is heavy chain data from Vergani et al., 2017; Tang et al., 2022, and was sequenced using the methods of Vergani et al., 2017. VanwinkleheavyTrainCC1m is a subset of the heavy chain data from Engelbrecht et al., 2025, sequenced using Takara 5’RACE BCR kit. VanwinklelightTrainCC1m is a 1M subset of the light chain data from Engelbrecht et al., 2025, sequenced using Takara 5’RACE BCR kit. RodriguezCC data is the $5^{'}$ RACE heavy chain data from Rodriguez et al., 2023, and is used only for testing. The ‘samples’ column is the number of individual samples in the dataset; in these datasets, each sample is from a distinct individual. ‘Families’ is the number of clonal families in the dataset. ‘PCPs’ is the number of parent-child pairs in the dataset. ‘Med. mutns’ is the median number of mutations per PCP in the dataset.

Purpose	Name	Samples	Families	PCPs	Med. mutns
Train	JaffePairedCC	4	50,776	209,599	7
Train	TangCC	21	45,267	651,899	2
Train	VanwinkleheavyTrainCC	149	21,269	124,985	4
Train	VanwinklelightTrainCC1m	330	2658	1,000,000	2
Test	RodriguezCC	51	3592	38,050	5

Appendix 1—table 2

Comparison of ESM2 model sizes on the Koenig benchmark using the masked-marginals approach.

The larger ESM2-3B model (2.8B parameters) performs slightly worse than the 650M variant, particularly on light chain predictions.

	Binding		Expression
Model	Heavy	Light	Heavy	Light
ESM2-650M	−0.001	0.308	0.418	0.524
ESM2-3B	−0.025	0.283	0.418	0.469

Additional files

MDAR checklist: https://cdn.elifesciences.org/articles/109644/elife-109644-mdarchecklist1-v1.pdf
Download elife-109644-mdarchecklist1-v1.pdf

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Frederick A Matsen IV
Will Dumm
Kevin Sung
Mackenzie M Johnson
David H Rich
Tyler N Starr
Yun S Song
Julia Fukuyama
Hugh K Haddox

(2026)

Separating selection from mutation in antibody language models

eLife 15:RP109644.

https://doi.org/10.7554/eLife.109644.3

Share this article

Cite this article

Nucleotide-level mutation processes distort protein language model predictions.

Conflating mutation and selection hinders functional prediction.

Our model separates mutation from selection to predict functional effects without nucleotide-level biases.

Overview of full detailed methods.

Multihit correction improves model fit.

Comparing model predictions with experimentally measured effects of mutations on antibody expression from Koenig et al., 2017.

Light chain version of main figure.

Comparison of selection factors for codon neighbors in the deep amino acid selection model (DASM).

Heavy chain heatmap.

Parent-child pair (PCP) predictions.

Scatterplots for Shanehsazzadeh data.

MAGMA scatter plots.

Correlation of models with predicting effects of single mutations on antibody expression and antigen binding, as measured in Koenig et al., 2017.

Correlation of models with binding measurements on the data of Shanehsazzadeh et al., 2023, which typically involves multi-mutant variants.

Correlations between model predictions and binding affinity.

Computational efficiency comparison on sequences from the MAGMA-seq experiments.

Data used in this paper.

Comparison of ESM2 model sizes on the Koenig benchmark using the masked-marginals approach.

MDAR checklist

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)