Separating selection from mutation in antibody language models

  1. Frederick A Matsen IV  Is a corresponding author
  2. Will Dumm
  3. Kevin Sung
  4. Mackenzie M Johnson
  5. David H Rich
  6. Tyler N Starr
  7. Yun S Song
  8. Julia Fukuyama
  9. Hugh K Haddox
  1. Computational Biology Program, Fred Hutchinson Cancer Center, United States
  2. Department of Genome Sciences, University of Washington, United States
  3. Department of Statistics, University of Washington, United States
  4. Howard Hughes Medical Institute, United States
  5. Department of Biochemistry, University of Utah School of Medicine, United States
  6. Computer Science Division and Department of Statistics, University of California, Berkeley, United States
  7. Department of Statistics, Indiana University, United States
3 figures, 6 tables and 1 additional file

Figures

Figure 1 with 1 supplement
Nucleotide-level mutation processes distort protein language model predictions.

(a) AbLang2 assigns almost 100× lower probabilities to amino acids requiring multiple nucleotide mutations compared to single-mutation variants. Each point represents a possible amino acid substitution at a single site in the amino acid sequence. (b) AbLang2 probabilities correlate with neutral somatic hypermutation probabilities across the V-encoded portion of nine naive sequences, demonstrating how the model is strongly impacted by mutation bias. Each point represents a site in the sequence. Triangles are outliers that have been brought into the y range. (c) AbLang2 functional prediction accuracy drops substantially for amino acids that are multiple (2 or 3) nucleotide mutations away from the wild-type codon. Data are from Koenig et al., 2017.

Figure 1—figure supplement 1
Conflating mutation and selection hinders functional prediction.

This plot is analogous to the rightmost panel of Figure 1c, but colored by mutability according to the Thrifty (Sung et al., 2025a) model.

Figure 2 with 2 supplements
Our model separates mutation from selection to predict functional effects without nucleotide-level biases.

(a) Our model combines a fixed mutation component (trained on non-functional data) with a learned selection component (deep amino acid selection model [DASM] transformer). Training uses inferred parent-child sequence pairs from reconstructed B cell phylogenies to predict natural affinity maturation after a jointly inferred time t. (b) The DASM directly predicts selection factors for all amino acid substitutions at every position in a single forward pass. Positive factors indicate beneficial changes, and negative factors indicate deleterious changes.

Figure 2—figure supplement 1
Overview of full detailed methods.

The model is trained to predict the probability of a child sequence given the parent sequence. It is divided into mutation and selection components. Mutation: given a parent sequence X, probability of mutation to alternate codons after time t is calculated using a model of SHM (Sung et al., 2025a) and a ‘multihit model’ (see Methods) and then aggregated into codons as in Matsen et al., 2025b to obtain pj,c(t,X). Selection: given an amino acid translation X¯ of the parent, the transformer-encoder gives per-site selection factors fj,c¯(X¯). These are then multiplied (Equation 1) and summed to give the probability of the observed child sequence at every site. This gives a likelihood for a parent-child pair. The algorithm maximizes the likelihood across branch lengths t for each parent-child pair as well as across the parameters of the transformer model for all parent-child pairs in the dataset (dashed lines).

Figure 2—figure supplement 2
Multihit correction improves model fit.

Models that allow for multiple hits per codon (right column) show better model fit than without (left column) on the out-of-frame data from Tang et al., 2022 as prepared in Sung et al., 2025a. These observed-versus-expected plots compare the observed number of mutations to the computationally predicted number across bins of mutation probability. For each bin, the expected number of mutations is calculated as the sum of mutation probabilities for all sites in that bin, while the observed count represents the actual mutations found in those sites. The overlap metric quantifies the area of overlap between observed and expected distributions divided by their average area, while the residual metric measures the normalized root-mean-square difference between observed and expected counts. These plots are faceted by the number of mutations per codon.

Figure 3 with 6 supplements
Comparing model predictions with experimentally measured effects of mutations on antibody expression from Koenig et al., 2017.

(a) The deep amino acid selection model (DASM) maintains high predictive accuracy on functional effects of mutations regardless of codon accessibility. The correlation is equally high for amino acid mutations that only require a single-nucleotide mutation (left plot) vs. amino acid mutations that require multi-nucleotide mutations (center plot), demonstrating successful separation of mutation bias from functional effects. Compare Figure 1c. (b) DASM predictions mimic patterns in the expression data. For additional heatmap comparisons, see Figure 3—figure supplement 3.

Figure 3—figure supplement 1
Light chain version of main figure.

DASM removes codon bias in light chain functional predictions. Equivalent plot as Figure 3 but for light chain.

Figure 3—figure supplement 2
Comparison of selection factors for codon neighbors in the deep amino acid selection model (DASM).

DASM selection factors match the codon-table pattern seen in experimental measurements, while masked language models show artifacts from the codon table. The experimental data (left two panels) show a slight decrease in median scores for amino acids requiring multiple nucleotide mutations (‘multiple’) versus single mutations (‘single’). DASM captures this pattern, showing similar distributions for both categories. In contrast, AbLang and ESM assign radically lower scores to multi-nucleotide amino acid substitutions, consistent with the masked language modeling objective learning codon-level mutation probabilities as described in the main text (Figure 1a). Comparison done on the heavy chain of the (Koenig et al., 2017) wildtype sequence.

Figure 3—figure supplement 3
Heavy chain heatmap.

Heatmap showing the heavy-chain data of Koenig et al., 2017 along with model predictions.

Figure 3—figure supplement 4
Parent-child pair (PCP) predictions.

The DASM with the Thrifty neutral model accurately assesses probabilities of natural affinity maturation paths. (a, b) The DASM is better than AbLang2 at predicting the location of nonsynonymous mutations observed in PCPs withheld from model training. (c) The DASM achieves lower conditional perplexity (median 4.88 vs 7.39) when predicting amino acid identity at mutated sites, with fewer extreme outliers. The conditional perplexity is the perplexity of the child amino acid, conditioned on there being a mutation at that site. Note that due to the inherent stochasticity of affinity maturation, there is a lower limit to this conditional perplexity that is substantially greater than 1.

Figure 3—figure supplement 5
Scatterplots for Shanehsazzadeh data.

Scatterplots for the data of Shanehsazzadeh et al., 2023 zeroshot data set, partitioned by sequence length.

Figure 3—figure supplement 6
MAGMA scatter plots.

Model predictions versus experimentally measured binding affinity for five antibodies using the MAGMA-seq protocol. ‘Petersen’ data are from an experiment probing the rules of recognition for influenza neutralizing antibodies (Petersen et al., 2024) and ‘Kirby’ data are from combinatorial libraries applying combinations of mutations on the path from naive to mature (Kirby et al., 2025). Each row shows a different model (DASM, ESM2, AbLang2, ProGen2) and each column shows a different antibody lineage.

Tables

Table 1
Correlation of models with predicting effects of single mutations on antibody expression and antigen binding, as measured in Koenig et al., 2017.

Masked language models scored using per-sequence pseudo-perplexity following the FLAb protocol (Chungyoun et al., 2024).

BindingExpression
ModelHeavyLightHeavyLight
AbLang2−0.114−0.1080.153−0.109
DASM0.3350.3160.6880.674
ESM20.0090.2430.3840.416
ProGen20.1560.2760.5590.568
Table 2
Correlation of models with binding measurements on the data of Shanehsazzadeh et al., 2023, which typically involves multi-mutant variants.

See Figure 3—figure supplement 5 for scatterplots.

Seq. lengthAbLang2DASMESM2ProGen2
1190.2630.4580.1910.074
1200.1660.5180.3080.052
Table 3
Correlations between model predictions and binding affinity.

The ‘Petersen’ data is from an experiment probing the rules of recognition for influenza neutralizing antibodies (Petersen et al., 2024) and ‘Kirby’ data is from combinatorial libraries applying combinations of mutations on the path from naive to mature SARS-CoV-2 antibodies (Kirby et al., 2025). See Figure 3—figure supplement 6 for corresponding scatter plots.

SourceAntibodyAbLang2DASMESM2ProGen2
Petersen (Petersen et al., 2024)222-1C060.0410.248−0.039−0.030
319-3450.0790.2790.1990.222
Kirby (Kirby et al., 2025)002-S21F2−0.246−0.165−0.283−0.353
Ab_2-15−0.3250.094−0.265−0.069
C118−0.6240.276−0.293−0.159
Table 4
Computational efficiency comparison on sequences from the MAGMA-seq experiments.

10 sequences were run on CPU, and 100 on the GPU server.

ModelCPU (s/seq)GPU (s/seq)
DASM0.00970.0053
AbLang211.05290.5196
ESM2112.59897.6090
Appendix 1—table 1
Data used in this paper.

CC means that PCPs with mutated cysteines are excluded from the data. JaffePairedCC is paired data from Jaffe et al., 2022, sequenced using 10×. TangCC data is heavy chain data from Vergani et al., 2017; Tang et al., 2022, and was sequenced using the methods of Vergani et al., 2017. VanwinkleheavyTrainCC1m is a subset of the heavy chain data from Engelbrecht et al., 2025, sequenced using Takara 5’RACE BCR kit. VanwinklelightTrainCC1m is a 1M subset of the light chain data from Engelbrecht et al., 2025, sequenced using Takara 5’RACE BCR kit. RodriguezCC data is the 5 RACE heavy chain data from Rodriguez et al., 2023, and is used only for testing. The ‘samples’ column is the number of individual samples in the dataset; in these datasets, each sample is from a distinct individual. ‘Families’ is the number of clonal families in the dataset. ‘PCPs’ is the number of parent-child pairs in the dataset. ‘Med. mutns’ is the median number of mutations per PCP in the dataset.

PurposeNameSamplesFamiliesPCPsMed. mutns
TrainJaffePairedCC450,776209,5997
TrainTangCC2145,267651,8992
TrainVanwinkleheavyTrainCC14921,269124,9854
TrainVanwinklelightTrainCC1m33026581,000,0002
TestRodriguezCC51359238,0505
Appendix 1—table 2
Comparison of ESM2 model sizes on the Koenig benchmark using the masked-marginals approach.

The larger ESM2-3B model (2.8B parameters) performs slightly worse than the 650M variant, particularly on light chain predictions.

BindingExpression
ModelHeavyLightHeavyLight
ESM2-650M−0.0010.3080.4180.524
ESM2-3B−0.0250.2830.4180.469

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Frederick A Matsen IV
  2. Will Dumm
  3. Kevin Sung
  4. Mackenzie M Johnson
  5. David H Rich
  6. Tyler N Starr
  7. Yun S Song
  8. Julia Fukuyama
  9. Hugh K Haddox
(2026)
Separating selection from mutation in antibody language models
eLife 15:RP109644.
https://doi.org/10.7554/eLife.109644.3