Separating selection from mutation in antibody language models
Figures
Nucleotide-level mutation processes distort protein language model predictions.
(a) AbLang2 assigns almost 100× lower probabilities to amino acids requiring multiple nucleotide mutations compared to single-mutation variants. Each point represents a possible amino acid substitution at a single site in the amino acid sequence. (b) AbLang2 probabilities correlate with neutral somatic hypermutation probabilities across the V-encoded portion of nine naive sequences, demonstrating how the model is strongly impacted by mutation bias. Each point represents a site in the sequence. Triangles are outliers that have been brought into the y range. (c) AbLang2 functional prediction accuracy drops substantially for amino acids that are multiple (2 or 3) nucleotide mutations away from the wild-type codon. Data are from Koenig et al., 2017.
Conflating mutation and selection hinders functional prediction.
This plot is analogous to the rightmost panel of Figure 1c, but colored by mutability according to the Thrifty (Sung et al., 2025a) model.
Our model separates mutation from selection to predict functional effects without nucleotide-level biases.
(a) Our model combines a fixed mutation component (trained on non-functional data) with a learned selection component (deep amino acid selection model [DASM] transformer). Training uses inferred parent-child sequence pairs from reconstructed B cell phylogenies to predict natural affinity maturation after a jointly inferred time . (b) The DASM directly predicts selection factors for all amino acid substitutions at every position in a single forward pass. Positive factors indicate beneficial changes, and negative factors indicate deleterious changes.
Overview of full detailed methods.
The model is trained to predict the probability of a child sequence given the parent sequence. It is divided into mutation and selection components. Mutation: given a parent sequence X, probability of mutation to alternate codons after time t is calculated using a model of SHM (Sung et al., 2025a) and a ‘multihit model’ (see Methods) and then aggregated into codons as in Matsen et al., 2025b to obtain . Selection: given an amino acid translation of the parent, the transformer-encoder gives per-site selection factors . These are then multiplied (Equation 1) and summed to give the probability of the observed child sequence at every site. This gives a likelihood for a parent-child pair. The algorithm maximizes the likelihood across branch lengths t for each parent-child pair as well as across the parameters of the transformer model for all parent-child pairs in the dataset (dashed lines).
Multihit correction improves model fit.
Models that allow for multiple hits per codon (right column) show better model fit than without (left column) on the out-of-frame data from Tang et al., 2022 as prepared in Sung et al., 2025a. These observed-versus-expected plots compare the observed number of mutations to the computationally predicted number across bins of mutation probability. For each bin, the expected number of mutations is calculated as the sum of mutation probabilities for all sites in that bin, while the observed count represents the actual mutations found in those sites. The overlap metric quantifies the area of overlap between observed and expected distributions divided by their average area, while the residual metric measures the normalized root-mean-square difference between observed and expected counts. These plots are faceted by the number of mutations per codon.
Comparing model predictions with experimentally measured effects of mutations on antibody expression from Koenig et al., 2017.
(a) The deep amino acid selection model (DASM) maintains high predictive accuracy on functional effects of mutations regardless of codon accessibility. The correlation is equally high for amino acid mutations that only require a single-nucleotide mutation (left plot) vs. amino acid mutations that require multi-nucleotide mutations (center plot), demonstrating successful separation of mutation bias from functional effects. Compare Figure 1c. (b) DASM predictions mimic patterns in the expression data. For additional heatmap comparisons, see Figure 3—figure supplement 3.
Light chain version of main figure.
DASM removes codon bias in light chain functional predictions. Equivalent plot as Figure 3 but for light chain.
Comparison of selection factors for codon neighbors in the deep amino acid selection model (DASM).
DASM selection factors match the codon-table pattern seen in experimental measurements, while masked language models show artifacts from the codon table. The experimental data (left two panels) show a slight decrease in median scores for amino acids requiring multiple nucleotide mutations (‘multiple’) versus single mutations (‘single’). DASM captures this pattern, showing similar distributions for both categories. In contrast, AbLang and ESM assign radically lower scores to multi-nucleotide amino acid substitutions, consistent with the masked language modeling objective learning codon-level mutation probabilities as described in the main text (Figure 1a). Comparison done on the heavy chain of the (Koenig et al., 2017) wildtype sequence.
Heavy chain heatmap.
Heatmap showing the heavy-chain data of Koenig et al., 2017 along with model predictions.
Parent-child pair (PCP) predictions.
The DASM with the Thrifty neutral model accurately assesses probabilities of natural affinity maturation paths. (a, b) The DASM is better than AbLang2 at predicting the location of nonsynonymous mutations observed in PCPs withheld from model training. (c) The DASM achieves lower conditional perplexity (median 4.88 vs 7.39) when predicting amino acid identity at mutated sites, with fewer extreme outliers. The conditional perplexity is the perplexity of the child amino acid, conditioned on there being a mutation at that site. Note that due to the inherent stochasticity of affinity maturation, there is a lower limit to this conditional perplexity that is substantially greater than 1.
Scatterplots for Shanehsazzadeh data.
Scatterplots for the data of Shanehsazzadeh et al., 2023 zeroshot data set, partitioned by sequence length.
MAGMA scatter plots.
Model predictions versus experimentally measured binding affinity for five antibodies using the MAGMA-seq protocol. ‘Petersen’ data are from an experiment probing the rules of recognition for influenza neutralizing antibodies (Petersen et al., 2024) and ‘Kirby’ data are from combinatorial libraries applying combinations of mutations on the path from naive to mature (Kirby et al., 2025). Each row shows a different model (DASM, ESM2, AbLang2, ProGen2) and each column shows a different antibody lineage.
Tables
Correlation of models with predicting effects of single mutations on antibody expression and antigen binding, as measured in Koenig et al., 2017.
Masked language models scored using per-sequence pseudo-perplexity following the FLAb protocol (Chungyoun et al., 2024).
| Binding | Expression | |||
|---|---|---|---|---|
| Model | Heavy | Light | Heavy | Light |
| AbLang2 | −0.114 | −0.108 | 0.153 | −0.109 |
| DASM | 0.335 | 0.316 | 0.688 | 0.674 |
| ESM2 | 0.009 | 0.243 | 0.384 | 0.416 |
| ProGen2 | 0.156 | 0.276 | 0.559 | 0.568 |
Correlation of models with binding measurements on the data of Shanehsazzadeh et al., 2023, which typically involves multi-mutant variants.
See Figure 3—figure supplement 5 for scatterplots.
| Seq. length | AbLang2 | DASM | ESM2 | ProGen2 |
|---|---|---|---|---|
| 119 | 0.263 | 0.458 | 0.191 | 0.074 |
| 120 | 0.166 | 0.518 | 0.308 | 0.052 |
Correlations between model predictions and binding affinity.
The ‘Petersen’ data is from an experiment probing the rules of recognition for influenza neutralizing antibodies (Petersen et al., 2024) and ‘Kirby’ data is from combinatorial libraries applying combinations of mutations on the path from naive to mature SARS-CoV-2 antibodies (Kirby et al., 2025). See Figure 3—figure supplement 6 for corresponding scatter plots.
| Source | Antibody | AbLang2 | DASM | ESM2 | ProGen2 |
|---|---|---|---|---|---|
| Petersen (Petersen et al., 2024) | 222-1C06 | 0.041 | 0.248 | −0.039 | −0.030 |
| 319-345 | 0.079 | 0.279 | 0.199 | 0.222 | |
| Kirby (Kirby et al., 2025) | 002-S21F2 | −0.246 | −0.165 | −0.283 | −0.353 |
| Ab_2-15 | −0.325 | 0.094 | −0.265 | −0.069 | |
| C118 | −0.624 | 0.276 | −0.293 | −0.159 |
Computational efficiency comparison on sequences from the MAGMA-seq experiments.
10 sequences were run on CPU, and 100 on the GPU server.
| Model | CPU (s/seq) | GPU (s/seq) |
|---|---|---|
| DASM | 0.0097 | 0.0053 |
| AbLang2 | 11.0529 | 0.5196 |
| ESM2 | 112.5989 | 7.6090 |
Data used in this paper.
CC means that PCPs with mutated cysteines are excluded from the data. JaffePairedCC is paired data from Jaffe et al., 2022, sequenced using 10×. TangCC data is heavy chain data from Vergani et al., 2017; Tang et al., 2022, and was sequenced using the methods of Vergani et al., 2017. VanwinkleheavyTrainCC1m is a subset of the heavy chain data from Engelbrecht et al., 2025, sequenced using Takara 5’RACE BCR kit. VanwinklelightTrainCC1m is a 1M subset of the light chain data from Engelbrecht et al., 2025, sequenced using Takara 5’RACE BCR kit. RodriguezCC data is the RACE heavy chain data from Rodriguez et al., 2023, and is used only for testing. The ‘samples’ column is the number of individual samples in the dataset; in these datasets, each sample is from a distinct individual. ‘Families’ is the number of clonal families in the dataset. ‘PCPs’ is the number of parent-child pairs in the dataset. ‘Med. mutns’ is the median number of mutations per PCP in the dataset.
| Purpose | Name | Samples | Families | PCPs | Med. mutns |
|---|---|---|---|---|---|
| Train | JaffePairedCC | 4 | 50,776 | 209,599 | 7 |
| Train | TangCC | 21 | 45,267 | 651,899 | 2 |
| Train | VanwinkleheavyTrainCC | 149 | 21,269 | 124,985 | 4 |
| Train | VanwinklelightTrainCC1m | 330 | 2658 | 1,000,000 | 2 |
| Test | RodriguezCC | 51 | 3592 | 38,050 | 5 |
Comparison of ESM2 model sizes on the Koenig benchmark using the masked-marginals approach.
The larger ESM2-3B model (2.8B parameters) performs slightly worse than the 650M variant, particularly on light chain predictions.
| Binding | Expression | |||
|---|---|---|---|---|
| Model | Heavy | Light | Heavy | Light |
| ESM2-650M | −0.001 | 0.308 | 0.418 | 0.524 |
| ESM2-3B | −0.025 | 0.283 | 0.418 | 0.469 |