Nucleotide-level mutation processes distort protein language model predictions.

a: AbLang2 assigns almost 100 lower probabilities to amino acids requiring multiple nucleotide mutations compared to single-mutation variants. Each point represents a possible amino acid substitution at a single site in the amino acid sequence. b: AbLang2 probabilities correlate with neutral somatic hypermutation probabilities across the V-encoded portion of nine naive sequences, demonstrating how the model is strongly impacted by mutation bias. Each point represents a site in the sequence. Triangles are outliers that have been brought into the y range. c: AbLang2 functional prediction accuracy drops substantially for amino acids that are multiple (2 or 3) nucleotide mutations away from the wildtype codon. Data from [11].

Our model separates mutation from selection to predict functional effects without nucleotide-level biases.

a: Our model combines a fixed mutation component (trained on non-functional data) with a learned selection component (DASM transformer). Training uses inferred parent-child sequence pairs from reconstructed B cell phylogenies to predict natural affinity maturation after a jointly-inferred time t. b: The DASM directly predicts selection factors for all amino acid substitutions at every position in a single forward pass. Positive factors indicate beneficial changes, and negative factors indicate deleterious changes.

Comparing model predictions with experimentally measured effects of mutations on antibody expression from [11].

a: The DASM maintains high predictive accuracy on functional effects of mutations regardless of codon accessibility. The correlation is equally high for amino-acid mutations that only require a single-nucleotide mutation (left plot) vs. amino-acid mutations that require multi-nucleotide mutations (center plot), demonstrating successful separation of mutation bias from functional effects. Compare Figure 1c. b: DASM predictions mimic patterns in the expression data. For additional heatmap comparisons see Figure S4.

Correlation of models with predicting effects of single mutations on antibody expression and antigen binding, as measured in [11].

Correlation of models with binding measurements on the data of [23], which typically involves multi-mutant variants.

See Figure S6 for scatterplots.

Correlations between model predictions and binding affinity.

The “Petersen” data is from an experiment probing the rules of recognition for influenza neutralizing antibodies [24] and “Kirby” data is from combinatorial libraries applying combinations of mutations on the path from naive to mature SARS-CoV-2 antibodies [26]. See Figure S7 for corresponding scatter plots.

Computational efficiency comparison on sequences from the MAGMA-seq experiments.

10 sequences were run on CPU, and 100 on the GPU server.

Data used in this paper.

CC means that PCPs with mutated cysteines are excluded from the data. JaffePairedCC is paired data from [56] sequenced using 10X. TangCC data is heavy chain data from [55, 57] and was sequenced using the methods of [57]. VanwinkleheavyTrainCC1m is a subset of the heavy chain data from [58] sequenced using Takara 5’RACE BCR kit. VanwinklelightTrainCC1m is a 1M subset of the light chain data from [58] sequenced using Takara 5’RACE BCR kit. RodriguezCC data is the 5 RACE heavy chain data from [12] and is used only for testing. The “samples” column is the number of individual samples in the dataset; in these datasets, each sample is from a distinct individual. “Clonal families” is the number of clonal families in the dataset. “PCPs” is the number of parent-child pairs in the dataset. “Median mutations” is the median number of mutations per PCP in the dataset.

Conflating mutation and selection hinders functional prediction.

This plot is analogous to the rightmost panel of Figure 1c, but colored by mutability according to the Thrifty [7] model.

DASM removes codon bias in light chain functional predictions.

Equivalent plot as Figure 3 but for light chain.

DASM selection factors are similar between codon neighbor amino acids and non-neighbors (compare Figure 1a), in contrast to other models.

Comparison done on the heavy chain of the Koenig [11] wildtype sequence.

Heatmap showing the heavy-chain data of [11] along with model predictions.

The DASM with the Thrifty neutral model accurately assesses probabilities of natural affinity maturation paths.

a, b: The DASM is better than AbLang2 at predicting the location of nonsynonymous mutations observed in PCPs withheld from model training. c: The DASM achieves lower conditional perplexity (median 4.88 vs 7.39) when predicting amino acid identity at mutated sites, with fewer extreme outliers. The conditional perplexity is the perplexity of the child amino acid, conditioned on there being a mutation at that site. Note that due to the inherent stochasticity of affinity maturation, there is a lower limit to this conditional perplexity that is substantially greater than 1.

Scatterplots for the data of [23] zero-shot data set, partitioned by sequence length.

Model predictions versus experimentally measured binding affinity for five antibodies using the MAGMA-seq protocol.

“Petersen” data are from an experiment probing the rules of recognition for influenza neutralizing antibodies [24] and “Kirby” data are from combinatorial libraries applying combinations of mutations on the path from naive to mature [26]. Each row shows a different model (DASM, ESM2, AbLang2, ProGen2) and each column shows a different antibody lineage.

The model is trained to predict the probability of a child sequence given the parent sequence.

It is divided into mutation and selection components. Mutation: given a parent sequence X, probability of mutation to alternate codons after time t is calculated using a model of SHM [7] and a “multihit model” (see Methods) and then aggregated into codons as in [15] to obtain pj,c(t, X). Selection: given an amino acid translation of the parent, the transformer-encoder gives per-site selection factors . These are then multiplied (1) and summed to give the probability of the observed child sequence at every site. This gives a likelihood for a parent-child pair. The algorithm maximizes the likelihood across branch lengths t for each parent-child pair as well as across the parameters of the transformer model for all parent-child pairs in the dataset (dashed lines).

Models that allow for multiple hits per codon (right column) show better model fit than without (left column) on the out-of-frame data from [55] as prepared in [7].

These observed-versus-expected plots compare the observed number of mutations to the computationally predicted number across bins of mutation probability. For each bin, the expected number of mutations is calculated as the sum of mutation probabilities for all sites in that bin, while the observed count represents the actual mutations found in those sites. The overlap metric quantifies the area of overlap between observed and expected dis-tributions divided by their average area, while the residual metric measures the normalized root-mean-square difference between observed and expected counts. These plots are faceted by the number of mutations per codon.