Combining mutation and recombination statistics to infer clonal families in antibody repertoires

  1. Natanael Spisak
  2. Gabriel Athènes
  3. Thomas Dupic
  4. Thierry Mora  Is a corresponding author
  5. Aleksandra M Walczak  Is a corresponding author
  1. Laboratoire de physique de l’École normale supérieure, CNRS, PSL University, Sorbonne Université and Université de Paris, France
  2. Saber Bio SAS, Institut du Cerveau, iPEPS The Healthtech Hub, France
  3. Department of Organismic and Evolutionary Biology, Harvard University, United States
9 figures, 1 table and 1 additional file

Figures

Clonal families and VJl classes.

(A) Variable region of the immunoglobulin heavy chain (IgH)-coding gene. (B) A clonal family is a lineage of related B cells stemming from the same VDJ recombination event. The partition of the B-cell receptor (BCR) repertoire into clonal families is a refinement of the partition into VJl classes, defined by sequences with the same V and J usage and the same complementarity determining region 3 (CDR3) length l. (C–D) Properties of VJl classes in donor 326651 from Briney et al., 2019. (C) Distribution of VJl class sizes exhibits power-law scaling. The total number of pairwise comparisons in the largest VJl classes is 1052=1010. (D) Distribution of the CDR3 length l. The distribution is in yellow for in-frame CDR3 sequences (l multiple of 3), and in gray for out-of-frame sequences.

Figure 2 with 7 supplements
Complementarity determining region 3 (CDR3)-based inference method (HILARy-CDR3).

(A) Example distribution of normalized Hamming distances, x=n/l, for one VJl class with CDR3 length l=21, V gene IGHV3-9 and J gene IGHJ4 (black). We fit the distribution by a mixture of positive pairs (belonging to the same family, in blue) and negative pairs (belonging to different families, in red). See Figure 2—figure supplement 5 for example fit results across different CDR3 lengths. Inset: the prevalence is defined as a fraction of positive pairs and was estimated to ρ^=3.1%. Data from donor 326651 of Briney et al., 2019. (B) Distribution of the maximum likelihood estimates of prevalence ρ^ across VJl classes in donor 326651. (C–F) The choice of threshold t on the normalized Hamming distance x translates to the following a priori characteristics of inference (illustrated here for arbitrarily chosen ρ and μ). (C) Fallout rate p^(t)=FP^/(FP^+TN^). The null distribution of all negatives (N=FP + TN) is estimated using the soNNia sequence generation software. (D) Sensitivity s^(t)=TP^/(TP^+FN^). (E–F) Precision π^=TP^/(TP^+FP^). For the same choice of threshold t, a low prevalence of ρ^=103 (E) leads to lower precision than high prevalence of ρ^=101 (F). (G) Model distribution PT(x|μ) of distances between unrelated sequences, for l=15,30,45,60, computed by the soNNia software. (H) Precision π^, computed a priori (i.e. before doing the inference) from the model with μ^=0.04, ρ^=0.1, and l=15,...,81 (colors as in G), as a function of the threshold t. For each VJl class and its own inferred ρ^ and μ^, the threshold t is chosen to achieve a desired π. (I) High-precision threshold t ensuring π^(t)=π=99% a priori, as a function of CDR3 length l for different values of the prevalence ρ^, and μ^=0.04, as predicted by the model. (J) Sensitivity s^(t) at the high-precision threshold t, as a function of CDR3 length l for different values of the prevalence ρ^ (colors as in I). Solid lines denote a priori prediction for intermediate mean distance μ=4%, dashed lines denote actual performance of HILARy-CDR3 in a synthetic dataset.

Figure 2—figure supplement 1
Mean intra-family distances.

Distribution of the maximum likelihood estimates of mean intra-family distance μ across VJl classes.

Figure 2—figure supplement 2
Null distribution PN(x|l) of CDR3 distances between unrelated sequences for l[15,81], computed by soNNia software.

White line denotes a growing threshold ensuring a fallout rate p<104 as determined by this distribution.

Figure 2—figure supplement 3
Distribution of normalized Hamming distances.

x=n/l, for largest VJl class for each CDR3 length l=81 (black). We fit the distribution by a mixture of positive pairs (PT(x|μ) in blue), and negative pairs (PN(x), in red). For l=18 the estimate μ^ is too large results in large fitting error and for sensitivity computation we used global μ=4% (in green).

Figure 2—figure supplement 4
Distribution of post-selection probabilities.

Ppost of CDR3 nucleotide sequences computed using soNNia across CDR3 lengths. Short junctions are on average more likely to be generated in VDJ recombination and pass subsequent selection (Isacchini et al., 2021). This makes inference in low-l classes more difficult, a feature reflected by synthetic dataset constructed by sampling unmutated lineage progenitors from the soNNia model.

Figure 2—figure supplement 5
Site frequency spectra estimated for families identifed using high-precision CDR3-based inference method (HILARy-CDR3) in the subset of the data where this approach is highly reliable (large-l and large-ρ^ regime).

The distributions are shown for families of varying family size, z[10,100] and averaged over all families of a given size. Together with the exact configuration of sequences carrying a given substitution, synthetic datasets of the sames ignatures of mutations and clonal expansions can be generated.

Figure 2—figure supplement 6
Distribution of normalized Hamming distances x=n/l, for l classes, averaging over all VJl classes.

We fit the distribution by a mixture of positive pairs using a geometric distribution (PT(x|μ) in blue), and negative pairs (PN(x), in red). The corresponding prevalence estimates p^ are used for small VJl classes for which this parameter cannot be reliably estimated independently.

Figure 2—figure supplement 7
Prevalence and VJl class size.

Dependence of prevalence estimates on VJl class size N for largest classes in donor 326651 from Briney et al., 2019. 28% of variation in prevalence estimates can be explained by variation in VJl class sizes.

Figure 3 with 2 supplements
Full inference method with mutational information.

(A) For a pair of sequences, n1,n2 denote the numbers of mutations along the templated region (V and J), and n0 is the number of shared mutations. For related sequences, n0 corresponds to mutations on the initial branch of the tree, and is expected to be larger than for unrelated sequences, where n0 corresponds to coincidental mutations. (B) Positive and negative pairs are called mutated if both sequences have mutations n1,n2>0. Among positive pairs in the synthetic datasets, more than 99% are mutated. (C, D) Distributions of the rescaled variables x and y (Equation 4), for pairs of synthetic sequences belonging to the same lineage (positive pairs) and sequences belonging to different lineages (negative pairs). The separatrix xy=t marks a high-precision (99%) threshold choice. (E) To limit the number of pairwise comparisons we make use of high-precision and high-sensitivity complementarity determining region 3 (CDR3)-based partitions. High precision corresponds to the choice t=t. High sensitivity corresponds to a coarser partition where t is set to achieve 90% sensitivity. When the two partitions disagree, mutational information can be used to break the coarse, high-sensitivity partition into smaller clonal families. (F, G) Mutations-based methods achieve high sensitivity across all CDR3 lengths l in the synthetic dataset (G), extending the range of applicability with respect to the CDR3-based method (F).

Figure 3—figure supplement 1
Merging partitions.

Red circles represent clusters from the coarse (high-sensitivity) partition, while green clusters represent the fine (high-precision) partition. When the two partitions differ, HILARy-full merges precise clusters inside each sensitive cluster whenever there exists of pair of positive sequences linking them.

Figure 3—figure supplement 2
Error vs VJl class size.

We plot the fitting error of P(x) by the mixture model, for each VJl class in the synthetic dataset, as a function of their sizes. The error is computed as the squared difference between the model and data distributions of distances.

Figure 4 with 2 supplements
Benchmark of the alternative methods on synthetic heavy-chain repertoires.

(A) Comparison of inference time using subsamples from the largest VJl class found in donor 326651 from Briney et al., 2019. Comparisons were done on a computer with 14 double-threaded 2.60 GHz CPUs (28 threads in total) and 62.7 Gb of RAM. (B) Clustering precision πpost (post single-linkage clustering of positive pairs), (C) sensitivity spost, and (D) variation of information v as a function of complementarity determining region 3 (CDR3) length l in the realistic synthetic dataset generated for this study. Solid lines represent the mean value averaged over five synthetic datasets. (E–G) Same as (B–D) but for the synthetic dataset from Ralph and Matsen, 2022, designed for the development and testing of the partis software. The solid lines represent the mean over the three datasets.

Figure 4—figure supplement 1
Performance of spectral SCOPer using V gene mutations.

(A) Comparison of inference time using subsamples from the largest VJl class found in donor 326651 from Briney et al., 2019. Comparisons were done on a computer with 14 double-threaded 2.60GHz CPUs (28 threads in total) and 62.7Gb of RAM. (B) Clustering precision πpost (post single linkage clustering of positive pairs), (C) Sensitivity spost, and (D) variation of information υ vs CDR3 length l in the realistic synthetic dataset generated for this study. Solid lines represent the mean value averaged over 5 synthetic datasets.

Figure 4—figure supplement 2
Performance of single-linkage clustering with fixed threshold.

We call this method VJCDR3-sim, where sim is the threshold on the normalized similarity between two CDR3s, equal to 1x, where x is our normalized Hamming distance. (A) Clustering precision πpost (post single linkage clustering of positive pairs), (B) sensitivity spost, and (C) variation of information υ as a function of CDR3 length l in the realistic synthetic dataset generated for this study. Solid lines represent the mean value averaged over 5 synthetic datasets. (D-F): Same than (A-C) using the synthetic data from Ralph and Matsen, 2022 and across mutation rates.

Performance of HILARy as a function of mutation rate for heavy chains, on synthetic data from Ralph and Matsen, 2022, designed for the development and testing of the partis software.

(A) Clustering precision πpost (post single-linkage clustering of positive pairs), (B) sensitivity spost, and (C) variation of information v as a function of mutation rate, using the heavy chain only. Solid lines represent the mean value averaged over the three datasets.

Benchmark of HILARy with paired light and heavy chains.

(A) Clustering precision πpost (post single-linkage clustering of positive pairs), (B) sensitivity spost, and (C) variation of information v as a function of complementarity determining region 3 (CDR3) length l, on the synthetic datasets from Ralph and Matsen, 2022, designed for the development and testing of the partis software. (D–F): Same as (A–C) but as a function of mutation rate.

Inference results across complementarity determining region 3 (CDR3) lengths.

Inference results for donor 326651 of Briney et al., 2019, are presented for nine quantiles of the CDR3 distribution, each containing between 8% and 12% of the total number of sequences (corresponding to nine colors in the inset of A). (A) Distributions of family size z. All CDR3 length quantiles exhibit universal power-law scaling with exponent −2.3. (B) Site frequency spectra estimated for families of sizes z=100. Families of larger sizes were subsampled to z=100 to subtract the influence of varying family sizes. (C) Distribution of lineage dN/dS ratios computed for polymorphisms in CDR3 regions over all lineages within each nine quantile.

Author response image 1
Author response image 2

Tables

Table 1
Summary of notations used throughout the paper.

Hats ˆ denote estimates from the fit of the mixture model. Stars ∗ denote estimates after imposing 99% precision. The ‘post’ subscript denotes quantities after applying single-linkage clustering to obtain a partition from positive pairs.

Definition
ρPrevalence/fraction of positive pairs
πPrecision = TP/(TP+FP)
sSensitivity = TP/(TP+FN)
pFallout = FP/(FP+TN)
tThreshold on CDR3 distance
lCDR3 length
nCDR3 Hamming distance of a pair
xNormalized CDR3 Hamming distance=l/n
xCDR3 Hamming distance, centered and scaled
yShared mutations on V segment, centered and scaled
μMean x between positive pairs
PTModel distribution for positive pairs
PFModel distribution for negative pairs

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Natanael Spisak
  2. Gabriel Athènes
  3. Thomas Dupic
  4. Thierry Mora
  5. Aleksandra M Walczak
(2024)
Combining mutation and recombination statistics to infer clonal families in antibody repertoires
eLife 13:e86181.
https://doi.org/10.7554/eLife.86181