Biochemical patterns of antibody polyreactivity revealed through a bioinformatics-based analysis of CDR loops

  1. Christopher T Boughter
  2. Marta T Borowska
  3. Jenna J Guthmiller
  4. Albert Bendelac
  5. Patrick C Wilson
  6. Benoit Roux
  7. Erin J Adams  Is a corresponding author
  1. Graduate Program in Biophysical Sciences, University of Chicago, United States
  2. Department of Biochemistry and Molecular Biology, University of Chicago, United States
  3. Department of Medicine, Section of Rheumatology, University of Chicago, United States
  4. Committee on Immunology, University of Chicago, United States
  5. Department of Pathology, University of Chicago, United States

Abstract

Antibodies are critical components of adaptive immunity, binding with high affinity to pathogenic epitopes. Antibodies undergo rigorous selection to achieve this high affinity, yet some maintain an additional basal level of low affinity, broad reactivity to diverse epitopes, a phenomenon termed ‘polyreactivity’. While polyreactivity has been observed in antibodies isolated from various immunological niches, the biophysical properties that allow for promiscuity in a protein selected for high-affinity binding to a single target remain unclear. Using a database of over 1000 polyreactive and non-polyreactive antibody sequences, we created a bioinformatic pipeline to isolate key determinants of polyreactivity. These determinants, which include an increase in inter-loop crosstalk and a propensity for a neutral binding surface, are sufficient to generate a classifier able to identify polyreactive antibodies with over 75% accuracy. The framework from which this classifier was built is generalizable, and represents a powerful, automated pipeline for future immune repertoire analysis.

eLife digest

To defend itself against bacteria and viruses, the body depends on a group of proteins known as antibodies. Each subset of antibodies undergoes a rigorous training regimen to ensure it recognizes a single epitope well – that is, one specific region on the surface of foreign, harmful organisms.

Most antibodies stick extremely tightly to their one unique epitope, but some can also weakly bind to molecules that are vastly different from their main trained targets. This feature – known as polyreactivity – can in some cases help the immune system fight against multiple strains of viruses. On the other hand, when antibodies are designed in the laboratory to treat diseases, this characteristic can sometimes lead to the failure of pre-clinical trials. Yet it is currently unclear why some antibodies are polyreactive when others are not.

To investigate this question, Boughter et al. compared over 1,000 polyreactive and non-polyreactive antibody sequences from a large database, revealing differences in the physical properties of the region of the antibodies that attaches to epitopes. Using these defining features, Boughter et al. went on to design a new piece of freely available, automated software that could predict which antibodies would be polyreactive more than 75% of the time.

Such software could ultimately help to guide the design of antibody-based treatments, while bypassing the need for costly laboratory tests.

Introduction

Antibodies are immunogenic proteins expressed by B cells that play a major role in the adaptive immune response against non-self. Upon recognition of target epitopes, these antibodies undergo multiple rounds of somatic hypermutation and affinity maturation inside a germinal center, whereby the amino acid sequence of the epitope-binding surface is selected for optimal binding to the target (Victora and Nussenzweig, 2012; Eisen and Siskind, 1964; McKean et al., 1984). The longer this affinity maturation process extends, the higher the affinity and specificity of the antibodies toward their target antigen, primarily through mutagenesis of the six complementarity determining region (CDR) loops of the antibody (Victora and Nussenzweig, 2012). Using a combination of affinity matured CDR loops, these antibodies bind strongly to the target and aid in invader neutralization. While the process of affinity maturation and somatic hypermutation of antibodies results in high-affinity and incredibly specific binders to a particular epitope, some antibodies have been shown to display signs of reactivity toward diverse off-target epitopes. This broad but low-affinity binding has been termed ‘polyreactivity’.

Antibody polyreactivity has been hypothesized to be beneficial in the early stages of antibody maturation, acting as a pool of diverse binders ready to recognize novel antigens and initiate the more stringent selection process (Dimitrov et al., 2013). To this end, a majority of B cell receptors and antibodies which have not undergone somatic hypermutation, including those on immature B cells and early ‘natural’ antibodies, have been found to be polyreactive to some extent and are suggested to have an innate-like response to pathogens (Ochsenbein et al., 1999; Wardemann et al., 2003). While these mostly unmutated polyreactive antibodies remain at low frequency in antigen-experienced individuals, a distinct population of polyreactive antibodies that have undergone selection are still expressed by mature B cells that circulate in blood (Tiller et al., 2007). In fact, some studies have found that the polyreactivity status of an antibody is mostly independent of the number of somatic hypermutations in the antibody sequence (Mouquet et al., 2010; Prigent et al., 2016). In line with this finding, only 5–10% of the repertoire of naive B cells circulating in the periphery are polyreactive, but this increases to 20–30% in the memory B cell compartment, showing a distinct capability of polyreactivity to survive selection (Tiller et al., 2007; Koelsch et al., 2007). These results suggest that polyreactivity can persist, or perhaps even be selected for during the selection process within the germinal center.

In a few notable cases, polyreactivity may in fact augment the efficacy of a given immune response. Polyreactive IgA antibodies have been shown to have an inherent reactivity to microbiota in the mouse gut, with a predicted role in host homeostasis (Bunker et al., 2017). These previously identified antibodies so far have no known primary ligands, yet play a key role in facilitating the gut immune response to the plethora of exogenous antigens encountered in the dynamic dietary and microbial environment of the gut. This implies the existence of antibodies whose primary function is to act as polyreactive sentries in the gut, yet the downstream effects of polyreactive antibodies coating commensal bacteria is so far unclear. Similar polyreactive IgA and IgG mucosal antibodies were found in the gut of human immunodeficiency virus (HIV)-infected patients, but these antibodies either had low affinity to the virus or lacked neutralization capabilities (Planchais et al., 2019). The benefit of singular antibody sequences with the ability to sample large portions of the commensal population may represent an improvement in efficiency of the homeostatic machinery of the gut.

While the precise role of these primarily polyreactive gut antibodies is still a topic of debate, polyreactivity has been suggested to augment the immune response in other immunological niches. Broadly neutralizing antibodies (bnAbs), which bind robustly to conserved epitopes on the surface glycoproteins of influenza viruses or HIV are more likely to be polyreactive (Haynes et al., 2005; Mouquet et al., 2011; Andrews et al., 2015). In one study of HIV-binding antibodies, over half of all tested bnAbs were found to be polyreactive (Prigent et al., 2018). These bnAbs have been the subject of intense study for their potential as the central components of an HIV treatment or as the byproduct of an immune response to a universal Influenza vaccine (Andrews et al., 2015; Haynes et al., 2019; Crowell et al., 2019; Li et al., 2012). One hypothesized mechanism for the capability of polyreactive antibodies to confer this broad neutralization in the face of a changing viral epitope is heteroligation, the ability of a single antibody to bind the primary target with one binding domain and use the other binding domain to bind in a polyreactive manner (Mouquet et al., 2010). This heteroligation allows the antibody to take advantage of the significant avidity increase afforded by bivalent binding, despite the low envelope protein density of HIV or a geometry which does not readily lend itself to bivalent binding on the surface of influenza viruses (Klein and Bjorkman, 2010).

Although polyreactivity may play a positive role in natural immune responses, oftentimes this same property is considered undesirable from the point of view of generating therapeutic antibodies with high specificity. Antibody-based treatments, which generally take the form of an intravenous transfusion, are sensitive to the accelerated systemic clearance of polyreactive antibodies (Hötzel et al., 2012; Kelly et al., 2015; Kelly et al., 2018; Datta-Mannan et al., 2015). In general, much work has focused on attempting to answer the question of optimizing ‘developability’ of a given antibody. These efforts have been dedicated to determining the most critical components of developability through a large array of experimental assays, in silico structural prediction-based methods, sequence-based analysis and their correlations with clearance, sequence-based SASA predictions, and sequence-based aggregation propensity predictors (Jain et al., 2017b; Raybould et al., 2019; Sharma et al., 2014; Jain et al., 2017a; Obrezanova et al., 2015). In many of these studies, polyreactivity or non-specificity in general was seen to be a negative indicator of the developability of a drug, suggesting that therapeutic antibodies should strive toward a drug-like specificity (Starr and Tessier, 2019).

In line with this goal of understanding the predominant factors involved in the specificity of therapeutic antibodies, many researchers have worked to identify the biophysical underpinnings of polyreactivity in natural immune responses. The most popular hypotheses for the primary biophysical predictors of polyreactivity have included CDR3 length (Prigent et al., 2016), CDR3 flexibility (Prigent et al., 2018), net hydrophobicity (Lecerf et al., 2019), and net charge (Rabia et al., 2018). More observational studies have found an increased prevalence of arginine and tyrosine in polyreactive antibodies (Kelly et al., 2018; Birtalan et al., 2008). While these previous studies represent substantial advances in the study of polyreactivity, they have often been limited in scope, focusing on a singular antibody source and primarily focused on CDR3H. Comparing across these individual antibody sources highlights discrepancies between the proposed predictors of polyreactivity. The aforementioned properties determined to be key to polyreactivity in previous studies were found to be statistically insignificant in studies of HIV-binding and mouse gut polyreactive antibodies (Bunker et al., 2017; Mouquet et al., 2010).

Clearly, a computational framework that would enable us to predict the polyreactivity of a given antibody a priori, whether evaluating the efficacy of a natural immune response or the potential fate of a therapeutic antibody, would be tremendously useful. Such a framework, for example, could be used to assist in the isolation of broadly neutralizing antiviral antibodies, or speed up the process of therapeutic antibody screening. To achieve this goal, a thorough understanding of the molecular features behind polyreactive binding interactions is critical. Experimental approaches utilizing next-generation sequencing and ELISA allow for the identification of hundreds of polyreactive antibody sequences. However, the systematic characterization of these antibodies is difficult. Issues immediately arise when defining the conditions by which we determine an antibody to be polyreactive. While polyreactivity may exist on some continuous spectrum, we are inclined to frame the problem as binary. This binary discretization is useful for the identification of meaningful differences, yet must be recognized as an imperfect assumption. In addition to this more philosophical challenge, experimental efforts must also overcome significant hurdles. Detailed biochemical studies of polyreactive antibodies via protein crystallography, quantitative binding experiments, and mutagenesis provide exceptional insight but are inherently low throughput. Structural modeling of these polyreactive antibodies represent a high-throughput approach, but models of flexible loops are relatively unreliable, and are unlikely to capture nuances in side-chain placement (Karami et al., 2019). A bioinformatics-based approach, centered around high-throughput analysis that minimizes structural assumptions while maintaining positional context of amino acid sequences would provide a thorough, unbiased analysis of existing data and create a powerful pipeline for future studies.

In this study, we show that using just the amino acid sequences of antibodies from a database of over 1000 sequences tested for polyreactivity, unifying biophysical properties that distinguish polyreactive antibodies from non-polyreactive antibodies can be identified. We find that, while charge and hydrophobicity are in fact important determinants of polyreactivity, the characteristic feature of polyreactive antibodies appears to be a shift toward neutrality of the binding interface. In addition, loop crosstalk is more prevalent in the heavy chain of polyreactive antibodies than non-polyreactive antibodies. From these properties, a machine learning-based classification software was developed with the capability to determine the polyreactivity status of a given sequence. This software is generalizable and can be retrained on any binary classification problem and identify the key differences between two distinct populations of antibodies, T cell receptors, or MHC-like molecules at the amino acid level.

Results

Database

Our aggregate database of over 1000 antibody sequences is compiled from our own previously published and new data, in addition to data from published studies by the Mouquet and Nussenzweig labs (Table 1Mouquet et al., 2010; Bunker et al., 2017; Planchais et al., 2019; Mouquet et al., 2011; Prigent et al., 2018). Using an ELISA-based assay, the reactivity of each antibody is tested against a panel of 4–7 biochemically diverse target antigens: DNA, insulin, lipopolysaccharide (LPS), flagellin, albumin, cardiolipin, and keyhole limpet hemocyanin (KLH). This panel has become increasingly prevalent in the literature for experimental measures of polyreactivity in antibodies (Mouquet et al., 2010; Prigent et al., 2016Bunker et al., 2017; Planchais et al., 2019; Mouquet et al., 2011; Andrews et al., 2015Prigent et al., 2018Jain et al., 2017b; Neu et al., 2019; Wrammert et al., 2011). The ligands represent a diverse sampling of biophysical and biochemical properties: for example, enrichment in negative charge (DNA, insulin, LPS, albumin), amphipathic in nature (LPS, cardiolipin), exceptionally polar (KLH), or large in size (KLH, flagellin). From this panel, a general rating of ‘polyreactive’ or ‘non-polyreactive’ is given to 529 and 524 antibodies, respectively. For the purposes of this study, antibodies are determined to be polyreactive if the authors of the original studies determined a particular clone binds to two or more ligands in the panel. Those that bind to one or none of the ligands in the panel are deemed non-polyreactive.

Table 1
A quantification of the antibodies used in this study.
Dataset# Polyreactive# Non-PolyreactiveTotal
Mouse IgA205240445
HIV reactive172124296
Influenza reactive152160312
Complete dataset5295241053

A limitation of this full polyreactivity dataset is that there exists an intermediate between the two classes. As discussed in the introduction, it is not immediately obvious where the line for polyreactivity should be drawn. An antibody that binds to 2–3 ligands may not necessarily achieve broad reactivity through the same mechanism as an antibody that binds four or more ligands from a panel of 6 or 7. To remove these ambiguities, a so-called 'parsed' dataset is generated whereby antibodies that bind 4–7 ligands are labeled polyreactive, antibodies that bind 0 panel ligands are labeled non-polyreactive, and those that bind 1–3 are removed from the analysis. The results presented below utilize the full dataset of 1053 antibody sequences, unless otherwise noted. Analysis of the parsed dataset can be found in the supplementary figures.

A surface-level analysis of polyreactive antibody sequences

As a first pass at the given dataset, we focus on the most simplistic of the possible explanations for differences between polyreactive and non-polyreactive antibodies, specifically the J- and V-gene usage of each group. Figure 1A and B, rendered with code adapted from the Dash et al. derived program TCRdist (Dash et al., 2017), represents each antibody V-gene as a line connecting a single heavy and light chain gene for the full human-derived antibody dataset (685 sequences). We repeat this analysis in Figure 1—figure supplement 1 using the parsed human-derived antibody dataset (472 sequences). Direct comparisons between mouse and human derived antibodies is difficult at the gene usage level. A similar analysis highlighting differences between mouse polyreactive and non-polyreactive antibodies can be found in the supplement (Figure 1—figure supplement 2).

Figure 1 with 5 supplements see all
A comparative genetic analysis of human-derived polyreactive and non-polyreactive antibody sequences uncovers population level differences.

Gene usage diagrams comparing (A) human polyreactive and (B) non-polyreactive sequences show a qualitative difference in the VH gene usage. Shared colors indicate identical genes, gray indicates genes that are not seen in the other population at a level over 2%. Unlabeled genes are colored randomly to highlight genetic variation in the populations. (C) Sequence alignment of the most prevalent genes in the polyreactive and non-polyreactive populations compared to a reference gene common to each population. Hydrophobic amino acids are colored white, hydrophilic amino acids are colored gray, and positively or negatively charged amino acids are colored blue or red, respectively. (D) Percentage and raw count of observed gene usage for the polyreactive and non-polyreactive sequences.

Genes are identified from nucleotide sequences using NCBI’s IgBLAST command line tool (Ye et al., 2013). Heavy and light chain genes that are shared between polyreactive and non-polyreactive sequences are colored for the top labeled instances. Genes that are labeled but not found above a 2% threshold in the opposite population are colored gray, whereas those that do not have a visible name are colored randomly to highlight variation in gene usage. From this comparison, it is clear that the variable gene usage is skewed between polyreactive and non-polyreactive sequences, with an enrichment of VH1-69, VH1-46, and VH4-59 in the polyreactive population, a trend that persists in the parsed dataset (Figure 1—figure supplement 1). In contrast, no qualitative differences in the J-gene usage are readily discernible between these two groups (Figure 1—figure supplement 3).

While the full alignment of these most used heavy chain variable genes shows a high degree of sequence similarity (Figure 1—figure supplement 4), Figure 1C highlights the regions of highest dissimilarity between the biophysical properties of amino acids in prevalent genes within each population. VH3-23, the most prevalent gene in the non-polyreactive human dataset and the second most prevalent gene in the polyreactive human dataset, can be used as a reference for comparisons between genes enriched in each individual population. This reference gene shares a high degree of sequence similarity with the second and third most frequently occurring genes in the non-polyreactive dataset, VH3-7 and VH3-9, save for a lysine and aspartic acid pair in framework 2 of VH3-7. The genes enriched in the polyreactive dataset, however, are quite different from this reference. All three of the polyreactive enriched genes have charged residues where the non-polyreactive enriched genes have hydrophilic residues (or vice versa) at IMGT positions 1, 13, and 90. These initial results hint at some systematic differences between the polyreactive and non-polyreactive antibody populations.

Figure 1D quantifies the extent of the difference in gene usage in each population by comparing these most prominent genes from our accumulated dataset of HIV- and influenza virus-reactive antibodies. While the two most common genes in the polyreactive dataset account for 27% of the human polyreactive antibodies in this study, the top three most common genes in the non-polyreactive dataset account for just over 17% of the total population. In addition to being the most prevalent gene in the polyreactive dataset, VH16901 has also been found historically to be more prevalent in broadly neutralizing antibodies against influenza viruses, in line with the previously mentioned overlap between bnAbs and polyreactivity (Andrews et al., 2015; Wrammert et al., 2011).

Overall, there is a noticeable difference between the gene usage frequency of polyreactive and non-polyreactive antibodies, but the overlap in the usage of the two populations suggests that gene usage alone is not sufficient to distinguish the two groups. While there exist qualitative differences between framework sequences enriched in the polyreactive dataset compared to the non-polyreactive population, a look at the amino acid usage of the CDR loops of each group shows no significant differences (Figure 1—figure supplement 5). This implies that the positional context of a given amino acid is critical to tease out differences in antibody binding properties.

A position-sensitive matrix representation of sequences provides further insights into polyreactivity

To identify deeper trends in the biophysical properties of polyreactive antibodies, we utilize a new methodology to analyze and represent a range of different properties inherent to these sequences. Although the framework regions of antibodies are highly conserved, the CDR loops vary significantly in length and show very low conservation between populations. This makes alignment of CDR loops difficult without creating subgroups for loops of identical length. To overcome this, the sequence data is reorganized into a matrix representation (Figure 2A). Each sequence is aligned by the center of each CDR loop, with spaces between the loops set to zero and each amino acid encoded as a number from 1 to 21. While this alignment method excludes the framework regions of the antibodies and slightly averages out some of the properties at the edge of the CDR loops, we reason that most of these differences are evident in the gene usage analysis of the previous section. From this simple alignment, no obvious patterns emerge separating polyreactive and non-polyreactive antibodies; however, we can clearly see that mouse gut-derived IgA antibodies have generally shorter CDR3H loops, and more conserved CDR3L sequences when compared to the human-derived antibody sequences. All subsequent analysis is derived from this matrix representation of the sequences.

Figure 2 with 1 supplement see all
A new representation of CDR loop sequences improves the position-sensitivity of quantitative antibody analysis.

(A) Matrix representation of the amino acid sequences used in this study provides a framework for further analysis. Each amino acid is encoded as a number from 1 to 21, represented by a distinct color in the matrix. A 0-value is used as a buffer between loops and is represented by the dark blue regions. The red line separates polyreactive and non-polyreactive sequences. (B) Amino acid frequency difference between polyreactive and non-polyreactive sequences for all six CDR loops. Residues more common in polyreactive sequences are shown in green, while those more common in non-polyreactive sequences are shown in pink. Loop positions correspond to the numerical position within the matrix of panel (A). (C) An in-depth representation highlighting the amino acid frequencies used to create panel (B). Only frequency changes greater than 10% are shown for clarity.

With this new positionally sensitive and quantitative alignment method, we are able to further dissect the differences in amino acid sequences presented in Figure 1. Figure 2B uses this positional sequence encoding to determine the amino acid frequency difference between polyreactive and non-polyreactive sequences. For example, phenylalanine is found at position 93 in roughly 40% of polyreactive sequences and nearly 60% of non-polyreactive sequences. Therefore position 93, amino acid F has an intensity of −0.2 in Figure 2B.

From this panel it is evident that most of the major differences are in the germline encoded regions CDR1H and CDR2H, in line with the observations from Figure 1 that suggest polyreactive antibodies have a distinct gene usage when compared to non-polyreactive antibodies. Figure 2C further expands on these differences, showing the largest changes in amino acid frequencies between the two populations. We can see that there is a slight decrease of phenylalanine frequency in CDR1H of polyreactive antibodies, in favor of isoleucine. Additionally, there is a general shift toward hydrophobicity in CDR2H, as the hydrophilic residue serine at matrix positions 78 and 82 is less prevalent in polyreactive antibodies, instead replaced by the more hydrophobic residues isoleucine and glycine. In the parsed dataset, these differences become larger in magnitude, particularly in CDR1L, where phenylalanine is again found less frequently in polyreactive sequences (Figure 2—figure supplement 1).

This increased prevalence in loop hydrophobicity of polyreactive antibodies has been suggested before in the literature (Prigent et al., 2018) along with a net increase in positive charge (Rabia et al., 2018), so we next aimed to analyze this matrix systematically using biophysical properties inherent to the loops. A simple analysis of the full human and mouse-derived dataset investigating classical parameters explored previously by other groups (CDR loop length, net charge, net hydrophobicity, and gene usage) and some new properties (side chain flexibility, side chain bulk, and the Kidera Factors from Kidera et al., 1985) show some significant differences between polyreactive and non-polyreactive antibodies (Figure 3A,B). The versatility of the positionally sensitive amino acid matrix allows for the application of multiple 'property masks' to tease out the specific regions of each CDR loop that contributes most to these significant differences. Given a property, amino acid charge for example, we can replace each simple 1–21 representation with a distinct representation based upon amino acid properties.

Figure 3 with 2 supplements see all
Position-sensitive quantification of CDR loop properties of mouse and human antibody sequences highlights differences between polyreactive and non-polyreactive populations.

Plotting the average CDR loop lengths (A) and net antibody biophysical properties (B) show small but significant differences when analyzed in bulk. Basic properties 1–5 are hydrophobicity1, charge, hydrophobicity2, side chain flexibility, and side chain bulk. Plotting the average net charge (C) and hydrophobicity (D) as a function of position of polyreactive and non-polyreactive sequences highlights significant differences in CDR3H. Light shadow around lines represent bootstrap standard errors. All uncertainties obtained via bootstrapping. Stars indicate p-value ≤ 0.05 calculated via nonparametric Studentized bootstrap test. Bars with a single star above represent contiguous regions of significance. p-values in panels (A) and (B) corrected for multiple tests using the Bonferroni correction.

In the matrix of Figure 2A leucine, histidine, and arginine are represented by the integers 3, 16, and 17. As an example, when the charge property mask is applied, the matrix representations of these three amino acids in all sequences is changed to 0.00, 0.091, and 1.00, respectively. We apply 62 such masks to this matrix, including simple metrics like charge, hydrophobicity, side chain flexibility, and side chain bulkiness to go along with more carefully curated metrics from the works of Kidera et al., 1985; Liu et al., 2018. A complete description of these properties can be found in the 'Key resources table' and in Appendix 1—table 1. The application of these masks gives an entirely new matrix describing the localization of amino acids with a given property.

By averaging across all sequences in the polyreactive or non-polyreactive dataset when these masks are applied, we can readily see differences in charge patterning and hydrophobicity when comparing polyreactive and non-polyreactive sequences (Figure 3C,D).

Including errors obtained via bootstrapping, we see that these differences are most pronounced in the center of CDR3H, with some differences also apparent in the remaining five loops. This analysis shows an overall bias toward neutrality in these regions; that is neither positively nor negatively charged, neither strongly hydrophilic nor hydrophobic. These results also contextualize the findings of Figure 2C. The trend toward hydrophobic residues in CDR2H of polyreactive antibodies importantly does not make these regions net hydrophobic, but instead make these regions slightly less hydrophilic on average. This effect is yet again more pronounced in the parsed dataset (Figure 3—figure supplement 1), with a strong trend toward interface neutrality. Conversely, when comparing bootstrap samples drawn from the null distribution, that is the 'polyreactive' or 'non-polyreactive' labels are given to antibody sequences at random, we see no difference between the biophysical properties of the two populations (Figure 3—figure supplement 2).

Systematic determination of the key contributions to polyreactivity

Along with simple property averaging, these masks also give a high dimensional space from which we can determine, in an unbiased way, the primary factors that discriminate polyreactive and non-polyreactive antibodies. As a first pass, we apply a principal component analysis (PCA) to the matrix of all antibody sequences in an attempt to separate the polyreactive or non-polyreactive populations along the axes of highest variation in the dataset. Unfortunately, the principal components of these data do not effectively distinguish between the two populations (Figure 4—figure supplement 1).

To further investigate the physical and sequence-based properties of polyreactivity in antibodies in a more targeted manner, we employ linear discriminant analysis (LDA), a common algorithm often applied in classification problems (Barker and Rayens, 2003; Cordeiro et al., 2009; Ma et al., 2013). LDA works in a manner conceptually similar to PCA, reducing the dimensionality of a given dataset via a linear combination of the original dimensions. However, LDA takes one additional input, the label or class of each sequence. Whereas the objective of PCA is to identify the axes which maximize the variance in the dataset, LDA has the dual objective of maximizing the projected distance between two classes while minimizing the variance within a given class. While LDA is well adapted for classifying two distinct populations, it is susceptible to overfitting, unlike PCA (Qiao et al., 2009). Here, we have labeled our two classes in the matrix with either a ‘1’ for polyreactive, or ‘0’ for non-polyreactive. In our application of LDA, we parse down the large number of input vectors using either PCA or an algorithm which selects the vectors with the largest average differences between the two populations. This reduction in dimensionality ensures the data are not being overfit, and the tunable number of input vectors allows us to control for overfitting in each individual application.

Figure 4A shows the results of LDA when applied to the parsed dataset comprised of 311 polyreactive antibodies and 362 non-polyreactive antibodies. As discussed in the introduction, the framing of polyreactivity as a binary problem is not a perfect assumption. The inclusion of intermediate levels of polyreactivity further confounds this issue. Indeed, the application of LDA to the full dataset shows a reduced ability to split polyreactive and non-polyreactive antibodies (Figure 4—figure supplement 2), likely due to this spectrum of polyreactivity. By considering only the parsed dataset for these classification analyses, we can improve confidence that the differences identified are those that separate strongly polyreactive and strongly non-polyreactive antibodies.

Figure 4 with 3 supplements see all
Linear discriminant analysis (LDA) can meaningfully separate the two populations and these meaningful differences can be used to generate a polyreactivity classifier.

(A) LDA applied individually to the complete parsed, Influenza, HIV, and mouse datasets. Percentages indicate the accuracy of the linear discriminant in labeling polyreactive and non-polyreactive antibodies. For these data, the plotted linear discriminants are comprised of different linear weights. (B) Accuracies of a polyreactivity classifier with a separate test and training dataset. Groupings in this figure are the same as those in panel (A). A support vector machine is generated for each individual population, and the reported values are accuracies calculated through leave one out cross validation. Shown are test data and a scrambled dataset where the labels of ‘polyreactive’ or ‘non-polyreactive’ are applied randomly (gray bars). The dotted line indicates 50% accuracy threshold. (C) Property matrices highlighting the top 10 weights of the linear discriminants in panel A for the parsed dataset with 75 vectors (C) and the HIV dataset with 75 vectors (D). Color bar represents the normalized weight of each property, where pink rectangles represent properties correlated with increased polyreactivity, and green rectangles represent properties correlated with decreased polyreactivity.

LDA analysis is versatile in its applications, and in this work, we utilize the method in two distinct modes. In the first mode, all available data is used as input with the output vector representing the features that best distinguish between the two complete populations. Plots of the data projected onto this vector (as in Figure 4A) represent the maximum achievable separation between the two populations for a defined number of input components from the given biophysical property matrix. In the second mode, we utilize LDA as a more canonical classification algorithm separating the data randomly into training and test groups. In this classification mode of operation, a combination of correlation analysis coupled with maximal average differences is used to parse input features, and a support vector machine (SVM) is used to generate the final classifier from these features. Accuracy of the resultant classifiers is assessed via leave one out cross validation, these accuracies are shown in Figure 4B.

In the first mode, we find that the data can be split more effectively when the parsed dataset is broken up into the distinct ‘reactivity’ groups, that is those antibodies specific for influenza viruses, HIV, or found in the mouse gut (Figure 4A). This suggests there may be some bias due to antigen specificity, or lack thereof, whereby influenza virus-specific antibodies take a slightly different path toward polyreactivity compared to HIV reactive or mouse gut IgA antibodies. However, when using the classification mode, the classification accuracy is roughly equivalent across all tested datasets (Figure 4B). Testing this classifier with a scrambled dataset, where the labels are randomly assigned, shows the expected decrease in classification accuracy for each individual dataset for all ranges of input features.

When applying LDA in the first mode (Figure 4A), we can directly pull the linear weights of each component comprising linear discriminant one and reveal which biophysical properties at each CDR position best distinguish between the two populations. The differences in the linear weights from the heavy chain CDR loops comprising each discriminant show clear differences when comparing the complete parsed dataset (Figure 4C) to the HIV only dataset (Figure 4D). In the parsed dataset, the discriminating weights are heavily concentrated in CDR2H. Whereas in the HIV dataset, these weights are centered around the CDR3H loop. Only the top 10 linear weights are shown in Figure 4C,D. The full matrix of linear weights can be found in Figure 4—figure supplement 3. The predominant discriminating factors between datasets might be due to the significant difference in CDR3H length between the mouse (IgA) and the human datasets, which confounds the analysis in this region. However, when examining each individual subset of the complete dataset we do find that there are common properties that seem to be the primary discriminators (i.e. largest linear weights). These are hydrophobicity 1, hydrophobicity 2, and hotspot variable 6 (a structural parameter related to α-helix propensity).

An information theoretic approach

While analysis of the biophysical property differences between polyreactive and non-polyreactive sequences provides some insight into the molecular basis for the polyreactivity phenomenon, a broad unifying pattern which could discern the biophysical mechanism behind polyreactivity was not readily evident across all types of antibodies. To probe these polyreactive sequences in a quantitative yet more coarse manner, we applied the formalism of information theory to our dataset of antibody sequences. Information theory, a theory classically applied to communication across noisy channels, is incredibly versatile in its applications, with high potential for further applications in immunology (Shannon, 1948; Román-Roldán et al., 1996; Cheong et al., 2011; Vinga, 2014; Mora et al., 2010; Murugan et al., 2012). In this work, we utilize two powerful concepts from information theory, namely Shannon entropy and mutual information.

Shannon entropy, in its simplest form, can be used as a proxy for the diversity in a given input population. This entropy, denoted as H, has the general form shown in Equation 1:

(1) H(X)=-Xp(x)log2p(x)

where p(x) is the occurrence probability of a given event, and X is the set of all events. We can then calculate this entropy at every position along the CDR loops, where X is the set of all amino acids, and p(x) is the probability of seeing a specific amino acid at the given position. In other words, we want to determine, for a given site in a CDR loop, how much diversity (or entropy) is present. Figure 5A shows this Shannon entropy distribution for the full dataset of polyreactive and non-polyreactive antibodies. Given there are only 20 amino acids used in naturally derived antibodies, we can calculate a theoretical maximum entropy of 4.2 bits, which assumes that every amino acid occurs at a given position with equal probability. Although the observed entropy of the CDR3H loop approaches this theoretical maximum, it hovers below it (3.5 Bits) due to the relative absence of the amino acids cysteine and proline in the center of this loop. The difference in the entropy distributions in CDR1H are consistent with the bias in amino acid usage in this region, shown previously in Figure 2.

Figure 5 with 2 supplements see all
An information theoretic analysis of antibody sequences shows an increase in polyreactive antibody loop crosstalk.

(A) The sequence diversity of the polyreactive and non-polyreactive datasets, quantified using Shannon Entropy, highlight similar diversities between the two groups. (B) A crystal structure (PDB: 5UGY) provides a visual representation of the lateral organization of the CDR loops on the antibody binding surface. (C) The difference in mutual information between polyreactive and non-polyreactive sequences shows that CDR loops of the heavy chain have more crosstalk in polyreactive antibodies. Each individual row represents the given condition, whereas each column gives the location the mutual information is calculated. (D) Singular slices of the mutual information show the data in (C), projected from the matrix onto a line, highlighting the significance of the differences at these particular locations. The positions of the ‘given’ amino acid, that is the particular Y in H(X|Y), are highlighted by gray boxes in panel C. Solid black lines indicate where on the X-axis this ‘given’ amino acid is located. Stars indicate statistical significance (p ≤ 0.05) calculated through a nonparametric permutation test. Bars with a single star above represent contiguous regions of significance.

Importantly, from this entropy we can calculate an equally interesting property of the dataset, namely the mutual information. Mutual information is similar, but not identical to, correlation. Whereas correlations are required to be linear, if two amino acids vary in any linked way, this will be reflected as an increase in mutual information. In addition, due to some of the highly conserved residues in the non-CDR3H loops, high covariance can be achieved for residues that have not been specifically selected for in the germinal center. Using this information theory framework, these conserved residues have a mutual information of 0. Overall, the mutual information can be used to identify patterns in antibody sequences that were not readily evident through the previous analysis in this or other studies. If there is some coevolution or crosstalk between residues undergoing some selection pressure in the antibody maturation process, it will be reflected as an increase in the mutual information. In this work, mutual information I(X;Y) is calculated by subtracting the Shannon entropy described above by the conditional Shannon entropy H(X|Y) at each given position as seen in Equations 2 and 3:

(2) H(X|Y)=-yYp(y)xXp(x|y)log2p(x|y)
(3) I(X;Y)=H(X)-H(X|Y)

To orient ourselves in physical space, Figure 5B gives an example crystal structure (PDB: 5UGY)(Whittle et al., 2011; Ziegler et al., 2014) highlighting the lateral arrangements of the CDR loops. The matrix in Figure 5C shows that the mutual information between CDR loops on this binding surface is increased in the heavy chains of polyreactive antibodies over non-polyreactive ones, suggesting an increase in loop crosstalk in antibodies that exhibit polyreactivity. Interestingly, it appears that there is a corresponding decrease of loop crosstalk in the light chains of polyreactive antibodies. Importantly, this crosstalk is increased across and within all loops when analyzing the parsed dataset (Figure 5—figure supplement 1). This observed crosstalk persists across all polyreactive antibodies within all subsets of our tested dataset and is evident both in intra-loop and inter-loop interactions. Figure 5D highlights some examples of the interesting significant differences of this crosstalk at distinct given positions within CDR1L, CDR1H, and CDR3H. A complete plot of the statistically significant differences (p0.05) of Figure 5C shows that a large portion of these differences are in fact significant (Figure 5—figure supplement 2).

The ordering of these entropy and information plots was chosen to reflect the spatial arrangement of the loops on the antibody surface; as such they show also that mutual information between loops drops off with physical distance between these loops. In other words, loops (and residues) that are located close to each other will have more of an effect on their direct neighbors as opposed to those that are more physically distant. This increased mutual information suggests that in the heavy chains of polyreactive antibodies, there is enhanced cooperativity or co-evolution of the amino acids of intra- and inter-CDR loop pairs.

Extension of the analysis to MHC and MHC-like molecules

Given the ability of this analysis pipeline to find nuanced differences between polyreactive and non-polyreactive antibodies, we next sought to expand the range of applications of our approach. Extension of the pipeline to the analysis of TCR sequences is trivial, due to the similar arrangement of CDR loops on the binding surface and the capability of IgBLAST to annotate TCR sequences (Ye et al., 2013). Instead, we sought to significantly expand the scope of this software by applying a similar approach to the analysis of MHC and MHC-like molecules. MHC molecules are encoded by a large superfamily of genes that are spread throughout the genome (Adams and Parham, 2001; Piertney and Oliver, 2006). MHC and MHC-like genes are found across a wide range of divergent species, and these genes have diversified extensively over time, making the distinction between orthologs and instances of convergent evolution difficult. In some cases, the divergence is extreme enough that phylogenetics cannot provide predictions of function. Given that these MHC molecules have evolved to present different antigen subtypes, such as lipid molecules in the case of CD1 proteins (Borg et al., 2007; Luoma et al., 2013; Adams, 2014), we explored the use of our pipeline as a classifier based on biophysical properties rather than phylogeny. In achieving this new functionality, the critical step lies in the transformation of the MHC sequences into a numeric form as in Figure 2A.

To accomplish this, we split the sequences by their most prominent structural features. For MHC and MHC-like molecules, these features are the two β-strands and α-helices of the so-called platform domain. As a test case, we use two example training classes; a representative list of human MHC Class I molecules, and the output from a BLAST query on CD1d (Sayers et al., 2020). We can then assess our ability to separate these two training classes, while also introducing a test class derived from the data of Almeida et al. (Almeida et al., 2020). Each sequence within a given class is globally aligned, and one representative sequence from each class is sent through the Phyre2 structural prediction server (Kelley et al., 2015). We then use these structural predictions to identify the start and end points of each major structural feature in the alignment space. These start and end points then define the boundaries that are numerically encoded into our position-sensitive matrix, as seen in Figure 6A.

The analysis pipeline is flexible and sufficiently identifies differences between MHC Class I and CD1-like molecules, and has the potential to be used as a classifier moving forward.

(A) Similar to antibody sequences, the MHC and CD1 sequences can be encoded into a matrix. Here, we focus on human Class I MHC molecules and CD1 molecules from various organisms, and use these sequences as training data to classify sequences collected in Almeida et al., 2020.(B) Comparisons of simple biophysical properties across these molecular species highlight differences between classes. (C) Projection of the biophysical properties of each class on to the first two principal components can be used to classify MHC- and CD1-like molecules. Molecules present in Almeida et al., 2020 but absent from the training data are labeled.

Once the data are in this form, all downstream analysis outlined previously can be applied in a similar manner. In this example case, we find that average biophysical properties across these sequences reveal expected differences in hydrophobicity, with the lipid binding CD1 molecules displaying increased hydrophobicity when compared to the peptide binding MHC class I molecules (Figure 6B). Interestingly, unlike in the case of the antibody analysis, a simple PCA can effectively discriminate between the two training classes in this example case. Figure 6C shows the projection of the biophysical property data of each class onto the first two principal components. Here, we see that the peptide binding molecules (HLA-E, HLA-A2, H2-D) and the lipid binding molecules (Human CD1b, Chicken CD1) of the test dataset cluster with the respective peptide and lipid binding training data. The majority of the data of Almeida et al., comprised of non-classical MHC class I molecules from cartilaginous fish, clusters as its own distinct group, likely due to evolutionary distance between these molecules and those derived from mammalian and avian immune systems.

Discussion

Previous research has highlighted the importance of hydrophobicity, charge, and CDR loop flexibility on antibody specificity. In this work, we expand upon these previous results with a new bioinformatic and biophysical characterization of polyreactive antibodies. The software generated for this study provides a powerful computational tool which can be utilized by researchers interested in discerning differences between populations of adaptive immune molecules in broad contexts. Building off of the efforts of our own work and that of experimental collaborators, we were able to aggregate to date one of the largest publicly available datasets of antibodies tested for polyreactivity. Differences in the germline gene frequency and amino acid frequencies show there exists some underlying differences between polyreactive and non-polyreactive antibodies. A surface level analysis of this dataset is able to discriminate certain features of polyreactive and non-polyreactive antibodies, namely that on average, polyreactive antibodies are less strongly negatively charged, less hydrophilic, and have a higher prevalence of antibodies with longer CDR loops of the heavy chain. Importantly, however, these binding surfaces do not have a net positive charge nor are they net hydrophobic.

Our results highlight an increase in VH1-69 gene usage in polyreactive antibodies, an interesting finding given the substantial literature outlining its importance in diverse immune environments. In addition to the aforementioned role of VH1-69 in broadly neutralizing anti-influenza and anti-HIV antibodies (Haynes et al., 2005; Mouquet et al., 2011; Andrews et al., 2015; Prigent et al., 2018), autoreactive chronic lymphocytic leukemic B cells commonly express receptors bearing VH1-69 (Sasso et al., 1993; Forconi et al., 2010), and anti-HIV antibodies which target the membrane-proximal external region of HIV-1 envelope glycoproteins frequently utilize VH1-69 (IAVI Protocol G Investigators et al., 2019). While previous reports suggest that the key feature permitting these auto-reactive or polyreactive interactions of VH1-69 is an exceptionally hydrophobic CDR2H loop (Chen et al., 2019) our results suggest this does not explain the over-representation of this antibody in the polyreactive dataset, as on average the CDR2H of polyreactive antibodies is strongly hydrophilic. Instead, certain structural or dynamic features of the antibody may contribute to its out-sized role in critical biological contexts.

To dig deeper into the biophysical differences between polyreactive and non-polyreactive antibodies, we created an adaptable software for the automated analysis of large antibody datasets and the application of a new analysis pipeline for the study of polyreactive antibodies. Overall, the improvements of this software to the current state of antibody sequence analysis are sufficient to highlight key differences in the two populations with improved spatial resolution. The position sensitive sequence alignment is able to further parse through the genetic differences and show that in general, polyreactive antibodies have a tendency to have more hydrophobic residues in CDR2H, and a decreased preference for phenylalanine in CDR1H. While these observational differences provided some initial insight, a more rigorous biophysical treatment was necessary. With the addition of 62 biophysical properties analyzed using the position sensitive alignment, significant differences between the CDR3H loops in polyreactive and non-polyreactive antibodies become immediately evident, providing a more detailed depiction of the antigen binding surface of polyreactive antibodies.

These data suggest a movement toward neutrality or ‘inoffensive’ residues in the CDR loops of polyreactive antibodies: amino acids that are neither exceptionally hydrophobic nor hydrophilic and with a net charge close to 0. Previous studies have suggested that polyreactive antibodies tend to have more hydrophobic CDR loops, such that low-affinity Van der Waals interactions might be the primary means of polyreactive interactions (Prigent et al., 2018; Starr and Tessier, 2019). However, these studies counted the number of hydrophobic residues per sequence or averaged the hydrophobicity of all six CDR loops. While our results partially agree with these previous findings, our analysis extends much further into defining the biophysical basis of this phenomenon. For example, while our position sensitive representation of the sequences shows that CDR3H does become more hydrophobic in polyreactive sequences, it is still net hydrophilic on average. A highly hydrophobic binding surface would provide an avenue for non-specific interactions with other hydrophobic proteins, but it would occlude binding to highly hydrophilic ligands like DNA. A slightly hydrophilic, neutral-charged binding surface would permit weak interactions with a wide range of ligands.

Using these and other biophysical properties as input feature vectors, we were able to generate a generalizable protocol for binary comparisons between two distinct populations of Ig-domain sequences. This framework is able to successfully split all tested polyreactive and non-polyreactive antibody datasets. Care was taken to not overfit these data and a preliminary classifier built from this algorithm was able to identify the proper number of input vectors for each LDA application. While there are general features which best split the polyreactive and non-polyreactive antibodies in these datasets, including charge, hydrophobicity, and α-helix propensity, these features alone are not sufficient to discriminate between the two populations. Instead, 75 vectors taken from the position-sensitive biophysical property matrix are necessary to properly split the groups, including both simple properties like charge, hydrophobicity, flexibility, and bulkiness and more carefully curated properties like the often used Kidera factors and the hotspot detecting variables of Liu et al. (Kidera et al., 1985; Liu et al., 2018; Vihinen et al., 1994). The inability to arrive at a core few biophysical properties that could effectively distinguish polyreactive and non-polyreactive antibodies necessitated the application of further approaches, namely information theory.

The tools provided by information theory proved to be effective in the present study. The classic approach to information theory considers some input, communication of this input across a noisy channel, and then reception of a meaningful message from the resultant output. We can think of the analogous case for these antibodies, whereby the sequence and structure of the antibodies can be seen as our input, the thermal noise inherent to biological systems can complicate biochemical interactions, and the necessary output is antigen recognition, i.e. binding between the antibody and the ligand. Focusing just on the antibody side of this communication channel, we determined the underlying loop diversity through the Shannon entropy of the polyreactive and non-polyreactive datasets. This diversity was found to be nearly equivalent while the mutual information, a metric of ‘crosstalk’ across populations, between and within CDR loops was found to be increased in the heavy chain and decreased in the light chain of polyreactive antibodies. What this loop crosstalk entails physically is not immediately clear from these measurements.

The mutual information increase could come from gene usage being somehow coupled, amino acid usage coupling with the cognate ligand, or the amino acids directly interacting physically with each other. In some way, this crosstalk appears to be selected for in the polyreactive population. If this increase in mutual information manifests as an increase of charge-charge interactions, this could explain why there is a minimal change in net charge of antibodies between the two groups, yet a significant move toward neutrality in the CDR loops of polyreactive antibodies. The pairing of two charged groups would help move the binding surface of polyreactive antibodies toward a more ‘inoffensive’ binding surface. A binding surface that is neither exceptionally hydrophobic nor hydrophilic, and lacks a significant positive or negative charge, would represent a relatively appealing binding interface for a low-affinity interaction with a large array of diverse ligands. A patchwork of hydrophobic and hydrophilic non-charged residues exposed to potential ligands would provide an ideal candidate polyreactive surface. The corresponding decrease in the mutual information between the light chain CDR loops of polyreactive antibodies could be caused by a de-emphasis in the involvement of these loops due to differential binding configurations of polyreactive ligands, as has been previously hypothesized (Dimitrov et al., 2013; Sethi et al., 2006).

While further crystallographic, biochemical, and dynamic studies are necessary to identify the true source of this increase in mutual information across polyreactive antibodies, we can speculate what these results may mean in the context of the results obtained using linear discriminant analysis. In addition to standard side chain properties, many of the most important features for splitting polyreactive and non-polyreactive antibodies were structural in nature. Specifically, hotspot variables 6, 24, 25, and 41 all correspond to the structural tendencies of a given amino acid. Coupled with the increase in side chain interactions that may be implied by the increased mutual information across the loops of polyreactive antibodies, this potential for increased loop structure may suggest more rigid CDR loops in polyreactive antibodies.

In addition to the insights into polyreactivity, the computational tools developed for this study are broadly applicable to future studies of large antibody or T cell receptor repertoires. One of the strengths of this approach is a decreased emphasis on structural information when crystal structures are unavailable. Computational prediction of loop conformation is difficult, and drawing inferences from incorrect models regarding side-chain interactions and positioning could be misleading. Reliable structural information on these polyreactive antibodies will be critical to a further understanding of the mechanisms of polyreactivity, including complex structures of antibodies bound to various ligands. In the high-throughput analysis of antibody sequences, our approach strikes a careful balance of the structural assumptions that should apply consistently across antibody populations.

Further experimental assays will be necessary to more comprehensively identify the underlying mechanisms of polyreactivity, including further sequencing and biochemical analysis of polyreactive and non-polyreactive antibodies. Antibodies specific to other pathogens or those from other organisms tested for polyreactivity will help form a more complete picture and improve the generality of the results. As with any machine learning based approach, the classification algorithm is only as good as the data it is trained on. Adding further data in the training set, including more mutations and germline reversions that turn a polyreactive antibody non-polyreactive or vice-versa, will be critical for a comprehensive analysis of polyreactivity. Additionally, a more robust assay for determining polyreactivity such as a chip based screen to test for binding to many diverse targets, would greatly broaden our perspective and help understand just how broadly reactive these polyreactive antibodies are. Lastly, a more complete understanding of the germinal center and the selection processes inherent to the affinity maturation process will assist in the determination of whether polyreactivity is a byproduct or a purposeful feature of the affinity maturation process.

The software generated for this study is publicly available as a python application (see Materials and methods). The unique aspect of this software is its hybrid approach to position-sensitive amino acid sequence analysis. Structural information is implicitly encoded by the alignment strategy employed, yet these assumptions are weaker than those imposed by explicit structural prediction. Downstream analysis from this positional encoder is streamlined and can be generalized to analyze any binary or higher order classification problems. This streamlined analysis allows for the generation of each figure in this study to be applied to thousands of sequences in a matter of minutes. The classification capabilities of the software could prove particularly useful when comparing binary classes, such as T cell receptors or antibody sequences derived from healthy and diseased tissue samples. Acceptable inputs are not restricted to CDR loops of immunoglobulins, and we have shown that the software can be adapted for the analysis of MHC-like molecules. Moving forward, this MHC analysis has the potential to classify the antigen binding properties of highly-divergent MHC sequences from a broad range of species, providing insights where phylogenetic approaches prove difficult. This software represents a strong addition to the existing toolkit for repertoire analysis of diverse molecular species.

Materials and methods

Key resources table
Reagent type
(species) or resource
DesignationSource or referenceIdentifiersAdditional information
Software, algorithmJupyter notebookDOI:10.3233/978-1-61499-649-1-87RRID:SCR_018413https://pypi.org/project/jupyter-client/5.2.3/
Software, algorithmMatPlotLibDOI:10.1109/MCSE.2007.55RRID:SCR_008624http://matplotlib.sourceforge.net
Software, algorithmSeabornDOI:10.5281/zenodo.12710RRID:SCR_018132https://seaborn.pydata.org/
Software, algorithmPandashttps://github.com/pandas-dev/pandasRRID:SCR_018214https://pandas.pydata.org
Software, algorithmSciPyDOI:10.1038/s41592-019-0686-2RRID:SCR_008058http://www.scipy.org/
Software, algorithmScikit-LearnDOI:10.5555/1953048.2078195RRID:SCR_002577http://scikit-learn.org/
Software, algorithmAIMSThis PaperBoughter et al. 2020https://github.com/ctboughter/AIMS
Software, algorithmTCRdistDash et al., 2017 DOI:10.1038/nature22383https://github.com/phbradley/tcr-dist
Software, algorithmIgBLASTDOI:10.1093/nar/gkt382RRID:SCR_002873http://www.ncbi.nlm.nih.gov/igblast/
Software, algorithmBLASTDOI:10.1093/nar/gkv1290RRID:SCR_004870http://blast.ncbi.nlm.nih.gov/Blast.cgi

Software

All analyses were performed in python, with code tested and finalized using Jupyter Notebooks (Kluyver et al., 2016). Figures were generated with MatPlotLib (Hunter, 2007) or seaborn (Ziegler et al., 2014), while the majority of data analysis was carried out using Pandas (McKinney, 2015), SciPy (Virtanen et al., 2020), and SciKit-learn (Pedregosa et al., 2011). All code and data is available at https://github.com/ctboughter/AIMS, including the original Jupyter Notebooks used to generate the data in this manuscript as well as generalized Notebooks and a python-based GUI application for analysis of novel datasets (Boughter, 2020; copy archived at swh:1:rev:f6c855ef4a7ce63f72dba6b34e9d0e9edd9200ce).

Statistical analysis

Request a detailed protocol

Error bars in all plots are provided by the standard deviation of 1000 bootstrap iterations. Statistical significance is calculated using either a two-sided nonparametric Studentized bootstrap or a two-sided nonparametric permutation test as outlined in ‘Bootstrap Methods and Their Application’ (Davison and Hinkley, 2011). For the Studentized bootstrap, the bootstrapped data are drawn from a resampling of the null distribution of the data, with replacement. Practically, this entails combining the polyreactive and non-polyreactive antibodies into a single matrix, without labels, and using the Scikit-learn resample module to randomly separate this matrix into two classes, preserving the number of sequences in each population. To calculate bootstrapped averages, we draw from the empirical rather than null distribution. Statistical significance is estimated by calculating the p-value using the relation:

(4) p=1+(z2z02)R+1

Here, we calculate the p-value by counting the number of bootstrap iterations where z2 is greater than or equal to z02.z2 and z02 are Studentized test statistics taken from the null and empirical and distributions, respectively. R is the number of times this bootstrapping process is repeated. The general form of z is given by:

(5) z=Y¯2Y¯1(σ22n2σ11n1)1/2

where Y¯ represents the bootstrapped sample mean of each population, σ is the bootstrapped sample standard deviation, and n is the number of samples. Populations 1 and 2 in this case correspond to polyreactive and non-polyreactive antibodies. To calculate z for the empirical distribution (z0), all values correspond to the empirical rather than bootstrapped values.

To calculate p-values for differences in mutual information, the permutation test was used rather than the Studentized bootstrap. Here, the test statistic t is set to a simple difference of means, and rather than sampling with replacement from the empirical or null distributions with replacement, we randomly permute the data into ‘polyreactive’ or ‘non-polyreactive’ bins. We then count the number of permutations where the randomly permuted test statistic is greater than or equal to the empirical test statistic. This count then replaces the count (#) in the above equation for p.

Appendix 1

Appendix 1—table 1
List of all biophysical properties used for this study.

For hotspot detecting variables (HS) a simplified form of the description is used. For more in-depth descriptions, the original reference should be used.

Property shorthandDescription
Phob1Hydrophobicity scale [−1,1]
ChargeCharge [ec]
Phob2Octanol-interface hydrophobicity scale
BulkSide-chain bulkiness
FlexSide-chain flexibility
KD1Helix/bend preference
KD2Side-chain size
KD3Extended structure preference
KD4Hydrophobicity
KD5Double-bend preference
KD6Flat extended preference
KD7Partial specific volume
KD8Occurrence in alpha-region
KD9pK-C
KD10Surrounding hydrophobicity
HS1Normalized positional residue Freq at helix C-term
HS2Normalized positional residue Freq at helix C4-term
HS3Spin-spin coupling constants
HS4Random parameter
HS5pK-N
HS6Alpha-helix indices for beta-proteins
HS7Linker propensity from 2-linker dataset
HS8Linker propensity from long dataset
HS9Normalized relative Freq of helix end
HS10Normalized relative Freq of double bend
HS11pK-COOH
HS12Relative mutability
HS13Kerr-constant increments
HS14Net charge
HS15Norm Freq Zeta-R
HS16Hydropathy scale
HS17Ratio of average computed composition
HS18Intercept in regression analysis
HS19Correlation coefficient in Reg Anal
HS20Weights for alpha-helix at window pos
HS21Weights for beta-sheet at window pos −3
HS22Weights for beta-sheet at window pos 3
HS23Weights for coil at win pos −5
HS24Weights coil win pos −4
HS25Weights coil win pos 6
HS26Avg Rel Frac occur in AL
HS27Avg Rel Frac occur in EL
HS28Avg Rel Frac occur in A0
HS29Rel Pref at N
HS30Rel Pref at N1
HS31Rel Pref at N2
HS32Rel Pref at C1
HS33Rel Pref at C
HS34Information measure for extended without H-bond
HS35Information measure for C-term turn
HS36Loss of SC hydropathy by helix formation
HS37Principal component 4 (Sneath, 1966)
HS38Zimm-Bragg parameter
HS39Normalized Freq of ZetaR
HS40Rel Pop conformational state A
HS41Rel Pop conformational state C
HS42Electron-ion interaction potential
HS43Free energy change of epsI to epsEx
HS44Free energy change of alphaRI to alphaRH
HS45Hydrophobicity coeff
HS46Principal property value z3 (Hellberg et al., 1987)

Data availability

All data generated and all code used for analysis in this study has been published on GitHub at http://github.com/ctboughter/AIMS (copy archived at https://archive.softwareheritage.org/swh:1:rev:f6c855ef4a7ce63f72dba6b34e9d0e9edd9200ce/).

References

  1. Conference
    1. Kluyver T
    2. Ragan-kelley B
    3. Fernando P
    4. Granger B
    5. Bussonnier M
    6. Frederic J
    7. Kelley K
    8. Hamrick J
    9. Grout J
    10. Corlay S
    11. Ivanov P
    12. Avila D
    13. Abdalla S
    14. Willing C
    (2016)
    Jupyter Notebooks—a publishing format for reproducible computational workflows
    Positioning and Power in Academic Publishing: Players, Agents and Agendas.
  2. Software
    1. McKinney W
    (2015)
    Pandas - Powerful Python Data Analysis Toolkit
    Pandas - Powerful Python Data Analysis Toolkit.
    1. Qiao Z
    2. Zhou L
    3. Huang JZ
    (2009)
    Sparse linear discriminant analysis with applications to high dimensional low sample size data
    IAENG International Journal of Applied Mathematics 39:2.

Decision letter

  1. Pamela J Bjorkman
    Reviewing Editor; California Institute of Technology, United States
  2. Satyajit Rath
    Senior Editor; Indian Institute of Science Education and Research (IISER), India
  3. Brian Baker
    Reviewer
  4. Bart Haynes
    Reviewer; Duke University School of Medicine, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Thank you for submitting your article "Biochemical Patterns of Antibody Polyreactivity Revealed Through a Bioinformatics-Based Analysis of CDR Loops" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Satyajit Rath as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Brian Baker (Reviewer #1); Bart Haynes (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

As the editors have judged that your manuscript is of interest, but as described below, we suggest substantial revisions. We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). First, because many researchers have temporarily lost access to the labs, we will give authors as much time as they need to submit revised manuscripts. We are also offering, if you choose, to post the manuscript to bioRxiv (if it is not already there) along with this decision letter and a formal designation that the manuscript is "in revision at eLife". Please let us know if you would like to pursue this option. (If your work is more suitable for medRxiv, you will need to post the preprint yourself, as the mechanisms for us to do so are still in development.)

Summary:

The authors present a bioinformatics pipeline to analyze a large aggregate data set of nearly 1,500 antibody sequences, seeking to uncover the biophysical underpinning of antibody polyreactivity – i.e., broad low-affinity binding to diverse epitopes. This is an important topic, relevant not only to Ab therapeutics, but to other assessments of immune repertoires and considerations of protein-protein interaction hubs in general. The unbiased informatics approach that is taken, largely separated from structural considerations, is a distinctive feature of the work, and an attractive aspect of the study. The authors identify two key determinants of polyreactivity: First, the binding interface tends to be neutral, i.e., being neither strongly hydrophobic nor hydrophilic, and lacking significant positive or negative charge. Second, crosstalk (measured by mutual information) between and within CDR loops increases in the heavy chain of polyreactive antibodies. The method of sequence alignment leaves out most structural information (except aligning by the center of each CDR loop) yet retaining the positional context.

The research goal of identifying biophysical properties that allow for promiscuity of a protein selected for strong binding to a particular target is of general interest and biomedical importance, and the approach of combining statistical analyses and information metrics is appealing. The lack of a clear smoking gun is an important finding, as it clearly illustrates the subtleties at play. A classifier is described at the end of the paper, showing promise for sequence-based predictions and applications to therapeutic Abs.

We believe the paper may be suitable for eLife with revisions to address the following concerns.

Essential revisions:

1) The reviewers have concerns about the binary distinction between polyreactive vs. non-polyreactive. The authors essentially sum up this concern in the third paragraph of the subsection “Systematic Determination of the Key Contributions to Polyreactivity”, as well as in the ninth paragraph of the Discussion. The fact that pharma has nonetheless articulated a test for polyreactivity does not make the problem binary, which is clearly recognized by the authors. One way around this would be to repeat the analyses in Figures 1-3 on the more stringently separated (albeit now smaller) dataset used for Figure 4. How do things change? Are there any stronger signals?

Also, this limitation, while articulated well in the Discussion, could be brought up in the Introduction. Forcing a non-binary problem to be binary is by no means unique to this study, but it is a limitation best addressed upfront. As mentioned above, a strength of the paper is the demonstration of the scope of the challenge.

2) A major concern is that, with the biased removal of antibodies that are harder to classify, the true statistics are distorted. This occurs in two places. First, to avoid overfitting of LDA, only input vectors with the largest average differences between the two populations are kept. But this necessarily introduces bias which simplifies the classification problem and distorts the statistical structure in the original data set. Second, upon observing that an intermediate exists between the two classes when applying LDA to a subsampled data set, the authors removed the antibodies that bind a moderate number of select ligands from the analysis. But this continuum of polyreactivity is a real, and likely important, data feature. Since all the subsequent analyses act on the input data, it is necessary to address, with a clear description and supporting evidence, that the main findings remain valid when using the full data set. After this is addressed, direct comparisons – LDA vs. PCA, and with other methods – should be provided to justify the choice of LDA over PCA, and the performance of this newly developed pipeline relative to existing methods.

3) There seems to be a jump from the LDA to mutual information, then back to LDA in the application to therapeutic antibodies. While the physical hints the mutual information gives as mentioned in the Discussion is appreciated, and this can be built on, is this incorporated at all into the analysis of the therapeutic antibodies?

4) Related to point 3, it is not clear why mutual information is a good choice for detecting correlated changes between residues, and there is no comparison to other methods and previous studies. Moreover, mutual information won't be able to distinguish between coevolution and crosstalk between residues. For instance, Direct Coupling Analysis (DCA) indicates that coevolving residues are often in physical proximity. This points to a weakness of lacking the structural information in this work. Thus, it is unclear whether increased mutual information really indicates crosstalk, or instead signals coevolution between residues. This ambiguity undermines the second determinant of polyreactivity.

5) The analysis is finally applied to therapeutic antibodies, which is considered a logical next step. But this is not well motivated – why, in the first place, shall one expect correlations between polyreactivity status of naturally-derived antibodies and the acceptance or discontinuation of a therapeutic antibody? Hence, this section didn't lend validation support to the approach. Perhaps the manuscript could be shortened by deletion of the therapeutic dataset. These antibodies are not natural and it is not clear what approved versus discontinued antibody means.

6) There are other major omissions. First, no performance comparison between this new software and other available methods of antibody sequence analysis is presented (all is said in words, with no supporting figures). Second, it is stated that the approach has been adapted and applied to analyze MHC-like molecules, but no data or references are provided.

7) While in general the manuscript is quite clear, some sections are less so. Is there nothing to be discerned from the identity of the top 10 weights shown in Figure 4? We are only given the identity of three (subsection “Systematic Determination of the Key Contributions to Polyreactivity”, last paragraph), and even then these are only mentioned in a cursory manner. How do these and the other heavily weighted terms relate to the biophysical clues gleaned from the earlier analyses?

8) In the Abstract, the word "offensive" is not defined and would best be "neutral charge" for reader understanding.

9) Introduction, last paragraph, it is not clear in the first sentence if there are 1500 polyreactive abs or a total of 1500 abs studied. It is clarified later but not clear here.

10) The authors use a relatively small panel of antigens to define polyreactivity, whereas a chip type of assay with more proteins may be more precise. This might be mentioned as a way to make the discrimination between PR and non-PR abs more precise.

11) Figure 1 is terrific and very interesting. The fact that VH1-69 in over-represented in the polyreactive group of antibodies is important, both because of the propensity of polyreactive/rheumatoid factor B cells in fetal liver to be VH1-69 and because the polar region (PR)-distal HIV gp41 MPER Abs frequently use VH1-69. Perhaps these associations could be referenced and discussed.

12) In the fifth paragraph of the subsection “A Surface-Level Analysis of Polyreactive Antibody Sequences”, can statistics be applied to determine if any difference between 27% and 17%?

13) Figure 3, were these p values corrected for multiple tests?

https://doi.org/10.7554/eLife.61393.sa1

Author response

Essential revisions:

1) The reviewers have concerns about the binary distinction between polyreactive vs. non-polyreactive. The authors essentially sum up this concern in the third paragraph of the subsection “Systematic Determination of the Key Contributions to Polyreactivity”, as well as in the ninth paragraph of the Discussion. The fact that pharma has nonetheless articulated a test for polyreactivity does not make the problem binary, which is clearly recognized by the authors. One way around this would be to repeat the analyses in Figure 1-3 on the more stringently separated (albeit now smaller) dataset used for Figure 4. How do things change? Are there any stronger signals?

We appreciate the reviewers’ cognizance of the difficult nuances inherent to the current definitions of polyreactivity available to us and the broader field. We agree that including the analysis of Figures 1-3 applied to the parsed dataset would help contextualize our findings, and these data have been added to the supplement (Figure 1—figure supplement 1, Figure 2—figure supplement 1, and Figure 3—figure supplement 1). In these new figures, we see stronger signals in the same direction as the general trends found in the analysis of the full dataset. These stronger signals are discussed across multiple sections (subsections “A Surface-Level Analysis of Polyreactive Antibody Sequences” and “A Position Sensitive Matrix Representation of Sequences Provides Further Insights into Polyreactivity”). An interesting difference can be seen in the mutual information analysis, where loop crosstalk is increased in all loops, not just those of the heavy chain. This is highlighted in newly added text in the subsection “An Information Theoretic Approach” and Figure 5—figure supplement 1. Now that the parsed dataset is discussed earlier in the manuscript, a description of the parsed dataset has been relocated to the subsection “Database”.

Also, this limitation, while articulated well in the Discussion, could be brought up in the Introduction. Forcing a non-binary problem to be binary is by no means unique to this study, but it is a limitation best addressed upfront. As mentioned above, a strength of the paper is the demonstration of the scope of the challenge.

This difficulty of reframing a non-binary problem as binary is now discussed in the Introduction, in addition to the pre-existing statements found in the Results section and Discussion section. This updated section of the Introduction can be found in the seventh paragraph, when discussing the many difficulties of the systemic characterization of polyreactive antibodies

2) A major concern is that, with the biased removal of antibodies that are harder to classify, the true statistics are distorted. This occurs in two places. First, to avoid overfitting of LDA, only input vectors with the largest average differences between the two populations are kept. But this necessarily introduces bias which simplifies the classification problem and distorts the statistical structure in the original data set.

First, to hopefully alleviate some concerns for the reviewers, we have included Figure 4—figure supplement 2, using LDA to classify the full dataset, removing the identified bias. While performance clearly suffers, we are still able to successfully identify antibody polyreactivity with 71% accuracy, or a 65% accuracy when using LDA as a canonical classifier.

Second, when discussing LDA in this manuscript, it is important to differentiate between the two distinct modes used. In mode 1, our goal is explicitly to find the vectors which best discriminate between polyreactive and non-polyreactive antibodies. So here the “bias” is simply a search method. However, we can show that using PCA to reduce dimensionality rather than the maximum difference algorithm results in only a slight decrease in the reported accuracies (shown in Author response image 1).

In mode 2, the more canonical classification mode, this concern of simplifying the classification problem would be valid if the selection of the vectors happened before the splitting of data into “test” and “training” sets. However, this step occurs once the data has already been split. In other words, for each replica of the classification testing, the vectors selected with the largest average differences are changing, albeit slightly. Conceptually, this is no different from using other approaches to reduce the dimensionality of the system. Rather than using PCA, which identifies the axes of maximal variance using a linear combination of the input vectors, we are identifying axes of maximal difference.

The strength of this approach is that the resultant classifier is a linear combination of initial input vectors, rather than a linear combination of vectors that are linear combinations of the input vectors. This increased clarity makes the classifier more interpretable.

Author response image 1
To address reviewer comment #2, we regenerated the analysis from Figure 4A using PCA rather than the maximum difference algorithm to reduce dimensionality.

While performance is modestly decreased across all datasets, we still see strong separation between polyreactive and non-polyreactive antibodies.

Second, upon observing that an intermediate exists between the two classes when applying LDA to a subsampled data set, the authors removed the antibodies that bind a moderate number of select ligands from the analysis. But this continuum of polyreactivity is a real, and likely important, data feature. Since all the subsequent analyses act on the input data, it is necessary to address, with a clear description and supporting evidence, that the main findings remain valid when using the full data set. After this is addressed, direct comparisons – LDA vs. PCA, and with other methods – should be provided to justify the choice of LDA over PCA, and the performance of this newly developed pipeline relative to existing methods.

It should be pointed out, the full dataset is used for the majority of the analysis in the paper (Figures 1-3, Figure 5). Please see the subsection “Database”, for this clarification. As was addressed in reviewer comment 1, these analyses have now been applied to the parsed datasets and show stronger effects.

Likewise, the use of PCA was shown to be ineffective in separating the parsed data in Figure 4—figure supplement 1. For completeness, the PCA analysis of the full dataset is now also included in Figure 4—figure supplement 2. PCA, whether applied to the parsed or full dataset, is incapable of separating the two classes.

Lastly, to our knowledge, no methods to discriminate polyreactive and non-polyreactive antibodies in a systematic way currently exist.

3) There seems to be a jump from the LDA to mutual information, then back to LDA in the application to therapeutic antibodies. While the physical hints the mutual information gives as mentioned in the Discussion is appreciated, and this can be built on, is this incorporated at all into the analysis of the therapeutic antibodies?

As addressed in comment 5, we have removed the therapeutic antibody section, helping improve the flow of the manuscript.

4) Related to point 3, it is not clear why mutual information is a good choice for detecting correlated changes between residues, and there is no comparison to other methods and previous studies. Moreover, mutual information won't be able to distinguish between coevolution and crosstalk between residues. For instance, Direct Coupling Analysis (DCA) indicates that coevolving residues are often in physical proximity. This points to a weakness of lacking the structural information in this work. Thus, it is unclear whether increased mutual information really indicates crosstalk, or instead signals coevolution between residues. This ambiguity undermines the second determinant of polyreactivity.

One of the key novel approaches in our manuscript is the quantification of antibody sequences using position sensitive Shannon Entropy and Mutual Information. While Shannon Entropy has been used previously in repertoire analysis, notably in the generation of “sequence logos” [Schneider and Stephens, Nucleic Acids Research 1990], analysis using Mutual Information has been less common. However, this lack of previous use does not preclude mutual information from being a useful tool. The novelty here is part of the strength of our manuscript.

The reviewers point regarding mutual information’s inability to determine whether the signal is originating from coevolved residues and direct crosstalk is correct, and in fact this is mentioned in the sixth paragraph of the Discussion. While direct coupling analysis may find that coevolving residues are often in physical proximity, that does not explain the relative increase in the mutual information for polyreactive antibodies. There is no reason to believe, a priori, that mutual information should be increased across and within the loops of polyreactive antibodies compared to those of non-polyreactive antibodies. Regardless of whether this increase in signal is due to coevolution throughout the process of VDJ recombination and somatic hypermutation or direct contact between side chains, the signal is stronger in polyreactive antibodies. Importantly, this signal becomes even more pronounced when analyzing the parsed dataset (Figure 5—figure supplement 1). Hopefully, future studies can address this using structural and biochemical approaches, but these are outside of the scope of the present manuscript.

5) The analysis is finally applied to therapeutic antibodies, which is considered a logical next step. But this is not well motivated – why, in the first place, shall one expect correlations between polyreactivity status of naturally-derived antibodies and the acceptance or discontinuation of a therapeutic antibody? Hence, this section didn't lend validation support to the approach. Perhaps the manuscript could be shortened by deletion of the therapeutic dataset. These antibodies are not natural and it is not clear what approved versus discontinued antibody means.

As pointed out by the reviewers, while polyreactivity can play an important role in the success of an antibody therapeutic, the precise link between natural polyreactivity outlined in this manuscript and polyreactivity in engineered antibodies is not necessarily clear. Indeed, the data from the original Supplementary Figure 8 (now removed) seems to suggest there is little to no correlation between the two.

The original goal of this section was to give an example of the flexibility of our approach and its utility as a general tool. However, tackling a problem as complex as the success of or failure of an antibody therapeutic in clinical trials was likely overly ambitious. As suggested by the reviewers, we have removed this section. We now exhibit the flexibility of our approach with the new MHC and MHC-like analysis section.

6) There are other major omissions. First, no performance comparison between this new software and other available methods of antibody sequence analysis is presented (all is said in words, with no supporting figures). Second, it is stated that the approach has been adapted and applied to analyze MHC-like molecules, but no data or references are provided.

To our knowledge, no other software or available methods for antibody sequence analysis exists that was either designed specifically for, or is flexible enough to include, the classification of polyreactive and non-polyreactive antibodies. While previous researchers have attempted to find patterns in polyreactivity (Introduction, sixth paragraph) none of these studies claim to be able to predict polyreactivity in antibodies from their studies. Other potential approaches (Introduction, seventh paragraph) have been attempted at some level, but likewise have no claims of prediction.

Those algorithms that have been generated for prediction primarily focus on structural considerations of antibody-antigen interactions; i.e. epitope prediction, binding interface prediction, and ligand identification [Graves et al. Antibodies 2020]. While in theory these approaches could be used to classify polyreactive antibodies, they were not designed with this application in mind.

Last, a manuscript that utilizes the application of the software to analyze MHC and MHC-like molecules is currently in preparation. However, to demonstrate the potential application, a simple test case has been outlined in a new section of the manuscript (subsection “Extension of the Analysis to MHC and MHC-Like Molecules”, Figure 6). Here we are tasked with the very simple analysis of separating peptide-binding and lipid-binding MHC/MHC-like molecules, and then classifying molecules identified in a recent publication. In addition, the GitHub page hosting this software has an example test case for the easy application of this analysis mode.

7) While in general the manuscript is quite clear, some sections are less so. Is there nothing to be discerned from the identity of the top 10 weights shown in Figure 4? We are only given the identity of three (subsection “Systematic Determination of the Key Contributions to Polyreactivity”, last paragraph), and even then these are only mentioned in a cursory manner. How do these and the other heavily weighted terms relate to the biophysical clues gleaned from the earlier analyses?

We appreciate the reviewer’s suggestion to more closely consider the conclusions from our analysis. These weights were not so strongly discussed in the original manuscript due to the speculative nature of some of our conclusions. Without a crystal structure of a polyreactive antibody, our conclusions can only be strung together using the bioinformatic results we have. Our revised discussion of these data can now be found in the eighth paragraph of the Discussion section.

8) In the Abstract, the word "offensive" is not defined and would best be "neutral charge" for reader understanding.

We thank the reviewer for this suggestion. This wording has been changed in the Abstract.

9) Introduction, last paragraph, it is not clear in the first sentence if there are 1500 polyreactive abs or a total of 1500 abs studied. It is clarified later but not clear here.

This point has now been clarified. Additionally, since the therapeutic antibodies are no longer considered in this manuscript, references to the number of antibodies in the study have been changed to “over 1000”.

10) The authors use a relatively small panel of antigens to define polyreactivity, whereas a chip type of assay with more proteins may be more precise. This might be mentioned as a way to make the discrimination between PR and non-PR abs more precise.

One major drawback to the currently available experimental data is the relatively limited scope of the antigens tested to determine polyreactivity. While this panel of 7 antigens is the standard in the field, a protein microarray of some sort, as is often used when testing for polyreactivity in therapeutic antibodies, would provide a more detailed picture of polyreactivity. A reference to this proposed improvement in the analysis of polyreactivity has been added to the tenth paragraph of the Discussion.

11) Figure 1 is terrific and very interesting. The fact that VH1-69 in over-represented in the polyreactive group of antibodies is important, both because of the propensity of polyreactive/rheumatoid factor B cells in fetal liver to be VH1-69 and because the polar region (PR)-distal HIV gp41 MPER Abs frequently use VH1-69. Perhaps these associations could be referenced and discussed.

We thank the reviewer for their appreciation of Figure 1, and for their additional important context on VH1-69, which we have now included. These and other conclusions regarding the results of Figure 1 can now be found in the second paragraph of the Discussion.

12) In the fifth paragraph of the subsection “A Surface-Level Analysis of Polyreactive Antibody Sequences”, can statistics be applied to determine if any difference between 27% and 17%?

While we agree with the reviewer that the application of statistics to these percentages would strengthen the findings, it is unclear how statistics could be applied here. Since Figure 1 and the data associated with it are direct reports of the “count” of some property, here the count of a specific gene, it is difficult to acquire a standard error for this metric. Ideally, future experiments may quantify the gene usage of polyreactive antibodies across individual mice or humans, allowing for a more precise understanding of the deviation in these values.

13) Figure 3, were these p values corrected for multiple tests?

Originally, the p-values of Figure 3 were not corrected for multiple tests. Figure 3A and B now contain p-values at a level determined by the Bonferroni correction. It should be noted, when applying the Bonferroni correction to the data of Figure 3C and D, all differences fail to satisfy the corrected p-value level. However, a known shortcoming of this correction is a tendency towards an overly conservative analysis of the data. In this case where the number of tested points is 104, we believe that the multiple test correction is not a proper correction to our statistics. For increased data transparency, we have included the data profiles of the null distributions used for statistical comparisons in Figure 3—figure supplement 2.

https://doi.org/10.7554/eLife.61393.sa2

Article and author information

Author details

  1. Christopher T Boughter

    Graduate Program in Biophysical Sciences, University of Chicago, Chicago, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7106-4699
  2. Marta T Borowska

    Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, United States
    Contribution
    Conceptualization, Data curation, Writing - review and editing
    Competing interests
    No competing interests declared
  3. Jenna J Guthmiller

    Department of Medicine, Section of Rheumatology, University of Chicago, Chicago, United States
    Contribution
    Resources, Data curation, Writing - review and editing
    Competing interests
    No competing interests declared
  4. Albert Bendelac

    1. Committee on Immunology, University of Chicago, Chicago, United States
    2. Department of Pathology, University of Chicago, Chicago, United States
    Contribution
    Conceptualization, Resources, Data curation, Writing - review and editing
    Competing interests
    No competing interests declared
  5. Patrick C Wilson

    1. Department of Medicine, Section of Rheumatology, University of Chicago, Chicago, United States
    2. Committee on Immunology, University of Chicago, Chicago, United States
    Contribution
    Resources, Data curation, Writing - review and editing
    Competing interests
    No competing interests declared
  6. Benoit Roux

    Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, United States
    Contribution
    Conceptualization, Resources, Data curation, Supervision, Funding acquisition, Project administration, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5254-2712
  7. Erin J Adams

    1. Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, United States
    2. Committee on Immunology, University of Chicago, Chicago, United States
    Contribution
    Conceptualization, Resources, Data curation, Supervision, Funding acquisition, Project administration, Writing - review and editing
    For correspondence
    ejadams@uchicago.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6271-8574

Funding

National Institute of Biomedical Imaging and Bioengineering (EB009412)

  • Christopher T Boughter

National Institute of Allergy and Infectious Diseases (AI147954)

  • Christopher T Boughter
  • Marta T Borowska
  • Erin J Adams

National Institute of Allergy and Infectious Diseases (AI115471)

  • Christopher T Boughter
  • Marta T Borowska
  • Erin J Adams

National Science Foundation (MCB-1517221)

  • Christopher T Boughter
  • Benoit Roux

National Institute of Allergy and Infectious Diseases (AI125250)

  • Albert Bendelac

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was supported by the National Science Foundation through grant MCB-1517221 (BR and CTB), NIH NIBIB Training Grant T32 EB009412 (CTB), and NIH NIAID R01 AI115471, R01 AI147954 (EJA and CTB). We especially want to thank Cyril Planchais, Hugo Mouquet, and Michel Nussenzweig, for their assistance in procuring many of the antibody sequences used in this work. CTB acknowledges the University of Chicago’s Research Computing Center Midway resource, where some of these calculations were carried out, as well as Kristof Nolan, Caitlin Castro, Ryan Duncombe, and Nabil Faruk for their insightful discussions of the research.

Senior Editor

  1. Satyajit Rath, Indian Institute of Science Education and Research (IISER), India

Reviewing Editor

  1. Pamela J Bjorkman, California Institute of Technology, United States

Reviewers

  1. Brian Baker
  2. Bart Haynes, Duke University School of Medicine, United States

Version history

  1. Received: July 23, 2020
  2. Accepted: November 9, 2020
  3. Accepted Manuscript published: November 10, 2020 (version 1)
  4. Version of Record published: December 22, 2020 (version 2)

Copyright

© 2020, Boughter et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 4,022
    Page views
  • 574
    Downloads
  • 18
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Christopher T Boughter
  2. Marta T Borowska
  3. Jenna J Guthmiller
  4. Albert Bendelac
  5. Patrick C Wilson
  6. Benoit Roux
  7. Erin J Adams
(2020)
Biochemical patterns of antibody polyreactivity revealed through a bioinformatics-based analysis of CDR loops
eLife 9:e61393.
https://doi.org/10.7554/eLife.61393

Further reading

    1. Cell Biology
    2. Immunology and Inflammation
    Yijun Zhang, Tao Wu ... Li Wu
    Research Article

    Dendritic cells (DCs), the key antigen-presenting cells, are primary regulators of immune responses. Transcriptional regulation of DC development had been one of the major research interests in DC biology, however, the epigenetic regulatory mechanisms during DC development remains unclear. Here, we report that Histone deacetylase 3 (Hdac3), an important epigenetic regulator, is highly expressed in pDCs, and its deficiency profoundly impaired the development of pDCs. Significant disturbance of homeostasis of hematopoietic progenitors was also observed in HDAC3-deficient mice, manifested by altered cell numbers of these progenitors and defective differentiation potentials for pDCs. Using the in vitro Flt3L supplemented DC culture system, we further demonstrated that HDAC3 was required for the differentiation of pDCs from progenitors at all developmental stages. Mechanistically, HDAC3 deficiency resulted in enhanced expression of cDC1-associated genes, owing to markedly elevated H3K27 acetylation (H3K27ac) at these gene sites in BM pDCs. In contrast, the expression of pDC-associated genes was significantly downregulated, leading to defective pDC differentiation.

    1. Computational and Systems Biology
    2. Immunology and Inflammation
    David J Torres, Paulus Mrass ... Judy L Cannon
    Research Article Updated

    T cells are required to clear infection, and T cell motion plays a role in how quickly a T cell finds its target, from initial naive T cell activation by a dendritic cell to interaction with target cells in infected tissue. To better understand how different tissue environments affect T cell motility, we compared multiple features of T cell motion including speed, persistence, turning angle, directionality, and confinement of T cells moving in multiple murine tissues using microscopy. We quantitatively analyzed naive T cell motility within the lymph node and compared motility parameters with activated CD8 T cells moving within the villi of small intestine and lung under different activation conditions. Our motility analysis found that while the speeds and the overall displacement of T cells vary within all tissues analyzed, T cells in all tissues tended to persist at the same speed. Interestingly, we found that T cells in the lung show a marked population of T cells turning at close to 180o, while T cells in lymph nodes and villi do not exhibit this “reversing” movement. T cells in the lung also showed significantly decreased meandering ratios and increased confinement compared to T cells in lymph nodes and villi. These differences in motility patterns led to a decrease in the total volume scanned by T cells in lung compared to T cells in lymph node and villi. These results suggest that the tissue environment in which T cells move can impact the type of motility and ultimately, the efficiency of T cell search for target cells within specialized tissues such as the lung.