1. Introduction

Antibodies are not only an essential part of the vertebrate immune system, but also has wide application in biomedical field [1, 2]. In clinical practice, the emergence of monoclonal antibodies (mAbs) has revolutionized the treatment for various cancers, autoimmune diseases, and many other diseases. Due to their large binding interfaces, high affinity and specificity, antibody drugs have successfully targeted numerous proteins that were previously impervious to small molecule therapies. The expansion of antibody drug formats, such as antibody-drug conjugates (ADCs), domain antibodies, bispecific antibodies and antibody fusion proteins, has broadened their indications to a wider range of diseases including blood disorders, infections, neurological conditions, ocular diseases, and metabolic disorders [3-5]. The process of antibody drug development can be briefly divided into three stages: preclinical research, clinical trial and post-marketing surveillance study [6]. The preclinical research of antibodies mainly refers to a series of in vitro experiments and animal model studies conducted before the application of antibodies in clinical trials. In this stage, an important task is to determine the binding interfaces and interaction sites between antibodies and their drug targets. Understanding how antibodies specifically interact with their targets or antigens can enable help optimize antibody drug leads further, as well as provide insights into mechanisms of interaction [7].

The field of machine learning and deep learning-based methods for antigen and antibody binding interfaces prediction have gained significant attention in recent years [8-11]. The advancement of this field heavily relies on a reliable foundation of data. While the Protein Data Bank (PDB) [12, 13] is a well-known repository for protein complexes, the identification of antigen-antibody complexes from the vast pool of general protein structures remains challenging. In 2018, AbDb [14] was introduced as a specialized database for antigen-antibody complexes. However, there have been no further updates to AbDb since then. It is crucial to emphasize that our statistical analysis reveals a significant increase in the number of antigen-antibody complex entries within the PDB database during the period from 2021 to 2023. These entries accounted for approximately 45% of the total antigen-antibody complex entries. The AbDb database currently lacks proper organization and management of this valuable data. In comparison, the series of SAbDab databases [15-17] indeed provide a comprehensive data source of antibody structure and achieve in time updating. However, this rapid updating process may inadvertently overlook a significant amount of information that requires thorough verification, thereby increasing the likelihood of incorporating erroneous annotations directly from the PDB database.

Furthermore, another issue in the field of interface prediction is the definition of interacting amino acids. Some datasets are based on differences in the solvent accessible surface area (ΔSASA) for each residue upon binding [9, 18]; Others are based on distances between atoms [8]. Different studies employ different datasets and definitions, resulting in a lack of comparability. There is an urgent need for a standard benchmark dataset to enable objective comparisons of these methods. None of the databases mentioned above provide details of interaction residues. Motivated by these issues and challenges, we have developed the Antigen-Antibody Complex Database (AACDB). It has a user-friendly interface for convenient querying, manipulation, browsing, and visualization of comprehensive information about antibody-antigen complexes. Notably, AACDB also provides detailed information on interacting residues using two methods (ΔSASA and atom distance), facilitating the benchmarking studies of the predictive methods for paratopes and epitopes.

2. Materials and Methods

We search the PDB database using the keywords “antibody”, “antigen”, “immunoglobulin” and “complex”. Over 32000 experimental structures were extracted (up to December 2023). Each entry was processed following the steps in Figure 1.

Data processing algorithm pipeline of AACDB.

2.1. Entry screening

We aimed to retrieve all antibody-related complexes from the PDB using the chosen search approach. However, we encountered a notable number of false hits in the results, prompting us to implement a stringent structural filtering process. Antibody complexes were defined as structures containing at least one antibody molecule and another large protein (with a length greater than 50 amino acids). Structures that lacked antibodies or consisted solely of one type of antibody were excluded from further analysis. It is crucial to emphasize that not all proteins that bind to antibodies qualify as antigens. For instance, immunoglobulin-binding proteins like Protein A and Protein G, expressed by Staphylococcus aureus and Streptococcal species, are commonly employed for antibody purification procedures [19]. Although protein A may bind to the Fab region of antibodies, the interacting amino acids might be located in the framework region (FR) rather than the complementarity-determining regions (CDRs) [20]. Such proteins are often incorrectly identified as antigen. Consequently, structures exhibiting such characteristics require meticulous verification and should be excluded from our dataset.

2.2. PDB splitting

we directly incorporated the PDB entry of antigen-antibody complex with only one antigen and one antibody into the AACDB database. However, quite a few antigen-antibody complexes contain several antigens and antibodies. In such cases, before splitting the structures, we need to determine the correct pairing of light and heavy chains by examining the information on equivalent chain interactions in the PDBsum database [21]. Next, we assign the correct antigen chains to the antibodies by identifying whether the remaining chains in the structure interact with the CDR regions of determined antibodies. For example, in the case of 1AHW, which contains two copies of antigen-antibody complexes, chains BA and DE are identified as two antibody pairs, where chains C and F bind to BA and DE, respectively. Consequently, 1AHW is split into two files, BAC and DEF (Figure 2A). In 6OGE, Pertuzumab (chains C and B) and Trastuzumab (chains E and D) bind to different epitopes of the receptor tyrosine-protein kinase erbB-2 (chain A). This would generate two records in AACDB (CBA and EDA) (Figure 2B). Specifically, it is necessary to determine if anti-idiotype antibody binds to both the heavy and light chains of the antibody on the other side. For instance, in 3BQU, the anti-idiotype 3H6 Fab (chains DC) only interact with the heavy chain (chain B) of 2F5 Fab. In AACDB, 2F5 Fab will be split and the light chain (chain A) will be discarded(Figure 2C).

Examples of PDB file splitting under different situations. (A) 1AHW contains two copies of the same antigen and the same antibody. (B) In 6OGE, two different antibodies bind to distinct epitopes of the same antigen. (C) In 3BQU, an anti-idiotypic antibody binds exclusively to a single chain of the antibody.

For each PDB entry, we utilized the corresponding split chains information to divide the downloaded .pdb or .cif files based on the ATOM records. Since the Naccess software does not support .cif files as input, we converted the .cif files into PDB format files while performing the splitting process (see Interacting residues definition). Furthermore, we adjusted the annotation of the split .fasta files to ensure coherence with AACDB records.

2.3. Metadata

AACDB provides detailed metadata for each entry, including chain IDs, antibody name, antigen name, method, resolution, organism, and more. To ensure data accuracy, we have conducted comprehensive verification by consulting original literature sources. AACDB has addressed many annotation errors identified within the corresponding PDB entries. These errors include but not limited to: (1) mislabeling of species (e.g. the entry 7WRL where the organism of BD55-1239H antibody was erroneously labeled as “SARS coronavirus B012”); (2) resolution annotation errors (e.g. 1NSN, the resolution should be 2.9, but it was incorrectly labeled as 2.8); (3) mislabeling of antibody chains as other proteins (e.g. in 3KS0, the light chain of B2B4 antibody was misnamed as heme domain of flavocytochrome b2); (4) misidentification of heavy chains as light chains (e.g. both two chains of antibody were labeled as light chain in 5EBW); (5) mutation status annotation errors. We have identified cases in which PDB entries indicate “NO” for mutations, while in reality, mutations exist (e.g. bevacizumab (Avastin) in 6BFT was labeled as none mutation. When aligned with the bevacizumab sequence, however, mutation T8D/T30D in heavy chain and S52D/S53D in light chain were observed.); and (6) incomplete annotations. Certain entries only provide the name of the mutant without specifying the precise mutation site (e.g. in 7SU1, antibody was described as Ipilimumab variant Ipi.106. but PDB database do not provide any mutation amino acid or position, which can be found in the reference literature). We carefully checked each entry manually to find out all possible annotation problems. In the AACDB database, we have supplemented the available information or correct the annotations through comprehensive literature reviews and sequence alignment with wild-type proteins.

The antibody nomenclature follows the title of the corresponding search entry in the RCSB PDB database, with verification done through the original literature. In cases where the name in the original literature differs from that in the RCSB PDB, we used the name in the published literature as the standard. Furthermore, for antibody fragments lacking names in both the RCSB PDB database and original literature, we adopt a naming convention of “PDBID” + “antibody fragment” (e.g., 4WEB Fab).

The biological and physicochemical properties play a crucial role in the new drug discovery pipelines for therapeutic antibodies. These properties include solubility, immunogenicity, aggregation tendencies, expression level, stability, and hydrophobicity. We provide the International Non-proprietary Name (INN) and the clinical trial for each therapeutic antibody entry, linking them to the DOTAD database [22]. Numerous antigens have been successfully identified as targets for antibodies or small molecule drugs. We conducted a comparison between antigen sequences and the drug targets listed in the DrugBank database [23]. A threshold of percent identity > 90% was applied to determine the corresponding drug targets.

2.4. Interacting residues definition

We labeled the interacting residues based on SASA and atom distance. Naccess V2.1.1 and Bio.PDB module were employed to calculate SASA value for each residue in antibody and antigen, respectively. The residues with a SASA loss (ΔSASA) in binding of more than 1Å2 were classified as interacting residues. In addition, we also defined another set of interacting paratope-epitope residues by a distance cutoff of 5Å. Two amino acids are considered as interacting residues if they have at least one atom within a distance of 5 Å from any atom.

2.5. Data integration and website implementation

The main data processing algorithm is implemented in Python. The front-end web interface of AACDB was constructed by HTML and enhanced with JavaScript, CSS and Bootstrap technologies. We developed a dynamic 3D structure visualization window based on PV, a WebGL-based protein viewer, inspiring by Dunbar et al. [15]. All the data were managed within the MySQL database system. For the back-end functionality, PHP is utilized to enable data browsing, searching, and downloading features.

3. Results

3.1. Statistics

Out of more than 32000 experimental structures in the PDB database, 7498 antigen−antibody entries were manually curated in current version of AACDB, referring to 16 antibody fragment types across 14 species(Figure 3). It is obvious that fab fragments and human antibodies accounted for the largest proportion of the data, accounting for 71.98% and 60.95%, respectively. Our statistical analysis reveals a significant increase in the number of antigen-antibody complex entries within the PDB during the period from 2021 to 2023. These entries accounted for approximately 45% of the total antigen-antibody complex entries. The data in Figure 2B, released in 2024, reflects a change due to the removal of the 7SIX entry (initially released on November 16, 2022) from the distribution of released PDB entries. This removal occurred on January 17, 2024 and it has been replaced by the entry 8TM1. Furthermore, the developability properties of antibodies in 325 entries can be queried in the DOTAD database, at the meanwhile, 3,733 antigen records have been identified as drug targets in DrugBank (data not shown).

AACDB statistics. (A) Antibody fragment distribution in database. (B) The number of antibody-antigen complexes released in different year (unique PDBID). (C) Organismal distribution of antibody entries.

3.2. Database browse and search

All the data can be browsed directly by clicking the “Datasets” item on the top menu (Figure 4). The summary table includes nine columns as following:

The AACDB browse and search page. This example demonstrates the use of the keyword ‘Lysozyme’ in the ‘Protein’ column, showcasing the search functionality of the AACDB database. The search results display antigen related to Lysozyme.

  1. AACDB_ID: The unique id in the AACDB database, linking to the “Detail” page;

  2. PDBID: The identifier of the RCSB PDB database (https://www.rcsb.org/), linking to PDB;

  3. Chains: The chain ids containing in this entry. Antibody and antigen chains were separated by “_”. The heavy chain id precedes the light chain id if antibody is complete;

  4. Antibody: This column is represented by “antibody name + fragment”;

  5. Protein: The corresponding antigen;

  6. Organism: The source organism of antibody and antigen;

  7. Method: The experiment method used to solve the structure;

  8. Resolution: The indicator that measures the resolution of protein structures in experiments is expressed in units of angstroms (Å).

  9. Reference: The DOI linker of original literature that produce this structure.

The table in AACDB can be easily searched through the search panel at the top section of the “Datasets” page. Users can perform a quick search by specifying one or more fields and entering relevant keywords. The search results can be downloaded as files in either txt or csv format. Figure 4 shows the search results using the keyword “Lysozyme” in “Protein” column, returning 117 hits.

An individual structure can be accessed using its AACDB_ID accession code. When click the AACDB_ID hyperlink, the user will be brought to its details page as shown in Figure 5. There, the complex structure can be visualized with different colors and styles (Figure 5A). Besides the visualization window, we provide more details of this entry, including the mutation, INN and clinical trial of antibody, the ID in DrugBank of antigen (if exist) (Figure 5B). Under the structure information tab, further details about each chain can be found. These include: sequence, mutate amino acid type and position, interacting residues in each chain based on the ΔSASA method, interacting plot of paratope-epitope residues by a distance cutoff of < 5Å (Figure 5D).

The details page of entries, taking ‘1BJ1’ as an example. (A) Structure visualization window. (B) Entry meta information. (C) sequence and mutation information. (D) Interacting residues details based on SASA and atom distance methods. (E) The download hyperlinks of a single entry.

3.3. Data download

We provide two ways for downloading the data:

  1. Download data of the single entry.

    When access the detailed information about an entry using the corresponding AACDB_ID, user can click the hyperlink at the “Download” section of the bottom of the page to download the data for a single certain entry (Figure 5E).

  2. Download all the data of AACDB.

    AACDB provide the download page for users. All the sequence and structure files and the interacting data based on different methods were packaged in different .zip file that can be downloaded.

Moreover, the website provides a user-friendly ‘Help’ page that presents a step-by-step tutorial to assist users in manipulating, querying, browsing and downloading the AACDB database.

4. Discussion and conclusion

Research on antigen-antibody interactions contributes to the advancement of the antibody-related industry. Databases such as PDB and SabDab provide the foundational data for this purpose. However, there are still many unresolved issues. In this work, we have developed the AACDB database with the aim of providing a clean and reliable dataset of antigen-antibody interactions. During the process of data collection and organization, we identified numerous annotation errors in the PDB database. Some of these errors had been directly introduced into SabDab. For example, the species of the antibody in 7WRL was incorrectly labeled as “SARS coronavirus B012” in both PDB and SabDab. We also invested significant effort and time to manually cross-reference with original literature in order to rectify these errors and exclude antibody binding proteins that were erroneously annotated as antigens by SabDab. Apart from the curation and reannotation of structural data, AACDB offers features not provided by other antigen-antibody complex databases: 1) AACDB’s data processing pipeline supports mmCIF files, and 2) we provide amino acids in the interaction interface through two methods, enabling the definition of unified standards for epitopes and paratopes. This provides a more accurate and comprehensive benchmark dataset for developed interaction interface prediction tools, enhancing the comparability of various tools.

However, AACDB still has some limitations. Despite our best efforts, the limitations of our team’s resources and knowledge mean that our database may not capture all antigen-antibody complex structures. Since the data is manually curated, eliminating errors completely during the information processing is a challenge. While we strive to fill in any gaps, we also hope that experts and users within the community can provide timely feedback to help us improve these issues. Additionally, currently AACDB only includes antigen proteins with a length greater than 50 amino acids. In future work, we will expand the database to include complex structures of different antigen types, including peptides, nucleic acids, and haptens.

In summary, AACDB is a novel database of antigen-antibody complexes that provides information on antibody-developability, antigen-drug target relationships, and detailed antigen-antibody interaction interfaces. It is fully accessible at http://i.uestc.edu.cn/AACDB. We are committed to a regular data update, ensuring that researchers in immunoinformatics have access to timely and valuable resources.

Acknowledgements

This work was supported by the National Natural Science Foundation of China [grant numbers 62071099, 62371112] and Sichuan Province Science and Technology Support Program [2024NSFSC0636].

Additional information

CRediT author statement

Yuwei Zhou: Writing - Original Draft, Software, Conceptualization Wenwen Liu: Validation, Visualization Ziru Huang: Validation Yushu Gou: Data Curation Siqi Liu: Data Curation Lixu Jiang: Data Curation Yue Yang: Data Curation Jian Huang: Supervision, Funding acquisition