Abstract
Antibodies play a crucial role in the vertebrate immune system and have diverse biomedical applications. Understanding the intricate interactions between antibodies and antigens is an important step in drug development. However, the complex and vast nature of the data presents significant challenges in accurately identifying and comprehending these interactions. To address these challenges and enhance our understanding of the antibody-antigen interface, the Antigen-Antibody Complex Database (AACDB) has been established. The current version provides a comprehensive collection of 7,498 manually processed antigen-antibody complexes, ensuring accuracy and detail. This database not only provides rich metadata but also corrects annotation errors observed in the PDB database. Furthermore, it integrates data on antibody developability and antigen-drug target relationships, making it valuable for assisting new antibody therapies development. Notably, the database includes comprehensive paratope and epitope annotation information, thereby serving as a valuable benchmark for immunoinformatics research. The AACDB interface is designed to be user-friendly, providing researchers with powerful search and visualization tools that enable effortless querying, manipulation, and visualization of complex data. Researchers can access AACDB completely online at http://i.uestc.edu.cn/AACDB. Regular updates are promised to ensure the timely provision of scientific and valuable information.
Graphical abstract
1. Introduction
Antibodies are not only an essential part of the vertebrate immune system, but also has wide application in biomedical field [1, 2]. In clinical practice, the emergence of monoclonal antibodies (mAbs) has revolutionized the treatment for various cancers, autoimmune diseases, and many other diseases. Due to their large binding interfaces, high affinity and specificity, antibody drugs have successfully targeted numerous proteins that were previously impervious to small molecule therapies. The expansion of antibody drug formats, such as antibody-drug conjugates (ADCs), domain antibodies, bispecific antibodies and antibody fusion proteins, has broadened their indications to a wider range of diseases including blood disorders, infections, neurological conditions, ocular diseases, and metabolic disorders [3-5]. The process of antibody drug development can be briefly divided into three stages: preclinical research, clinical trial and post-marketing surveillance study [6]. The preclinical research of antibodies mainly refers to a series of in vitro experiments and animal model studies conducted before the application of antibodies in clinical trials. In this stage, an important task is to determine the binding interfaces and interaction sites between antibodies and their drug targets. Understanding how antibodies specifically interact with their targets or antigens can enable help optimize antibody drug leads further, as well as provide insights into mechanisms of interaction [7].
The field of machine learning and deep learning-based methods for antigen and antibody binding interfaces prediction have gained significant attention in recent years [8-11]. The advancement of this field heavily relies on a reliable foundation of data. While the Protein Data Bank (PDB) [12, 13] is a well-known repository for protein complexes, the identification of antigen-antibody complexes from the vast pool of general protein structures remains challenging. In 2018, AbDb [14] was introduced as a specialized database for antigen-antibody complexes. However, there have been no further updates to AbDb since then. It is crucial to emphasize that our statistical analysis reveals a significant increase in the number of antigen-antibody complex entries within the PDB database during the period from 2021 to 2023. These entries accounted for approximately 45% of the total antigen-antibody complex entries. The AbDb database currently lacks proper organization and management of this valuable data. In comparison, the series of SAbDab databases [15-17] indeed provide a comprehensive data source of antibody structure and achieve in time updating. However, this rapid updating process may inadvertently overlook a significant amount of information that requires thorough verification, thereby increasing the likelihood of incorporating erroneous annotations directly from the PDB database.
Furthermore, another issue in the field of interface prediction is the definition of interacting amino acids. Some datasets are based on differences in the solvent accessible surface area (ΔSASA) for each residue upon binding [9, 18]; Others are based on distances between atoms [8]. Different studies employ different datasets and definitions, resulting in a lack of comparability. There is an urgent need for a standard benchmark dataset to enable objective comparisons of these methods. None of the databases mentioned above provide details of interaction residues. Motivated by these issues and challenges, we have developed the Antigen-Antibody Complex Database (AACDB). It has a user-friendly interface for convenient querying, manipulation, browsing, and visualization of comprehensive information about antibody-antigen complexes. Notably, AACDB also provides detailed information on interacting residues using two methods (ΔSASA and atom distance), facilitating the benchmarking studies of the predictive methods for paratopes and epitopes.
2. Materials and Methods
We search the PDB database using the keywords “antibody”, “antigen”, “immunoglobulin” and “complex”. Over 32000 experimental structures were extracted (up to December 2023). Each entry was processed following the steps in Figure 1.
2.1. Entry screening
We aimed to retrieve all antibody-related complexes from the PDB using the chosen search approach. However, we encountered a notable number of false hits in the results, prompting us to implement a stringent structural filtering process. Antibody complexes were defined as structures containing at least one antibody molecule and another large protein (with a length greater than 50 amino acids). Structures that lacked antibodies or consisted solely of one type of antibody were excluded from further analysis. It is crucial to emphasize that not all proteins that bind to antibodies qualify as antigens. For instance, immunoglobulin-binding proteins like Protein A and Protein G, expressed by Staphylococcus aureus and Streptococcal species, are commonly employed for antibody purification procedures [19]. Although protein A may bind to the Fab region of antibodies, the interacting amino acids might be located in the framework region (FR) rather than the complementarity-determining regions (CDRs) [20]. Such proteins are often incorrectly identified as antigen. Consequently, structures exhibiting such characteristics require meticulous verification and should be excluded from our dataset.
2.2. PDB splitting
we directly incorporated the PDB entry of antigen-antibody complex with only one antigen and one antibody into the AACDB database. However, quite a few antigen-antibody complexes contain several antigens and antibodies. In such cases, before splitting the structures, we need to determine the correct pairing of light and heavy chains by examining the information on equivalent chain interactions in the PDBsum database [21]. Next, we assign the correct antigen chains to the antibodies by identifying whether the remaining chains in the structure interact with the CDR regions of determined antibodies. For example, in the case of 1AHW, which contains two copies of antigen-antibody complexes, chains BA and DE are identified as two antibody pairs, where chains C and F bind to BA and DE, respectively. Consequently, 1AHW is split into two files, BAC and DEF (Figure 2A). In 6OGE, Pertuzumab (chains C and B) and Trastuzumab (chains E and D) bind to different epitopes of the receptor tyrosine-protein kinase erbB-2 (chain A). This would generate two records in AACDB (CBA and EDA) (Figure 2B). Specifically, it is necessary to determine if anti-idiotype antibody binds to both the heavy and light chains of the antibody on the other side. For instance, in 3BQU, the anti-idiotype 3H6 Fab (chains DC) only interact with the heavy chain (chain B) of 2F5 Fab. In AACDB, 2F5 Fab will be split and the light chain (chain A) will be discarded(Figure 2C).
For each PDB entry, we utilized the corresponding split chains information to divide the downloaded .pdb or .cif files based on the ATOM records. Since the Naccess software does not support .cif files as input, we converted the .cif files into PDB format files while performing the splitting process (see Interacting residues definition). Furthermore, we adjusted the annotation of the split .fasta files to ensure coherence with AACDB records.
2.3. Metadata
AACDB provides detailed metadata for each entry, including chain IDs, antibody name, antigen name, method, resolution, organism, and more. To ensure data accuracy, we have conducted comprehensive verification by consulting original literature sources. AACDB has addressed many annotation errors identified within the corresponding PDB entries. These errors include but not limited to: (1) mislabeling of species (e.g. the entry 7WRL where the organism of BD55-1239H antibody was erroneously labeled as “SARS coronavirus B012”); (2) resolution annotation errors (e.g. 1NSN, the resolution should be 2.9, but it was incorrectly labeled as 2.8); (3) mislabeling of antibody chains as other proteins (e.g. in 3KS0, the light chain of B2B4 antibody was misnamed as heme domain of flavocytochrome b2); (4) misidentification of heavy chains as light chains (e.g. both two chains of antibody were labeled as light chain in 5EBW); (5) mutation status annotation errors. We have identified cases in which PDB entries indicate “NO” for mutations, while in reality, mutations exist (e.g. bevacizumab (Avastin) in 6BFT was labeled as none mutation. When aligned with the bevacizumab sequence, however, mutation T8D/T30D in heavy chain and S52D/S53D in light chain were observed.); and (6) incomplete annotations. Certain entries only provide the name of the mutant without specifying the precise mutation site (e.g. in 7SU1, antibody was described as Ipilimumab variant Ipi.106. but PDB database do not provide any mutation amino acid or position, which can be found in the reference literature). We carefully checked each entry manually to find out all possible annotation problems. In the AACDB database, we have supplemented the available information or correct the annotations through comprehensive literature reviews and sequence alignment with wild-type proteins.
The antibody nomenclature follows the title of the corresponding search entry in the RCSB PDB database, with verification done through the original literature. In cases where the name in the original literature differs from that in the RCSB PDB, we used the name in the published literature as the standard. Furthermore, for antibody fragments lacking names in both the RCSB PDB database and original literature, we adopt a naming convention of “PDBID” + “antibody fragment” (e.g., 4WEB Fab).
The biological and physicochemical properties play a crucial role in the new drug discovery pipelines for therapeutic antibodies. These properties include solubility, immunogenicity, aggregation tendencies, expression level, stability, and hydrophobicity. We provide the International Non-proprietary Name (INN) and the clinical trial for each therapeutic antibody entry, linking them to the DOTAD database [22]. Numerous antigens have been successfully identified as targets for antibodies or small molecule drugs. We conducted a comparison between antigen sequences and the drug targets listed in the DrugBank database [23]. A threshold of percent identity > 90% was applied to determine the corresponding drug targets.
2.4. Interacting residues definition
We labeled the interacting residues based on SASA and atom distance. Naccess V2.1.1 and Bio.PDB module were employed to calculate SASA value for each residue in antibody and antigen, respectively. The residues with a SASA loss (ΔSASA) in binding of more than 1Å2 were classified as interacting residues. In addition, we also defined another set of interacting paratope-epitope residues by a distance cutoff of 5Å. Two amino acids are considered as interacting residues if they have at least one atom within a distance of 5 Å from any atom.
2.5. Data integration and website implementation
The main data processing algorithm is implemented in Python. The front-end web interface of AACDB was constructed by HTML and enhanced with JavaScript, CSS and Bootstrap technologies. We developed a dynamic 3D structure visualization window based on PV, a WebGL-based protein viewer, inspiring by Dunbar et al. [15]. All the data were managed within the MySQL database system. For the back-end functionality, PHP is utilized to enable data browsing, searching, and downloading features.
3. Results
3.1. Statistics
Out of more than 32000 experimental structures in the PDB database, 7498 antigen−antibody entries were manually curated in current version of AACDB, referring to 16 antibody fragment types across 14 species(Figure 3). It is obvious that fab fragments and human antibodies accounted for the largest proportion of the data, accounting for 71.98% and 60.95%, respectively. Our statistical analysis reveals a significant increase in the number of antigen-antibody complex entries within the PDB during the period from 2021 to 2023. These entries accounted for approximately 45% of the total antigen-antibody complex entries. The data in Figure 2B, released in 2024, reflects a change due to the removal of the 7SIX entry (initially released on November 16, 2022) from the distribution of released PDB entries. This removal occurred on January 17, 2024 and it has been replaced by the entry 8TM1. Furthermore, the developability properties of antibodies in 325 entries can be queried in the DOTAD database, at the meanwhile, 3,733 antigen records have been identified as drug targets in DrugBank (data not shown).
3.2. Database browse and search
All the data can be browsed directly by clicking the “Datasets” item on the top menu (Figure 4). The summary table includes nine columns as following:
AACDB_ID: The unique id in the AACDB database, linking to the “Detail” page;
PDBID: The identifier of the RCSB PDB database (https://www.rcsb.org/), linking to PDB;
Chains: The chain ids containing in this entry. Antibody and antigen chains were separated by “_”. The heavy chain id precedes the light chain id if antibody is complete;
Antibody: This column is represented by “antibody name + fragment”;
Protein: The corresponding antigen;
Organism: The source organism of antibody and antigen;
Method: The experiment method used to solve the structure;
Resolution: The indicator that measures the resolution of protein structures in experiments is expressed in units of angstroms (Å).
Reference: The DOI linker of original literature that produce this structure.
The table in AACDB can be easily searched through the search panel at the top section of the “Datasets” page. Users can perform a quick search by specifying one or more fields and entering relevant keywords. The search results can be downloaded as files in either txt or csv format. Figure 4 shows the search results using the keyword “Lysozyme” in “Protein” column, returning 117 hits.
An individual structure can be accessed using its AACDB_ID accession code. When click the AACDB_ID hyperlink, the user will be brought to its details page as shown in Figure 5. There, the complex structure can be visualized with different colors and styles (Figure 5A). Besides the visualization window, we provide more details of this entry, including the mutation, INN and clinical trial of antibody, the ID in DrugBank of antigen (if exist) (Figure 5B). Under the structure information tab, further details about each chain can be found. These include: sequence, mutate amino acid type and position, interacting residues in each chain based on the ΔSASA method, interacting plot of paratope-epitope residues by a distance cutoff of < 5Å (Figure 5D).
3.3. Data download
We provide two ways for downloading the data:
Download data of the single entry.
When access the detailed information about an entry using the corresponding AACDB_ID, user can click the hyperlink at the “Download” section of the bottom of the page to download the data for a single certain entry (Figure 5E).
Download all the data of AACDB.
AACDB provide the download page for users. All the sequence and structure files and the interacting data based on different methods were packaged in different .zip file that can be downloaded.
Moreover, the website provides a user-friendly ‘Help’ page that presents a step-by-step tutorial to assist users in manipulating, querying, browsing and downloading the AACDB database.
4. Discussion and conclusion
Research on antigen-antibody interactions contributes to the advancement of the antibody-related industry. Databases such as PDB and SabDab provide the foundational data for this purpose. However, there are still many unresolved issues. In this work, we have developed the AACDB database with the aim of providing a clean and reliable dataset of antigen-antibody interactions. During the process of data collection and organization, we identified numerous annotation errors in the PDB database. Some of these errors had been directly introduced into SabDab. For example, the species of the antibody in 7WRL was incorrectly labeled as “SARS coronavirus B012” in both PDB and SabDab. We also invested significant effort and time to manually cross-reference with original literature in order to rectify these errors and exclude antibody binding proteins that were erroneously annotated as antigens by SabDab. Apart from the curation and reannotation of structural data, AACDB offers features not provided by other antigen-antibody complex databases: 1) AACDB’s data processing pipeline supports mmCIF files, and 2) we provide amino acids in the interaction interface through two methods, enabling the definition of unified standards for epitopes and paratopes. This provides a more accurate and comprehensive benchmark dataset for developed interaction interface prediction tools, enhancing the comparability of various tools.
However, AACDB still has some limitations. Despite our best efforts, the limitations of our team’s resources and knowledge mean that our database may not capture all antigen-antibody complex structures. Since the data is manually curated, eliminating errors completely during the information processing is a challenge. While we strive to fill in any gaps, we also hope that experts and users within the community can provide timely feedback to help us improve these issues. Additionally, currently AACDB only includes antigen proteins with a length greater than 50 amino acids. In future work, we will expand the database to include complex structures of different antigen types, including peptides, nucleic acids, and haptens.
In summary, AACDB is a novel database of antigen-antibody complexes that provides information on antibody-developability, antigen-drug target relationships, and detailed antigen-antibody interaction interfaces. It is fully accessible at http://i.uestc.edu.cn/AACDB. We are committed to a regular data update, ensuring that researchers in immunoinformatics have access to timely and valuable resources.
Acknowledgements
This work was supported by the National Natural Science Foundation of China [grant numbers 62071099, 62371112] and Sichuan Province Science and Technology Support Program [2024NSFSC0636].
Additional information
CRediT author statement
Yuwei Zhou: Writing - Original Draft, Software, Conceptualization Wenwen Liu: Validation, Visualization Ziru Huang: Validation Yushu Gou: Data Curation Siqi Liu: Data Curation Lixu Jiang: Data Curation Yue Yang: Data Curation Jian Huang: Supervision, Funding acquisition
References
- [1]The global landscape of approved antibody therapiesAntib Ther 5:233–257https://doi.org/10.1093/abt/tbac021
- [2]Macromolecules and Antibody-Based DrugsAdv Exp Med Biol 1248:485–530https://doi.org/10.1007/978-981-15-3266-5_20
- [3]Designing antibodies as therapeuticsCell 185:2789–2805https://doi.org/10.1016/j.cell.2022.05.029
- [4]Exploring the next generation of antibody-drug conjugatesNat Rev Clin Oncol 21:203–223https://doi.org/10.1038/s41571-023-00850-2
- [5]Combining nanotechnology with monoclonal antibody drugs for rheumatoid arthritis treatmentsJ Nanobiotechnology 21https://doi.org/10.1186/s12951-023-01857-8
- [6]Deep learning in preclinical antibody drug discovery and developmentMethods 218:57–71https://doi.org/10.1016/j.ymeth.2023.07.003
- [7]Understanding the complementarity and plasticity of antibody-antigen interfacesBioinformatics 39https://doi.org/10.1093/bioinformatics/btad392
- [8]Learning context-aware structural representations to predict antigen and antibody binding interfacesBioinformatics 36:3996–4003https://doi.org/10.1093/bioinformatics/btaa263
- [9]SEPPA-mAb: spatial epitope prediction of protein antigens for mAbsNucleic Acids Res 51:W528–W534https://doi.org/10.1093/nar/gkad427
- [10]SEPPA 3.0-enhanced spatial epitope prediction enabling glycoprotein antigensNucleic Acids Res 47:W388–W394https://doi.org/10.1093/nar/gkz413
- [11]SEPPA 2.0--more refined server to predict spatial epitope considering species of immune host and subcellular localization of protein antigenNucleic Acids Res 42:W59–63https://doi.org/10.1093/nar/gku395
- [12]RCSB Protein Data Bank (RCSB.org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learningNucleic Acids Res 51:D488–D508https://doi.org/10.1093/nar/gkac1077
- [13]RCSB Protein Data Bank: supporting research and education worldwide through explorations of experimentally determined and computationally predicted atomic level 3D biostructuresIUCrJ 11:279–286https://doi.org/10.1107/S2052252524002604
- [14]AbDb: antibody structure database-a database of PDB-derived antibody structuresDatabase (Oxford) https://doi.org/10.1093/database/bay040
- [15]SAbDab: the structural antibody databaseNucleic Acids Res 42:D1140–6https://doi.org/10.1093/nar/gkt1043
- [16]Thera-SAbDab: the Therapeutic Structural Antibody DatabaseNucleic Acids Res 48:D383–D388https://doi.org/10.1093/nar/gkz827
- [17]SAbDab in the age of biotherapeutics: updates including SAbDab-nano, the nanobody structure trackerNucleic Acids Res 50:D1368–D1372https://doi.org/10.1093/nar/gkab1050
- [18]SEPPA: a computational server for spatial epitope prediction of protein antigensNucleic Acids Res 37:W612–6https://doi.org/10.1093/nar/gkp417
- [19]Protein A and Protein G Purification of AntibodiesCold Spring Harb Protoc https://doi.org/10.1101/pdb.prot099143
- [20]Crystal structure of a Staphylococcus aureus protein A domain complexed with the Fab fragment of a human IgM antibody: structural basis for recognition of B-cell receptors and superantigen activityProc Natl Acad Sci U S A 97:5399–404https://doi.org/10.1073/pnas.97.10.5399
- [21]PDBsum: Structural summaries of PDB entriesProtein Sci 27:129–134https://doi.org/10.1002/pro.3289
- [22]DOTAD: A Database of Therapeutic Antibody DevelopabilityInterdiscip Sci https://doi.org/10.1007/s12539-024-00613-2
- [23]DrugBank 6.0: the DrugBank Knowledgebase for 2024Nucleic Acids Res 52:D1265–D1275https://doi.org/10.1093/nar/gkad976
Article and author information
Author information
Version history
- Sent for peer review:
- Preprint posted:
- Reviewed Preprint version 1:
Copyright
© 2025, Zhou et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 100
- downloads
- 0
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.