Establishing comprehensive quaternary structural proteomes from genome sequence

Edward Alexander Catoiu; Nathan Mih; Maxwell Lu; Bernhard Palsson

doi:10.7554/eLife.100485.1

eLife assessment

This study presents an important platform for mapping mutation effects onto higher-level protein structural information, addressing a significant gap in current research. While the work is ambitious and incorporates often-overlooked aspects of higher-order structure, the strength of the evidence supporting some results seems incomplete. The quaternary structure modeling appears to underestimate oligomeric proteins compared to previous studies, and the mutation analysis lacks crucial baseline information. Despite these limitations, the method has potential for broader applications and generalization to additional organisms, warranting further development and refinement.

https://doi.org/10.7554/eLife.100485.1.sa2

Significance of findings

important: Findings that have theoretical or practical implications beyond a single subfield

landmark
fundamental
important
valuable
useful

Strength of evidence

incomplete: Main claims are only partially supported

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

A critical body of knowledge has developed through advances in protein microscopy, protein-fold modeling, structural biology software, availability of sequenced bacterial genomes, large-scale mutation databases, and genome-scale models. Based on these recent advances, we develop a computational framework that; i) identifies the oligomeric structural proteome encoded by an organism’s genome from available structural resources; ii) maps multi-strain alleleomic variation, resulting in the structural proteome for a species; and iii) calculates the 3D orientation of proteins across subcellular compartments with residue-level precision. Using the platform, we; iv) compute the quaternary E. coli K-12 MG1655 structural proteome; v) use a dataset of 12,000 mutations to build Random Forest classifiers that can predict the severity of mutations; and, in combination with a genome-scale model that computes proteome allocation, vi) obtain the spatial allocation of the E. coli proteome. Thus, in conjunction with relevant datasets and increasingly accurate computational models, we can now annotate quaternary structural proteomes, at genome-scale, to obtain a molecular-level understanding of whole-cell functions.

Significance

Advancements in experimental and computational methods have revealed the shapes of multi-subunit proteins. The absence of a unified platform that maps actionable datatypes onto these increasingly accurate structures creates a barrier to structural analyses, especially at the genome-scale. Here, we describe QSPACE, a computational annotation platform that evaluates existing resources to identify the best-available structure for each protein in a user’s query, maps the 3D location of actionable datatypes (e.g., active sites, published mutations) onto the selected structures, and uses third-party APIs to determine the subcellular compartment of all amino acids of a protein. As proof-of-concept, we deployed QSPACE to generate the quaternary structural proteome of E. coli MG1655 and demonstrate two use-cases involving large-scale mutant analysis and genome-scale modelling.

Introduction

The proteome of the cell is responsible for metabolite uptake and secretion, genetic information processing and replication, energy production, and all other processes required for maintaining cellular homeostasis. Before becoming a functional unit of this multi-scale system, a protein must fold properly into its native three-dimensional shape. This folding process is also multi-scale. A peptide sequence (primary sequence) associates locally to form small recognizable patterns (e.g., alpha-helices and beta-sheets). Often stabilized by disulfide bridges and physiochemical attraction, these secondary structures fold onto each other to form larger recognizable domains, resulting in the three-dimensional structure of the protein monomer (tertiary structure). These protein monomers often oligomerize and form multi-subunit enzymes (quaternary structures) that carry out the functions in the cell.

Structural biology—the study of protein shape and function—has advanced rapidly in recent years. For proteins that form large multi-subunit complexes and for those spanning the cell membrane, the three-dimensional shape was particularly difficult to study with classical crystallographic techniques. The development of cryogenic electron microscopy – a method that images thin slices of a protein frozen in its native state (much like a biopsy)—has drastically increased the speed and ease by which these previously unknowable protein structures can be resolved^1–5. Concurrently, computational methods have also experienced increasing success in accurately predicting protein structures^6–10. Most recently, deep learning algorithms^11–12 (e.g., AlphaFold) have utilized multiple sequence alignments and incorporated biophysical knowledge about protein structure to predict the shape of proteins without homologous structures in the Protein Data Bank¹³. Even more promising, these algorithms can be “hacked” to predict the structures of oligomeric assemblies of protein complexes¹⁴. Recent benchmarking efforts have confirmed the accuracy of homodimer AF-multimer models¹⁵ and have subsequently been used for homo-oligomeric predictions¹⁶.

Although structural biology can offer molecular insights into a protein’s shape and function, mutations in key domains can change the enzymatic properties and modulate a protein’s function. Changes in protein function can be either beneficial (e.g., an increase in stability of the active form) or detrimental (e.g., a loss of substrate binding efficiency). Protein engineering employs a variety of techniques to find mutations that produce a desired phenotype. One such technique, multiplex automated genomic engineering (MAGE)¹⁷, can introduce many mutations with unknown effects at specific sites in the genome. Mutations that result in the desired phenotype can then be selected for. Laboratory evolution, an experimental approach involving the serial passage of a cell population in an increasingly stringent selection pressure, can speed up the evolutionary process and beneficial mutations can be identified by sequencing the endpoint strain^18–20. The dramatic decrease in sequencing costs in the last ten years has allowed for many mutations identified by these experimental techniques to be collected in databases.

Concurrent with advancements in structural biology and the formation of large-scale mutation databases, systems biology—the study of systems-level cellular behavior— was driven by the development of genome-scale models (GEMs) of cellular metabolism that predict gene essentiality, growth phenotypes, and proteome allocation in a few organisms^21–26. Software (e.g., ssbio²⁷) was developed to map available structural information to these modeled proteomes. Structural systems biology—the study of structural biology at the systems-level—has incorporated protein structures into genome-scale models (GEM-PROs) to study protein-fold evolution and investigate structural differences between organisms^28–32. Notwithstanding the incorporation of protein information at the monomer-level, the use of GEM-PROs is the most recent step towards building genome-scale models that reflect the physical nature of the cellular proteome.

Given the availability of high-quality protein structures and structural models that capture the shape of multi-subunit complexes, the deposition of mutation-phenotype information into large-scale databases, and the development of genome-scale models of cellular proteome allocation, the creation of a genome annotation platform with interoperability between structural, functional, mutational, and systems-level information is now possible.

In this study, we present the Quaternary Structural Proteome Atlas of a CEll (QSPACE) — a computational annotation platform that 1) utilizes state-of-the-art modeling software (e.g. Alphafold¹¹ & Alphafold Multimer¹⁴) and the latest crystallographic depositions to identify a three-dimensional structural representation that accounts for the multi-subunit assembly of the cellular proteome; 2) calculates structural properties of the proteome; 3) provides a three-dimensional context to map functional information including enzymatic domains, binding sites, and protein interfaces; 4) draws mutational information from large-scale databases of laboratory-acquired mutations^20,33–35 and of the wild-type natural sequence diversity (alleleome) of E. coli³⁶; and 5) calculates the subcellular compartmentalization of the proteome with residue-level resolution.

The QSPACE platform allows users to rapidly interact with protein structural data for biological inquiries ranging from the single-protein to the genome-scale (GS). Using E. coli as an example, we present two separate genome-scale applications of the QSPACE platform to demonstrate its broad applicability. First, we exploit QSPACE’s superimposition of mutant datasets and annotated functional domains on the protein structure to calculate 100 residue-level features for over 12,000 published E. coli mutations in UniProt³⁷, allowing us to build RF-classifiers capable of predicting the severity of amino acid substitutions. Second, we showcase how QSPACE’s subcellular compartmentalization of the protein structures advances genome-scale modelling efforts. By calculating the size (volume, and, when applicable, the cross-sectional area of membrane proteins) of E. coli protein structures and incorporating them into iJL1678b²⁶—a genome-scale model that predicts the macromolecular expression (80%, by mass) of E. coli MG1655—we are able to predict the physical space (across multiple subcellular compartments) required by the computed proteome of Escherichia coli K-12 MG1655 at optimal growth rate. To our knowledge, this QSPACE/GEM-PRO is the most comprehensive whole-cell approach that captures the 3D nature of the E. coli structural proteome. As structural, mutational, and functional knowledge is discovered, and GEMs are developed with increasing specificity, QSPACE can provide a method to rapidly integrate all information related to the structural proteome for an increasing number of organisms. QSPACE can be deployed for any organism following the tutorial python notebook available at https://github.com/EdwardCatoiu/QSPACE/.

Results

Overview of the QSPACE platform

The Quaternary Structural Proteome Atlas of a Cell (QSPACE) is an annotation platform that compiles available structural data from the latest structural biology efforts to obtain a 3D representation of all codon positions in a genome – complete with residue-level biophysical, chemical, and mutational data (see Table S1 for details). The QSPACE of E. coli is presented as a CSV file in Dataset S1. The two user-defined inputs to the QSPACE platform (Fig. 1a) are i) a list of gene IDs and ii) a dictionary of protein complexes and the associated stoichiometric ratio of the genes that make up each complex (Dataset S2A). QSPACE automatically downloads all protein structures (and homology models) from RCSB-PDB¹³, ITASSER⁸, SWISS-MODEL¹⁰ & AlphaFold¹² that correspond to any of the genes in the user-defined inputs. QSPACE then finds the 3D coordinate file (i.e. “structure”) that best reflects the user-defined (input #2) multi-subunit protein assembly (Fig. 1b, details in Fig. 2). When no available structures can accurately reflect the gene-stoichiometry of a protein complex, QSPACE will attempt to generate models for the protein structure using an external GoogleColab notebook running AlphaFold Multimer¹⁴ (v2.0 via ColabFold³⁸).

The Quaternary Structural Proteome Atlas of a CEll is a genome-scale annotation platform that was applied to the *E. coli* proteome.
**(a)** The QSPACE platform requires two user-defined inputs: a list of gene(s) and dictionary of proteins and their associated gene-stoichiometric ratios. User-defined proteins can be oligomeric or monomeric. QSPACE accommodates residue-level sequence variation (the alleleome³⁶ was used in this study). **(b)** QPACE identifies (or generates w/AF-multimer) the protein structure/model that best reflects the gene-stoichiometry of each user-defined protein complex (details in Figure 2a). The resulting structures **(c)** are analyzed using various software packages to calculate physicochemical properties and to identify evolutionary variable regions and functional domains (details in Figure 3a). **(d)** The structures are localized to their subcellular compartments and membrane-embedded structures are oriented across the membrane, resulting in a three-dimensional representation of **(e)** the structural proteome of an organism. **(f)** The amino acids of protein complexes are mapped to protein structures with varying levels of coverage (mean = 0.94) in *E. coli*. **(g)** Genome-scale counts of the unique codon positions belonging to various computed categories are shown.

QSPACE yields 3D structures that reflect the oligomeric nature of multi-subunit proteins.
(a) There are three ways that QSPACE’s protein-structure module finds the 3D structural representation for the user-defined gene-stoichiometry of a protein (target). (i) For each protein target (*left*), all structures that share genes with the protein are identified (*center left*) and combined (if applicable) to recreate the user-defined gene-stoichiometry (*center right*). If multiple matches are found, the structure that most accurately reflects the complete protein is selected (*right*). (ii) Sometimes, only higher-order structures of the protein-target are identified (*center left*). After QCQA of the relevant quality metrics associated with each structure (see Fig. S9, Methods 3.1), the highest confidence oligomeric structure is used to redefine the gene stoichiometry for the protein (‘new complex’, *right*). (iii) If identified structures are unable to recreate the protein in its entirety (‘missing subunits’, *center left)*, Alphafold Multimer is used to predict the structure (<2000 AAs) and the quality of the resulting oligomeric models are assessed (see Fig. S11, Dataset S9). (b) To achieve a multi-subunit representation of the *E. coli* proteome, structures or models from various sources are used. Unlike previous GEM-PROs, QSPACE accommodates monomeric and k-meric Alphafold models. (c) The protein-to-structure module yields 3D structures that represent the oligomeric nature of the *E. coli* proteome. This representation is further improved using structural data to correct existing protein-gene stoichiometry and with Alphafold Multimer to calculate novel oligomeric structures, (d) allowing for a truer accounting of higher-order oligomeric proteins. (e) The QSPACE platform generated 50 novel structures for ABC-transporters in *E. coli*. The pLDDT scores are used to color the AF-Multimer models. The associated AlphaFold Model Score (see Fig. S11, Dataset S3) is displayed in the bottom right. Asterisks are used to denote incomplete models resulting from incorrectly defined gene-stoichiometries or from protein size limitations of ColabFold. Existing PDB structures are shaded.

The thresholds used by QSPACE to assess the accuracy of selected protein structures are described in the text accompanying Figure 2 and in the methods section. All quality metrics related to protein structures for the E. coli QSPACE are provided in Dataset S3. By exploiting previously published repositories of protein structures, QSPACE reduces the threshold of interacting with genome-scale structural data to the order of days. Depending on user-preferences, QSPACE can function with all or some of the structural repositories used in this manuscript and can easily accommodate structural data from new sources as they become available.

Once the structure file representing the quaternary assembly of each protein is determined, multiple software packages and databases (see Table S1) are used to map physio-chemical, evolutionary, and functional information to the protein structures (Fig. 1c). The 3D overlay of multiple data types (details in Fig. 3) creates potential for many analysis tools (e.g., Fig. 4). The amino acids in each protein are then assigned to one of twelve subcellular compartments; and those representing the membrane fraction of the proteome are oriented across one of the E. coli membranes (Fig. 1d, details in Fig. 5). These structures can be integrated with genome-scale systems models to add a 3D understanding of the biophysical/spatial allocation of the proteome in a functioning cell (see Fig. 6). Users of QSPACE can bypass the mapping of any of these datasets if they are not relevant to their research, or not available for their organism.

Multi-dimensional QSPACE features to predict mutant phenotypes.
**(a)** QSPACE calculates 100 properties (residue-level, sequence-level, and structure-level) for all amino acids. (right) The mapping of the three-dimensional location of functional domains on the protein structure enables the calculation of a mutation’s proximity to important protein regions. **(b)** (left) UniProt mutations and their annotated phenotypes, are mapped to QSPACE and (right) can be classified into general categories that reflect mutant severity. **(c)** Annotated phenotypes and multi-dimensional properties for 12,000 UniProt mutations mapped to the *E. coli* QSPACE (Dataset S5) can be used to train RF-classifiers that can predict the effect of a mutation (see Fig. 4). We suggest the application of these classifiers to novel mutant datasets, (e.g. mutations in adaptive laboratory evolution experiments (ALE), the long-term evolution experiment (LTEE) and the alleleome are already mapped to the *E. coli* QSPACE in Dataset S1).

Random-forest prediction of mutant phenotypes in UniProtKB.
**(a)** QSPACE finds 4,299 mutants with known phenotypes in proteins containing *active sites*. The mutational properties (‘features’) of the amino acid residue (grey), the 5 amino acid long sequence centered at the mutation (yellow), and the local 3D protein structure (green) are calculated (see Fig. 3). The numbers inside the boxes describe the number of features used. Random Forest Classifiers are trained on (top, “0D”) residue-level parameters; (middle, “1D”) residue and sequence-level parameters; and (bottom, “3D”) residue, sequence, and structure-level parameters. For RF-classifiers initially trained on more than 30 parameters, the least predictive parameter is removed until the 30 most-predictive parameters are identified. **(b)** “One vs Rest” receiver operating characteristic curves and precision-recall curves are calculated from the averages of 100 RF-classifiers trained on the 3 different sets of parameters. The shaded region represents 1 standard deviation. **(c)** The importance of individual features for “3D” RF-classifiers is determined. MUT- and MUT+ reflect pre- and post-mutation sequence properties, respectively. The radius of the 3D environment is described where applicable. The interoperability of multiple datatypes in QPSACE allows for the calculation of a mutation’s proximity to the nearest active site—the third most important feature in determining mutant severity. **(d)** QSPACE can be used to analyze mutations found in proteins containing various functional domains. **(e)** The weighted (by phenotype class) area under the curve (AUC) is calculated from the ROC and P-R curves in Panel B for RF-classifiers (3 parameter sets x 100 RF-classifiers) for each functional domain. **(f)** The cumulative importance of residue, sequence, and structure features in “3D” RF-classifiers of mutations in proteins containing each domain.

The membrane module orients proteins across the membrane to identify residue-level subcellular compartments of the *E. coli* proteome.
**(a)** In the metadata provided by EcoCyc (cellular compartment), Gene Ontology Terms (pathway, function, compartment), iML1515 (metabolic subsystem), and UniProt (topological & transmembrane domains) databases, there are 1,777 protein structures mapped to at least one gene that is associated with the *E. coli* membrane. **(b)** Membrane-crossing residues are identified by the amino acid sequence information provided by UniProt, predicted by DeepTMHMM, and calculated by OPM. From these residues, a plane of best fit is calculated. **(c)** Structures with two calculated membrane planes pass the QCQA analysis if i) the angle between the planes is less than 35°, ii) the thickness of the membrane embedded region is between 12 and 45 Angstroms, and iii) the cross-sectional area of the membrane embedded region is less than 10,000 Å². **(d)** Membrane proteins are oriented using the topological information provided by UniProt (if available) or manually using common protein motifs (see Dataset S6-S7) such that **(e)** the subcellular compartment of every amino acid of the *E. coli* proteome can be determined.

QSPACE integrates with genome-scale models to predict the physical space required by the *E. coli* proteome at optimal growth rate.
**(a)** The compartmentalization of each amino acid of the *E. coli* proteome allows for the calculation of geometric properties of all proteins. **(b)** The volume of ATP-synthase and **(c)** its cross-sectional membrane area are shown. **(d)** The integration of QSPACE with genome-scale models (iJL1678b-ME, in this case) of metabolism (M-matrix) and macromolecular expression (E-matrix) (ME-models) can be used to calculate **(e)** the proteome allocation, **(f)** the volumetric allocation, and **(g)** the membrane composition of *E. coli* at optimal growth rate. The expression, volume, and membrane area allocated to ATP-synthase is shown (Panels E-G, cyan). The calculated spatial allocation (Panels F-G) of the macromolecular expression predicted by existing ME-models (in Panel E) is a fundamental advancement towards building genome-scale biophysical whole-cell models.

As an example, we apply QSPACE to the genome of E. coli K-12 MG1655 (Fig. 1e, Dataset S1) and identify the quaternary structural representation of its oligomeric proteome—as defined by the multi-decade bibliomic curation available in EcoCyc³⁹ and in the E. coli genome-scale model iJL1678b²⁶. These gene-stoichiometric inputs to the E. coli QSPACE are provided in Dataset S2A. Selecting from both experimentally resolved structures deposited in the Protein DataBank (RCSB-PDB)¹³ and from structural models calculated using protein modeling methods (ITASSER⁸, SWISS¹⁰, AlphaFold¹² & AlphaFold Multimer¹⁴) (details in Fig. 2), QSPACE can map the 3D position of 94% (on average) of the amino acids belonging to 3,985 annotated E. coli proteins (Fig. 1f, Dataset S3A). The set of structures that QSPACE maps to the E. coli structural proteome can be used as 3D scaffolds to map multiple structural, functional, mutational, and spatial data types (Fig. 1g).

Structural representation of multi-subunit proteins

Proteins often require oligomerization to function properly. The fundamental advancement of the E. coli QSPACE over existing genome-scale models with protein structures (e.g. iML1515-GP⁴⁰ see Fig. S1) is that it can be used to identify structures that represent the quaternary shapes of multi-subunit proteins.

To ensure that the user-defined multi-subunit proteins are accurately reflected in the structural data, we designed a pipeline to identify the best available protein structure for a target oligomeric protein, to suggest changes to the user-defined gene stoichiometry when the existing structural data suggests oligomerization, and to generate de novo structural models for oligomeric enzymes whose subunits cannot be fully represented by the structures in the PDB. A simplified representation is shown in Figure 2a.

The input to the QSPACE pipeline is a user-defined dictionary of protein complexes and their associated gene-stoichiometries. For E. coli, this information is the result of multi-decade bibliomic evidence that has been annotated in the EcoCyc database³⁹ and in the genome-scale model iJL1678b-ME²⁶ (Dataset S2A). Across these resources of annotated protein-complexes, 31% (1,334/4,309) of E. coli genes participate in 1,047 oligomeric complexes, 667 genes are annotated as monomers, and 2,308 genes are not included (i.e. assumed to be monomers) (Fig. S9A-B). In the set of annotated or assumed monomers, QSPACE identified structures (in the PDB or SWISS-MODEL repository) containing one or more oligomeric conformations for 983 of these genes (Fig. 2a.ii & Fig. S9C). QPACE uses a semi-automated pipeline that relies on various structure-derived quality metrics to assess the accuracy of PDB and SWISS-MODELs before redefining the existing monomeric annotation for these genes (see Methods 3.1).

The accuracy of quaternary structures (experimental and modelled) has been the focus of many community-wide structural biology efforts. Previous studies have estimated that the accuracy of the quaternary structures in the PDB (‘biological assemblies’) is in the range of 80-90% ^41–44, and the accuracy of PISA-generated homo-oligomers to be 85%⁴⁵. QSPACE uses PDB biological assemblies that are author-defined, software-defined (by PISA), or both (see Dataset S8). In cases where the PDB structures suggested oligomerization (contrary to the existing monomeric annotation), we reviewed the publication(s) associated with each PDB structure to confirm the oligomeric structure is believed (by the authors) to be biologically relevant (case IV-V in Fig. S9C-D).

Since 2017, QSQE-scores have been used to assess the quality of oligomeric SWISS-MODELs⁴⁶. Recently, the SWISS-MODEL QSQE-score was shown to distinguish between biologically relevant and non-relevant homodimer structures at a rate of 0.79¹⁵. Although other modelling platforms perform slightly better¹⁵, SWISS-MODELs are precomputed and readily available, making them a convenient choice for rapid integration into the QSPACE annotation platform. Thus, in cases where SWISS-MODELs provided structural evidence of oligomerization (cases I-III in Fig. S9C-D), QSPACE relies on the established metrics and thresholds (QSQE⁴⁶ > 0.5, GMQE⁴⁷ > 0.5, and QMN4⁴⁷ > -4) to assess the accuracy of each oligomeric SWISS-MODEL. SWISS-MODELs with scores exceeding these thresholds are used to redefine the oligomerization state of the user-defined monomers.

All oligomeric structures that were considered for changing the annotated E. coli monomers are provided in Dataset S2B. The relevant quality metrics associated with each structure ultimately selected in the E. coli QSPACE are provided in Dataset S3B.

When structures (in the PDB and/or SWISS-MODEL) are unable to fully reflect the gene-stoichiometry of a user-defined oligomer, the QSPACE platform relies on Alphafold Multimer¹⁴ (v2.0, via ColabFold³⁸) to generate de novo structures for desired protein oligomers (Fig. 2.a.iii). Alphafold Multimer (v2.0) was shown to outperform existing methods in modelling physiological homodimers¹⁵ and has been reported to generate high-confidence homo-oligomeric structures for various organisms, including E. coli¹⁶. QSPACE assigns confidence to AF-Multimer models using an established scoring metric¹⁴ (0.8*iPTM + 0.2*PTM ≥ 0.8) (Fig. S11). It is important to note that the iPTM thresholds were shown to correlate with biologically relevant homo-dimer models¹⁵.

Furthermore, we confirmed the physiological relevance of 86% (841/973) of the homo-oligomeric structures that QSPACE ultimately selects to represent E. coli proteome (Fig. S10) using QSalignWeb⁴⁵ — a webserver that uses superposition of structures to infer the physiological relevance of a quaternary structure. We provide all relevant quality metrics associated with each structure, and the QSalign inferred relevance (when applicable) for all proteins in the E. coli QSPACE in Dataset S3B.

The final structural representation of the E. coli proteome is a collection of experimental structures (deposited in the PDB) and models (generated by SWISS-MODEL, I-TASSER, AlphaFold, and AlphaFold Multimer) (Fig. 2b). The collection of structures identified by QSPACE captures the multi-subunit assembly of 1,473 oligomeric proteins (Fig. 2c). Proteins that are not known to oligomerize and that have no structural evidence of oligomerization are mapped to their respective monomeric structures as in previous GEM-PRO formulations^27–31. We show that QSPACE identifies the structures of higher-order oligomeric enzymes (Fig. 2d). Among these oligomers, the QSPACE platform identifies high-confidence structures for 51/54 ATP-binding cassette (ABC) transporters in E. coli (defined in EcoCyc³⁹ and/or iJL1678b²⁶). Only 4 of these transporters have experimentally resolved structures in the PDB (2QI9, 3RLF, 7CGE & GMHU). We present high-confidence novel structures QSPACE generated with AF-multimer for the 47/50 remaining ABC-transporters in Figure 2e. Incomplete AF-multimer models (3/50, asterisks in Fig. 2e) provide obvious suggestions for the correct gene-stoichiometry of ABC-transporters that were incorrectly annotated at the time of publication (e.g. putative ABC-55 transporter is missing an ATP-binding subunit).

When compared to the latest E. coli genome-scale model with protein structures (iML1515-GP⁴⁰), QSPACE improves the oligomeric structural annotation for 70% of genes in iML1515, while offering a 2.86-fold increase in gene coverage and higher quality structures (Fig. S1). To our knowledge this result is the most advanced genome-scale structural representation of the E. coli proteome and de facto represents a major advancement in genome annotation.

Interoperable data types form the basis for predicting mutant phenotypes

An accurate 3D structural representation of the proteome can serve as a scaffold for mapping multiple data types, thus providing a structured approach to data integration. The interoperability of multiple datatypes can accelerate our understanding of structure-function relationships and mechanisms. To this end, QSPACE uses third-party software (Table S1) to map residue-level, sequence-level, and protein-level properties (columns in Dataset S1) to all amino acids of the E. coli proteome. To illustrate the extensive functional content contained in the E. coli QSPACE, we provide a global accounting of all functionally important regions of the E. coli proteome (Fig. S4 and Dataset S10).

Non-synonymous mutation—the swapping of one amino acid residue for another—provides an opportunity for QSPACE to be used for mutant analysis. Residue-level properties (e.g., the Grantham score⁴⁸) of each mutation are calculated. The physio-chemical properties (e.g., the hydrophobicity) of the local sequence (i.e., 5 amino acids centered at the mutation) of each mutation are also determined. Using the protein structure, QSPACE can calculate the properties of the local 3D environment (all amino acids within a fixed radius) of a mutation. The interoperable mapping of multiple datatypes onto the protein structure also allows for the calculation of unique properties (e.g., the distance between a mutation and the nearest protein active site). A graphical summary of mutant-specific properties can be found in Figure 3a.

The UniProt knowledgebase³⁷ contains annotated phenotypes for over 12,000 non-synonymous E. coli mutations. We use keyword phrases (Dataset S4) to assign each mutations annotated phenotype to one of eight phenotype classes of varying severity (Fig. 3b). Combined with the 100 mutant-specific properties calculated by QSPACE (Dataset S5), the mutant-phenotype UniProt dataset can be used to train Random Forest (RF) classifiers that can predict the severity of mutations in novel mutational databases (e.g., from adaptive laboratory evolutions, the long-term evolution experiment, or the natural sequence variants) (Fig. 3c).

Random-forest classification of mutant phenotypes

We investigated the accuracy with which Random Forest (RF) classifiers predicted mutant phenotypes and the relative importance of higher-dimensional features. To this end, we selected all UniProt mutations found in proteins containing annotated active sites (e.g., Fig. 4a) and calculated 100 mutant-specific properties for each mutation (see Fig. 3, Dataset S5). To quantify the importance of higher-dimensional properties, we trained three sets of RF-classifiers on varying combinations of residue (“0D”, gray), sequence (“1D”, yellow), and structure (“3D”, green) features, and iteratively removed the least-predictive feature after 100 train/test cycles until the 30 most-predictive features were identified (Fig. 4a). For each set of RF-classifiers, we quantified model performance (accuracy, precision, and recall) using “One vs Rest” validation for each phenotype class (Fig. 4b). The importance of individual features used to train the “3D” RF-classifiers is shown (Fig. 4c).

Mutations in the UniProt dataset are not limited to proteins containing active sites (Fig. 4d). Thus, we followed the procedure described in Figure 4a-c to obtain a global assessment of our ability to predict mutant phenotypes found in proteins containing various functionally important annotations (UniProt “Feature”). The Area Under the Curve (AUC) for the Receiver Operating Characteristic (ROC) and Precision-Recall (P-R) curves were weighted by the relative occurrence of each phenotype class (i.e., horizontal line in Figure 4b, right) and plotted for RF-classifiers trained for each functional class (Fig. 4e). For each functional annotation, the cumulative importance of residue-level, sequence-level, and structural features in “3D” RF-classifiers supports the use of protein structures as a context to study interoperable datatypes and mutations (Fig. 4f).

The membrane module yields angstrom-level subcellular compartmentalization of the E. coli proteome

While mapping data types to individual protein complex structures can prove useful, understanding the location and space that these protein complexes occupy in the cell is important for building a genome-scale representation that reflects the physical embodiment of a proteome. To date, genomic databases of E. coli (e.g., EcoCyc) assign the entire gene to a subcellular compartment. UniProt sometimes offers sequence annotation of transmembrane and topological (‘bulb’) domains, however, these annotations may be inaccurate (see Fig. 5b) or missing entirely. Since the structure is not used to determine a protein’s subcellular compartment, assigned cellular compartments can often be incomplete (e.g. there is no distinction between membrane proteins that contain and those that do not contain membrane-spanning regions) and the residue-level orientation of a protein across the cell membrane cannot be achieved. Likewise, sequence-based prediction software (e.g., DeepTMHMM⁴⁹) and structure-based prediction software (e.g., OPM⁵⁰) are agnostic to membrane orientation and can also generate erroneous results.

To achieve a residue-level representation of the E. coli proteome, we use a structure-guided approach that combines and assesses all available annotations and predictions (from UniProt, DeepTMHMM, and OPM) to better identify the integration and orientation of the membrane-embedded proteome.

QSPACE queries the available gene-level subcellular compartment information provided by Ecocyc³⁹, UniProt³⁷, Gene Ontology⁵¹, and genome-scale model iML1515⁴⁰ to identify all potential membrane-embedded protein structures (Fig. 5a). For each identified structure, QSPACE determines the membrane-spanning residues for each subunit using the sequence annotations provided in UniProt, the sequence-based predictions generated by DeepTMHMM⁴⁹, and the structure-based calculation of the membrane planes predicted by OPM⁵⁰. For each of the three sources of residue information (when available), QSPACE calculates the normal vectors of the corresponding membrane planes (Fig. 5b). For each pair of membrane planes, the angle between the planes, and the thickness and area of the membrane-embedded region are used to determine whether the calculated membranes are viable (Fig. 5c).

QSPACE segregates each viable membrane protein into three sections: a membrane-embedded region and two ‘bulbous’ regions. Each bulb is automatically assigned (Dataset S6) to either the cytoplasmic, periplasmic, or extracellular side of either the inner or outer membrane, using the annotated topological domains in UniProt or manually assigned (Dataset S7) using common 3D motifs in the protein structures (Fig. 5d). Proteins annotated to the cell membrane (Fig. 5a) that do not contain a membrane-embedded region are considered ‘membrane-associated’ and tagged to their respective membrane while those tagged to the cytoplasm or periplasm are left unchanged. The gene ontology (GO) terms of genes mapped to non-membrane proteins were used to assign proteins to the cytoplasm, periplasm, or extracellular space.

In E. coli, QSPACE was able to assign 86% of proteins (89% of AAs) to one of twelve subcellular compartments (Fig. 5e), resulting in a residue-level annotation of cellular compartmentalization of the E. coli proteome across both cellular membranes. The membrane integration for an additional 5% of proteins (2% of AAs) is known (Fig. 5e, compartments #13-17), however there is insufficient information to properly orient these proteins across the membrane (e.g. short, single-pass transmembrane helix proteins, see Dataset S7). Incorporated into genome-scale models that compute protein expression (or proteomic datasets), the residue-level compartmentalization of each protein structure provides a first-principles approach to compute the location and size taken up by a cell’s proteome.

Computing the physical space required by the E. coli proteome

The multi-subunit protein complexes carry out metabolic reactions, transport nutrients across the cell membrane, maintain cellular homeostasis, replicate the cellular genome, and even synthesize other proteins. Considering all these functions simultaneously calls for the use of computational models. As genome-scale models (GEMs) have increased in scope and mechanistic detail^22–26,52, they require the biosynthesis and proper assembly of multi-subunit complexes to drive the reactions in their reconstructed metabolic networks.

While genome-scale models using protein structures (GEM-PROs) have been used for a variety of applications⁵³ (e.g., contextualization of disease-associated human mutations³², identification of protein-fold conservation in similar metabolic reactions³¹, prediction of thermosensitivity in a metabolic network³⁰, comparative structural analyses of multiple organisms²⁸), the promise of a complete physical representation of a functioning cellular proteome has yet to be delivered. QSPACE moves us close to this goal by calculating the subcellular compartment of every amino acid across the proteome.

The successful annotation and 3D orientation of proteins across the subcellular compartments is crucial for building genome-scale models that can predict the physical distribution of the cellular proteome. A geometric analysis of the compartmentalized proteome (Fig. 6a) allows us to calculate the volume occupied (Fig. 6b) as well as the membrane area required (if applicable) (Fig. 6c) by each protein. Genome-scale models of metabolism and macromolecular expression (ME-models) (Fig. 6d) predict the proteome allocation required to sustain growth in optimally growing bacterial cells (Fig. 6e). In calculating the physical space required by each protein, the spatial requirements of model-predicted proteomes can also be determined (Fig. 6f-g). Thus, it is now possible to compute the composition and location of the structural proteome. A more detailed supra-protein-complex-level 3D arrangement requires additional considerations^54–55.

Discussion

The 3D visualization and modeling of the structural proteome of a functioning cell has been an implicit goal of genome-scale annotations and computational biology methods. QSPACE, introduced here, rapidly identifies and annotates multi-subunit protein structures (including de novo annotations of protein-complex assemblies and de novo structural models) at the genome-scale for computational modeling and structural analyses. In conjunction with mutational databases, functional annotations, and other data types, the oligomeric structures identified through QSPACE can be used to obtain a deeper understanding of whole-cell functions.

To achieve a physical representation of the cellular proteome, the structure of each individual protein complex in its native state is needed. To this end, QSPACE allows for multi-gene mapping to oligomeric crystallographic depositions (e.g., PDB bioassemblies), existing homo-oligomeric structural models (e.g., high-quality⁴⁷ SWISS-PROT models¹⁰ with high QS-scores⁴⁶), and de novo high quality oligomeric models (from Alphafold Multimer¹⁴/ColabFold³⁸). Unlike a purely annotative workflow, QSPACE uses a structure-guided assessment to identify previously unannotated oligomeric assemblies and generate de novo structural models when the existing structural data for a protein complex is incomplete. As an example, we present the novel structures of 50/54 ABC-transporters in E.coli, and show that even incomplete models 3/50 can provide clues to the correct oligomerization of a protein (Fig. 2e). In the E. coli QSPACE, we confirmed the physiological relevance for 86% of the homo-oligomeric structures with QSalignWeb⁴⁵ (Fig. S10). Thus, QSPACE achieves the structural representation of multi-subunit protein complexes, a significant advancement over existing genome-scale models using structural biology software (ssbio²⁷, see Fig. S1).

The protein structures identified by QSPACE are a well-suited 3D scaffold on which to calculate protein properties, identify enzymatic domains, and analyze impactful mutations. QSPACE’s interoperability of various data types (columns in Dataset S1), can drive biological discovery. In this study, we showed how the residue-level, sequence-level, and protein-level properties calculated by QSPACE (Fig. 3a) for the mutations annotated in the UniProt knowledgebase (Fig. 3b) can be used to accurately predict mutant phenotypes (Fig. 4b & 4e). Interestingly, when we iteratively removed the least predictive properties from the RF-classifiers during the training phase (Fig. 4a), we found that the predictive power of RF models was overwhelmingly the result of structure-level features (Fig. 4c & 4f). Thus, QSPACE provides users a rapid way to interact with relevant structures and interoperable datatypes to elucidate structure-function relationships across multiple scales.

In addition to the annotated mutations in UniProt, QSPACE can also be used to analyze novel mutational data sets from adaptive laboratory evolutions (ALEdb²⁰, Fig. S5), the long-term evolution experiment (LTEE³³, Fig. S6), and the natural sequence variation³⁶ of E. coli in three dimensions (Fig. 3c). To our knowledge, the E. coli QSPACE provides the first 3D representation of the natural sequence variation of an organism at the genome-scale, and it moves the description and scale of the structural proteome to the species level.

QSPACE advances whole cell modeling efforts^54,56–57 by establishing structural annotations relevant for molecular processes. Advancements with computational genome-scale models (GEMs) over the past decade have allowed for the prediction of proteome allocation for cells at optimal growth rate^22–26,52. Increasingly detailed, GEMs include reactions for protein assembly and translocation across subcellular compartments (e.g., membranes), however, previous GEM formulations with monomeric protein structures (GEM-PROs) have yet to reflect the biophysical embodiment of these in silico processes.

Using a structures-based approach that combines and assesses all available annotations and predictions for membrane-spanning proteins, QSPACE determines the membrane integration and orientation of proteins across both the inner and outer membrane of E. coli. In fact, QSPACE even calculates membrane integration for proteins that span both membranes (e.g., AcrAB-TolC efflux pump, PDB:5v5s, see Fig. 5e). In this study, QSPACE determined the subcellular compartment for 89% of amino acids in E. coli. As a proof-of-concept, we combine the protein-level information in QSPACE with a genome-scale model of macromolecular expression (iJL1678b-ME²⁶) to calculate the physical size occupied by the predicted proteome of E. coli at optimal growth rate. To our knowledge, this first-principles approach resulted in the first GEM-PRO that embodies the spatial allocation of the E. coli proteome.

Taken together, the QSPACE genome annotation platform proves users a rapid method to interact with the best available quaternary structures for any list of proteins (e.g., a strain), can accommodate natural sequence variations described by the alleleome³⁶ to generate species-level structural proteomes, and enables a physical embodiment of the structural proteome against the 3D morphology of the bacterial cell. The analysis of mutant phenotypes and the size calculation for the E. coli proteome demonstrate that QSPACE is amenable to diverse applications. As structures are resolved for large protein complexes, as the scope of genome-scale models expands to include an increasing number of niche cellular mechanisms (e.g., stress responses), and as new mutations of functional importance are annotated in publicly available databases, the QSPACE platform will provide an interoperable pipeline for the structural proteomes for a growing list of organisms.

Limitations

We emphasize that QSPACE is a large-scale annotation platform that interfaces with numerous third-party software to quickly map multiple interoperable datasets onto relevant protein structures. QSPACE applications range from the single-protein to the genome-scale. As such, the structures identified by QSPACE reflect the gene-stoichiometry of the protein complexes defined by the user. For extensively studied organisms (e.g., E. coli), these protein complexes have been defined over decades of published work. For less-studied organisms, a structural proteome assigned by the workflow presented in this study may be incomplete. In such cases, QSPACE can still provide insights into the structural proteome. For instance, QSPACE can generate the entirety of all homo-oligomerization states of an organism’s genome. By modifying the user-defined protein complexes to reflect the gene-stoichiometry of any theoretical homo-oligomerization state of a gene(s), QSPACE will identify (or use AF-multimer to generate) structures for these oligomers, provide the user with a quality assessment of each oligomeric structure, and would thus reveal homo-oligomerization states backed by structural evidence.

As QSPACE relies on third-party software and repositories for the generation of novel structures, the mapping of datasets, and the calculation of structural properties, it is limited by the maintenance, capabilities and accuracy of such resources. For example, the use of AF-multimer via ColabFold allows for modelling protein complexes up to 2000 amino acids, and the quality assessment of such models is currently based on their iPTM and PTM scores, rendering QSPACE incapable of generating higher-order structures that have not been published in repositories. As new modelling platforms, better scoring methods, and larger repositories of pre-computed structures are disseminated by the structural biology community, we see potential for their incorporation into the QSPACE workflow to identify increasingly accurate structures for user-defined proteins. Thus, the maintenance of the QSPACE codebase is vital to ensure that QSPACE can provide users with the most-accurate protein structures for future applications.

Acknowledgements

We would like to thank Marc Abrams for assistance with manuscript editing. This work was funded by Novo Nordisk Foundation (Grant Number NNF20CC0035580) (E.A.C. and B.O.P.) and NIH (Grant R01 GM057089) (B.O.P.).

Author Contributions

E.A.C. and B.O.P. designed and performed the research; E.A.C., N.M. and M.L. contributed analytical tools and code; E.A.C. and B.O.P. analyzed data; E.A.C. and B.O.P. wrote the manuscript; E.A.C., N.M., M.L. and B.O.P. edited the manuscript; and B.O.P. supervised the research.

Competing Interests

The authors declare no competing interest.

Materials and Methods

Detailed methods are provided in the Supplementary Appendix. Additional information can be found at github.com/EdwardCatoiu/QSPACE.

Overview of the QSPACE workflow

We encourage the reader to familiarize themselves with the ‘demo_QSPACE.ipynb’ tutorial notebook available at (https://github.com/EdwardCatoiu/QSPACE). The QSPACE platform offers the user flexibility in the use of some or all structural repositories (for identifying/generating structures), third-party software (for calculating structural properties), and mutant datasets. This section will describe the generation of the E. coli QSPACE (Dataset S1) and the two applications presented in this study.

The overall QSPACE workflow can be summarized: 1) A list of 4,309 genes identified across 2,661 E. coli strains³⁶ and the gene-stoichiometries of E. coli proteins are annotated in EcoCyc³⁹ and iJL1678b²⁶ and serve as the user-defined input to the E. coli QSPACE; 2) UniProt IDs were identified and corresponding .fasta and .txt files were downloaded; 3) Homology models corresponding to the UniProt IDs are downloaded from various repositories; 4) PDB structures (and associated PDB bioassemblies) corresponding to the UniProt IDs and sequences were downloaded using PDB APIs; 5) For proteins that are annotated to be monomers and for those not included in EcoCyc and/or iJL1678b (assumed monomers), a semi-automated module is used to assess if the oligomeric structures (from PDB/SWISS-MODEL) provide overwhelming evidence of oligomerization; 6) For annotated oligomers (user-defined in #1) whose gene-stoichiometry is not reflected in PDB or SWISS-MODEL, AF-multimer v2.0 (via ColabFold) is used to generate oligomeric models; 7) iPTM and PTM scores are used to assess the quality of AF-multimer models; 8) The highest sequence identity structure (after structural QCQA relevant to its source) is selected for each unique gene stoichiometry. When multiple structures provide the same sequence identity for the same gene stoichiometry, preference is given to PDB, AF-Multimer, AlphaFoldDB, SWISS, and ITASSER, respectively); 9) A structure (or combination of structures) is selected as the representation of each user-defined (or re-defined in #5) protein complex; 10) A CSV file (the backbone of Dataset S1, ‘the QSPACE’) is generated where each row provides a mapping between each amino acid across 4,309 E. coli genes (user-defined in #1) and its 3D position (chain and residue number) on the protein structure that reflects the protein complex (user-defined in #1 and/or redefined in #5); 11) When possible, third-party software is used to identify membrane-embedded amino acids in proteins that are believed to be in (or associated with) the membrane; 12) Proteins with membrane-embedded regions are oriented across the membrane using available topological information (automatically) and/or by manual inspection of common motifs with known orientation; 13) Multiple third-party software is used to calculate protein properties; 14) External mutation databases and functional annotations (in UniProt) are mapped to the QSPACE CSV file (Dataset 1).

Notably, in steps 1-8, the highest quality structure for each unique structure-gene stoichiometry is selected from each resource of homology and experimental structures. In step 9, QSPACE selects the specific structure that best represents each user-defined gene-stoichiometries. All relevant quality metrics associated with all structures in the structure pool are defined in Dataset S9. The structures selected by QSPACE (in step 9) from this pool (and all associated quality metrics), are used to generate the residue-level CSV file described in step 10. The associated quality metrics for all selected structures in the final CSV file are provided in Dataset S8.

The QSPACE CSV (Dataset S1, AA-to-structure mapping in step 10, additional data mapped in steps 11-14) contains information that can be utilized for various applications, for example: 15) The severity of UniProt mutants mapped to the QSPACE was determined using a key-word search of their annotated phenotypes; 16) the AA-level mapping in QSPACE was used to calculate the aggregate properties of the local environment (i.e. the amino acids directly adjacent in sequence and in 3D space) of each mutation; 17) annotated mutant phenotypes and calculated properties were used to train RF-classifiers to predict mutant severity; 18) (Separately) the area taken up by each protein in the membrane was calculated from the membrane-embedded regions and inferred membrane planes generated in step 11; 19) The volume of each protein was calculated; 20) The geometry of all proteins were incorporated into a genome-scale model of macromolecular expression, iJL1678b-ME, to compute the physical space taken up by the model-predicted proteome of E. coli.

Data Availability

All data is freely available from public sources.

Structures selected from the PDB were last downloaded March 5^th, 2023. We show experimental structures from the PDB with accession numbers 2GRX, 5V5S, 7NYU, 1NEK, 6OQS, 6C53, 1PFK, 6V0C. Structures selected from the SWISS-MODEL E. coli repository were last downloaded December 20^th, 2022. We show SWISS-MODELs with UniProt IDs P33232 and P39099. Structures selected from the ITASSER E. coli database repository were last downloaded November 3^rd, 2022. Structures selected from the Alphafold database were last downloaded January 15^th, 2023. We show Alphafold models with UniProt IDs P33924 and P30143. Structures were last modelled using ColabFold on April 6^th, 2023. We show Alphafold Multimer/ColabFold models for protein complexes with EcoCyc IDs CYT-D-UBIOX-CPLX and ABC-13-CPLX.

Protein complex gene stoichiometry data for E. coli is provided by the Public SmartTable in EcoCyc at https://ecocyc.org/group?id=Biocyc12-4862-3584200844 and by genome-scale model iJL1674b-ME at https://github.com/SBRG/ecolime.

ALE mutation data is available at https://aledb.org/. LTEE mutation data is available at https://barricklab.org/shiny/LTEE-Ecoli/. Both mutation datasets were mapped to the E. coli genome by Catoiu et. al. 2023³⁶.

Data generated in this study is provided in the Supplementary Material and/or at https://github.com/EdwardCatoiu/QSPACE/.

Select data generated in this study that exceeds the size limits of GitHub is available at https://drive.google.com/drive/folders/1OkXnPK2YP3WAk62Mmu1p00z2dKiS1HQN?usp=sharing.

Code Availability

All source code for QSPACE is provided at https://github.com/EdwardCatoiu/QSPACE/. The best way to build a QSPACE is to follow the detailed instructions in the iPython tutorial notebook (“demo_QSPACE.ipynb”).

QSPACE could not be possible without the following:

References

1.
1. Bai X.
2. McMullan G.
3. Scheres S.H.W.
2015How cryo-EM is revolutionizing structural biologyTrends in Biochem. Sci 40:49–57https://doi.org/10.1016/j.tibs.2014.10.005 Google Scholar
2.
1. Cheng Y.
2018Single-particle cryo-EM—How did it get here and where will it goScience 361:876–880https://doi.org/10.1126/science.aat4346 Google Scholar
3.
1. Renaud J.P.
2. Chari A.
3. Liu W.
4. Remigy H.W.
5. Start H.
6. Wiesmann C.
2018Cryo-EM in drug discovery: achievements, limitations and prospectsNat. Rev. Drug Discov 17:471–492https://doi.org/10.1038/nrd.2018.77 Google Scholar
4.
1. Nakane T.
2. Kotecha A.
3. Sente A.
4. McMullan G.
5. Masiulis S.
6. Brown P.M.G.E.
7. Grigoras I.T.
8. Malinauskaite L.
9. Malinauskas T.
10. Miehling J.
11. Uchański T.
12. Yu L.
13. Karia D.
14. Pechnikova E.V.
15. de Jong E.
16. Keizer J.
17. Bischoff M.
18. McCormack J.
19. Tiemeijer P.
20. Hardwick S.W.
21. Chirgadze D.Y.
22. Murshudov G.
23. Aricescu A.R.
24. Scheres S.H.W.
2020Single-particle cryo-EM at atomic resolutionNature 587:152–156https://doi.org/10.1038/s41586-020-2829-0 Google Scholar
5.
1. Cheng Y.
2018Membrane protein structural biology in the era of single particle cryo-EMCurr. Opin. Struct. Biol 52:58–63https://doi.org/10.1016/j.sbi.2018.08.008 Google Scholar
6.
1. Zheng W.
2. Zhang C.
3. Li Y.
4. Pearce R.
5. Bell E.W.
6. Zhang Y.
2021Folding non-homology proteins by coupling deep-learning contact maps with I-TASSER assembly simulationsCell Rep 1https://doi.org/10.1016/j.crmeth.2021.100014 Google Scholar
7.
1. Yang J.
2. Yan R.
3. Roy A.
4. Xu D.
5. Poisson J.
6. Zhang Y.
2015The I-TASSER Suite: Protein structure and function predictionNat. Methods 12:7–8https://doi.org/10.1038/nmeth.3213 Google Scholar
8.
1. Yang J.
2. Zhang Y.
2015I-TASSER server: new development for protein structure and function predictionsNucleic Acids Res 43:W174–W181https://doi.org/10.1093/nar/gkv342 Google Scholar
9.
1. Waterhouse A.
2. Bertoni M.
3. Bienert S.
4. Studer G.
5. Tauriello G.
6. Gumienny R.
7. Heer F.T.
8. de Beer T.A.P
9. Rempfer C.
10. Bordoli L.
11. Lepore R.
12. Schwede T.
2018SWISS-MODEL: homology modeling of protein structures and complexesNucleic Acids Res 46:W296–W303https://doi.org/10.1093/nar/gky427 Google Scholar
10.
1. Bienert S.
2. Waterhouse A.
3. de Beer T.A.P.
4. Tauriello G.
5. Studer G.
6. Bordoli L.
7. Schwede T.
2017The SWISS-MODEL Repository - new features and functionalityNucleic Acids Res 45:D313–D319https://doi.org/10.1093/nar/gkw1132 Google Scholar
11.
1. Jumper J.
2. Evans R.
3. Pritzel A.
4. et al.
2021Highly accurate protein structure prediction with AlphaFoldNature https://doi.org/10.1038/s41586-021-03819-2 Google Scholar
12.
1. Varadi M.
2. Anyango S.
3. Deshpande M.
4. et al.
2022AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy modelsNucleic Acids Res 50:D439–D444https://doi.org/10.1093/nar/gkab1061 Google Scholar
13.
1. Berman H.M.
2. Westbrook J.
3. Feng Z.
4. Gilliland G.
5. Bhat T.N.
6. Weissig H.
7. Shindyalov I.N.
8. Bourne P.E.
2000The Protein Data BankNucleic Acids Res 28:235–242https://doi.org/10.1093/nar/28.1.235 Google Scholar
14.
1. Evans R.
2. O’Neill M.
3. Pritzel A.
4. et al.
2021Protein complex prediction with AlphaFold-MultimerbioRxiv https://doi.org/10.1101/2021.10.04.463034 Google Scholar
15.
1. Schweke H.
2. Xu Q.
3. Tauriello G.
4. Pantolini L.
5. Schwede T.
6. Cazals F.
7. Lhéritier A.
8. Fernandez-Recio J.
9. Rodríguez-Lumbreras L.A.
10. Schueler-Furman O.
11. Varga J.K.
12. Jiménez-García B.
13. Réau M.F.
14. Bonvin A.M.J.J.
15. Savojardo C.
16. Martelli P.L.
17. Casadio R.
18. Tubiana J.
19. Wolfson H.J.
20. Oliva R.
21. Barradas-Bautista D.
22. Ricciardelli T.
23. Cavallo L.
24. Venclovas C.
25. Olechnovič K.
26. Guerois R.
27. Andreani J.
28. Martin J.
29. Wang X.
30. Terashi G.
31. Sarkar D.
32. Christoffer C.
33. Aderinwale T.
34. Verburgt J.
35. Kihara D.
36. Marchand A.
37. Correia B.E.
38. Duan R.
39. Qiu L.
40. Xu X.
41. Zhang S.
42. Zou X.
43. Dey S.
44. Dunbrack R.L.
45. Levy E.D.
46. Wodak S.J.
2023Discriminating physiological from non-physiological interfaces in structures of protein complexes: A community-wide studyProteomics 23https://doi.org/10.1002/pmic.202200323 Google Scholar
16.
1. Schweke H.
2. Pacesa M.
3. Levin T.
4. Goverde C.A.
5. Kumar P.
6. Duhoo Y.
7. Dornfeld L.J
8. Dubreuil B.
9. Georgeon S.
10. Ovchinnikov S.
11. Woolfson D. N
12. Correia B.E.
13. Dey S.
14. Levy E.D.
2024An atlas of protein homo-oligomerization across domains of lifeCell Google Scholar
17.
1. Wang H.H.
2. Isaacs F. J.
3. Carr P. A.
4. Sun Z.Z.
5. Xu G.
6. Forest C.R.
7. Church G.M.
2009Programming cells by multiplex genome engineering and accelerated evolutionNature :460–7257https://doi.org/10.1038/nature08187 Google Scholar
18.
1. Sandberg T.E.
2. Salazar M.J.
3. Weng L.L.
4. Palsson B.O.
5. Feist A.M.
2019The emergence of adaptive laboratory evolution as an efficient tool for biological discovery and industrial biotechnologyMetab. Eng 56:1–16https://doi.org/10.1016/j.ymben.2019.08.004 Google Scholar
19.
1. Kim K.
2. Kang M.
3. Cho SH
4. Yoo E.
5. Kim U.
6. Cho S.
7. Palsson B.
8. Cho BK
2023Minireview: Engineering evolution to reconfigure phenotypic traits in microbes for biotechnological applicationsComput. Struct. Biotechnol. J 21:563–573https://doi.org/10.1016/j.csbj.2022.12.042 Google Scholar
20.
1. Phaneuf P.V.
2. Gosting D.
3. Palsson B.O.
4. Feist A.M.
2019ALEdb 1.0: a database of mutations from adaptive laboratory evolution experimentationNucleic Acids Res 47:D1164–D1171https://doi.org/10.1093/nar/gky983 Google Scholar
21.
1. Tibocha-Bonilla J.D.
2. Zuñiga C.
3. Lekbua A.
4. Lloyd C.
5. Rychel K.
6. Short K.
7. Zengler K.
2022Predicting stress response and improved protein overproduction in Bacillus subtilisNPJ Syst. Biol. Appl 8https://doi.org/10.1038/s41540-022-00259-0 Google Scholar
22.
1. O’Brien E.J.
2. Lerman J.A.
3. Chang R.L.
4. Hyduke D.R.
5. Palsson B.Ø.
2013Genome-scale models of metabolism and gene expression extend and refine growth phenotype predictionMol. Syst. Biol 9https://www.embopress.org/doi/pdf/10.1038/msb.2013.52#sec-18 Google Scholar
23.
1. Du B.
2. Yang L.
3. Lloyd C.J.
4. Fang X.
5. Palsson B.O.
2019Genome-scale model of metabolism and gene expression provides a multi-scale description of acid stress responses in Escherichia coliPLOS Comp. Bio 15https://doi.org/10.1371/journal.pcbi.1007525 Google Scholar
24.
1. Chen K.
2. Gao Y.
3. Mih N.
4. Palsson BO
2017Thermosensitiviy of growth is determined by chaperone-mediated proteome reallocationPNAS 114:11548–11553https://doi.org/10.1073/pnas.1705524114 Google Scholar
25.
1. Yang L.
2. Nih N.
3. Anand A.
4. Park J.H.
5. Tan J.
6. Yurkovich J.
7. Monk J.M.
8. Lloyd C.J.
9. Sandberg T.E.
10. Seo S.W.
11. Kim D.
12. Sastry A.V.
13. Phaneuf P.
14. Gao Y.
15. Broddrick J.R.
16. Chen K.
17. Heckmann D.
18. Szubin R.
19. Hefner Y.
20. Feist A.M.
21. Palsson B.O.
2019Cellular responses to reactive oxygen species are predicted from molecular mechanismsPNAS 116:14368–14373https://doi.org/10.1073/pnas.1905039116 Google Scholar
26.
1. Lloyd C.J.
2. Ebrahim A.
3. Yang L.
4. King Z.A.
5. Catoiu E.
6. Obrien E.J.
7. Liu J.K.
8. Palsson B.O.
2018COBRAme: A computational framework for genome-scale models of metabolism and gene expressionPLOS Comp. Bio 14https://doi.org/10.1371/journal.pcbi.1006302 Google Scholar
27.
1. Mih N.
2. Brunk E.
3. Chen K.
4. Catoiu E.
5. Sastry A.
6. Kavvas E.
7. Monk J.M.
8. Zhang Z.
9. Palsson B.O.
2018, ssbio: a Python framework for structural systems biologyBioinformatics 34:2155–2157https://doi.org/10.1093/bioinformatics/bty077 Google Scholar
28.
1. Brunk E.
2. Mih N.
3. Monk J.M.
4. Zhang Z.
5. Obrien E.J.
6. Bliven S.E.
7. Chen K.
8. Chang R.L.
9. Bourne P.E.
10. Palsson B.O.
2013Systems biology of the structural proteomeBMC Syst. Biol 10https://doi.org/10.1186/s12918-016-0271-6 Google Scholar
29.
1. Chang R.L.
2. Xie L.
3. Bourne P.E.
4. Palsson B.O.
2010Drug off-target effects predicted using structural analysis in the context of a metabolic network modelPLOS Comp. Biol 6https://doi.org/10.1371/journal.pcbi.1000938 Google Scholar
30.
1. Chang R.L
2. Andrews K.
3. Kim D.
4. Li Z.
5. Godzik A.
6. Palsson B.O.
2013Structural systems biology evaluation of metabolic thermotolerance in Escherichia coliScience :34–6137https://doi.org/10.1126/science.1234012 Google Scholar
31.
1. Zhang Y.
2. Theile I.
3. Weekes D.
4. Li Z.
5. Jaroszewski L.
6. Ginalski K.
7. Deacon A.
8. Wooley J.
9. Lesley S.
10. Wilson I.A.
11. Palsson B.O.
12. Osterman A.
13. Godzik A.
2009Three-dimensional structural view of the central metabolic network of Thermotoga maritimaScience 325:1544–1549https://doi.org/10.1126/science.1174671 Google Scholar
32.
1. Brunk E.
2. Sahoo S.
3. Zielinski D.C.
4. Altunkaya A.
5. Dräger A.
6. Mih N.
7. Gatto F.
8. Nilsson A.
9. Preciat-Gonzalez G.A.
10. Aurich M.K.
11. Prlić A.
12. Sastry A.
13. Danielsdottir A.D.
14. Heinken A.
15. Noronha A.
16. Rose P.W.
17. Burley S.K.
18. Fleming R.M.T.
19. Nielsen J.
20. Thiele I.
21. Palsson B.O.
2018Recon3D enables a three-dimensional view of gene variation in human metabolismNat. Biotechnol 36:272–281https://doi.org/10.1038/nbt.4072 Google Scholar
33.
1. Barrick Lab
2022LTEE-EcoliBarrick Lab https://barricklab.org/shiny/LTEE-Ecoli/
34.
1. Lenski R.E.
2. Rose M.R.
3. Simpson S.C.
4. Tadler S.C.
1991Long-term experimental evolution in Escherichia coli adaptation and divergence during 2,000 generationsAm. Nat 138:1315–1341Google Scholar
35.
1. Tenaillon O.
2. Barrick J. E.
3. Ribeck N.
4. Deatherage D. E.
5. Blanchard J. L.
6. Dasgupta A.
7. Wu G.C.
8. Wielgoss S.
9. Cruveiller S.
10. Médigue C.
11. Schneider D.
12. Lenski R. E.
2017Tempo and mode of genome evolution in a 50,000-generation experimentNature 536:165–170Google Scholar
36.
1. Catoiu E.A.
2. Phaneuf P.
3. Monk J.M.
4. Palsson B.O.
2023Whole genome sequences from wild-type and laboratory evolved strains define the alleleome and establish its hallmarksPNAS 120https://doi.org/10.1073/pnas.221883512 Google Scholar
37.
1. Apweiler R.
2. Bairoch A.
3. Wu C.H.
4. Barker W.C
5. Boeckmann B.
6. Ferro S.
7. Gasteiger E.
8. Huang H.
9. Lopez R.
10. Magrane M.
11. Martin M.J
12. Natale D.A
13. O’Donovan C.
14. Redaschi N.
15. Yeh L.S.
2004UniProt: the Universal Protein knowledgebaseNucleic Acids Res 32:D115–D119https://doi.org/10.1093/nar/gkh131 Google Scholar
38.
1. Mirdita M.
2. Schütze K.
3. Moriwaki Y.
4. et al.
2022ColabFold: making protein folding accessible to allNat Methods 19:679–682https://doi.org/10.1038/s41592-022-01488-1 Google Scholar
39.
1. Keseler I.M.
2. Collado-Vides J.
3. Santos-Zavaleta A.
4. Peralta-Gil M.
5. Gama-Castro S.
6. Muñiz-Rascado L.
7. Bonavides-Martinez C.
8. Paley S.
9. Krummenacker M.
10. Altman T.
11. Kaipa P.
12. Spaulding A.
13. Pacheco J.
14. Latendresse M.
15. Fulcher C.
16. Sarker M.
17. Shearer A.G.
18. Mackie A.
19. Paulsen I.
20. Gunsalus R.P.
21. Karp P.D.
2011Ecocyc: a comprehensive database of Escherichia coli biologyNucleic Acids Res 39:D583–590https://doi.org/10.1093/nar/gkq1143 Google Scholar
40.
1. Monk J.
2. Lloyd C.J.
3. Brunk E.
4. et al.
2017iML1515, a knowledgebase that computes Escherichia coli traitsNat Biotechnol 35:904–908https://doi.org/10.1038/nbt.3956 Google Scholar
41.
1. Krissinel E.
2. Henrick K.
2007Inference of macromolecular assemblies from crystalline stateJ. Mol. Biol 372:774–797https://doi.org/10.1016/j.jmb.2007.05.022 Google Scholar
42.
1. Xu Q.
2. Canutescu A.A.
3. Wang G.
4. Shapovalov M.
5. Obradovic Z.
6. Dunbrack R.L.
2008Statistical analysis of interface similarity in crystals of homologous proteinsJ. Mol. Biol 381:487–507https://doi.org/10.1016/j.jmb.2008.06.002 Google Scholar
43.
1. Baskaran K.
2. Duarte J.M.
3. Biyani N.
4. Bliven S.
5. Capitani G.
2014A PDB-wide, evolution-based assessment of protein–protein interfacesBMC Struct. Biol 14https://doi.org/10.1186/s12900-014-0022-0 Google Scholar
44.
1. Levy E.D.
2007PiQSi: protein quaternary structure investigationStructure 15:1364–1367https://doi.org/10.1016/j.str.2007.09.019 Google Scholar
45.
1. Dey S.
2. Priluskiy J.
3. Levy E.D.
2022QSalignWeb: A server to predict and analyze protein quaternary structureFront Mol Biosci https://doi.org/10.3389/fmolb.2021.787510 Google Scholar
46.
1. Bertoni M.
2. Kiefer F.
3. Biasini M.
4. Bordoli L.
5. Schwede T.
2017Modeling protein quaternary structure of homo- and hetero-oligomers beyond binary interactions by homologySci. Rep 7https://doi.org/10.1038/s41598-017-09654-8 Google Scholar
47.
1. Benkert P.
2. Biasini M.
3. Schwede T.
2011Toward the estimation of the absolute quality of individual protein structure modelsBioinformatics 27:343–350https://doi.org/10.1093/bioinformatics/btq662 Google Scholar
48.
1. Grantham R.
1974Amino acid difference formula to explain protein evolutionScience 185:862–864https://doi.org/10.1126/science.185.4154.862 Google Scholar
49.
1. Hallgren J.
2. Tsirigos K.D.
3. Pedersen M.D.
4. Armenteros J.J.A.
5. Marcatili P.
6. Nielsen H.
7. Krogh A.
8. Winther O.
2022DeepTMHMM predicts alpha and beta transmembrane proteins using deep neural networksbioRxiv https://doi.org/10.1101/2022.04.08.487609 Google Scholar
50.
1. Lomize M.A
2. Pogozheva I.D.
3. Joo H.
4. Mosberg H.I.
5. Lomize A.L
2012OPM database and PPM web server: resources for positioning of proteins in membranesNucleic Acids Res 40:D370–D376https://doi.org/10.1093/nar/gkr703 Google Scholar
51.
1. Ashburner M.
2. Ball C.A.
3. Blake J. A.
4. et al.
2000Gene ontology: tool for the unification of biology. The Gene Ontology ConsortiumNat. Genet 25:25–29https://doi.org/10.1038/75556 Google Scholar
52.
1. Liu J.K.
2. O’Brien E.J.
3. Lerman J.A.
4. Zengler K.
5. Palsson B.O.
6. Feist A.M.
2014Reconstruction and modeling protein translocation and compartmentalization in Escherichia coli at the genome-scaleBMC Syst. Biol 8https://doi.org/10.1186/s12918-014-0110-6 Google Scholar
53.
1. Mih N.
2. Palsson B.O.
2019Expanding the uses of genome-scale models with protein structuresMol. Cyst. Biol 15https://doi.org/10.15252/msb.20188601 Google Scholar
54.
1. Thornburg Z.R.
2. Bianchi D.M.
3. Brier T.A.
4. Gilbert B.R.
5. Earnest T.M.
6. Melo M.C.R.
7. Safronova N.
8. Sáenz J.P.
9. Cook A.T.
10. Wise K.S.
11. Hutchison C.A.
12. Smith H.O.
13. Glass J.I.
14. Luthey-Schulten Z.
2022Fundamental behaviors emerge from simulations of a living minimal cellCell 185:345–360https://doi.org/10.1016/j.cell.2021.12.025 Google Scholar
55.
1. Maritan M.
2. Autin L.
3. Karr J.
4. Covert M.W.
5. Olson A.J.
6. Goodsell D.S.
2022Building Structural Models of a Whole Mycoplasma CellJ. Mol. Biol 434https://doi.org/10.1016/j.jmb.2021.167351 Google Scholar
56.
1. O’Brien E.J.
2. Monk J.M.
3. Palsson B.O
2015Using Genome-scale Models to Predict Biological CapabilitiesCell 161:971–987https://doi.org/10.1016/j.cell.2015.05.019 Google Scholar
57.
1. Karr J.R
2. Sanghvi J.C.
3. Macklin D.N.
4. Gutschow M.V.
5. Jacobs J.M.
6. Bolival Jr B.
7. Assad-Garcia N.
8. Glass J.I.
9. Covert M.W.
2012A whole-cell computational model predicts phenotype from genotypeCell 150:389–401https://doi.org/10.1016/j.cell.2012.05.044 Google Scholar
58.
1. Rose Y.
2. Duarte J.M.
3. Lowe R.
4. Segura J.
5. Bi C.
6. Bhikadiya C.
7. Chen L.
8. Rose A.S.
9. Bittrich S.
10. Burley S.K.
11. Westbrook J.D.
2021RCSB Protein Data Bank: Architectural Advances Towards Integrated Searching and Efficient Access to Macromolecular Structure Data from the PDB ArchiveJ. Mol. Biol 433https://doi.org/10.1016/j.jmb.2020.11.003 Google Scholar
59.
1. Linding R.
2. Jensen L.J.
3. Diella F.
4. Bork P.
5. Gibson T.J.
6. Russell R.B.
2003Protein disorder prediction: implications for structural proteomicsStructure 11:1453–1459https://doi.org/10.1016/j.str.2003.10.002 Google Scholar
60.
1. Tubiana J.
2. Schneidman-Duhovny D.
3. Wolfson H.J.
2022ScanNet: an interpretable geometric deep learning model for structure-based protein binding site predictionNat Methods 19:730–739https://doi.org/10.1038/s41592-022-01490-7 Google Scholar
61.
1. Cheng J.
2. Randall A.Z.
3. Sweredoski M.J.
4. Baldi P.
2005SCRATCH: a protein structure and structural feature prediction serverNucleic Acids Res 33:72–76https://doi.org/10.1093/nar/gki396 Google Scholar
62.
1. Kabsch W.
2. Sander C.
1983Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical featuresBiopolymers 22:2577–2637https://doi.org/10.1002/bip.360221211 Google Scholar
63.
1. Sanner M.F.
2. Olson A.J.
3. Spehner J.C.
1996Reduced Surface: An Efficient Way to Compute Molecular SurfacesBiopolymers 38:305–320Google Scholar
64.
1. Liebermeister W.
2. Noor E.
3. Flamholz A.
4. Davidi D.
5. Bernhardt J.
6. Milo R.
2014Visual account of protein investment in cellular functionsPNAS 111:8488–8493https://doi.org/10.1073/pnas.131481011 Google Scholar

Article and author information

Author information

Edward Alexander Catoiu
Department of Bioengineering, University of California, San Diego, La Jolla, CA 92101
Nathan Mih
Department of Bioengineering, University of California, San Diego, La Jolla, CA 92101
Maxwell Lu
Omnicorp Inc. (Pilot AI), San Francisco, CA 94129
Bernhard Palsson
Department of Bioengineering, University of California, San Diego, La Jolla, CA 92101, The Novo Nordisk Foundation (NNF) Center for Biosustainability, The Technical University of Denmark, Kongens Lyngby 2800, Denmark
ORCID iD: 0000-0003-2357-6785
- For correspondence: bpalsson@ucsd.edu

Version history

Preprint posted: April 28, 2024
Sent for peer review: June 13, 2024
Reviewed Preprint version 1: September 25, 2024

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.100485. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

views: 492
downloads: 18
citations: 2

Views, downloads and citations are aggregated across all versions of this paper published by eLife.