The Quaternary Structural Proteome Atlas of a CEll is a genome-scale annotation platform that was applied to the E. coli proteome.

(a) The QSPACE platform requires two user-defined inputs: a list of gene(s) and dictionary of proteins and their associated gene-stoichiometric ratios. User-defined proteins can be oligomeric or monomeric. QSPACE accommodates residue-level sequence variation (the alleleome36 was used in this study). (b) QPACE identifies (or generates w/AF-multimer) the protein structure/model that best reflects the gene-stoichiometry of each user-defined protein complex (details in Figure 2a). The resulting structures (c) are analyzed using various software packages to calculate physicochemical properties and to identify evolutionary variable regions and functional domains (details in Figure 3a). (d) The structures are localized to their subcellular compartments and membrane-embedded structures are oriented across the membrane, resulting in a three-dimensional representation of (e) the structural proteome of an organism. (f) The amino acids of protein complexes are mapped to protein structures with varying levels of coverage (mean = 0.94) in E. coli. (g) Genome-scale counts of the unique codon positions belonging to various computed categories are shown.

QSPACE yields 3D structures that reflect the oligomeric nature of multi-subunit proteins.

(a) There are three ways that QSPACE’s protein-structure module finds the 3D structural representation for the user-defined gene-stoichiometry of a protein (target). (i) For each protein target (left), all structures that share genes with the protein are identified (center left) and combined (if applicable) to recreate the user-defined gene-stoichiometry (center right). If multiple matches are found, the structure that most accurately reflects the complete protein is selected (right). (ii) Sometimes, only higher-order structures of the protein-target are identified (center left). After QCQA of the relevant quality metrics associated with each structure (see Fig. S9, Methods 3.1), the highest confidence oligomeric structure is used to redefine the gene stoichiometry for the protein (‘new complex’, right). (iii) If identified structures are unable to recreate the protein in its entirety (‘missing subunits’, center left), Alphafold Multimer is used to predict the structure (<2000 AAs) and the quality of the resulting oligomeric models are assessed (see Fig. S11, Dataset S9). (b) To achieve a multi-subunit representation of the E. coli proteome, structures or models from various sources are used. Unlike previous GEM-PROs, QSPACE accommodates monomeric and k-meric Alphafold models. (c) The protein-to-structure module yields 3D structures that represent the oligomeric nature of the E. coli proteome. This representation is further improved using structural data to correct existing protein-gene stoichiometry and with Alphafold Multimer to calculate novel oligomeric structures, (d) allowing for a truer accounting of higher-order oligomeric proteins. (e) The QSPACE platform generated 50 novel structures for ABC-transporters in E. coli. The pLDDT scores are used to color the AF-Multimer models. The associated AlphaFold Model Score (see Fig. S11, Dataset S3) is displayed in the bottom right. Asterisks are used to denote incomplete models resulting from incorrectly defined gene-stoichiometries or from protein size limitations of ColabFold. Existing PDB structures are shaded.

Multi-dimensional QSPACE features to predict mutant phenotypes.

(a) QSPACE calculates 100 properties (residue-level, sequence-level, and structure-level) for all amino acids. (right) The mapping of the three-dimensional location of functional domains on the protein structure enables the calculation of a mutation’s proximity to important protein regions. (b) (left) UniProt mutations and their annotated phenotypes, are mapped to QSPACE and (right) can be classified into general categories that reflect mutant severity. (c) Annotated phenotypes and multi-dimensional properties for 12,000 UniProt mutations mapped to the E. coli QSPACE (Dataset S5) can be used to train RF-classifiers that can predict the effect of a mutation (see Fig. 4). We suggest the application of these classifiers to novel mutant datasets, (e.g. mutations in adaptive laboratory evolution experiments (ALE), the long-term evolution experiment (LTEE) and the alleleome are already mapped to the E. coli QSPACE in Dataset S1).

Random-forest prediction of mutant phenotypes in UniProtKB.

(a) QSPACE finds 4,299 mutants with known phenotypes in proteins containing active sites. The mutational properties (‘features’) of the amino acid residue (grey), the 5 amino acid long sequence centered at the mutation (yellow), and the local 3D protein structure (green) are calculated (see Fig. 3). The numbers inside the boxes describe the number of features used. Random Forest Classifiers are trained on (top, “0D”) residue-level parameters; (middle, “1D”) residue and sequence-level parameters; and (bottom, “3D”) residue, sequence, and structure-level parameters. For RF-classifiers initially trained on more than 30 parameters, the least predictive parameter is removed until the 30 most-predictive parameters are identified. (b) “One vs Rest” receiver operating characteristic curves and precision-recall curves are calculated from the averages of 100 RF-classifiers trained on the 3 different sets of parameters. The shaded region represents 1 standard deviation. (c) The importance of individual features for “3D” RF-classifiers is determined. MUT- and MUT+ reflect pre- and post-mutation sequence properties, respectively. The radius of the 3D environment is described where applicable. The interoperability of multiple datatypes in QPSACE allows for the calculation of a mutation’s proximity to the nearest active site—the third most important feature in determining mutant severity. (d) QSPACE can be used to analyze mutations found in proteins containing various functional domains. (e) The weighted (by phenotype class) area under the curve (AUC) is calculated from the ROC and P-R curves in Panel B for RF-classifiers (3 parameter sets x 100 RF-classifiers) for each functional domain. (f) The cumulative importance of residue, sequence, and structure features in “3D” RF-classifiers of mutations in proteins containing each domain.

The membrane module orients proteins across the membrane to identify residue-level subcellular compartments of the E. coli proteome.

(a) In the metadata provided by EcoCyc (cellular compartment), Gene Ontology Terms (pathway, function, compartment), iML1515 (metabolic subsystem), and UniProt (topological & transmembrane domains) databases, there are 1,777 protein structures mapped to at least one gene that is associated with the E. coli membrane. (b) Membrane-crossing residues are identified by the amino acid sequence information provided by UniProt, predicted by DeepTMHMM, and calculated by OPM. From these residues, a plane of best fit is calculated. (c) Structures with two calculated membrane planes pass the QCQA analysis if i) the angle between the planes is less than 35°, ii) the thickness of the membrane embedded region is between 12 and 45 Angstroms, and iii) the cross-sectional area of the membrane embedded region is less than 10,000 Å2. (d) Membrane proteins are oriented using the topological information provided by UniProt (if available) or manually using common protein motifs (see Dataset S6-S7) such that (e) the subcellular compartment of every amino acid of the E. coli proteome can be determined.

QSPACE integrates with genome-scale models to predict the physical space required by the E. coli proteome at optimal growth rate.

(a) The compartmentalization of each amino acid of the E. coli proteome allows for the calculation of geometric properties of all proteins. (b) The volume of ATP-synthase and (c) its cross-sectional membrane area are shown. (d) The integration of QSPACE with genome-scale models (iJL1678b-ME, in this case) of metabolism (M-matrix) and macromolecular expression (E-matrix) (ME-models) can be used to calculate (e) the proteome allocation, (f) the volumetric allocation, and (g) the membrane composition of E. coli at optimal growth rate. The expression, volume, and membrane area allocated to ATP-synthase is shown (Panels E-G, cyan). The calculated spatial allocation (Panels F-G) of the macromolecular expression predicted by existing ME-models (in Panel E) is a fundamental advancement towards building genome-scale biophysical whole-cell models.