Introduction

Biological cells can change their gene expression and metabolic profiles globally to adapt to thier biological contexts and external conditions, while maintaining the homeostasis of their core physiological states. The simultaneous realization of adaptability and homeostasis is a hallmark of biological systems and is assumed to be a system-level property of gene expression profiles in cells (1, 2). However, understanding the underlying organizational principles in comprehensive gene expression profiles remains to be a fundamental problem in biology.

Vibrational spectroscopy such as Raman spectroscopy might help us investigate such principles in gene expression profiles. Raman spectroscopy is a light scattering technique that measures energy shifts of light caused by interaction with sample molecules. Raman spectra are obtainable non-destructively even from biological samples such as individual cells. In principle, cellular Raman spectra are optical signatures conveying comprehensive molecular composition of targeted cells (36). Furthermore, no prior treatments, such as staining and tagging, are necessary to obtain cellular Raman spectra. However, although some biomolecules have separable and intense Raman signal peaks, Raman spectra of most biomolecules overlap and are masked by signals of other molecules due to the diversity and complexity of molecular compositions of cells. Therefore, it is impractical to comprehensively determine the amounts of biomolecules by spectral decomposition.

Despite the intractability of spectral decomposition, reconstruction of comprehensive molecular profiles may be achievable by analyzing detectable global spectral patterns (Fig. 1A) thanks to effective low dimensionality of changes in molecular profile of targeted cells (717)(Fig. 1B and Fig. S1). Indeed, it has been demonstrated that condition-dependent global transcriptome profiles of cells can be inferred from cellular Raman spectra based on their statistical correspondence (18, 19). Importantly, this Raman-spectroscopic transcriptome inference was possible from dimension-reduced Raman spectra. Therefore, dominant changes in global Raman spectral patterns may contain vital information about the constraints on the molecular profiles in cells; an inspection of their correspondence might give us insights into architectural principles of omics profiles and biological foundation for global omics inference from spectral patterns (Fig. S1).

Cellular physiological state differences detected by Raman spectral global patterns and gene expression profiles.

(A) Condition-dependent cellular Raman spectral patterns. Raman spectra obtained from cells reflect their molecular profiles. Therefore, systematic differences in global spectral patterns may indicate their physiological states. A Raman spectrum from each cell can be represented as a vector and a point in a high-dimensional Raman space. If condition-dependent differences exist in the spectral patterns, appropriate dimensional reduction methods allow us to classify the spectra and detect cellular physiological states in a low-dimensional space. (B) Condition-dependent gene expression profiles. Global gene expression profiles (proteomes and transcriptomes) are also dependent on conditions. For each gene, we can consider a high-dimensional vector whose elements represent expression levels under different conditions. It has been suggested that these expression-level vectors are constrained to some low-dimensional manifolds (717). This study characterizes the statistical correspondence between dimension-reduced Raman spectral patterns and gene expression profiles. Analyzing the correspondence, we reveal a stoichiometry conservation principle that constrains gene expression profiles to low-dimensional manifolds.

In this report, we first reveal that, in addition to transcriptomes, condition-dependent proteome profiles of Escherichia coli are predictable from cellular Raman spectra. Next, we scrutinize the correspondence between Raman and proteome data, identifying several stoichiometrically conserved groups (SCGs) whose expression tightly correlates with the major changes in cellular Raman spectra. Finally, we reveal that the stoichiometry conservation centrality of each gene correlates with its essentiality, evolutionary conservation, and condition-specificity of gene expression levels, which turns out general across different omics layers and organisms.

Results

Statistical correspondence between Raman spectra and proteomes

To examine the correspondence between Raman spectra and proteomes in E. coli, we reproduced 15 environmental conditions for which absolute quantitative proteome data are already available (20) and measured Raman spectra of E. coli cells under those conditions (Fig. 2A and Fig. 2B). The culture conditions we adopted include (i) exponential growth phase in minimal media with various carbon sources, (ii) exponential growth phase in rich media, (iii) exponential growth phase with various stressors, and (iv) stationary phases (Table S1). We measured Raman spectra of single cells sampled from each condition and focused on the fingerprint region of biological samples, where the signals from various biomolecules concentrate (spectral range of 700–1800 cm−1, Fig. 2B and Fig. S2). The Raman spectra were classified on the basis of the environmental conditions using a simple linear classifier, linear discriminant analysis (LDA) (3, 4, 21) (Fig. 2C–2E). This classifier calculates the most discriminatory axes by maximizing the ratio of between-condition variance to within-condition variance and reduces the dimensions of Raman data to m − 1, where m is the number of conditions (see sections 1.1 and 2.1 in (22)).

Estimation of proteomes from Raman spectra.

(A) The experimental design. We cultured E. coli cells under 15 different conditions and measured single cells’ Raman spectra. We then examined the correspondence between the measured Raman spectra and the absolute quantitative proteome data reported by Schmidt et al. (20). (B) Representative Raman spectra from single cells, one from the “Glucose” condition, and the other from the “LB” condition. The fingerprint region and representative peaks are annotated. (C to E) Cellular Raman spectra in LDA space. The dimensionality of the spectra is reduced to 14 (= 15 − 1). Each point represents a spectrum from a single cell, and each ellipse shows the 95% concentration ellipse for each condition. Their projection to the LDA1-LDA2 plane (C), the LDA1-LDA3 plane (D), and the LDA1-LDA4 plane (E) are shown. (F) Visualization of the 14 dimensional LDA space embedded in two dimensional space with t-SNE. (G) The scheme of leave-one-out cross-validation. The Raman and proteome data of one condition (here j) are excluded, and the matrix B is estimated using the data of the rest of the conditions as . The proteome data under the condition j is estimated from the Raman data with and compared with the actual data to calculate estimation errors. (H) Comparison of measured and estimated proteome data. The plot for the “Glucose” condition is shown as an example. Each dot corresponds to one protein species. The straight line indicates x = y. Proteins with negative estimated values are not shown.

The result shows that Raman spectral points from different environmental conditions are distinguishable in the LDA space (Fig. 2C–2E). For example, the first and second LDA axes clearly distinguish the conditions “LB” and “stationary3days” (Fig. 2C), and the third axis distinguishes “Glucose42C” and “GlycerolAA” (Fig. 2D). Notably, the first principal axis LDA1 correlated with growth rate significantly (Pearson correlation r = 0.81 ± 0.09, Fig. S2). Visualizing the Raman LDA data by embedding them on a 2D plane using t-distributed Stochastic Neighbor Embedding (t-SNE) (23) confirms that the points for each condition form a distinctive cluster (Fig. 2F). These results imply that positions in the Raman LDA space reflect condition-dependent differences in cellular physiological states.

We next asked whether these Raman spectral differences in the LDA space could be linked to the different proteome profiles. To examine this, we hypothesized linear correspondence between the proteome vector and the low-dimensional Raman vector in condition j,

B is a matrix that connects and . We calculated as the average of the low-dimensional LDA Raman data of single cells in condition j since the proteomes were measured for cell populations.

We conducted leave-one-out cross-validation (LOOCV) to verify this linear correspondence (Fig. 2G). We excluded one condition (here, j) as a test condition and estimated B as by simple ordinary least squares (OLS) regression using the data of the rest of the conditions. We thereby estimated the proteome in condition j as .

The proteome profile estimated using the first four major LDA axes (LDA1–LDA4) agreed well with the actual proteome data under most conditions (Fig. 2H and Fig. S3; see section 1.2 in (22) for the estimation with all the LDA axes). Changing the condition to exclude, we estimated the proteomes for all the 15 conditions and calculated the overall estimation error by the Euclidean distance . The result shows that the overall estimation error is significantly small (p = 0.00005 by permutation test (2426)). Adopting other distance measures does not change the conclusion (Table S2 and Table S3). These results therefore validate the assumption of linear correspondence between cellular Raman spectra and proteomes and confirm that condition-dependent changes in proteomes can be inferred from the corresponding low-dimensional Raman spectra.

Stoichiometry conservation of proteins in the ISP COG class

The regression matrix B considered above determines how the proteomes relate to the Raman LDA axes. Therefore, analyzing B should provide some insights into constraints on condition-dependent changes in the proteomes.

The matrix B is represented as B = [b0 b1 bm−1], where bk = (b1k b2k … bnk)T is the (k + 1)-th column of B (0 ≤ km − 1) and n = 2,058 is the number of protein species in the proteome data. We first asked whether any shared features might exist in the coefficients of B depending on biological functions of corresponding proteins. We then classified the proteins according to functional annotations of Clusters of Orthologous Group (COG) classes (2729) and found that, for many proteins belonging to the “information storage and processing” (ISP) COG class, the coefficients corresponding to different LDA axes are approximately proportional to the constant terms; i.e., blkckbl0, where ck is a constant common to the ISP COG class proteins (Fig. 3A). The ISP COG class contains various proteins involved in processing genetic information such as translation, transcription, DNA replication, and DNA repair (20). Simple calculations show that these proportionality relationships imply that proteins in the ISP COG class conserves their mutual abundance ratios, i.e., stoichiometry, irrespective of environmental conditions (see section 1.3 in (22)).

A stoichiometrically conserved protein group identified by an analysis of the Raman-proteome coefficient matrix.

(A) Scatterplots of Raman-proteome transformation coefficients. The horizontal axes are constant terms (b0) in all the plots. The vertical axis is coefficients for LDA1 (b1), LDA2 (b2), LDA3 (b3), or LDA4 (b4) in each plot. The proteins in the ISP COG class are indicated in yellow. Yellow solid straight lines are least squares regression lines passing through the origins for the ISP proteins. Insets are enlarged views of area around the origins. In this figure, we used the average of as an estimate of B. (B) Similarity of expression patterns between culture conditions for each COG class. We divided the proteome into COG classes (28, 29), and calculated Pearson’s correlation coefficient of expression patterns for all the combinations of culture conditions. Since the data are from 15 conditions, there are 105 (= 15·14/(2·1)) points for each COG class in the graph. The box-and-whisker plots summarize the distributions of the points. The lines inside the boxes denote the medians, the top and bottom edges of the boxes do the 25th percentiles and 75th percentiles, respectively. The numbers of protein species are 376 for the Cellular Processes and Siginaling COG class, 354 for the ISP COG class, and 840 for the Metabolism COG class. See Fig. S4 for the evaluation with Pearson correlation coefficient of log abundances and with cosine similarity. Fig. S4 also contains figures directly showing expression-level changes of different protein species across conditions for each COG class. (C) Examples of stoichiometry-conserving proteins in the ISP COG class. The horizontal axis represents the abundance of RplF under 15 conditions and the vertical axis represents those of several ISP COG class proteins. These proteins are also contained in the homeostatic core defined later (see Fig. 4). The solid straight lines are linear regression lines with intercept of zero. (D) Examples of abundance ratios of non-ISP COG class proteins. The horizontal axis represents the abundance of RplF under 15 conditions and the vertical axis represents those of compared non-ISP COG class proteins. Crp belongs to the Cellular Processes and Signaling COG class; the other proteins belong to the Metabolism COG class. In both (C) and (D), we selected the proteins expressed from distant loci on the chromosome. All sigma factors participating in the regulation of the proteins examined in (C) and (D) are listed on the right of the gene name legends. All transcription factors known to regulate multiple genes listed here are shown in the right diagrams. Arrows show activation; bars represent inhibition; and squares indicate that a transcription factor activates or inhibits depending on other factors. The information on gene regulation and functions was obtained from EcoCyc (49) in August 2022. The error bars are standard errors calculated by using the data of (20). The inset shows the positions of the genes on the E. coli chromosome determined based on ASM75055v1.46 (50). No genes are in the same operon.

Since this is an implication from the Raman-proteome correspondence, we next examined the stoichiometry conservation only with the proteome data, evaluating the expression levels with Pearson correlation coefficients for all the pairs of the conditions for each COG class (Fig. 3B and Fig. S4). For the ISP COG class, the correlation coefficients were close to 1, whereas those for the other COG classes were significantly weaker depending on condition pairs. Therefore, stoichiometry conservation is stronger in the ISP COG class than in the other COG classes. Remarkably, neither shared transcription factors nor chromosome locations can account for the observed stoichiometry conservation of many protein pairs (Fig. 3C and Fig. 3D), implying multi-level regulation and coordination of their abundance.

We consulted other public quantitative proteome data of Mycobacterium tuberculosis (30), Mycobacterium bovis (30), and Saccharomyces cerevisiae (31) under environmental perturbations and consistently found strong stoichiometry conservation of the ISP COG class (Fig. S4). Furthermore, the same trend was observed for the genotype-dependent expression changes in

E. coli proteomes (20) (Fig. S4).

Identifying stoichiometrically conserved groups

Inspired by the existence of a large class of proteins that conserves their stoichiometry, we considered a systematic way to extract stoichiometrically conserved groups (SCGs) without relying on artificial functional classification of COG. Focusing only on the proteome data, we evaluated stoichiometry conservation by the cosine similarity of expression levels across conditions for all the pairs of proteins in the proteome (Fig. 4A) and extracted groups in each of which the component proteins exhibit coherent expression change patterns by setting a high threshold of cosine similarity (≥ 0.995, Fig. 4B).

Extracting SCGs from proteome data.

(A) Quantifying stoichiometry conservation by cosine similarity. We consider an expression vector for each protein species whose elements represent its abundance under different conditions. The cosine similarity between the expres-sion vectors of two protein species becomes nearly 1 when they conserve mutual stoichiometry strongly across conditions, whereas lower than 1 when their expression patterns are incoherent. (B) Extracted SCGs. We extracted proteins with high cosine similarity relationships. Each node represents a protein species. An edge connecting two nodes represents that the expression patterns of the two connected protein species have high cosine similarity exceeding a threshold of 0.995. Proteins that have no edge with the other proteins are not shown. The largest and the second largest protein groups, which we refer to as SCG 1 and SCG 2, respectively, are indicated by shaded polygons. (C) Expression patterns of the extracted SCGs. The horizontal and vertical axes represent growth rate and protein abundance, respectively. Line-connected points represent expression-level changes of different protein species across conditions. The inset for SCG 2 shows the total abundances of SCG 2 proteins with a log-scaled vertical axis. Error bars are standard errors. (D) The gene loci of the homeostatic core (SCG 1) proteins on the chromosome. Magenta dots are nodes (genes), and gray lines are edges (high cosine similarity relationships). We determined the gene loci based on ASM75055v1.46 (50).

The largest SCG (SCG 1) included many proteins in the ISP COG class (91 out of 191 SCG 1 members), such as ribosomal proteins and RNA polymerase, and also proteins in the other COG classes (Fig. 4B, Table S4). We call this largest SCG homeostatic core, as it constitutes the largest stoichiometry-conserving unit in cells. We found that the abundance of each protein in the homeostatic core (SCG 1) increased approximately linearly with the growth rate in each condition (Fig. 4C). This relationship is reminiscent of the growth law: The total ribosomal contents for translation increase linearly with growth rate (3234). The linear increase in the abundance of each protein in Fig. 4C indicates that the growth law is valid even at the single-gene level for a large class of ribosomal and non-ribosomal proteins in the homeostatic core (Fig. S5) (see section 3.1 in (22)).

Though not evenly distributed, the gene loci of the proteins in the homeostatic core are scattered throughout the chromosome (Fig. 4D). Therefore, localization of gene loci to a single or a small number of operons is not likely a cause of the observed stoichiometry conservation.

The proteins in the second largest SCG (SCG 2) are expressed at high levels in the fast growth conditions, especially in the “LB” condition (Fig. 4C). The SCG 2 includes many proteins in the metabolism COG class (21 out of 26 SCG 2 members) (Table S5), and their abundance increases approximately exponentially with growth rate (Fig. 4C). We also identified other condition-specific small SCGs, such as a group most expressed in the “GlycerolAA” condition (SCG 3) (Table S6), a group mainly expressed in the “Fructose” condition (SCG 4) (Table S7), and a group most expressed in the stationary phase conditions (SCG 5) (Table S8) (Fig. 4C).

Biological relevance of stoichiometry conservation

To understand the overall strength of stoichiometry conservation of the proteins in the different SCGs, we calculated the sum of cosine similarity, , for each protein species i, where cos is cosine similarity between the expression level vectors of protein i and protein j, and the sum is taken over all the protein species (see section 1.5 in (22)). We refer to di as “stoichiometry conservation centrality.”

The proteins in the homeostatic core had high centrality scores (Fig. 5A). Therefore, these proteins tend to have more connections with other proteins in terms of stoichiometry conservation. On the other hand, the proteins in the condition-specific SCGs tend to have low centrality scores among all the proteins (Fig. 5A), which suggests that their stoichiometry conservation is localized within each SCG.

A proteome structure characterized by global stoichiometry conservation relationships.

(A) Distributions of stoichiometry conservation centrality values for all the proteins (gray), the homeostatic core (SCG 1) proteins (magenta), and the proteins belonging to the other SCGs (cyan). (B) Correlation between stoichiometry conservation centrality and gene essentiality. The proportion of essential genes within each class of stoichiometry conservation ranking is shown. The list of essential genes was downloaded from EcoCyc (49). (C) Correlation between stoichiometry conservation and evolutionary conservation. The strength of evolutionary conservation of each protein species was estimated by the number of orthologs found in the OrthoMCL species (35). The genes with more orthologs tend to have higher stoichiometry conservation centrality (p = 6.24 × 10−15 by one-sided Brunner-Munzel test between the top 25% and the bottom 25% fractions of ortholog number ranking). Likewise, the genes with higher stoichiometry conservation centrality scores tend to have more orthologs (p = 4.04 × 10−12 by one-sided Brunner-Munzel test, Top 25%–Bottom 25% comparison; p-values in the captions for (F) to (I) were evaluated with the same statistical test scheme). (D) to (G) Stoichiometry conservation analyses of human cell atlas transcriptome data of fetal 15 organs (36). The top gray histogram in (D) shows the distribution of stoichiometry conservation centrality values for all genes. The bottom histograms in (D) show the distribution for coding genes (yellow) and that for the other genes (cyan). (E) shows a correlation between the ratio of coding genes and stoichiometry conservation centrality calculated from the human cell atlas data. (F) shows a correlation between gene essentiality and stoichiometry conservation centrality calculated from the human cell atlas data. The essentiality of each human gene was quantified by CRISPR score, which is the fitness cost imposed by CRISPR-based inactivation of the gene in KBM7 chronic myelogenous leukemia cells (35). Genes with lower CRISPR score are regarded as more essential. The fraction with low CRISPR score (i.e. high essentiality fraction) tends to have higher stoichiometry conservation centrality (p < 10−15). The fraction with high centrality score tends to be more essential (p < 10−15). (G) shows a correlation between evolutionary conservation and stoichiometry conservation centrality based on the human cell atlas data. The gene fraction with many orthologs tends to have higher stoichiometry conservation centrality (p < 10−15). The gene fraction with high centrality score tends to have more orthologs (p< 10−15). (H) and (I) Stoichiometry conservation analyses of genome-wide Perturb-seq data (37). (H) shows a correlation between stoichiometry conservation centrality calculated from the Perturb-seq data and gene essentiality. The essentiality of each gene was quantified by the CRISPR score as in (F). The gene fraction with low CRISPR score (i.e. high essentiality fraction) tends to have higher stoichiometry conservation centrality (p< 10−15). The gene fraction with high centrality score tends to be more essential (p < 10−15). (I) shows a correlation between stoichiometry conser-vation based on the Perturb-seq data and evolutionary conservation of genes. The gene fraction with many orthologs tends to have higher stoichiometry conservation centrality (p < 10−15). The gene fraction with high centrality score tends to have more orthologs (p < 10−15). (J) Representation of the proteomes as a graph. A node corresponds to a protein species, and the weight of an edge is taken as the cosine similarity between the expression vectors of the two connected protein species. The matrix A can specify the whole graph. Note that the diagonal elements of A are ones, which were introduced just for simplicity. (K) Cosine similarity LE (csLE) structure in a three-dimensional space. Each dot represents a different protein species and is color-coded on the basis of its stoichiometry conservation centrality value. We selected the axes considering the structural similarity to the Raman-based proteome structure in ΩB (see Fig. 6). (L) The csLE structure in a three-dimensional space. The views from two different angles are shown. Each gray dot represents a different protein species. The proteins belonging to each SCG are indicated with distinct markers.

The stoichiometry conservation centrality is biologically relevant because it correlates with gene essentiality. Fractions of essential genes almost monotonically decrease with the ranks of centrality score (Fig. 5B and Fig. S6). We also noted that genes with high centrality scores have more orthologs determined by OrthoMCL-DB (35) across the three domains of life (Fig. 5C and Fig. S6). Likewise, genes with many orthologs tend to have higher centrality scores (Fig. 5C and Fig. S6). Therefore, the stoichiometry conservation in cells correlates with the evolutionary conservation of proteins.

Comparable correlations of stoichiometry conservation centrality with gene essentiality and evolutionary conservation were also found for the S. pombe transcriptome data (Fig. S6). In addition, fractions of coding genes almost monotonically decreased with ranks of centrality score in the S. pombe data (Fig. S6).

We further analyzed two kinds of Homo sapiens transcriptome data. One is a human cell atlas, in which expression of both coding and noncoding genes in 15 fetal organs was quantified (36), and the other is genome-wide Perturb-seq data (37), in which genetically perturbed transcriptomes were measured mainly for coding genes. Our analysis of the human cell atlas data revealed that, while the overall distribution of stoichiometry conservation centrality was broad (Fig. 5D top), the centrality distribution of coding genes was skewed to higher values (Fig. 5D bottom) as observed for the E. coli proteome. Fractions of coding genes almost mono-tonically decreased with ranks of centrality (Fig. 5E) as seen in the S. pombe data (Fig. S6). Essentiality of each gene in human cells was quantified by an index called CRISPR score, which measures the fitness cost imposed by CRISPR-based inactivation of the gene (38). Genes with lower CRISPR scores are considered more essential. Our analysis revealed that genes with higher stoichiometry conservation centrality scores tend to have lower CRISPR scores, thus more essential (Fig. 5F). Similarly, genes with lower CRISPR scores tend to have higher stoichiometry conservation centrality scores. Furthermore, genes with higher centrality scores have more orthologs across the three domains of life, and vice versa (Fig. 5G). Comparable correlations of stoichiometry conservation with essentiality and evolutionary conservation were also found in the genome-wide Perturb-seq data (Fig. 5H and Fig. 5I). Together, these results suggest that correlations of stoichiometry conservation centrality with gene essentiality and evolutionary conservation are general and preserved from E. coli to human cells regardless of the type of perturbation.

Revealing global stoichiometry conservation architecture of the proteomes with csLE

To gain further insights into the linkage between stoichiometry conservation constraints and cellular Raman spectra, we next analyzed the proteomes using a method similar to Laplacian eigenmaps (LE) (39). We consider a symmetric matrix A with its (i, j) entry is cos (Fig. 5J). The entire proteome structure can be represented using the eigenvectors of normalized A. Major differences of this method from the ordinary LE are that we consider an edge for all node pairs and that we adopt cosine similarity for weighting edges. This method places the proteins with higher cosine similarity closer in the resulting space (see sections 1.5 and 2.1 in (22)); we call this method cosine similarity LE (csLE).

In this csLE space ΩLE, the stoichiometry conservation centrality of the proteins decreased from center to periphery (Fig. 5K), which confirms that it indeed measures the extent to which each protein is close to the center in the entire stoichiometry conservation architecture. Furthermore, the proteins formed polyhedral distributions with the cluster of the proteins in the homeostatic core at the center and the clusters of the proteins in the other condition-specific SCGs at distinct vertices (Fig. 5L). This distribution is consistent with the fact that the condition-specific SCGs are the components whose expression patterns are distant from the homeostatic core and also between each other.

Representing the proteomes using the Raman LDA axes

Given that the analysis of the Raman-proteome regression coefficients B initially led us to identify the SCGs and the stoichiometry conservation architecture in the proteome data, major changes in cellular Raman spectra characterized by LDA might reflect the expression changes of these coherently-expressed components.

The coefficients in the regression matrix B must satisfy for all k (1 ≤ km − 1) for the pair of protein i and protein j that conserve their stoichiometry (see sections 1.3 and 2.1 in (22)). Noting this property, we constructed another proteome space ΩB, assigning each protein species i a coordinate . As in ΩLE, a pair of proteins with strong stoichiometry conservation are expected to position closely in this proteome space ΩB.

We then found that the distribution of the proteins in ΩB closely resembled the one in ΩLE when visualized using the first few major axes (Fig. 5L and Fig. 6A). This similarity is nontrivial because ΩLE is constructed only from the proteome data, whereas ΩB depends on the Raman LDA space.

Raman-based proteome structure and its similarity to stoichiometry-based proteome structure.

(A) Proteome structure determined by Raman-proteome coefficients visualized in a three-dimensional space. The views from two different angles are shown. Each gray dot represents a protein species. The proteins belonging to each SCG are indicated with distinct markers. We note that SCGs are defined without referring to Raman data (Fig. 4). (B–D) Similarity among the distribution of LDA Raman spectra (B), the proteome structure determined by Raman-proteome coefficients (C), and the proteome structure determined by stoichiometry conservation (D). (E) Mathematical relation between the coordinates of the proteins in ΩB (C) and ΩLE (D). The two conditions, one between b0 and (cyan), and the other with Θ (magenta), must hold for the similarity between the two proteome structures (yellow), as described in the gray box. denotes column-wise proportionality.

We remark that each axis of ΩB is directly linked to the corresponding Raman LDA axis. Consequently, the orthants in ΩB where the condition-specific protein species reside agree with those in the Raman LDA space where the cellular Raman spectra under corresponding condi-tions reside (Fig. S10) (see sections 1.7 and 2.1 in (22)). Indeed, we find such orthant agreement between the proteins in the condition-specific SCGs (SCG 2–SCG 5) and the cellular Raman spectra under the corresponding conditions (Fig. 6B and Fig. 6C). This straightforward correspondence between ΩB and the Raman LDA space allows us to examine the relationship between changes in cellular Raman spectra and omics components’ stoichiometry conservation architecture by comparing the two proteome structures in ΩB and ΩLE.

Omics-level interpretation of cellular Raman spectra and a quantitative constraint between expression generality and stoichiometry conservation centrality

To understand rigorously what the similarity of the proteome structures in ΩB and ΩLE signifies (Fig. 6C and Fig. 6D), we clarified the mathematical relation between the coordinates of the proteins in these two spaces (Fig. 6E; see sections 2.1 and 2.2 in (22) for detail). We then characterized the two mathematical conditions that must be satisfied simultaneously (Fig. 6E).

The first condition is that major axes of the Raman LDA space and those of the proteome csLE space correspond (Fig. 6E). Consequently, cellular Raman spectra under a condition accompanying the expression of a condition-specific SCG must be significantly different from those under conditions with the expression of other condition-specific SCGs in a manner distinguishable by LDA. We verified this first condition with the data (Fig. S9).

The second condition is that the stoichiometry conservation centrality of each protein species di must be proportional to gi := ∥pi1/pi2, where ∥pi1 and ∥pi2 are the L1 and L2 norms of the expression level vector of protein i across conditions (Fig. 6E).

gi can be interpreted as the expression generality score. When gi is large, the protein i is expressed generally across conditions; when gi is small, this is expressed only under specific conditions (Fig. S8) (see section 1.9 in (22)). Therefore, the proportionality between di and gi indicates that the proteins with high stoichiometry conservation centrality must be expressed non-specifically to conditions. We also verified this condition with the data, confirming that it is indeed satisfied (Fig. 7A and Fig. S9).

Proportionality between stoichiometry conservation centrality and expression generality.

(A) Relationships between stoichiometry conservation centrality (di) and expression generality (gi). Each gray dot represent a protein species. The proteins belonging to each SCG are indicated with distinct markers. The dashed lines are ). The solid lines represent (see section 2.2 in (22)). The devia-tion of a point from the solid line is related to the growth rate under the condition where each protein is expressed the most. (B) The same plot as (A) in black and white. Overlaid red circles indicate proteins featured in (C). (C) Expression patterns of the proteins indicated by red circles in (B) across conditions. The condition differences are shown by the growth rate differences on the horizontal axes. The arrangement of the plots for the proteins corresponds to their relative positions in (B).

The spread of the points from the proportionality diagonal line of the E. coli proteome data in Fig. 7A was found related to the growth rate under the condition where each protein is expressed the most (see section 2.2 in (22) for a detailed analysis on the origin of the deviation). Consequently, one can envisage a growth-rate-dependent expression pattern of each protein on the basis of its relative position in this gi-di plot (Fig. 7B and Fig. 7C). For example, both BamB and YqjD are expressed non-specifically to the conditions with nearly identical expression generality scores. However, BamB is expressed at higher levels under fast growth conditions, whereas YqjP is expressed at higher levels under slow growth conditions due to their relative positions to the proportionality line. Similar growth rate dependence is observed for PaaE and DgoA, but with more prominent condition-specificity because these proteins are characterized by their low expression generality scores. These growth-rate-dependent deviation patterns might hint at a new growth law that governs the total relative expression changes of the proteome components (see section 2.2 in (22) for detailed discussion).

Generality

We also examined the generality of the aforementioned two conditions using the Raman and proteome data of E. coli strains with different genotypes (BW25113, MG1655, and NCM3722) under two culture conditions (20) and the Raman and transcriptome data of S. pombe under 10 culture conditions (18). Applying csLE to the omics data, we again found similar omics structures between ΩLE and ΩB when visualized using the first few major axes, with homeostatic cores at the centers and condition-specific SCGs at the vertices (Fig. S11 and Fig. S12).

Proportionality between stoichiometry conservation centrality and expression generality score was also confirmed in both additional datasets (Fig. S7). We further used publicly available quantitative proteome data of M. tuberculosis, M. bovis, and S. cerevisiae (30, 31) to examine this relation and confirmed that the proportionality universally holds (Fig. S7 and Fig. S13). Almost no deviation from the proportionality line existed in the S. cerevisiae proteome data measured for the cells in different media but cultured in chemostats with an identical dilution rate (thus, identical growth rate), which is consistent with the result of E. coli in which the deviations were related to the growth rate differences.

Discussion

A Raman spectrum obtained from a single cell is a superposition of the spectra of all of its constituent biomolecules. Therefore, cellular Raman spectra potentially contain rich information on essential state differences in targeted cells. The fact that both transcriptomes and proteomes are inferable from cellular Raman spectra, as demonstrated in this and previous (18) studies, endorses this speculation. The detailed analyses of the relationship between Raman and omics data have identified functionally relevant constraints on omics changes and provided an interpretation of cellular Raman spectra (Fig. S1). Specifically, it has been revealed that major changes in cellular Raman spectra distinguishable by LDA reflect the changes in omics profiles under the constraints of stoichiometry conservation. This correspondence would help us interpret global changes in cellular Raman spectra by translating them into the differences in omics profiles.

We remark that linearity in our formulation enabled us to find the rigorous connection between the two omics spaces ΩB and ΩLE (Fig. 6E). Unlike the original LE, we adopted cosine similarity as weights of edges between all node pairs to measure expression stoichiometry conservation of proteins. This modification was indispensable in terms of interpretation; relative proximity of positions in ΩLE reflects the strength of stoichiometry conservation. We also remark that simple principal component analysis (PCA) applied to the normalized E. coli proteome data also finds a similar low-dimensional proteome structure (Fig. S6) (see section 1.10 in (22)). Therefore, besides interpretability, omics structures in ΩLE might reflect dominant relationships among omics components commonly characterized by several methods of omics representation.

Stoichiometry conservation is plausibly crucial for cellular functions and physiology. For example, the enzymes involved in evolutionarily conserved metabolic pathways conserve their stoichiometry across microorganism species despite their diverse transcriptional and translational rates (40). It is suggested that stoichiometry conservation is achieved by optimizing the metabolic flux for fast growth (41). Furthermore, a ribosome-targeting antibiotic causes an imbalance of ribosomal proteins and growth arrest in E. coli, but the balance is restored alongside growth recovery through physiological adaptation (42). These results suggest that breakage of stoichiometric balance among core components could impose significant fitness cost.

It is intriguing to ask how cells conserve stoichiometry among the components in each SCG. Especially, the homeostatic core (SCG 1) contains many components whose gene loci are scattered throughout the genome. It is known that both transcriptional and translational negative auto-regulation contributes to controlling the stoichiometry of many ribosomal proteins (4348). The genes for the ribosomal proteins are scattered in multiple operons and co-regulated with many other non-ribosomal proteins, such as RNA polymerase subunits, translation initiation/elongation factors, and transmembrane transporters (49). Therefore, the stoichiometry-conserving mechanisms established for ribosomes might be partially exploited for the stoichiometry conservation within the homeostatic core.

The proportionality between stoichiometry conservation centrality and expression generality score suggests that proteins with high stoichiometry conservation centrality govern basal cellular functions required under any conditions. In fact, both essential genes and evolutionarily conserved genes are enriched in the omics fractions with high centrality scores. On the contrary, proteins of low centrality scores might have been acquired in later stages of the evolution and exploited to survive or increase fitness under specific conditions. Such hierarchy in the stoichiometry conservation centrality among core and peripheral processes might promote the adaptability of cells since cells can respond to diverse environments without restructuring a large body of the functional homeostatic core. This architectural principle in omics might underlie the robustness and adaptability of biological cells.

Acknowledgements

We thank Matthias Heinemann and Silke Bonsing-Vedelaar for detailed information on the E. coli culture conditions; Doeke R. Hekstra, Tetsuya J. Kobayashi, Takafumi Miyamoto, John Russell, and Ian Hunt-Isaak for reading the manuscript and providing critical comments; Kunihiko Kaneko, Chikara Furusawa, Yasushi Okada, and members of the Wakamoto Lab and the Universal Biology Institute for discussion and encouragement. This work was supported by JST CREST Grant Number JPMJCR1927 (Y.W.); JST ERATO Grant Number JPMJER1902 (Y.W.); JSPS KAKENHI Grant Numbers 19J22448 (K.F.K.) and 21K20672 (T.N.).

Additional files

Supplemental text and figures.

Table S1.