Introduction

Biological cells can change their gene expression and metabolic profiles globally to adapt to thier biological contexts and external conditions, while maintaining the homeostasis of their core physiological states. The simultaneous realization of adaptability and homeostasis is a hallmark of biological systems and is assumed to be a system-level property of gene expression profiles in cells (1, 2). However, understanding the underlying organizational principles in comprehensive gene expression profiles remains to be a fundamental problem in biology.

Vibrational spectroscopy such as Raman spectroscopy might help us investigate such principles in gene expression profiles. Raman spectroscopy is a light scattering technique that measures energy shifts of light caused by interaction with sample molecules. Raman spectra are obtainable non-destructively even from biological samples such as individual cells. In principle, cellular Raman spectra are optical signatures conveying comprehensive molecular composition of targeted cells (36). Furthermore, no prior treatments, such as staining and tagging, are necessary to obtain cellular Raman spectra. However, although some biomolecules have separable and intense Raman signal peaks, Raman spectra of most biomolecules overlap and are masked by signals of other molecules due to the diversity and complexity of molecular compositions of cells. Therefore, it is impractical to comprehensively determine the amounts of biomolecules by spectral decomposition.

Despite the intractability of spectral decomposition, reconstruction of comprehensive molecular profiles may be achievable by analyzing detectable global spectral patterns (Fig. 1A) thanks to effective low dimensionality of changes in molecular profile of targeted cells (717) (Fig. 1B and Fig. S1). Indeed, it has been demonstrated that condition-dependent global transcriptome profiles of cells can be inferred from cellular Raman spectra based on their statistical correspondence (18, 19). Importantly, this Raman-spectroscopic transcriptome inference was possible from dimension-reduced Raman spectra. Therefore, dominant changes in global Raman spectral patterns may contain vital information about the constraints on the molecular profiles in cells; an inspection of their correspondence might give us insights into architectural principles of omics profiles and biological foundation for global omics inference from spectral patterns (Fig. S1).

Cellular physiological state differences detected by Raman spectral global patterns and gene expression profiles.

(A) Condition-dependent cellular Raman spectral patterns. Raman spectra obtained from cells reflect their molecular profiles. Therefore, systematic differences in global spectral patterns may indicate their physiological states. A Raman spectrum from each cell can be represented as a vector and a point in a high-dimensional Raman space. If condition-dependent differences exist in the spectral patterns, appropriate dimensional reduction methods allow us to classify the spectra and detect cellular physiological states in a low-dimensional space. (B) Condition-dependent gene expression profiles. Global gene expression profiles (proteomes and transcriptomes) are also dependent on conditions. For each gene, we can consider a high-dimensional vector whose elements represent expression levels under different conditions. It has been suggested that these expression-level vectors are constrained to some low-dimensional manifolds (717). This study characterizes the statistical correspondence between dimension-reduced Raman spectral patterns and gene expression profiles. Analyzing the correspondence, we reveal a stoichiometry conservation principle that constrains gene expression profiles to low-dimensional manifolds.

In this report, we first reveal that, in addition to transcriptomes, condition-dependent proteome profiles of Escherichia coli are predictable from cellular Raman spectra. Next, we scrutinize the correspondence between Raman and proteome data, identifying several stoichiometrically conserved groups (SCGs) whose expression tightly correlates with the major changes in cellular Raman spectra. Finally, we reveal that the stoichiometry conservation centrality of each gene correlates with its essentiality, evolutionary conservation, and condition-specificity of gene expression levels, which turns out general across different omics layers and organisms.

Results

Statistical correspondence between Raman spectra and proteomes

To examine the correspondence between Raman spectra and proteomes in E. coli, we reproduced 15 environmental conditions for which absolute quantitative proteome data are already available (20) and measured Raman spectra of E. coli cells under those conditions (Fig. 2A and Fig. 2B). The culture conditions we adopted include (i) exponential growth phase in minimal media with various carbon sources, (ii) exponential growth phase in rich media, (iii) exponential growth phase with various stressors, and (iv) stationary phases (Table S1). We measured Raman spectra of single cells sampled from each condition and focused on the fingerprint region of biological samples, where the signals from various biomolecules concentrate (spectral range of 700–1800 cm−1, Fig. 2B and Fig. S2). The Raman spectra were classified on the basis of the environmental conditions using a simple linear classifier, linear discriminant analysis (LDA) (3, 4, 21) (Fig. 2C–2E and S1). This classifier calculates the most discriminatory axes by maximizing the ratio of between-condition variance to within-condition variance and reduces the dimensions of Raman data to m − 1, where m = 15 is the number of conditions (see sections 1.1 and 2.1 in (22)).

Estimation of proteomes from Raman spectra.

(A) The experimental design. We cultured E. coli cells under 15 different conditions and measured single cells’ Raman spectra. We then examined the correspondence between the measured Raman spectra and the absolute quantitative proteome data reported by Schmidt et al. (20). (B) Representative Raman spectra from single cells, one from the “Glucose” condition, and the other from the “LB” condition. The fingerprint region and representative peaks are annotated. (C to E) Cellular Raman spectra in LDA space. The dimensionality of the spectra is reduced to 14 (= 15 − 1). Each point represents a spectrum from a single cell, and each ellipse shows the 95% concentration ellipse for each condition. Their projection to the LDA1-LDA2 plane (C), the LDA1-LDA3 plane (D), and the LDA1-LDA4 plane (E) are shown. (F) Visualization of the 14 dimensional LDA space embedded in two dimensional space with t-SNE. (G) The scheme of leave-one-out cross-validation. The Raman and proteome data of one condition (here j) are excluded, and the matrix B is estimated using the data of the rest of the conditions as . The proteome data under the condition j is estimated from the Raman data with and compared with the actual data to calculate estimation errors. (H) Comparison of measured and estimated proteome data. The plot for the “Glucose” condition is shown as an example. Each dot corresponds to one protein species. The straight line indicates x = y. Proteins with negative estimated values are not shown.

The result shows that Raman spectral points from different environmental conditions are distinguishable in the (m − 1)-dimensional LDA space (Fig. 2C–2E). For example, the first and second LDA axes clearly distinguish the conditions “LB” and “stationary3days” (Fig. 2C), and the third axis distinguishes “Glucose42C” and “GlycerolAA” (Fig. 2D). Notably, the first principal axis LDA1 correlated with growth rate significantly (Pearson correlation r = 0.81 ± 0.09, Fig. S2). Visualizing the Raman LDA data by embedding them on a 2D plane using t-distributed Stochastic Neighbor Embedding (t-SNE) (23) confirms that the points for each condition form a distinctive cluster (Fig. 2F). These results imply that positions in the Raman LDA space reflect condition-dependent differences in cellular physiological states.

We next asked whether these Raman spectral differences in the LDA space could be linked to the different proteome profiles (Fig. S1). To examine this, we hypothesized linear correspondence between the n-dimensional proteome column vector , where n = 2058 is the number of protein species in the proteome data, and the low-dimensional ((m − 1)-dimensional) Raman column vector in condition j,

B is an n × m matrix that connects and . We calculated as the average of the low-dimensional LDA Raman data of single cells in condition j since the proteomes were measured for cell populations (Table 1).

List of scalars, vectors and matrices in the main text.

Scalars, vectors, and matrices in the main text are listed with their sizes and descriptions. m is the number of conditions, and n is the number of protein species. (m = 15 and n = 2058 in the main text.)

We conducted leave-one-out cross-validation (LOOCV) to verify this linear correspondence (Fig. 2G). We excluded one condition (here, j) as a test condition and estimated B as by simple ordinary least squares (OLS) regression using the data of the rest of the conditions. We thereby estimated the proteome in condition j as .

The proteome profile estimated using the first four major LDA axes (LDA1–LDA4) agreed well with the actual proteome data under most conditions (Fig. 2H and Fig. S3; see section 1.2 in (22) for the estimation with all the LDA axes). Changing the condition to exclude, we estimated the proteomes for all the 15 conditions and calculated the overall estimation error by the Euclidean distance . The result shows that the overall estimation error is significantly small (p = 0.00005 by permutation test (2426)). Adopting other distance measures does not change the conclusion (Table S2 and Table S3). These results therefore validate the assumption of linear correspondence between cellular Raman spectra and proteomes and confirm that condition-dependent changes in proteomes can be inferred from the corresponding low-dimensional Raman spectra.

Stoichiometry conservation of proteins in the ISP COG class

Since the dimensionality of the proteome data is significantly higher than that of the Raman data, the result above suggests that changes in proteome profiles are constrained in low-dimensional space. The regression matrix B considered above determines how the proteomes relate to the Raman LDA axes. Therefore, analyzing B should provide some insights into constraints on condition-dependent changes in the proteomes (Fig. S1).

The n × m matrix B is represented as B = [b0 b1bm−1], where the (k + 1)-th column bk = (b1k b2kbnk) (0 ≤ km − 1) is the collection of coefficients of all n proteins for the k-th LDA axis (Table 1). In the case of k = 0, the coefficients are constant terms. We first asked whether any shared features might exist in the coefficients of B depending on biological functions of corresponding proteins. We then classified the proteins according to functional annotations of Clusters of Orthologous Group (COG) classes (2729) and found that, for many proteins belonging to the “information storage and processing” (ISP) COG class, the coefficients corresponding to different LDA axes are approximately proportional to the constant terms; i.e., blkckbl0, where l is the index of an ISP COG class protein species and ck is the proportionality constant common to many ISP COG class protein species for the k-th LDA axis (Fig. 3A). The ISP COG class contains various proteins involved in processing genetic information such as translation, transcription, DNA replication, and DNA repair (20). Simple calculations show that these proportionality relationships imply that proteins in the ISP COG class conserves their mutual abundance ratios, i.e., stoichiometry, irrespective of environmental conditions (see section 1.3 in (22)).

A stoichiometrically conserved protein group identified by an analysis of the Raman-proteome coefficient matrix.

(A) Scatterplots of Raman-proteome transformation coefficients. The horizontal axes are constant terms (b0) in all the plots. The vertical axis is coefficients for LDA1 (b1), LDA2 (b2), LDA3 (b3), or LDA4 (b4) in each plot. The proteins in the ISP COG class are indicated in yellow. Yellow solid straight lines are least squares regression lines passing through the origins for the ISP proteins. Insets are enlarged views of area around the origins. In this figure, we used the average of as an estimate of B. (B) Similarity of expression patterns between culture conditions for each COG class. We divided the proteome into COG classes (28, 29), and calculated Pearson’s correlation coefficient of expression patterns for all the combinations of culture conditions. Since the data are from 15 conditions, there are 105 (= 15·14/(2·1)) points for each COG class in the graph. The box-and-whisker plots summarize the distributions of the points. The lines inside the boxes denote the medians, the top and bottom edges of the boxes do the 25th percentiles and 75th percentiles, respectively. The numbers of protein species are 376 for the Cellular Processes and Siginaling COG class, 354 for the ISP COG class, and 840 for the Metabolism COG class. See Fig. S4 for the evaluation with Pearson correlation coefficient of log abundances and with cosine similarity. Fig. S4 also contains figures directly showing expression-level changes of different protein species across conditions for each COG class. (C) Examples of stoichiometry-conserving proteins in the ISP COG class. The horizontal axis represents the abundance of RplF under 15 conditions and the vertical axis represents those of several ISP COG class proteins. These proteins are also contained in the homeostatic core defined later (see Fig. 4). The solid straight lines are linear regression lines with intercept of zero. (D) Examples of abundance ratios of non-ISP COG class proteins. The horizontal axis represents the abundance of RplF under 15 conditions and the vertical axis represents those of compared non-ISP COG class proteins. Crp belongs to the Cellular Processes and Signaling COG class; the other proteins belong to the Metabolism COG class. In both (C) and (D), we selected the proteins expressed from distant loci on the chromosome. All sigma factors participating in the regulation of the proteins examined in (C) and (D) are listed on the right of the gene name legends. All transcription factors known to regulate multiple genes listed here are shown in the right diagrams. Arrows show activation; bars represent inhibition; and squares indicate that a transcription factor activates or inhibits depending on other factors. The information on gene regulation and functions was obtained from EcoCyc (55) in August 2022. The error bars are standard errors calculated by using the data of (20). The inset shows the positions of the genes on the E. coli chromosome determined based on ASM75055v1.46 (56). No genes are in the same operon.

Since this is an implication from the Raman-proteome correspondence, we next examined the stoichiometry conservation only with the proteome data, evaluating the expression levels with Pearson correlation coefficients for all the pairs of the conditions for each COG class (Fig. 3B). For the ISP COG class, the correlation coefficients were close to 1, whereas those for the other COG classes were significantly smaller depending on condition pairs. We also evaluated the coordination of gene expression patterns within each COG class using cosine similarity and obtained consistent results (Fig. S4). Therefore, stoichiometry conservation is stronger in the ISP COG class than in the other COG classes. Remarkably, neither shared transcription factors nor chromosome locations can account for the observed stoichiometry conservation of many protein pairs. Indeed, although the ISP COG class shows highly coordinated expression patterns (Fig. 3C) compared to the non-ISP COG class (Fig. 3D), the gene loci are not chromosomally clustered in either example. Additionally, the similarity/dissimilarity of expression patterns cannot easily be inferred from transcription factor regulation patterns. These results imply multi-level regulation of their abundance.

We consulted other public quantitative proteome data of Mycobacterium tuberculosis (30), Mycobacterium bovis (30), and Saccharomyces cerevisiae (31) under environmental perturbations and consistently found strong stoichiometry conservation of the ISP COG class (Fig. S4). Furthermore, the same trend was observed for the genotype-dependent expression changes in E. coli proteomes (20) (Fig. S4).

Identifying stoichiometrically conserved groups

Inspired by the existence of a large class of proteins that conserves their stoichiometry, we considered a systematic way to extract stoichiometrically conserved groups (SCGs) without relying on artificial functional classification of COG (Fig. S1). Focusing only on the proteome data, we evaluated stoichiometry conservation for all the pairs of proteins in the proteome by calculating the cosine similarity of expression patterns (i.e., all in Fig. 4A and Table 1, where each element of the m-dimensional vector pi denotes the expression level of protein species i under one of the m conditions), and extracted groups in each of which the component proteins exhibit coherent expression change patterns by setting a high threshold of cosine similarity (≥ 0.995, Fig. 4B; see section 1.4 in (22) for detail).

Extracting SCGs from proteome data.

(A) Quantifying stoichiometry conservation by cosine similarity. We consider an m-dimensional expression vector for each protein species whose elements represent its abundance under different conditions. The cosine similarity between the m-dimensional expression vectors of two protein species becomes nearly 1 when they conserve mutual stoichiometry strongly across conditions, whereas lower than 1 when their expression patterns are incoherent. (B) Extracted SCGs. We extracted proteins with high cosine similarity relationships. Each node represents a protein species. An edge connecting two nodes represents that the expression patterns of the two connected protein species have high cosine similarity exceeding a threshold of 0.995. Proteins that have no edge with the other proteins are not shown. The largest and the second largest protein groups, which we refer to as SCG 1 and SCG 2, respectively, are indicated by shaded polygons. (C) Expression patterns of the extracted SCGs. The horizontal and vertical axes represent growth rate and protein abundance, respectively. Line-connected points represent expression-level changes of different protein species across conditions. SCG 1 (homeostatic core) is shown in two ways: the left panel with a linear-scaled vertical axis and the right panel with a log-scaled vertical axis. The inset for SCG 2 shows the total abundances of SCG 2 proteins with a log-scaled vertical axis. Error bars are standard errors. (D) The gene loci of the homeostatic core (SCG 1) proteins on the chromosome. Magenta dots are nodes (genes), and gray lines are edges (high cosine similarity relationships). We determined the gene loci based on ASM75055v1.46 (56).

The largest SCG (SCG 1) included many proteins in the ISP COG class (91 out of 191 SCG 1 members), such as ribosomal proteins and RNA polymerase, and also proteins in the other COG classes (Fig. 4B, Table S4). We call this largest SCG homeostatic core, as it constitutes the largest stoichiometry-conserving unit in cells. We found that the abundance of each protein in the homeostatic core (SCG 1) increased approximately linearly with the growth rate in each condition (Fig. 4C). This relationship is reminiscent of the growth law: The total ribosomal contents for translation increase linearly with growth rate (3234). The linear increase in the abundance of each protein in Fig. 4C indicates that the growth law is valid even at the single-gene level for a large class of ribosomal and non-ribosomal proteins in the homeostatic core (Fig. S5) (see section 3.1 in (22)).

Though not evenly distributed, the gene loci of the proteins in the homeostatic core are scattered throughout the chromosome (Fig. 4D). Therefore, localization of gene loci to a single or a small number of operons is not likely a cause of the observed stoichiometry conservation.

The proteins in the second largest SCG (SCG 2) are expressed at high levels in the fast growth conditions, especially in the “LB” condition (Fig. 4C). The SCG 2 includes many proteins in the metabolism COG class (21 out of 26 SCG 2 members) (Table S5), and their abundance increases approximately exponentially with growth rate (Fig. 4C). We also identified other condition-specific small SCGs, such as a group most expressed in the “GlycerolAA” condition (SCG 3) (Table S6), a group mainly expressed in the “Fructose” condition (SCG 4) (Table S7), and a group most expressed in the stationary phase conditions (SCG 5) (Table S8) (Fig. 4C).

Biological relevance of stoichiometry conservation

To understand the overall strength of stoichiometry conservation of the proteins in the different SCGs, we calculated the sum of co-sine similarity, , for each protein species i, where is cosine similarity between the m-dimensional expression level vectors of protein i and protein j (Fig. 4A), and the sum is taken over all the protein species (see section 1.5 in (22)). We refer to di as “stoichiometry conservation centrality” (Table 1).

The proteins in the homeostatic core had high centrality scores (Fig. 5A). Therefore, these proteins tend to have more connections with other proteins in terms of stoichiometry conservation. On the other hand, the proteins in the condition-specific SCGs tend to have low centrality scores among all the proteins (Fig. 5A), which suggests that their stoichiometry conservation is localized within each SCG.

A proteome structure characterized by global stoichiometry conservation relationships.

(A) Distributions of stoichiometry conservation centrality values for all the proteins (gray), the homeostatic core (SCG 1) proteins (magenta), and the proteins belonging to the other SCGs (cyan). (B) Correlation between stoichiometry conservation centrality and gene essentiality. The proportion of essential genes within each class of stoichiometry conservation ranking is shown. The list of essential genes was downloaded from EcoCyc (55). (C) Correlation between stoichiometry conservation and evolutionary conservation. The strength of evolutionary conservation of each protein species was estimated by the number of orthologs found in the OrthoMCL species (35). The genes with more orthologs tend to have higher stoichiometry conservation centrality (p = 3.42 × 10−14 by one-sided Brunner–Munzel test between the top 25% and the bottom 25% fractions of ortholog number ranking). Likewise, the genes with higher stoichiometry conservation centrality scores tend to have more orthologs (p = 8.44 × 10−12 by one-sided Brunner–Munzel test, Top 25%–Bottom 25% comparison; p-values in the captions for (F) to (I) were evaluated with the same statistical test scheme). (D) to (G) Stoichiometry conservation analyses of human cell atlas transcriptome data of fetal 15 organs (36). The top gray histogram in (D) shows the distribution of stoichiometry conservation centrality values for all genes. The bottom histograms in (D) show the distribution for coding genes (yellow) and that for the other genes (cyan). (E) shows a correlation between the ratio of coding genes and stoichiometry conservation centrality calculated from the human cell atlas data. (F) shows a correlation between gene essentiality and stoichiometry conservation centrality calculated from the human cell atlas data. The essentiality of each human gene was quantified by CRISPR score, which is the fitness cost imposed by CRISPR-based inactivation of the gene in KBM7 chronic myelogenous leukemia cells (35). Genes with lower CRISPR score are regarded as more essential. The fraction with low CRISPR score (i.e. high essentiality fraction) tends to have higher stoichiometry conservation centrality (p < 10−15). The fraction with high centrality score tends to be more essential (p < 10−15). (G) shows a correlation between evolutionary conservation and stoichiometry conservation centrality based on the human cell atlas data. The gene fraction with many orthologs tends to have higher stoichiometry conservation centrality (p < 10−15). The gene fraction with high centrality score tends to have more orthologs (p< 10−15). (H) and (I) Stoichiometry conservation analyses of genome-wide Perturb-seq data (37). (H) shows a correlation between stoichiometry conservation centrality calculated from the Perturb-seq data and gene essentiality. The essentiality of each gene was quantified by the CRISPR score as in (F). The gene fraction with low CRISPR score (i.e. high essentiality fraction) tends to have higher stoichiometry conservation centrality (p< 10−15). The gene fraction with high centrality score tends to be more essential (p < 10−15). (I) shows a correlation between stoichiometry conservation based on the Perturb-seq data and evolutionary conservation of genes. The gene fraction with many orthologs tends to have higher stoichiometry conservation centrality (p < 10−15). The gene fraction with high centrality score tends to have more orthologs (p< 10−15). (J) Representation of the proteomes as a graph. A node corresponds to a protein species, and the weight of an edge is taken as the cosine similarity between the m-dimensional expression vectors of the two connected protein species. The n × n matrix A can specify the whole graph. Note that the diagonal elements of A are ones, which were introduced just for simplicity. (K) Cosine similarity LE (csLE) structure in a three-dimensional space. Each dot represents a different protein species and is color-coded on the basis of its stoichiometry conservation centrality value. We selected the axes considering the structural similarity to the Raman-based proteome structure in ΩB (see Fig. 6). (L) The csLE structure in a three-dimensional space. The views from two different angles are shown. Each gray dot represents a different protein species. The proteins belonging to each SCG are indicated with distinct markers. Colors of the two-dimensional histograms in (C), (F), (G), (H), and (I) represent the height of each bar.

The stoichiometry conservation centrality is biologically relevant because it correlates with gene essentiality. Fractions of essential genes almost monotonically decrease with the ranks of centrality score (Fig. 5B and Fig. S6). We also noted that genes with high centrality scores have more orthologs determined by OrthoMCL-DB (35) across the three domains of life (Fig. 5C and Fig. S6). Likewise, genes with many orthologs tend to have higher centrality scores (Fig. 5C and Fig. S6). Therefore, the stoichiometry conservation in cells correlates with the evolutionary conservation of proteins.

To determine if the correlation of stoichiometry conservation centrality with gene essentiality and evolutionary conservation is general, we analyzed the transcriptome data from other organisms and found comparable correlations in S. pombe (Fig. S6). In addition, we found that fractions of coding genes almost monotonically decreased with ranks of centrality score in the S. pombe data (Fig. S6).

We further analyzed two kinds of Homo sapiens transcriptome data. One is a human cell atlas, in which expression of both coding and noncoding genes in 15 fetal organs was quantified (36), and the other is genome-wide Perturb-seq data (37), in which genetically perturbed transcriptomes were measured mainly for coding genes. Our analysis of the human cell atlas data revealed that, while the overall distribution of stoichiometry conservation centrality was broad (Fig. 5D top), the centrality distribution of coding genes was skewed to higher values (Fig. 5D bottom) as observed for the E. coli proteome. Fractions of coding genes almost monotonically decreased with ranks of centrality (Fig. 5E) as seen in the S. pombe data (Fig. S6). Essentiality of each gene in human cells was quantified by an index called CRISPR score, which measures the fitness cost imposed by CRISPR-based inactivation of the gene (38). Genes with lower CRISPR scores are considered more essential. Our analysis revealed that genes with higher stoichiometry conservation centrality scores tend to have lower CRISPR scores, thus more essential (Fig. 5F). Similarly, genes with lower CRISPR scores tend to have higher stoichiometry conservation centrality scores. Furthermore, genes with higher centrality scores have more orthologs across the three domains of life, and vice versa (Fig. 5G). Comparable correlations of stoichiometry conservation with essentiality and evolutionary conservation were also found in the genome-wide Perturb-seq data (Fig. 5H and Fig. 5I). Together, these results suggest that correlations of stoichiometry conservation centrality with gene essentiality and evolutionary conservation are general and preserved from E. coli to human cells regardless of the type of perturbation (see section 1.6 in (22) for detail).

Revealing global stoichiometry conservation architecture of the proteomes with csLE

Although the previous analysis revealed the biological relevance of stoichiometry conservation centrality, it is a one-dimensional quantity and cannot capture the global architecture of omics profiles. To gain further insights into genome-wide stoichiometry-conserving relationships among genes, we next analyzed the proteomes using a method similar to Laplacian eigenmaps (LE) (Fig. S1) (39). We consider a symmetric n × n matrix A whose (i, j) entry is (Fig. 5J, Table 1). The entire proteome structure can be represented using the eigenvectors of normalized A. Major differences of this method from the ordinary LE are that we consider an edge for all node pairs and that we adopt cosine similarity for weighting edges. This method places the proteins with higher cosine similarity closer in the resulting (m − 1)-dimensional space (see sections 1.5 and 2.1 in (22)); we call this linear method cosine similarity LE (csLE).

In this (m − 1)-dimensional csLE space ΩLE, the stoichiometry conservation centrality of the proteins decreased from center to periphery (Fig. 5K), which confirms that it indeed measures the extent to which each protein is close to the center in the entire stoichiometry conservation architecture. Furthermore, the proteins formed polyhedral distributions with the cluster of the proteins in the homeostatic core at the center and the clusters of the proteins in the other condition-specific SCGs at distinct vertices (Fig. 5L). This distribution is consistent with the fact that the condition-specific SCGs are the components whose expression patterns are distant from the homeostatic core and also between each other.

Representing the proteomes using the Raman LDA axes

Given that the analysis of the LDA Raman-proteome regression coefficients B (Fig. 3A) eventually led us to identify the stoichiometry conservation architecture in the proteome data (Fig. 5), the low-dimensional proteome structure in ΩLE might be related to major changes in cellular Raman spectra in the LDA space and provide insight into the Raman-proteome correspondence. To investigate this, we considered representing the proteomes on the basis of the Raman LDA axes (Fig. S1).

The coefficients in the n × m regression matrix B must satisfy the proportionality for all k-th LDA axes (1 ≤ km − 1) for the pair of protein i and protein j that perfectly conserve their stoichiometry, as previously mentioned in the analysis of the ISP COG class (Fig. 3A; see sections 1.3 and 2.1 in (22)). Noting this property, we constructed another (m − 1)-dimensional proteome space ΩB, assigning each protein species i a coordinate , where is the normalized coefficient of gene i corresponding to the k-th LDA axis. As in (m − 1)-dimensional ΩLE, a pair of proteins with strong stoichiometry conservation are expected to position closely in this (m − 1)-dimensional proteome space ΩB. Note that the proximity of the coordinates of different proteins i in ΩB is equivalent to the approximate proportionality of different proteins i in Fig. 3A, demonstrated for the ISP COG class using the proportionality constants (normalized coefficients) ck common to different proteins.

We then found that the distribution of the proteins in ΩB closely resembled the one in ΩLE when visualized using the first few major axes (Fig. 5L and Fig. 6A). This similarity is non-trivial because ΩLE is constructed only from the proteome data, whereas ΩB depends on the (m − 1)-dimensional Raman LDA space (Fig. 2C–2E).

Raman-based proteome structure and its similarity to stoichiometry-based proteome structure.

(A) Proteome structure determined by Raman-proteome coefficients visualized in a three-dimensional space. The views from two different angles are shown. Each gray dot represents a protein species. The proteins belonging to each SCG are indicated with distinct markers. We note that SCGs are defined without referring to Raman data (Fig. 4). (B–D) Similarity among the distribution of LDA Raman spectra (B), the proteome structure determined by Raman-proteome coefficients (C), and the proteome structure determined by stoichiometry conservation (D). (E) Mathematical relation between the coordinates of the proteins in ΩB (C) and ΩLE (D). The two conditions, one with Θ (magenta), and the other between b0 and (cyan), must hold for the similarity between the two proteome structures (yellow), as described in the gray box. denotes column-wise proportionality.

We remark that each axis of ΩB is directly linked to the corresponding Raman LDA axis. Consequently, the orthants in ΩB where the condition-specific protein species reside agree with those in the Raman LDA space where the cellular Raman spectra under corresponding conditions reside (Fig. S10) (see sections 1.7 and 2.1 in (22)). Indeed, we find such orthant agreement between the proteins in the condition-specific SCGs (SCG 2–SCG 5) and the cellular Raman spectra under the corresponding conditions (Fig. 6B and Fig. 6C). This straightforward correspondence between ΩB and the Raman LDA space allows us to examine the relationship between changes in cellular Raman spectra and omics components’ stoichiometry conservation architecture by comparing the two proteome structures in ΩB and ΩLE.

Omics-level interpretation of cellular Raman spectra and a quantitative constraint between expression generality and stoichiometry conservation centrality

To understand rigorously what the similarity of the proteome structures in ΩB and ΩLE signifies (Fig. 6C and Fig. 6D), we clarified the mathematical relation between the coordinates of the proteins in these two spaces (Fig. 6E and Fig. S1; see sections 2.1 and 2.2 in (22) for detail). We then characterized the two mathematical conditions that must be satisfied simultaneously (Fig. 6E).

The first condition is that major axes of the Raman LDA space and those of the proteome csLE space correspond (Fig. 6E). Consequently, cellular Raman spectra under a condition accompanying the expression of a condition-specific SCG must be significantly different from those under conditions with the expression of other condition-specific SCGs in a manner distin-guishable by LDA. Mathematically, this condition is related to the m × m orthogonal matrix Θ that appears in the equation in Fig. 6E. For the distributions of the proteome components to be similar in the low-dimensional subspaces of ΩLE and ΩB, Θ must be close to the identity matrix with small off-diagonal elements (Fig. 6E). We verified this first condition with the data (Fig. S9; see section 1.8 in (22) for detail).

The second condition relates to the proportionality of the n-dimensional vectors b0 and in Fig. 6E. This proportionality relation can be transformed into another relation that di is proportional to gi := ∥pi1/∥pi2, where ∥pi1 and ∥pi2 are the L1 and L2 norms of the expression level m-dimensional vector of protein i across conditions (Fig. 4A and 6E, Table 1). gi can be interpreted as the expression generality score. When gi is large, the protein i is expressed generally across conditions; when gi is small, this is expressed only under specific conditions (Fig. S8) (see section 1.9 in (22)). Therefore, the proportionality between di and gi indicates that the proteins with high stoichiometry conservation centrality must be expressed non-specifically to conditions. We also verified this condition with the data, confirming that it is indeed satisfied (Fig. 7A and Fig. S9).

Proportionality between stoichiometry conservation centrality and expression generality.

(A) Relationships between stoichiometry conservation centrality (di) and expression generality (gi). Each gray dot represent a protein species. The proteins belonging to each SCG are indicated with distinct markers. The dashed lines are . The solid lines represent (see section 2.2 in (22)). The deviation of a point from the solid line is related to the growth rate under the condition where each protein is expressed the most. (B) The same plot as (A) in black and white. Overlaid red circles indicate proteins featured in (C). (C) Expression patterns of the proteins indicated by red circles in (B) across conditions. The condition differences are shown by the growth rate differences on the horizontal axes. The arrangement of the plots for the proteins corresponds to their relative positions in (B).

The spread of the points from the proportionality diagonal line of the E. coli proteome data in Fig. 7A was found related to the growth rate under the condition where each protein is expressed the most (see section 2.2 in (22) for a detailed analysis on the origin of the deviation). Consequently, one can envisage a growth-rate-dependent expression pattern of each protein on the basis of its relative position in this gi-di plot (Fig. 7B and Fig. 7C). For example, both BamB and YqjD are expressed non-specifically to the conditions with nearly identical expression generality scores. However, BamB is expressed at higher levels under fast growth conditions, whereas YqjP is expressed at higher levels under slow growth conditions due to their relative positions to the proportionality line. Similar growth rate dependence is observed for PaaE and DgoA, but with more prominent condition-specificity because these proteins are characterized by their low expression generality scores. These growth-rate-dependent deviation patterns might hint at a new growth law that governs the total relative expression changes of the proteome components (see section 2.2 in (22) for detailed discussion).

Generality

We also examined the generality of the aforementioned two conditions using the Raman and proteome data of E. coli strains with different genotypes (BW25113, MG1655, and NCM3722) under two culture conditions (20) and the Raman and transcriptome data of S. pombe under 10 culture conditions (18). Applying csLE to the omics data, we again found similar omics structures between ΩLE and ΩB when visualized using the first few major axes, with homeostatic cores at the centers and condition-specific SCGs at the vertices (Fig. S11 and Fig. S12).

Proportionality between stoichiometry conservation centrality and expression generality score was also confirmed in both additional datasets (Fig. S7). We further used publicly available quantitative proteome data of M. tuberculosis, M. bovis, and S. cerevisiae (30, 31) to examine this relation and confirmed that the proportionality universally holds (Fig. S7 and Fig. S13). Almost no deviation from the proportionality line existed in the S. cerevisiae proteome data measured for the cells in different media but cultured in chemostats with an identical dilution rate (thus, identical growth rate), which is consistent with the result of E. coli in which the deviations were related to the growth rate differences.

Discussion

A Raman spectrum obtained from a single cell is a superposition of the spectra of all of its constituent biomolecules. Therefore, cellular Raman spectra potentially contain rich information on essential state differences in targeted cells. The fact that both transcriptomes and proteomes are inferable from cellular Raman spectra, as demonstrated in this and previous (18) studies, endorses this speculation. The detailed analyses of the relationship between Raman and omics data have identified functionally relevant constraints on omics changes and provided an interpretation of cellular Raman spectra (Fig. S1). Specifically, it has been revealed that major changes in cellular Raman spectra distinguishable by LDA reflect the changes in omics profiles under the constraints of stoichiometry conservation. This correspondence would help us interpret global changes in cellular Raman spectra by translating them into the differences in omics profiles.

We remark that linearity in our formulation enabled us to find the rigorous connection between the two omics spaces ΩB and ΩLE (Fig. 6E). Unlike the original LE, we adopted cosine similarity as weights of edges between all node pairs to measure expression stoichiometry conservation of proteins. This modification was indispensable in terms of interpretation; relative proximity of positions in ΩLE reflects the strength of stoichiometry conservation. We also remark that simple principal component analysis (PCA) applied to the normalized E. coli proteome data also finds a similar low-dimensional proteome structure (Fig. S6) (see section 1.10 in (22)). Therefore, besides interpretability, omics structures in ΩLE might reflect dominant relationships among omics components commonly characterized by several methods of omics representation.

It should be noted that the quantitative analysis of Raman-omics correspondence resulted in the characterization of stoichiometry-conserving architecture in cells (Fig. S1). This shows that besides distinguishing different cellular states or quantifying specific biomolecular species by focusing on spectral peaks, Raman spectra can also characterize the system-level constraints behind changes in global gene expression profiles. While the identified features, such as stoichiometry conservation centrality, expression generality score, and csLE space, can be calculated without Raman data, it is difficult to reach them directly without scrutinizing the Ramanomics correspondence. Furthermore, the definition of expression generality and its relation to stoichiometry conservation centrality were directly derived from the Raman-omics correspondence analysis (Fig. 6E and Fig. 7). Therefore, as a signal reflecting comprehensive molecular profiles in cells, Raman spectra are an important modality for dissecting system-level properties and constraints in cells.

In this study, we mainly analyzed the Raman and proteome data of E. coli under 15 different environmental conditions. However, the resulting low-dimensional structures and correspondence of ΩLE and ΩB can change depending on what and how many conditions are included in the analysis. Thus, an intriguing question is how the Raman-proteome correspondence is affected by the conditions used in the analysis. A subsampling analysis focusing on the orthogonal matrix Θ, which represents low-dimensional correspondence precision of ΩLE and ΩB (Fig. 6E), reveals that correspondence precision tends to increase with an increasing number of conditions (Fig. S14). This result suggests that increasing the number of conditions generally improves the low-dimensional correspondence rather than disrupting it.

Since the proteome data that we referenced (20) represent the averaged expression profile of the cells in each condition, we likewise averaged the single-cell Raman data in each condition in the LDA space to determine their correspondence. Once this correspondence is established, it becomes technically feasible to infer the proteomes of individual cells from their Raman spectra. However, verifying the accuracy of the inferred proteome profiles requires quantitative ground truth of single-cell proteomes, which are not yet readily obtainable, especially for bacterial cells. Despite this limitation, future studies may clarify the correspondence at the single-cell level as omics technology advances.

Stoichiometry conservation is plausibly crucial for cellular functions and physiology. For example, the enzymes involved in evolutionarily conserved metabolic pathways conserve their stoichiometry across microorganism species despite their diverse transcriptional and translational rates (40). It is suggested that stoichiometry conservation is achieved by optimizing the metabolic flux for fast growth (41). Furthermore, a ribosome-targeting antibiotic causes an imbalance of ribosomal proteins and growth arrest in E. coli, but the balance is restored alongside growth recovery through physiological adaptation (42). These results suggest that disruption of stoichiometric balance among core components could impose significant fitness cost.

It is known that functions, essentiality, and evolutionary conservation of genes can be linked to the topologies of gene networks (4348). However, networks that have been previously analyzed, such as protein-protein interaction networks, depend on known interactions. Therefore, as our understanding of the molecular interactions evolves with new findings, the conclusions may change. Furthermore, analysis of a particular interaction network cannot account for effects of different types of interactions or multilayered regulations affecting each protein species, thus highlighting only one aspect of the inherently global coordination of molecular compositions in cells. In contrast, the stoichiometry conservation network in this study focuses solely on expression patterns as the net result of interactions and regulations among all types of molecules in cells. Consequently, the stoichiometry conservation networks are not affected by the detailed knowledge of molecular interactions, and naturally reflect the global effects of multilayered interactions behind cellular physiological state changes. Additionally, stoichiometry conservation networks can easily be obtained for non-model organisms, for which detailed molecular interaction information is usually unavailable. Therefore, analysis with the stoichiometry conservation network has several advantages over existing methods from both biological and technical perspectives.

It is intriguing to ask how cells conserve stoichiometry among the components in each SCG. Especially, the homeostatic core (SCG 1) contains many components whose gene loci are scattered throughout the genome. It is known that both transcriptional and translational negative auto-regulation contributes to controlling the stoichiometry of many ribosomal proteins (4954). The genes for the ribosomal proteins are scattered in multiple operons and co-regulated with many other non-ribosomal proteins, such as RNA polymerase subunits, translation initiation/elongation factors, and transmembrane transporters (55). Therefore, the stoichiometry-conserving mechanisms established for ribosomes might be partially exploited for the stoichiometry conservation within the homeostatic core.

The existence of condition-specific SCGs and genes with similar expression patterns confirms that adaptation to specific conditions is not necessarily achieved by a small number of functionally relevant genes, but is often accompanied by changes in the expression of many seemingly unrelated genes. Indeed, condition-specific SCGs contain genes with unclear roles in adaptation, including some that are functionally uncharacterized (Tables S5S8). Therefore, it would be important to investigate whether the coexpression of multiple genes is crucial for cellular adaptation to a wide range of perturbations while maintaining homeostasis.

The proportionality between stoichiometry conservation centrality and expression generality score suggests that proteins with high stoichiometry conservation centrality govern basal cellular functions required under any conditions. In fact, both essential genes and evolutionarily conserved genes are enriched in the omics fractions with high centrality scores. On the contrary, proteins of low centrality scores might have been acquired in later stages of the evolution and exploited to survive or increase fitness under specific conditions. Such hierarchy in the stoichiometry conservation centrality among core and peripheral processes might promote the adaptability of cells since cells can respond to diverse environments without restructuring a large body of the functional homeostatic core. This architectural principle in omics might underlie the robustness and adaptability of biological cells.

Supplementary Materials

1 Materials and Methods

1.1 Experimental methods, data acquisition, and data analyses

1.1.1 Absolute quantitative proteome data

We utilized high-quality absolute quantitative proteome data reported by Schmidt et al. [1]. In these data, expression levels of more than 55 % of genes of E. coli BW25113 strain (more than 95 % of total proteome mass) were quantified under various environmental conditions.

We also used additional absolute quantitative proteome data [1, 2, 3] for checking the generality of our findings (see 3.2). In addition to the proteome data across environmental conditions, Schmidt et al. also reported proteomes of E. coli strains with different genotype backgrounds (BW25113, MG1655, and NCM3722) cultured in a rich medium or a minimal medium supplemented with glucose. Schubert et al. [2] quantified proteomes of M. tuberculosis H37Rv strain and M. bovis BCG strain under time-course environmental change conditions starting from exponential growth conditions, followed by dormant states induced by decreasing oxygen levels, and finally regrowth conditions with re-aeration. Lahtvee et al. [3] quantified proteomes of S. cerevisiae under a reference condition and three stressed conditions (ethanol, osmotic pressure, and high temperature, with three stress intensity steps for each type of stress) using chemostat.

For checking the generality of our findings across omics classes, we also used the transcriptome data reported by our previous paper [4]. The data include the transcriptomes of S. pombe in rich and minimal media, in nutrient-depleted media, and under various stress conditions.

1.1.2 E. coli strains and culture conditions

To quantitatively analyze a linkage between the absolute proteome data generated by Schmidt et al. [1] and Raman data, we reproduced the culture conditions used in Schmidt et al. [1] as closely as possible in our lab. We obtained three biological replicates.

E. coli strains

We used BW25113, MG1655, and NCM3722 as in Schmidt et al. [1]. In particular, BW25113 [5] was used for the main data in this study. The genotype of BW25113 is F Δ(araD -araB)567 ΔlacZ4787 (::rrnB-3) λ rph-1 Δ(rhaD -rhaB)568 hsdR514, that of MG1655 is F λ rph-1, and that of NCM3722 is F+, respectively [6, 7, 8].

Culture conditions

We prepared 15 batch culture conditions listed in Table S1. We excluded three culture conditions among the 18 conditions reported in [1] because we could not obtain sufficiently strong cellular Raman signals under those excluded conditions. See [1] for the detail of medium compositions. For “GlucosepH6” medium, 37 % HCl was titrated to the “Glucose” medium. Medium for “stationary1day” and “stationary3days” was the same as “Glucose” medium. LB agar plates were prepared by adding 15 g/L agar to “LB” medium.

List of culture conditions.
Cultivation

Culturing E. coli cells proceeded in four steps: Step 1 Growth on LB agar plates.

Step 1 Growth on LB agar plates.

Cells were taken from a −80 °C glycerol stock and streaked on LB agar plates. The plates were incubated at 37 °C overnight and stored at 4 °C. All subsequent experiments were conducted using colonies on the LB agar plates. Picking colonies from the plates for cultivation was done within four days of storage at 4 °C.

Step 2 Liquid culture under “Glucose” condition.

Several colonies picked from LB agar plates were inoculated into “Glucose” liquid culture medium and grown for about 16 hours. Cells for “Glucose42C” condition were cultured at 42 °C and those for the other conditions were grown at 37 °C.

Step 3 Liquid culture under each condition.

Cells from Step 2 were passaged into each type of medium and grown to exponential phase. Cells for “Glucose42C” condition were grown at 42 °C and those for the other conditions were cultured at 37 °C.

Step 4 Liquid culture under each condition.

Cells from Step 3 were passaged into the respective fresh medium and grown to almost the same level of turbidity as that at the end of Step 3. Cells for “Glucose42C” condition were cultured at 42 °C and those for the other conditions were grown at 37 °C.

For the exponential conditions, cell cultivation was conducted as described above. For the stationary conditions, cultivation of cells at Step 3 was continued instead of proceeding to Step 4 and ended one or three days after they reached the stationary phase.

The medium volume was 2 mL for all the liquid cultures in our experiments. Borosilicate glass test tubes with a diameter of 16.5 mm and a length of 165 mm were used. A fresh medium was pre-warmed before passage so that its temperature was the same as that of cultivation. All the liquid cultures were under reciprocal shaking at 200 r/min and at an inclination of 45°. Liquid cultures were diluted to an OD600 of around 0.01 for passage.

Main differences between our cultivation conditions and those of Schmidt et al. [1] are the periods of storage at 4 °C at Step 1 (a maximum of three weeks in [1]), the number of colonies inoculated from plates to liquid medium at the second step (one colony per inoculation in [1]), and medium volumes and shaking conditions of liquid cultures (50 mL liquid culture in 500 mL unbaffled wide-neck Erlenmeyer flasks under orbital shaking at 300 r/min in [1]).

Growth rate measurements

Growth curves were obtained by continuing the Step 3 in cultivation. Cultivation of cells for growth measurements was conducted with 5 mL culture media, not 2 mL, due to a requirement of the device used for continuous turbidity recording (ODBox-C, TAITEC CORPORATION). In addition, cells were washed with each type of fresh medium before inoculation at the beginning of Step 3, and cultivation for growth recording started from an OD600 of around 0.001. Growth rates were calculated from the growth curves using the fitting algorithm based on Gaussian processes [9].

1.1.3 Raman measurements and preprocessing of spectra

Cells were washed three times with 0.9 % aqueous solution of NaCl, and 5 µL of the suspension was placed on a synthetic quartz slide glass (TOSHIN RIKO CO., LTD.) and dried. Raman spectra of cells were measured with a Raman microscope (Figure S2), where a custom-built Raman system (STR-Raman, AIRIX) was integrated into a microscope (Ti-E, Nikon). Excitation light was generated by a 532 nm continuous-wave diode-pumped solid-state (DPSS) laser (Gem 532, Laser Quantum). We altered the first version of this Raman microscope [4], and light from the laser oscillator was transmitted by mirrors in this research. A 100× and NA = 0.9 air objective lens (MPLN100X, Olympus) was used. Raman scattered light was collected by an optical fiber and transmitted to a spectrometer (Acton SP2300i, Princeton Instruments). Dispersed light by a 300 gr/mm grating was projected onto an image sensor of an sCMOS camera (OrcaFlash 4.0 v2, Hamamatsu Photonics). The sCMOS camera was water-cooled at 15 °C to reduce dark noise. The exposure time for each cell was 10 s. Randomly selected 15 cells were measured per condition per replicate. Raman spectrum of background was measured for each cell with 10 s exposure in an area close to a targeted cell where neither cells nor NaCl crystals existed.

In our setup, the laser power at the sample stage was 21 mW. The measurement system and processes were controlled by using Micro-Manager 1.4 [10] and a plugin we made.

Readout noise of sCMOS image sensors is pixel-dependent. A noise reduction filter developed in [4] on the basis of [11] was applied to measured spectral images by using 10000 blank images obtained with the same sCMOS sensor with exposure time of 10 s. See [4] for detail.

After noise reduction with the filter, pixel counts were summed up along the direction perpendicular to wavenumber. A background spectrum was subtracted from a cellular Raman spectrum. A pixel region corresponding to the range from 632 cm−1 to 1862 cm−1 was cropped. The cropped spectrum was smoothed with Savitzky–Golay filter [12]. To minimize the effect of laser excitation variations, each spectrum was normalized by subtracting the average and dividing it by the standard deviation.

1.1.4 Data analysis

We wrote scripts and analyzed data using Matlab (R2019a and R2023b), except for Brunner–Munzel test, for which we used R (version 4.0.3) (See 1.6.2).

Related to Figure 2 in the main text, we first performed linear discriminant analysis (LDA) against the Raman data. LDA is a linear classifier; it finds the most discriminatory bases by maximizing the ratio of the between-class variance to the within-class variance, and reduces the dimensions of the data to m − 1, where m is the number of classes [13][14][15]. In the case of our main data, classes are culture conditions. In the verification step of the correspondence between the LDA Raman and omics data, we conducted leave-one-out cross-validation (LOOCV). In LOOCV, one condition is used as test data and the remaining conditions are used as training data. This is repeated by changing the condition to exclude.

The detail of the data analyses will follow in the sections below.

1.2 Raman-proteome statistical correspondence

1.2.1 Notation

We write the population-averaged 14-dimensional LDA Raman spectrum vector of each condition as a row vector and the 2058-dimensional absolute proteome vector of each condition as a row vector . Note that we regarded and as column vectors in the main text for simple expression of equations.

Our hypothesis of Raman-proteome linear correspondence (Eq. 1 in the main text) is expressed as

where B is a 2058 × 15 matrix and ⊤ denotes transpose. In LOOCV, one condition is excluded (let i be the excluded conditoin) and the remaining 14 conditions are used to estimate B. We write the estimated B as , which is also a 2058 × 15 matrix. Let be the estimated proteome of the excluded condition in LOOCV (Figure 2G).

1.2.2 OLS in LOOCV scheme

In the case of LOOCV, 14 (= 15 − 1) conditions are included in a training data. Thus, if all the 14 LDA axes of the low-dimensional Raman data are considered, OLS becomes underdetermined. We excluded higher dimensions of the Raman space to conduct OLS in LOOCV unless otherwise noted. The results described in the main text were obtained by using the first four axes (LDA1 to LDA4). In this case, is a 2058 × 5 matrix.

1.2.3 Permutation test

Let a permutation of all the 15 conditions be σ. In our permutation test, we calculated overall estimation errors as , where is one of the distance measures between and listed in Table S2. There exist 15! sets of σ, and calculating all of them is computationally intensive. Thus, we randomly generated 105 permutation sets.

The result presented in the main text is the case where Euclidean metric (PRESS) was used as a distance measure. Likewise, we also obtained small p-values with the other metrics (Table S2).

Evaluation of the overall estimation error with various distance measures (The case where LDA1 to LDA4 axes were used).

The sum of estimation errors was calculated and a permutation test (105 permutations) was conducted. In this table, LDA1 to LDA4 axes were used. represents a vector whose all elements are the mean of all elements of x. xj is the j-th element of x. medianj xj represents the median of scalers xj.

We could also estimate the proteomes with high accuracy using all the 14 dimensions of the LDA space (Table S3). As noted in 1.2.2, the regression is underdetermined in this case. Thus, we simply adopted the minimum-norm solution from among all least-squares solutions.

Evaluation of the overall estimation error with various distance measures (The case where all the 14 LDA axes were used).

The results obtained by using all the 14 LDA axes are presented. See Table S2 for notations. Note that the system is underdetermined in this case, thus we adopted the minimum-norm solution from among all least-squares solutions.

1.3 Characterizing a stoichiometrically-conserved group by analyzing the Raman-proteome correspondence matrix

1.3.1 Notation

The component representation of (1.1) is

where n is the number of proteins, and m is the number of culture conditions. n = 2058 and m = 15 in our case. Let bh−1 be the h-th column of B. For example, b0 = (b10bn0) denotes the constant term for each protein, and b1 = (b11bn1) the coefficient of LDA1 for each protein. The expression level of protein j in the condition i is

1.3.2 Stoichiometry conservation of ISP COG class

In the main text, we revealed that many proteins belonging to “information storage and processing” COG class were aligned on a straight line passing through the origin when the relations between the columns of B were shown in scatter plots (Figure 3A). Consider hypothetical proteins that align perfectly on a straight line through the origin. Let e1,…, ek be the indices of such perfectly aligning protein species. Extracting only these rows for the proteins from (1.2), we obtain

where ci (i = 1, 2,…, m − 1) are constants and . For our data,Figures S4A and S4B correspond to (1.6). In these plots, the y-axis represents , and the x-axis . Many “Information storage and processing” proteins indeed align on a straight line through the origin with different slopes for different conditions (Figure S4A). In contrast, many proteins in other COG classes do not align on a straight line (Figure S4B).

Importantly, for a pair of proteins eα, eβ that align on the straight line,

holds from (1.6). The right-hand side of (1.7) does not contain condition index i, which means that the abundance ratio of the proteins remains constant independently of the conditions.

1.3.3 On the evaluation of stoichiometry conservation by Pearson’s correlation coefficient

In the main text, we used Pearson’s correlation coefficients to confirm the stoichiometry conservation of many ISP COG class members (Figure 3B). However, strictly speaking, cosine similarity is a more appropriate measure to evaluate stoichiometry conservation. In this analysis, cosine similarity can be written as

where and are the vectors representing the protein abundance for the proteome subgroups (“Information storage and processing” COG class, “Cellular processes and signaling” COG class, and “Metabolism” COG class) for conditions i and j, respectively (1 ≤ i, jm). Cosine similarity version of Figure 3B is Figure S4F. The cosine similarity takes the maximum value 1 only when abundance ratios between all considered proteins are perfectly the same between compared two conditions.

In addition, we also examined differences between COG classes by calculating Pearson’s correlation coefficients of log abundances (Figure S4E).

1.4 Direct characterization of stoichiometrically-conserved groups in omics data

1.4.1 Notation

Let pi be a column vector representing the abundances of protein i. Each component of this vector indicates the abundance of protein i under each condition. Therefore,

where m = 15 in our case. Note that pi defined here is a 15-dimensional column vector and different from introduced previously, which was a 2058-dimensional row vector.

1.4.2 Identifying SCGs in omics data

As explained in the main text, we extracted stoichiometrically-conserved groups (SCGs) directly from the omics data without referring to Raman data nor COG classification. We evaluated similarity of expression patterns for all the combinations of proteins using cosine similarity. Specifically, cosine similarity between proteins i and j is calculated as

This is the inner product of normalized pi and pj. Note that as protein abundances of any proteins are non-negative. takes the maximum value 1 if and only if pi and pj point in identical direction, i.e., the abundance ratios of proteins i and j are constant under all the conditions. Therefore, if we extract only the proteins connected with high cosine similarity from all protein pairs, they would constitute proteome fractions in each of which the abundance ratios of the proteins remain almost constant across all the m conditions. We hence extracted only the protein pairs whose cosine similarity was above a high threshold of 0.995. As a result, we obtained several SCGs, in each of which the protein species are linked to each other with high cosine similarity (Figure 4B and 4C).

The genes in each SCG are listed in Tables S4, S5, S7, S8 and S6. Note that there are many other minor components (Figure 4B), some of which may have an expression pattern similar to another component but are separated due to the high threshold.

The positions of members of the SCGs on the chromosome are shown in Figure 4D (SCG 1(homeostatic core)) and Figure S5E (SCGs 2–5).

Gene list of SCG 1 (homeostatic core).

Members of homeostatic core (Figure 4, cosine similarity threshold: 0.995). The description of each gene is cited from [1].

Gene list of SCG 2.

Members in SCG 2 (Figure 4, cosine similarity threshold: 0.995). The description of each gene is cited from [1],

Gene list of SCG 3.

Members in SCG 3 (Figure 4, cosine similarity threshold: 0.995). The description of each gene is cited from [1].

Gene list of SCG 4.

Members in SCG 4 (Figure 4, cosine similarity threshold: 0.995). The description of each gene is cited from [1].

Gene list of SCG 5.

Members in SCG 5 (Figure 4, cosine similarity threshold: 0.995). The description of each gene is cited from [1].

1.5 Global proteome structures based on stoichiometric balance

In the previous section, we identified SCGs by setting a threshold of cosine similarity for extracting protein pairs. We next removed the threshold and considered the “distance” with respect to cosine similarity for all the protein pairs to capture the global proteome structure including SCGs.

The cosine similarity for all the pairs of proteins can be summarized in one matrix as

where (i, j) component represents cosine similarity between proteins i and j. Assuming that this matrix is an adjacency matrix in graph theory and network theory, the entire proteomes are considered as a weighted undirected complete graph (with loops), where nodes correspond to protein types and any protein pair is connected by an undirected edge. Each edge is weighted by the cosine similarity between the two protein species at both ends. Note that all the diagonal elements of A are one, which represents that the each node has a loop with weight of one. These were introduced just for simplicity.

To ask whether the SCGs identified in the previous section have any unique features in this network, we evaluated the degree to which each node is central in the network structure. In graph theory, “centrality” is known as an index to measure how “important” or “influential” each node is. In particular, we employed a measure called “degree centrality” (for weighted graphs) [16][17]. Degree centrality, which is also called “degree”, simply measures “influence” of a node on a network on the basis of links with its direct neighborhood. One can obtain a degree centrality value by calculating the sum of the weights of all the edges connected to each node (see also the definition of the degree matrix D (1.13) below). We note that in our graph, degree centrality vector A1n(= D1n), where 1n is an n-dimensional column vector of which all elements are one, is equal to the eigenvector corresponding to the largest eigenvalue of a “normalized” adjacency matrix (D−1A) = AD−1 up to multiplication by a constant. From this perspective, the central-ity index we adopted measures “influence” of a node in a recursive manner depending on “influence” of its neighboring nodes. A well-known example of this centrality indicator is Google’s PageRank [18] used for ranking web pages on the World Wide Web. It can be regarded as a variant of “eigenvector centrality” (the eigenvector corresponding to the largest eigenvalue of the adjacency matrix A) [19][17]. As explained in the main text, the protein species in the homeostatic core (the largest SCG) had high centrality scores, while those in the other condition-specific SCGs had low centrality scores (Figure 5A).

We directly observed the global stoichiometry conservation structure of this proteome graph using Laplacian eigenmaps (Figures 5K, 5L, and 6D). In general, a graph can be uniquely specified not only by the adjacency matrix A, but also by the Laplacian matrix L defined as

where D = (dij) is the degree matrix with the components of

where 1n is an n-dimensional column vector of which all elements are one. The (i, i)-element of D represents the sum of the weights of all the edges connected to node i. In our case, it represents the sum of cosine similarity values between protein i and the other proteins. To see the entire proteome graph structure, we specifically employed the normalized Laplacian,

We remark that there are two types of often-used normalized Laplacian matrices, Lrw and Lsym = D−1/2LD−1/2 = ID−1/2AD−1/2, in the field of machine learning [20], and our mathematical analysis can provide a clear interpretation to each of them in the context of the Raman-proteome linear correspondence as described in 2.1.5.

There exist m−1 non-trivial eigenvalues of Lrw that are greater than zero and less than one. We write these m − 1 eigenvalues as λLE1,. .., λLE(m−1) from the smallest and the corresponding eigenvectors as vrw,1,…, vrw,(m−1). Additionally, we denote the eigenvector corresponding to the eigenvalue zero as vrw,0. Using these eigenvectors, one can construct a matrix and visualize a proteome, assigning protein j with a coordinate specified by the elements after the second column in the j-th row of , i.e. by the j-th row of [vrw,1vrw,(m−1)]. The cosine similarity structure we illustrate in the main figures was produced by using these eigenvectors. For example, csLE1-csLE2 figure in the main text (Figure 6D) is a scatter plot between vrw,1 and vrw,2. Note that the closer to one the cosine similarity of a protein pair is (the more similar their expression patterns are), the “closer” the two protein species are placed (See 2.1.5 for details).

This method of obtaining low-dimensional representation of data using eigenvectors of a graph Laplacian is known as Laplacian eigenmaps [21][22]. Thus, what we explained above is the Laplacian eigenmaps of a graph with edges weighted with cosine similarity of expression patterns of nodes (protein species). It differs from the original and common usages of Laplacian eigenmaps in that the graph we considered is a complete graph (with loops) and that the weight of edges (pairwise similarity of nodes) is cosine similarity. It has made all the mathematical formulations linear, which allowed us to biologically interpret the results with mathematically rigorous analyses. We also remark that our graph representation of proteome does not rely on existing knowledge on the underlying interaction and regulatory networks of proteins and is based only on final expression levels of the proteins. Therefore, the results are robust against the uncertainty of underlying molecular detail.

1.6 Relevance of centrality of cosine similarity LE structure to biological functions

1.6.1 Centrality-essentiality correlation

As mentioned in the main text, centrality of protein species with regard to stoichiometry conservation correlates with gene essentiality (Figure 5B). We analyzed the proteome data from all the 22 conditions reported by [1] in Figure 5B. Interestingly, the centrality-essentiality correlation becomes weaker when the analysis was conducted with the data from fewer conditions (Figure S6A).

We obtained the list of essential genes of E. coli from EcoCyc [23] on September 23rd, 2020. The list contained 318 essential genes in total. The essentiality of the genes in this list was determined on the basis of whether single-gene knockouts of BW25113 (Keio collection) could grow under LB condition at 37 °C [6].

We also confirmed centrality-essentiality correlation for S. pombe transcriptome data [4] (Figure S6B, see 3.2). For this analysis, we downloaded the list of essential genes of S. pombe from PomBase [24] on May 13th, 2022. The list contained 1221 essential genes in total. Here, the essentiality data by PomBase was based on the Fission Yeast Phenotype Ontology terms “inviable vegetative cell population” (FYPO:0002061) and “viable vegetative cell population” (FYPO:0002060) [25]. Note that in our S. pombe essentiality analysis, we focused only on coding genes, whereas the cosine similarity LE structure was calculated using both coding and non-coding genes. See 1.6.3 and Figure S6C for the proportion of coding genes in each bin in Figure S6B. Eleven coding genes in the S. pombe transcriptome data were not found in current PomBase. Thus, some bins do not show 100% in total in Figure S6B.

Stoichiometry conservation centrality in human cells was evaluated using two kinds of H. sapiens transcriptome data: (i) human cell atlas data reported in [26] (Figure 5F) and (ii) genome-wide Perturb-seq data reported in [27] (Figure 5H).

The human cell atlas data [26] contain gene expression profiles in cells from 15 fetal organs. To calculate stoichiometry conservation centrality from the human cell atlas, we analyzed the pseudobulk data (GSE156793_S4_gene_expression_tissue.txt provided at the Gene Expression Omnibus). We calculated stoichiometry conservation centrality value of each gene using expression level data of 53,908 genes that are expressed at least in one organ.

The Perturb-seq data we used are gene expression profiles in a chronic myeloid leukemia cell line K562 [27]. This dataset contains single-cell RNA sequencing data of genetically perturbed cells in which expression of targeted genes is inhibited by CRISPR interference. We analyzed the pseudobulk data (K562 gwps raw bulk 01.h5ad provided at Figshare) to calculate stoichiometry conservation centrality. We evaluated stoichiometry conservation centrality value of each gene using the expression data of all the 8,248 genes in the Perturb-seq data. We remark that this dataset did not contain genes that showed no expression under all the reported genetic perturbation conditions.

Human gene essentiality was determined by referring to another dataset reported in [28], in which fitness cost imposed by gene inactivation was evaluated by a CRISPR-based method [28]. The fitness cost was quantified by an index called CRISPR score; genes with lower CRISPR scores are considered more essential [28]. We used the CRISPR scores calculated with a human chronic myelogenous leukemia cell line KBM7.

The CRISPR scores of 16,996 genes and 7,462 genes were found in [28] among the genes whose stoichiometry conservation centrality was evaluated using the human cell atlas data [26] and the Perturb-seq data [27], respectively. We evaluated the correlations between stoichiometry conservation centrality and gene essentiality (CRISPR scores) for these common genes in Figure 5F and 5H. The correlations were examined with Brunner– Munzel test [29] using R (version 4.0.3) and “brunnermunzel” package (version 2.0) [30].

1.6.2 Centrality-evolutionary conservation correlation

As mentioned in the main text, centrality of proteins with regard to expression stoichiometry conservation weakly correlates with evolutionary conservation represented by the number of orthologs based on protein sequences (Figure 5C). In Figure 5C, we analyzed the proteome data from all the 22 conditions reported in [1]. We also confirmed the relation for the E. coli proteome data from fewer conditions which we had used for our Raman-proteome correspondence analyses (Figure S6D).

We obtained the ortholog data from OrthoMCL-DB [31] (release 6.12). We used the number of orthologs in all of the “Core species” and the “Peripheral species” of OrthoMCL, which are across the three domains (Bacteria, Archaea, and Eukaryota), as a proxy for evolutionary conservation of each protein. To examine the correlation, we performed Brunner–Munzel test [29] using R (version 4.0.3) and “brunnermunzel” package (version 2.0) [30]. The E. coli proteome data contain 15 proteins with IDs which were not found in OrthoMCL-DB for technical reasons such as changes in IDs in the past and thus, we manually processed these 15 proteins.

We also examined S. pombe transcriptome data [4] (Figures S6E–G, see 3.2). We obtained ortholog data from OrthoMCL-DB [31] (release 6.12). The S. pombe transcriptome data have 11 coding genes which were not found in both current PomBase and OrthoMCL-DB, and two coding genes which were found in PomBase but not in OrthoMCL-DB. TheS. pombe transcriptome data contain not only coding genes but also non-coding genes, and we obtained the cosine similarity LE structure using both.

We also evaluated stoichiometry conservation-evolutionary conservation correlation using the human cell atlas data [26] (Figure 5G) and the genome-wide Perturb-seq data [27] (Figure 5I). Ortholog data for these analyses were obtained from OrthoMCL-DB (release 6.20). We found the ortholog data in OrthoMCL-DB for 18,959 genes among the 53,908 genes with stoichiometry conservation centrality evaluated with the human cell atlas data. We remark that 98.7% of the 18,959 genes were classified as coding genes in the human cell atlas data. We also found the ortholog data for 7,957 genes among the 8,248 genes with stoichiometry conservation centrality evaluated with the Perturb-seq data. The correlations were examined with Brunner–Munzel test [29] using R (version 4.0.3) and “brunnermunzel” package (version 2.0) [30].

1.6.3 Centrality-coding/non-coding correlation

As mentioned in the main text and 1.6.1, centrality of genes with regard to stoichiometry conservation clearly correlates with coding/non-coding classification of genes in S. pombe. We observed this trend using S. pombe transcriptome data [4] (Figure S6C). The coding/non-coding assignment of each gene is based on PomBase [24] data downloaded on October 11th, 2022.

We observed a comparable correlation even in the human cell atlas data (Figure 5E). The gene type assignment is based on the human cell atlas data. Note that almost all the genes in the Perturb-seq data were coding genes.

1.7 Global omics structures characterized by Raman-omics correspondences

1.7.1 Notation

Let denote the j-th row in B (see Equtation (1.2)). It is an m-dimensional row vector whose components represent coefficients of protein j. The first component is the constant term, and the i-th component is the coefficient for LDA(i− 1) Raman. Below we consider the coefficients normalized with the constant terms,

1.7.2 Raman-proteome correspondence matrix as a low-dimensional representation of proteome changes

We asked whether the stoichiometry conservation structure of the proteomes revealed by LE (Figures 5K, 5L and 6D) is relevant to the low-dimensional Raman LDA space. To address this, we focused on a proteome low-dimensional structure specified by the Raman-proteome coefficients, motivated by the fact that the analysis of B led to the discovery of a proteome fraction that conserves mutual stoichiometry (Figure 3). We considered a space where represents the coordinate of each protein. From (1.6) or (1.7), protein species whose abundance ratios remain constant have an identical coordinate in this normalized coefficient space. The proteome in this normalized coefficient space is shown in Figures 6A and 6C.

This structure (Figures 6A and 6C) is constructed by using the Raman LDA axes (dual basis) and is different from the cosine similarity LE structure (Figures 5K, 5L and 6D), which is independent of Raman information. Therefore, it is nontrivial that these two structures are similar. This similarity suggests that differences in cellular Raman spectra captured by LDA might be quantitatively related to the omics structure deduced from stoichiometry-conserving relations. We will mathematically analyze this similarity in the section 2.

1.8 Evaluating similarity between orthogonal matrix Θ and identity matrix

As we see in 2.1.5, an orthogonal matrix Θ that appears in the relation connecting the two types of proteome structure must be close to an identity matrix to guarantee the structural similarity. To evaluate to what extent Θ is close to an identity matrix, we generated many random orthogonal matrices [32] and compared Θ and the identity matrix with them.

We first multiplied each orthogonal matrix by itself in the sense of Hadamard product (element-wise product). Then, we regarded the resultant matrix as a scatter plot (Figure S9B) and calculated its Pearson correlation coefficient, assuming that (i, j) element was the frequency of “data points” at the coordinate (i, j). The obtained Pearson correlation coefficient can be regarded as a measure of closeness to the identity matrix. In the case of identity matrix, the correlation coefficient takes the maximum value, one, because non-zero values are concentrated on the diagonal part.

We calculated the square of each matrix in the sense of Hadamard product for two reasons. First, since all elements of the resultant matrix are non-negative, one can ensure that the number of “points” is non-negative at any coordinate. Note that it is not necessarily an integer here. Second, the sum of all the elements of the resultant matrix is necessarily m. Thus, the total number of “points” is equally m for any m× m orthogonal matrices compared.

In addition to this method, we also evaluated the closeness of Θ to the identity matrix (i) by comparing the magnitudes of off-diagonal elements among Θ, the identity matrix, and random orthogonal matrices, and (ii) by comparing the magnitudes of elements of leading principal submatrices among Θ, the identity matrix, and random orthogonal matrices. In (i), from a part consisting of (m− 1)- and −(m− 1)-diagonals ((1, m) and (m, 1) elements) to the whole matrix, we expand step by step the area to consider by including i- and −i-diagonals (m − 1 ≥ i ≥ 0, the final step is inclusion of the main diagonal), and calculated the sum of the square of the elements in the area at each step. In (ii), from the smallest leading principal submatrix ((1, 1) element) to the whole matrix, we expand the area to consider step by step, and calculated the sum of the square of the elements in the area at each step. See also schematic diagrams in the supplementary figures (e.g. Figures S9D and S9E).

1.9 Interpretation of L1 norm/L2 norm ratio of an expression vector as a quantitative measure of expression generality

In 2.1.5, we will also see that even if Θ is close to the identity matrix, there is another condition which must be met to guarantee the similarity of the two types of proteome structure. By considering the mathematics behind the condition, we will reveal that the two indices, stoichiometry conservation centrality (degree) and expression generality score gj = ‖ pj1/ ‖pj2 must be mutually proportional. Here, we explain why gj is a quantitative measure of the generality (or constancy) of expression levels.

First, we note that ratio ‖pj1 /‖pj2 is independent of the magnitude of the expression vector pj. In other words, normalization does not affect the ratio;

On the basis of this, we only consider normalized expression vectors ‖ pj ‖ / ‖ pj2 without loss of generality.

By definition, L2 norm of a normalized expression vector (the denominator of the most left-hand side of (1.16)) equals to one. Thus, the ratio we are considering equals to the L1 norm of the normalized expression vector;

Here, we write

Where . Then,

Note that holds because of normalization. Therefore, any normalized expression vector corresponds to a point on the first orthant division of the unit (m − 1)- sphere .

Next we consider a hyperplane which passes through the point (1.18). Since all the coefficients of this hyperplane are equal, all the m intercepts are also equal. The intercept value is , thus equals to the ratio of (1.19). In other words, the ratio (1.19) appears as an intercept of the hyperplane passing through the point corresponding to the normalized vector pjpj /‖2 with all the coefficients equal to one.

By simple calculation, one can see that the two surfaces and intersect when . In other words, holds. The intercept value takes the maximum when the normalized expression vector points to the “center” of the first orthant division of the unit (m − 1)-sphere, i.e., when

This means that the expression level is even and constant across the conditions. When this evenness of expression level breaks, the intercept value k decreases, and it attains the minimum k = 1 when the normalized expression vector overlaps with an axis, i.e., when

which corresponds to a completely “condition-specific expression pattern” (µ is the condition’s index).

See Figures S8A and S8B for a graphical explanation of the argument for the two- and three-dimensional cases.

1.10 Proteome structure obtained with PCA

As mentioned in the main text, we confirmed that PCA could find a proteome structure (Figure S6H) similar to the cosine similarity LE structure (Figures 5L and 6D). Since cosine similarity of expression vectors is inner product of the L2-normalized expression vectors, we also performed L2 normalization of proteome data before applying PCA in this PCA analysis. In other words, we applied PCA to a normalized proteome data [p1/p12pn/pn2](See 2.1.5).

We remark that, despite the structural similarity between the PCA structure and the cosine similarity LE structure, cosine similarity LE has an advantage over PCA in that the relative proximity of positions reflects the strength of stoichiometry conservation between each element. In addition, as shown in the main text and the section 2, cosine similarity LE of omics data has a direct quantitative connection to cellular Raman spectra, which is not the case for PCA.

2 Mathematical analysis and details

To clarify what is non-trivial in the correspondence between LDA Raman and cosine similarity LE proteome, here we derive rigorous mathematical relations for the correspondence through linear algebraic calculation.

2.1 Mathematics behind the correspondence

2.1.1 Notations

We use following notations:

  • 1x denotes an x-dimensional column vector of ones, and 0x does an x-dimensional zero column vector.

  • A vector without a hat (e.g. x) is a column vector, and a vector with a hat (e.g. ) is a row vector.

  • I denotes an identity matrix, and O does a zero matrix.

  • For a square matrix X = (xij), diag(X) = (δijxij), and for a vector x = (xi), diag(x) = (δijxi), where δij is the Kronecker delta.

  • For a matrix X, X [i, :] denotes the i-th row of X, and X [:, j] denotes the j-th column of X.

2.1.2 Preparations

Let l be the number of cells in each condition, m be the number of conditions, n be the original dimension of proteome, i.e. the number of protein species in the proteome data, and s be the original dimension of Raman spectra after the application of the Savitzky-Golay filter. In our main data, l = 38,m = 15,n = 2058, and s = 599.

Raman Spectra
Original preprocessed Raman data

Let

be a preprocessed Raman spectrum from cell j under condition i (See 1.1.3). The prime denotes that the variable is the original preprocessed data. The are collected in an l × s matrix:

Combining from different conditions, one can define an (lm) × s matrix

which contains all the preprocessed Raman data.

PCA and centering

Before LDA, principal component analysis (PCA) was first applied to the preprocessed Raman data to reduce noise. The covariance matrix is

which is positive semi-definite. PCA is formulated as the following eigenvalue problem of :

where ΛPCA is a diagonal matrix with the eigenvalues of as its diagonal elements in decreasing order from the upper left, and VPCA is an orthogonal matrix consisting of the eigenvectors of as its columns. Using the first s (1 ≤ ss) columns of , i.e. the columns corresponding the first s largest eigenvalues, we obtain an (lm) × s matrix representing the post PCA data

Here, wk is the k-th PCA coefficient vector, and s is the reduced dimension of the Raman spectra. The subtraction of the is for centering the data. In our case, the top 218 principal components (i.e., s = 218) explaining 98% of the variance were used to reduce noise and dimensionality.

Let be the post PCA Raman spectrum from cell j under condition i, i.e.,

and Xi be the collection of ; i.e.,

Then, X is written as

From (2.6),

namely, for any k,

which means that the post PCA data is centered.

Population average in each condition

Let be the population average of the post PCA spectra of cells under condition i. Then,

Also,

where

We define an m × s matrix

Each row of corresponds to a condition. From (2.11) and (2.14), for any k,

Namely,

These relations mean that is also centered.

LDA

The within-class covariance matrix is

Here, (2.12) was used. The between-class covariance matrix is

Here, (2.17) was used. Assume rank(CI) = s and rank(CB) = m − 1 (the maximum possible values). In fact, rank(CI) = s and rank(CB) = m − 1 in our data. Note that s > m − 1. From the definitions of CI and CB above, both are positive semi-definite. LDA is formulated as the following generalized eigenvalue problem:

where is a diagonal matrix, and VLDA is an s × (m − 1) matrix that simultaneously diagonize CB and CI (to ΛLDA):

Here, the diagonal elements in were in decreasing order from the upper left. In our analysis, the columns of VLDA were normalized.

Using VLDA, we obtain an m × (m − 1) matrix representing the post LDA data

Each row of R represents a dimension-reduced Raman spectrum of each condition. Let us write the h-th (1 ≤ hm − 1) column of R as

Then,

Transforming RR gives

Therefore, RR is a diagonal matrix, and rh are orthogonal to each other. As all the diagonal elements of the diagonal matrix ΛLDA is positive, rank(R) = m−1. Furthermore,

Namely, for any h,

This means that, as the data is centered, all the columns of R is perpendicular to 1m.

Proteome

Let

be the absolute abundances of the j-th protein. pj are collected in an m × n matrix:

pij is the absolute abundance of the j-th protein in condition i.

We assume rank(P) = m, i.e., proteome vectors for different conditions are linear independent. Actually, rank(P) = m in our data. We also assume that proteins with zero expression in all the m conditions had been excluded from the proteome data.

2.1.3 Linear transformation between LDA Raman and proteome

We define an m × m matrix

We denote the first column of RE as r0. Hence,

From (2.29) and (2.31),

Therefore, (RE) RE is also a diagonal matrix, and RE has full rank. For convenience, we write

Here, we consider singular value decomposition (SVD) of RE; i.e.,

where is a diagonal matrix whose diagonal elements are the singular values of RE, and .

We can set VR = I in the following way. Let

Then,

Thus, SVD of RE can be written as

As is the eigenvalue matrix of (RE) RE,

and

Now, we consider linear transformation between P and RE. We introduce the n × m coefficient matrix BE = [b0bm−1] that connects P and RE as

RE has full rank and is therefore invertible. Thus, BE is obtained by

From the viewpoint of linear regression, b0 can be regarded as the constant terms and bh (1 ≤ hm − 1) is the coefficients for the h-th LDA dimension.

We can rewrite P using the row vectors in RE and BE. Writing the i-th (1 ≤ im) row of RE as ,

Likewise, writing the j-th (1 ≤ jn) row of BE as ,

Then, (2.46) can be written in another way;

The interpretation of each vector is summarized in Table S9.

Interpretations of , and

2.1.4 Relation between LDA Raman and ΩB

Here, we discuss the spatial correspondence between the Raman distribution in LDA space and the normalized Raman-proteome coefficient proteome structure (Figures 6B and 6C).

Connecting LDA Raman and Raman-omics transformation coefficients

Let us consider (RE) RE (BE) in two ways. In the first approach,

In the second approach,

Comparing the two calculations yields

This is equivalent to

for any protein j.

Normalization with constant terms

Next we consider the normalization of the matrices in (2.58) with constant terms. We first define the normalized coefficient matrix as

This is normalization of the coefficients by the constant terms. Furthermore, we normalize as

Thus, the right-hand side of (2.58) is

Since the first row of (RE) is (lm) , rewrite the left-hand side of (2.58) as

Here,

Therefore,

This relation indicates that the constant term for each protein is its average abundance. Consequently,

Therefore, we obtain

Equivalently,

for any protein j. This means that the normalized coefficients of each protein are mainly determined by the weighted averages of the Raman vectors, where the weight is the abundances of the protein.

As we already saw in (1.6) or (1.7), this equation also shows that protein pairs whose abundance ratio remains constant over all the conditions have identical normalized coefficients.

Special case — condition-specific protein

Consider an imaginary condition-specific protein γ whose abundance is c (> 0) under condition Γ and zero under the other conditions; i.e.,

From (2.59),

which indicates that the LDA Raman of condition , and Raman-proteome coefficients for Γ-specific protein , are in the same orthant. The normalized version is obtained by dividing the both sides by the first row (or from (2.72));

which shows that the LDA Raman of condition , and the normalized Raman-proteome coefficients of Γ-specific protein , are in the same orthant.

Application to main data

The LDA Raman distribution shown in Figure 6B corresponds to (scatter plots between different columns of RE). On the other hand, the normalized coefficient proteome structure in Figure 6C is the scatter plots between different columns of . The above linear algebra explains the correspondence between the two. In addition, from (2.72), one can understand that the homeostatic core distributes around the center of the structure because its member proteins are expressed in all the conditions.

The relation (2.76) was obtained by considering an imaginary protein whose expression levels were zero under all conditions except for one condition. To confirm this relation with actual data, we picked an almost-condition-specific protein (PaaE, highly expressed in LB condition) and a non-condition-specific protein (AcrR), and confirmed that the former approximately satisfied (2.76), while the relation did not hold for the latter (Figure S10).

2.1.5 Relation between ΩB and ΩLE

Here, we discuss the spatial correspondence between the normalized Raman-proteome coefficient proteome structure and the cosine similarity LE proteome structure (Figures 6C and 6D).

Cosine similarity LE proteome structure

Consider an undirected graph where each node corresponds to one type of protein. As previously explained, let the graph be a complete graph, namely every pair of nodes is connected by an edge. Each edge is weighted with cosine similarity between the two types of protein connected by the edge. Cosine similarity of protein i and protein j is given by

where pi · pj (= (pi) pj) is the inner product of pi and pj, and is the L2-norm (Euclidean norm) of pi. Cosine similarity evaluates how similar the expression patterns are between protein i and protein j. When the abundance ratio between protein i and protein j remains constant over all the m conditions, takes the maximum value 1.

The adjacency matrix of this graph is given by

and the degree matrix of this graph is

For simplicity, diagonal element (i, i) of A is cos for any protein i, i.e., each node has a loop. A is n × n and real symmetric and D is n × n and diagonal. Then, the Laplacian matrix is given by

which is an n× n symmetric matrix, and the symmetric normalized Laplacian is given by

which is an n × n symmetric matrix.

Here we define by normalizing the columns of P = [p1pn];

By using is rewritten as

Consider an eigenproblem

where Λ is an n × n diagonal matrix in which the eigenvalues of Lsym are arranged in increasing order from the upperleft, and columns of Vsym are the normalized eigenvectors of Lsym corresponding to the eigenvalues. Denote

Here, Lsym has the following four characteristics:

  • Lsym is positive semi-definite. See, for example, [20] for the proof.

  • In an undirected graph with non-negative weights, the number of separated graph components equals the multiplicity of the eigenvalue zero of Lsym. See, for example,

  • [20] for the proof. Since our proteome graph is connected, our Lsym has the single eigenvalue zero.

  • From (2.86), . Here, it is obvious that diag(PP)−1/2 has full rank, hence, . Therefore, rank(A) = m. Obviously, D has full rank by definition and thus, rank(D−1/2AD−1/2) = rank(A) = m. Therefore, D−1/2AD−1/2 has nm singular values of zero. Since D−1/2AD−1/2 is symmetric, its singular values and eigenvalues are the same. Thus, D−1/2AD−1/2 has nm singular values of zero. Therefore, Lsym(= ID−1/2AD−1/2) has nm singular values of 1(= 1 − 0).

  • For any n-dimensional vector . Therefore, D−1/2AD−1/2 is positive semi-definite, and all of the eigenvalues of Lsym(= ID−1/2AD−1/2) are less than or equal to one.

By these four points, we see that the eigenvalues of Lsym(= ID−1/2AD−1/2) satisfy

Now we define an m × m matrix as

Then, we can write

Let be the first m columns of Vsym;

The truncated version of the eigenproblem is

is the i-th row of and provides a new m-dimensional representation of the i-th protein. This representation of proteome reflects distance be-tween each protein pair in terms of cosine similarity.

Let

Then, this eigenproblem can also be regarded as the following generalized eigenproblem,

Remembering that Lrw = D−1L, we can further transform it into an eigneproblem

This is the form of eigenproblem that we explained previously in 1.5. In this section, we discuss it later.

The eigenproblem (2.87) can be transformed to

where

This means that the columns of Vsym are also the (normalized) eignenvectors of D−1/2AD−1/2, and M is the corresponding eigenvalue matrix. Defining an m × m matrix as

we can write

Note that

(2.97) is further transformed into

Comparing (2.102) and (2.85) leads to

Connecting Raman-proteome transformation coefficients and cosine similarity LE proteome

We consider PP in two ways. First, from (2.103),

Since

the diagonal elements of , i.e., the positive eigenvalues of D−1/2AD−1/2, are the square of the singular values of . Compact SVD of is expressed as

where ΣLE is an m×m diagonal matrix whose diagonal elements are the singular values in decreasing order from the upper left, and (ULE) ULE = (VLE) VLE = I. We then obtain

Thus,

On the other hand, from (2.44) and (2.46),

Therefore, comparing (2.109) and (2.112) yields

where Θ is an m × m orthogonal matrix. We define the estimate of BE as

By this notation, (2.113) can be written as

Here, the left-hand side represents Raman-proteome linear transformation, whereas the right-hand side except for Θ is derived only from proteome data. Note that in order to derive (2.113), LDA does not need to be applied to Raman data because PP can be written in the form of even if LDA is not applied to Raman data.

Normalization with constant terms

Now we consider normalizing both sides of (2.115) by the first columns.

With (2.60) and (2.61), the left-hand side of (2.115) can be rewritten as

Likewise, for the right-hand side, the first column of (the estimated constant term) is

The first column of is the normalized eigenvector corresponding to the eigenvalue zero of Lsym.

By the definition of L, L1n = D1nA1n = diag(D) − diag(D) = 0n. Hence, in general, L has an eigenvalue zero and a corresponding eigenvector 1n. Therefore, by the definition of Lrw, Lrw1n = D−1L1n = 0n; Lrw also has an eigenvalue zero and a corresponding eigenvector 1n. The eigenproblems and are equivalent because one can obtain the eigenproblem of Lsym by left multiplying both sides of the eigenproblem of Lrw by D1/2 and that of Lrw by left multiplying both sides of the eigenproblem of Lsym by D−1/2. Eigenvalues of Lrw and Lsym are identical, and holds for their eigenvectors. Therefore, Lsym has an eigenvalue zero and a corresponding eigenvector D1/21n. The multiplicity of the eigenvalue zero of our Lsym is one, and its corresponding eigenvector is limited to D1/21n.

By writing D as

Thus, (2.117) can be further transformed as

Remembering that both diag PP and D are diagonal, we obtain

Therefore, from (2.114) and (2.122), the “estimated coefficients” normalized with the “estimated constants” are

We remark that the eigenproblem of Lrw, i.e.,

is equivalent to solving a minimization problem

Thus, the closer pi and pj are in terms of cosine similarity, the closer and (the i-th and j-th rows of , respectively).

The relation between the minimization problem and the eigenproblem is the following. The objective function of the minimization problem is

Here,

Therefore, (2.130) can be transformed into

Thus, the minimization problem is

This can be transformed into the generalized eigenproblem by the method of Lagrange multipliers [33].

Remember that this property of and is analogous to that of and (the i-th and j-th rows of , respectively) as explained in 2.1.4 and 1.7.2.

By considering the first (upper-left) element of ΣLE is one, the right side of (2.113) can be transformed into

Therefore, from (2.115), (2.116), (2.136), and (2.137),

This is the equation that links the normalized Raman-proteome coefficient proteome structure and the cosine similarity LE proteome structure.

Mathematical interpretation of the obtained equation

From (2.139), if the distributions of and are similar, the diagonal matrix Θ must be similar to the identity matrix because large off-diagonal elements of Θ makes lower dimensions “mix” much with the higher dimensions. In addition, the directions of diagonal matrices diag (b) and diag PP 1/2 D must also be close to each other even if Θ is close to the identity matrix. Note that the first column of Θ also reflects the relation between b0 and .

The obtained relation between Raman-proteome normalized coefficient structure and cosine similarity LE structure is summarized in Table S10.

Note that the relation between and ΣLE can change depending on normalization of VLDA. However, the difference is not important for the spatial correspondence between the two structures because they only affect scaling of the axes. Rather, and , which determine the order of columns of VLDA and , are important.

Mathematical relation between Raman-proteome coefficients and cosine similarity LE proteomes.

The matrices in the left-hand side of (2.138) (a proteome structure based on Raman-proteome coefficients) and their counterparts in the right-hand side of (2.138) (a proteome structure obtained with cosine similarity LE) are listed.

Application to main data

The normalized coefficient proteome structure in Figures 6A and 6C are the scatter plots between different columns of . The cosine similarity proteome structures in Figures 5L and 6D are the scatter plots between different columns of . In our data analysis, we calculated as , where each column of was normalized. On the basis of the results of the mathematical analysis, we compared and in Figure S9G.

The similarity between the projections of the two distinct omics structures onto low-dimensional subspaces suggests that Θ is close to the identity matrix ((2.138) and Table S10). Figures S9A and S9C–S9E show that the actual Θ is indeed significantly close to the identity matrix. This suggests that the major changes in cellular Raman spectra detectable by LDA reflect the major changes in the proteome characterized by LE based on stoichiometry balance (cosine similarity).

The structural similarity also suggests that directions of diag (b) and diag PP 1/2 D are also similar ((2.138) and Table S10). In fact, Figure S9F confirmed good agreement between m1/2 diag (b0) and . Since these two quantities are calculated only from proteome data, this agreement is a characteristics of proteome data. See 2.2 for further analyses and discussion on this point.

2.2 Quantitative constraint on omics profiles

2.2.1 From agreement between and to proportionality between L1 norm/L2 norm ratio and degree

We observed above that constant terms b0 and the estimated constant terms were strongly correlated (Figure S9F). It is of note that both b0 and can be calculated only from omics data. Specifically, from (2.69) and (2.121),

Here, pj 1 and pj 2 are the L1 and L2 norms of pj and reflect only the expression property of protein j. On the other hand, the degree dj is a measure for the relationships of protein j with the other proteins because dj is the sum of cosine similarities, .

The observed relation

(Figure S9F) indicates that a proportionality relation

must hold approximately.

As mentioned in the main text, we refer the ratio of L1 norm to L2 norm in the left-hand side of (2.143) as expression generality score (gj) because it can be interpreted as a measure of constancy and generality of the expression levels of the protein (See 1.9 and Figures S8A and S8B). When a protein is perfectly condition-specific and expressed only in a particular condition, its L1 norm equals its L2 norm, and the ratio takes the minimum value one. When a protein is expressed equally across all the conditions, its L1 norm is greater than its L2 norm and the ratio takes the maximum value, the square root of the number of conditions (Figures S8A and S8B). On the other hand, the right-hand side of (2.143), which we refer as stoichiometry conservation centrality (dj) in the main text, measures to what extent protein j conserves its stoichiometry with the other proteins. Therefore, the proportionality relation (2.143) suggests a global quantitative constraint between condition-specificity of expression patterns and stoichiometry conservation strength. Positions of SCGs and density of genes in the cosine similarity LE structure (Figures 5A, 5K, and 5L) already suggested that genes with less condition-specific expression patterns have more genes with stoichiometrically similar expression patterns, and the proportionality here quantitatively captures this property of omics dynamics. We remark that the proportionality relation was confirmed in all the omics data we analyzed in this paper (See 3.2 and Figures S7I–S7N).

2.2.2 Mathematics behind proportionality between stoichiometry conservation centrality dj and expression generality score gj

The cosine similarity-based analyses involve normalization of expression vectors pj by its L2 norm pj 2 (2.77). Normalized expression vectors pjpj 2 represent points on the first orthant division of a unit (m − 1)-sphere in a m-dimensional space . L2 normalization allows us to compare expression patterns without considering expression magnitudes, and an expression pattern is represented by a position on the unit (m − 1)-sphere. Therefore, our cosine similarity-based quantification of stoichiometric balance in omics data is equivalent to evaluating distances (measured with angle) between points on the (m − 1)-sphere.

Since stoichiometry conservation centrality dj is the sum of cosine similarities,

Defining

we obtain

Note that . The last term is the normalized vector of the sum of all the normalized expression vectors, which we refer to as “expression-pattern norm vector.” Therefore, (2.147) means that stoichiometry conservation centrality is proportional to the cosine of the angle between the expression-pattern norm vector and protein j’s expression pattern pj / pj 2 . The more distant the expression pattern of a protein is from that specified by the expression-pattern norm vector, the smaller dj is.

On the other hand, expression generality score gj is

Therefore,

The last term is the normalized vector of 1m, corresponding to the “center” of the first orthant division of the unit (m − 1)-sphere. In other words, represents “perfectly-even expression pattern” across conditions. Therefore, (2.150) means that the expression generality score gj = pj1/ pj 2 is proportional to the cosine of the angle between the “perfectly-even expression pattern” and protein j’s expression pattern pj / pj2 . The more distant the expression pattern of a protein is from the “perfectly-even expression pattern”, the smaller the expression generality score is.

Comparing (2.147) and (2.150), we see that if the expression-pattern norm vector and the perfectly-even expression pattern are equal, a proportional relationship

holds. Note that this is equivalent to .

Conversely, if deviates from , the proportional relation between dj and gj breaks. We found that the proteome data by Schmidt et al.[1] showed . Instead, we found that the values of the elements of increased approximately linearly with the population growth rates under corresponding conditions (Figure S8D). Such a strong positive correlation between the elements of and the population growth rates is nontrivial and suggests a new growth law constraining the total of relative expression level changes of all the proteins.

Next, we consider the consequence of this positive correlation between the elements of and the population growth rates. Let us consider the proteins whose expression generality score gj = pj1/pj 2 takes the minimum value one. Namely, the expression of these proteins are completely condition-specific (Figures S8A and S8B). For such proteins, only one component of the expression pattern vector pj /pj2 is one, and the other components are zero. Thus, from (2.147), their stoichiometry conservation centrality dj becomes proportional to the elements of corresponding to the conditions under which they are expressed. Since the values of the elements of are positively correlated with the population growth rates, dj of completely condition-specific proteins also exhibits a positive correlation with the growth rates under the conditions accompanying their expression (Figure S8C). Such correlation can be confirmed for the proteins with nearly condition-specific expression patterns (PaaE, Asr, and DgoA in Figures 7B and 7C).

More generally, the deviation of dj from the perfect proportionality line can be understood by the relation

which can be derived from (2.147) and (2.150). Note that the values of the elements of the last term also increase with the population growth rates, being positive under fast growth conditions and negative under slow growth conditions (Figure S8D). Therefore, when protein j tends to be expressed higher under fast growth conditions, i.e., when the elements of pj / pj2 corresponding to the fast growth conditions are relatively larger than those corresponding to the slow growth conditions, the left-hand side of (2.152) becomes positive, and its dj resides above the perfect proportionality line. On the other hand, when protein j tends to be expressed higher under slow growth conditions, its dj resides below the perfect proportionality line.

In the gj-dj plot in Figures 7A and 7B, we find several stretches of protein clusters above and below the perfect proportionality line. As expected from the argument above, each cluster corresponds to a group of proteins with similar expression patterns, and their positions relative to the proportionality line characterize the condition under which they are expressed the most (Figure 7C).

In summary, visualizing omics data by using the stoichiometry conservation centrality dj and the expression generality score gj allows us to systematically characterize the condition-dependent expression pattern of each protein on the basis of its position in the plot. Interestingly, this systematic characterization of gene expression patterns (the relation between dj and gj) was derived from our mathematical analyses of the correspondences between Raman and omics as we explained above.

3 Extended data analysis

3.1 Growth laws

3.1.1 Single-gene-level growth law

Bacterial growth law states that the total abundances of ribosomal components increase linearly with growth rate [34][35][36]. The homeostatic core (the largest SCG) identified in our analysis contains many ribosomal proteins. Hence, it is plausible that the total abundance of homeostatic core proteins also increases linearly with growth rate, which we indeed found true (Figure S5A). Furthermore, the abundance ratios of homeostatic core proteins are conserved across conditions. Therefore, the intracellular abundance of each protein species in the homeostatic core is expected to increase linearly with growth rate.

Let pϵj be the abundance of protein j in the homeostatic core in environment ϵ. Since this protein conserves the stoichiometry with the other homeostatic core proteins across conditions,

in any environments ϵ (αij is the environment-independent abundance ratio of the homeostatic core protein i to protein j).

Let Mϵ =∑ i pϵi = pϵji αij be the total abundance of homeostatic core proteins in environment ϵ. The growth law for the homeostatic core is

where gϵ is the growth rate in environment ϵ; a is the y-intercept; and b is the slope of the linear relation. Therefore,

This shows that the abundance of homeostatic core proteins satisfies single-gene-level growth law.

3.1.2 Extended verification of stoichiometry conservation

When the abundance ratios between protein i and protein j are conserved,

where c and s specify the environments (s signifies the standard environment). Hence,

where γc is the common abundance ratio of stoichiometry-conserving proteins with respect to the condition c. Note that γc is common among all the proteins in a stoichiometry-conserving group.

From (3.5),

Therefore, plotting log pcj against log psj should find the stoichiometry conserving proteins aligned on a straight line with a slope of 1. We indeed find such plots for the homeostatic core proteins (Figures S5B and S5C). This result confirms their stoichiometry conservation from a different perspective.

3.1.3 Linear dependence of common abundance ratio on growth rate

Since the total amount of homeostatic core proteins increase linearly with growth rate (Figure S5A and (3.2)),

Since ∑j pcj = γcj psj,

Hence,

where Ms = ∑j psj. Therefore, the common abundance ratio γc also increases linearly with growth rate.

Estimating Γc := log10 γc as the y-intercepts of the regression lines with a slope of 1 (Figure S5B), we confirmed this linear dependence of common abundance ratio of homeostatic core proteins on growth rate (Figure S5D).

3.2 Generality of the results

3.2.1 Additional datasets with Raman data

Correspondence of three types of space

The correspondences among LDA Raman, Raman-omimcs normalized coefficient omics structure and cosine similarity LE omics structure were also observed in other datasets. The datasets include Raman (this paper) and proteome [1] data of E. coli with different genotypes (BW25113, MG1655, and NCM3722) cultured in the “LB” medium (Figures S11A–S11E) and in the “Glucose” medium (Figures S11F–S11J), and Raman and transcriptome data of S. pombe cultured under 10 different environmental conditions (Figures S11K–S11O) [4].

Comparison of matrices obtained by mathematical analyses

In addition to the comparison of three types of space, we also examined the matrices on the basis of results of the mathematical analyses (Table S10) using the aforementioned additional datasets; the closeness of and (Figures S12F for Raman-proteome of E. coli with different genotypes cultured in “LB”, S12L for Raman-proteome of E. coli with different genotypes cultured in “Glucose”, and S12R for Raman-transcriptome of S. pombe cultured under in 10 environment conditions), the closeness of Θ to the identity matrix (Figures S12A–S12D for Raman-proteome of E. coli with different genotypes cultured in “LB”, S12G–S12J for Raman-proteome of E. coli with different genotypes cultured in “Glucose”, and S12M–S12P for Raman-transcriptome of S. pombe cultured under in 10 environment conditions), and the correspondence between and (Figures S12E for Raman-proteome of E. coli with different genotypes cultured in “LB”, S12K for Raman-proteome of E. coli with different genotypes cultured in “Glucose”, and S12Q for Raman-transcriptome of S. pombe cultured under in 10 environment conditions). We confirmed that the same results hold for these additional datasets. Note that the correspondence between and does not involve Raman data. It is an intrinsic property of the omics data.

Proportionality between expression generality and stoichiometry conservation centrality

The correspondence between and suggested the proportionality between expression generality score gj and stoichiometry conservation centrality dj in these omics data. We confirmed that the same results hold for these additional datasets (Figures S7I–N).

Correlation between and growth rates

For the proteome data of E. coli with different genotypes (BW25113, MG1655, and NCM3722) cultured in “LB” and in “Glucose”, growth rates were also reported [1]. We confirmed a positive correlation between the elements of and growth rates for these datasets (Figure S8E). See also the deviation from the proportionality line in the gj-dj plot (Figures S7I and S7J).

Biological relevance of centrality of cosine similarity LE structure

We also confirmed centrality-essentiality correlation and centrality-evolutionary conservation correlation in the S. pombe transcriptome data (Figures S6B and S6E–G).

Degree distribution and its destruction by randomization

Degree (stoichiometry conservation centrality) distributions of cosine similarity LE structure of the additional data sets also showed a similar pattern as the main data, and randomization of the omics data breaks the strong correlation of expression patterns in the actual data (Figures S7C, S7D, and S7H).

3.2.2 Additional datasets without Raman data

Examining proteome structures with cosine similarity LE does not require Raman data. Therefore, we additionally analyzed publicly available proteome data of M. tuberculosis and M. bovis under the growth conditions with distinct oxygen levels [2], and the proteome data of S. cerevisiae under various environmental conditions [3].

We characterized cosine similarity LE structures of these datasets (Figure S13). Furthermore, we confirmed the proportionality between expression generality score gj and stoichiometry conservation centrality dj (Figures S7K–M). For the proteome data of S. cerevisiae, growth rates were also reported [3]. The S. cerevisiae cells were cultured in chemostat at the same dilution rate in any condition. In fact, we observed little variation of (Figure S8F), which leads to little deviation from the proportionality line (Figure S7M).

Degree (stoichiometry conservation centrality) distributions of cosine similarity LE structure of the additional data sets also showed a similar pattern as the main data, and randomization of the omics data breaks the strong correlation of expression patterns in the actual data (Figures S7E–G).

4 Supplementary figures

Schematic illustration of the approach in this study. Related to Figure 1.

Raman spectra and gene expression profiles are both high-dimensional vectors and can be represented as points in high-dimensional spaces. Coarse-graining Raman spectra by dimensional reduction finds condition-dependent differences in their global spectral patterns (see Fig. 2). The dimension-reduced spectra were linked to and used to predict condition-dependent global gene expression profiles (see Fig. 2), which implies that global changes in spectral patterns detect differences in cellular physiological states. The analysis of this linkage led us to discover a stoichiometry-conserving constraint on gene expression, which enabled us to represent gene expression profiles in a functionally-relevant low-dimensional space ((i); see also Fig. 35). Then, we find a nontrivial correspondence between these low-dimensional Raman and gene expression spaces ((ii); see also Fig. 6). This correspondence provides an omics-level interpretation of global Raman spectral patterns and a quantitative constraint between expression generality and stoichiometry conservation centrality ((ii); see also Fig. S9 and Fig. 7).

Custom-built Raman microscope and analyses of E. coli Raman spectra. Related to Figure 2.

(A) Schematic diagram of the Raman microscope used in this study. (B) Representative Raman spectra from single E. coli cells. The fingerprint region of one spectrum is shown for each condition. (C) Linear superposition of Raman shifts. Each LDA axis is linear superposition of Raman shifts. These figures show the coefficients for LDA1 (left) and LDA2 (right). (D) Relationship between Raman LDA1 axis and growth rates. The horizontal axis represents Raman LDA1 axis. The vertical axis represents growth rates measured in [1]. Each point corresponds to the data for one condition. Pearson’s correlation coefficient is 0.81±0.09.

Estimation of proteomes from Raman spectra. Related to Figure 2.

Comparing the measured proteomes with those estimated from Raman spectra. The horizontal and vertical axes represent the estimated and measured proteomes, respectively. Proteins with negative estimated abundance are not shown in these figures. The conditions with the largest and the second largest numbers of proteins with negative estimated abundance were “stationary3days” (666 proteins) and “LB” (359 proteins). The conditions with the fewest and the second fewest negatively estimated proteins were “GlucosepH6” (0 proteins) and “Xylose” (7 proteins).

Comparison of stoichiometry conservation among COG classes. Related to Figure 3.

(A and B) Relations between protein abundance and constant terms of Raman-proteome coefficients. The horizontal axes are b0 (constant terms), and the vertical axes are (protein abundance). Dashed lines are the least squares regression lines with intercept zero for ISP COG class members. The average of was used as an estimate of B here. In (A), only ISP COG class members are shown for three representative conditions: “Galactose”, “Glucose”, and “GlycerolAA”. In (B), all proteins are shown for a representative condition, “GlycerolAA”. (C) Relations between protein abundance and growth rates of E. coli under 15 environmental conditions. We analyzed the absolute quantitative proteome data, growth rate data, and COG annotation reported by [1]. Lines represent different protein species. Error bars are standard errors. The top panel is for the Cellular Processes and Signaling COG class; the middle is for the Information Storage and Processing COG class; and the bottom is for the Metabolism COG class. (D) Relations between protein abundance and growth rates of three E. coli strains (BW25113, MG1655, and NCM3722) under two culture conditions. We again analyzed the data by [1]. Lines represent different protein species. Error bars are standard errors. (E and F) COG class-dependent expression pattern similarity of E. coli proteomes between conditions. The E. coli proteome data under the 15 different environmental conditions were analyzed. The similarity is evaluated by Pearson correlation coefficients of log expression levels in (E) and by cosine similarity in (F). We consider all the combinations of the 15 conditions. Thus, there are 105 data points for each COG class. The box-and-whisker plots summarize the distributions of the points. The lines inside the boxes denote the medians. The top and bottom edges of the boxes denote the 25th percentiles and 75th percentiles, respectively. Note that (E) and (F) are evaluation of the same data used in Figure 3B in the main text with different similarity indices. (G) COG class-dependent expression pattern similarity between different strains of E. coli (BW25113, MG1655, and NCM3722). The absolute quantitative proteome data and COG annotation were taken from [1]. The similarity was evaluated by cosine similarity. The data contain three strains. Thus, there are thee points for each COG class. The top panel is for the “Glucose” condition; and the bottom is for the “LB” condition. (H–J) COG class-dependent expression pattern similarity in other organisms. (H) is for M. tuberculosis (data from [2]; six environmental conditions (time points)), (I) for M. bovis (data from [2]; six environmental conditions (time points)), and (J) for S. cerevisiae (data from [3]; 10 environmental conditions). The COG annotations were taken from the December 2014 release of 2003-2014 COGs [37] and the Release 3 of “Mycobrowser” [38] for (H) and (I) and from the Comprehensive Sake Yeast Genome Database (S288C strain) [39] for (J). The unit for protein abundance was fg/cell for (H) and (I) and fg in pg dry cell weight for (J).

Single-gene level growth law in the homeostatic core. Related to Figure 4.

(A) Relationship between population growth rates and total abundance of SCG 1 (homeostatic core) proteins. Here, we analyzed the E. coli proteome data [1], focusing on the 15 conditions for which we obtained Raman data. The dashed line is the least squares regression line. (B) Scatterplots of log abundance of SCG 1 (homeostatic core) proteins. Here, the proteomes under three representative conditions, “LB”, “Glucose”, and “Galactose”, are compared with that under the standard condition “Glycerol.” Each colored line is the linear regression line with slope one for the points with the same color. The vertical line is x = 0. (C) Relationship between population growth rate and coefficient of determination of linear regression in (B). The vertical line represents the growth rate under the standard condition (“Glycerol”). (D) Linear relationship between common abundance ratio and growth rates. The vertical axis represents , where Γc is the y-intercepts in (B) (see 3.1.2). The dashed line is the linear regression line. The horizontal line is y = 1, and the x coordinate of the vertical line is the growth rate under the standard condition (“Glycerol”). (E) The gene loci of the proteins belonging to the condition-specific SCGs on the chromosome (ASM75055v1.46 [40]). Yellow dots are nodes (genes), and gray lines are edges (high cosine similarity relationships). The edge in the map of SCG 5 cannot be seen because their gene loci are clustered in close proximity in the same operon.

Functional relevance of stoichiometry conservation centrality. Related to Figure 5.

(A) Relationship between gene essentiality and stoichiometry conservation centrality in E. coli. The proportion of essential genes is plotted for each stoichiometry conservation centrality rank range. In this plot, we calculated stoichiometry conservation centrality based on the E. coli proteome data [1] under the 15 conditions for which we obtained Raman data. The list of essential genes was downloaded from EcoCyc [23]. (B) Relationship between gene essentiality and stoichiometry conservation centrality in S. pombe. We calculated stoichiometry conservation centrality based on the S. pombe transcriptome data reported in [4]. Only coding genes are considered in this plot, though stoichiometry conservation centrality values were calculated using both coding and non-coding genes. Gene classification is based on PomBase [24]. Some bins do not reach 100% in sum because eleven coding genes in the S. pombe transcriptome data were not found in the current PomBase. (C) Relationship between ratio of coding genes and stoichiometry conservation centrality in the S. pombe transcriptome data. The coding/non-coding assignment is based on PomBase [24]. (D) Correlation between stoichiometry conservation and evolutionary conservation. In this plot, we calculated stoichiometry conservation centrality based on the E. coli proteome data [1] under the 15 conditions for which we obtained Raman data. Colors represent the height of each bar. The distributions of stoichiometry conservation centrality were compared between the top 25% and the bottom 25% fractions in the number of orthologs rankings. The fraction with many orthologs tends to have higher stoichiometry conservation centrality (one-sided Brunner–Munzel test, p = 7.84 × 10−15). The distributions of the number of orthologs were compared between the top 25% and the bottom 25% stoichiometry conservation centrality fractions. The high centrality fraction tends to have more orthologs (one-sided Brunner–Munzel test, p = 1.46 × 10−11). Ortholog data were taken from OrthoMCL-DB [31]. (E–G) Correlation between stoichiometry conservation and evolutionary conservation in S. pombe. We calculated stoichiometry conservation centrality based on the S. pombe transcriptome data reported in [4]. In (E), the result is shown by two-dimensional histogram. Colors represent the height of each bar. The distributions of the number of orthologs were compared between the top 25% and the bottom 25% stoichiometry conservation centrality fractions. The high centrality fraction tends to have more orthologs (one-sided Brunner–Munzel test, p = 0.00548). The direct comparison between the two fractions is shown in (F). The distributions of stoichiometry conservation centrality were compared between the top 25% and the bottom 25% fractions in the number of orthologs rankings. The fraction with many orthologs tends to have higher stoichiometry conservation centrality (one-sided Brunner–Munzel test, p = 0.00270). The direct comparison between the two fractions is shown in (G). Ortholog data were taken from OrthoMCL-DB [31]. (H) Applying PCA to L2-normalized proteomes. PCA (with mean-centering) was applied to L2-normalized proteome data [p /ppn/p]2. Here, we analyzed the E. coli proteome data under the 15 conditions for which we obtained Raman data. The left is a projection onto a two-dimensional space; and the right is a projection onto a three-dimensional space. The axes for visualization were selected by considering similarity to the cosine similarity LE structure.

Distributions and constraints with respect to stoichiometry conservation centrality (degree). Related to Figures 5 and 7.

(A) Comparison of degree (stoichiometry conservation centrality) distributions between original (yellow) and randomized (blue) E. coli proteome data. We created randomized proteome data by shuffling the expression levels across the protein species within each condition. We used the E. coli proteome data [1] under the 15 conditions for which we obtained Raman data. (B) Comparison of the gj-dj relationships between original (yellow) and randomized data (blue). The horizontal axis is expression generality score (gj = L1 norm/L2 norm) and the vertical axis is stoichiometry conservation centrality (dj: degree). Each dot represents a protein species. The dashed lines are . The solid lines are . (C–H) Degree (stoichiometry conservation centrality) distributions for additional datasets. Yellow histograms are for the original data, and blue histograms are for the randomized data. (C) for the proteomes of three E. coli strains (BW25113, MG1655, and NCM3722) in LB [1]; (D) for the proteomes of the three E. coli strains in M9 Glucose [1]; (E) for the proteomes of M. tuberculosis [2]; (F) for the proteomes of M. bovis [2]; (G) for the proteomes of S. cerevisiae [3]; and (H) for the transcriptomes of S. pombe [4]. (I–N) gj-dj relationships for additional datasets. Each gray dots represent a protein species. The proteins belonging to the homeostatic core in each dataset are shown in magenta; those belonging to condition-specific SCGs are indicated in different colors in each plot. See the caption of Figures S11 and S13 for the cosine similarity threshold to specify the homeostatic core and the condition-specific SCGs in each dataset. The dashed lines are . The solid lines through the origins are . (I) for the proteomes of the three E. coli strains in LB [1]; (J) for the proteomes of the three E. coli strains in M9 Glucose [1]; (K) for the proteomes of M. tuberculosis [2]; (L) for the proteomes of M. bovis [2]; (M) for the proteomes of S. cerevisiae [3]; and (N) for the transcriptomes of S. pombe [4].

Properties of normalized expression vectors. Related to Figure 7.

(A and B) Schematic explanation for the interpretation of the L1 norm/L2 norm ratio of expression vectors as an index of expression generality. (A) is a two-dimensional case, and (B) is a three-dimensional case. The inset in (A) schematically explains L1 norm and L2 norm of an expression vector. See section 1.9 for detail. (C) Schematic explanation for deviations of points from the proportionality line in the gj-dj plots. Here, we consider four condition-specific protein species a, b, c, and d labeled in the descending order of growth rates under the conditions accompanying their expression. Note that their L1 norm/L2 norm ratios are all one on the horizontal axis. One can show that the degree (stoichiometry conservation centrality) dj is proportional to the inner product of L2-normalized expression vector pj pj and the expression norm vec-tor (see (2.147) in section 2.2.2). Since the elements of increase approximately linearly with growth rates of the corresponding conditions (see (D)), the degrees (stoichiometry conservation centrality values) decrease from a to d in the order of growth rates. (D–F) Correlation between elements of and population growth rates. The vertical axis represents the elements of and the horizontal axis represents the popu-lation growth rates. The dashed lines are . (D) is the result from the analysis of the E. coli proteome data [1] under the 15 conditions for which we obtained Raman data (m = 15). (E) is the result from the analysis of the proteome data of three strains of E. coli (BW25113, MG1655, and NCM3722) under “LB” and “Glucose” conditions (m = 6) [1]. (F) is the result from the analysis of the proteome data of S. cerevisiae under 10 different conditions (m = 10) [3]. The cells were cultured in chemostat with the same dilution rate. The numbers of analyzable protein species and the numbers of conditions were different between (D) and (E). Thus, the values of the vertical axes cannot be compared directly between them.

Mathematical analyses of the main Raman-proteome data. Related to Figure 6.

(A) Proteomes of E. coli under 15 conditions [1] and corresponding Raman data we measured in this study were analyzed in this figure. (B) Visual comparison of the unit matrix I, the orthogonal matrix Θ obtained from the data, and a random orthogonal matrix. Height of each bar indicates the value of each element. Colors represent the height of each bar. For clarifying the position of each element, a component form of matrix Θ is shown in the middle (m = 15). For Θ (middle) and a random orthogonal matrix (right), the original matrices are displayed in the upper row, and matrices whose elements are the absolute values of the corresponding elements of the original matrices are displayed in the lower row. (In this figure, |Θ| represents a matrix of which the (i, j) element is the absolute value of the (i, j) element of Θ.) (B) Representation of matrices as scatterplots. See section 1.8 for detail. (C) Comparison of the unit matrix I, the orthogonal matrix Θ obtained from the data, and random orthogonal matrices Q by Pearson correlation coefficients. Pearson correlation coefficient of the element-wise squared matrix of each matrix can be regarded as a measure of closeness to the identity matrix (∘ represents element-wise multiplication). The probability of finding a random orthogonal matrix Q with Pearson correlation coefficient greater than the Pearson correlation coefficient of Θ was < 1 × 10−5 (No occurrence in 105 samplings). See section 1.8 for detail. (D) Comparison of magnitudes of off-diagonal elements among the unit matrix I, the orthogonal matrix Θ obtained from the data, and random orthogonal matrices Q. The lattice on the top explains the numbering of k-diagonals (−m < k < m, m = 15). In the lattices on the bottom, black color indicates areas in which the elements are squared and summed at the corresponding steps (i.e., areas represented by x in the graph). The sum of the squared values in each step is shown in the middle graph. Error bars of the random matrix line are standard errors of 100 samplings. See section 1.8 for detail. (E) Comparison of magnitudes of elements of leading principal submatrices among the unit matrix I, the orthogonal matrix Θ obtained from the data, and random orthogonal matrices Q. In the lattices on the bottom, black color indicates an area in which elements are squared and summed at the corresponding step (i.e., an area represented by x in the graphs). The sum of the squared values in each area is shown in the top graph. The results shown in the top graph are converted into ratios to the identity matrix I and are shown in the middle graph. Error bars of the random matrix line are standard errors of 100 samplings. See section 1.8 for detail. (F) Comparison of and . x axis represents and y axis represents . The dashed line indicates y = x. (G) Comparison between (left) and (right). Note that while figure (left) is the same as Figure 6C, the right figure shows , where is shown in Figure 6D.

Orthant correspondences between Raman spectra in LDA space and condition-specific proteins in Raman-proteome coefficient proteome space. Related to Figure 6.

Using the main Raman and proteome data of E. coli under the 15 conditions, we examine the orthant correspondence between Raman spectra in the LDA space and condition-specific proteins in the Raman-proteome coefficient proteome space ΩB. Here, we focus on two proteins PaaE and AcrR. (A) Expression patterns of PaaE (left) and AcrR (right) across conditions. Error bars are standard errors. PaaE is expressed under the “LB” condition in a condition-specific manner, whereas AcrR is expressed at high levels not only under “LB” condition but also under several other conditions. (B) Positions of PaaE and AcrR in the Raman-proteome coefficient-based proteome space ΩB. (C) Verification of orthant correspondence. We verified the orthant correspondence described by the relation (2.76). We multiplied both sides of (2.76) by , and the elements of the vectors of both sides were compared by scatterplots. The horizontal axes are related to the coordinates in the Raman LDA space; the vertical axes are related to the coordinates in the Raman-proteome coefficient proteome space. The dashed lines are y = x. The nearly perfect agreement of the elements confirms the orthant correspondence for the condition-specific protein PaaE (left). Deviations from the diagonal agreement line are found for AcrR (right).

Stoichiometry-based omics structures and their correspondences to Raman-based omics structures for additional datasets. Related to Figures 46.

This figure summarizes the results on omics structures characterized by stoichiometry conservation relations and their correspondences to those characterized by Raman-omics relations for additional datasets. (A–E) show the results from the analyses of the Raman and proteome data of three E. coli strains (BW25113, MG1655, and NCM3722) in LB; (F–J) from the analyses of the Raman and proteome data of the three E. coli strains in M9 Glucose; and (K–O) from the analyses of the Raman and transcriptome data of S. pombe under 10 conditions. We used the E. coli proteome data reported in [1] and the S. pombe transcriptome data reported in [4] in the analyses. (A), (F), and (K) show distributions of omics components in cosine similarity LE space. Stoichiometry conservation centrality of each component is indicated by color. (B), (G), and (L) show expression patterns of representative condition-specific omics components indicated in the previous figures of omics structures in the csLE spaces. Error bars are standard errors in (B) and (G), and maximum-minimum ranges (two replicates) in (L). (C), (H), and (M) show positions of averaged cellular Raman spectra under different conditions in the LDA spaces. (D), (I), and (N) show omics structures in the spaces specified by the Raman-omics coefficients with the homeostatic cores and condition-specific SCGs indicated by colored points. (E), (J), and (O) show the omics structures in the csLE omics spaces with the homeostatic cores and condition-specific SCGs indicated by colored points. Columns vrw,1 (the eigenvector corresponding to Lrw’s smallest eigenvalue except for zero) and vrw,2 (the eigenvector corresponding to Lrw’s second smallest eigenvalue except for zero) are shown. We used the cosine similarity thresholds of 0.99993 to specify SCGs both for the three E. coli strains under LB data ((D) and (E)) and for the three E. coli strains under M9 Glucose data ((I) and (J)), and 0.9967 for the S. pombe transcriptome data ((N) and (O)).

Analyses of the mathematical relation connecting two types of omics spaces. Related to Figure 6.

This figure shows the analyses of mathematical relation that connects coordinates of omics components in the two types of omics spaces (see Figure 6E and Section 2) using additional datasets. (A–F) show the results from the analyses of the Raman and proteome data of three E. coli strains (BW25113, MG1655, and NCM3722) in LB; (G–L) from the analyses of the Raman and proteome data of the three E. coli strains in M9 Glucose; and (M–R) from the analyses of the Raman and transcriptome data of S. pombe under 10 conditions. We used the E. coli proteome data reported in [1] and the S. pombe transcriptome data reported in [4] in the analyses. See the caption of Figure S9 for the explanation of each panel. The SCGs in (F), (L), and (R) are the same as in Figure S11. The probability of finding a random orthogonal matrix Q with Pearson correlation coefficient greater than the Pearson correlation coefficient of Θ was 0.022 in (B), 0.013 in (H), and < 1 × 10−5 (No occurrence in 105 samplings) in (N).

Stoichiometry-based proteome structures for additional datasets. Related to Figures 45.

This figure shows proteome structures in the csLE proteome spaces for additional datasets. (A–C) show the results from the analyses of the proteome data of M. tuberculosis H37Rv under gradual changes in oxygen levels [2]; (D–F) shows the results from the analyses of the proteome data of M. bovis BCG under gradual changes in oxygen levels [2]; and (G–I) show the results from the analyses of the proteome data of S. cerevisiae under 10 conditions in chemostat with the same dilution rate [3]. (A), (D), and (G) show the proteome structures in the csLE spaces. The thresholds used to specify the SCGs were 0.99965 for (A), 0.9997 for (D), and 0.9989 for (G). (B), (E), and (H) show the same proteome structures as in the previous panels, but with stoichiometry conservation centrality of each protein species indicated by the color. (C), (F), and (I) show expression patterns of representative proteins indicated by the red circles in the previous panels. Error bars in (C) are standard errors.

Dependence of low-dimensional correspondence between Raman spectra and proteomes on the number of conditions. Related to Figure 6.

The dependence of the low-dimensional correspondence between Raman spectra and proteomes on the number of analyzed conditions was systematically investigated by evaluating the similarity of the orthogonal matrix Θ to the identity matrix for all subsampled condition sets. Proteomes of E. coli under 15 conditions [1] and corresponding Raman data we measured in this study were analyzed in this figure. The relationship between the number of conditions and the probability of obtaining higher level of low-dimensional correspondence than that of experimental data by chance. This probability is calculated as the probability of finding a random orthogonal matrix with Pearson correlation coefficient greater than the Pearson correlation coefficient of Θ by creating 104 random orthogonal matrices. See section 1.8 and Figure S9 for details of the evaluation method. Each green square corresponds to one subsample, and each short horizontal black line represents the median of all the combinations of conditions (i.e., green squares) for each subsample size x. The blue dashed line indicates the detection limit (i.e., one over the number of generated random orthogonal matrices). The non-subsampled case (i.e., the case with all 15 conditions) in this figure corresponds to Figure S9C. Visual comparison of , and for six representative subsamples indi-cated in (A). As in Figure S9A, Θ is visualized using |Θ|, whose element is the absolute value of the corresponding element of Θ, and height of each bar in the figures of |Θ| indicates the value of each element of |Θ|. Colors reflect the height of each bar. Spaces created with columns of and are ΩB and ΩLE, respectively. As Θ deviates from the identity matrix from the cases α and β to the case of ϵ, the low-dimensional correspondence between ΩB and ΩLE collapses naturally. Since the case ζ is the non-subsampled case, the figure of |Θ| is the same as Figure S9A, and those of and are the same as Figure S9G. Note that the figure of ΩB of the case ζ is also exactly the same as Figure 6C, and that of ΩLE of the case ζ is equal to Figure 6D up to a factor of . The SCGs shown in this figure were defined in the analysis of the proteomes of all the 15 conditions (Figure 4C).

Acknowledgements

We thank Matthias Heinemann and Silke Bonsing-Vedelaar for detailed information on the E. coli culture conditions; Doeke R. Hekstra, Tetsuya J. Kobayashi, Takafumi Miyamoto, John Russell, and Ian Hunt-Isaak for reading the manuscript and providing critical comments; Kunihiko Kaneko, Chikara Furusawa, Yasushi Okada, and members of the Wakamoto Lab and the Universal Biology Institute for discussion and encouragement. This work was supported by JST CREST Grant Number JPMJCR1927 (Y.W.); JST ERATO Grant Number JPMJER1902 (Y.W.); JSPS KAKENHI Grant Numbers 19J22448 (K.F.K.) and 21K20672 (T.N.).

Additional information

Funding

Japan Science and Technology Agency

https://doi.org/10.52926/jpmjcr1927

Japan Science and Technology Agency

https://doi.org/10.52926/jpmjer1902

Japan Society for the Promotion of Science (19J22448)

Japan Society for the Promotion of Science (21K20672)