Cellular physiological state differences detected by Raman spectral global patterns and gene expression profiles.
(A) Condition-dependent cellular Raman spectral patterns. Raman spectra obtained from cells reflect their molecular profiles. Therefore, systematic differences in global spectral patterns may indicate their physiological states. A Raman spectrum from each cell can be represented as a vector and a point in a high-dimensional Raman space. If condition-dependent differences exist in the spectral patterns, appropriate dimensional reduction methods allow us to classify the spectra and detect cellular physiological states in a low-dimensional space. (B) Condition-dependent gene expression profiles. Global gene expression profiles (proteomes and transcriptomes) are also dependent on conditions. For each gene, we can consider a high-dimensional vector whose elements represent expression levels under different conditions. It has been suggested that these expression-level vectors are constrained to some low-dimensional manifolds (7–17). This study characterizes the statistical correspondence between dimension-reduced Raman spectral patterns and gene expression profiles. Analyzing the correspondence, we reveal a stoichiometry conservation principle that constrains gene expression profiles to low-dimensional manifolds.
Estimation of proteomes from Raman spectra.
(A) The experimental design. We cultured E. coli cells under 15 different conditions and measured single cells’ Raman spectra. We then examined the correspondence between the measured Raman spectra and the absolute quantitative proteome data reported by Schmidt et al. (20). (B) Representative Raman spectra from single cells, one from the “Glucose” condition, and the other from the “LB” condition. The fingerprint region and representative peaks are annotated. (C to E) Cellular Raman spectra in LDA space. The dimensionality of the spectra is reduced to 14 (= 15 − 1). Each point represents a spectrum from a single cell, and each ellipse shows the 95% concentration ellipse for each condition. Their projection to the LDA1-LDA2 plane (C), the LDA1-LDA3 plane (D), and the LDA1-LDA4 plane (E) are shown. (F) Visualization of the 14 dimensional LDA space embedded in two dimensional space with t-SNE. (G) The scheme of leave-one-out cross-validation. The Raman and proteome data of one condition (here j) are excluded, and the matrix B is estimated using the data of the rest of the conditions as . The proteome data under the condition j is estimated from the Raman data with and compared with the actual data to calculate estimation errors. (H) Comparison of measured and estimated proteome data. The plot for the “Glucose” condition is shown as an example. Each dot corresponds to one protein species. The straight line indicates x = y. Proteins with negative estimated values are not shown.
List of scalars, vectors and matrices in the main text.
Scalars, vectors, and matrices in the main text are listed with their sizes and descriptions. m is the number of conditions, and n is the number of protein species. (m = 15 and n = 2058 in the main text.)
A stoichiometrically conserved protein group identified by an analysis of the Raman-proteome coefficient matrix.
(A) Scatterplots of Raman-proteome transformation coefficients. The horizontal axes are constant terms (b0) in all the plots. The vertical axis is coefficients for LDA1 (b1), LDA2 (b2), LDA3 (b3), or LDA4 (b4) in each plot. The proteins in the ISP COG class are indicated in yellow. Yellow solid straight lines are least squares regression lines passing through the origins for the ISP proteins. Insets are enlarged views of area around the origins. In this figure, we used the average of as an estimate of B. (B) Similarity of expression patterns between culture conditions for each COG class. We divided the proteome into COG classes (28, 29), and calculated Pearson’s correlation coefficient of expression patterns for all the combinations of culture conditions. Since the data are from 15 conditions, there are 105 (= 15·14/(2·1)) points for each COG class in the graph. The box-and-whisker plots summarize the distributions of the points. The lines inside the boxes denote the medians, the top and bottom edges of the boxes do the 25th percentiles and 75th percentiles, respectively. The numbers of protein species are 376 for the Cellular Processes and Siginaling COG class, 354 for the ISP COG class, and 840 for the Metabolism COG class. See Fig. S4 for the evaluation with Pearson correlation coefficient of log abundances and with cosine similarity. Fig. S4 also contains figures directly showing expression-level changes of different protein species across conditions for each COG class. (C) Examples of stoichiometry-conserving proteins in the ISP COG class. The horizontal axis represents the abundance of RplF under 15 conditions and the vertical axis represents those of several ISP COG class proteins. These proteins are also contained in the homeostatic core defined later (see Fig. 4). The solid straight lines are linear regression lines with intercept of zero. (D) Examples of abundance ratios of non-ISP COG class proteins. The horizontal axis represents the abundance of RplF under 15 conditions and the vertical axis represents those of compared non-ISP COG class proteins. Crp belongs to the Cellular Processes and Signaling COG class; the other proteins belong to the Metabolism COG class. In both (C) and (D), we selected the proteins expressed from distant loci on the chromosome. All sigma factors participating in the regulation of the proteins examined in (C) and (D) are listed on the right of the gene name legends. All transcription factors known to regulate multiple genes listed here are shown in the right diagrams. Arrows show activation; bars represent inhibition; and squares indicate that a transcription factor activates or inhibits depending on other factors. The information on gene regulation and functions was obtained from EcoCyc (55) in August 2022. The error bars are standard errors calculated by using the data of (20). The inset shows the positions of the genes on the E. coli chromosome determined based on ASM75055v1.46 (56). No genes are in the same operon.
Extracting SCGs from proteome data.
(A) Quantifying stoichiometry conservation by cosine similarity. We consider an m-dimensional expression vector for each protein species whose elements represent its abundance under different conditions. The cosine similarity between the m-dimensional expression vectors of two protein species becomes nearly 1 when they conserve mutual stoichiometry strongly across conditions, whereas lower than 1 when their expression patterns are incoherent. (B) Extracted SCGs. We extracted proteins with high cosine similarity relationships. Each node represents a protein species. An edge connecting two nodes represents that the expression patterns of the two connected protein species have high cosine similarity exceeding a threshold of 0.995. Proteins that have no edge with the other proteins are not shown. The largest and the second largest protein groups, which we refer to as SCG 1 and SCG 2, respectively, are indicated by shaded polygons. (C) Expression patterns of the extracted SCGs. The horizontal and vertical axes represent growth rate and protein abundance, respectively. Line-connected points represent expression-level changes of different protein species across conditions. SCG 1 (homeostatic core) is shown in two ways: the left panel with a linear-scaled vertical axis and the right panel with a log-scaled vertical axis. The inset for SCG 2 shows the total abundances of SCG 2 proteins with a log-scaled vertical axis. Error bars are standard errors. (D) The gene loci of the homeostatic core (SCG 1) proteins on the chromosome. Magenta dots are nodes (genes), and gray lines are edges (high cosine similarity relationships). We determined the gene loci based on ASM75055v1.46 (56).
A proteome structure characterized by global stoichiometry conservation relationships.
(A) Distributions of stoichiometry conservation centrality values for all the proteins (gray), the homeostatic core (SCG 1) proteins (magenta), and the proteins belonging to the other SCGs (cyan). (B) Correlation between stoichiometry conservation centrality and gene essentiality. The proportion of essential genes within each class of stoichiometry conservation ranking is shown. The list of essential genes was downloaded from EcoCyc (55). (C) Correlation between stoichiometry conservation and evolutionary conservation. The strength of evolutionary conservation of each protein species was estimated by the number of orthologs found in the OrthoMCL species (35). The genes with more orthologs tend to have higher stoichiometry conservation centrality (p = 3.42 × 10−14 by one-sided Brunner–Munzel test between the top 25% and the bottom 25% fractions of ortholog number ranking). Likewise, the genes with higher stoichiometry conservation centrality scores tend to have more orthologs (p = 8.44 × 10−12 by one-sided Brunner–Munzel test, Top 25%–Bottom 25% comparison; p-values in the captions for (F) to (I) were evaluated with the same statistical test scheme). (D) to (G) Stoichiometry conservation analyses of human cell atlas transcriptome data of fetal 15 organs (36). The top gray histogram in (D) shows the distribution of stoichiometry conservation centrality values for all genes. The bottom histograms in (D) show the distribution for coding genes (yellow) and that for the other genes (cyan). (E) shows a correlation between the ratio of coding genes and stoichiometry conservation centrality calculated from the human cell atlas data. (F) shows a correlation between gene essentiality and stoichiometry conservation centrality calculated from the human cell atlas data. The essentiality of each human gene was quantified by CRISPR score, which is the fitness cost imposed by CRISPR-based inactivation of the gene in KBM7 chronic myelogenous leukemia cells (35). Genes with lower CRISPR score are regarded as more essential. The fraction with low CRISPR score (i.e. high essentiality fraction) tends to have higher stoichiometry conservation centrality (p < 10−15). The fraction with high centrality score tends to be more essential (p < 10−15). (G) shows a correlation between evolutionary conservation and stoichiometry conservation centrality based on the human cell atlas data. The gene fraction with many orthologs tends to have higher stoichiometry conservation centrality (p < 10−15). The gene fraction with high centrality score tends to have more orthologs (p< 10−15). (H) and (I) Stoichiometry conservation analyses of genome-wide Perturb-seq data (37). (H) shows a correlation between stoichiometry conservation centrality calculated from the Perturb-seq data and gene essentiality. The essentiality of each gene was quantified by the CRISPR score as in (F). The gene fraction with low CRISPR score (i.e. high essentiality fraction) tends to have higher stoichiometry conservation centrality (p< 10−15). The gene fraction with high centrality score tends to be more essential (p < 10−15). (I) shows a correlation between stoichiometry conservation based on the Perturb-seq data and evolutionary conservation of genes. The gene fraction with many orthologs tends to have higher stoichiometry conservation centrality (p < 10−15). The gene fraction with high centrality score tends to have more orthologs (p< 10−15). (J) Representation of the proteomes as a graph. A node corresponds to a protein species, and the weight of an edge is taken as the cosine similarity between the m-dimensional expression vectors of the two connected protein species. The n × n matrix A can specify the whole graph. Note that the diagonal elements of A are ones, which were introduced just for simplicity. (K) Cosine similarity LE (csLE) structure in a three-dimensional space. Each dot represents a different protein species and is color-coded on the basis of its stoichiometry conservation centrality value. We selected the axes considering the structural similarity to the Raman-based proteome structure in ΩB (see Fig. 6). (L) The csLE structure in a three-dimensional space. The views from two different angles are shown. Each gray dot represents a different protein species. The proteins belonging to each SCG are indicated with distinct markers. Colors of the two-dimensional histograms in (C), (F), (G), (H), and (I) represent the height of each bar.
Raman-based proteome structure and its similarity to stoichiometry-based proteome structure.
(A) Proteome structure determined by Raman-proteome coefficients visualized in a three-dimensional space. The views from two different angles are shown. Each gray dot represents a protein species. The proteins belonging to each SCG are indicated with distinct markers. We note that SCGs are defined without referring to Raman data (Fig. 4). (B–D) Similarity among the distribution of LDA Raman spectra (B), the proteome structure determined by Raman-proteome coefficients (C), and the proteome structure determined by stoichiometry conservation (D). (E) Mathematical relation between the coordinates of the proteins in ΩB (C) and ΩLE (D). The two conditions, one with Θ (magenta), and the other between b0 and (cyan), must hold for the similarity between the two proteome structures (yellow), as described in the gray box. denotes column-wise proportionality.
Proportionality between stoichiometry conservation centrality and expression generality.
(A) Relationships between stoichiometry conservation centrality (di) and expression generality (gi). Each gray dot represent a protein species. The proteins belonging to each SCG are indicated with distinct markers. The dashed lines are . The solid lines represent (see section 2.2 in (22)). The deviation of a point from the solid line is related to the growth rate under the condition where each protein is expressed the most. (B) The same plot as (A) in black and white. Overlaid red circles indicate proteins featured in (C). (C) Expression patterns of the proteins indicated by red circles in (B) across conditions. The condition differences are shown by the growth rate differences on the horizontal axes. The arrangement of the plots for the proteins corresponds to their relative positions in (B).
List of culture conditions.
Evaluation of the overall estimation error with various distance measures (The case where LDA1 to LDA4 axes were used).
The sum of estimation errors was calculated and a permutation test (105 permutations) was conducted. In this table, LDA1 to LDA4 axes were used. represents a vector whose all elements are the mean of all elements of x. xj is the j-th element of x. medianjxj represents the median of scalers xj.
Evaluation of the overall estimation error with various distance measures (The case where all the 14 LDA axes were used).
The results obtained by using all the 14 LDA axes are presented. See Table S2 for notations. Note that the system is underdetermined in this case, thus we adopted the minimum-norm solution from among all least-squares solutions.
Gene list of SCG 1 (homeostatic core).
Members of homeostatic core (Figure 4, cosine similarity threshold: 0.995). The description of each gene is cited from [1].
Gene list of SCG 2.
Members in SCG 2 (Figure 4, cosine similarity threshold: 0.995). The description of each gene is cited from [1],
Gene list of SCG 3.
Members in SCG 3 (Figure 4, cosine similarity threshold: 0.995). The description of each gene is cited from [1].
Gene list of SCG 4.
Members in SCG 4 (Figure 4, cosine similarity threshold: 0.995). The description of each gene is cited from [1].
Gene list of SCG 5.
Members in SCG 5 (Figure 4, cosine similarity threshold: 0.995). The description of each gene is cited from [1].
Interpretations of , and
Mathematical relation between Raman-proteome coefficients and cosine similarity LE proteomes.
The matrices in the left-hand side of (2.138) (a proteome structure based on Raman-proteome coefficients) and their counterparts in the right-hand side of (2.138) (a proteome structure obtained with cosine similarity LE) are listed.
Schematic illustration of the approach in this study. Related to Figure 1.
Raman spectra and gene expression profiles are both high-dimensional vectors and can be represented as points in high-dimensional spaces. Coarse-graining Raman spectra by dimensional reduction finds condition-dependent differences in their global spectral patterns (see Fig. 2). The dimension-reduced spectra were linked to and used to predict condition-dependent global gene expression profiles (see Fig. 2), which implies that global changes in spectral patterns detect differences in cellular physiological states. The analysis of this linkage led us to discover a stoichiometry-conserving constraint on gene expression, which enabled us to represent gene expression profiles in a functionally-relevant low-dimensional space ((i); see also Fig. 3–5). Then, we find a nontrivial correspondence between these low-dimensional Raman and gene expression spaces ((ii); see also Fig. 6). This correspondence provides an omics-level interpretation of global Raman spectral patterns and a quantitative constraint between expression generality and stoichiometry conservation centrality ((ii); see also Fig. S9 and Fig. 7).
Custom-built Raman microscope and analyses of E. coli Raman spectra. Related to Figure 2.
(A) Schematic diagram of the Raman microscope used in this study. (B) Representative Raman spectra from single E. coli cells. The fingerprint region of one spectrum is shown for each condition. (C) Linear superposition of Raman shifts. Each LDA axis is linear superposition of Raman shifts. These figures show the coefficients for LDA1 (left) and LDA2 (right). (D) Relationship between Raman LDA1 axis and growth rates. The horizontal axis represents Raman LDA1 axis. The vertical axis represents growth rates measured in [1]. Each point corresponds to the data for one condition. Pearson’s correlation coefficient is 0.81±0.09.
Estimation of proteomes from Raman spectra. Related to Figure 2.
Comparing the measured proteomes with those estimated from Raman spectra. The horizontal and vertical axes represent the estimated and measured proteomes, respectively. Proteins with negative estimated abundance are not shown in these figures. The conditions with the largest and the second largest numbers of proteins with negative estimated abundance were “stationary3days” (666 proteins) and “LB” (359 proteins). The conditions with the fewest and the second fewest negatively estimated proteins were “GlucosepH6” (0 proteins) and “Xylose” (7 proteins).
Comparison of stoichiometry conservation among COG classes. Related to Figure 3.
(A and B) Relations between protein abundance and constant terms of Raman-proteome coefficients. The horizontal axes are b0 (constant terms), and the vertical axes are (protein abundance). Dashed lines are the least squares regression lines with intercept zero for ISP COG class members. The average of was used as an estimate of B here. In (A), only ISP COG class members are shown for three representative conditions: “Galactose”, “Glucose”, and “GlycerolAA”. In (B), all proteins are shown for a representative condition, “GlycerolAA”. (C) Relations between protein abundance and growth rates of E. coli under 15 environmental conditions. We analyzed the absolute quantitative proteome data, growth rate data, and COG annotation reported by [1]. Lines represent different protein species. Error bars are standard errors. The top panel is for the Cellular Processes and Signaling COG class; the middle is for the Information Storage and Processing COG class; and the bottom is for the Metabolism COG class. (D) Relations between protein abundance and growth rates of three E. coli strains (BW25113, MG1655, and NCM3722) under two culture conditions. We again analyzed the data by [1]. Lines represent different protein species. Error bars are standard errors. (E and F) COG class-dependent expression pattern similarity of E. coli proteomes between conditions. The E. coli proteome data under the 15 different environmental conditions were analyzed. The similarity is evaluated by Pearson correlation coefficients of log expression levels in (E) and by cosine similarity in (F). We consider all the combinations of the 15 conditions. Thus, there are 105 data points for each COG class. The box-and-whisker plots summarize the distributions of the points. The lines inside the boxes denote the medians. The top and bottom edges of the boxes denote the 25th percentiles and 75th percentiles, respectively. Note that (E) and (F) are evaluation of the same data used in Figure 3B in the main text with different similarity indices. (G) COG class-dependent expression pattern similarity between different strains of E. coli (BW25113, MG1655, and NCM3722). The absolute quantitative proteome data and COG annotation were taken from [1]. The similarity was evaluated by cosine similarity. The data contain three strains. Thus, there are thee points for each COG class. The top panel is for the “Glucose” condition; and the bottom is for the “LB” condition. (H–J) COG class-dependent expression pattern similarity in other organisms. (H) is for M. tuberculosis (data from [2]; six environmental conditions (time points)), (I) for M. bovis (data from [2]; six environmental conditions (time points)), and (J) for S. cerevisiae (data from [3]; 10 environmental conditions). The COG annotations were taken from the December 2014 release of 2003-2014 COGs [37] and the Release 3 of “Mycobrowser” [38] for (H) and (I) and from the Comprehensive Sake Yeast Genome Database (S288C strain) [39] for (J). The unit for protein abundance was fg/cell for (H) and (I) and fg in pg dry cell weight for (J).
Single-gene level growth law in the homeostatic core. Related to Figure 4.
(A) Relationship between population growth rates and total abundance of SCG 1 (homeostatic core) proteins. Here, we analyzed the E. coli proteome data [1], focusing on the 15 conditions for which we obtained Raman data. The dashed line is the least squares regression line. (B) Scatterplots of log abundance of SCG 1 (homeostatic core) proteins. Here, the proteomes under three representative conditions, “LB”, “Glucose”, and “Galactose”, are compared with that under the standard condition “Glycerol.” Each colored line is the linear regression line with slope one for the points with the same color. The vertical line is x = 0. (C) Relationship between population growth rate and coefficient of determination of linear regression in (B). The vertical line represents the growth rate under the standard condition (“Glycerol”). (D) Linear relationship between common abundance ratio and growth rates. The vertical axis represents , where Γc is the y-intercepts in (B) (see 3.1.2). The dashed line is the linear regression line. The horizontal line is y = 1, and the x coordinate of the vertical line is the growth rate under the standard condition (“Glycerol”). (E) The gene loci of the proteins belonging to the condition-specific SCGs on the chromosome (ASM75055v1.46 [40]). Yellow dots are nodes (genes), and gray lines are edges (high cosine similarity relationships). The edge in the map of SCG 5 cannot be seen because their gene loci are clustered in close proximity in the same operon.
Functional relevance of stoichiometry conservation centrality. Related to Figure 5.
(A) Relationship between gene essentiality and stoichiometry conservation centrality in E. coli. The proportion of essential genes is plotted for each stoichiometry conservation centrality rank range. In this plot, we calculated stoichiometry conservation centrality based on the E. coli proteome data [1] under the 15 conditions for which we obtained Raman data. The list of essential genes was downloaded from EcoCyc [23]. (B) Relationship between gene essentiality and stoichiometry conservation centrality in S. pombe. We calculated stoichiometry conservation centrality based on the S. pombe transcriptome data reported in [4]. Only coding genes are considered in this plot, though stoichiometry conservation centrality values were calculated using both coding and non-coding genes. Gene classification is based on PomBase [24]. Some bins do not reach 100% in sum because eleven coding genes in the S. pombe transcriptome data were not found in the current PomBase. (C) Relationship between ratio of coding genes and stoichiometry conservation centrality in the S. pombe transcriptome data. The coding/non-coding assignment is based on PomBase [24]. (D) Correlation between stoichiometry conservation and evolutionary conservation. In this plot, we calculated stoichiometry conservation centrality based on the E. coli proteome data [1] under the 15 conditions for which we obtained Raman data. Colors represent the height of each bar. The distributions of stoichiometry conservation centrality were compared between the top 25% and the bottom 25% fractions in the number of orthologs rankings. The fraction with many orthologs tends to have higher stoichiometry conservation centrality (one-sided Brunner–Munzel test, p = 7.84 × 10−15). The distributions of the number of orthologs were compared between the top 25% and the bottom 25% stoichiometry conservation centrality fractions. The high centrality fraction tends to have more orthologs (one-sided Brunner–Munzel test, p = 1.46 × 10−11). Ortholog data were taken from OrthoMCL-DB [31]. (E–G) Correlation between stoichiometry conservation and evolutionary conservation in S. pombe. We calculated stoichiometry conservation centrality based on the S. pombe transcriptome data reported in [4]. In (E), the result is shown by two-dimensional histogram. Colors represent the height of each bar. The distributions of the number of orthologs were compared between the top 25% and the bottom 25% stoichiometry conservation centrality fractions. The high centrality fraction tends to have more orthologs (one-sided Brunner–Munzel test, p = 0.00548). The direct comparison between the two fractions is shown in (F). The distributions of stoichiometry conservation centrality were compared between the top 25% and the bottom 25% fractions in the number of orthologs rankings. The fraction with many orthologs tends to have higher stoichiometry conservation centrality (one-sided Brunner–Munzel test, p = 0.00270). The direct comparison between the two fractions is shown in (G). Ortholog data were taken from OrthoMCL-DB [31]. (H) Applying PCA to L2-normalized proteomes. PCA (with mean-centering) was applied to L2-normalized proteome data [p /∥p ∥ … pn/∥p ∥]2. Here, we analyzed the E. coli proteome data under the 15 conditions for which we obtained Raman data. The left is a projection onto a two-dimensional space; and the right is a projection onto a three-dimensional space. The axes for visualization were selected by considering similarity to the cosine similarity LE structure.
Distributions and constraints with respect to stoichiometry conservation centrality (degree). Related to Figures 5 and 7.
(A) Comparison of degree (stoichiometry conservation centrality) distributions between original (yellow) and randomized (blue) E. coli proteome data. We created randomized proteome data by shuffling the expression levels across the protein species within each condition. We used the E. coli proteome data [1] under the 15 conditions for which we obtained Raman data. (B) Comparison of the gj-dj relationships between original (yellow) and randomized data (blue). The horizontal axis is expression generality score (gj = L1 norm/L2 norm) and the vertical axis is stoichiometry conservation centrality (dj: degree). Each dot represents a protein species. The dashed lines are . The solid lines are . (C–H) Degree (stoichiometry conservation centrality) distributions for additional datasets. Yellow histograms are for the original data, and blue histograms are for the randomized data. (C) for the proteomes of three E. coli strains (BW25113, MG1655, and NCM3722) in LB [1]; (D) for the proteomes of the three E. coli strains in M9 Glucose [1]; (E) for the proteomes of M. tuberculosis [2]; (F) for the proteomes of M. bovis [2]; (G) for the proteomes of S. cerevisiae [3]; and (H) for the transcriptomes of S. pombe [4]. (I–N)gj-dj relationships for additional datasets. Each gray dots represent a protein species. The proteins belonging to the homeostatic core in each dataset are shown in magenta; those belonging to condition-specific SCGs are indicated in different colors in each plot. See the caption of Figures S11 and S13 for the cosine similarity threshold to specify the homeostatic core and the condition-specific SCGs in each dataset. The dashed lines are . The solid lines through the origins are . (I) for the proteomes of the three E. coli strains in LB [1]; (J) for the proteomes of the three E. coli strains in M9 Glucose [1]; (K) for the proteomes of M. tuberculosis [2]; (L) for the proteomes of M. bovis [2]; (M) for the proteomes of S. cerevisiae [3]; and (N) for the transcriptomes of S. pombe [4].
Properties of normalized expression vectors. Related to Figure 7.
(A and B) Schematic explanation for the interpretation of the L1 norm/L2 norm ratio of expression vectors as an index of expression generality. (A) is a two-dimensional case, and (B) is a three-dimensional case. The inset in (A) schematically explains L1 norm and L2 norm of an expression vector. See section 1.9 for detail. (C) Schematic explanation for deviations of points from the proportionality line in the gj-dj plots. Here, we consider four condition-specific protein species a, b, c, and d labeled in the descending order of growth rates under the conditions accompanying their expression. Note that their L1 norm/L2 norm ratios are all one on the horizontal axis. One can show that the degree (stoichiometry conservation centrality) dj is proportional to the inner product of L2-normalized expression vector pjpj and the expression norm vec-tor (see (2.147) in section 2.2.2). Since the elements of increase approximately linearly with growth rates of the corresponding conditions (see (D)), the degrees (stoichiometry conservation centrality values) decrease from a to d in the order of growth rates. (D–F) Correlation between elements of and population growth rates. The vertical axis represents the elements of and the horizontal axis represents the popu-lation growth rates. The dashed lines are . (D) is the result from the analysis of the E. coli proteome data [1] under the 15 conditions for which we obtained Raman data (m = 15). (E) is the result from the analysis of the proteome data of three strains of E. coli (BW25113, MG1655, and NCM3722) under “LB” and “Glucose” conditions (m = 6) [1]. (F) is the result from the analysis of the proteome data of S. cerevisiae under 10 different conditions (m = 10) [3]. The cells were cultured in chemostat with the same dilution rate. The numbers of analyzable protein species and the numbers of conditions were different between (D) and (E). Thus, the values of the vertical axes cannot be compared directly between them.
Mathematical analyses of the main Raman-proteome data. Related to Figure 6.
(A) Proteomes of E. coli under 15 conditions [1] and corresponding Raman data we measured in this study were analyzed in this figure. (B) Visual comparison of the unit matrix I, the orthogonal matrix Θ obtained from the data, and a random orthogonal matrix. Height of each bar indicates the value of each element. Colors represent the height of each bar. For clarifying the position of each element, a component form of matrix Θ is shown in the middle (m = 15). For Θ (middle) and a random orthogonal matrix (right), the original matrices are displayed in the upper row, and matrices whose elements are the absolute values of the corresponding elements of the original matrices are displayed in the lower row. (In this figure, |Θ| represents a matrix of which the (i, j) element is the absolute value of the (i, j) element of Θ.) (B) Representation of matrices as scatterplots. See section 1.8 for detail. (C) Comparison of the unit matrix I, the orthogonal matrix Θ obtained from the data, and random orthogonal matrices Q by Pearson correlation coefficients. Pearson correlation coefficient of the element-wise squared matrix of each matrix can be regarded as a measure of closeness to the identity matrix (∘ represents element-wise multiplication). The probability of finding a random orthogonal matrix Q with Pearson correlation coefficient greater than the Pearson correlation coefficient of Θ was < 1 × 10−5 (No occurrence in 105 samplings). See section 1.8 for detail. (D) Comparison of magnitudes of off-diagonal elements among the unit matrix I, the orthogonal matrix Θ obtained from the data, and random orthogonal matrices Q. The lattice on the top explains the numbering of k-diagonals (−m < k < m, m = 15). In the lattices on the bottom, black color indicates areas in which the elements are squared and summed at the corresponding steps (i.e., areas represented by x in the graph). The sum of the squared values in each step is shown in the middle graph. Error bars of the random matrix line are standard errors of 100 samplings. See section 1.8 for detail. (E) Comparison of magnitudes of elements of leading principal submatrices among the unit matrix I, the orthogonal matrix Θ obtained from the data, and random orthogonal matrices Q. In the lattices on the bottom, black color indicates an area in which elements are squared and summed at the corresponding step (i.e., an area represented by x in the graphs). The sum of the squared values in each area is shown in the top graph. The results shown in the top graph are converted into ratios to the identity matrix I and are shown in the middle graph. Error bars of the random matrix line are standard errors of 100 samplings. See section 1.8 for detail. (F) Comparison of and . x axis represents and y axis represents . The dashed line indicates y = x. (G) Comparison between (left) and (right). Note that while figure (left) is the same as Figure 6C, the right figure shows , where is shown in Figure 6D.
Orthant correspondences between Raman spectra in LDA space and condition-specific proteins in Raman-proteome coefficient proteome space. Related to Figure 6.
Using the main Raman and proteome data of E. coli under the 15 conditions, we examine the orthant correspondence between Raman spectra in the LDA space and condition-specific proteins in the Raman-proteome coefficient proteome space ΩB. Here, we focus on two proteins PaaE and AcrR. (A) Expression patterns of PaaE (left) and AcrR (right) across conditions. Error bars are standard errors. PaaE is expressed under the “LB” condition in a condition-specific manner, whereas AcrR is expressed at high levels not only under “LB” condition but also under several other conditions. (B) Positions of PaaE and AcrR in the Raman-proteome coefficient-based proteome space ΩB. (C) Verification of orthant correspondence. We verified the orthant correspondence described by the relation (2.76). We multiplied both sides of (2.76) by , and the elements of the vectors of both sides were compared by scatterplots. The horizontal axes are related to the coordinates in the Raman LDA space; the vertical axes are related to the coordinates in the Raman-proteome coefficient proteome space. The dashed lines are y = x. The nearly perfect agreement of the elements confirms the orthant correspondence for the condition-specific protein PaaE (left). Deviations from the diagonal agreement line are found for AcrR (right).
Stoichiometry-based omics structures and their correspondences to Raman-based omics structures for additional datasets. Related to Figures 4–6.
This figure summarizes the results on omics structures characterized by stoichiometry conservation relations and their correspondences to those characterized by Raman-omics relations for additional datasets. (A–E) show the results from the analyses of the Raman and proteome data of three E. coli strains (BW25113, MG1655, and NCM3722) in LB; (F–J) from the analyses of the Raman and proteome data of the three E. coli strains in M9 Glucose; and (K–O) from the analyses of the Raman and transcriptome data of S. pombe under 10 conditions. We used the E. coli proteome data reported in [1] and the S. pombe transcriptome data reported in [4] in the analyses. (A), (F), and (K) show distributions of omics components in cosine similarity LE space. Stoichiometry conservation centrality of each component is indicated by color. (B), (G), and (L) show expression patterns of representative condition-specific omics components indicated in the previous figures of omics structures in the csLE spaces. Error bars are standard errors in (B) and (G), and maximum-minimum ranges (two replicates) in (L). (C), (H), and (M) show positions of averaged cellular Raman spectra under different conditions in the LDA spaces. (D), (I), and (N) show omics structures in the spaces specified by the Raman-omics coefficients with the homeostatic cores and condition-specific SCGs indicated by colored points. (E), (J), and (O) show the omics structures in the csLE omics spaces with the homeostatic cores and condition-specific SCGs indicated by colored points. Columns vrw,1 (the eigenvector corresponding to Lrw’s smallest eigenvalue except for zero) and vrw,2 (the eigenvector corresponding to Lrw’s second smallest eigenvalue except for zero) are shown. We used the cosine similarity thresholds of 0.99993 to specify SCGs both for the three E. coli strains under LB data ((D) and (E)) and for the three E. coli strains under M9 Glucose data ((I) and (J)), and 0.9967 for the S. pombe transcriptome data ((N) and (O)).
Analyses of the mathematical relation connecting two types of omics spaces. Related to Figure 6.
This figure shows the analyses of mathematical relation that connects coordinates of omics components in the two types of omics spaces (see Figure 6E and Section 2) using additional datasets. (A–F) show the results from the analyses of the Raman and proteome data of three E. coli strains (BW25113, MG1655, and NCM3722) in LB; (G–L) from the analyses of the Raman and proteome data of the three E. coli strains in M9 Glucose; and (M–R) from the analyses of the Raman and transcriptome data of S. pombe under 10 conditions. We used the E. coli proteome data reported in [1] and the S. pombe transcriptome data reported in [4] in the analyses. See the caption of Figure S9 for the explanation of each panel. The SCGs in (F), (L), and (R) are the same as in Figure S11. The probability of finding a random orthogonal matrix Q with Pearson correlation coefficient greater than the Pearson correlation coefficient of Θ was 0.022 in (B), 0.013 in (H), and < 1 × 10−5 (No occurrence in 105 samplings) in (N).
Stoichiometry-based proteome structures for additional datasets. Related to Figures 4–5.
This figure shows proteome structures in the csLE proteome spaces for additional datasets. (A–C) show the results from the analyses of the proteome data of M. tuberculosis H37Rv under gradual changes in oxygen levels [2]; (D–F) shows the results from the analyses of the proteome data of M. bovis BCG under gradual changes in oxygen levels [2]; and (G–I) show the results from the analyses of the proteome data of S. cerevisiae under 10 conditions in chemostat with the same dilution rate [3]. (A), (D), and (G) show the proteome structures in the csLE spaces. The thresholds used to specify the SCGs were 0.99965 for (A), 0.9997 for (D), and 0.9989 for (G). (B), (E), and (H) show the same proteome structures as in the previous panels, but with stoichiometry conservation centrality of each protein species indicated by the color. (C), (F), and (I) show expression patterns of representative proteins indicated by the red circles in the previous panels. Error bars in (C) are standard errors.
Dependence of low-dimensional correspondence between Raman spectra and proteomes on the number of conditions. Related to Figure 6.
The dependence of the low-dimensional correspondence between Raman spectra and proteomes on the number of analyzed conditions was systematically investigated by evaluating the similarity of the orthogonal matrix Θ to the identity matrix for all subsampled condition sets. Proteomes of E. coli under 15 conditions [1] and corresponding Raman data we measured in this study were analyzed in this figure. The relationship between the number of conditions and the probability of obtaining higher level of low-dimensional correspondence than that of experimental data by chance. This probability is calculated as the probability of finding a random orthogonal matrix with Pearson correlation coefficient greater than the Pearson correlation coefficient of Θ by creating 104 random orthogonal matrices. See section 1.8 and Figure S9 for details of the evaluation method. Each green square corresponds to one subsample, and each short horizontal black line represents the median of all the combinations of conditions (i.e., green squares) for each subsample size x. The blue dashed line indicates the detection limit (i.e., one over the number of generated random orthogonal matrices). The non-subsampled case (i.e., the case with all 15 conditions) in this figure corresponds to Figure S9C. Visual comparison of , and for six representative subsamples indi-cated in (A). As in Figure S9A, Θ is visualized using |Θ|, whose element is the absolute value of the corresponding element of Θ, and height of each bar in the figures of |Θ| indicates the value of each element of |Θ|. Colors reflect the height of each bar. Spaces created with columns of and are ΩB and ΩLE, respectively. As Θ deviates from the identity matrix from the cases α and β to the case of ϵ, the low-dimensional correspondence between ΩB and ΩLE collapses naturally. Since the case ζ is the non-subsampled case, the figure of |Θ| is the same as Figure S9A, and those of and are the same as Figure S9G. Note that the figure of ΩB of the case ζ is also exactly the same as Figure 6C, and that of ΩLE of the case ζ is equal to Figure 6D up to a factor of . The SCGs shown in this figure were defined in the analysis of the proteomes of all the 15 conditions (Figure 4C).