Cellular physiological state differences detected by Raman spectral global patterns and gene expression profiles.

(A) Condition-dependent cellular Raman spectral patterns. Raman spectra obtained from cells reflect their molecular profiles. Therefore, systematic differences in global spectral patterns may indicate their physiological states. A Raman spectrum from each cell can be represented as a vector and a point in a high-dimensional Raman space. If condition-dependent differences exist in the spectral patterns, appropriate dimensional reduction methods allow us to classify the spectra and detect cellular physiological states in a low-dimensional space. (B) Condition-dependent gene expression profiles. Global gene expression profiles (proteomes and transcriptomes) are also dependent on conditions. For each gene, we can consider a high-dimensional vector whose elements represent expression levels under different conditions. It has been suggested that these expression-level vectors are constrained to some low-dimensional manifolds (717). This study characterizes the statistical correspondence between dimension-reduced Raman spectral patterns and gene expression profiles. Analyzing the correspondence, we reveal a stoichiometry conservation principle that constrains gene expression profiles to low-dimensional manifolds.

Estimation of proteomes from Raman spectra.

(A) The experimental design. We cultured E. coli cells under 15 different conditions and measured single cells’ Raman spectra. We then examined the correspondence between the measured Raman spectra and the absolute quantitative proteome data reported by Schmidt et al. (20). (B) Representative Raman spectra from single cells, one from the “Glucose” condition, and the other from the “LB” condition. The fingerprint region and representative peaks are annotated. (C to E) Cellular Raman spectra in LDA space. The dimensionality of the spectra is reduced to 14 (= 15 − 1). Each point represents a spectrum from a single cell, and each ellipse shows the 95% concentration ellipse for each condition. Their projection to the LDA1-LDA2 plane (C), the LDA1-LDA3 plane (D), and the LDA1-LDA4 plane (E) are shown. (F) Visualization of the 14 dimensional LDA space embedded in two dimensional space with t-SNE. (G) The scheme of leave-one-out cross-validation. The Raman and proteome data of one condition (here j) are excluded, and the matrix B is estimated using the data of the rest of the conditions as . The proteome data under the condition j is estimated from the Raman data with and compared with the actual data to calculate estimation errors. (H) Comparison of measured and estimated proteome data. The plot for the “Glucose” condition is shown as an example. Each dot corresponds to one protein species. The straight line indicates x = y. Proteins with negative estimated values are not shown.

A stoichiometrically conserved protein group identified by an analysis of the Raman-proteome coefficient matrix.

(A) Scatterplots of Raman-proteome transformation coefficients. The horizontal axes are constant terms (b0) in all the plots. The vertical axis is coefficients for LDA1 (b1), LDA2 (b2), LDA3 (b3), or LDA4 (b4) in each plot. The proteins in the ISP COG class are indicated in yellow. Yellow solid straight lines are least squares regression lines passing through the origins for the ISP proteins. Insets are enlarged views of area around the origins. In this figure, we used the average of as an estimate of B. (B) Similarity of expression patterns between culture conditions for each COG class. We divided the proteome into COG classes (28, 29), and calculated Pearson’s correlation coefficient of expression patterns for all the combinations of culture conditions. Since the data are from 15 conditions, there are 105 (= 15·14/(2·1)) points for each COG class in the graph. The box-and-whisker plots summarize the distributions of the points. The lines inside the boxes denote the medians, the top and bottom edges of the boxes do the 25th percentiles and 75th percentiles, respectively. The numbers of protein species are 376 for the Cellular Processes and Siginaling COG class, 354 for the ISP COG class, and 840 for the Metabolism COG class. See Fig. S4 for the evaluation with Pearson correlation coefficient of log abundances and with cosine similarity. Fig. S4 also contains figures directly showing expression-level changes of different protein species across conditions for each COG class. (C) Examples of stoichiometry-conserving proteins in the ISP COG class. The horizontal axis represents the abundance of RplF under 15 conditions and the vertical axis represents those of several ISP COG class proteins. These proteins are also contained in the homeostatic core defined later (see Fig. 4). The solid straight lines are linear regression lines with intercept of zero. (D) Examples of abundance ratios of non-ISP COG class proteins. The horizontal axis represents the abundance of RplF under 15 conditions and the vertical axis represents those of compared non-ISP COG class proteins. Crp belongs to the Cellular Processes and Signaling COG class; the other proteins belong to the Metabolism COG class. In both (C) and (D), we selected the proteins expressed from distant loci on the chromosome. All sigma factors participating in the regulation of the proteins examined in (C) and (D) are listed on the right of the gene name legends. All transcription factors known to regulate multiple genes listed here are shown in the right diagrams. Arrows show activation; bars represent inhibition; and squares indicate that a transcription factor activates or inhibits depending on other factors. The information on gene regulation and functions was obtained from EcoCyc (49) in August 2022. The error bars are standard errors calculated by using the data of (20). The inset shows the positions of the genes on the E. coli chromosome determined based on ASM75055v1.46 (50). No genes are in the same operon.

Extracting SCGs from proteome data.

(A) Quantifying stoichiometry conservation by cosine similarity. We consider an expression vector for each protein species whose elements represent its abundance under different conditions. The cosine similarity between the expres-sion vectors of two protein species becomes nearly 1 when they conserve mutual stoichiometry strongly across conditions, whereas lower than 1 when their expression patterns are incoherent. (B) Extracted SCGs. We extracted proteins with high cosine similarity relationships. Each node represents a protein species. An edge connecting two nodes represents that the expression patterns of the two connected protein species have high cosine similarity exceeding a threshold of 0.995. Proteins that have no edge with the other proteins are not shown. The largest and the second largest protein groups, which we refer to as SCG 1 and SCG 2, respectively, are indicated by shaded polygons. (C) Expression patterns of the extracted SCGs. The horizontal and vertical axes represent growth rate and protein abundance, respectively. Line-connected points represent expression-level changes of different protein species across conditions. The inset for SCG 2 shows the total abundances of SCG 2 proteins with a log-scaled vertical axis. Error bars are standard errors. (D) The gene loci of the homeostatic core (SCG 1) proteins on the chromosome. Magenta dots are nodes (genes), and gray lines are edges (high cosine similarity relationships). We determined the gene loci based on ASM75055v1.46 (50).

A proteome structure characterized by global stoichiometry conservation relationships.

(A) Distributions of stoichiometry conservation centrality values for all the proteins (gray), the homeostatic core (SCG 1) proteins (magenta), and the proteins belonging to the other SCGs (cyan). (B) Correlation between stoichiometry conservation centrality and gene essentiality. The proportion of essential genes within each class of stoichiometry conservation ranking is shown. The list of essential genes was downloaded from EcoCyc (49). (C) Correlation between stoichiometry conservation and evolutionary conservation. The strength of evolutionary conservation of each protein species was estimated by the number of orthologs found in the OrthoMCL species (35). The genes with more orthologs tend to have higher stoichiometry conservation centrality (p = 6.24 × 10−15 by one-sided Brunner-Munzel test between the top 25% and the bottom 25% fractions of ortholog number ranking). Likewise, the genes with higher stoichiometry conservation centrality scores tend to have more orthologs (p = 4.04 × 10−12 by one-sided Brunner-Munzel test, Top 25%–Bottom 25% comparison; p-values in the captions for (F) to (I) were evaluated with the same statistical test scheme). (D) to (G) Stoichiometry conservation analyses of human cell atlas transcriptome data of fetal 15 organs (36). The top gray histogram in (D) shows the distribution of stoichiometry conservation centrality values for all genes. The bottom histograms in (D) show the distribution for coding genes (yellow) and that for the other genes (cyan). (E) shows a correlation between the ratio of coding genes and stoichiometry conservation centrality calculated from the human cell atlas data. (F) shows a correlation between gene essentiality and stoichiometry conservation centrality calculated from the human cell atlas data. The essentiality of each human gene was quantified by CRISPR score, which is the fitness cost imposed by CRISPR-based inactivation of the gene in KBM7 chronic myelogenous leukemia cells (35). Genes with lower CRISPR score are regarded as more essential. The fraction with low CRISPR score (i.e. high essentiality fraction) tends to have higher stoichiometry conservation centrality (p < 10−15). The fraction with high centrality score tends to be more essential (p < 10−15). (G) shows a correlation between evolutionary conservation and stoichiometry conservation centrality based on the human cell atlas data. The gene fraction with many orthologs tends to have higher stoichiometry conservation centrality (p < 10−15). The gene fraction with high centrality score tends to have more orthologs (p< 10−15). (H) and (I) Stoichiometry conservation analyses of genome-wide Perturb-seq data (37). (H) shows a correlation between stoichiometry conservation centrality calculated from the Perturb-seq data and gene essentiality. The essentiality of each gene was quantified by the CRISPR score as in (F). The gene fraction with low CRISPR score (i.e. high essentiality fraction) tends to have higher stoichiometry conservation centrality (p< 10−15). The gene fraction with high centrality score tends to be more essential (p < 10−15). (I) shows a correlation between stoichiometry conser-vation based on the Perturb-seq data and evolutionary conservation of genes. The gene fraction with many orthologs tends to have higher stoichiometry conservation centrality (p < 10−15). The gene fraction with high centrality score tends to have more orthologs (p < 10−15). (J) Representation of the proteomes as a graph. A node corresponds to a protein species, and the weight of an edge is taken as the cosine similarity between the expression vectors of the two connected protein species. The matrix A can specify the whole graph. Note that the diagonal elements of A are ones, which were introduced just for simplicity. (K) Cosine similarity LE (csLE) structure in a three-dimensional space. Each dot represents a different protein species and is color-coded on the basis of its stoichiometry conservation centrality value. We selected the axes considering the structural similarity to the Raman-based proteome structure in ΩB (see Fig. 6). (L) The csLE structure in a three-dimensional space. The views from two different angles are shown. Each gray dot represents a different protein species. The proteins belonging to each SCG are indicated with distinct markers.

Raman-based proteome structure and its similarity to stoichiometry-based proteome structure.

(A) Proteome structure determined by Raman-proteome coefficients visualized in a three-dimensional space. The views from two different angles are shown. Each gray dot represents a protein species. The proteins belonging to each SCG are indicated with distinct markers. We note that SCGs are defined without referring to Raman data (Fig. 4). (B–D) Similarity among the distribution of LDA Raman spectra (B), the proteome structure determined by Raman-proteome coefficients (C), and the proteome structure determined by stoichiometry conservation (D). (E) Mathematical relation between the coordinates of the proteins in ΩB (C) and ΩLE (D). The two conditions, one between b0 and (cyan), and the other with Θ (magenta), must hold for the similarity between the two proteome structures (yellow), as described in the gray box. denotes column-wise proportionality.

Proportionality between stoichiometry conservation centrality and expression generality.

(A) Relationships between stoichiometry conservation centrality (di) and expression generality (gi). Each gray dot represent a protein species. The proteins belonging to each SCG are indicated with distinct markers. The dashed lines are ). The solid lines represent (see section 2.2 in (22)). The devia-tion of a point from the solid line is related to the growth rate under the condition where each protein is expressed the most. (B) The same plot as (A) in black and white. Overlaid red circles indicate proteins featured in (C). (C) Expression patterns of the proteins indicated by red circles in (B) across conditions. The condition differences are shown by the growth rate differences on the horizontal axes. The arrangement of the plots for the proteins corresponds to their relative positions in (B).