Revealing global stoichiometry conservation architecture in cells from Raman spectral patterns
eLife Assessment
This paper reports the fundamental finding of how Raman spectral patterns correlate with proteome profiles using Raman spectra of E. coli cells from different physiological conditions and found global stoichiometric regulation on proteomes. The authors' findings provide compelling evidence that stoichiometric regulation of proteomes is general through analysis of both bacterial and human cells. In the future, similar methodology can be applied on various tissue types and microbial species for studying proteome composition with Raman spectral patterns.
https://doi.org/10.7554/eLife.101485.3.sa0Fundamental: Findings that substantially advance our understanding of major research questions
- Landmark
- Fundamental
- Important
- Valuable
- Useful
Compelling: Evidence that features methods, data and analyses more rigorous than the current state-of-the-art
- Exceptional
- Compelling
- Convincing
- Solid
- Incomplete
- Inadequate
During the peer-review process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments
Abstract
Cells can adapt to various environments by changing their biomolecular profiles while maintaining physiological homeostasis. What organizational principles in cells enable the simultaneous realization of adaptability and homeostasis? To address this question, we measure Raman scattering light from Escherichia coli cells under diverse conditions, whose spectral patterns convey their comprehensive molecular composition. We reveal that dimension-reduced Raman spectra can predict condition-dependent proteome profiles. Quantitative analysis of the Raman-proteome correspondence characterizes a low-dimensional hierarchical stoichiometry-conserving proteome structure. The network centrality of each gene in the stoichiometry conservation relations correlates with its essentiality and evolutionary conservation, and these correlations are preserved from bacteria to human cells. Furthermore, stoichiometry-conserving core components obey growth law and ensure homeostasis across conditions, whereas peripheral stoichiometry-conserving components enable adaptation to specific conditions. Mathematical analysis reveals that the stoichiometrically constrained architecture is reflected in major changes in Raman spectral patterns. These results uncover coordination of global stoichiometric balance in cells and demonstrate that vibrational spectroscopy can decipher such biological constraints beyond statistical or machine-learning inference of cellular states.
Introduction
Biological cells can change their gene expression and metabolic profiles globally to adapt to their biological contexts and external conditions, while maintaining the homeostasis of their core physiological states. The simultaneous realization of adaptability and homeostasis is a hallmark of biological systems and is assumed to be a system-level property of gene expression profiles in cells (Waddington, 1957; Waddington, 1959). However, understanding the underlying organizational principles in comprehensive gene expression profiles remains to be a fundamental problem in biology.
Vibrational spectroscopy such as Raman spectroscopy might help us investigate such principles in gene expression profiles. Raman spectroscopy is a light scattering technique that measures energy shifts of light caused by interaction with sample molecules. Raman spectra are obtainable non-destructively even from biological samples such as individual cells. In principle, cellular Raman spectra are optical signatures conveying comprehensive molecular composition of targeted cells (Goodacre et al., 1998; Huang et al., 2004; Ichimura et al., 2014; Germond et al., 2018). Furthermore, no prior treatments, such as staining and tagging, are necessary to obtain cellular Raman spectra. However, although some biomolecules have separable and intense Raman signal peaks, Raman spectra of most biomolecules overlap and are masked by signals of other molecules due to the diversity and complexity of molecular compositions of cells. Therefore, it is impractical to comprehensively determine the amounts of biomolecules by spectral decomposition.
Despite the intractability of spectral decomposition, reconstruction of comprehensive molecular profiles may be achievable by analyzing detectable global spectral patterns (Figure 1A), thanks to effective low dimensionality of changes in molecular profile of targeted cells (Eisen et al., 1998; Segal et al., 2003; Bergmann et al., 2003; Keren et al., 2013; You et al., 2013; Kaneko et al., 2015; Hui et al., 2015; Heimberg et al., 2016; Biswas et al., 2017; Husain and Murugan, 2020; Sato and Kaneko, 2020; Figure 1B and Appendix 1—figure 1). Indeed, it has been demonstrated that condition-dependent global transcriptome profiles of cells can be inferred from cellular Raman spectra based on their statistical correspondence (Kobayashi-Kirschvink et al., 2018; Kobayashi-Kirschvink et al., 2024). Importantly, this Raman-spectroscopic transcriptome inference was possible from dimension-reduced Raman spectra. Therefore, dominant changes in global Raman spectral patterns may contain vital information about the constraints on the molecular profiles in cells; an inspection of their correspondence might give us insights into architectural principles of omics profiles and biological foundation for global omics inference from spectral patterns (Appendix 1—figure 1).
Cellular physiological state differences detected by Raman spectral global patterns and gene expression profiles.
(A) Condition-dependent cellular Raman spectral patterns. Raman spectra obtained from cells reflect their molecular profiles. Therefore, systematic differences in global spectral patterns may indicate their physiological states. A Raman spectrum from each cell can be represented as a vector and a point in a high-dimensional Raman space. If condition-dependent differences exist in the spectral patterns, appropriate dimensional reduction methods allow us to classify the spectra and detect cellular physiological states in a low-dimensional space. (B) Condition-dependent gene expression profiles. Global gene expression profiles (proteomes and transcriptomes) are also dependent on conditions. For each gene, we can consider a high-dimensional vector whose elements represent expression levels under different conditions. It has been suggested that these expression-level vectors are constrained to some low-dimensional manifolds (Eisen et al., 1998; Segal et al., 2003; Bergmann et al., 2003; Keren et al., 2013; You et al., 2013; Kaneko et al., 2015; Hui et al., 2015; Heimberg et al., 2016; Biswas et al., 2017; Husain and Murugan, 2020; Sato and Kaneko, 2020). This study characterizes the statistical correspondence between dimension-reduced Raman spectral patterns and gene expression profiles. Analyzing the correspondence, we reveal a stoichiometry conservation principle that constrains gene expression profiles to low-dimensional manifolds.
In this report, we first reveal that, in addition to transcriptomes, condition-dependent proteome profiles of Escherichia coli are predictable from cellular Raman spectra. Next, we scrutinize the correspondence between Raman and proteome data, identifying several stoichiometrically conserved groups (SCGs) whose expression tightly correlates with the major changes in cellular Raman spectra. Finally, we reveal that the stoichiometry conservation centrality of each gene correlates with its essentiality, evolutionary conservation, and condition specificity of gene expression levels, which turns out general across different omics layers and organisms.
Results
Statistical correspondence between Raman spectra and proteomes
To examine the correspondence between Raman spectra and proteomes in E. coli, we reproduced 15 environmental conditions for which absolute quantitative proteome data are already available (Schmidt et al., 2016) and measured Raman spectra of E. coli cells under those conditions (Figure 2A and B). The culture conditions we adopted include (i) exponential growth phase in minimal media with various carbon sources, (ii) exponential growth phase in rich media, (iii) exponential growth phase with various stressors, and (iv) stationary phases (Appendix 1—table 1). We measured Raman spectra of single cells sampled from each condition and focused on the fingerprint region of biological samples, where the signals from various biomolecules concentrate (spectral range of 700–1800 cm−1, Figure 2B and Appendix 1—figure 2). The Raman spectra were classified on the basis of the environmental conditions using a simple linear classifier, linear discriminant analysis (LDA) (Goodacre et al., 1998; Huang et al., 2004; De Bie et al., 2005; Figure 2C–E and Appendix 1—figure 1). This classifier calculates the most discriminatory axes by maximizing the ratio of between-condition variance to within-condition variance and reduces the dimensions of Raman data to , where is the number of conditions (see ‘Experimental methods, data acquisition, and data analyses’ in Materials and methods and Section 2.1 in Appendix).
Estimation of proteomes from Raman spectra.
(A) The experimental design. We cultured E. coli cells under 15 different conditions and measured single cells’ Raman spectra. We then examined the correspondence between the measured Raman spectra and the absolute quantitative proteome data reported by Schmidt et al., 2016. (B) Representative Raman spectra from single cells, one from the ‘Glucose’ condition, and the other from the ‘LB’ condition. The fingerprint region and representative peaks are annotated. (C–E) Cellular Raman spectra in linear discriminant analysis (LDA) space. The dimensionality of the spectra is reduced to . Each point represents a spectrum from a single cell, and each ellipse shows the 95% concentration ellipse for each condition. Their projections to the LDA1-LDA2 plane (C), the LDA1-LDA3 plane (D), and the LDA1-LDA4 plane (E) are shown. (F) Visualization of the 14-dimensional LDA space embedded in two-dimensional space with t-distributed stochastic neighbor embedding (t-SNE). (G) The scheme of leave-one-out cross-validation. The Raman and proteome data of one condition (here ) are excluded, and the matrix is estimated using the data of the rest of the conditions as . The proteome data under the condition is estimated from the Raman data with and compared with the actual data to calculate estimation errors. (H) Comparison of measured and estimated proteome data. The plot for the ‘Glucose’ condition is shown as an example. Each dot corresponds to one protein species. The straight line indicates . Proteins with negative estimated values are not shown.
The result shows that Raman spectral points from different environmental conditions are distinguishable in the -dimensional LDA space (Figure 2C–E). For example, the first and second LDA axes clearly distinguish the conditions ‘LB’ and ‘stationary3days’ (Figure 2C), and the third axis distinguishes ‘Glucose42C’ and ‘GlycerolAA’ (Figure 2D). Notably, the first principal axis LDA1 correlated with growth rate significantly (Pearson correlation , Appendix 1—figure 2). Visualizing the Raman LDA data by embedding them on a two-dimensional plane using t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten and Hinton, 2008) confirms that the points for each condition form a distinctive cluster (Figure 2F). These results imply that positions in the Raman LDA space reflect condition-dependent differences in cellular physiological states.
We next asked whether these Raman spectral differences in the LDA space could be linked to the different proteome profiles (Appendix 1—figure 1). To examine this, we hypothesized linear correspondence between the -dimensional proteome column vector , where is the number of protein species in the proteome data, and the low-dimensional (-dimensional) Raman column vector in condition ,
is an matrix that connects and . We calculated as the average of the low-dimensional LDA Raman data of single cells in condition since the proteomes were measured for cell populations (Table 1).
List of scalars, vectors, and matrices in the main text.
Scalars, vectors, and matrices in the main text are listed with their sizes and descriptions. is the number of conditions, and is the number of protein species. ( and in the main text.) Note that the notation summarized in this table differs in some respect from that in Materials and methods and Appendix.
| Size (#columns × #rows) | Description | |
|---|---|---|
| (vector) | Mean LDA Raman profile of single cells under condition | |
| (vector) | Proteome profile of cell population under condition | |
| Set of condition-independent coefficients that linearly connect and for all conditions (Equation 1) | ||
| (vector) | Expression levels of protein species across conditions | |
| (scalar) | Stoichiometry (abundance ratio) conservation strength between two protein species and (Figure 4A) | |
| Set of stoichiometry conservation strengths between all pairs of protein species (Figure 5J) | ||
| (scalar) | Stoichiometry conservation centrality of protein species | |
| (scalar) | Expression generality of protein species |
We conducted leave-one-out cross-validation (LOOCV) to verify this linear correspondence (Figure 2G). We excluded one condition (here, ) as a test condition and estimated as by simple ordinary least squares (OLS) regression using the data of the rest of the conditions. We thereby estimated the proteome in condition as .
The proteome profile estimated using the first four major LDA axes (LDA1–LDA4) agreed well with the actual proteome data under most conditions (Figure 2H and Appendix 1—figure 3; see ‘Raman-proteome statistical correspondence’ in Materials and methods for the estimation with all the LDA axes). Changing the condition to exclude, we estimated the proteomes for all the 15 conditions and calculated the overall estimation error by the Euclidean distance . The result shows that the overall estimation error is significantly small ( by permutation test; Fisher, 1935; Pitman, 1937; Phipson and Smyth, 2010). Adopting other distance measures does not change the conclusion (Appendix 1—tables 2 and 3). These results, therefore, validate the assumption of linear correspondence between cellular Raman spectra and proteomes and confirm that condition-dependent changes in proteomes can be inferred from the corresponding low-dimensional Raman spectra.
Stoichiometry conservation of proteins in the ISP COG class
Since the dimensionality of the proteome data is significantly higher than that of the Raman data, the result above suggests that changes in proteome profiles are constrained in low-dimensional space. The regression matrix considered above determines how the proteomes relate to the Raman LDA axes. Therefore, analyzing should provide some insights into constraints on condition-dependent changes in the proteomes (Appendix 1—figure 1).
The matrix is represented as , where the -th column () is the collection of coefficients of all proteins for the -th LDA axis (Table 1). In the case of , the coefficients are constant terms. We first asked whether any shared features might exist in the coefficients of depending on biological functions of corresponding proteins. We then classified the proteins according to functional annotations of Clusters of Orthologous Group (COG) classes (Tatusov et al., 1997; Tatusov et al., 2003; Galperin et al., 2015) and found that, for many proteins belonging to the ‘information storage and processing’ (ISP) COG class, the coefficients corresponding to different LDA axes are approximately proportional to the constant terms, i.e., , where is the index of an ISP COG class protein species and is the proportionality constant common to many ISP COG class protein species for the -th LDA axis (Figure 3A). The ISP COG class contains various proteins involved in processing genetic information such as translation, transcription, DNA replication, and DNA repair (Schmidt et al., 2016). Simple calculations show that these proportionality relationships imply that proteins in the ISP COG class conserve their mutual abundance ratios, i.e., stoichiometry, irrespective of environmental conditions (see ‘Characterizing an SCG by analyzing the Raman-proteome correspondence matrix’ in Materials and methods).
A stoichiometrically conserved protein group identified by an analysis of the Raman-proteome coefficient matrix.
(A) Scatterplots of Raman-proteome transformation coefficients. The horizontal axes are constant terms () in all the plots. The vertical axis is coefficients for LDA1 (), LDA2 (), LDA3 (), or LDA4 () in each plot. The proteins in the information storage and processing (ISP) Clusters of Orthologous Group (COG) class are indicated in yellow. Yellow solid straight lines are least squares regression lines passing through the origins for the ISP proteins. Insets are enlarged views of area around the origins. In this figure, we used the average of as an estimate of . (B) Similarity of expression patterns between culture conditions for each COG class. We divided the proteome into COG classes (Tatusov et al., 2003; Galperin et al., 2015) and calculated Pearson correlation coefficient of expression patterns for all the combinations of culture conditions. Since the data are from 15 conditions, there are 105 (=15·14/(2·1)) points for each COG class in the graph. The box-and-whisker plots summarize the distributions of the points. The lines inside the boxes denote the medians, the top and bottom edges of the boxes do the 25th percentiles and 75th percentiles, respectively. The numbers of protein species are 376 for the Cellular Processes and Signaling COG class, 354 for the ISP COG class, and 840 for the Metabolism COG class. See Appendix 1—figure 4 for the evaluation with Pearson correlation coefficient of log abundances and with cosine similarity. Appendix 1—figure 4 also contains figures directly showing expression-level changes of different protein species across conditions for each COG class. (C) Examples of stoichiometry-conserving proteins in the ISP COG class. The horizontal axis represents the abundance of RplF under 15 conditions, and the vertical axis represents those of several ISP COG class proteins. These proteins are also contained in the homeostatic core defined later (see Figure 4). The solid straight lines are linear regression lines with an intercept of zero. (D) Examples of abundance ratios of non-ISP COG class proteins. The horizontal axis represents the abundance of RplF under 15 conditions, and the vertical axis represents those of compared non-ISP COG class proteins. Crp belongs to the Cellular Processes and Signaling COG class; the other proteins belong to the Metabolism COG class. In both (C) and (D), we selected the proteins expressed from distant loci on the chromosome. All sigma factors participating in the regulation of the proteins examined in (C) and (D) are listed on the right of the gene name legends. All transcription factors known to regulate multiple genes listed here are shown in the right diagrams. Arrows show activation; bars represent inhibition; and squares indicate that a transcription factor activates or inhibits depending on other factors. The information on gene regulation and functions was obtained from EcoCyc (Keseler et al., 2017) in August 2022. The error bars are standard errors calculated by using the data of Schmidt et al., 2016. The insets show the positions of the genes on the E. coli chromosome determined based on ASM75055v1.46 (Howe et al., 2020). No genes are in the same operon.
Since this is an implication from the Raman-proteome correspondence, we next examined the stoichiometry conservation only with the proteome data, evaluating the expression levels with Pearson correlation coefficients for all the pairs of the conditions for each COG class (Figure 3B). For the ISP COG class, the correlation coefficients were close to 1, whereas those for the other COG classes were significantly smaller depending on condition pairs. We also evaluated the coordination of gene expression patterns within each COG class using cosine similarity and obtained consistent results (Appendix 1—figure 4). Therefore, stoichiometry conservation is stronger in the ISP COG class than in the other COG classes. Remarkably, neither shared transcription factors nor chromosome locations can account for the observed stoichiometry conservation of many protein pairs. Indeed, although the ISP COG class shows highly coordinated expression patterns (Figure 3C) compared to the non-ISP COG class (Figure 3D), the gene loci are not chromosomally clustered in either example. Additionally, the similarity/dissimilarity of expression patterns cannot easily be inferred from transcription factor regulation patterns. These results imply multi-level regulation of their abundance.
We consulted other public quantitative proteome data of Mycobacterium tuberculosis (Schubert et al., 2015), Mycobacterium bovis (Schubert et al., 2015), and Saccharomyces cerevisiae (Lahtvee et al., 2017) under environmental perturbations and consistently found strong stoichiometry conservation of the ISP COG class (Appendix 1—figure 4). Furthermore, the same trend was observed for the genotype-dependent expression changes in E. coli proteomes (Schmidt et al., 2016; Appendix 1—figure 4).
Identifying SCGs
Inspired by the existence of a large class of proteins that conserves their stoichiometry, we considered a systematic way to extract SCGs without relying on artificial functional classification of COG (Appendix 1—figure 1). Focusing only on the proteome data, we evaluated stoichiometry conservation for all the pairs of proteins in the proteome by calculating the cosine similarity of expression patterns (i.e. all in Figure 4A and Table 1, where each element of the -dimensional vector denotes the expression level of protein species under one of the conditions), and extracted groups in each of which the component proteins exhibit coherent expression change patterns by setting a high threshold of cosine similarity (, Figure 4B; see ‘Direct characterization of SCGs in omics data’ in Materials and methods for details).
Extracting stoichiometrically conserved groups (SCGs) from proteome data.
(A) Quantifying stoichiometry conservation by cosine similarity. We consider an -dimensional expression vector for each protein species whose elements represent its abundance under different conditions. The cosine similarity between the -dimensional expression vectors of two protein species becomes nearly 1 when they conserve mutual stoichiometry strongly across conditions, whereas lower than 1 when their expression patterns are incoherent. (B) Extracted SCGs. We extracted proteins with high cosine similarity relationships. Each node represents a protein species. An edge connecting two nodes represents that the expression patterns of the two connected protein species have high cosine similarity exceeding a threshold of 0.995. Proteins that have no edge with the other proteins are not shown. The largest and the second largest protein groups, which we refer to as SCG 1 and SCG 2, respectively, are indicated by shaded polygons. (C) Expression patterns of the extracted SCGs. The horizontal and vertical axes represent growth rate and protein abundance, respectively. Line-connected points represent expression-level changes of different protein species across conditions. SCG 1 (homeostatic core) is shown in two ways: the left panel with a linear-scaled vertical axis and the right panel with a log-scaled vertical axis. The inset for SCG 2 shows the total abundances of SCG 2 proteins with a log-scaled vertical axis. Error bars are standard errors. (D) The gene loci of the homeostatic core (SCG 1) proteins on the chromosome. Magenta dots are nodes (genes), and gray lines are edges (high cosine similarity relationships). We determined the gene loci based on ASM75055v1.46 (Howe et al., 2020).
The largest SCG (SCG 1) included many proteins in the ISP COG class (91 out of 191 SCG 1 members), such as ribosomal proteins and RNA polymerase, and also proteins in the other COG classes (Figure 4B, Appendix 1—table 4). We call this largest SCG homeostatic core, as it constitutes the largest stoichiometry-conserving unit in cells. We found that the abundance of each protein in the homeostatic core (SCG 1) increased approximately linearly with the growth rate in each condition (Figure 4C). This relationship is reminiscent of the growth law: The total ribosomal contents for translation increase linearly with growth rate (Neidhardt and Magasanik, 1960; Scott et al., 2010; Bremer and Dennis, 2008). The linear increase in the abundance of each protein in Figure 4C indicates that the growth law is valid even at the single-gene level for a large class of ribosomal and non-ribosomal proteins in the homeostatic core (Appendix 1—figure 5) (see Section 3.1 in Appendix).
Though not evenly distributed, the gene loci of the proteins in the homeostatic core are scattered throughout the chromosome (Figure 4D). Therefore, localization of gene loci to a single or a small number of operons is not likely a cause of the observed stoichiometry conservation.
The proteins in the second largest SCG (SCG 2) are expressed at high levels in the fast growth conditions, especially in the ‘LB’ condition (Figure 4C). The SCG 2 includes many proteins in the metabolism COG class (21 out of 26 SCG 2 members) (Appendix 1—table 5), and their abundance increases approximately exponentially with growth rate (Figure 4C). We also identified other condition-specific small SCGs, such as a group most expressed in the ‘GlycerolAA’ condition (SCG 3) (Appendix 1—table 6), a group mainly expressed in the ‘Fructose’ condition (SCG 4) (Appendix 1—table 7), and a group most expressed in the stationary phase conditions (SCG 5) (Appendix 1—table 8; Figure 4C).
Biological relevance of stoichiometry conservation
To understand the overall strength of stoichiometry conservation of the proteins in the different SCGs, we calculated the sum of cosine similarity, , for each protein species , where is cosine similarity between the -dimensional expression level vectors of protein and protein (Figure 4A), and the sum is taken over all the protein species (see ‘Global proteome structures based on stoichiometric balance’ in Materials and methods). We refer to as ‘stoichiometry conservation centrality’ (Table 1).
The proteins in the homeostatic core had high centrality scores (Figure 5A). Therefore, these proteins tend to have more connections with other proteins in terms of stoichiometry conservation. On the other hand, the proteins in the condition-specific SCGs tend to have low centrality scores among all the proteins (Figure 5A), which suggests that their stoichiometry conservation is localized within each SCG.
A proteome structure characterized by global stoichiometry conservation relationships.
(A) Distributions of stoichiometry conservation centrality values for all the proteins (gray), the homeostatic core (SCG 1) proteins (magenta), and the proteins belonging to the other stoichiometrically conserved groups (SCGs) (cyan). (B) Correlation between stoichiometry conservation centrality and gene essentiality. The proportion of essential genes within each class of stoichiometry conservation ranking is shown. The list of essential genes was downloaded from EcoCyc (Keseler et al., 2017). (C) Correlation between stoichiometry conservation and evolutionary conservation. The strength of evolutionary conservation of each protein species was estimated by the number of orthologs found in the OrthoMCL species (Chen et al., 2006). The genes with more orthologs tend to have higher stoichiometry conservation centrality ( by one-sided Brunner-Munzel test between the top 25% and the bottom 25% fractions of ortholog number ranking). Likewise, the genes with higher stoichiometry conservation centrality scores tend to have more orthologs ( by one-sided Brunner-Munzel test, top 25%–bottom 25% comparison; -values in the captions for (F–I) were evaluated with the same statistical test scheme). (D–G) Stoichiometry conservation analyses of human cell atlas transcriptome data of fetal 15 organs (Cao et al., 2020). The top gray histogram in (D) shows the distribution of stoichiometry conservation centrality values for all genes. The bottom histograms in (D) show the distribution for coding genes (yellow) and that for the other genes (cyan). (E) shows a correlation between the ratio of coding genes and stoichiometry conservation centrality calculated from the human cell atlas data. (F) shows a correlation between gene essentiality and stoichiometry conservation centrality calculated from the human cell atlas data. The essentiality of each human gene was quantified by CRISPR score, which is the fitness cost imposed by CRISPR-based inactivation of the gene in KBM7 chronic myelogenous leukemia cells (Wang et al., 2015). Genes with lower CRISPR score are regarded as more essential. The fraction with low CRISPR scores (i.e. high essentiality fraction) tends to have higher stoichiometry conservation centrality (). The fraction with high centrality scores tends to be more essential (). (G) shows a correlation between evolutionary conservation and stoichiometry conservation centrality based on the human cell atlas data. The gene fraction with many orthologs tends to have higher stoichiometry conservation centrality (). The gene fraction with high centrality scores tends to have more orthologs (). (H) and (I) Stoichiometry conservation analyses of genome-wide Perturb-seq data (Replogle et al., 2022). (H) shows a correlation between stoichiometry conservation centrality calculated from the Perturb-seq data and gene essentiality. The essentiality of each gene was quantified by the CRISPR score as in (F). The gene fraction with low CRISPR scores (i.e. high essentiality fraction) tends to have higher stoichiometry conservation centrality (). The gene fraction with high centrality scores tends to be more essential (). (I) shows a correlation between stoichiometry conservation based on the Perturb-seq data and evolutionary conservation of genes. The gene fraction with many orthologs tends to have higher stoichiometry conservation centrality (). The gene fraction with high centrality scores tends to have more orthologs (). (J) Representation of the proteomes as a graph. A node corresponds to a protein species, and the weight of an edge is taken as the cosine similarity between the -dimensional expression vectors of the two connected protein species. The matrix can specify the whole graph. Note that the diagonal elements of are ones, which were introduced just for simplicity. (K) Cosine similarity LE (csLE) structure in a three-dimensional space. Each dot represents a different protein species and is color-coded on the basis of its stoichiometry conservation centrality value. We selected the axes considering the structural similarity to the Raman-based proteome structure in (see Figure 6). (L) The csLE structure in a three-dimensional space. The views from two different angles are shown. Each gray dot represents a different protein species. The proteins belonging to each SCG are indicated with distinct markers. Colors of the two-dimensional histograms in (C), (F), (G), (H), and (I) represent the height of each bar.
The stoichiometry conservation centrality is biologically relevant because it correlates with gene essentiality. Fractions of essential genes almost monotonically decrease with the ranks of centrality score (Figure 5B and Appendix 1—figure 6). We also noted that genes with high centrality scores have more orthologs determined by OrthoMCL-DB (Chen et al., 2006) across the three domains of life (Figure 5C and Appendix 1—figure 6). Likewise, genes with many orthologs tend to have higher centrality scores (Figure 5C and Appendix 1—figure 6). Therefore, the stoichiometry conservation in cells correlates with the evolutionary conservation of proteins.
To determine if the correlation of stoichiometry conservation centrality with gene essentiality and evolutionary conservation is general, we analyzed the transcriptome data from other organisms and found comparable correlations in Schizosaccharomyces pombe (Appendix 1—figure 6). In addition, we found that fractions of coding genes almost monotonically decreased with ranks of centrality score in the S. pombe data (Appendix 1—figure 6).
We further analyzed two kinds of Homo sapiens transcriptome data. One is a human cell atlas, in which expression of both coding and non-coding genes in 15 fetal organs was quantified (Cao et al., 2020), and the other is genome-wide Perturb-seq data (Replogle et al., 2022), in which genetically perturbed transcriptomes were measured mainly for coding genes. Our analysis of the human cell atlas data revealed that, while the overall distribution of stoichiometry conservation centrality was broad (Figure 5D, top), the centrality distribution of coding genes was skewed to higher values (Figure 5D, bottom) as observed for the E. coli proteome. Fractions of coding genes almost monotonically decreased with ranks of centrality (Figure 5E) as seen in the S. pombe data (Appendix 1—figure 6). Essentiality of each gene in human cells was quantified by an index called CRISPR score, which measures the fitness cost imposed by CRISPR-based inactivation of the gene (Wang et al., 2015). Genes with lower CRISPR scores are considered more essential. Our analysis revealed that genes with higher stoichiometry conservation centrality scores tend to have lower CRISPR scores, thus more essential (Figure 5F). Similarly, genes with lower CRISPR scores tend to have higher stoichiometry conservation centrality scores. Furthermore, genes with higher centrality scores have more orthologs across the three domains of life and vice versa (Figure 5G). Comparable correlations of stoichiometry conservation with essentiality and evolutionary conservation were also found in the genome-wide Perturb-seq data (Figure 5H and I). Together, these results suggest that correlations of stoichiometry conservation centrality with gene essentiality and evolutionary conservation are general and preserved from E. coli to human cells regardless of the type of perturbation (see ‘Relevance of centrality of csLE structure to biological functions’ in Materials and methods for details).
Revealing global stoichiometry conservation architecture of the proteomes with csLE
Although the previous analysis revealed the biological relevance of stoichiometry conservation centrality, it is a one-dimensional quantity and cannot capture the global architecture of omics profiles. To gain further insights into genome-wide stoichiometry-conserving relationships among genes, we next analyzed the proteomes using a method similar to Laplacian eigenmaps (LE) (Appendix 1—figure 1; Belkin and Niyogi, 2003). We consider a symmetric matrix whose entry is (Figure 5J, Table 1). The entire proteome structure can be represented using the eigenvectors of normalized . Major differences of this method from the ordinary LE are that we consider an edge for all node pairs and that we adopt cosine similarity for weighting edges. This method places the proteins with higher cosine similarity closer in the resulting -dimensional space (see ‘Global proteome structures based on stoichiometric balance’ in Materials and methods and Section 2.1 in Appendix); we call this linear method cosine similarity LE (csLE).
In this -dimensional csLE space , the stoichiometry conservation centrality of the proteins decreased from center to periphery (Figure 5K), which confirms that it indeed measures the extent to which each protein is close to the center in the entire stoichiometry conservation architecture. Furthermore, the proteins formed polyhedral distributions with the cluster of the proteins in the homeostatic core at the center and the clusters of the proteins in the other condition-specific SCGs at distinct vertices (Figure 5L). This distribution is consistent with the fact that the condition-specific SCGs are the components whose expression patterns are distant from the homeostatic core and also between each other.
Representing the proteomes using the Raman LDA axes
Given that the analysis of the LDA Raman-proteome regression coefficients (Figure 3A) eventually led us to identify the stoichiometry conservation architecture in the proteome data (Figure 5), the low-dimensional proteome structure in might be related to major changes in cellular Raman spectra in the LDA space and provide insight into the Raman-proteome correspondence. To investigate this, we considered representing the proteomes on the basis of the Raman LDA axes (Appendix 1—figure 1).
The coefficients in the regression matrix must satisfy the proportionality for all -th LDA axes () for the pair of protein and protein that perfectly conserve their stoichiometry, as previously mentioned in the analysis of the ISP COG class (Figure 3A; see ‘Characterizing an SCG by analyzing the Raman-proteome correspondence matrix’ in Materials and methods and Section 2.1 in Appendix). Noting this property, we constructed another -dimensional proteome space , assigning each protein species a coordinate , where is the normalized coefficient of gene corresponding to the -th LDA axis. As in -dimensional , a pair of proteins with strong stoichiometry conservation is expected to position closely in this -dimensional proteome space . Note that the proximity of the coordinates of different proteins in is equivalent to the approximate proportionality of different proteins in Figure 3A, demonstrated for the ISP COG class using the proportionality constants (normalized coefficients) common to different proteins.
We then found that the distribution of the proteins in closely resembled the one in when visualized using the first few major axes (Figure 5L and Figure 6A). This similarity is nontrivial because is constructed only from the proteome data, whereas depends on the -dimensional Raman LDA space (Figure 2C–E).
Raman-based proteome structure and its similarity to stoichiometry-based proteome structure.
(A) Proteome structure determined by Raman-proteome coefficients visualized in a three-dimensional space. The views from two different angles are shown. Each gray dot represents a protein species. The proteins belonging to each stoichiometrically conserved group (SCG) are indicated with distinct markers. We note that SCGs are defined without referring to Raman data (Figure 4). (B–D) Similarity among the distribution of linear discriminant analysis (LDA) Raman spectra (B), the proteome structure determined by Raman-proteome coefficients (C), and the proteome structure determined by stoichiometry conservation (D). (E) Mathematical relation between the coordinates of the proteins in (C) and (D). The two conditions, one with (magenta) and the other between and (cyan), must hold for the similarity between the two proteome structures (yellow), as described in the gray box. denotes column-wise proportionality.
We remark that each axis of is directly linked to the corresponding Raman LDA axis. Consequently, the orthants in where the condition-specific protein species reside agree with those in the Raman LDA space where the cellular Raman spectra under corresponding conditions reside (Appendix 1—figure 10) (see ‘Global omics structures characterized by Raman-omics correspondences’ in Materials and methods and Section 2.1 in Appendix). Indeed, we find such orthant agreement between the proteins in the condition-specific SCGs (SCG 2–SCG 5) and the cellular Raman spectra under the corresponding conditions (Figure 6B and C). This straightforward correspondence between and the Raman LDA space allows us to examine the relationship between changes in cellular Raman spectra and omics components’ stoichiometry conservation architecture by comparing the two proteome structures in and .
Omics-level interpretation of cellular Raman spectra and a quantitative constraint between expression generality and stoichiometry conservation centrality
To understand rigorously what the similarity of the proteome structures in and signifies (Figure 6C and D), we clarified the mathematical relation between the coordinates of the proteins in these two spaces (Figure 6E and Appendix 1—figure 1; see Sections 2.1 and 2.2 in Appendix for details). We then characterized the two mathematical conditions that must be satisfied simultaneously (Figure 6E).
The first condition is that major axes of the Raman LDA space and those of the proteome csLE space correspond (Figure 6E). Consequently, cellular Raman spectra under a condition accompanying the expression of a condition-specific SCG must be significantly different from those under conditions with the expression of other condition-specific SCGs in a manner distinguishable by LDA. Mathematically, this condition is related to the orthogonal matrix that appears in the equation in Figure 6E. For the distributions of the proteome components to be similar in the low-dimensional subspaces of and , must be close to the identity matrix with small off-diagonal elements (Figure 6E). We verified this first condition with the data (Appendix 1—figure 9; see ‘Evaluating similarity between orthogonal matrix and identity matrix’ in Materials and methods for details).
The second condition relates to the proportionality of the -dimensional vectors and in Figure 6E. This proportionality relation can be transformed into another relation that is proportional to , where and are the and norms of the expression-level -dimensional vector of protein across conditions (Figures 4A and 6E, Table 1).
can be interpreted as the expression generality score. When is large, the protein is expressed generally across conditions; when is small, this is expressed only under specific conditions (Appendix 1—figure 8) (see ‘Interpretation of norm/ norm ratio of an expression vector as a quantitative measure of expression generality’ in Materials and methods). Therefore, the proportionality between and indicates that the proteins with high stoichiometry conservation centrality must be expressed nonspecifically to conditions. We also verified this condition with the data, confirming that it is indeed satisfied (Figure 7A and Appendix 1—figure 9).
Proportionality between stoichiometry conservation centrality and expression generality.
(A) Relationships between stoichiometry conservation centrality () and expression generality (). Each gray dot represents a protein species. The proteins belonging to each stoichiometrically conserved group (SCG) are indicated with distinct markers. The dashed lines are . The solid lines represent (see Section 2.2 in Appendix). The deviation of a point from the solid line is related to the growth rate under the condition where each protein is expressed the most. (B) The same plot as (A) in black and white. Overlaid red circles indicate proteins featured in (C). (C) Expression patterns of the proteins indicated by red circles in (B) across conditions. The condition differences are shown by the growth rate differences on the horizontal axes. The arrangement of the plots for the proteins corresponds to their relative positions in (B).
The spread of the points from the proportionality diagonal line of the E. coli proteome data in Figure 7A was found related to the growth rate under the condition where each protein is expressed the most (see Section 2.2 in Appendix for a detailed analysis on the origin of the deviation). Consequently, one can envisage a growth-rate-dependent expression pattern of each protein on the basis of its relative position in this - plot (Figure 7B and C). For example, both BamB and YqjD are expressed nonspecifically to the conditions with nearly identical expression generality scores. However, BamB is expressed at higher levels under fast growth conditions, whereas YqjP is expressed at higher levels under slow growth conditions due to their relative positions to the proportionality line. A similar growth rate dependence is observed for PaaE and DgoA, but with more prominent condition specificity because these proteins are characterized by their low expression generality scores. These growth-rate-dependent deviation patterns might hint at a new growth law that governs the total relative expression changes of the proteome components (see Section 2.2 in Appendix for detailed discussion).
Generality
We also examined the generality of the aforementioned two conditions using the Raman and proteome data of E. coli strains with different genotypes (BW25113, MG1655, and NCM3722) under two culture conditions (Schmidt et al., 2016) and the Raman and transcriptome data of S. pombe under 10 culture conditions (Kobayashi-Kirschvink et al., 2018). Applying csLE to the omics data, we again found similar omics structures between and when visualized using the first few major axes, with homeostatic cores at the centers and condition-specific SCGs at the vertices (Appendix 1—figures 11 and 12).
Proportionality between stoichiometry conservation centrality and expression generality score was also confirmed in both additional datasets (Appendix 1—figure 7). We further used publicly available quantitative proteome data of M. tuberculosis, M. bovis, and S. cerevisiae (Schubert et al., 2015; Lahtvee et al., 2017) to examine this relation and confirmed that the proportionality universally holds (Appendix 1—figures 7 and 13). Almost no deviation from the proportionality line existed in the S. cerevisiae proteome data measured for the cells in different media but cultured in chemostats with an identical dilution rate (thus, identical growth rate), which is consistent with the result of E. coli in which the deviations were related to the growth rate differences.
Discussion
A Raman spectrum obtained from a single cell is a superposition of the spectra of all of its constituent biomolecules. Therefore, cellular Raman spectra potentially contain rich information on essential state differences in targeted cells. The fact that both transcriptomes and proteomes are inferable from cellular Raman spectra, as demonstrated in this and previous (Kobayashi-Kirschvink et al., 2018) studies, endorses this speculation. The detailed analyses of the relationship between Raman and omics data have identified functionally relevant constraints on omics changes and provided an interpretation of cellular Raman spectra (Appendix 1—figure 1). Specifically, it has been revealed that major changes in cellular Raman spectra distinguishable by LDA reflect the changes in omics profiles under the constraints of stoichiometry conservation. This correspondence would help us interpret global changes in cellular Raman spectra by translating them into the differences in omics profiles.
We remark that linearity in our formulation enabled us to find the rigorous connection between the two omics spaces and (Figure 6E). Unlike the original LE, we adopted cosine similarity as weights of edges between all node pairs to measure expression stoichiometry conservation of proteins. This modification was indispensable in terms of interpretation; relative proximity of positions in reflects the strength of stoichiometry conservation. We also remark that simple principal component analysis (PCA) applied to the normalized E. coli proteome data also finds a similar low-dimensional proteome structure (Appendix 1—figure 6) (see ‘Proteome structure obtained with PCA’ in Materials and methods). Therefore, besides interpretability, omics structures in might reflect dominant relationships among omics components commonly characterized by several methods of omics representation.
It should be noted that the quantitative analysis of Raman-omics correspondence resulted in the characterization of stoichiometry-conserving architecture in cells (Appendix 1—figure 1). This shows that besides distinguishing different cellular states or quantifying specific biomolecular species by focusing on spectral peaks, Raman spectra can also characterize the system-level constraints behind changes in global gene expression profiles. While the identified features, such as stoichiometry conservation centrality, expression generality score, and csLE space, can be calculated without Raman data, it is difficult to reach them directly without scrutinizing the Raman-omics correspondence. Furthermore, the definition of expression generality and its relation to stoichiometry conservation centrality were directly derived from the Raman-omics correspondence analysis (Figures 6E and 7). Therefore, as a signal reflecting comprehensive molecular profiles in cells, Raman spectra are an important modality for dissecting system-level properties and constraints in cells.
In this study, we mainly analyzed the Raman and proteome data of E. coli under 15 different environmental conditions. However, the resulting low-dimensional structures and correspondence of and can change depending on what and how many conditions are included in the analysis. Thus, an intriguing question is how the Raman-proteome correspondence is affected by the conditions used in the analysis. A subsampling analysis focusing on the orthogonal matrix , which represents low-dimensional correspondence precision of and (Figure 6E), reveals that correspondence precision tends to increase with an increasing number of conditions (Appendix 1—figure 14). This result suggests that increasing the number of conditions generally improves the low-dimensional correspondence rather than disrupting it.
Since the proteome data that we referenced (Schmidt et al., 2016) represent the averaged expression profile of the cells in each condition, we likewise averaged the single-cell Raman data in each condition in the LDA space to determine their correspondence. Once this correspondence is established, it becomes technically feasible to infer the proteomes of individual cells from their Raman spectra. However, verifying the accuracy of the inferred proteome profiles requires quantitative ground truth of single-cell proteomes, which are not yet readily obtainable, especially for bacterial cells. Despite this limitation, future studies may clarify the correspondence at the single-cell level as omics technology advances.
Stoichiometry conservation is plausibly crucial for cellular functions and physiology. For example, the enzymes involved in evolutionarily conserved metabolic pathways conserve their stoichiometry across microorganism species despite their diverse transcriptional and translational rates (Lalanne et al., 2018). It is suggested that stoichiometry conservation is achieved by optimizing the metabolic flux for fast growth (Lalanne and Li, 2021). Furthermore, a ribosome-targeting antibiotic causes an imbalance of ribosomal proteins and growth arrest in E. coli, but the balance is restored alongside growth recovery through physiological adaptation (Koganezawa et al., 2022). These results suggest that disruption of stoichiometric balance among core components could impose significant fitness cost.
It is known that functions, essentiality, and evolutionary conservation of genes can be linked to the topologies of gene networks (Jeong et al., 2001; He and Zhang, 2006; Yu et al., 2007; Fraser et al., 2002; Wuchty et al., 2003; Li et al., 2020). However, networks that have been previously analyzed, such as protein-protein interaction networks, depend on known interactions. Therefore, as our understanding of the molecular interactions evolves with new findings, the conclusions may change. Furthermore, analysis of a particular interaction network cannot account for effects of different types of interactions or multilayered regulations affecting each protein species, thus highlighting only one aspect of the inherently global coordination of molecular compositions in cells. In contrast, the stoichiometry conservation network in this study focuses solely on expression patterns as the net result of interactions and regulations among all types of molecules in cells. Consequently, the stoichiometry conservation networks are not affected by the detailed knowledge of molecular interactions and naturally reflect the global effects of multilayered interactions behind cellular physiological state changes. Additionally, stoichiometry conservation networks can easily be obtained for non-model organisms, for which detailed molecular interaction information is usually unavailable. Therefore, analysis with the stoichiometry conservation network has several advantages over existing methods from both biological and technical perspectives.
It is intriguing to ask how cells conserve stoichiometry among the components in each SCG. In particular, the homeostatic core (SCG 1) contains many components whose gene loci are scattered throughout the genome. It is known that both transcriptional and translational negative autoregulation contributes to controlling the stoichiometry of many ribosomal proteins (Nomura et al., 1980; Dean et al., 1981; Kaczanowska and Rydén-Aulin, 2007; Portier and Grunberg-Manago, 1993; Aseev et al., 2008; Roy et al., 2020). The genes for the ribosomal proteins are scattered in multiple operons and co-regulated with many other non-ribosomal proteins, such as RNA polymerase subunits, translation initiation/elongation factors, and transmembrane transporters (Keseler et al., 2017). Therefore, the stoichiometry-conserving mechanisms established for ribosomes might be partially exploited for the stoichiometry conservation within the homeostatic core.
The existence of condition-specific SCGs and genes with similar expression patterns confirms that adaptation to specific conditions is not necessarily achieved by a small number of functionally relevant genes, but is often accompanied by changes in the expression of many seemingly unrelated genes. Indeed, condition-specific SCGs contain genes with unclear roles in adaptation, including some that are functionally uncharacterized (Appendix 1—table 5–8). Therefore, it would be important to investigate whether the coexpression of multiple genes is crucial for cellular adaptation to a wide range of perturbations while maintaining homeostasis.
The proportionality between stoichiometry conservation centrality and expression generality score suggests that proteins with high stoichiometry conservation centrality govern basal cellular functions required under any conditions. In fact, both essential genes and evolutionarily conserved genes are enriched in the omics fractions with high centrality scores. On the contrary, proteins of low centrality scores might have been acquired in later stages of the evolution and exploited to survive or increase fitness under specific conditions. Such hierarchy in the stoichiometry conservation centrality among core and peripheral processes might promote the adaptability of cells since cells can respond to diverse environments without restructuring a large body of the functional homeostatic core. This architectural principle in omics might underlie the robustness and adaptability of biological cells.
Materials and methods
| Reagent type (species) or resource | Designation | Source or reference | Identifiers | Additional information |
|---|---|---|---|---|
| Chemical compound, drug | Difco LB Broth, Miller (Luria-Bertani) | Becton, Dickinson and Company | ||
| Chemical compound, drug | Bacto Yeast Extract | Becton, Dickinson and Company | ||
| Chemical compound, drug | Bacto Tryptone | Becton, Dickinson and Company | ||
| Chemical compound, drug | Sodium Chloride | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Disodium Hydrogenphosphate | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Potassium Dihydrogenphosphate | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Ammonium Sulfate | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Zinc Sulfate Heptahydrate | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Cooper(II) Chloride Dihydrate | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Manganese(II) Sulfate Pentahydrate | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Cobalt(II) Chloride Hexahydrate | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Calcium Chloride Dihydrate | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Magnesium Sulfate Heptahydrate | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Thiamin Hydrochloride | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Iron(III) Chloride Hexahydrate | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Sodium Acetate | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Disodium Fumarate | FUJIFILM Wako Pure Chemical Corporation | ||
| Chemical compound, drug | D-Galactose | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | D-Glucose | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Glycerol | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | D-Fructose | FUJIFILM Wako Pure Chemical Corporation | ||
| Chemical compound, drug | D-Mannose | FUJIFILM Wako Pure Chemical Corporation | ||
| Chemical compound, drug | D-Xylose | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | L-Alanine | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | L-Asparagine Monohydrate | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | L-Cysteine | FUJIFILM Wako Pure Chemical Corporation | ||
| Chemical compound, drug | L-Glutamic acid | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | L-Glutamine | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Glycine | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | L-Histidine | FUJIFILM Wako Pure Chemical Corporation | ||
| Chemical compound, drug | L-Isoleucine | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | L-Phenylalanine | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | L-Proline | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | L-Serine | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Adenine | FUJIFILM Wako Pure Chemical Corporation | ||
| Chemical compound, drug | L-Arginine | FUJIFILM Wako Pure Chemical Corporation | ||
| Chemical compound, drug | L-Aspartic acid | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | L-Leucine | FUJIFILM Wako Pure Chemical Corporation | ||
| Chemical compound, drug | L-Lysine | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | L-Methionine | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | L-Threonine | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | L-Tryptophan | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | L-Tyrosine | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | L-Valine | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Uracil | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Sodium Hydroxide Solution | Wako Pure Chemical Industries, Ltd., FUJIFILM Wako Pure Chemical Corporation | ||
| Chemical compound, drug | 35–37% (mass/mass) Hydrochloric Acid | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Hydrochloric Acid | Wako Pure Chemical Industries, Ltd. | ||
| Chemical compound, drug | Agar | Wako Pure Chemical Industries, Ltd., FUJIFILM Wako Pure Chemical Corporation | ||
| Strain, strain background (Escherichia coli) | BW25113 | Wakamoto Laboratory stock | ||
| Strain, strain background (Escherichia coli) | MG1655 | Wakamoto Laboratory stock | ||
| Strain, strain background (Escherichia coli) | NCM3722 | Coli Genetic Stock Center |
Note that mathematical notation in Materials and methods differs in some respects from that in the main text, Table 1, and main figures.
Experimental methods, data acquisition, and data analyses
Absolute quantitative proteome data
Request a detailed protocolWe utilized high-quality absolute quantitative proteome data reported by Schmidt et al., 2016. In these data, expression levels of more than 55% of genes of E. coli BW25113 strain (more than 95% of total proteome mass) were quantified under various environmental conditions.
We also used additional absolute quantitative proteome data (Schmidt et al., 2016; Schubert et al., 2015; Lahtvee et al., 2017) for checking the generality of our findings (see Appendix 3.2). In addition to the proteome data across environmental conditions, Schmidt et al. also reported proteomes of E. coli strains with different genotype backgrounds (BW25113, MG1655, and NCM3722) cultured in a rich medium or a minimal medium supplemented with glucose. Schubert et al., 2015 quantified proteomes of M. tuberculosis H37Rv strain and M. bovis BCG strain under time-course environmental change conditions starting from exponential growth conditions, followed by dormant states induced by decreasing oxygen levels, and finally regrowth conditions with re-aeration. Lahtvee et al., 2017 quantified proteomes of S. cerevisiae under a reference condition and three stressed conditions (ethanol, osmotic pressure, and high temperature, with three stress intensity steps for each type of stress) using chemostat.
For checking the generality of our findings across omics classes, we also used the transcriptome data reported by our previous paper (Kobayashi-Kirschvink et al., 2018). The data include the transcriptomes of S. pombe in rich and minimal media, in nutrient-depleted media, and under various stress conditions.
E. coli strains and culture conditions
Request a detailed protocolTo quantitatively analyze a linkage between the absolute proteome data generated by Schmidt et al., 2016 and Raman data, we reproduced the culture conditions used in Schmidt et al., 2016 as closely as possible in our lab. We obtained three biological replicates.
E. coli strains
Request a detailed protocolWe used BW25113, MG1655, and NCM3722 as in Schmidt et al., 2016. In particular, BW25113 (Datsenko and Wanner, 2000) was used for the main data in this study. The genotype of BW25113 is F− Δ(araD-araB)567 ΔlacZ4787 (::rrnB-3) λ− rph-1 Δ(rhaD-rhaB)568 hsdR514, that of MG1655 is F− λ− rph-1 , and that of NCM3722 is F+, respectively (Baba et al., 2006; Blattner et al., 1997; Soupene et al., 2003).
Culture conditions
Request a detailed protocolWe prepared 15 batch culture conditions listed in Appendix 1—table 1. We excluded three culture conditions among the 18 conditions reported in Schmidt et al., 2016 because we could not obtain sufficiently strong cellular Raman signals under those excluded conditions. See Schmidt et al., 2016 for the detail of medium compositions. For ‘GlucosepH6’ medium, 37% HCl was titrated to the ‘Glucose’ medium. Medium for ‘stationary1day’ and ‘stationary3days’ was the same as ‘Glucose’ medium. LB agar plates were prepared by adding agar to ‘LB’ medium.
Cultivation
Request a detailed protocolCulturing E. coli cells proceeded in four steps:
Step 1: Growth on LB agar plates. Cells were taken from a −80°C glycerol stock and streaked on LB agar plates. The plates were incubated at 37°C overnight and stored at 4°C. All subsequent experiments were conducted using colonies on the LB agar plates. Picking colonies from the plates for cultivation was done within 4 days of storage at 4°C.
Step 2: Liquid culture under ‘Glucose’ condition. Several colonies picked from LB agar plates were inoculated into ‘Glucose’ liquid culture medium and grown for about 16 hr. Cells for the ‘Glucose42C’ condition were cultured at 42°C, and those for the other conditions were grown at 37 °C.
Step 3: Liquid culture under each condition. Cells from Step 2 were passaged into each type of medium and grown to exponential phase. Cells for the ‘Glucose42C’ condition were grown at 42°C, and those for the other conditions were cultured at 37°C.
Step 4: Liquid culture under each condition. Cells from Step 3 were passaged into the respective fresh medium and grown to almost the same level of turbidity as that at the end of Step 3. Cells for the ‘Glucose42C’ condition were cultured at 42°C, and those for the other conditions were grown at 37°C.
For the exponential conditions, cell cultivation was conducted as described above. For the stationary conditions, cultivation of cells at Step 3 was continued instead of proceeding to Step 4 and ended 1 or 3 days after they reached the stationary phase.
The medium volume was for all the liquid cultures in our experiments. Borosilicate glass test tubes with a diameter of and a length of were used. A fresh medium was pre-warmed before passage so that its temperature was the same as that of cultivation. All the liquid cultures were under reciprocal shaking at and at an inclination of 45°. Liquid cultures were diluted to an OD600 of around 0.01 for passage.
Main differences between our cultivation conditions and those of Schmidt et al., 2016 are the periods of storage at 4°C at Step 1 (a maximum of 3 weeks in Schmidt et al., 2016), the number of colonies inoculated from plates to liquid medium at the second step (one colony per inoculation in Schmidt et al., 2016), and medium volumes and shaking conditions of liquid cultures ( liquid culture in unbaffled wide-neck Erlenmeyer flasks under orbital shaking at in Schmidt et al., 2016).
Growth rate measurements
Request a detailed protocolGrowth curves were obtained by continuing the Step 3 in cultivation. Cultivation of cells for growth measurements was conducted with culture media, not , due to a requirement of the device used for continuous turbidity recording (ODBox-C, TAITEC Corporation). In addition, cells were washed with each type of fresh medium before inoculation at the beginning of Step 3, and cultivation for growth recording started from an OD600 of around 0.001. Growth rates were calculated from the growth curves using the fitting algorithm based on Gaussian processes (Swain et al., 2016).
Raman measurements and preprocessing of spectra
Request a detailed protocolCells were washed three times with 0.9% aqueous solution of NaCl, and 5 µL of the suspension was placed on a synthetic quartz slide glass (Toshin Riko Co., Ltd.) and dried. Raman spectra of cells were measured with a Raman microscope (Appendix 1—figure 2), where a custom-built Raman system (STR-Raman, AIRIX) was integrated into a microscope (Ti-E, Nikon). Excitation light was generated by a continuous-wave diode-pumped solid-state laser (Gem 532, Laser Quantum). We altered the first version of this Raman microscope (Kobayashi-Kirschvink et al., 2018), and light from the laser oscillator was transmitted by mirrors in this research. A and air objective lens (MPLN100X, Olympus) was used. Raman scattered light was collected by an optical fiber and transmitted to a spectrometer (Acton SP2300i, Princeton Instruments). Dispersed light by a grating was projected onto an image sensor of an sCMOS camera (OrcaFlash 4.0 v2, Hamamatsu Photonics). The sCMOS camera was water-cooled at 15°C to reduce dark noise. The exposure time for each cell was . Randomly selected 15 cells were measured per condition per replicate. Raman spectrum of background was measured for each cell with exposure in an area close to a targeted cell where neither cells nor NaCl crystals existed.
In our setup, the laser power at the sample stage was . The measurement system and processes were controlled using Micro-Manager 1.4 (Edelstein et al., 2014) and a plugin we made.
Readout noise of sCMOS image sensors is pixel-dependent. A noise reduction filter developed in Kobayashi-Kirschvink et al., 2018, on the basis of Huang et al., 2013, was applied to measured spectral images by using 10,000 blank images obtained with the same sCMOS sensor with exposure time of . See Kobayashi-Kirschvink et al., 2018 for details.
After noise reduction with the filter, pixel counts were summed up along the direction perpendicular to wavenumber. A background spectrum was subtracted from a cellular Raman spectrum. A pixel region corresponding to the range from to was cropped. The cropped spectrum was smoothed with a Savitzky-Golay filter (Savitzky and Golay, 1964). To minimize the effect of laser excitation variations, each spectrum was normalized by subtracting the average and dividing it by the standard deviation.
Data analysis
Request a detailed protocolWe wrote scripts and analyzed data using MATLAB (R2019a and R2023b), except for Brunner-Munzel test, for which we used R (version 4.0.3) (see ‘Centrality-evolutionary conservation correlation’ in Materials and methods).
Related to Figure 2 in the main text, we first performed LDA against the Raman data. LDA is a linear classifier; it finds the most discriminatory bases by maximizing the ratio of the between-class variance to the within-class variance and reduces the dimensions of the data to , where is the number of classes (Huang et al., 2004; De Bie et al., 2005; Goodacre et al., 1998). In the case of our main data, classes are culture conditions. In the verification step of the correspondence between the LDA Raman and omics data, we conducted LOOCV. In LOOCV, one condition is used as test data and the remaining conditions are used as training data. This is repeated by changing the condition to exclude.
The details of the data analyses are provided in the sections below.
Raman-proteome statistical correspondence
Notation
Request a detailed protocolWe write the population-averaged 14-dimensional LDA Raman spectrum vector of each condition as a row vector and the 2058-dimensional absolute proteome vector of each condition as a row vector . Note that we regarded and as column vectors in the main text for simple expression of equations.
Our hypothesis of Raman-proteome linear correspondence (Equation 1 in the main text) is expressed as
where is a 2058 × 15 matrix and denotes transpose. In LOOCV, one condition is excluded (let be the excluded condition) and the remaining 14 conditions are used to estimate . We write the estimated as , which is also a 2058 × 15 matrix. Let be the estimated proteome of the excluded condition in LOOCV (Figure 2G).
OLS in LOOCV scheme
Request a detailed protocolIn the case of LOOCV, 14 (= 15 − 1) conditions are included in a training data. Thus, if all the 14 LDA axes of the low-dimensional Raman data are considered, OLS becomes underdetermined. We excluded higher dimensions of the Raman space to conduct OLS in LOOCV unless otherwise noted. The results described in the main text were obtained using the first four axes (LDA1 to LDA4). In this case, is a 2058 × 5 matrix.
Permutation test
Request a detailed protocolLet a permutation of all the 15 conditions be . In our permutation test, we calculated overall estimation errors as , where is one of the distance measures between and listed in Appendix 1—table 2. There exist sets of , and calculating all of them is computationally intensive. Thus, we randomly generated 105 permutation sets.
The result presented in the main text is the case where Euclidean metric (PRESS) was used as a distance measure. Likewise, we also obtained small -values with the other metrics (Appendix 1—table 2).
We could also estimate the proteomes with high accuracy using all the 14 dimensions of the LDA space (Appendix 1—table 3). As noted in ‘OLS in LOOCV scheme’ in Materials and methods, the regression is underdetermined in this case. Thus, we simply adopted the minimum-norm solution from among all least-squares solutions.
Characterizing an SCG by analyzing the Raman-proteome correspondence matrix
Notation
Request a detailed protocolThe component representation of Equation 2 is
where is the number of proteins and is the number of culture conditions. and in our case. Let be the -th row of th column of . For example, denotes the constant term for each protein, and the coefficient of LDA1 for each protein. The expression level of protein in the condition is
Stoichiometry conservation of ISP COG class
Request a detailed protocolIn the main text, we revealed that many proteins belonging to ISP COG class were aligned on a straight line passing through the origin when the relations between the columns of were shown in scatterplots (Figure 3A). Consider hypothetical proteins that align perfectly on a straight line through the origin. Let be the indices of such perfectly aligning protein species. Extracting only these rows for the proteins from Equation 3, we obtain
where () are constants and . For our data, Appendix 1—figure 4A and B correspond to Equation 7. In these plots, the -axis represents , and the -axis . Many ISP proteins indeed align on a straight line through the origin with different slopes for different conditions (Appendix 1—figure 4A). In contrast, many proteins in other COG classes do not align on a straight line (Appendix 1—figure 4B).
Importantly, for a pair of proteins that align on the straight line,
holds from Equation 7. The right-hand side of Equation 8 does not contain condition index , which means that the abundance ratio of the proteins remains constant regardless of the conditions.
On the evaluation of stoichiometry conservation by Pearson correlation coefficient
Request a detailed protocolIn the main text, we used Pearson correlation coefficients to confirm the stoichiometry conservation of many ISP COG class members (Figure 3B). However, strictly speaking, cosine similarity is a more appropriate measure to evaluate stoichiometry conservation. In this analysis, cosine similarity can be written as
where and are the vectors representing the protein abundance for the proteome subgroups (‘ISP’ COG class, ‘Cellular processes and signaling’ COG class, and ‘Metabolism’ COG class) for conditions and , respectively (). Cosine similarity version of Figure 3B is Appendix 1—figure 4F. The cosine similarity takes the maximum value 1 only when abundance ratios between all considered proteins are perfectly the same between the two compared conditions.
In addition, we also examined differences between COG classes by calculating Pearson correlation coefficients of log abundances (Appendix 1—figure 4E).
Direct characterization of SCGs in omics data
Notation
Request a detailed protocolLet be a column vector representing the abundances of protein . Each component of this vector indicates the abundance of protein under each condition. Therefore,
where in our case. Note that defined here is a 15-dimensional column vector and different from introduced previously, which was a 2058-dimensional row vector.
Identifying SCGs in omics data
Request a detailed protocolAs explained in the main text, we extracted SCGs directly from the omics data, without referring to Raman data or COG classification. We evaluated the similarity of expression patterns for all the combinations of proteins using cosine similarity. Specifically, cosine similarity between proteins and is calculated as
This is the inner product of normalized and . Note that as protein abundances of any proteins are non-negative. takes the maximum value 1 if and only if and point in identical direction, i.e., the abundance ratios of proteins and are constant under all the conditions. Therefore, if we extract only the proteins connected with high cosine similarity from all protein pairs, they would constitute proteome fractions in each of which the abundance ratios of the proteins remain almost constant across all the conditions. We hence extracted only the protein pairs whose cosine similarity was above a high threshold of 0.995. As a result, we obtained several SCGs, in each of which the protein species are linked to each other with high cosine similarity (Figure 4B and C).
The genes in each SCG are listed in Appendix 1—table 4–8. Note that there are many other minor components (Figure 4B), some of which may have an expression pattern similar to another component but are separated due to the high threshold.
The positions of members of the SCGs on the chromosome are shown in Figure 4D (SCG 1 [homeostatic core]) and Appendix 1—figure 5E (SCGs 2–5).
Global proteome structures based on stoichiometric balance
Request a detailed protocolIn the previous section, we identified SCGs by setting a threshold of cosine similarity for extracting protein pairs. We next removed the threshold and considered the ‘distance’ with respect to cosine similarity for all the protein pairs to capture the global proteome structure that includes SCGs.
The cosine similarity for all the pairs of proteins can be summarized in one matrix as
where () component represents cosine similarity between proteins and . Assuming that this matrix is an adjacency matrix in graph theory and network theory, the entire proteomes are considered as a weighted undirected complete graph (with loops), where nodes correspond to protein types and any protein pair is connected by an undirected edge. Each edge is weighted by the cosine similarity between the two protein species at both ends. Note that all the diagonal elements of are one, which represents that each node has a loop with weight of one. These were introduced just for simplicity.
To ask whether the SCGs identified in the previous section have any unique features in this network, we evaluated the degree to which each node is central in the network structure. In graph theory, ‘centrality’ is known as an index to measure how ‘important’ or ‘influential’ each node is. In particular, we employed a measure called ‘degree centrality’ (for weighted graphs) (Nieminen, 1973; Segarra and Ribeiro, 2015). Degree centrality, which is also called ‘degree’, simply measures ‘influence’ of a node on a network on the basis of links with its direct neighborhood. One can obtain a degree centrality value by calculating the sum of the weights of all the edges connected to each node (see also the definition of the degree matrix in Equation 14 below). We note that in our graph, degree centrality vector , where is an -dimensional column vector of which all elements are one, is equal to the eigenvector corresponding to the largest eigenvalue of a ‘normalized’ adjacency matrix up to multiplication by a constant. From this perspective, the centrality index we adopted measures ‘influence’ of a node in a recursive manner depending on ‘influence’ of its neighboring nodes. A well-known example of this centrality indicator is Google’s PageRank (Brin and Page, 1998) used for ranking web pages on the World Wide Web. It can be regarded as a variant of ‘eigenvector centrality’ (the eigenvector corresponding to the largest eigenvalue of the adjacency matrix ) (Bonacich, 1972; Segarra and Ribeiro, 2015). As explained in the main text, the protein species in the homeostatic core (the largest SCG) had high centrality scores, while those in the other condition-specific SCGs had low centrality scores (Figure 5A).
We directly observed the global stoichiometry conservation structure of this proteome graph using Laplacian eigenmaps (Figures 5K, L, 6D). In general, a graph can be uniquely specified not only by the adjacency matrix , but also by the Laplacian matrix defined as
where is the degree matrix with the components of
where is an -dimensional column vector of which all elements are one. The -element of represents the sum of the weights of all the edges connected to node . In our case, it represents the sum of cosine similarity values between protein and the other proteins. To see the entire proteome graph structure, we specifically employed the normalized Laplacian,
We remark that there are two types of often-used normalized Laplacian matrices, and , in the field of machine learning (von Luxburg, 2007), and our mathematical analysis can provide a clear interpretation to each of them in the context of the Raman-proteome linear correspondence as described in Appendix 2.1.5.
There exist nontrivial eigenvalues of that are greater than zero and less than one. We write these eigenvalues as from the smallest and the corresponding eigenvectors as . Additionally, we denote the eigenvector corresponding to the eigenvalue zero as . Using these eigenvectors, one can construct a matrix and visualize a proteome, assigning protein with a coordinate specified by the elements after the second column in the -th row of , i.e., by the -th row of . The csLE structure we illustrate in the main figures was produced by using these eigenvectors. For example, the csLE1-csLE2 figure in the main text (Figure 6D) is a scatterplot between and . Note that the closer to one the cosine similarity of a protein pair is (the more similar their expression patterns are), the ‘closer’ the two protein species are placed (see Section 2.1.5 in Appendix for details).
This method of obtaining low-dimensional representation of data using eigenvectors of a graph Laplacian is known as Laplacian eigenmaps (LE) (Belkin and Niyogi, 2001; Belkin and Niyogi, 2003). Thus, what we explained above is the LE of a graph with edges weighted with cosine similarity of expression patterns of nodes (protein species). It differs from the original and common usages of LE in that the graph we considered is a complete graph (with loops) and that the weight of edges (pairwise similarity of nodes) is cosine similarity. It has made all the mathematical formulations linear, which allowed us to biologically interpret the results with mathematically rigorous analyses. We also remark that our graph representation of proteome does not rely on existing knowledge on the underlying interaction and regulatory networks of proteins and is based only on final expression levels of the proteins. Therefore, the results are robust against the uncertainty of underlying molecular detail.
Relevance of centrality of csLE structure to biological functions
Centrality-essentiality correlation
Request a detailed protocolAs mentioned in the main text, centrality of protein species with regard to stoichiometry conservation correlates with gene essentiality (Figure 5B). We analyzed the proteome data from all the 22 conditions reported by Schmidt et al., 2016 in Figure 5B. Interestingly, the centrality-essentiality correlation becomes weaker when the analysis was conducted with the data from fewer conditions (Appendix 1—figure 6A).
We obtained the list of essential genes of E. coli from EcoCyc (Keseler et al., 2017) on September 23, 2020. The list contained 318 essential genes in total. The essentiality of the genes in this list was determined on the basis of whether single-gene knockouts of BW25113 (Keio Collection) could grow under LB condition at 37°C (Baba et al., 2006).
We also confirmed centrality-essentiality correlation for S. pombe transcriptome data (Kobayashi-Kirschvink et al., 2018; Appendix 1—figure 6B, see Appendix 3.2). For this analysis, we downloaded the list of essential genes of S. pombe from PomBase (Harris et al., 2022) on May 13, 2022. The list contained 1221 essential genes in total. Here, the essentiality data by PomBase was based on the Fission Yeast Phenotype Ontology terms ‘inviable vegetative cell population’ (FYPO:0002061) and ‘viable vegetative cell population’ (FYPO:0002060) (Harris et al., 2013). Note that in our S. pombe essentiality analysis, we focused only on coding genes, whereas the csLE structure was calculated using both coding and non-coding genes. See ‘Centrality-coding/non-coding correlation’ in Materials and methods and Appendix 1—figure 6C for the proportion of coding genes in each bin in Appendix 1—figure 6B. Eleven coding genes in the S. pombe transcriptome data were not found in current PomBase. Thus, some bins do not show 100% in total in Appendix 1—figure 6B.
Stoichiometry conservation centrality in human cells was evaluated using two kinds of H. sapiens transcriptome data: (i) human cell atlas data reported in Cao et al., 2020 (Figure 5F) and (ii) genome-wide Perturb-seq data reported in Replogle et al., 2022 (Figure 5H).
The human cell atlas data (Cao et al., 2020) contain gene expression profiles in cells from 15 fetal organs. To calculate stoichiometry conservation centrality from the human cell atlas, we analyzed the pseudobulk data (GSE156793_S4_gene_expression_tissue.txt provided at the Gene Expression Omnibus). We calculated stoichiometry conservation centrality value of each gene using expression level data of 53,908 genes that are expressed at least in one organ.
The Perturb-seq data we used are gene expression profiles in a chronic myeloid leukemia cell line K562 (Replogle et al., 2022). This dataset contains single-cell RNA sequencing data of genetically perturbed cells in which expression of targeted genes is inhibited by CRISPR interference. We analyzed the pseudobulk data (K562_gwps_raw_bulk_01.h5ad provided at Figshare) to calculate stoichiometry conservation centrality. We evaluated stoichiometry conservation centrality value of each gene using the expression data of all the 8248 genes in the Perturb-seq data. We remark that this dataset did not contain genes that showed no expression under all the reported genetic perturbation conditions.
Human gene essentiality was determined by referring to another dataset reported in Wang et al., 2015, in which fitness cost imposed by gene inactivation was evaluated by a CRISPR-based method (Wang et al., 2015). The fitness cost was quantified by an index called CRISPR score; genes with lower CRISPR scores are considered more essential (Wang et al., 2015). We used the CRISPR scores calculated with a human chronic myelogenous leukemia cell line KBM7.
The CRISPR scores of 16,996 genes and 7462 genes were found in Wang et al., 2015 among the genes whose stoichiometry conservation centrality was evaluated using the human cell atlas data (Cao et al., 2020) and the Perturb-seq data (Replogle et al., 2022), respectively. We evaluated the correlations between stoichiometry conservation centrality and gene essentiality (CRISPR scores) for these common genes in Figure 5F and H. The correlations were examined with the Brunner-Munzel test (Brunner and Munzel, 2000) using R (version 4.0.3) and ‘brunnermunzel’ package (version 2.0) (Ara, 2022).
Centrality-evolutionary conservation correlation
Request a detailed protocolAs mentioned in the main text, centrality of proteins with regard to expression stoichiometry conservation weakly correlates with evolutionary conservation represented by the number of orthologs based on protein sequences (Figure 5C). In Figure 5C, we analyzed the proteome data from all the 22 conditions reported in Schmidt et al., 2016. We also confirmed the relation for the E. coli proteome data from fewer conditions which we had used for our Raman-proteome correspondence analyses (Appendix 1—figure 6D).
We obtained the ortholog data from OrthoMCL-DB (Chen et al., 2006) (release 6.12). We used the number of orthologs in all of the ‘Core species’ and the ‘Peripheral species’ of OrthoMCL, which are across the three domains (Bacteria, Archaea, and Eukaryota), as a proxy for evolutionary conservation of each protein. To examine the correlation, we performed the Brunner-Munzel test (Brunner and Munzel, 2000) using R (version 4.0.3) and ‘brunnermunzel’ package (version 2.0) (Ara, 2022). The E. coli proteome data contain 15 proteins with IDs that were not found in OrthoMCL-DB for technical reasons such as changes in IDs in the past, and thus, we manually processed these 15 proteins.
We also examined S. pombe transcriptome data (Kobayashi-Kirschvink et al., 2018; Appendix 1—figure 6E–G, see Appendix 3.2). We obtained ortholog data from OrthoMCL-DB (Chen et al., 2006) (release 6.12). The S. pombe transcriptome data have 11 coding genes which were not found in both current PomBase and OrthoMCL-DB, and two coding genes which were found in PomBase but not in OrthoMCL-DB. The S. pombe transcriptome data contain not only coding genes but also non-coding genes, and we obtained the csLE structure using both.
We also evaluated stoichiometry conservation-evolutionary conservation correlation using the human cell atlas data (Cao et al., 2020; Figure 5G) and the genome-wide Perturb-seq data (Replogle et al., 2022; Figure 5I). Ortholog data for these analyses were obtained from OrthoMCL-DB (release 6.20). We found the ortholog data in OrthoMCL-DB for 18,959 genes among the 53,908 genes with stoichiometry conservation centrality evaluated with the human cell atlas data. We remark that 98.7% of the 18,959 genes were classified as coding genes in the human cell atlas data. We also found the ortholog data for 7957 genes among the 8248 genes with stoichiometry conservation centrality evaluated with the Perturb-seq data. The correlations were examined with the Brunner-Munzel test (Brunner and Munzel, 2000) using R (version 4.0.3) and ‘brunnermunzel’ package (version 2.0) (Ara, 2022).
Centrality-coding/non-coding correlation
Request a detailed protocolAs mentioned in the main text and ‘Centrality-essentiality correlation’ in Materials and methods, centrality of genes with regard to stoichiometry conservation clearly correlates with coding/non-coding classification of genes in S. pombe. We observed this trend using S. pombe transcriptome data (Kobayashi-Kirschvink et al., 2018; Appendix 1—figure 6C). The coding/non-coding assignment of each gene is based on PomBase (Harris et al., 2022) data downloaded on October 11, 2022.
We observed a comparable correlation even in the human cell atlas data (Figure 5E). The gene type assignment is based on the human cell atlas data. Note that almost all the genes in the Perturb-seq data were coding genes.
Global omics structures characterized by Raman-omics correspondences
Notation
Request a detailed protocolLet denote the -th row in (see Equation 3). It is an -dimensional row vector whose components represent coefficients of protein . The first component is the constant term, and the -th component is the coefficient for LDA() Raman. Below, we consider the coefficients normalized with the constant terms,
Raman-proteome correspondence matrix as a low-dimensional representation of proteome changes
Request a detailed protocolWe asked whether the stoichiometry conservation structure of the proteomes revealed by LE (Figures 5K, L, 6D) is relevant to the low-dimensional Raman LDA space. To address this, we focused on a proteome low-dimensional structure specified by the Raman-proteome coefficients, motivated by the fact that the analysis of led to the discovery of a proteome fraction that conserves mutual stoichiometry (Figure 3). We considered a space where represents the coordinate of each protein. From Equation 7 or Equation 8, protein species whose abundance ratios remain constant have an identical coordinate in this normalized coefficient space. The proteome in this normalized coefficient space is shown in Figure 6A and C.
This structure (Figure 6A and C) is constructed using the Raman LDA axes (dual basis) and is different from the csLE structure (Figures 5K, L, 6D), which is independent of Raman information. Therefore, it is nontrivial that these two structures are similar. This similarity suggests that differences in cellular Raman spectra captured by LDA might be quantitatively related to the omics structure deduced from stoichiometry-conserving relations. We will mathematically analyze this similarity in Section 2 in Appendix.
Evaluating similarity between orthogonal matrix and identity matrix
Request a detailed protocolAs we see in Appendix 2.1.5, an orthogonal matrix that appears in the relation connecting the two types of proteome structure must be close to an identity matrix to guarantee the structural similarity. To evaluate to what extent is close to an identity matrix, we generated many random orthogonal matrices (Mezzadri, 2006) and compared and the identity matrix with them.
We first multiplied each orthogonal matrix by itself in the sense of Hadamard product (element-wise product). Then, we regarded the resultant matrix as a scatterplot (Appendix 1—figure 9B) and calculated its Pearson correlation coefficient, assuming that element was the frequency of ‘data points’ at the coordinate . The obtained Pearson correlation coefficient can be regarded as a measure of closeness to the identity matrix. In the case of the identity matrix, the correlation coefficient takes the maximum value, one, because non-zero values are concentrated on the diagonal part.
We calculated the square of each matrix in the sense of Hadamard product for two reasons. First, since all elements of the resultant matrix are non-negative, one can ensure that the number of ‘points’ is non-negative at any coordinate. Note that it is not necessarily an integer here. Second, the sum of all the elements of the resultant matrix is necessarily . Thus, the total number of ‘points’ is equally for any orthogonal matrices compared.
In addition to this method, we also evaluated the closeness of to the identity matrix (i) by comparing the magnitudes of off-diagonal elements among , the identity matrix, and random orthogonal matrices, and (ii) by comparing the magnitudes of elements of leading principal submatrices among , the identity matrix, and random orthogonal matrices. In (i), from a part consisting of - and -diagonals ( and elements) to the whole matrix, we expand step by step the area to consider by including - and -diagonals (, the final step is inclusion of the main diagonal), and calculated the sum of the square of the elements in the area at each step. In (ii), from the smallest leading principal submatrix ( element) to the whole matrix, we expand the area to consider step by step and calculate the sum of the square of the elements in the area at each step. See also schematic diagrams in the figures in Appendix (e.g. Appendix 1—figure 9D and E).
Interpretation of norm/ norm ratio of an expression vector as a quantitative measure of expression generality
Request a detailed protocolIn Appendix 2.1.5, we will also see that even if is close to the identity matrix, there is another condition which must be met to guarantee the similarity of the two types of proteome structure. By considering the mathematics behind the condition, we will reveal that the two indices, stoichiometry conservation centrality (degree) and expression generality score , must be mutually proportional. Here, we explain why is a quantitative measure of the generality (or constancy) of expression levels.
First, we note that the ratio is independent of the magnitude of the expression vector . In other words, normalization does not affect the ratio:
On the basis of this, we only consider normalized expression vectors without loss of generality.
By definition, norm of a normalized expression vector (the denominator of the most left-hand side of Equation 17) equals one. Thus, the ratio we are considering equals the norm of the normalized expression vector:
Here, we write
where . Then,
Note that holds because of normalization. Therefore, any normalized expression vector corresponds to a point on the first orthant division of the unit -sphere .
Next, we consider a hyperplane which passes through the point in Equation 19. Since all the coefficients of this hyperplane are equal, all the intercepts are also equal. The intercept value is , thus equals the ratio in Equation 20. In other words, the ratio from Equation 20 appears as an intercept of the hyperplane passing through the point corresponding to the normalized vector with all the coefficients equal to one.
By simple calculation, one can see that the two surfaces and intersect when . (In other words, holds.) The intercept value takes the maximum when the normalized expression vector points to the ‘center’ of the first orthant division of the unit -sphere, i.e., when
This means that the expression level is even and constant across the conditions. When this evenness of expression level breaks, the intercept value decreases, and it attains the minimum when the normalized expression vector overlaps with an axis, i.e., when
which corresponds to a completely ‘condition-specific expression pattern’ (μ is the condition’s index).
See Appendix 1—figure 8A and B for a graphical explanation of the argument for the two- and three-dimensional cases.
Proteome structure obtained with PCA
Request a detailed protocolAs mentioned in the main text, we confirmed that PCA could find a proteome structure (Appendix 1—figure 6H) similar to the csLE structure (Figure 5L and Figure 6D). Since cosine similarity of expression vectors is inner product of the -normalized expression vectors, we also performed normalization of proteome data before applying PCA in this analysis. In other words, we applied PCA to a normalized proteome data (see Appendix 2.1.5).
We remark that, despite the structural similarity between the PCA structure and the csLE structure, csLE has an advantage over PCA in that the relative proximity of positions reflects the strength of stoichiometry conservation between each element. In addition, as shown in the main text and Section 2 in Appendix, csLE of omics data has a direct quantitative connection to cellular Raman spectra, which is not the case for PCA.
Appendix 1
1 Materials and methods
See Materials and methods section in the main text.
2 Mathematical analysis and details
To clarify what is nontrivial in the correspondence between LDA Raman and csLE proteome, here we derive rigorous mathematical relations for the correspondence through linear algebraic calculation.
2.1 Mathematics behind the correspondence
2.1.1 Notations
We use the following notations:
denotes an -dimensional column vector of ones, and does an -dimensional zero column vector.
A vector without a hat (e.g. ) is a column vector, and a vector with a hat (e.g. ) is a row vector.
denotes an identity matrix, and does a zero matrix.
For a square matrix , , and for a vector , , where is the Kronecker delta.
For a matrix , denotes the -th row of , and denotes the -th column of .
2.1.2 Preparations
Let be the number of cells in each condition, be the number of conditions, be the original dimension of proteome, i.e., the number of protein species in the proteome data, and be the original dimension of Raman spectra after the application of the Savitzky-Golay filter. In our main data, , and .
Original preprocessed Raman data: Let
be a preprocessed Raman spectrum from cell under condition (see ‘Raman measurements and preprocessing of spectra’ in Materials and methods). The prime ′ denotes that the variable is the original preprocessed data. The are collected in an matrix:
Combining from different conditions, one can define an matrix
which contains all the preprocessed Raman data.
PCA and centering: Before LDA, PCA was first applied to the preprocessed Raman data to reduce noise. The covariance matrix is
which is positive semi-definite. PCA is formulated as the following eigenvalue problem of :
where is a diagonal matrix with the eigenvalues of as its diagonal elements in decreasing order from the upper left, and is an orthogonal matrix consisting of the eigenvectors of as its columns. Using the first columns of , i.e., the columns corresponding to the first largest eigenvalues, we obtain an matrix representing the post PCA data:
Here, is the -th PCA coefficient vector, and is the reduced dimension of the Raman spectra. The subtraction of the is for centering the data. In our case, the top 218 principal components (i.e. ) explaining 98% of the variance were used to reduce noise and dimensionality.
Let be the post PCA Raman spectrum from cell under condition , i.e.,
and be the collection of , i.e.,
Then, is written as
From Equation 2.6,
namely, for any ,
which means that the post PCA data is centered.
Population average in each condition: Let be the population average of the post PCA spectra of cells under condition . Then,
Also,
where
We define an matrix
Each row of corresponds to a condition. From Equation 2.11 and Equation 2.14, for any ,
Namely,
These relations mean that is also centered.
LDA: The within-class covariance matrix is
Here, Equation 2.12 was used. The between-class covariance matrix is
Here, Equation 2.17 was used. Assume and (the maximum possible values). In fact, and in our data. Note that . From the definitions of and above, both are positive semi-definite. LDA is formulated as the following generalized eigenvalue problem:
where is a diagonal matrix, and is an matrix that simultaneously diagonalizes and (to ):
Here, the diagonal elements in were in decreasing order from the upper left. In our analysis, the columns of were normalized.
Using , we obtain an matrix representing the post LDA data
Each row of represents a dimension-reduced Raman spectrum of each condition. Let us write the -th column of as
Then,
Transforming gives
Therefore, is a diagonal matrix, and are orthogonal to each other. As all the diagonal elements of the diagonal matrix is positive, . Furthermore,
Namely, for any ,
This means that, as the data is centered, all the columns of are perpendicular to .
Proteome
Let
be the absolute abundances of the -th protein. are collected in an matrix:
is the absolute abundance of the -th protein in condition .
We assume , i.e., proteome vectors for different conditions are linearly independent. Actually, in our data. We also assume that proteins with zero expression in all the conditions had been excluded from the proteome data.
2.1.3 Linear transformation between LDA Raman and proteome
We define an matrix
We denote the first column of as . Hence,
From Equation 2.29 and Equation 2.31,
Therefore, is also a diagonal matrix, and has full rank. For convenience, we write
Here, we consider singular value decomposition (SVD) of ; i.e.;
where is a diagonal matrix whose diagonal elements are the singular values of , and . Note that we can set in the following way. Let
Then,
Thus, SVD of can be written as
As is the eigenvalue matrix of ,
and
Now, we consider linear transformation between and . We introduce the coefficient matrix that connects and as
has full rank and is therefore invertible. Thus, is obtained by
From the viewpoint of linear regression, can be regarded as the constant terms and is the coefficients for the -th LDA dimension.
We can rewrite using the row vectors in and . Writing the -th row of as ,
Likewise, writing the -th row of as ,
Then, Equation 2.46 can be written in another way:
The interpretation of each vector is summarized in Appendix 1—table 9.
2.1.4 Relation between LDA Raman and
Here, we discuss the spatial correspondence between the Raman distribution in LDA space and the normalized Raman-proteome coefficient proteome structure (Figure 6B and C).
Connecting LDA Raman and Raman-omics transformation coefficients
Let us consider in two ways. In this first approach,
In the second approach,
Comparing the two calculations yields
This is equivalent to
for any protein .
Normalization with constant terms
Next, we consider the normalization of the matrices in Equation 2.58 with constant terms.
We first define the normalized coefficient matrix as
This is normalization of the coefficients by the constant terms. Furthermore, we normalize as
Thus, the right-hand side of Equation 2.58 is
Since the first row of is , one can rewrite the left-hand side of Equation 2.58 as
Here,
Therefore,
This relation indicates that the constant term for each protein is its average abundance. Consequently,
Therefore, we obtain
Equivalently,
for any protein . This means that the normalized coefficients of each protein are mainly determined by the weighted averages of the Raman vectors, where the weights are the abundances of the protein.
As we already saw in Equation 7 or Equation 8 in Materials and methods, this equation also shows that protein pairs whose abundance ratio remains constant over all the conditions have identical normalized coefficients.
Special case – condition-specific protein
Consider an imaginary condition-specific protein whose abundance is under condition and zero under the other conditions, i.e.,
From Equation 2.59,
which indicates that the LDA Raman of condition , and Raman-proteome coefficients for -specific protein , are in the same orthant. The normalized version is obtained by dividing both sides by the first row (or from Equation 2.72):
which shows that the LDA Raman of condition , and the normalized Raman-proteome coefficients of -specific protein , are in the same orthant.
Application to main data
The LDA Raman distribution shown in Figure 6B corresponds to (scatterplots between different columns of ). On the other hand, the normalized coefficient proteome structure in Figure 6C is the scatterplots between different columns of . The above linear algebra explains the correspondence between the two. In addition, from Equation 2.72, one can understand that the homeostatic core distributes around the center of the structure because its member proteins are expressed in all the conditions.
Equation 2.76 was obtained by considering an imaginary protein whose expression levels were zero under all conditions except for one condition. To confirm this relation with actual data, we picked an almost-condition-specific protein (PaaE, highly expressed in LB condition) and a non-condition-specific protein (AcrR), and confirmed that the former approximately satisfied Equation 2.76, while the relation did not hold for the latter (Appendix 1—figure 10).
2.1.5 Relation between and
Here, we discuss the spatial correspondence between the normalized Raman-proteome coefficient proteome structure and the csLE proteome structure (Figure 6C and D).
csLE proteome structure
Consider an undirected graph where each node corresponds to one type of protein. As previously explained, let the graph be a complete graph, namely every pair of nodes is connected by an edge. Each edge is weighted with cosine similarity between the two types of protein connected by the edge. Cosine similarity of protein and protein is given by
where is the inner product of and , and is the -norm (Euclidean norm) of . Cosine similarity evaluates how similar the expression patterns are between protein and protein . When the abundance ratio between protein and protein remains constant over all the conditions, takes the maximum value 1.
The adjacency matrix of this graph is given by
and the degree matrix of this graph is
For simplicity, diagonal element of is for any protein , i.e., each node has a loop. is and real symmetric and is and diagonal. Then, the Laplacian matrix is given by
which is an symmetric matrix, and the symmetric normalized Laplacian is given by
which is an symmetric matrix.
Here, we define by normalizing the columns of :
By using , is rewritten as
Consider an eigenproblem
where is an diagonal matrix in which the eigenvalues of are arranged in increasing order from the upper left, and columns of are the normalized eigenvectors of corresponding to the eigenvalues. Denote
Here, has the following four characteristics:
is positive semi-definite. See, for example, von Luxburg, 2007 for the proof.
In an undirected graph with non-negative weights, the number of separated graph components equals the multiplicity of the eigenvalue zero of . See, for example, von Luxburg, 2007 for the proof. Since our proteome graph is connected, our has the single eigenvalue zero.
From Equation 2.86, . Here, it is obvious that has full rank, hence, . Therefore, . Obviously, has full rank by definition and thus, . Therefore, has singular values of zero. Since is symmetric, its singular values and eigenvalues are the same. Thus, has singular values of zero. Therefore, has singular values of .
For any -dimensional vector , . Therefore, is positive semi-definite, and all of the eigenvalues of are less than or equal to one.
By these four points, we see that the eigenvalues of satisfy
Now we define an matrix as
Then, we can write
Let be the first columns of :
The truncated version of the eigenproblem is
is the -th row of and provides a new -dimensional representation of the -th protein. This representation of the proteome reflects distance between each protein pair in terms of cosine similarity.
Here we clarify the correspondence of the eigenproblems defined by the normalized Laplacian matrices and . Let
Then, this eigenproblem can also be regarded as the following generalized eigenproblem,
Remembering that , we can further transform it into an eigneproblem
This is the form of eigenproblem that we explained previously in ‘Global proteome structures based on stoichiometric balance’ in Materials and methods. In this section, we discuss it later.
The eigenproblem (Equation 2.87) can be transformed to
where
This means that the columns of are also the (normalized) eigenvectors of , and is the corresponding eigenvalue matrix.
Defining an matrix as
we can write
Note that
Equation 2.97 is further transformed into
Comparing Equation 2.102 and Equation 2.85 leads to
Connecting Raman-proteome transformation coefficients and csLE proteome
We consider in two ways. First, from Equation 2.103,
Since
the diagonal elements of , i.e., the positive eigenvalues of are the square of the singular values of . Compact SVD of is expressed as
where is an diagonal matrix whose diagonal elements are the singular values in decreasing order from the upper left, and . We then obtain
Thus,
On the other hand, from Equation 2.44 and Equation 2.46,
Therefore, comparing Equation 2.109 and Equation 2.112 yields
where is an orthogonal matrix. We define the estimate of as
By this notation, Equation 2.113 can be written as
Here, the left-hand side represents Raman-proteome linear transformation, whereas the right-hand side except for is derived only from proteome data. Note that in order to derive Equation 2.113, LDA does not need to be applied to Raman data because can be written in the form of even if LDA is not applied to Raman data.
Normalization with constant terms
Now we consider normalizing both sides of Equation 2.115 by the first columns.
With Equation 2.60 and Equation 2.61, the left-hand side of Equation 2.115 can be rewritten as
Likewise, for the right-hand side, the first column of (the estimated constant term) is
The first column of is the normalized eigenvector corresponding to the eigenvalue zero of .
By the definition of , . Hence, in general, has an eigenvalue zero and a corresponding eigenvector . Therefore, by the definition of , ; also has an eigenvalue zero and a corresponding eigenvector . The eigenproblems and are equivalent because one can obtain the eigenproblem of by left multiplying both sides of the eigenproblem of by and that of by left multiplying both sides of the eigenproblem of by . Eigenvalues of and are identical, and holds for their eigenvectors. Therefore, has an eigenvalue zero and a corresponding eigenvector . The multiplicity of the eigenvalue zero of our is one, and its corresponding eigenvector is limited to .
By writing as
Thus, Equation 2.117 can be further transformed as
Remembering that both and are diagonal, we obtain
Therefore, from Equation 2.114 and Equation 2.122, the ‘estimated coefficients’ normalized with the ‘estimated constants’ are
We remark that the eigenproblem of , i.e.,
is equivalent to solving a minimization problem
Thus, the closer and are in terms of cosine similarity, the closer and (the -th and -th rows of , respectively).
The relation between the minimization problem and the eigenproblem is the following. The objective function of the minimization problem is
Here,
Therefore, Equation 2.130 can be transformed into
Thus, the minimization problem is
This can be transformed into the generalized eigenproblem by the method of Lagrange multipliers (Ghojogh et al., 2019).
Remember that this property of and is analogous to that of and (the -th and -th rows of , respectively) as explained in Section 2.1.4 in Appendix and ‘Raman-proteome correspondence matrix as a low-dimensional representation of proteome changes’ in Materials and methods.
By considering the first (upper-left) element of is one, the right side of Equation 2.113 can be transformed into
Therefore, from Equations 2.115, 2.116, 2.136, 2.137,
This is the equation that links the normalized Raman-proteome coefficient proteome structure and the csLE proteome structure.
Mathematical interpretation of the obtained equation
From Equation 2.139, if the distributions of and are similar, the diagonal matrix must be similar to the identity matrix because large off-diagonal elements of makes lower dimensions ‘mix’ much with the higher dimensions. In addition, the directions of diagonal matrices and must also be close to each other even if is close to the identity matrix. Note that the first column of also reflects the relation between and .
The obtained relation between Raman-proteome normalized coefficient structure and csLE structure is summarized in Appendix 1—table 10.
Note that the relation between and can change depending on normalization of . However, the difference is not important for the spatial correspondence between the two structures because they only affect scaling of the axes. Rather, and , which determine the order of columns of and , are important.
Application to main data
The normalized coefficient proteome structure in Figure 6A and C is the scatterplots between different columns of . The cosine similarity proteome structures in Figure 5L and Figure 6D are the scatterplots between different columns of .
In our data analysis, we calculated as , where each column of was normalized.
On the basis of the results of the mathematical analysis, we compared and in Appendix 1—figure 9G.
The similarity between the projections of the two distinct omics structures onto low-dimensional subspaces suggests that is close to the identity matrix (Equation 2.138 and Appendix 1—table 10). Appendix 1—figure 9A, C–E shows that the actual is indeed significantly close to the identity matrix. This suggests that the major changes in cellular Raman spectra detectable by LDA reflect the major changes in the proteome characterized by LE based on stoichiometry balance (cosine similarity).
The structural similarity also suggests that directions of and are also similar (Equation 2.138 and Appendix 1—table 10).
In fact, Appendix 1—figure 9F confirmed good agreement between and . Since these two quantities are calculated only from proteome data, this agreement is a characteristic of proteome data. See the next section (Section 2.2 in Appendix) for further analyses and discussion on this point.
2.2 Quantitative constraint on omics profiles
2.2.1 From agreement between and to proportionality between norm/ norm ratio and degree
We observed above that constant terms and the estimated constant terms were strongly correlated (Appendix 1—figure 9F). It is of note that both and can be calculated only from omics data. Specifically, from Equations 2.121 and 2.69,
Here, and are the and norms of and reflect only the expression property of protein . On the other hand, the degree is a measure for the relationships of protein with the other proteins because is the sum of cosine similarities, .
The observed relation
(Appendix 1—figure 9F) indicates that a proportionality relation
must hold approximately.
As mentioned in the main text, we refer to the ratio of norm to norm in the left-hand side of Equation 2.143 as expression generality score () because it can be interpreted as a measure of constancy and generality of the expression levels of the protein (see ‘Interpretation of norm/ norm ratio of an expression vector as a quantitative measure of expression generality’ in Materials and methods, Appendix 1—figure 8A and B).
When a protein is perfectly condition-specific and expressed only in a particular condition, its norm equals its norm, and the ratio takes the minimum value one. When a protein is expressed equally across all the conditions, its norm is greater than its norm and the ratio takes the maximum value, the square root of the number of conditions (Appendix 1—figure 8A and B). On the other hand, the right-hand side of Equation 2.143, which we refer to as stoichiometry conservation centrality () in the main text, measures to what extent protein conserves its stoichiometry with the other proteins. Therefore, the proportionality relation (Equation 2.143) suggests a global quantitative constraint between condition specificity of expression patterns and stoichiometry conservation strength. Positions of SCGs and density of genes in the csLE structure (Figure 5A, K, and L) already suggested that genes with less condition-specific expression patterns have more genes with stoichiometrically similar expression patterns, and the proportionality here quantitatively captures this property of omics dynamics. We remark that the proportionality relation was confirmed in all the omics data we analyzed in this paper (see Section 3.2 and Appendix 1—figure 7I–N).
2.2.2 Mathematics behind proportionality between stoichiometry conservation centrality and expression generality score
The cosine similarity-based analyses involve normalization of expression vectors by its norm (Equation 2.77). Normalized expression vectors represent points on the first orthant division of a unit -sphere in an -dimensional space (). normalization allows us to compare expression patterns without considering expression magnitudes, and an expression pattern is represented by a position on the unit -sphere. Therefore, our cosine similarity-based quantification of stoichiometric balance in omics data is equivalent to evaluating distances (measured with angle) between points on the -sphere.
Since stoichiometry conservation centrality is the sum of cosine similarities,
Defining
we obtain
Note that . The last term is the normalized vector of the sum of all the normalized expression vectors, which we refer to as ‘expression-pattern norm vector’. Therefore, Equation 2.147 means that stoichiometry conservation centrality is proportional to the cosine of the angle between the expression-pattern norm vector and protein ’s expression pattern . The more distant the expression pattern of a protein is from that specified by the expression-pattern norm vector, the smaller is.
On the other hand, expression generality score is
Therefore,
The last term is the normalized vector of , corresponding to the ‘center’ of the first orthant division of the unit -sphere. In other words, represents ‘perfectly even expression pattern’ across conditions. Therefore, Equation 2.150 means that the expression generality score is proportional to the cosine of the angle between the ‘perfectly even expression pattern’ and protein ’s expression pattern . The more distant the expression pattern of a protein is from the ‘perfectly even expression pattern’, the smaller the expression generality score is.
Comparing Equations 2.147 and 2.150, we see that if the expression-pattern norm vector and the perfectly even expression pattern are equal, a proportional relationship
holds. Note that this is equivalent to .
Conversely, if deviates from , the proportional relation between and breaks. We found that the proteome data by Schmidt et al., 2016 showed . Instead, we found that the values of the elements of increased approximately linearly with the population growth rates under corresponding conditions (Appendix 1—figure 8D). Such a strong positive correlation between the elements of and the population growth rates is nontrivial and suggests a new growth law constraining the total of relative expression level changes of all the proteins.
Next, we consider the consequence of this positive correlation between the elements of and the population growth rates. Let us consider the proteins whose expression generality score takes the minimum value one. Namely, the expression of these proteins is completely condition-specific (Appendix 1—figure 8A and B). For such proteins, only one component of the expression pattern vector is one, and the other components are zero. Thus, from Equation 2.147, their stoichiometry conservation centrality becomes proportional to the elements of corresponding to the conditions under which they are expressed. Since the values of the elements of are positively correlated with the population growth rates, of completely condition-specific proteins also exhibits a positive correlation with the growth rates under the conditions accompanying their expression (Appendix 1—figure 8C).
Such correlation can be confirmed for the proteins with nearly condition-specific expression patterns (PaaE, Asr, and DgoA in Figure 7B and C).
More generally, the deviation of from the perfect proportionality line can be understood by the relation
which can be derived from Equations 2.147, 2.150. Note that the values of the elements of the last term also increase with the population growth rates, being positive under fast growth conditions and negative under slow growth conditions (Appendix 1—figure 8D). Therefore, when protein tends to be expressed higher under fast growth conditions, i.e., when the elements of corresponding to the fast growth conditions are relatively larger than those corresponding to the slow growth conditions, the left-hand side of Equation 2.152 becomes positive, and its resides above the perfect proportionality line. On the other hand, when protein tends to be expressed higher under slow growth conditions, its resides below the perfect proportionality line.
In the - plot in Figure 7A and B, we find several stretches of protein clusters above and below the perfect proportionality line. As expected from the argument above, each cluster corresponds to a group of proteins with similar expression patterns, and their positions relative to the proportionality line characterize the condition under which they are expressed the most (Figure 7C).
In summary, visualizing omics data by using the stoichiometry conservation centrality and the expression generality score allows us to systematically characterize the condition-dependent expression pattern of each protein on the basis of its position in the plot. Interestingly, this systematic characterization of gene expression patterns (the relation between and ) was derived from our mathematical analyses of the correspondences between Raman and omics as we explained above.
3 Extended data analysis
3.1 Growth laws
3.1.1 Single-gene-level growth law
Bacterial growth law states that the total abundances of ribosomal components increase linearly with growth rate (Neidhardt and Magasanik, 1960; Scott et al., 2010; Bremer and Dennis, 2008). The homeostatic core (the largest SCG) identified in our analysis contains many ribosomal proteins. Hence, it is plausible that the total abundance of homeostatic core proteins also increases linearly with growth rate, which we indeed found true (Appendix 1—figure 5A). Furthermore, the abundance ratios of homeostatic core proteins are conserved across conditions. Therefore, the intracellular abundance of each protein species in the homeostatic core is expected to increase linearly with growth rate.
Let be the abundance of protein in the homeostatic core in environment . Since this protein conserves the stoichiometry with the other homeostatic core proteins across conditions,
in any environments ( is the environment-independent abundance ratio of the homeostatic core protein to protein ).
Let be the total abundance of homeostatic core proteins in environment . The growth law for the homeostatic core is
where is the growth rate in environment , is the -intercept, and is the slope of the linear relation. Therefore,
This shows that the abundance of homeostatic core proteins satisfies single-gene-level growth law.
3.1.2 Extended verification of stoichiometry conservation
When the abundance ratios between protein and protein are conserved,
where and specify the environments ( signifies the standard environment). Hence,
where is the common abundance ratio of stoichiometry-conserving proteins with respect to the condition . Note that is common among all the proteins in a stoichiometry-conserving group.
From Equation 3.5,
Therefore, plotting against should find the stoichiometry-conserving proteins aligned on a straight line with a slope of 1. We indeed find such plots for the homeostatic core proteins (Appendix 1—figure 5B and C). This result confirms their stoichiometry conservation from a different perspective.
3.1.3 Linear dependence of common abundance ratio on growth rate
Since the total amount of homeostatic core proteins increases linearly with growth rate (Appendix 1—figure 5A and Equation 3.2),
Since ,
Hence,
where . Therefore, the common abundance ratio also increases linearly with growth rate.
Estimating as the -intercepts of the regression lines with a slope of 1 (Appendix 1—figure 5B), we confirmed this linear dependence of common abundance ratio of homeostatic core proteins on growth rate (Appendix 1—figure 5D).
3.2 Generality of the results
3.2.1 Additional datasets with Raman data
Correspondence of three types of space
The correspondences among LDA Raman, Raman-omics normalized coefficient omics structure, and csLE omics structure were also observed in other datasets. The datasets include Raman (this paper) and proteome (Schmidt et al., 2016) data of E. coli with different genotypes (BW25113, MG1655, and NCM3722) cultured in the ‘LB’ medium (Appendix 1—figure 11A–E) and in the ‘Glucose’ medium (Appendix 1—figure 11F–J), and Raman and transcriptome data of S. pombe cultured under 10 different environmental conditions (Appendix 1—figure 11K–O; Kobayashi-Kirschvink et al., 2018).
Comparison of matrices obtained by mathematical analyses
In addition to the comparison of three types of space, we also examined the matrices on the basis of results of the mathematical analyses (Appendix 1—table 10) using the aforementioned additional datasets; the closeness of and (Appendix 1—figure 12F for Raman-proteome of E. coli with different genotypes cultured in ‘LB’, Appendix 1—figure 12L for Raman-proteome of E. coli with different genotypes cultured in ‘Glucose’, and Appendix 1—figure 12R for Raman-transcriptome of S. pombe cultured in 10 environment conditions), the closeness of to the identity matrix (Appendix 1—figure 12A–D for Raman-proteome of E. coli with different genotypes cultured in ‘LB’, Appendix 1—figure 12G–J for Raman-proteome of E. coli with different genotypes cultured in ‘Glucose’, and Appendix 1—figure 12M–P for Raman-transcriptome of S. pombe cultured in 10 environment conditions), and the correspondence between and (Appendix 1—figure 12E for Raman-proteome of E. coli with different genotypes cultured in ‘LB’, Appendix 1—figure 12K for Raman-proteome of E. coli with different genotypes cultured in ‘Glucose’, and Appendix 1—figure 12Q for Raman-transcriptome of S. pombe cultured under in 10 environment conditions).
We confirmed that the same results hold for these additional datasets. Note that the correspondence between and does not involve Raman data. It is an intrinsic property of the omics data.
Proportionality between expression generality and stoichiometry conservation centrality
The correspondence between and suggested the proportionality between expression generality score and stoichiometry conservation centrality in these omics data. We confirmed that the same results hold for these additional datasets (Appendix 1—figure 7I–N).
Correlation between and growth rates
For the proteome data of E. coli with different genotypes (BW25113, MG1655, and NCM3722) cultured in ‘LB’ and in ‘Glucose’, growth rates were also reported (Schmidt et al., 2016). We confirmed a positive correlation between the elements of and growth rates for these datasets (Appendix 1—figure 8E). See also the deviation from the proportionality line in the plot (Appendix 1—figure 7I–J).
Biological relevance of centrality of csLE structure
We also confirmed centrality-essentiality correlation and centrality-evolutionary conservation correlation in the S. pombe transcriptome data (Appendix 1—figure 6B and E–G).
Degree distribution and its destruction by randomization
Degree (stoichiometry conservation centrality) distributions of csLE structure of the additional datasets also showed a similar pattern as the main data, and randomization of the omics data breaks the strong correlation of expression patterns in the actual data (Appendix 1—figure 7C, D, and H).
3.2.2 Additional datasets without Raman data
Examining proteome structures with csLE does not require Raman data. Therefore, we additionally analyzed publicly available proteome data of M. tuberculosis and M. bovis under the growth conditions with distinct oxygen levels (Schubert et al., 2015), and the proteome data of S. cerevisiae under various environmental conditions (Lahtvee et al., 2017).
We characterized csLE structures of these datasets (Appendix 1—figure 13). Furthermore, we confirmed the proportionality between expression generality score and stoichiometry conservation centrality (Appendix 1—figure 7K–M). For the proteome data of S. cerevisiae, growth rates were also reported (Lahtvee et al., 2017). The S. cerevisiae cells were cultured in chemostat at the same dilution rate in any condition. In fact, we observed little variation of (Appendix 1—figure 8F), which leads to little deviation from the proportionality line (Appendix 1—figure 7M).
Degree (stoichiometry conservation centrality) distributions of csLE structure of the additional datasets also showed a similar pattern as the main data, and randomization of the omics data breaks the strong correlation of expression patterns in the actual data (Appendix 1—figure 7E–G).
Schematic illustration of the approach in this study.
Related to Figure 1. Raman spectra and gene expression profiles are both high-dimensional vectors and can be represented as points in high-dimensional spaces. Coarse-graining Raman spectra by dimensional reduction finds condition-dependent differences in their global spectral patterns (see Figure 2). The dimension-reduced spectra were linked to and used to predict condition-dependent global gene expression profiles (see Figure 2), which implies that global changes in spectral patterns detect differences in cellular physiological states. The analysis of this linkage led us to discover a stoichiometry-conserving constraint on gene expression, which enabled us to represent gene expression profiles in a functionally relevant low-dimensional space (i; see also Figures 3—5). Then, we find a nontrivial correspondence between these low-dimensional Raman and gene expression spaces (ii; see also Figure 6). This correspondence provides an omics-level interpretation of global Raman spectral patterns and a quantitative constraint between expression generality and stoichiometry conservation centrality (ii; see also Figure 7, Appendix 1—figure 9).
Custom-built Raman microscope and analyses of E. coli Raman spectra.
Related to Figure 2. (A) Schematic diagram of the Raman microscope used in this study. (B) Representative Raman spectra from single E. coli cells. The fingerprint region of one spectrum is shown for each condition. (C) Linear superposition of Raman shifts. Each linear discriminant analysis (LDA) axis is a linear superposition of Raman shifts. These figures show the coefficients for LDA1 (left) and LDA2 (right). (D) Relationship between Raman LDA1 axis and growth rates. The horizontal axis represents Raman LDA1 axis. The vertical axis represents growth rates measured in Schmidt et al., 2016. Each point corresponds to the data for one condition. Pearson correlation coefficient is 0.81±0.09.
Estimation of proteomes from Raman spectra.
Related to Figure 2. Comparing the measured proteomes with those estimated from Raman spectra. The horizontal and vertical axes represent the estimated and measured proteomes, respectively. Proteins with negative estimated abundance are not shown in these figures. The conditions with the largest and the second largest numbers of proteins with negative estimated abundance were ‘stationary3days’ (666 proteins) and ‘LB’ (359 proteins). The conditions with the fewest and the second fewest negatively estimated proteins were ‘GlucosepH6’ (0 proteins) and ‘Xylose’ (7 proteins).
Comparison of stoichiometry conservation among Clusters of Orthologous Group (COG) classes.
Related to Figure 3. (A and B) Relations between protein abundance and constant terms of Raman-proteome coefficients. The horizontal axes are (constant terms), and the vertical axes are (protein abundance). Dashed lines are the least squares regression lines with intercept zero for information storage and processing (ISP) COG class members. The average of was used as an estimate of here. In (A), only ISP COG class members are shown for three representative conditions: ‘Galactose’, ‘Glucose’, and ‘GlycerolAA’. In (B), all proteins are shown for a representative condition, ‘GlycerolAA’. (C) Relations between protein abundance and growth rates of E. coli under 15 environmental conditions. We analyzed the absolute quantitative proteome data, growth rate data, and COG annotation reported by Schmidt et al., 2016. Lines represent different protein species. Error bars are standard errors. The top panel is for the Cellular Processes and Signaling COG class; the middle is for the ISP COG class; and the bottom is for the Metabolism COG class. (D) Relations between protein abundance and growth rates of three E. coli strains (BW25113, MG1655, and NCM3722) under two culture conditions. We again analyzed the data by Schmidt et al., 2016. Lines represent different protein species. Error bars are standard errors. (E and F) COG class-dependent expression pattern similarity of E. coli proteomes between conditions. The E. coli proteome data under the 15 different environmental conditions were analyzed. The similarity is evaluated by Pearson correlation coefficients of log expression levels in (E) and by cosine similarity in (F). We consider all the combinations of the 15 conditions. Thus, there are 105 data points for each COG class. The box-and-whisker plots summarize the distributions of the points. The lines inside the boxes denote the medians. The top and bottom edges of the boxes denote the 25th percentiles and 75th percentiles, respectively. Note that (E) and (F) are evaluations of the same data used in Figure 3B in the main text with different similarity indices. (G) COG class-dependent expression pattern similarity between different strains of E. coli (BW25113, MG1655, and NCM3722). The absolute quantitative proteome data and COG annotation were taken from Schmidt et al., 2016. The similarity was evaluated by cosine similarity. The data contain three strains. Thus, there are three points for each COG class. The top panel is for the ‘Glucose’ condition, and the bottom is for the ‘LB’ condition. (H–J) COG class-dependent expression pattern similarity in other organisms. (H) is for M. tuberculosis (data from Schubert et al., 2015; six environmental conditions [time points]), (I) for M. bovis (data from Schubert et al., 2015; six environmental conditions [time points]), and (J) for S. cerevisiae (data from Lahtvee et al., 2017; 10 environmental conditions). The COG annotations were taken from the December 2014 release of 2003-2014 COGs (Galperin et al., 2015) and the Release 3 of ‘Mycobrowser’ (Kapopoulou et al., 2011) for (H) and (I) and from the Comprehensive Sake Yeast Genome Database (S288C strain) (Akao et al., 2011) for (J). The unit for protein abundance was fg/cell for (H) and (I) and fg in pg dry cell weight for (J).
Single-gene-level growth law in the homeostatic core.
Related to Figure 4. (A) Relationship between population growth rates and total abundance of SCG 1 (homeostatic core) proteins. Here, we analyzed the E. coli proteome data (Schmidt et al., 2016), focusing on the 15 conditions for which we obtained Raman data. The dashed line is the least squares regression line. (B) Scatterplots of log abundance of SCG 1 (homeostatic core) proteins. Here, the proteomes under three representative conditions, ‘LB’, ‘Glucose’, and ‘Galactose’, are compared with that under the standard condition ‘Glycerol’. Each colored line is the linear regression line with slope one for the points with the same color. The vertical line is . (C) Relationship between population growth rate and coefficient of determination of linear regression in (B). The vertical line represents the growth rate under the standard condition (‘Glycerol’). (D) Linear relationship between common abundance ratio and growth rates. The vertical axis represents , where is the y-intercepts in (B) (see Section 3.1.2 in Appendix). The dashed line is the linear regression line. The horizontal line is , and the coordinate of the vertical line is the growth rate under the standard condition (‘Glycerol’). (E) The gene loci of the proteins belonging to the condition-specific stoichiometrically conserved groups (SCGs) on the chromosome (ASM75055v1.46; Howe et al., 2020). Colored dots are nodes (genes), and gray lines are edges (high cosine similarity relationships). The edge in the map of SCG 5 cannot be seen because their gene loci are clustered in close proximity in the same operon.
Functional relevance of stoichiometry conservation centrality.
Related to Figure 5. (A) Relationship between gene essentiality and stoichiometry conservation centrality in E. coli. The proportion of essential genes is plotted for each stoichiometry conservation centrality rank range. In this plot, we calculated stoichiometry conservation centrality based on the E. coli proteome data (Schmidt et al., 2016) under the 15 conditions for which we obtained Raman data. The list of essential genes was downloaded from EcoCyc (Keseler et al., 2017). (B) Relationship between gene essentiality and stoichiometry conservation centrality in S. pombe. We calculated stoichiometry conservation centrality based on the S. pombe transcriptome data reported in Kobayashi-Kirschvink et al., 2018. Only coding genes are considered in this plot, though stoichiometry conservation centrality values were calculated using both coding and non-coding genes. Gene classification is based on PomBase (Harris et al., 2022). Some bins do not reach 100% in sum because 11 coding genes in the S. pombe transcriptome data were not found in the current PomBase. (C) Relationship between ratio of coding genes and stoichiometry conservation centrality in the S. pombe transcriptome data. The coding/non-coding assignment is based on PomBase (Harris et al., 2022). (D) Correlation between stoichiometry conservation and evolutionary conservation. In this plot, we calculated stoichiometry conservation centrality based on the E. coli proteome data (Schmidt et al., 2016) under the 15 conditions for which we obtained Raman data. Colors represent the height of each bar. The distributions of stoichiometry conservation centrality were compared between the top 25% and the bottom 25% fractions in the number of orthologs rankings. The fraction with many orthologs tends to have higher stoichiometry conservation centrality (one-sided Brunner-Munzel test, ). The distributions of the number of orthologs were compared between the top 25% and the bottom 25% stoichiometry conservation centrality fractions. The high centrality fraction tends to have more orthologs (one-sided Brunner-Munzel test, ). Ortholog data were taken from OrthoMCL-DB (Chen et al., 2006). (E–G) Correlation between stoichiometry conservation and evolutionary conservation in S. pombe. We calculated stoichiometry conservation centrality based on the S. pombe transcriptome data reported in Kobayashi-Kirschvink et al., 2018. In (E), the result is shown by a two-dimensional histogram. Colors represent the height of each bar. The distributions of the number of orthologs were compared between the top 25% and the bottom 25% stoichiometry conservation centrality fractions. The high centrality fraction tends to have more orthologs (one-sided Brunner-Munzel test, ). The direct comparison between the two fractions is shown in (F). The distributions of stoichiometry conservation centrality were compared between the top 25% and the bottom 25% fractions in the number of orthologs rankings. The fraction with many orthologs tends to have higher stoichiometry conservation centrality (one-sided Brunner-Munzel test, ). The direct comparison between the two fractions is shown in (G). Ortholog data were taken from OrthoMCL-DB (Chen et al., 2006). (H) Applying principal component analysis (PCA) to -normalized proteomes. PCA (with mean centering) was applied to -normalized proteome data . Here, we analyzed the E. coli proteome data under the 15 conditions for which we obtained Raman data. The left is a projection onto a two-dimensional space, and the right is a projection onto a three-dimensional space. The axes for visualization were selected by considering similarity to the cosine similarity LE (csLE) structure.
Distributions and constraints with respect to stoichiometry conservation centrality (degree).
Related to Figure 5 and Figure 7. (A) Comparison of degree (stoichiometry conservation centrality) distributions between original (yellow) and randomized (blue) E. coli proteome data. We created randomized proteome data by shuffling the expression levels across the protein species within each condition. We used the E. coli proteome data (Schmidt et al., 2016) under the 15 conditions for which we obtained Raman data. (B) Comparison of the - relationships between original (yellow) and randomized data (blue). The horizontal axis is expression generality score ( norm/ norm), and the vertical axis is stoichiometry conservation centrality (: degree). Each dot represents a protein species. The dashed lines are , (). The solid lines are . (C–H) Degree (stoichiometry conservation centrality) distributions for additional datasets. Yellow histograms are for the original data, and blue histograms are for the randomized data. (C) For the proteomes of three E. coli strains (BW25113, MG1655, and NCM3722) in LB (Schmidt et al., 2016); (D) for the proteomes of the three E. coli strains in M9 Glucose (Schmidt et al., 2016); (E) for the proteomes of M. tuberculosis (Schubert et al., 2015); (F) for the proteomes of M. bovis (Schubert et al., 2015); (G) for the proteomes of S. cerevisiae (Lahtvee et al., 2017); and (H) for the transcriptomes of S. pombe (Kobayashi-Kirschvink et al., 2018). (I–N) - relationships for additional datasets. Each gray dot represents a protein species. The proteins belonging to the homeostatic core in each dataset are shown in magenta; those belonging to condition-specific stoichiometrically conserved groups (SCGs) are indicated in different colors in each plot. See the caption of Appendix 1—figures 11 and 13 for the cosine similarity threshold to specify the homeostatic core and the condition-specific SCGs in each dataset. The dashed lines are . The solid lines through the origins are (I) for the proteomes of the three E. coli strains in LB (Schmidt et al., 2016); (J) for the proteomes of the three E. coli strains in M9 Glucose (Schmidt et al., 2016); (K) for the proteomes of M. tuberculosis (Schubert et al., 2015); (L) for the proteomes of M. bovis (Schubert et al., 2015) (M) for the proteomes of S. cerevisiae (Lahtvee et al., 2017); and (N) for the transcriptomes of S. pombe (Kobayashi-Kirschvink et al., 2018).
Properties of normalized expression vectors.
Related to Figure 7. (A and B) Schematic explanation for the interpretation of the norm/ norm ratio of expression vectors as an index of expression generality. (A) is a two-dimensional case, and (B) is a three-dimensional case. The inset in (A) schematically explains norm and norm of an expression vector. See ‘Interpretation of norm/ norm ratio of an expression vector as a quantitative measure of expression generality’ in Materials and methods for details. (C) Schematic explanation for deviations of points from the proportionality line in the - plots. Here, we consider four condition-specific protein species a, b, c, and d labeled in the descending order of growth rates under the conditions accompanying their expression. Note that their norm/ norm ratios are all one on the horizontal axis. One can show that the degree (stoichiometry conservation centrality) is proportional to the inner product of -normalized expression vector and the expression norm vector (see Equation 2.147 in Section 2.2.2). Since the elements of increase approximately linearly with growth rates of the corresponding conditions (see D), the degrees (stoichiometry conservation centrality values) decrease from a to d in the order of growth rates. (D–F) Correlation between elements of and population growth rates. The vertical axis represents the elements of , and the horizontal axis represents the population growth rates. The dashed lines are . (D) is the result from the analysis of the E. coli proteome data (Schmidt et al., 2016) under the 15 conditions for which we obtained Raman data (). (E) is the result from the analysis of the proteome data of three strains of E. coli (BW25113, MG1655, and NCM3722) under ‘LB’ and ‘Glucose’ conditions () (Schmidt et al., 2016). (F) is the result from the analysis of the proteome data of S. cerevisiae under 10 different conditions () (Lahtvee et al., 2017). The cells were cultured in a chemostat with the same dilution rate. The numbers of analyzable protein species and the numbers of conditions were different between (D) and (E). Thus, the values of the vertical axes cannot be compared directly between them.
Mathematical analyses of the main Raman-proteome data.
Related to Figure 6. Proteomes of E. coli under 15 conditions (Schmidt et al., 2016) and corresponding Raman data we measured in this study were analyzed in this figure. (A) Visual comparison of the unit matrix , the orthogonal matrix obtained from the data, and a random orthogonal matrix. Height of each bar indicates the value of each element. Colors represent the height of each bar. For clarifying the position of each element, a component form of matrix is shown in the middle (). For (middle) and a random orthogonal matrix (right), the original matrices are displayed in the upper row, and matrices whose elements are the absolute values of the corresponding elements of the original matrices are displayed in the lower row. (In this figure, represents a matrix of which the element is the absolute value of the element of .) (B) Representation of matrices as scatterplots. See ‘Evaluating similarity between orthogonal matrix and identity matrix’ in Materials and methods for details. (C) Comparison of the unit matrix , the orthogonal matrix obtained from the data, and random orthogonal matrices by Pearson correlation coefficients. Pearson correlation coefficient of the element-wise squared matrix of each matrix can be regarded as a measure of closeness to the identity matrix ( represents element-wise multiplication). The probability of finding a random orthogonal matrix with Pearson correlation coefficient greater than the Pearson correlation coefficient of was (no occurrence in 105 samplings). See ‘Evaluating similarity between orthogonal matrix and identity matrix’ in Materials and methods for details. (D) Comparison of magnitudes of off-diagonal elements among the unit matrix , the orthogonal matrix obtained from the data, and random orthogonal matrices . The lattice on the top explains the numbering of -diagonals (, ). In the lattices on the bottom, black color indicates areas in which the elements are squared and summed at the corresponding steps (i.e. areas represented by in the graph). The sum of the squared values in each step is shown in the middle graph. Error bars of the random matrix line are standard errors of 100 samplings. See ‘Evaluating similarity between orthogonal matrix and identity matrix’ in Materials and methods for details. (E) Comparison of magnitudes of elements of leading principal submatrices among the unit matrix , the orthogonal matrix obtained from the data, and random orthogonal matrices . In the lattices on the bottom, black color indicates an area in which elements are squared and summed at the corresponding step (i.e. an area represented by in the graphs). The sum of the squared values in each area is shown in the top graph. The results shown in the top graph are converted into ratios to the identity matrix and are shown in the middle graph. Error bars of the random matrix line are standard errors of 100 samplings. See ‘Evaluating similarity between orthogonal matrix and identity matrix’ in Materials and methods for details. (F) Comparison of and . axis represents and axis represents . The dashed line indicates . (G) Comparison between (left) and (right). Note that while figure (left) is the same as Figure 6C, the right figure shows , where is shown in Figure 6D.
Orthant correspondences between Raman spectra in linear discriminant analysis (LDA) space and condition-specific proteins in Raman-proteome coefficient proteome space.
Related to Figure 6. Using the main Raman and proteome data of E. coli under the 15 conditions, we examine the orthant correspondence between Raman spectra in the LDA space and condition-specific proteins in the Raman-proteome coefficient proteome space . Here, we focus on two proteins PaaE and AcrR. (A) Expression patterns of PaaE (left) and AcrR (right) across conditions. Error bars are standard errors. PaaE is expressed under the ‘LB’ condition in a condition-specific manner, whereas AcrR is expressed at high levels not only under ‘LB’ condition but also under several other conditions. (B) Positions of PaaE and AcrR in the Raman-proteome coefficient-based proteome space . (C) Verification of orthant correspondence. We verified the orthant correspondence described by Equation 2.76. We multiplied both sides of Equation 2.76 by , and the elements of the vectors of both sides were compared by scatterplots. The horizontal axes are related to the coordinates in the Raman LDA space; the vertical axes are related to the coordinates in the Raman-proteome coefficient proteome space. The dashed lines are . The nearly perfect agreement of the elements confirms the orthant correspondence for the condition-specific protein PaaE (left). Deviations from the diagonal agreement line are found for AcrR (right).
Stoichiometry-based omics structures and their correspondences to Raman-based omics structures for additional datasets.
Related to Figures 4—6. This figure summarizes the results on omics structures characterized by stoichiometry conservation relations and their correspondences to those characterized by Raman-omics relations for additional datasets. (A–E) show the results from the analyses of the Raman and proteome data of three E. coli strains (BW25113, MG1655, and NCM3722) in LB; (F–J) from the analyses of the Raman and proteome data of the three E. coli strains in M9 Glucose; and (K–O) from the analyses of the Raman and transcriptome data of S. pombe under 10 conditions. We used the E. coli proteome data reported in Schmidt et al., 2016, and the S. pombe transcriptome data reported in Kobayashi-Kirschvink et al., 2018, in the analyses. (A), (F), and (K) show distributions of omics components in cosine similarity LE (csLE) space. Stoichiometry conservation centrality of each component is indicated by color. (B), (G), and (L) show expression patterns of representative condition-specific omics components indicated in the previous figures of omics structures in the csLE spaces. Error bars are standard errors in (B) and (G), and maximum-minimum ranges (two replicates) in (L). (C), (H), and (M) show positions of averaged cellular Raman spectra under different conditions in the linear discriminant analysis (LDA) spaces. (D), (I), and (N) show omics structures in the spaces specified by the Raman-omics coefficients with the homeostatic cores and condition-specific stoichiometrically conserved groups (SCGs) indicated by colored points. (E), (J), and (O) show the omics structures in the csLE omics spaces with the homeostatic cores and condition-specific SCGs indicated by colored points. Columns (the eigenvector corresponding to ’s smallest eigenvalue except for zero) and (the eigenvector corresponding to ’s second smallest eigenvalue except for zero) are shown. We used the cosine similarity thresholds of 0.99993 to specify SCGs both for the three E. coli strains under LB data (D and E) and for the three E. coli strains under M9 Glucose data (I and J), and 0.9967 for the S. pombe transcriptome data (N and O).
Analyses of the mathematical relation connecting two types of omics spaces.
Related to Figure 6. This figure shows the analyses of mathematical relation that connects coordinates of omics components in the two types of omics spaces (see Figure 6E and Section 2 in Appendix) using additional datasets. (A–F) show the results from the analyses of the Raman and proteome data of three E. coli strains (BW25113, MG1655, and NCM3722) in LB; (G–L) from the analyses of the Raman and proteome data of the three E. coli strains in M9 Glucose; and (M–R) from the analyses of the Raman and transcriptome data of S. pombe under 10 conditions. We used the E. coli proteome data reported in Schmidt et al., 2016, and the S. pombe transcriptome data reported in Kobayashi-Kirschvink et al., 2018 in the analyses. See the caption of Appendix 1—figure 9 for the explanation of each panel. The stoichiometrically conserved groups (SCGs) in (F), (L), and (R) are the same as in Appendix 1—figure 11. The probability of finding a random orthogonal matrix with Pearson correlation coefficient greater than the Pearson correlation coefficient of was 0.022 in (B), 0.013 in (H), and (no occurrence in 105 samplings) in (N).
Stoichiometry-based proteome structures for additional datasets.
Related to Figures 4 and 5. This figure shows proteome structures in the cosine similarity LE (csLE) proteome spaces for additional datasets. (A–C) show the results from the analyses of the proteome data of M. tuberculosis H37Rv under gradual changes in oxygen levels (Schubert et al., 2015); (D–F) shows the results from the analyses of the proteome data of M. bovis BCG under gradual changes in oxygen levels (Schubert et al., 2015); and (G–I) show the results from the analyses of the proteome data of S. cerevisiae under 10 conditions in chemostat with the same dilution rate (Lahtvee et al., 2017). (A), (D), and (G) show the proteome structures in the csLE spaces. The thresholds used to specify the stoichiometrically conserved groups (SCGs) were 0.99965 for (A), 0.9997 for (D), and 0.9989 for (G). (B), (E), and (H) show the same proteome structures as in the previous panels, but with stoichiometry conservation centrality of each protein species indicated by the color. (C), (F), and (I) show expression patterns of representative proteins indicated by the red circles in the previous panels. Error bars in (C) are standard errors.
Dependence of low-dimensional correspondence between Raman spectra and proteomes on the number of conditions.
Related to Figure 6. The dependence of the low-dimensional correspondence between Raman spectra and proteomes on the number of analyzed conditions was systematically investigated by evaluating the similarity of the orthogonal matrix to the identity matrix for all subsampled condition sets. Proteomes of E. coli under 15 conditions (Schmidt et al., 2016) and corresponding Raman data we measured in this study were analyzed in this figure. (A) The relationship between the number of conditions and the probability of obtaining higher level of low-dimensional correspondence than that of experimental data by chance. This probability is calculated as the probability of finding a random orthogonal matrix with Pearson correlation coefficient greater than the Pearson correlation coefficient of by creating 104 random orthogonal matrices. See ‘Evaluating similarity between orthogonal matrix and identity matrix’ in Materials and methods and Appendix 1—figure 9 for details of the evaluation method. Each green square corresponds to one subsample, and each short horizontal black line represents the median of all the combinations of conditions (i.e. green squares) for each subsample size . The blue dashed line indicates the detection limit (i.e. one over the number of generated random orthogonal matrices). The non-subsampled case (i.e. the case with all 15 conditions) in this figure corresponds to Appendix 1—figure 9C. (B) Visual comparison of , and for six representative subsamples indicated in (A). As in Appendix 1—figure 9A, is visualized using , whose element is the absolute value of the corresponding element of , and height of each bar in the figures of indicates the value of each element of . Colors reflect the height of each bar. Spaces created with columns of and are and , respectively. As deviates from the identity matrix from the cases and to the case of , the low-dimensional correspondence between and collapses naturally. Since the case is the non-subsampled case, the figure of is the same as Appendix 1—figure 9A, and those of and are the same as Appendix 1—figure 9G. Note that the figure of of the case is also exactly the same as Figure 6C, and that of of the case is equal to Figure 6D up to a factor of . The stoichiometrically conserved groups (SCGs) shown in this figure were defined in the analysis of the proteomes of all the 15 conditions (Figure 4C).
List of culture conditions.
M9 m.m. and a.a. in this table are the abbreviations for M9 minimal media and amino acids, respectively.
| Phase | Overview of composition | Temperature | pH | Name in this paper |
|---|---|---|---|---|
| Exponential | M9 m.m. + acetate | 37°C | 7 | Acetate |
| M9 m.m. + fructose | Fructose | |||
| M9 m.m. + fumarate | Fumarate | |||
| M9 m.m. + galactose | Galactose | |||
| M9 m.m. + glucose | Glucose | |||
| M9 m.m. + glucose | 42°C | Glucose42C | ||
| M9 m.m. + glucose | 37°C | 6 | GlucosepH6 | |
| M9 m.m. + glycerol | 7 | Glycerol | ||
| M9 m.m. + glycerol + a.a. | GlycerolAA | |||
| M9 m.m. + glucose + NaCl | OsmoticStressGlucose | |||
| M9 m.m. + mannose | Mannose | |||
| M9 m.m. + xylose | Xylose | |||
| LB | LB | |||
| Stationary for 1 day | M9 m.m. + glucose | stationary1day | ||
| Stationary for 3 days | stationary3days |
Evaluation of the overall estimation error with various distance measures (the case where LDA1 to LDA4 axes were used).
The sum of estimation errors was calculated, and a permutation test (105 permutations) was conducted. In this table, LDA1 to LDA4 axes were used. represents a vector whose all elements are the mean of all elements of . is the -th element of . represents the median of scalers .
| Metric | Definition of | -value | |
|---|---|---|---|
| Square of norm (PRESS) | 2.34 × 103 | 0.00005 | |
| norm | 1.40 × 103 | 0.00002 | |
| Cosine distance | 1.52 | 0.0014 | |
| 1 – Pearson correlation coefficient | 1.57 | 0.0012 | |
| Median of relative error | 0.0536 | 0.00022 |
Evaluation of the overall estimation error with various distance measures (the case where all the 14 LDA axes were used).
The results obtained by using all the 14 LDA axes are presented. See Appendix 1—table 2 for notations. Note that the system is underdetermined in this case; thus, we adopted the minimum-norm solution from among all least-squares solutions.
| Metric | Definition of | -value | |
|---|---|---|---|
| Square of norm (PRESS) | 1.63 × 103 | 0.0019 | |
| norm | 1.19 × 103 | 0.00066 | |
| Cosine distance | 1.18 | 0.0879 | |
| 1 – Pearson correlation coefficient | 1.23 | 0.085 | |
| Median of relative error | 0.0418 | 0.00082 |
Gene list of SCG 1 (homeostatic core).
Members of homeostatic core (Figure 4, cosine similarity threshold: 0.995). The description of each gene is cited from Schmidt et al., 2016.
| Name | Description |
|---|---|
| rpoC | DNA-directed RNA polymerase subunit beta’ |
| rpoB | DNA-directed RNA polymerase subunit beta |
| tufA | Elongation factor Tu 1 |
| infB | Translation initiation factor IF-2 |
| fusA | Elongation factor G |
| glyS | Glycyl-tRNA synthetase beta subunit |
| rpsA | 30S ribosomal protein S1 |
| leuS | Leucyl-tRNA synthetase |
| pheT | Phenylalanyl-tRNA synthetase beta chain |
| aspS | Aspartyl-tRNA synthetase |
| valS | Valyl-tRNA synthetase |
| secA | Protein translocase subunit SecA |
| gyrA | DNA gyrase subunit A |
| pepN | Aminopeptidase N |
| tsf | Elongation factor Ts |
| tig | Trigger factor |
| pta | Phosphate acetyltransferase |
| bamA | Outer membrane protein assembly factor YaeT |
| rne | Ribonuclease E |
| ftsZ | Cell division protein FtsZ |
| gyrB | DNA gyrase subunit B |
| polA | DNA polymerase I |
| rplB | 50S ribosomal protein L2 |
| prlC | Oligopeptidase A |
| rho | Transcription termination factor Rho |
| ftsH | ATP-dependent zinc metalloprotease FtsH |
| nusA | Transcription elongation protein NusA |
| lysS | Lysyl-tRNA synthetase |
| metG | Methionyl-tRNA synthetase |
| glnS | Glutaminyl-tRNA synthetase |
| lpdA | Dihydrolipoyl dehydrogenase |
| serS | Seryl-tRNA synthetase |
| surA | Chaperone SurA |
| rpsB | 30S ribosomal protein S2 |
| gltX | Glutamyl-tRNA synthetase |
| lptD | LPS-assembly protein LptD |
| argS | Arginyl-tRNA synthetase |
| fabB | 3-Oxoacyl-[acyl-carrier-protein] synthase 1 |
| pheS | Phenylalanyl-tRNA synthetase alpha chain |
| clpX | ATP-dependent Clp protease ATP-binding subunit ClpX |
| accC | Biotin carboxylase |
| pyrG | CTP synthase |
| tolC | Outer membrane protein TolC |
| rplE | 50S ribosomal protein L5 |
| accA | Acetyl-coenzyme A carboxylase carboxyl transferase subunit alpha |
| hflK | Modulator of FtsH protease HflK |
| pdxB | Erythronate-4-phosphate dehydrogenase |
| ygfZ | tRNA-modifying protein YgfZ |
| pmbA | Protein PmbA |
| rplA | 50S ribosomal protein L1 |
| hldD | ADP-L-glycero-D-manno-heptose-6-epimerase |
| mreB | Rod shape-determining protein MreB |
| acrA | Acriflavine resistance protein A |
| gor | Glutathione reductase |
| hisS | Histidyl-tRNA synthetase |
| rpsC | 30S ribosomal protein S3 |
| glmM | Phosphoglucosamine mutase |
| lepA | Elongation factor 4 |
| ffh | Signal recognition particle protein |
| secD | Protein-export membrane protein SecD |
| lpoA | Penicillin-binding protein activator LpoA |
| rhlB | ATP-dependent RNA helicase RhlB |
| rpsG | 30S ribosomal protein S7 |
| rpsD | 30S ribosomal protein S4 |
| minD | Septum site-determining protein MinD |
| cyoA | Ubiquinol oxidase subunit 2 |
| mdoG | Glucans biosynthesis protein G |
| rplC | 50S ribosomal protein L3 |
| glmU | Bifunctional protein GlmU |
| rpsF | 30S ribosomal protein S6 |
| rpsE | 30S ribosomal protein S5 |
| hemL | Glutamate-1-semialdehyde 2,1-aminomutase |
| hldE | Bifunctional protein HldE |
| ubiE | Ubiquinone/menaquinone biosynthesis methyltransferase UbiE |
| sspA | Stringent starvation protein A |
| nusG | Transcription antitermination protein NusG |
| prfB | Peptide chain release factor 2 |
| dacA | D-alanyl-D-alanine carboxypeptidase DacA |
| rplF | 50S ribosomal protein L6 |
| fabG | 3-Oxoacyl-[acyl-carrier-protein] reductase |
| ftsY | Cell division protein FtsY |
| dcrB | Protein DcrB |
| mlaC | Probable phospholipid-binding protein MlaC |
| hflC | Modulator of FtsH protease HflC |
| coaB | Coenzyme A biosynthesis bifunctional protein CoaBC |
| ybiT | Uncharacterized ABC transporter ATP-binding protein YbiT |
| oxyR | Hydrogen peroxide-inducible genes activator |
| rpsH | 30S ribosomal protein S8 |
| fkpA | FKBP-type peptidyl-prolyl cis-trans isomerase FkpA |
| frr | Ribosome-recycling factor |
| fabD | Malonyl CoA-acyl carrier protein transacylase |
| hslO | 33 kDa chaperonin |
| ybeZ | PhoH-like protein |
| hemX | Putative uroporphyrinogen-III C-methyltransferase |
| rplY | 50S ribosomal protein L25 |
| rplK | 50S ribosomal protein L11 |
| rpsI | 30S ribosomal protein S9 |
| bamB | Lipoprotein YfgL |
| bamD | UPF0169 lipoprotein YfiO |
| kdgR | Transcriptional regulator KdgR |
| glnD | [Protein-PII] uridylyltransferase |
| yniC | Phosphatase YniC |
| rpsJ | 30S ribosomal protein S10 |
| rplX | 50S ribosomal protein L24 |
| rplD | 50S ribosomal protein L4 |
| rplQ | 50S ribosomal protein L17 |
| ppa | Inorganic pyrophosphatase |
| rpsM | 30S ribosomal protein S13 |
| rplN | 50S ribosomal protein L14 |
| ybaB | UPF0133 protein YbaB |
| yidC | Inner membrane protein OxaA |
| lptB | Lipopolysaccharide export system ATP-binding protein LptB |
| suhB | Inositol-1-monophosphatase |
| yejK | Nucleoid-associated protein YejK |
| ghrA | Glyoxylate/hydroxypyruvate reductase A |
| rsmI | Ribosomal RNA small subunit methyltransferase I |
| hemY | Protein HemY |
| uup | ABC transporter ATP-binding protein Uup |
| hrpA | ATP-dependent RNA helicase HrpA |
| rplJ | 50S ribosomal protein L10 |
| rplM | 50S ribosomal protein L13 |
| fur | Ferric uptake regulation protein |
| rplS | 50S ribosomal protein L19 |
| rcsB | Capsular synthesis regulator component B |
| mrp | Protein Mrp |
| glyQ | Glycyl-tRNA synthetase alpha subunit |
| greA | Transcription elongation factor GreA |
| nrdB | Ribonucleoside-diphosphate reductase 1 subunit beta |
| wbbI | Uncharacterized protein YefG |
| udk | Uridine kinase |
| mnmG | tRNA uridine 5-carboxymethylaminomethyl modification enzyme MnmG |
| rplL | 50S ribosomal protein L7/L12 |
| rplI | 50S ribosomal protein L9 |
| rpoZ | DNA-directed RNA polymerase subunit omega |
| ybbN | Uncharacterized protein YbbN |
| yfiF | Uncharacterized tRNA/rRNA methyltransferase YfiF |
| yedD | Uncharacterized lipoprotein YedD |
| rpmD | 50S ribosomal protein L30 |
| tatB | Sec-independent protein translocase protein TatB |
| yfgM | UPF0070 protein YfgM |
| kdsB | 3-Deoxy-manno-octulosonate cytidylyltransferase |
| rpoN | RNA polymerase sigma-54 factor |
| fdx | 2Fe-2S ferredoxin |
| rplV | 50S ribosomal protein L22 |
| rplO | 50S ribosomal protein L15 |
| fabZ | (3R)-hydroxymyristoyl-[acyl-carrier-protein] dehydratase |
| mipA | MltA-interacting protein |
| ssb | Single-stranded DNA-binding protein |
| yiaF | Uncharacterized protein YiaF |
| secY | Preprotein translocase subunit SecY |
| rbfA | Ribosome-binding factor A |
| potA | Spermidine/putrescine import ATP-binding protein PotA |
| rimM | Ribosome maturation factor RimM |
| trxA | Thioredoxin-1 |
| rpsS | 30S ribosomal protein S19 |
| rpsU | 30S ribosomal protein S21 |
| accB | Biotin carboxyl carrier protein of acetyl-CoA carboxylase |
| engB | Probable GTP-binding protein EngB |
| tatA | Sec-independent protein translocase protein TatA |
| rfbD | dTDP-4-dehydrorhamnose reductase |
| ribF | Riboflavin biosynthesis protein RibF |
| folP | Dihydropteroate synthase |
| lepB | Signal peptidase I |
| sspB | Stringent starvation protein B |
| hupA | DNA-binding protein HU-alpha |
| rpsP | 30S ribosomal protein S16 |
| rplP | 50S ribosomal protein L16 |
| rpsT | 30S ribosomal protein S20 |
| rpsK | 30S ribosomal protein S11 |
| rplU | 50S ribosomal protein L21 |
| rplR | 50S ribosomal protein L18 |
| lpxA | Acyl-[acyl-carrier-protein]–UDP-N-acetylglucosamine O-acyltransferase |
| yceD | Uncharacterized protein YceD |
| queC | 7-Cyano-7-deazaguanine synthase |
| rpmA | 50S ribosomal protein L27 |
| rpmG | 50S ribosomal protein L33 |
| rpmF | 50S ribosomal protein L32 |
| rpsN | 30S ribosomal protein S14 |
| rplT | 50S ribosomal protein L20 |
| nudK | GDP-mannose pyrophosphatase NudK |
| rplW | 50S ribosomal protein L23 |
| trmB | tRNA (guanine-N(7)-)-methyltransferase |
| rluB | Ribosomal large subunit pseudouridine synthase B |
| rpsR | 30S ribosomal protein S18 |
| secG | Protein-export membrane protein SecG |
| rlmE | Ribosomal RNA large subunit methyltransferase E |
| yfaY | CinA-like protein |
| trmA | tRNA (uracil-5-)-methyltransferase |
| rpmH | 50S ribosomal protein L34 |
| yajC | UPF0092 membrane protein YajC |
| yheU | UPF0270 protein YheU |
Gene list of SCG 2.
Members in SCG 2 (Figure 4, cosine similarity threshold: 0.995). The description of each gene is cited from Schmidt et al., 2016.
| Name | Description |
|---|---|
| fdoG | Formate dehydrogenase-O major subunit |
| dsdA | D-serine dehydratase |
| treC | Trehalose-6-phosphate hydrolase |
| sdaB | L-serine dehydratase 2 |
| nanA | N-acetylneuraminate lyase |
| garD | D-galactarate dehydratase |
| proV | Glycine betaine/L-proline transport ATP-binding protein ProV |
| garR | 2-Hydroxy-3-oxopropionate reductase |
| nanK | N-acetylmannosamine kinase |
| fdoH | Formate dehydrogenase-O iron-sulfur subunit |
| aphA | Class B acid phosphatase |
| nanE | Putative N-acetylmannosamine-6-phosphate 2-epimerase |
| srlB | Glucitol/sorbitol-specific phosphotransferase enzyme IIA component |
| ibpB | Small heat shock protein IbpB |
| hybC | Hydrogenase-2 large chain |
| proW | Glycine betaine/L-proline transport system permease protein ProW |
| srlE | Glucitol/sorbitol-specific phosphotransferase enzyme IIB component |
| fdoI | Formate dehydrogenase, cytochrome b556(fdo) subunit |
| preT | Uncharacterized oxidoreductase YeiT |
| garL | 5-Keto-4-deoxy-D-glucarate aldolase |
| paaB | Phenylacetic acid degradation protein PaaB |
| paaK | Phenylacetate-coenzyme A ligase |
| paaE | Probable phenylacetic acid degradation NADH oxidoreductase PaaE |
| ykgE | Uncharacterized protein YkgE |
| ybjT | Uncharacterized protein YbjT |
| ykgG | Uncharacterized protein YkgG |
Gene list of SCG 3.
Members in SCG 3 (Figure 4, cosine similarity threshold: 0.995). The description of each gene is cited from Schmidt et al., 2016.
| Name | Description |
|---|---|
| wzc | Tyrosine-protein kinase Wzc |
| amiC | N-acetylmuramoyl-L-alanine amidase AmiC |
Gene list of SCG 4.
Members in SCG 4 (Figure 4, cosine similarity threshold: 0.995). The description of each gene is cited from Schmidt et al., 2016.
| Name | Description |
|---|---|
| fruB | Multiphosphoryl transfer protein |
| fruK | 1-Phosphofructokinase |
| fruA | PTS system fructose-specific EIIBC component |
| narI | Respiratory nitrate reductase 1 gamma chain |
Gene list of SCG 5.
Members in SCG 5 (Figure 4, cosine similarity threshold: 0.995). The description of each gene is cited from Schmidt et al., 2016.
| Name | Description |
|---|---|
| hdeB | Protein HdeB |
| hdeA | Chaperone-like protein HdeA |
Interpretations of ,,, and .
Interpretations of the columns and rows of and are summarized.
| Matrix | Vector | Dimension | Description | |
|---|---|---|---|---|
| Column | List of -th LDA coordinates of mean LDA Raman of all the conditions | |||
| Row | Mean LDA Raman of condition | |||
| Column | List of coefficients of all the proteins for the -th LDA axis | |||
| Row | Coefficients for protein | |||
Mathematical relation between Raman-proteome coefficients and cosine similarity LE (csLE) proteomes.
The matrices in the left-hand side of Equation 2.138 (a proteome structure based on Raman-proteome coefficients) and their counterparts in the right-hand side of Equation 2.138 (a proteome structure obtained with csLE) are listed.
| Raman-omicscoef. structure | csLE | Size and type of matrix | Description |
|---|---|---|---|
| matrix | Coefficients normalized by constants | ||
| orthogonal matrix | Orthogonal transformation | ||
| diagonal matrix | Constant terms | ||
| diagonal matrix | Singular values |
Data availability
All data and analysis codes have been deposited in Zenodo and are publicly available at https://doi.org/10.5281/zenodo.17090710.
-
ZenodoCode and data for "Revealing global stoichiometry conservation architecture in cells from Raman spectral patterns".https://doi.org/10.5281/zenodo.17090710
-
Mendeley DataData for: Linear Regression Links Transcriptomic Data and Cellular Raman Spectra.https://doi.org/10.17632/2fx3h2rx2m.1
-
NCBI Gene Expression OmnibusID GSE156793. A human cell atlas of fetal gene expression.
-
figshare"Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq" Replogle et al. 2022 processed Perturb-seq datasets.https://doi.org/10.25452/figshare.plus.20029387.v1
References
-
Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collectionMolecular Systems Biology 2:2006.0008.https://doi.org/10.1038/msb4100050
-
ConferenceLaplacian eigenmaps and spectral techniques for embedding and clusteringNIPS’01: Proceedings of the 15th International Conference on Neural Information Processing Systems: Natural and Synthetic. pp. 585–591.
-
Laplacian eigenmaps for dimensionality reduction and data representationNeural Computation 15:1373–1396.https://doi.org/10.1162/089976603321780317
-
Factoring and weighting approaches to status scores and clique identificationThe Journal of Mathematical Sociology 2:113–120.https://doi.org/10.1080/0022250X.1972.9989806
-
The anatomy of a large-scale hypertextual Web search engineComputer Networks and ISDN Systems 30:107–117.https://doi.org/10.1016/S0169-7552(98)00110-X
-
OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groupsNucleic Acids Research 34:D363–D368.https://doi.org/10.1093/nar/gkj123
-
BookEigenproblems in pattern recognitionIn: Corrochano EB, editors. Handbook of Geometric Computing. Springer. pp. 129–167.https://doi.org/10.1007/3-540-28247-5_5
-
Advanced methods of microscope control using μManager softwareJournal of Biological Methods 1:e10.https://doi.org/10.14440/jbm.2014.36
-
Expanded microbial genome coverage and improved protein family annotation in the COG databaseNucleic Acids Research 43:D261–D269.https://doi.org/10.1093/nar/gku1223
-
FYPO: the fission yeast phenotype ontologyBioinformatics 29:1671–1678.https://doi.org/10.1093/bioinformatics/btt266
-
Ensembl Genomes 2020-enabling non-vertebrate genomic researchNucleic Acids Research 48:D689–D695.https://doi.org/10.1093/nar/gkz890
-
Raman spectroscopic signature of life in a living yeast cellJournal of Raman Spectroscopy 35:525–526.https://doi.org/10.1002/jrs.1219
-
Physical Constraints on EpistasisMolecular Biology and Evolution 37:2865–2874.https://doi.org/10.1093/molbev/msaa124
-
Ribosome biogenesis and the translation process in Escherichia coliMicrobiology and Molecular Biology Reviews 71:477–494.https://doi.org/10.1128/MMBR.00013-07
-
Promoters maintain their relative activity levels under different growth conditionsMolecular Systems Biology 9:701.https://doi.org/10.1038/msb.2013.59
-
The EcoCyc database: reflecting new knowledge about Escherichia coli K-12Nucleic Acids Research 45:D543–D550.https://doi.org/10.1093/nar/gkw1003
-
Network-based methods for predicting essential genes or proteins: a surveyBriefings in Bioinformatics 21:566–583.https://doi.org/10.1093/bib/bbz017
-
Studies on the role of ribonucleic acid in the growth of bacteriaBiochimica et Biophysica Acta 42:99–116.https://doi.org/10.1016/0006-3002(60)90757-5
-
On the centrality in a directed graphSocial Science Research 2:371–378.https://doi.org/10.1016/0049-089X(73)90010-0
-
Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawnStatistical Applications in Genetics and Molecular Biology 9:Article39.https://doi.org/10.2202/1544-6115.1585
-
Significance tests which may be applied to samples from any populationsSupplement to the Journal of the Royal Statistical Society 4:119–130.https://doi.org/10.2307/2984124
-
BookRegulation of ribosomal protein mRNA translation in bacteriaIn: Ilan J, editors. Translational Regulation of Gene Expression 2. Springer. pp. 23–47.https://doi.org/10.1007/978-1-4615-2894-4_2
-
Evolutionary dimension reduction in phenotypic spacePhysical Review Research 2:013197.https://doi.org/10.1103/PhysRevResearch.2.013197
-
Smoothing and differentiation of data by simplified least squares proceduresAnalytical Chemistry 36:1627–1639.https://doi.org/10.1021/ac60214a047
-
The quantitative and condition-dependent Escherichia coli proteomeNature Biotechnology 34:104–110.https://doi.org/10.1038/nbt.3418
-
Stability and continuity of centrality measures in weighted graphsIEEE Transactions on Signal Processing 64:543–555.https://doi.org/10.1109/TSP.2015.2486740
-
Inferring time derivatives including cell growth rates using Gaussian processesNature Communications 7:13766.https://doi.org/10.1038/ncomms13766
-
A tutorial on spectral clusteringStatistics and Computing 17:395–416.https://doi.org/10.1007/s11222-007-9033-z
Article and author information
Author details
Funding
Japan Science and Technology Agency (JPMJCR1927)
- Yuichi Wakamoto
Japan Science and Technology Agency (JPMJER1902)
- Yuichi Wakamoto
Japan Society for the Promotion of Science (19J22448)
- Ken-ichiro F Kamei
Japan Society for the Promotion of Science (21K20672)
- Takashi Nozoe
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Matthias Heinemann and Silke Bonsing-Vedelaar for detailed information on the E. coli culture conditions; Doeke R Hekstra, Tetsuya J Kobayashi, Takafumi Miyamoto, John Russell, and Ian Hunt-Isaak for reading the manuscript and providing critical comments; Kunihiko Kaneko, Chikara Furusawa, Yasushi Okada, and members of the Wakamoto Lab and the Universal Biology Institute for discussion and encouragement. This work was supported by JST CREST Grant Number JPMJCR1927 (YW); JST ERATO Grant Number JPMJER1902 (YW); JSPS KAKENHI Grant Numbers 19J22448 (KFK) and 21K20672 (TN).
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
- Version of Record published:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.101485. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2025, Kamei et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 1,553
- views
-
- 124
- downloads
-
- 2
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Citations by DOI
-
- 2
- citations for umbrella DOI https://doi.org/10.7554/eLife.101485