Applying causal discovery to single-cell analyses using CausalCell
Abstract
Correlation between objects is prone to occur coincidentally, and exploring correlation or association in most situations does not answer scientific questions rich in causality. Causal discovery (also called causal inference) infers causal interactions between objects from observational data. Reported causal discovery methods and single-cell datasets make applying causal discovery to single cells a promising direction. However, evaluating and choosing causal discovery methods and developing and performing proper workflow remain challenges. We report the workflow and platform CausalCell (http://www.gaemons.net/causalcell/causalDiscovery/) for performing single-cell causal discovery. The workflow/platform is developed upon benchmarking four kinds of causal discovery methods and is examined by analyzing multiple single-cell RNA-sequencing (scRNA-seq) datasets. Our results suggest that different situations need different methods and the constraint-based PC algorithm with kernel-based conditional independence tests work best in most situations. Related issues are discussed and tips for best practices are given. Inferred causal interactions in single cells provide valuable clues for investigating molecular interactions and gene regulations, identifying critical diagnostic and therapeutic targets, and designing experimental and clinical interventions.
Editor's evaluation
This manuscript presents an important tool for causal inference intended for the analysis of single cell datasets but possibly with broader applications. It compares several algorithms and incorporates a number of them in the platform and offers convincing evidence of its usefulness. With the rapid expansion of large datasets, this tool is beneficial in offering several causal inference analysis options and expediting the interpretation of data.
https://doi.org/10.7554/eLife.81464.sa0Introduction
RNA-sequencing (RNA-seq) has been used to detect gene expression in a lump of cells for years. Many statistical methods have been developed to explore correlation/association between transcripts in RNA-seq data, including the ‘weighted gene co-expression network analysis’ that infers networks of correlated genes (Joehanes, 2018). Since a piece of tissue may contain many different cells and the sample sizes of most RNA-seq data are <100, causal interactions in single cells, which to a great extent are emergent events (Bhalla and Iyengar, 1999), cannot be revealed by these statistical methods. Averaged gene expression in heterogeneous cells also makes causal interactions blurred or undetectable. Except for some annotated interactions in signaling pathways, most causal interactions in specific cells remain unknown (e.g. in developing cells undergoing rapid fate determination and in diseased cells expressing genes aberrantly).
Single-cell RNA-sequencing (scRNA-seq) has been widely used to detect gene expression in single cells, providing large samples for analyzing cell-specific gene expression and regulation. On statistical data analysis, it is argued that ‘statistics alone cannot tell which is the cause and which is the effect’ (Pearl and Mackenzie, 2019). Corresponding to this, causal discovery is a science that distinguishes between causes and effects and infers causal interactions from observational data. Many methods have been designed to infer causal interactions from observational data. For single-cell analysis, any method faces the three challenges – high-dimensional data, data with missing values, and inferring with incomplete model (with missing variables). The constraint-based methods are a class of causal discovery methods (Glymour et al., 2019; Yuan and Shou, 2022), and the PC algorithm is a classic constraint-based method. Testing conditional independence (CI, CI≠unconditional independence [UI]≠uncorrelation) between variables is at the heart of constraint-based methods. Many CI tests have been developed (Verbyla, 2018; Zhang and Peters, 2011), from the fast GaussCItest to the time-consuming kernel-based CI tests. GaussCItest is based upon partial correlations between variables. Kernel-based CI tests estimate the dependence between variables upon their observations without assuming any relationship between variables or distribution of data. These features of kernel-based CI tests enable relationships between any genes and molecules, not just transcription factors (TFs) and their targets, to be inferred. Thus, CI tests critically characterize constraint-based causal discovery and distinguish causal discovery from other network inferences, including ‘regulatory network inference’ (Nguyen et al., 2021; Pratapa et al., 2020), ‘causal network inference’ (Lu et al., 2021), ‘network inference’ (Deshpande et al., 2019), and ‘gene network inference’ (Marbach et al., 2012).
Kernel-based CI tests are highly time-consuming and thus infeasible for transcriptome-wide causal discovery. Recently other causal discovery methods are reported, especially continuous optimization-based methods (Bello et al., 2022a; Zheng et al., 2018). Thus, identifying the best methods and CI tests, developing reasonable workflows, developing measures for quality control, and making trade-offs between time consumption, network size, and network accuracy are important. This Tools and Resources article addresses the above issues by benchmarking multiple causal discovery methods and CI tests, applying causal discovery to multiple scRNA-seq datasets, developing a causal discovery workflow/platform (called CausalCell), and summarizing tips for best practices. Specifically, the workflow combines feature selection and causal discovery. The benchmarking includes 11 causal discovery methods, 10 CI tests, and 9 feature selection algorithms. In addition, measures for estimating and ensuring the reliability of causal discovery are developed. Our results indicate that when relationships between variables are free of missing variables and missing values, continuous optimization-based methods perform well. Otherwise, the PC algorithm with kernel-based CI tests can better tolerate incomplete models and missing values. Inferred relationships between gene products help researchers draw causal hypotheses and design experimental studies. The remaining sections describe the workflow/platform and data analysis examples, discuss specific issues, and present tips for best practices. The details of methods and algorithms, benchmarking results, and data analysis results are described in appendix files.
Materials and methods
Features of different algorithms
Request a detailed protocolCausal discovery cannot be performed transcriptome-wide due to time consumption and the power of methods. A way to choose a subset of genes based on one or several genes of interest is feature selection. A feature selection algorithm combines a search technique and an evaluation measure and works upon one or several response variables (i.e. genes of interest). After obtaining a measure between the response variable(s) and each feature (i.e. variable, gene), a subset of features most related to the response variable(s) are extracted from the whole dataset. Using simulated data and real scRNA-seq data (Appendix 1—table 1), we benchmarked nine feature selection algorithms. The properties and advantages/disadvantages of these algorithms are summarized, with ‘+++’ and ‘+’ indicating the most and least recommended ones (Table 1; Appendix 2—figures 1–7).
Many causal discovery methods have been proposed. Constraint-based causal discovery identifies causal relationships between a set of variables in two steps: skeleton estimation (determining the skeleton of the causal network) and orientation (determining the direction of edges in the causal network). The PC algorithm is a classic and widely recognized algorithm (Glymour et al., 2019). Causal discovery using the PC algorithm is different in that PC can work with different CI tests to perform the first step. We combined the PC algorithm with 10 CI tests to form 10 constraint-based causal discovery algorithms. The properties and advantages/disadvantages of the 10 algorithms are summarized, with ‘+++’ and ‘+’ indicating the most and least recommended ones (Table 2; Appendix 3—figures 1 and 2). In addition to constraint-based methods, there are other kinds of methods, including score-based methods that assign a score function to each directed acyclic graph (DAG) and optimize the score via greedy searches (Chickering, 2003), hybrid methods that combine score-based and constraint-based methods (Solus et al., 2021), and continuous optimization-based methods that convert the traditional combinatorial optimization problem into a continuous program (Bello et al., 2022a; Zheng et al., 2018). When benchmarking the four classes of methods, multiple simulated data, real scRNA-seq data, and signaling pathways were used to evaluate their performance (Appendix 1—table 1).
The results of benchmarking the 11 causal discovery methods and 10 CI tests show that when causal discovery is without the problems of incomplete models (i.e. ones that miss nodes or edges from the data-generating model) and missing values, nonlinear versions of continuous optimization-based methods (especially DAGMA-nonlinear) perform better than others (Bello et al., 2022a). When causal discovery is applied to a set of highly expressed or differentially expressed genes in an scRNA-seq dataset (which has both missing variables and missing values), the PC algorithm with kernel-based CI tests (especially DCC.gamma) performs well. Therefore, the CausalCell platform includes 4 causal discovery methods (PC, GES, GSP, and DAGMA-nonlinear) to suit different data, together with 10 CI tests and 9 feature selection algorithms.
Developing the workflow/platform for causal discovery
Request a detailed protocolThe CausalCell workflow/platform is implemented using the Docker technique and Shiny language and consists of feature selection, causal discovery, and several auxiliary functions (Figure 1). A parallel version of the PC algorithm is used to realize parallel multi-task causal discovery (Le et al., 2019). In addition, the platform also includes the GES, GSP, and DAGMA-nonlinear methods. PC and GSP can work with 10 CI tests. Annotations of functions and parameters and the detailed description of a causal discovery process are available online.
Data input and pre-processing
Request a detailed protocolscRNA-seq and proteomics data generated by different protocols or methods (e.g. 10x Genomics, Smart-seq2, and flow cytometry) can be analyzed. CausalCell accepts log2-transformed data and z-score data and can turn raw data into either of the two forms. A dataset (i.e. the ‘case’) can be analyzed with or without a control dataset (i.e. the ‘control’). Researchers often identify and analyze special genes, such as highly expressed or differentially expressed genes. For each gene in a case and control, three attributes (the averaged expression value, percentage of expressed cells, and variance) are computed. Fold changes of gene expression are also computed (using the FindMarkers function in the Seurat package) if a control is uploaded. Genes can be ordered upon any attribute and filtered upon a combination of five conditions (i.e. expression value, percentage of expressed cells, variance, fold change, and being a TF or not). Since performing feature selection transcriptome-wide is unreliable due to too many genes, filtering genes before feature selection is necessary, and different filtering conditions generate different candidates for feature selection.
Batch effects may influence identifying differentially expressed genes. Since removing batch effects should be performed with raw data before integrating batches and there are varied batch effect removal methods (Tran et al., 2020), it should be performed by the user if necessary.
Feature selection
Request a detailed protocolFeature selection selects a set of genes (i.e. features) from the candidate genes upon one or multiple genes of interest (i.e. response variables). As above-mentioned, candidate genes are extracted from the whole dataset upon specific conditions because performing feature selection transcriptome-wide is unreliable. Based on the accuracy, time consumption, and scalability of the nine feature selection algorithms (Table 1), BAHSIC is the most recommended algorithm. The joint use of two kinds of algorithms (e.g. Random Forest+BAHSIC) is also recommended to ensure reliability. Feature genes are usually 50–70, but the number also depends on the causal discovery algorithms. Genes can be manually added to or removed from the result of feature selection (i.e., the feature gene list) to address a biological question specifically. The input for causal discovery can also be manually selected without performing feature selection; for example, the user can examine a specific Gene ontology (GO) term.
Causal discovery
Request a detailed protocolThe PC and GSP algorithms can work with the 10 CI tests to provide varied options for causal discovery. In the inferred causal networks, direction of edges is determined by the meek rules (Meek, 1997), and each edge has a sign indicating activation or repression and a thickness indicating CI test’s statistical significance. The sign of an edge from A to B is determined by computing a Pearson correlation coefficient between A and B, which is ‘repression’ if the coefficient is negative or ‘activation’ if the coefficient is positive. In most situations, ‘A activating B’ and ‘A repressing B’ correspond to up-regulated A in the case dataset, with up- and down-regulated B in the case dataset compared with in the control dataset.
There are two ways to construct a consensus network that is statistically more reliable. One way is to run multiple algorithms (i.e. multiple CI tests) and take the intersection of some or all inferred networks as the consensus network (Figure 2). The other is to run an algorithm multiple times and take the intersection of all inferred networks as the consensus network (Figure 3).
If a scRNA-seq dataset is large, a subset of cells should be sampled to avoid excessive time consumption. We suggest that 300 and 600 cells are suitable for reliable inference if the input is Smart-seq2 and 10x Genomics data, respectively, the input contains about 50 genes, and genes are expressed in >50% cells. Here, reliable inference means that key interactions (those with high CI test significance) are inferred (Appendices 3, 4). More cells are needed if the input genes are expressed in fewer cells and if the input contains >50 genes. Larger sample sizes (more cells) may make more interactions be inferred, but the key interactions are stable (Appendix 3—figure 3). As HSIC.perm and DCC.perm employ permutation to perform CI test, the networks inferred each time may be somewhat different. Our data analyses suggest that interactions inferred by running distance covariance criteria (DCC) algorithms multiple times are quite stable (Figure 3).
Four parameters influence causal discovery. First, ‘set the alpha level’ determines the statistical significance cut-off of the CI test, and large and small values make more and fewer interactions be inferred. Second, ‘select the number of cells’ controls sample size, and selecting more cells makes the inference more reliable but also more time-consuming. Third, ‘select how a subset of cells is sampled’ determines how a subset of cells is sampled. If a subset is sampled randomly, the inferred network is not exactly reproducible (but by running multiple times, the inferred edges may show high consistency, see Figure 3). Fourth, ‘set the size of conditional set’ controls the size of conditional set when performing CI tests; it influences both network topology and time consumption and should be set with care. Since some CI tests are time-consuming and running causal discovery with multiple algorithms are especially time-consuming, providing an email address is necessary to make the result sent to the user automatically.
The performance of different PC+CI tests was intensively evaluated. First, we evaluated the accuracy, time consumption, sample requirement, and stability of PC+nine CI tests using simulated data and the non-small cell lung cancer (NSCLC) cell line H2228 and the normal lung alveolar cells (as the case and control) (Tian et al., 2019; Travaglini et al., 2020). Comparing inferred networks with the consensus network suggests that the two DCC CI tests are most accurate and most time-consuming, suitable for small-scale network inference. RCIT and RCoT, two approximated versions of the KCIT, are moderately accurate and relatively fast, suitable for large-scale network inference. GaussCItest is the fastest and suitable for data with Gaussian distribution (Figure 2; Appendix 3—figure 2). Second, we compared the performance of PC+DCC.gamma, GSP+DCC.gamma, and GES. The former two have comparable performance, and both are more accurate and time-consuming than GES (Appendix 3—figures 4 and 5).
Verification of causal discovery
Request a detailed protocolWe used the five NSCLC cell lines (A549, H1975, H2228, H838, and HCC827), the normal alveolar cells, and genes in specific pathways to validate network inference by PC+DCC.gamma (Tian et al., 2019; Travaglini et al., 2020). First, upon the combined conditions of (a) gene expression value >0.1, (b) gene expression in >50% cells, and (c) fold change >0.3, we identified highly and differentially expressed genes in each cell line against the alveolar cells. Second, we applied gene set enrichment analysis to differentially expressed genes in each cell line using the g:Profiler and GSEA programs. g:Profiler identified ‘Metabolic reprogramming in colon cancer’ (WP4290), ‘Pyrimidine metabolism’ (WP4022), and ‘Nucleotide metabolism’ (hsa01232) as enriched pathways in all cancer cell lines, and GSEA identified ‘Non-small cell lung cancer’ (hsa05223) as an enriched pathway in cancer cell lines (‘WP’ and ‘hsa’ indicate WikiPathways and KEGG pathways). Many studies reveal that glucose metabolism is reprogrammed and nucleotide synthesis is increased in cancer cells. Key features of reprogrammed glucose metabolism in cancer cells include increased glucose intake, increased lactate generation, and using the glycolysis/TCA cycle intermediates to synthesize nucleotides. The networks inferred by PC+DCC.gamma capture these features despite of the absence of metabolites in these datasets. The networks of WP4022 also capture the key features of pyrimidine metabolism. In the networks of hsa05223, over 50% inferred interactions agree with pathway annotations. These results support network inference (Appendix 4).
Evaluating and ensuring the reliability
Request a detailed protocolSingle-cell data vary in quality and sample sizes; thus, it is important to effectively evaluate and ensure the reliability of network inference. Inspired by using RNA spike-in to measure RNA-seq quality (Jiang et al., 2011), we developed a method to evaluate and ensure the reliability of causal discovery. This method includes three steps: extracting the data of several well-known genes and their interactions from certain dataset as the ‘spike-in’ data, integrating the spike-in data into the case dataset, and applying causal discovery to the integrated dataset (the latter two steps are performed automatically when a spike-in dataset is chosen or uploaded). The user can choose a spike-in dataset in the platform or design and upload a spike-in dataset. In the inferred network, a clear separation of genes and their interactions in the spike-in dataset from genes and interactions in the case dataset is an indicator of reliable inference (Appendix 4—figure 1). Some public databases (e.g. the STRING database, https://string-db.org/) can also be used to evaluate inferred interactions (Appendix 4—figures 2 and 3).
Results
The analysis of lung cancer cell lines and alveolar epithelial cells
Down-regulated MHC-II genes help cancer cells avoid being recognized by immune cells (Rooney et al., 2015); thus, identifying genes and interactions involved in MHC-II gene down-regulation is important. To assess if causal discovery helps identify the related interactions, we examined the five NSCLC cancer cell lines (A549, H1975, H2228, H838, and HCC827) and the normal alveolar epithelial cells (Tian et al., 2019; Travaglini et al., 2020). For each of the six datasets, we took the five MHC-II genes (HLA-DPA1, HLA-DPB1, HLA-DRA, HLA-DRB1, HLA-DRB5) as the response variables (genes of interest, hereafter also called target genes) and selected 50 feature genes (using BAHSIC, unless otherwise stated) from all genes expressed in >50% cells. Then, we applied the nine causal discovery algorithms to the 50 genes in 300 cells sampled from each of the datasets. The two DCC algorithms performed the best when processing the H2228 cells and lung alveolar epithelial cells (Appendix 5—figures 1 and 2).
The inferred networks also show that down-regulated genes weakly but up-regulated genes strongly regulate downstream targets and that activation and repression lead to up- and down-regulation of target genes. These features are biologically reasonable. Many inferred interactions, including those between MHC-II genes and CD74, between CXCL genes, and between MHC-I genes and B2M, are supported by the STRING database (http://string-db.org) and experimental findings (Figure 4; Appendix 4—figure 2; Castro et al., 2019; Karakikes et al., 2012; Szklarczyk et al., 2021). An interesting finding is the PRDX1→TALDO1→HSP90AA1→NQO1→PSMC4 cascade in H2228 cells. Interactions between PRDX1/TALDO1/HSP90AA1 and NQO1 were reported (Mathew et al., 2013; Yin et al., 2021), but the interaction between NQO1 and PSMC4 was not. Previous findings on NQO1 include that it determines cellular sensitivity to the antitumor agent napabucasin in many cancer cell lines (Guo et al., 2020), is a potential poor prognostic biomarker, and is a promising therapeutic target for patients with lung cancers (Cheng et al., 2018; Siegel et al., 2012), and that mutations in NQO1 are associated with susceptibility to various forms of cancer. Previous findings on PSMC4 include that high levels of PSMC4 (and other PSMC) transcripts were positively correlated with poor breast cancer survival (Kao et al., 2021). Thus, the inferred NQO1→PSMC4 probably somewhat explains the mechanism behind these experimental findings.
The analysis of macrophages isolated from glioblastoma
Macrophages critically influence glioma formation, maintenance, and progression (Gutmann, 2020), and CD74 is the master regulator of macrophage functions in glioblastoma (Alban et al., 2020; Quail and Joyce, 2017; Zeiner et al., 2015). To examine the function of CD74 in macrophages in gliomas, we used CD74 as the target gene and selected 50 genes from genes expressed in >50% of macrophages isolated from glioblastoma patients (Neftel et al., 2019). In the networks of DCC algorithms (Appendix 5—figure 3), CD74 regulates MHC-II genes, agreeing with the finding that CD74 is an MHC-II chaperone and plays a role in the intracellular sorting of MHC class II molecules. The network includes interactions between C1QA/B/C, agreeing that they form the complement C1q complex. The identified TYROBP→TREM2→A2M→APOE→APOC1 cascade is supported by the reports that TREM2 is expressed in tumor macrophages in over 200 human cancer cases (Molgora et al., 2020) and that there are interactions between TREM2/A2M, TREM2/APOE, A2M/APOE, and APOE/APOC1 (Krasemann et al., 2017).
The analysis of tumor-infiltrating exhausted CD8 T cells
Tumor-infiltrating exhausted CD8 T cells are highly heterogeneous yet share common differentially expressed genes (McLane et al., 2019; Zhang et al., 2018), suggesting that CD8 T cells undergo different processes to reach exhaustion. We analyzed three exhausted CD8 T datasets isolated from human liver, colorectal, and lung cancers (Appendix 5—figure 4; Guo et al., 2018; Zhang et al., 2018; Zheng et al., 2017). A key feature of CD8 T cell exhaustion identified in mice is PDCD1 up-regulation by TOX (Khan et al., 2019; Scott et al., 2019; Seo et al., 2019). Using TOX and PDCD1 as the target gene, we selected 50 genes expressed in >50% exhausted CD8 T cells and 50 genes expressed in >50% non-exhausted CD8 T cells, respectively. Transcriptional regulation of PDCD1 by TOX was observed in LCMV-infected mice without mentioning any role of CXCL13 (Khan et al., 2019). Here, indirect TOX→PDCD1 (via genes such as CXCL13) was inferred in exhausted CD8 cells, and direct TOX→PDCD1 was inferred in non-exhausted CD8 T cells (although the expression of TOX and PDCD1 is low in these cells) (Appendix 5—figure 4). Recently, CXCL13 was found to play a critical role in T cells for effective responses to anti-PD-L1 therapies (Zhang et al., 2021b). The causal discovery results help reveal differences in CD8 T cell exhaustion between humans and mice and under different pathological conditions. The PDCD1→TOX inferred in exhausted and non-exhausted CD8 T cells may indicate some feedback between TOX and PDCD1, as on the proteome level, a study reported that the binding of PD1 to TOX in the cytoplasm facilitates the endocytic recycling of PD1 (Wang et al., 2019).
Identifying genes and inferring interactions that signify CD4 T cell aging
How immune cells age and whether some senescence signatures reflect the aging of all cell types draw wide attention (Gorgoulis et al., 2019). We analyzed gene expression in naive, TEM, rTreg, naive_Isg15, cytotoxic, and exhausted CD4 T cells from young (2–3 months, n=4) and old (22–24 months, n=4) mice (Appendix 5—figures 5; Elyahu et al., 2019). For each cell type, we compared the combined data from all four young mice with the data from each old mouse to identify differentially expressed genes. If genes were expressed in >25% cells and consistently up/down-regulated (|fold change|>0) in most of the 24 comparisons, we assumed them as aging-related (Appendix 5—table 1). Some of these identified genes play important roles in the aging of T cells or other cells, such as the mitochondrial genes encoding cytochrome c oxidases and the gene Sub1 in the mTOR pathway (Bektas et al., 2019; Gorgoulis et al., 2019; Goronzy and Weyand, 2019; Walters and Cox, 2021). We directly used these genes, plus one CD4-specific biomarker (Cd28) and two reported aging biomarkers (Cdkn1b, Cdkn2d) (Gorgoulis et al., 2019; Larbi and Fulop, 2014), as feature genes to infer their interactions in different CD4 T cells in young and old mice. The inferred causal networks unveil multiple findings (Appendix 5—figure 5). First, B2m→H2-Q7 (a mouse MHC class I gene), Gm9843→Rps27rt (Gm9846), and the interactions between the five mitochondrial genes (MT-ATP6, MT-CO1/2/3, and MT-Nd1) were inferred in nearly all CD4 T cells. Second, many interactions are supported by the STRING database (Appendix 4—figure 3). Third, some interactions agree with experimental findings, including Sub1-|Lamtor2 (Chen et al., 2021) and the regulations of these mitochondrial genes by Lamtor2 (Morita et al., 2017). Fourth, Gm9843→Rps27rt→Junb were inferred in multiple CD4 T cells (both Gm9843 and Rps27rt are mouse-specific). Since JUNB belongs to the AP-1 family TFs that are increased in all immune cells during human aging (Zheng et al., 2020), Gm9843→Rps27rt→Junb could highlight a counterpart regulation of JUNB in human immune cells.
Discussion
Single-cell causal discovery
Various methods have been proposed to infer interactions between variables from observational data. As surveyed recently (Nguyen et al., 2021; Pratapa et al., 2020), many methods assume linear relationships between variables and the Gaussian distribution of data. These assumptions enable these methods to run fast, handle many genes and even perform transcriptome-wide prediction. However, our algorithm benchmarking results suggest that networks inferred by fast methods with these assumptions should be concerned.
Causal discovery infers causal interactions directly upon observations of variables without assuming relationships between variables and the distribution of data. Because genes and molecules have varied relationships in different cells, causal discovery better satisfies inferring their interactions than other methods. Causal discovery methods have reviewed recently (Glymour et al., 2019; Yuan and Shou, 2022), but workflows and platforms integrating multiple methods for analyzing scRNA-seq data remain rare.
Our integration and benchmarking of multiple methods (note that these methods are not for inferring causal relationships from temporal data) and analysis of multiple datasets generate several conclusions. First, although kernel-based CI tests are time-consuming (Shah and Peters, 2020), applying them to a set of genes is feasible. A set of genes can be generated by feature selection, by gene set enrichment analysis, or by manual selection. Second, the cost of time consumption pays off in network accuracy, as the most time-consuming CI tests generate the most reliable results. Thus, trade-offs between time consumption, network size, and network accuracy should be made. Third, causal discovery can infer signaling networks or gene regulatory networks, depending on the input. If genes encoding TFs and their targets are the input, gene regulatory networks are inferred. Fourth, dropouts and noises in scRNA-seq data concern researchers and trouble correlation analysis (Hou et al., 2020; Mohan and Pearl, 2018; Tu et al., 2019), but can be well tolerated by PC+kernel-based CI tests if samples are sufficiently large. Finally, using ‘spike-in’ data can effectively evaluate the reliability of causal discovery.
Challenges of data analysis
Single-cell causal discovery also faces several challenges. First, causal discovery assumes there are no unmeasured common causes (the causal sufficiency assumption), but in real data latent and unobserved variables are common and hard to identify. Specifically, inferring interactions between highly expressed or differentially expressed genes is a case of causal discovery with incomplete models (i.e. models with missing variables from the data-generating model). In this situation, what are inferred are indirect relationships instead of direct interactions between gene products. Second, constraint-based methods cannot differentiate networks belonging to a Markov equivalent class (the causal Markov assumption). This can be solved partly by combined use of PC and DAGMA-nonlinear (which can better determine the direction of edges). Third, the following examples indicate that the lack of relevant information makes judging inferred interactions and relationships difficult. (a) TOX is reported to activate PDCD1 in exhausted CD8 T cells in mice (Khan et al., 2019), but whether CXCL13 is involved in (or required for) the TOX-PDCD1 interaction in exhausted CD8 T cells in humans is unclear, until recently CXCL13 is reported to play critical roles in T cells for effective responses to anti-PD-L1 therapies (Zhang et al., 2021b). (b) The differences in inferred networks in exhausted CD8 T cells from different cancers are puzzling, until a recent study reports that exhausted CD8 T cells show high heterogeneity and exhaustion can follow different paths (Zheng et al., 2021). (c) It is difficult to explain multiple genes encoding ribosomal proteins in the inferred networks in CD4 cells from old mice, until a recent study reports that aging impairs ribosomes’ ability to synthesize proteins efficiently (Stein et al., 2022).
Limitations of the study
The time consumption of kernel-based CI tests disallows inferring large networks, and how this challenge can be solved remains unsolved. C codes may be developed to replace the most time-consuming parts of the R functions, but this has not been done.
Tips for best practices
First, exploring different biological modules or processes needs careful selection of genes (Figure 5). When it is unclear what genes are most relevant to one or several target genes, it is advisable to run multiple rounds of feature selection using different combination of target genes as response variable(s). Second, when feature genes are identified by gene set enrichment analysis or upon highly expressed genes, PC+kernel-based CI tests perform better than continuous optimization-based methods, and the inferred networks consist more likely of indirect causal relationships instead of direct causal interactions. Third, BAHSIC and SHS are the best feature selection algorithms. Since selecting feature genes from too many candidates is unreliable, filtering genes upon specific conditions (e.g. expression values, expressed cells, fold changes) is necessary. Fourth, DCC.gamma and DCC.perm are the best CI tests working with PC. When building consensus networks, it is advisable to use the results of just DCC CI tests. Fifth, trade-offs between scale, reliability, and accuracy are inevitable. When examining many genes, RCIT/RCoT may be proper, and when examining large datasets, sub-sampling is necessary. For Smart-seq2 and 10x Genomics datasets, 300 and 600 cells are recommended for analyzing 50–60 genes expressed in >50% of cells. More cells are needed if more genes are selected and/or selected genes are expressed in fewer cells (e.g. 25%). Sixth, when it is unclear if a sub-sampled dataset is large enough, repeat causal discovery several times using different sizes of sub-samples. If the inferred networks are similar, the sub-samples should be sufficient. Seventh, using “spike-in” datasets helps measure and ensure reliability. Eighth, carefully inspect the potential influence of cell heterogeneity on causal discovery, and caution is needed when interpreting the results of heterogeneous cells.
Appendix 1
Overview of data and algorithms
1. Algorithms and datasets
We combined feature selection and causal discovery to infer causal interactions among a set of gene products in single cells. We used synthetic data, semi-synthetic data, real scRNA-seq data, and flow cytometry data to benchmark nine feature selection algorithms and nine causal discovery algorithms (Appendix 1—tables 1 and 2; Appendix 1—figure 1).
2. Synthetic data for feature selection
Fully synthetic dataset
N variables (indicating candidate genes) without a specific pattern were generated randomly from a (0, 2) uniform distribution, from which n variables (indicating feature genes) were selected randomly to synthesize a response variable. Each feature influences the response variable depending randomly on one of the five functions: y=x2, y=sin(x), y=cos(x), y=tanh(x), and y=ex.
By combining different numbers of feature genes and candidate genes (4–50, 8–100, 8–200, 20–500, 20–1000, and 50–2000) and generating samples of different sizes (100, 200, 500, 1000, and 2000), we generated 30 schemes. For each algorithm, we ran each scheme 10 times, and each time the true positive rate (TPR) was calculated by:
A TPR of 1.0 means that the feature selection algorithm completely correctly selects the features; a small TPR indicates poor performance.
Semi-synthetic dataset
First, we extracted genes from a benchmark scRNA-seq dataset (https://support.10xgenomics.com/single-cell-gene-expression/datasets/4.0.0/Parent_NGSC3_DI_PBMC), sorted these genes based on the cells in which they were expressed (expression level >0), and obtained the top 5000 genes expressed in most cells. Then, we obtained the top 5000 cells that contained the most expressed genes. These genes and cells formed a 5000*5000 matrix. Different candidate gene sets were sampled from this matrix, and different feature gene sets were selected randomly from each candidate gene set. Next, the response variables (target genes) were synthesized using feature genes.
3. Synthetic data for causal discovery
We used the randomDAG function in the pcalg package (https://cran.r-project.org/web/packages/pcalg/index.html) to generate DAGs with random topologies. Values of nodes (i.e. genes) in these DAGs were randomly generated using the following 10 functions that determined relationships between nodes:
With the variable number ranging from 20 to 80 (step = 20), and the sample size ranging from 500 to 1000 (step = 500), we generated eight datasets with known networks.
4. Real single-cell data for feature selection and causal discovery
Single-cell datasets in Appendix 1—table 1 were used for benchmarking.
Appendix 2
Feature selection algorithms and benchmarking
1. Ensemble learning-based algorithms
Random forests
We used the RandomForestRegressor function (with default parameters) in the sklearn package (https://scikit-learn.org/stable/) to build random forest models (Breiman, 2001). Each model contained 200 decision trees. After regression based on the response variable(s), genes were sorted based on Gini importance, and the top genes were selected as feature genes.
Extremely randomized trees
We used the ExtraTreesRegressor function (with default parameters) in the sklearn package (https://scikit-learn.org/stable/) to generate extremely randomized trees (Geurts et al., 2006). Each tree model contained 200 decision trees. After regression based on the response variable(s), genes were sorted based on Gini importance, and the top genes were selected as feature genes.
XGBoost
We used the XGBRegressor function (with default parameters) in the Scikit-Learn API (https://xgboost.readthedocs.io/en/latest/python/python_api.html) to build the XGBoost models (Chen and Guestrin, 2016). Each XGBoost model contained 200 decision trees. After regression based on the response variable(s), genes were sorted based on Gini importance, and the top genes were selected as feature genes.
2. Regularization-based algorithms
Lasso
We used the Lasso function (with default parameters) in the sklearn package (https://scikit-learn.org/stable/) to produce the regression models. In the Lasso (least absolute shrinkage and selection operator) regression equation (Tibshirani, 1997):
N is the number of samples, is the number of features, is the coefficient of the th feature, and λ (by default λ=0.5) is a penalty coefficient controlling the shrinkage. Feature genes were selected based on the value of , which indicates the importance of the th feature for the response variable(s).
Ridge regression
We used the Ridge function (with default parameters) in the sklearn package (https://scikit-learn.org/stable/) to build Ridge repression models. The equation of Ridge regression is similar to that of Lasso (Hoerl and Kennard, 2000):
but the L2 penalty term is . Feature genes were selected based on the value of , which indicates the importance of the th feature for the response variable(s).
Elastic net
We used the ElasticNet function (with default parameters) in the sklearn package (https://scikit-learn.org/stable/) to build elastic net models. Elastic net linearly combines the L1 and L2 penalties of the Lasso and Ridge methods using the following equation (by default λ=1 and α=0.5) (Zou and Hastie, 2005). In the equation:
feature genes are selected upon the value of , which indicates the importance of the th feature for the response variable(s).
3. HSIC-based algorithms
BAHSIC
Hilbert-Schmidt independence criterion (HSIC) is a measure of dependency between two variables (Gretton et al., 2005). After obtaining a measure between a response variable and a feature, a backward elimination process is used to extract a subset of features that are most relevant to the response variable (Song et al., 2007). We used the BAHSIC program (https://www.cc.gatech.edu/~lsong/code.html), together with the nonlinear radial basis function kernel, to evaluate the dependency between feature genes and response variable(s), and set the parameter flg3 = to accelerate computation.
SHS
Sparse HSIC (SHS), which combines HSIC with fast sparse decomposition of matrices, is an HSIC-based feature selection algorithm without the backward elimination process to identify a sparse projection of all features (Gangeh et al., 2017). We translated the SHS program encoded in MATLAB (https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/shs.zip) into a Python program and used eigenvalue decomposition as the matrix halving procedure. The parameters γ=1.1 (a penalty parameter that controls the sparsity of the solution) and .
Block HSIC Lasso
HSIC Lasso is a variant of the minimum redundancy maximum relevance feature selection algorithm and is suitable for high-dimensional small sample data. We used the pyHSICLasso program (https://github.com/riken-aip/pyHSICLasso; Yamada et al., 2014), which is an approximation of HSIC Lasso but reduces memory usage dramatically while retaining the properties of HSIC Lasso (Climente-González et al., 2019). We used the function get_index_score() to compute feature importance and the function get_features() to return top feature genes.
4. Benchmarking results
We evaluated the time consumption, accuracy, and scalability of nine feature selection algorithms in three categories (Appendix 1—tables 1 and 2). First, tested using synthesized data, all algorithms showed moderate time consumption, which increased insignificantly when the sample size increased (Appendix 2—figure 1). Second, using synthesized data, multiple algorithms selected all features correctly if schemes were simple (e.g. selecting 4 features from 50 candidates). If features and/or candidates increased (e.g. selecting 50 features from 2000 candidates), BAHSIC showed the best performance, with accuracy decreasing more slowly than others (Appendix 2—figure 2). Third, using well-known microglial biomarkers in humans and mice (Butovsky et al., 2014; Patir et al., 2019) and using scRNA-seq data of microglia from the human and mouse brain (Geirsdottir et al., 2019), we further evaluated feature selection algorithms’ accuracy. We merged feature genes generated by the nine algorithms into a superset (Appendix 1—figure 1B), identified a subset generated by the majority of algorithms, and examined how many feature genes of each algorithm overlap with the subset. When selecting 50 genes from 3000 candidates upon a target gene (e.g. Hexb), BAHSIC and SHS were the best and second-best algorithms, and they also selected most microglia biomarkers (Appendix 2—figures 3 and 4). Fourth, we used real scRNA-seq data in applications to evaluate algorithms’ accuracy and found that BAHSIC also performs well. Finally, to evaluate algorithms’ scalability, we let algorithms select feature genes from different numbers of candidate genes. When the number of candidate genes is large (>10,000), the accuracy of feature selection is somewhat decreased.
BAHSIC’s performance was examined further using macrophages isolated from human glioblastoma by checking whether the selected feature genes accurately characterize macrophages (Neftel et al., 2019). We used six macrophage biomarkers (CD14, AIF1, FCER1G, FCGR3A, TYROBP, and CSF1) exclusively expressed in these macrophages as the target genes (response variables) and used BAHSIC to select 50 feature genes from 3000 candidate genes expressed in >50% macrophages (Appendix 2—figure 5). Nearly all feature genes were expressed exclusively in the macrophages (Appendix 2—figure 6). Experimental findings support many feature genes. C1QA/B/C and C3 are supported by the finding that C1Q is produced and the complement cascade is up-regulated in cancer-infiltrated macrophages (Yang et al., 2021). CD74 and MHC-II genes are supported by the finding that CD74 is correlated with malignancies and the immune microenvironment in gliomas (Xu et al., 2021). TREM2 and APOE are supported by the finding that highly expressed TREM2 and APOC2 in macrophages contribute to immune checkpoint therapy resistance (Xiong et al., 2020). MS4A4A and MS4A6A are supported by the finding that APOE and TREM2 are up-regulated by MS4A (Deming et al., 2019). In contrast, feature genes selected by RidgeRegression upon the same six biomarkers were expressed in diverse cells (Appendix 2—figure 7). These confirm that BAHSIC can quite reliably select feature genes upon target genes.
Appendix 3
Causal discovery algorithms and benchmarking
1. Causal discovery methods
CausalCell integrates four causal discovery methods – PC, GES, GSP, and DAGMA-nonliear – which are representative constraint-based, score-based, hybrid, and continuous optimization-based methods. Constraint-based methods identify causal interactions in a set of variables in two steps: skeleton estimation and orientation. Score-based methods assign a score function (e.g. the Bayesian information criterion) to each potential causal network and optimize the score via greedy approaches. Hybrid methods combine score-based methods and CI tests. Continuous optimization-based methods recast the combinatoric graph search problem as a continuous optimization problem. The PC and GSP algorithms can be combined with different CI tests.
To benchmark the performance of different CI tests, we combined 10 CI tests with the parallel version of the PC algorithm (i.e. the pc function in the R package pcalg, with the default setting skel.method="stable") (Le et al., 2019). The results show that kernel-based CI tests (especially the two DCC CI tests) outperform other CI tests (Appendix 3—table 1; Appendix 3—figures 1 and 2). To evaluate the score-based and hybrid methods GES (https://cran.r-project.org/web/packages/pcalg/index.html) and GSP (https://github.com/uhlerlab/causaldag; Chickering, 2003; Solus et al., 2021; Squires, 2018), we compared PC+DCC.gamma, GES, and GSP+DCC.gamma. The results show that PC+DCC.gamma and GSP+DCC.gamma have comparable network accuracy and time consumption, and both are more accurate but more time-consuming than GES (Appendix 3—figures 3–6).
Further, we benchmarked six continuous optimization-based methods (NOTEARS-linear, NOTEARS-nonlinear, DAGMA-linear, DAGMA-nonlinear, GOLEM, and DAG_GNN) (Bello et al., 2022a; Zheng et al., 2018), and two linear non-Gaussian acyclic model methods (ICLiGNAM and DirectLiGNAM). We compared the performance of these methods with PC+DCC.gamma and PC+GaussCItest. Continuous optimization-based methods, especially DAGMA-nonlinear (https://github.com/kevinsbello/dagma; Bello et al., 2022b), perform well when relationships between variables are free of missing variables and missing values, otherwise they perform poorly and underperform PC+DCC.gamma. All benchmarking used both simulated data and multiple scRNA-seq datasets, especially the five lung cancer cell lines (A549, H1975, H2228, H838, HCC827) from the CellBench benchmarking dataset (Tian et al., 2019). Genes differentially expressed in these cell lines were determined upon gene expression in the lung alveolar cells (Travaglini et al., 2020).
2. Partial correlation-based CI test
GaussCItest
Gauss CI test examines CI using partial correlation, assuming that all variables are multivariate Gaussian. The partial correlation coefficient is zero if and only if X is conditionally independent of Y given Z (Kunihiro et al., 2004). is , is , and a hypothesis test (p<0.05) decides whether two variables are conditionally independent given Z. We used the gaussCItest function in the R package pcalg with default parameters (https://cran.r-project.org/web/packages/pcalg/index.html).
3. HSIC-based CI test
HSIC is a measure of dependency between two variables; if X and Y are unconditionally independent. Performing two extra transformations can determine if X and Y are conditionally independent given the conditioning set Z: first, performing nonlinear regressions for X and Z and for Y and Z, respectively, to generate the residuals and based on Z; then, calculating that indicates whether X and Y are conditionally independent given the conditioning set Z () (Verbyla et al., 2017). We used the gam() function in the R package mgcv to build the nonlinear regression model and used the three HSIC-based functions (with default parameters unless otherwise specified) in the R package kpcalg (https://cran.r-project.org/web/packages/kpcalg/index.html) to perform CI test.
hsic.perm
In practice, may be slightly larger than 0.0 when X and Y are independent, making it hard to judge whether X and Y are independent. hsic.perm uses a permutation test to solve this problem by assuming that permuting Y removes any dependency between X and Y. We used the hsic.perm function to permute Y 100 times to calculate , then we compared them with . The p-value was the fraction of times was smaller than the .
hsic.gamma
We used the hsic.gamma function to fit a gamma distribution: Gamma(α, θ) of the HSIC under the null hypothesis. The shape parameter and the scale parameter were calculated using the equation:
A p-value was obtained as an upper-tail quantile of HSIC (X, Y).
hsic.clust
First, samples were clustered using the R function kmeans() by calculating the Euclidean distance between the Z coordinates of samples; then, Y was permutated based on the clustered Z. Within each Z, cluster was generated, ensuring that the permuted samples break the dependency between X and Y but retain the dependency between X and Y on Z. After permutation, a p-value was calculated to make a statistical decision.
4. Distance covariance-based CI test
Distance covariance is an alternative to HSIC for measuring independence (Székely and Rizzo, 2009; Székely et al., 2007). We used two DCC-based functions dcc.perm and dcc.gamma (with default parameters) in the R package kpcalg (https://cran.r-project.org/web/packages/kpcalg/index.html) to perform CI test. Similar to HSIC-based algorithms, the two functions directly calculate for an UI test, then, the nonlinear regression is performed, next, is calculated for a CI test and a statistical decision (Verbyla et al., 2017).
dcc.perm
This program is similar to hsic.perm and uses a permutation test to estimate a p-value. The DCC statistic is calculated in each permutation, and finally, a statistical decision is made based on the p-value. We used the dcov.test function (with default parameters) in the R package energy to calculate the statistic DCC in the permutation test. The p-value was the fraction of times that DCC(X, Yperm) was smaller than DCC(X,Y).
dcc.gamma
Similar to hsic.gamma, dcc.gamma uses the gamma distribution Gamma(α, θ) of the DCC under the null hypothesis. The two parameters were estimated by
We used the dcov.gamma function (with default parameters) in the R package kpcalg to calculate the p-value. The p-value was obtained as an upper-tail quantile of DCC(X, Y).
5. Approximation of KCIT
The KCIT is another powerful CI test (Zhang and Peters, 2011), but it is time-consuming for large datasets. Based on random Fourier features (Rahimi and Recht, 2007), two approximation methods (randomized conditional independence test, RCIT, and randomized conditional correlation test, RCoT) were proposed (Strobl et al., 2019). RCIT and RCoT approximate KCIT by sampling Fourier features, return p-values orders of magnitude faster than KCIT when the sample size is large, and may also estimate the null distribution more accurately than KCIT.
RCIT
We used the RCIT function in the R package RCIT (with default parameters) (https://github.com/ericstrobl/RCIT; Strobl, 2019) to implement the randomized CI test.
RCoT
RCoT often outperforms RCIT, especially when the size of the conditioning set is greater than or equal to 4. We used the RCoT function in the R package RCIT (with default parameters) (https://github.com/ericstrobl/RCIT; Strobl, 2019) to implement the RCoT.
6. Conditional mutual information-based CI test
Mutual information is used to measure mutual dependence between two variables. Conditional mutual information (CMI) is a measure based on mutual information, which is zero if and only if .
CMIknn
CMIknn is a program that combines CMI with a local permutation scheme determined by the nearest-neighbor approach (Runge, 2018). We used the Python package tigramite (with default parameters) (http://github.com/jakobrunge/tigramite; Runge, 2020) to perform the CI test.
7. CI test based on generalized covariance measure
GCM
GCM (https://cran.r-project.org/web/packages/GeneralisedCovarianceMeasure/index.html; Peters and Shah, 2022) is a CI test based on generalized covariance measure. It is also classified as a regression-based CI test because it is based on a suitably normalized version of the empirical covariance between the residual vectors from the regressions (Shah and Peters, 2020).
8. Benchmarking results
The time consumption, accuracy, sample requirement, and stability of the PC+ nine CI tests were evaluated (Appendix 3—table 1). First, we simulated eight datasets with known causal networks, whose variable numbers and sample sizes ranged from 20 to 80 (step = 20) and 500 to 1000 (step = 500), respectively, to evaluate causal discovery algorithms’ time consumption, scalability, and accuracy. Algorithms based on the DCC kernel were more time-consuming than others (Appendix 3—figure 1A,C). Algorithms’ accuracy was assessed based on the structural Hamming distance (SHD) between the inferred and the true networks (SHD = 0 indicates no difference). The networks of all algorithms showed similar SHD when the sample size was 500 (Appendix 3—figure 1B); the close performance was probably because synthetic data were generated using a few simple functions. When the sample size was increased from 500 to 1000, time consumption increased (but was not doubled), but SHD did not decrease (i.e. algorithms’ performance did not increase) significantly, indicating that 500 cells may be adequate for causal discovery (Appendix 3—figure 1D).
Second, to further evaluate algorithms’ accuracy, for each feature gene set, we merged causal networks generated by multiple good algorithms into a consensus network (multi-algorithm-based consensus network), then compared the network of each algorithm with the consensus network (Main text-Figure 2; Appendix 3—figure 2). We used the SHD to define the difference between two networks, and the network with the shortest SHD with the consensus network is assumed to be the most accurate.
Third, to evaluate the impact of sample size on algorithms’ performance, we ran the nine algorithms using 200 (instead of 300) H2228 cells. The results of 200 cells were poorer than the results of 300 cells (compared with the consensus network in Main text-Figure 2 and Appendix 3—figure 2). Still, the two DCC algorithms performed the best and were less sensitive to the decreased sample size than the two HSIC algorithms. We also inferred interactions between genes in the ‘Metabolic reprogramming in colon cancer’ (WP4290) pathway using 200, 400, 600, and 800 cells in the H838 (Appendix 3—figures 3–6). We find that more cells make more interactions be inferred, but the interactions with high significance are quite stable.
Fourth, to evaluate algorithms’ stability, we used the H2228 dataset to run the nine algorithms five times and estimated each algorithm’s stability by computing the mean relative SHD of the five networks. The networks of gaussCItest have the smallest mean relative SHD and the networks of HSIC.perm, HSIC.clust, and DCC.perm have the largest mean relative SHD (Appendix 3—table 1). As DCC.perm and DCC.gamma are the most accurate algorithms, we examined whether their stability impairs their accuracy by checking the distribution of interactions in the five networks. DCC.gamma and DCC.perm inferred 127 and 143 interactions, 78% and 64.3% occurred stably in ≥4 networks, and many inconsistent interactions occurred in just one network (Main text-Figure 3), indicating that most interactions were stably inferred in multiple running. The networks of multiple running can be merged into a consensus network (multi-running-based consensus network), which can be used to examine which algorithm generated the most consistent networks.
Fifth, we compared the accuracy of PC+DCC.gamma, GES, and GSP+DCC.gamma using genes in the WikiPathways ‘Metabolic reprogramming in colon cancer’ (WP4290) and 600 cells in the A549, H2228, and H838 datasets. GSP+DCC.gamma (the significance level alpha = 0.01) inferred much more interactions than PC+DCC.gamma (alpha = 0.1) and GES (alpha = 0.1). The results indicate that PC+DCC.gamma (alpha = 0.1) and GSP+DCC.gamma (alpha = 0.05) have comparable accuracy and time consumption, and both are more accurate but time-consuming than GES (alpha = 0.1) (Appendix 3—figures 7–12).
Appendix 4
Evaluating the reliability and verifying causal discovery results
We evaluated the reliability of causal discovery by examining whether algorithms can differentiate interactions between genes in different cells. Inspired by using RNA spike-in to measure RNA-seq quality, we extracted the data of six MHC-II-related genes (HLA-DRB1, HLA-DMA, HLA-DRA, HLA-DPA, CD74, C3, which have the suffix _si to mark them) from the macrophage dataset (generated by Smart-seq2 sequencing) and the alveolar epithelial cell dataset (generated by 10x Genomics) to form two spike-in datasets. We mixed the spike-in dataset with the dataset of exhausted CD8 T cells and examined if the causal discovery was able to separate MHC-II genes and their interactions in the spike-in dataset from feature genes and their interactions in the exhausted CD8 T dataset. When the datasets contain sufficient cells (usually >300), the two DCC algorithms can discriminate genes and interactions in the two datasets quite well (Appendix 4—figures 1 and 2), indicating the power of causal discovery based on kernel-based CI tests. The inferred causal interactions can be verified using annotated protein interactions in the STRING database (https://string-db.org/). The results of our application cases indicate that many inferred interactions are supported by annotated protein interactions in the STRING database (Appendix 4—figures 3 and 4).
We have taken a systematic approach to validate causal discovery using the five lung cancer cell lines and lung alveolar cells. First, upon (a) gene expression value >0.1, (b) gene expression >50% cells, (c) fold change >0.3, we identified differentially expressed genes in each cell line against the alveolar cells. Second, we applied GO analysis to the differentially expressed genes in each cancer dataset using g:Profiler (https://biit.cs.ut.ee/gprofiler/gost) (parameters: Significance threshold = Benjamini-Hochberg FDR, User threshold = 0.05, Data sources = KEGG and WikiPathways). The WikiPathways and KEGG pathways ‘Metabolic reprogramming in colon cancer’ (WP4290), ‘Pyrimidine metabolism’ (WP4022), and ‘Nucleotide metabolism’ (hsa01232) are commonly enriched in all cancer cell lines (Appendix 4—figure 5). We also performed GO analysis using the GSEA package, which identified the KEGG pathway ‘Non-small cell lung cancer’ (hsa05223) as an enriched pathway in cancer cell lines (note that these lung cancer cell lines were derived from NSCLC). We used the PC+DCC.gamma to infer interactions among genes in the three pathways in the five cancer cell lines and the alveolar cells.
First, we examined the ‘Metabolic reprogramming in colon cancer’ (WP4290) pathway (Appendix 4—figures 6–9). Numerous studies report that glucose metabolism is reprogrammed and nucleotides synthesis is increased in cancer cells. Thus, we first examined and compared the WP4290 pathway in the five lung cancer cell lines and lung alveolar cells. The key features of the reprogrammed glucose metabolism are that (a) glucose intake is increased, (b) the glycolysis/TCA cycle intermediates are used for synthesizing nucleotide, (c) lactate generation is increased. The inferred networks capture these features. (a) multiple activations of SLC2A1 (which encodes a major glucose transporter and controls glucose intake), PGD (which promotes glucose metabolism into the pentose phosphate shunt), PSAT1 (which encodes a phosphoserine aminotransferase that catalyzes the reversible conversion of 3-phosphohydroxypyruvate to phosphoserine), and LDHA (whose protein catalyzes the conversion of pyruvate to lactate) are inferred in all cancer cell lines but not in alveolar cells. (b) Many activations of genes by downstream genes are inferred, and this sort of feedback regulations is an intrinsic feature of metabolism. Especially, the controlling factor SLC2A1 is activated by multiple genes. (c) In contrast, none of these features occur in the alveolar cells (partly due to key genes such as SLC2A1 is not expressed). These inferred results are literature-supported and biologically reasonable, despite that the causal inference is flawed by the absence of metabolites in the data.
Second, we examined the ‘Pyrimidine metabolism’ (WP4022) pathway (Appendix 4—figures 10–13). We used genes in the ‘Pyrimidine metabolism’ (hsa00240) to perform the inference (because WP4022 contains too many POLR gene families) and used the more readable WP4022 pathway to illustrate the results. Compared with glucose metabolism, pyrimidine metabolism has many reversable reactions, making interactions vary greatly in cells and the differences between cancer and alveolar cells opaque. The following genes and reactions are notable. (a) TYMS catalyzes dUMP->dTMP unidirectionally toward DNA synthesis. (b) Tk1/2 catalyze thymidine->dTMP and deoxyuridine->dUMP toward DNA synthesis (while NT5C/E/M do the opposite). (c) DUT catalyzes dUTP->dUMP (and dUMP is the substrate for TYMS). (d) TYMP catalyzes thymidine->thymine unidirectionally away from DNA synthesis. (e) ENTPD1/3 catalyze dTTP->dTDP->dTMP, UTP->UDP->UMP, and CTP->CDP->CMP away from DNA and RNA synthesis (but AK9/NME reverse these reactions). (f) NT5C/E/M catalyze dCMP->deoxycytidine, dUMP->deoxyuridine, and dTMP->thymidine away from DNA synthesis. Accordingly, the following interactions were inferred from cancer cell lines. (a) TYMS (the most critical gene promoting DNA synthesis) is activated in all cancer cell lines but not in alveolar cells, and it is not repressed by any gene in cancer cell lines. (b) Tk1/2 are activated in cancer cells and alveolar cells. (c) DUT is activated in all cancer cell lines but is not expressed in alveolar cells. (d) activations of TYMP (the critical gene making reactions away from DNA synthesis) by multiple others are inferred in alveolar cells. (e) ENTPD1/3 (genes making reactions away from DNA synthesis) are activated only in alveolar cells. (f) NT5C/E/M are repressed in all cancer cell lines but are not expressed in alveolar cells. The most notable may be DUT->Tk1 and DUT->TYMS in all cancer cell lines, indicating feedforward or coordinated regulations that promote DNA synthesis. These features are literature-supported and biologically reasonable, despite that the causal inference is flawed by the absence of metabolites in the data.
Third, we examined the ‘Non-small cell lung cancer’ (hsa05223) pathway (Appendix 4—table 1; Appendix 4—figure 14). We used the ‘graphite’ R package to turn hsa05223 into an adjacency matrix and mapped inferred interactions to the matrix. If an interaction can be mapped to an edge or a path with any directions (forward, inverse, or undirected) in hsa05223, it was assumed mapped to the pathway. hsa05223 contains sub-pathways such as p53 signaling pathway and PI3K-AKT pathway, therefore there are considerable epistatic interactions that are not annotated in hsa05223. Also, synergistic interactions (e.g. CDKN1A->BAX and EGFR->MET, see Dong et al., 2019; Wang et al., 2014), and many of which are literature-supported but not annotated. We additionally examined hsa05223 and sub-pathways wherein manually and found that many inferred interactions can be mapped to epistatic and synergistic interactions. Taken together, in each cell line, about 50% of inferred interactions can be mapped to the pathway. Note that this is the result without considering feedback regulations by TFs. For example, many EGF1-related interactions were inferred (e.g. E2F1->EGFR and RB1->ERBB2), but these interactions were not accounted because they are not annotated in the KEGG database. Two extra notes here. First, unlike reprogrammed glucose metabolism, common interactions between genes in different cell lines are not impressive, probably because these cell lines are generated with different genetic basis despite being derived from NSCLC. Second, the annotation of hsa05223 has defects, because it is not in the list of enriched pathways identified by g:Profiler.
Appendix 5
Additional results of applications
This appendix file describes the additional results of five applications, including the analysis of lung cancer cell lines and alveolar epithelial cells, the analysis of macrophages isolated from glioblastoma, the analysis of tumor-infiltrating exhausted CD8 T cells, identifying genes and inferring interactions that signify CD4 T cell aging, and the analysis of a flow cytometry dataset. These examples were used to examine the applicability of causal discovery to varied cell types and sequencing protocols. To same running time and also examine algorithms’ power, varied sample sizes were used. All of these data were analyzed using the PC+CI method. The results indicate that causal discovery can be applied flexibly to varied cells. The appendix text (including appendix tables and figures) is brief and divided into five subsections, with the first four corresponding to the four subsections in the Results section in the main text, following appendix figures that are ordered accordingly.
1. The analysis of lung cancer cell lines and lung alveolar epithelial cells
As expected, feature genes and causal networks in H2228 and lung alveolar epithelial cells are distinctly different (Main text-Figure 4; Appendix 5—figures 1–8). (a) HLA Class II genes and CD74 are down-regulated in H2228 cells but up-regulated in lung alveolar epithelial cells. (b) LCN2 is up-regulated in H2228 cells but down-regulated in lung alveolar epithelial cells. (c) Algorithms inferred multiple interactions between PRDX1, TALDO1, HSP90AA1, NQO1, and PSMC4 in H2228 cells, but none of them were inferred in lung alveolar epithelial cells. (d) HLA Class I genes are feature genes in H2228 cells but not in the lung alveolar epithelial cells. HLA genes make proteins called human leukocyte antigens (HLA), which take bits and pieces of proteins from inside the cell and display them on the cell’s surface. If the cell is cancerous or infected, the HLA proteins display abnormal fragments that trigger immune cells to destroy that cell. Down-regulated HLA genes may help cancer cells escape from immune cells. Annotating the networks upon related experimental findings suggest that DCC algorithms are the best and cmiKnn and GaussCItest are the poorest.
2. The analysis of the macrophages from glioblastoma
After using the dataset of macrophage isolated from glioblastoma to examine feature selection algorithms, we also used it to examine causal discovery algorithms. Again, feature genes include HLA genes to examine whether reported interactions are inferred (Appendix 5—figures 9 and 10).
3. The analysis of exhausted CD8 T cells from multiple cancers
We used TOX and PDCD1 as the target gene, respectively, to select 50 genes from genes expressed in >50% exhausted CD8 T cells (from liver, colorectal, and lung cancers) and in >50% non-exhausted CD8 T cells (from the normal tissues neighboring these cancers). Networks with TOX and PDCD1 as the target gene are called TOX-network and PDCD1-network, respectively. In this application case, we demonstrate consensus networks; unless otherwise specified, all panels are consensus networks of the two DCC algorithms. Therefore, we use letters but not algorithms to label panels. Networks were inferred from 500 cells (the case of colorectal cancer) and 463 cells (the case of lung cancer). Exhausted and non-exhausted were mutually used as case and control. In panels, →→ and -|-| represent indirect activation and inhibition (Appendix 5—figures 11–17).
4. The analysis of CD4 T cells from young and old mice
Since aging occurs gradually and ubiquitously in almost all cells, we assumed that consistent up- or down-regulation in all CD4 T cell types better defines CD4 aging-related genes than large fold changes. Upon this, we obtained the presumably CD4 aging-related genes (Appendix 5—table 1). Many of these genes are not the senescence signatures (Gorgoulis et al., 2019), indicating that different genes may be involved in the aging of different cells, but the mitochondrial genes have been well recognized as being important for aging in many cells.
Data in the STRING database support many inferred interactions, especially interactions between the mitochondrial genes, between Ccnd2, Ccnd3, Cdkn1b, and Cdkn2d, between B2m and H2-Q7, between Lck and Cd28, and between Gm9843 and Rps27rt (Appendix 4—figure 3). Interactions supported by experimental findings include Cdc42→Coro1a (CDC42 and CORO1A exhibit strong associations both with age) (Kerber et al., 2009), Arpc1b→Coro1a (in mouse T cells Coro1a is involved in Arp2/3 regulation) (Shiow et al., 2008), B2m→H2-Q7 (B2m is associated with the MHC class I heavy chain) (Smith et al., 2015), Lck→Cd28 (Lck is found to associate with CD28 by using its SH2 domain to bind to a phospho-specific site) (Rudd, 2021), Cdc42-|Lamtor2 (mTOR is required for asymmetric division through small GTPases in mouse oocytes) (He et al., 2013; Lee et al., 2012), Ccnd2-|Lamtor2 (mTORC1 activation regulates beta-cell mass and proliferation by modulation of Ccnd2 synthesis and stability) (Balcazar et al., 2009), and Sub1-|Ccnd2-|Lamtor2 (Sub1 can accelerate aging via disturbing mTOR-regulated proteostasis) (Chen et al., 2021).
Several inferred results are noticeable. First, interactions among the mitochondrial genes were inferred in all cases, whose expression levels were low in cells from young mice but high in cells from old mice. These indicate that these genes may be common biomarkers of aging for CD4 T cells. Second, in the inferred networks, these mitochondrial genes do not have consistent inputs and outputs, which can probably be explained by the finding that the metabolic system undergoes extensive rewiring upon normal T-cell activation and differentiation (Zhang et al., 2021b). With the report of increasing experimental findings, mitochondrial dysfunction in aging and diseases of aging has drawn increasing attention (Haas, 2019). Third, Junb is activated. Persistent JUNB activation in human fibroblasts enforces skin aging and the AP-1 family TFs (including FOSL2 and JUNB) are increased in all immune cells during aging (Maity et al., 2021; Zheng et al., 2020). The findings of Junb/JUNB indicate that JUNB/Junb plays a critical role in aging. Fourth, the Gm9843→Rps27rt→Junb cascade (Rps27rt is also called Gm9846, and both Gm9843 and Rps27rt are mouse-specific genes) was inferred in many cases; it is interesting whether these interactions’ counterparts exist in humans (Appendix 5—figures 18–21).
5. The analysis of a flow cytometry dataset
Finally, we analyzed the flow cytometry data reported by Sachs et al. This dataset, due to the ground truth given by the authors, has been used to test other algorithms. The computed structural intervention distance (SID) and SHD between networks inferred by different algorithms and the ground truth network also suggest that the DCC CI tests outperform others. See Appendix 5—figure 22.
Data availability
Only public data were used. Links to all data are provided in the manuscript.
-
NCBI Gene Expression OmnibusID GSE134705. Cross-species analysis across 450 million years of evolution reveals conservation and divergence of the microglia program (scRNA-seq).
-
NCBI Gene Expression OmnibusID GSE126906. Designing a single cell RNA sequencing benchmark dataset to compare protocols and analysis methods [5 Cell Lines 10X].
-
Single Cell PortalID SCP490. Study: Aging promotes reorganization of the CD4 T cell landscape toward extreme regulatory and effector phenotypes.
-
NCBI Gene Expression OmnibusID GSE131928. Single cell RNA-seq analysis of adult and paediatric IDH-wildtype Glioblastomas.
-
NCBI Gene Expression OmnibusID GSE99254. T cell landscape of non-small cell lung cancer revealed by deep single-cell RNA sequencing.
-
NCBI Gene Expression OmnibusID GSE108989. Lineage tracking reveals dynamic relationships of T cells in colorectal cancer.
-
NCBI Gene Expression OmnibusID GSE98638. Landscape of infiltrating T cells in liver cancer revealed by single-cell sequencing.
References
-
Mtorc1 activation regulates beta-cell mass and proliferation by modulation of cyclin D2 synthesis and stabilityThe Journal of Biological Chemistry 284:7832–7842.https://doi.org/10.1074/jbc.M807458200
-
BookDAGMA: Learning DAGs via M-Matrices and a Log-Determinant Acyclicity CharacterizationNeurIPS.
-
ConferenceXGBoost: A Scalable Tree Boosting SystemProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA: Association for Computing Machinery). pp. 785–794.https://doi.org/10.1145/2939672.2939785
-
Enhanced glycometabolism as a mechanism of NQO1 potentiated growth of NSCLC revealed by metabolomic profilingBiochemical and Biophysical Research Communications 496:31–36.https://doi.org/10.1016/j.bbrc.2017.12.160
-
Optimal structure identification with greedy searchJournal of Machine Learning Research 3:507–554.
-
The MS4A gene cluster is a key modulator of soluble TREM2 and Alzheimer's disease risk’Science Translational Medicine 11:eaau2291.https://doi.org/10.1126/scitranslmed.aau2291
-
Identification of cold-shock protein RBM3 as a possible regulator of skeletal muscle size through expression profilingAmerican Journal of Physiology-Regulatory, Integrative and Comparative Physiology 295:R1263–R1273.https://doi.org/10.1152/ajpregu.90455.2008
-
Fast and scalable feature selection for gene expression data using hilbert-schmidt independence criterionIEEE/ACM Transactions on Computational Biology and Bioinformatics 14:167–181.https://doi.org/10.1109/TCBB.2016.2631164
-
Age-Related changes in lck-vav signaling pathways in mouse CD4 T cellsCellular Immunology 259:100–104.https://doi.org/10.1016/j.cellimm.2009.06.001
-
Cdc42 and aging of hematopoietic stem cellsCurrent Opinion in Hematology 20:295–300.https://doi.org/10.1097/MOH.0b013e3283615aba
-
Review of causal discovery methods based on graphical modelsFrontiers in Genetics 10:524.https://doi.org/10.3389/fgene.2019.00524
-
Mechanisms underlying T cell ageingNature Reviews. Immunology 19:573–583.https://doi.org/10.1038/s41577-019-0180-1
-
ConferenceMeasuring statistical dependence with Hilbert-Schemidt normsInternational Conference on AlgorithmicLearning Theory. pp. 63–77.https://doi.org/10.1007/11564089
-
Nqo1 is a determinant for cellular sensitivity to anti-tumor agent napabucasinAmerican Journal of Cancer Research 10:1442–1454.
-
The sociobiology of brain tumorsAdvances in Experimental Medicine and Biology 1225:115–125.https://doi.org/10.1007/978-3-030-35727-6_8
-
Age-Associated decrease in proteasome content and activities in human dermal fibroblasts: restoration of normal level of proteasome subunits reduces aging markers in fibroblasts from elderly personsThe Journals of Gerontology. Series A, Biological Sciences and Medical Sciences 62:490–499.https://doi.org/10.1093/gerona/62.5.490
-
Synthetic spike-in standards for RNA-seq experimentsGenome Research 21:1543–1551.https://doi.org/10.1101/gr.121095.111
-
Network analysis of gene expressionMethods in Molecular Biology 1783:325–341.https://doi.org/10.1007/978-1-4939-7834-2_16
-
Effects of aging on ribosomal protein L7 messenger RNA levels in cultured rat preadipocytesExperimental Gerontology 28:557–563.https://doi.org/10.1016/0531-5565(93)90044-e
-
Partial correlation and conditional correlation as measures of conditional independenceAustralian New Zealand Journal of Statistics 46:657–664.https://doi.org/10.1111/j.1467-842X.2004.00360.x
-
A fast PC algorithm for high dimensional causal discovery with multi-core pcsIEEE/ACM Transactions on Computational Biology and Bioinformatics 16:1483–1495.https://doi.org/10.1109/TCBB.2016.2591526
-
Signal transduction changes in CD4 + and CD8 + T cell subpopulations with agingExperimental Gerontology 105:128–139.https://doi.org/10.1016/j.exger.2018.01.005
-
Mtor is required for asymmetric division through small GTPases in mouse oocytesMolecular Reproduction and Development 79:356–366.https://doi.org/10.1002/mrd.22035
-
Foxp1 controls mesenchymal stem cell commitment and senescence during skeletal agingJournal of Clinical Investigation 127:1241–1253.https://doi.org/10.1172/JCI89511
-
Causal network inference from gene transcriptional time-series response to glucocorticoidsPLOS Computational Biology 17:e1008223.https://doi.org/10.1371/journal.pcbi.1008223
-
Wisdom of crowds for robust gene network inferenceNature Methods 9:796–804.https://doi.org/10.1038/nmeth.2016
-
Role of migratory inhibition factor in age-related susceptibility to radiation lung injury via NF-E2-related factor-2 and antioxidant regulationAmerican Journal of Respiratory Cell and Molecular Biology 49:269–278.https://doi.org/10.1165/rcmb.2012-0291OC
-
Cd8 T cell exhaustion during chronic viral infection and cancerAnnual Review of Immunology 37:457–495.https://doi.org/10.1146/annurev-immunol-041015-055318
-
BookGraphical Models: Selecting Causal and Statistical ModelsCarnegie Mellon University Diss.
-
A comprehensive survey of regulatory network inference methods using single cell RNA sequencing dataBriefings in Bioinformatics 22:bbaa190.https://doi.org/10.1093/bib/bbaa190
-
ConferenceRandom features for large-scale kernel machinesProceedings of the 20th International Conference on Neural Information Processing Systems. pp. 1177–1184.
-
How the discovery of the CD4/CD8-p56lck complexes changed immunology and immunotherapyFrontiers in Cell and Developmental Biology 9:626095.https://doi.org/10.3389/fcell.2021.626095
-
ConferenceConditional Independence Testing Based on a Nearest-Neighbour Estimator of Conditional Mutual InformationProceedings of the 21st International Conference on Artificial Intelligence and Statistics.
-
The hardness of conditional independence testing and the generalised covariance measureThe Annals of Statistics 48:1514–1538.https://doi.org/10.1214/19-AOS1857
-
Nad (P) H: quinone oxidoreductase 1 (NQO1) in the sensitivity and resistance to antitumor quinonesBiochemical Pharmacology 83:1033–1040.https://doi.org/10.1016/j.bcp.2011.12.017
-
Gene selection via the bahsic family of algorithmsBioinformatics 23:i490–i498.https://doi.org/10.1093/bioinformatics/btm216
-
A ribosomal perspective on proteostasis and agingCell Metabolism 23:1004–1012.https://doi.org/10.1016/j.cmet.2016.05.013
-
Approximate kernel-based conditional independence tests for fast non-parametric causal discoveryJournal of Causal Inference 7:20180017.https://doi.org/10.1515/jci-2018-0017
-
Measuring and testing dependence by correlation of distancesThe Annals of Statistics 35:2769–2794.https://doi.org/10.1214/009053607000000505
-
Brownian distance covarianceThe Annals of Applied Statistics 3:1236–1265.https://doi.org/10.1214/09-AOAS312
-
ConferenceCausal discovery in the presence of missing dataProceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 2019.
-
Age-Related responses to a bout of mechanotherapy in skeletal muscle of ratsJournal of Applied Physiology 127:1782–1791.https://doi.org/10.1152/japplphysiol.00641.2019
-
Intercellular transfer of mitochondria between senescent cells through cytoskeleton-supported intercellular bridges requires mTOR and Cdc42 signallingOxidative Medicine and Cellular Longevity 2021:6697861.https://doi.org/10.1155/2021/6697861
-
Cd74 correlated with malignancies and immune microenvironment in gliomasFrontiers in Molecular Biosciences 8:706949.https://doi.org/10.3389/fmolb.2021.706949
-
P41-arc, a regulatory subunit of Arp2/3 complex, can induce premature senescence in the absence of p53 and RbExperimental & Molecular Medicine 43:389–392.https://doi.org/10.3858/emm.2011.43.7.042
-
ConferenceKernel-based conditional independence test and application in causal discoveryProceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence. pp. 804–813.
-
Hallmarks of the aging T-cell systemThe FEBS Journal 288:7123–7142.https://doi.org/10.1111/febs.15770
-
Regularization and variable selection via the elastic netJournal of the Royal Statistical Society Series B 67:301–320.https://doi.org/10.1111/j.1467-9868.2005.00503.x
Article and author information
Author details
Funding
National Natural Science Foundation of China (31771456)
- Hao Zhu
Department of Science and Technology of Guangdong Province (2020A1515010803)
- Hao Zhu
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (31771456) and the Department of Science and Technology of Guangdong Province (2020A1515010803). We appreciate the help from Prof. Ruichu Cai at the Guangdong University of Technology.
Copyright
© 2023, Wen, Huang et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 2,535
- views
-
- 294
- downloads
-
- 7
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Computational and Systems Biology
- Developmental Biology
Understanding the principles underlying the design of robust, yet flexible patterning systems is a key problem in developmental biology. In the Drosophila wing, Hedgehog (Hh) signaling determines patterning outputs using dynamical properties of the Hh gradient. In particular, the pattern of collier (col) is established by the steady-state Hh gradient, whereas the pattern of decapentaplegic (dpp), is established by a transient gradient of Hh known as the Hh overshoot. Here we use mathematical modeling to suggest that this dynamical interpretation of the Hh gradient results in specific robustness and precision properties. For instance, the location of the anterior border of col, which is subject to self-enhanced ligand degradation is more robustly specified than that of dpp to changes in morphogen dosage, and we provide experimental evidence of this prediction. However, the anterior border of dpp expression pattern, which is established by the overshoot gradient is much more precise to what would be expected by the steady-state gradient. Therefore, the dynamical interpretation of Hh signaling offers tradeoffs between
-
- Computational and Systems Biology
- Neuroscience
Animal behaviour alternates between stochastic exploration and goal-directed actions, which are generated by the underlying neural dynamics. Previously, we demonstrated that the compositional Restricted Boltzmann Machine (cRBM) can decompose whole-brain activity of larval zebrafish data at the neural level into a small number (∼100-200) of assemblies that can account for the stochasticity of the neural activity (van der Plas et al., eLife, 2023). Here, we advance this representation by extending to a combined stochastic-dynamical representation to account for both aspects using the recurrent temporal RBM (RTRBM) and transfer-learning based on the cRBM estimate. We demonstrate that the functional advantage of the RTRBM is captured in the temporal weights on the hidden units, representing neural assemblies, for both simulated and experimental data. Our results show that the temporal expansion outperforms the stochastic-only cRBM in terms of generalization error and achieves a more accurate representation of the moments in time. Lastly, we demonstrate that we can identify the original time-scale of assembly dynamics by estimating multiple RTRBMs at different temporal resolutions. Together, we propose that RTRBMs are a valuable tool for capturing the combined stochastic and time-predictive dynamics of large-scale data sets.