Signature-scoring methods developed for bulk samples are not adequate for cancer single-cell RNA sequencing data

Abstract
Editor's evaluation
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Quantifying the activity of gene expression signatures is common in analyses of single-cell RNA sequencing data. Methods originally developed for bulk samples are often used for this purpose without accounting for contextual differences between bulk and single-cell data. More broadly, few attempts have been made to benchmark these methods. Here, we benchmark five such methods, including single sample gene set enrichment analysis (ssGSEA), Gene Set Variation Analysis (GSVA), AUCell, Single Cell Signature Explorer (SCSE), and a new method we developed, Jointly Assessing Signature Mean and Inferring Enrichment (JASMINE). Using cancer as an example, we show cancer cells consistently express more genes than normal cells. This imbalance leads to bias in performance by bulk-sample-based ssGSEA in gold standard tests and down sampling experiments. In contrast, single-cell-based methods are less susceptible. Our results suggest caution should be exercised when using bulk-sample-based methods in single-cell data analyses, and cellular contexts should be taken into consideration when designing benchmarking strategies.

Editor's evaluation

In this revised manuscript, Noureen and collaborators benchmark five methods that are used in single-cell RNA sequencing data analyses, including single sample gene set enrichment analysis (ssGSEA), Gene Set Variation Analysis (GSVA), AUCell, Single Cell Signature Explorer (SCSE), and a newly developed method, Jointly Assessing Signature Mean and Inferring Enrichment (JASMINE). The authors test these in distinct cancer datasets and conclude that caution should be exercised when using bulk sample-based methods in single-cell data analyses, and cellular contexts should be taken into consideration. As a rapidly developing field in need of method benchmarking, we believe that this paper will be of great interest to the bioinformatics and single-cell data analysis communities.

https://doi.org/10.7554/eLife.71994.sa0

Introduction

Single-cell RNA sequencing (scRNA-seq) is a powerful technology to study the ecosystems of normal and disease tissues. Gene expression signatures can be used to interrogate single cells for cell identities and other cellular properties (Noureen et al., 2021a) using signature-scoring methods. A key consideration for these methods is how to account for the variability of expression within the signature, a characteristic often associated with the overall expression variability of the interrogated sample. Despite widespread use and their benchmarking on certain applications (Holland et al., 2020; Zhang et al., 2020), limitations of these methods, including some initially developed for bulk samples, are incompletely understood in scRNA-seq analysis.

Compared with bulk-sample RNAseq, scRNA-seq data have high dropout rates (Hicks et al., 2018). Dropout rates are the opposite of gene counts, that is, the number of genes detected in a cell, a feature that is associated with cell differentiation status (Gulati et al., 2020). Variability in gene counts may cause imbalanced representation of non-expressed genes in gene signatures, resulting in scoring bias. Thus, we reason that in contexts where cells differ in differentiation status, failing to account for this variability may lead to misleading signature scores.

Cancer cells exhibit stem-cell like properties Malta et al., 2018; thus, cancer may represent such a context. To test this, we benchmark five signature-scoring methods in cancer single-cell datasets. The five methods included bulk-sample-based methods single sample gene set enrichment analysis (ssGSEA) and Gene Set Variation Analysis (GSVA) (implemented in the R GSVA package Hänzelmann et al., 2013) and three single-cell-based methods, AUCell (Aibar et al., 2017), Single-Cell Signature Explorer (SCSE) (Pont et al., 2019), and Jointly Assessing Signature Mean and Inferring Enrichment (JASMINE), a new method we developed. We show that cancer cells consistently express more genes than normal cells, and this imbalance significantly biases results from ssGSEA and GSVA but largely spares single-cell-based methods. Our results caution against the use of bulk-sample-based methods in scRNA-seq analyses.

Results

Cancer cells demonstrate higher gene counts than normal cells

Several recent studies showed that gene counts of single cells are associated with cell differentiation status and cell identity (Gulati et al., 2020; Qiu, 2020). Specifically, cells higher in the differentiation hierarchy express more genes. To examine if cancer and normal cells demonstrate similar imbalances, we collected 10 scRNA-seq datasets across different cancer types, technological platforms and sequencing coverage (Supplementary file 1). We also obtained cell type annotations from the original studies. We found that the average number of detected genes was significantly higher in tumor cells than in normal cells across all datasets (Figure 1A, p < 2.2e-16). This imbalance persisted when we separated normal cells into different cell populations (Figure 1—figure supplement 1). However, cancer-associated cells, including cancer-associated fibroblasts (CAF), tumor-associated macrophages (TAM), and tumor-related endothelial cells (TEC) had higher gene counts than other normal cell types, sometimes even comparable to malignant cells (Figure 1—figure supplement 1).

Figure 1 with 3 supplements see all

Download asset Open asset

Gene count imbalances affect signature scoring.

(A) The number of detected genes in tumor and normal cell populations in 10 single cell cancer RNAseq datasets. The height of each bar represents average, and whiskers represent standard deviation. In all cases, the difference is statistically significant (student t test, p < 2.2e-16). (B) Percentage of up and down regulated gene signatures in cancer cells relative to normal cells based on Cohen’s d. Dot size corresponds to the percentage of all signatures tested (n = 7503). (C) Spearman correlation coefficients of Cohen’s d with signature sizes across the datasets and methods. Asterisk (*) in each cell indicates p-value < 0.01. Color of the heatmap represents correlation coefficient. (D) Scores of a cell cycle gene set (GO:0007049) calculated using four methods along with MKI67 expression, gene counts, and cell cycle phases predicted by Seurat in Tumor and normal cell populations of HNSC dataset (GSE103322). The red box highlights non-cycling tumor cells that exhibit higher scores than non-cycling normal cells.

Figure 1—source data 1 Source data for Figure 1.: https://cdn.elifesciences.org/articles/71994/elife-71994-fig1-data1-v2.xlsx
Download elife-71994-fig1-data1-v2.xlsx

Bias of ssGSEA signature scores in cancer datasets

The reliability of signature-scoring methods is measured by their ability to identify the quantitative differences between two groups of samples at the gene-signature level. Here, we compare signature scores between tumor and normal cells. One issue is that scores from these methods differ in range and variance, thus making direct comparison challenging. To overcome this challenge, we used Cohen’s d (Cohen, 1988), a measurement that normalizes mean differences with standard deviation, thus generating unitless contrast that can be compared across methods. For simplicity, we defined Cohen’s d greater than one as upregulated, and less than –1 as down regulated. These criteria were used throughout this work.

We assembled 7503 gene sets from MSigDB that each has at least 20 genes (The term ‘gene set’ is used interchangeably with ‘gene signature’; Methods). On average, single-cell methods reported similar proportions of up- and down-regulated gene sets (9% vs 12% of all input gene sets) (Figure 1B). In contrast, ssGSEA and GSVA reported 25 and 23% of the input gene sets as upregulated but only 6 and 5% as down regulated. In six out of the 10 datasets, ssGSEA and GSVA estimated more than twice as many upregulated gene sets than down gene sets. This pattern remained when we divided normal cells into different populations (Figure 1—figure supplement 2). Notably, in cell types with high gene counts such as TEC, TAM, and CAF, we did not observe significantly more upregulated gene sets by ssGSEA and GSVA (Figure 1—figure supplement 2E,F). These results collectively suggest ssGSEA and GSVA are sensitive to variability of gene counts common to scRNA-seq data.

We also found that Cohen’s d from ssGSEA and GSVA showed consistent positive correlation with gene set sizes (Figure 1C). Single-cell methods did not demonstrate this pattern.

We observed that GSVA was slower than other methods. For instance, running the head and neck dataset with 200 threads on a 2.5 GHz Xeon processor took more than 3 days to complete. Because GSVA outputs were highly consistent with those of ssGSEA (Figure 1—figure supplement 3), and its running speed would not likely scale up to single cell datasets, we dropped GSVA from subsequent analyses.

We next use an example to illustrate how gene counts may affect signature scores and data interpretation. We scored the cell cycle gene signature from Gene Ontology (GO:0007049) in the head and neck cancer (HNSC) dataset (Puram et al., 2017). For comparison, we used Seurat (Butler et al., 2018) to identify cycling cells at G2M and S phases and also included the expression of KI67, a proliferation marker expressed by cycling cells. We observed all four methods reported a higher score in cycling cells (Figure 1D). However, ssGSEA scores were significantly higher in non-cycling cancer cells than in normal cells although neither are cycling (Student’s t test, p < 2.2e-16). In contrast, the single-cell-based methods did not show this difference.

Sensitivity and specificity

We next generated gold standard up- and down-regulated gene sets at various sizes to benchmark detection sensitivity. To simulate real-world scenario, we added noises to each gene set so that higher noise levels attenuate more of their signals (Method).

We found gene set size had negligible impacts on detection rates (Figure 2—figure supplement 1). Noise levels, on the other hand, had a much bigger impact. For up gene sets, all methods were able to detect 50% of the gene sets on average even at the 80% noise level (Figure 2A). However, for down gene sets, ssGSEA performed worse than other methods (Figure 2B). Without noise, it only detected about 85% of the gene sets. At 80% noise level, it detected around 30% of the gene sets, compared to 70–80% by single-cell methods.

Figure 2 with 5 supplements see all

Download asset Open asset

Sensitivity, specificity and accuracy.

(A) Recovery rate for up gene signatures across five noise levels by the four methods. Each dot represents one dataset. At each noise level, average of all datasets is used to represent the performance of each method. (B) Similarly, for down signatures. (C) Percentages of false up and down signatures. The size of the dots corresponds to the percentages of all the signatures tested. Because the contrasting groups are generated by down sampling, no signatures are expected to be identified. The numbers below the heatmap are the average percentage. (D) Accuracy of the three methods, separated into up and down signatures. Accuracy is calculated as the agreement with consensus calls by at least two methods.

Figure 2—source data 1 Source data for Figure 2.: https://cdn.elifesciences.org/articles/71994/elife-71994-fig2-data1-v2.xlsx
Download elife-71994-fig2-data1-v2.xlsx

To benchmark detection specificity, we generated null datasets by randomly choosing 100 cancer cells from each dataset. We down sampled each cell to 50% coverage, resulting in lower gene counts. These cells were then normalized to the same total coverage. Because the down-sampled cells were the same cells with the original, no up or down gene sets were expected between the two groups.

We found AUCell and JASMINE outperformed SCSE and ssGSEA in specificity (Figure 2C). On average, AUCell detected 7.5 and 1.2% of the input gene sets as up and down; JASMINE detected 4 and 1%. In four datasets, SCSE overestimated down regulated gene sets (average 25.4%). ssGSEA grossly overestimated both up- and down regulated gene sets in all datasets (average 11 and 46%, respectively).

Next, we estimated how changes in gene count affected score stability. For each gene signature, we calculated coefficient of variance (CV) (Figure 2—figure supplement 2). We observed that single-cell methods were robust to our down sampling, with little changes in CV. However, in three datasets, ssGSEA had a higher CV in down-sampled cells than in original cells.

Comparing with consensus and computational efficiency

We next compared the three single-cell-based methods against consensus—gene sets that were identified as up- or down regulated by at least two methods. JASMINE aligned better with the consensus in most datasets (Figure 2D and Figure 2—figure supplement 3). AUCell overall had lower accuracies against the consensus, particularly for down regulated gene sets, likely because it is designed to score marker signatures (Figure 2D). When breaking down to sensitivity and specificity, AUCell showed higher false positive rates (Figure 2—figure supplement 4A). The three tools showed comparable sensitivity. Across the 10 datasets, SCSE and JASMINE showed better correlation (Figure 2—figure supplement 4B,C), thus explaining their relatively better alignment with the consensus.

Finally, we tested computing efficiency of the four methods. We randomly generated gene sets of various sizes and analyzed each gene set in a dataset consisting of 2000 cells. Figure 2—figure supplement 5 shows the time and memory use averaged over 50 iterations for one gene set under the same hardware configuration (2.20 GHz CPU, 32 GB memory). Overall, SCSE was the most efficient computationally. JASMINE was the second fastest algorithm but required a memory similar to that of ssGSEA. AUCell was slightly faster than ssGSEA but required the most memory. Their performances were robust across signature sizes.

Dropouts affect ssGSEA scores

To further show the impact of dropouts on ssGSEA scores, we re-ran the down sampling experiment by setting down sampling rates to 20, 40, 60, and 80%. A down sampling rate of 80% indicates that the total sequencing depth of the down sampled cell is at 80% of the original cell. A lower down sampling rate creates more dropouts. We observed that as the down sampling rate shifted from low to high, the pattern of deregulated gene sets by ssGSEA also shifted from more up gene sets towards more down gene sets (Figure 3A and Figure 3—figure supplement 1). This observation strongly associates dropouts with ssGSEA scores. In Figure 3—figure supplement 2, we provide an example using real data to show how ssGSEA scores change when limited to cells with comparable gene counts.

Figure 3 with 2 supplements see all

Download asset Open asset

Impact of dropouts on ssGSEA signature scoring.

(A) Percentages of up and down regulated gene signatures in original cells relative to down sampled cells for four levels of down sampling (20, 40, 60, and 80%) based on Cohen’s d. Dot size corresponds to the percentage of all signatures tested (n = 7503) in Head and Neck (Puram et al., 2017). (B) Effect of dropouts on ssGSEA scoring using a dummy expression matrix. The black line denotes the cell without any dropouts, and the blue line denotes the same cell with a 60% dropout rate. Note that for the gene signature, the first 99 genes are fixed. The x axis reflects the position of the last signature gene. When the gene is at rank <4000. The two cells give identical scores. However, after entering dropout zone, the scores start to deviate.

Figure 3—source data 1 Source data for Figure 3.: https://cdn.elifesciences.org/articles/71994/elife-71994-fig3-data1-v2.xlsx
Download elife-71994-fig3-data1-v2.xlsx

To understand how dropouts may affect ssGSEA scoring, we simulated a dummy expression profile of a cell with 10,000 genes. We then set different dropout rates to the cell ranging from no dropout to 80% dropout. Genes were ranked from high to low expression to mimic ssGSEA calculation and the gene rankings were identical at each dropout rate. We then assembled a gene signature of 100 genes. The first 99 genes were drawn from the top 1000 genes, and the last gene was added to the signature from the 1001st gene onward to the bottom of the list. This way, the signature consists of 99 highly expressed genes and a floating gene. As the floating gene moves toward the dropout zone, we can observe how the signature score changes, especially around the dropout zone. The codes to reproduce these data can be found at our Github repository (Methods). Figure 3B shows the result at the 60% dropout rate. Before entering the dropout zone, the signature score was identical between the no-dropout cell and the 60%-dropout cell. However, when the floating gene was between ranks 4000 and 7000, the signature was scored higher in the dropout cell; when moving further toward ranks 7000–10,000, the signature was scored lower in the dropout cell. Thus, the scores teetered around the tie rank of dropout genes, that is, 7000. This simple experiment demonstrates that though tedious, dropouts are sufficient to change ssGSEA scores in unintended ways.

Discussion

In this study, we benchmarked five signature-scoring methods, including two that were developed for bulk sample RNAseq analysis (ssGSEA and GSVA), and three that were developed for scRNA-seq analysis (AUCell, SCSE, and JASMINE). Because ssGSEA and GSVA generate highly correlated scores and GSVA is slower, we focused on ssGSEA and the other three methods. Unlike methods that test for signature enrichment between sample groups, these signature-scoring methods quantify expression activity of a gene set in a sample independent of other samples. Outputs from these methods can be conveniently tested for association with complex sample characteristics. We show that single-cell-based methods are more robust at identifying bona fide up- and down regulated gene signatures from scRNAseq data. In contrast, bulk-sample-based ssGSEA and GSVA are more susceptible to dropouts. These disparities stem from their distinct mechanism to score signatures. AUCell uses Area Under the Curve (AUC) to test the enrichment of a gene set among top expressed genes in a cell. SCSE measures a signature using normalized total expression of the signature genes. AUCell and SCSE intuitively counter dropout effects by using only expressed genes. JASMINE considers the enrichment of signature genes in expressed genes to counter dropout effects, and meanwhile, evaluates the average expression level of the expressed signature genes. Bulk-sample-based methods were not designed to deal with excessive dropouts, which may mislead analysis especially if cell biology underpins different dropout rates. We further show that among single-cell methods, JASMINE and SCSE showed better concordance, and JASMINE outperforms SCSE in down sampling tests.

Dropouts were initially thought to result from low amounts of input RNA and transcriptional stochasticity. However, more studies showed that dropouts are associated with cell states and identities (Gulati et al., 2020; Qiu, 2020). Thus, in cellular contexts where cells may have systematic differences in dropout rates, any tool that fails to account for dropouts may generate misleading even erroneous data. In conclusion, our results caution against using bulk-sample-based signature-scoring methods to score single cells. Our study also suggests that it is important to consider cellular contexts when benchmarking methods for scRNA-seq data analysis, particularly when bulk-sample-based methods are involved.

Materials and methods

Single cell data sets and signature scoring

Request a detailed protocol

We used single cell data sets from 10 published studies (Bi et al., 2021; Chung et al., 2017; Darmanis et al., 2017; He et al., 2021; Lee et al., 2020; Ma et al., 2019; Neftel et al., 2019; Puram et al., 2017; Tirosh et al., 2016; Venteicher et al., 2017) for the evaluation of number of expressed genes in tumor versus normal cells to identify significant heterogenous patterns among the two phenotypes. Annotations of cell identity were also downloaded from each publication. We filtered all the data sets by removing non-expressed genes and then applied regularized negative binomial regression implemented in Seurat for normalization. We used C2 (n = 6226), C3 (n = 3556) and Hallmarks (n = 50) modules from MSigDB (Subramanian et al., 2005) v.7.2, to calculate the ratio of signature genes across all data sets and further signature scoring. We tested five tools for signature score calculations, including SCSE (Pont et al., 2019), AUCell (Aibar et al., 2017), ssGSEA, GSVA (Hänzelmann et al., 2013), and JASMINE. GSVA was included in tumor-normal comparisons but was dropped in gold standard tests and down sampling experiments due to slow running speed and highly correlated outputs with ssGSEA. We used GSVA and ssGSEA methods implemented in the GSVA Bioconductor (Hänzelmann et al., 2013) and AUCell method from AUCell Bioconductor packages (Aibar et al., 2017) with default parameters. We implemented SCSE in the R environment (v4.0) according to the equation reported in their paper (Pont et al., 2019). The output scores were used as is in tumor/normal cell comparisons and simulation analyses.

Jointly Assessing Signature Mean and Inferring Enrichment

Request a detailed protocol

For each signature, JASMINE calculates the approximate mean using gene ranks among expressed genes and the enrichment of the signature in expressed genes. The two are then scaled to 0–1 and averaged to result in the final JASMINE score.

Assume R_g represents the rank of an expressed signature gene g among genes with expression value >0 in a cell, then a mean rank vector V_mean is calculated as follows:

V_{m e a n} = \sum_{i = 1}^{m} R_{g, i} / (m \times N)

Where m represents the total number of expressed signature genes and N represents the total number of expressed genes in the cell. Because the mean is based on ranks, it is robust to scale normalization methods. It assesses the expression level of the signature only using genes detected in one cell, thus minimizing the effect of dropouts.

To assess signature enrichment in the expressed genes, we calculate either the Odds Ratio (OR) using four variables: a = signature genes expressed, b = signature genes not expressed, c = non-signature genes expressed, and d = non-signature genes not expressed. The signature enrichment using OR is calculated as follows:

O R = (a * d) / (b * c)

In practice, c is unlikely zero. For smaller signatures, b can be occasionally 0. In that case, we replace it with 1. In our testing, we observed that 99% of the gene sets did not encounter this issue in any of the cells we analyzed. Despite this rare incidence, we provide Likelihood Ratio (LR) as an alternative to OR in the JASMINE function. LR is calculated as follows:

L R = a * (c + d) / (c * (a + b))

OR and LR generate highly similar results according to our tests. OR (or LR) assesses the distribution of signature genes against the dropout background. Finally, V_mean and OR are linearly scaled to [0–1] and are averaged to generate JASMINE scores.

Effect size for gene sets scores

Request a detailed protocol

We first filtered the 9835 gene sets to 7503 by requiring a minimum size of 20 genes. We scored these gene sets (C2, C3, and hallmarks) among tumor and normal phenotypes using Cohen’s d (Cohen, 1988). We utilized an R package ‘effectsize’ (Ben-Shachar et al., 2020) to calculate this metric. We used Cohen’s d ≥ one as a threshold for positive or up cases and d ≤ –1 as negative or down cases with respect to tumor versus normal cells.

Gene signature simulation

Request a detailed protocol

We first identified differentially expressed genes for each dataset using MAST (Finak et al., 2015), a method that explicitly accounts for dropout rates. We then randomly drew N genes from upregulated genes to generate an up gene set of size N, and similarly for down gene sets. Because in practice up and down regulated gene sets contain genes that have no expression change, or even changes at opposite directions, we added noises to the simulated gene sets. For instance, when setting noise level to 20%, an up gene set of size N would have 20% of its genes drawn from the remainders other than the upregulated genes. Following this procedure, we generated gene sets at the sizes of 50, 100, 150, 200, and 300 genes. For each size, we set the noise levels to 0, 20, 40, 60, and 80%. For each noise-size combination, we randomly generated 200 gene sets. In total, 5000 gene sets were generated per data set for each direction.

Down sampling

Request a detailed protocol

We subset 100 tumor cells per data set for down sampling. The R package ‘scuttle’ (McCarthy et al., 2017) was used to down sample each cell to 50% total coverage. The down sampled data sets were then scaled back to ensure equal total coverage before comparison. Down sampling was also performed at various levels (20, 40, 60, and 80%) to examine its impact upon scoring of different methods especially ssGSEA.

Dropouts impact on ssGSEA

Request a detailed protocol

We generated a toy example to demonstrate the impact of dropouts on ssGSEA scores. We used a signature comprised of n + 1 genes. We randomly selected 99 genes from top 1000 genes of the dummy expression matrix. These 99 genes are fixed. Then starting from 1001 onward, we added one gene to the signature each time, starting from 1001st gene index throughout the remaining list. This allowed us to investigate how signature score changes as the additional gene gradually moves from high expression to dropouts. We simulated the scenario with 0% (original scores), and 60% dropout rates. This simulation clearly reflects how the dropout rates significantly affect the scoring by ssGSEA.

Data availability

Request a detailed protocol

Single cell data sets used in this study including their downloading sources were listed in Supplementary file 1. Gene sets were downloaded from MSigDB v.7.2 (http://www.gsea-msigdb.org/gsea/msigdb/index.jsp) for C2, C3 and Hallmark modules. JASMINE and ssGSEA testing codes are available on Github (https://github.com/NNoureen/JASMINE, copy archived at swh:1:rev:ba00996ad165ff471c6fada83e6cf76af50acdfa; Noureen, 2021b).

Data availability

The current manuscript is a computational study, so no data have been generated for this manuscript. Single cell data sets used in this study including their downloading sources were listed in Supplementary file 1. Gene sets were downloaded from MSigDB v.7.2. JASMINE source code is available on Github (https://github.com/NNoureen/JASMINE, copy archived at swh:1:rev:ba00996ad165ff471c6fada83e6cf76af50acdfa). Source Data contain the numerical data used to generate the figures.

The following data sets were generated

1. Noureen N
(2021) GitHub
ID GitHub. Signature-scoring methods developed for bulk samples are not adequate for cancer single-cell RNA sequencing data.

https://github.com/NNoureen/JASMINE

The following previously published data sets were used

(2016) NCBI Gene Expression Omnibus
ID GSE72056. Single-cell expression of metastatic melanoma.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72056
1. Puram SV
2. Tirosh I
3. Parikh AS
4. Patel AP
(2017) NCBI Gene Expression Omnibus
ID GSE103322. Single-cell expression of head and neck cancer.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103322
1. Bi K
2. Mx He
3. Bakouny Z
4. Kanodia A
(2021) Broad Single Cell portal
ID SCP1288. Single-cell expression of advanced renal cell carcinoma.

https://singlecell.broadinstitute.org/single_cell/study/SCP1288/tumor-and-immune-reprogramming-during-immunotherapy-in-advanced-renal-cell-carcinoma
1. Lee HO
2. Hong Y
3. Etlioglu HE
4. Cho YB
(2020) NCBI Gene Expression Omnibus
ID GSE144735. Single-cell expression of colorectal cancer.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE144735
1. Ma L
2. Hernandez MO
3. Zhao Y
4. Mehta M
(2019) NCBI Gene Expression Omnibus
ID GSE125449. Single-cell expression of liver cancer.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE125449
1. Neftel C
2. Laffy J
3. Filbin MG
4. Hara T
(2019) Broad Single Cell portal
ID SCP393. Single-cell expression of IDH wildtype glioblastoma.

https://singlecell.broadinstitute.org/single_cell/study/SCP393/single-cell-rna-seq-of-adult-and-pediatric-glioblastoma
(2017) Broad Single Cell portal
ID SCP50. Single-cell expression of IDH mutant astrocytoma.

https://singlecell.broadinstitute.org/single_cell/study/SCP50/single-cell-rna-seq-analysis-of-astrocytoma
(2017) NCBI Gene Expression Omnibus
ID GSE84465. Single-cell expression of glioblastoma.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84465
1. Chung W
2. Eum HH
3. Lee HO
4. Lee KM
(2017) NCBI Gene Expression Omnibus
ID GSE75688. Single-cell expression of breast cancer.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE75688
1. Mx He
2. Cuoco MS
3. Crowdis J
4. Bosma-Moody A
5. Zhang Z
6. Bi K
7. Kanodia A
(2021) Broad Single Cell portal
ID SCP1244. Single-cell expression of prostate cancer.

https://singlecell.broadinstitute.org/single_cell/study/SCP1244/transcriptional-mediators-of-treatment-resistance-in-lethal-prostate-cancer

References

1. Aibar S
2. González-Blas CB
3. Moerman T
4. Huynh-Thu VA
5. Imrichova H
6. Hulselmans G
7. Rambow F
8. Marine J-C
9. Geurts P
10. Aerts J
11. van den Oord J
12. Atak ZK
13. Wouters J
14. Aerts S
(2017) SCENIC: single-cell regulatory network inference and clustering
Nature Methods 14:1083–1086.

https://doi.org/10.1038/nmeth.4463
- PubMed
- Google Scholar
(2020) effectsize: Estimation of Effect Size Indices and Standardized Parameters
Journal of Open Source Software 5:2815.

https://doi.org/10.21105/joss.02815
- Google Scholar
1. Bi K
2. He MX
3. Bakouny Z
4. Kanodia A
5. Napolitano S
6. Wu J
7. Grimaldi G
8. Braun DA
9. Cuoco MS
10. Mayorga A
11. DelloStritto L
12. Bouchard G
13. Steinharter J
14. Tewari AK
15. Vokes NI
16. Shannon E
17. Sun M
18. Park J
19. Chang SL
20. McGregor BA
21. Haq R
22. Denize T
23. Signoretti S
24. Guerriero JL
25. Vigneau S
26. Rozenblatt-Rosen O
27. Rotem A
28. Regev A
29. Choueiri TK
30. Van Allen EM
(2021) Tumor and immune reprogramming during immunotherapy in advanced renal cell carcinoma
Cancer Cell 39:649–661.

https://doi.org/10.1016/j.ccell.2021.02.015
- PubMed
- Google Scholar
1. Butler A
2. Hoffman P
3. Smibert P
4. Papalexi E
5. Satija R
(2018) Integrating single-cell transcriptomic data across different conditions, technologies, and species
Nature Biotechnology 36:411–420.

https://doi.org/10.1038/nbt.4096
- PubMed
- Google Scholar
1. Chung W
2. Eum HH
3. Lee HO
4. Lee KM
5. Lee HB
6. Kim KT
7. Ryu HS
8. Kim S
9. Lee JE
10. Park YH
11. Kan Z
12. Han W
13. Park WY
(2017) Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer
Nature Communications 8:15081.

https://doi.org/10.1038/ncomms15081
- PubMed
- Google Scholar
Book
1. Cohen J
(1988)
Statistical Power Analysis for the Behavioral Sciences

Routledge.
- Google Scholar
1. Darmanis S
2. Sloan SA
3. Croote D
4. Mignardi M
5. Chernikova S
6. Samghababi P
7. Zhang Y
8. Neff N
9. Kowarsky M
10. Caneda C
11. Li G
12. Chang SD
13. Connolly ID
14. Li Y
15. Barres BA
16. Gephart MH
17. Quake SR
(2017) Single-Cell RNA-Seq Analysis of Infiltrating Neoplastic Cells at the Migrating Front of Human Glioblastoma
Cell Reports 21:1399–1410.

https://doi.org/10.1016/j.celrep.2017.10.030
- PubMed
- Google Scholar
1. Finak G
2. McDavid A
3. Yajima M
4. Deng J
5. Gersuk V
6. Shalek AK
7. Slichter CK
8. Miller HW
9. McElrath MJ
10. Prlic M
11. Linsley PS
12. Gottardo R
(2015) MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data
Genome Biology 16:278.

https://doi.org/10.1186/s13059-015-0844-5
- PubMed
- Google Scholar
1. Gulati GS
2. Sikandar SS
3. Wesche DJ
4. Manjunath A
5. Bharadwaj A
6. Berger MJ
7. Ilagan F
8. Kuo AH
9. Hsieh RW
10. Cai S
11. Zabala M
12. Scheeren FA
13. Lobo NA
14. Qian D
15. Yu FB
16. Dirbas FM
17. Clarke MF
18. Newman AM
(2020) Single-cell transcriptional diversity is a hallmark of developmental potential
Science (New York, N.Y.) 367:405–411.

https://doi.org/10.1126/science.aax0249
- PubMed
- Google Scholar
(2013) GSVA: gene set variation analysis for microarray and RNA-seq data
BMC Bioinformatics 14:7.

https://doi.org/10.1186/1471-2105-14-7
- PubMed
- Google Scholar
1. He MX
2. Cuoco MS
3. Crowdis J
4. Bosma-Moody A
5. Zhang Z
6. Bi K
7. Kanodia A
8. Su MJ
9. Ku SY
10. Garcia MM
11. Sweet AR
12. Rodman C
13. DelloStritto L
14. Silver R
15. Steinharter J
16. Shah P
17. Izar B
18. Walk NC
19. Burke KP
20. Bakouny Z
21. Tewari AK
22. Liu D
23. Camp SY
24. Vokes NI
25. Salari K
26. Park J
27. Vigneau S
28. Fong L
29. Russo JW
30. Yuan X
31. Balk SP
32. Beltran H
33. Rozenblatt-Rosen O
34. Regev A
35. Rotem A
36. Taplin ME
37. Van Allen EM
(2021) Transcriptional mediators of treatment resistance in lethal prostate cancer
Nature Medicine 27:426–433.

https://doi.org/10.1038/s41591-021-01244-6
- PubMed
- Google Scholar
1. Hicks SC
2. Townes FW
3. Teng M
4. Irizarry RA
(2018) Missing data and technical variability in single-cell RNA-sequencing experiments
Biostatistics (Oxford, England) 19:562–578.

https://doi.org/10.1093/biostatistics/kxx053
- PubMed
- Google Scholar
(2020) Robustness and applicability of transcription factor and pathway analysis tools on single-cell RNA-seq data
Genome Biology 21:36.

https://doi.org/10.1186/s13059-020-1949-z
- PubMed
- Google Scholar
1. Lee H-O
2. Hong Y
3. Etlioglu HE
4. Cho YB
5. Pomella V
6. Van den Bosch B
7. Vanhecke J
8. Verbandt S
9. Hong H
10. Min J-W
11. Kim N
12. Eum HH
13. Qian J
14. Boeckx B
15. Lambrechts D
16. Tsantoulis P
17. De Hertogh G
18. Chung W
19. Lee T
20. An M
21. Shin H-T
22. Joung J-G
23. Jung M-H
24. Ko G
25. Wirapati P
26. Kim SH
27. Kim HC
28. Yun SH
29. Tan IBH
30. Ranjan B
31. Lee WY
32. Kim T-Y
33. Choi JK
34. Kim Y-J
35. Prabhakar S
36. Tejpar S
37. Park W-Y
(2020) Lineage-dependent gene expression programs influence the immune landscape of colorectal cancer
Nature Genetics 52:594–603.

https://doi.org/10.1038/s41588-020-0636-z
- PubMed
- Google Scholar
1. Ma L
2. Hernandez MO
3. Zhao Y
4. Mehta M
5. Tran B
6. Kelly M
7. Rae Z
8. Hernandez JM
9. Davis JL
10. Martin SP
11. Kleiner DE
12. Hewitt SM
13. Ylaya K
14. Wood BJ
15. Greten TF
16. Wang XW
(2019) Tumor Cell Biodiversity Drives Microenvironmental Reprogramming in Liver Cancer
Cancer Cell 36:418–430.

https://doi.org/10.1016/j.ccell.2019.08.007
- PubMed
- Google Scholar
1. Malta TM
2. Sokolov A
3. Gentles AJ
4. Burzykowski T
5. Poisson L
6. Weinstein JN
7. Kamińska B
8. Huelsken J
9. Omberg L
10. Gevaert O
11. Colaprico A
12. Czerwińska P
13. Mazurek S
14. Mishra L
15. Heyn H
16. Krasnitz A
17. Godwin AK
18. Lazar AJ
19. Cancer Genome Atlas Research Network
20. Stuart JM
21. Hoadley KA
22. Laird PW
23. Noushmehr H
24. Wiznerowicz M
(2018) Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation
Cell 173:338–354.

https://doi.org/10.1016/j.cell.2018.03.034
- PubMed
- Google Scholar
(2017) Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R
Bioinformatics (Oxford, England) 33:1179–1186.

https://doi.org/10.1093/bioinformatics/btw777
- PubMed
- Google Scholar
1. Neftel C
2. Laffy J
3. Filbin MG
4. Hara T
5. Shore ME
6. Rahme GJ
7. Richman AR
8. Silverbush D
9. Shaw ML
10. Hebert CM
11. Dewitt J
12. Gritsch S
13. Perez EM
14. Gonzalez Castro LN
15. Lan X
16. Druck N
17. Rodman C
18. Dionne D
19. Kaplan A
20. Bertalan MS
21. Small J
22. Pelton K
23. Becker S
24. Bonal D
25. Nguyen QD
26. Servis RL
27. Fung JM
28. Mylvaganam R
29. Mayr L
30. Gojo J
31. Haberler C
32. Geyeregger R
33. Czech T
34. Slavc I
35. Nahed BV
36. Curry WT
37. Carter BS
38. Wakimoto H
39. Brastianos PK
40. Batchelor TT
41. Stemmer-Rachamimov A
42. Martinez-Lage M
43. Frosch MP
44. Stamenkovic I
45. Riggi N
46. Rheinbay E
47. Monje M
48. Rozenblatt-Rosen O
49. Cahill DP
50. Patel AP
51. Hunter T
52. Verma IM
53. Ligon KL
54. Louis DN
55. Regev A
56. Bernstein BE
57. Tirosh I
58. Suvà ML
(2019) An Integrative Model of Cellular States, Plasticity, and Genetics for Glioblastoma
Cell 178:835–849.

https://doi.org/10.1016/j.cell.2019.06.024
- PubMed
- Google Scholar
1. Noureen N
2. Wu S
3. Lv Y
4. Yang J
5. Alfred Yung WK
6. Gelfond J
7. Wang X
8. Koul D
9. Ludlow A
10. Zheng S
(2021a) Integrated analysis of telomerase enzymatic activity unravels an association with cancer stemness and proliferation
Nature Communications 12:139.

https://doi.org/10.1038/s41467-020-20474-9
- PubMed
- Google Scholar
Software
1. Noureen N
(2021b) JASMINE, version swh:1:rev:ba00996ad165ff471c6fada83e6cf76af50acdfa
GitHub.

https://archive.softwareheritage.org/swh:1:dir:f5b092ae4a9b8e9a7314131251bc01391abd8e1f;origin=https://github.com/NNoureen/JASMINE;visit=swh:1:snp:68b3eef22d209a4b966ce1c768b5eded68dae4ff;anchor=swh:1:rev:ba00996ad165ff471c6fada83e6cf76af50acdfa
(2019) Single-Cell Signature Explorer for comprehensive visualization of single cell signatures across scRNA-seq datasets
Nucleic Acids Research 47:e133.

https://doi.org/10.1093/nar/gkz601
- PubMed
- Google Scholar
1. Puram SV
2. Tirosh I
3. Parikh AS
4. Patel AP
5. Yizhak K
6. Gillespie S
7. Rodman C
8. Luo CL
9. Mroz EA
10. Emerick KS
11. Deschler DG
12. Varvares MA
13. Mylvaganam R
14. Rozenblatt-Rosen O
15. Rocco JW
16. Faquin WC
17. Lin DT
18. Regev A
19. Bernstein BE
(2017) Single-Cell Transcriptomic Analysis of Primary and Metastatic Tumor Ecosystems in Head and Neck Cancer
Cell 171:1611–1624.

https://doi.org/10.1016/j.cell.2017.10.044
- PubMed
- Google Scholar
1. Qiu P
(2020) Embracing the dropouts in single-cell RNA-seq analysis
Nature Communications 11:1169.

https://doi.org/10.1038/s41467-020-14976-9
- PubMed
- Google Scholar
1. Subramanian A
2. Tamayo P
3. Mootha VK
4. Mukherjee S
5. Ebert BL
6. Gillette MA
7. Paulovich A
8. Pomeroy SL
9. Golub TR
10. Lander ES
11. Mesirov JP
(2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles
PNAS 102:15545–15550.

https://doi.org/10.1073/pnas.0506580102
- PubMed
- Google Scholar
1. Tirosh I
2. Izar B
3. Prakadan SM
4. Wadsworth MH
5. Treacy D
6. Trombetta JJ
7. Rotem A
8. Rodman C
9. Lian C
10. Murphy G
11. Fallahi-Sichani M
12. Dutton-Regester K
13. Lin JR
14. Cohen O
15. Shah P
16. Lu D
17. Genshaft AS
18. Hughes TK
19. Ziegler CGK
20. Kazer SW
21. Gaillard A
22. Kolb KE
23. Villani AC
24. Johannessen CM
25. Andreev AY
26. Van Allen EM
27. Bertagnolli M
28. Sorger PK
29. Sullivan RJ
30. Flaherty KT
31. Frederick DT
32. Jané-Valbuena J
33. Yoon CH
34. Rozenblatt-Rosen O
35. Shalek AK
36. Regev A
37. Garraway LA
(2016) Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq
Science (New York, N.Y.) 352:189–196.

https://doi.org/10.1126/science.aad0501
- PubMed
- Google Scholar
1. Venteicher AS
2. Tirosh I
3. Hebert C
4. Yizhak K
5. Neftel C
6. Filbin MG
7. Hovestadt V
8. Escalante LE
9. Shaw ML
10. Rodman C
11. Gillespie SM
12. Dionne D
13. Luo CC
14. Ravichandran H
15. Mylvaganam R
16. Mount C
17. Onozato ML
18. Nahed BV
19. Wakimoto H
20. Curry WT
21. Iafrate AJ
22. Rivera MN
23. Frosch MP
24. Golub TR
25. Brastianos PK
26. Getz G
27. Patel AP
28. Monje M
29. Cahill DP
30. Rozenblatt-Rosen O
31. Louis DN
32. Bernstein BE
33. Regev A
34. Suvà ML
(2017) Decoupling genetics, lineages, and microenvironment in IDH-mutant gliomas by single-cell RNA-seq
Science (New York, N.Y.) 355:eaai8478.

https://doi.org/10.1126/science.aai8478
- PubMed
- Google Scholar
1. Zhang Y
2. Ma Y
3. Huang Y
4. Zhang Y
5. Jiang Q
6. Zhou M
7. Su J
(2020) Benchmarking algorithms for pathway activity transformation of single-cell RNA-seq data
Computational and Structural Biotechnology Journal 18:2953–2961.

https://doi.org/10.1016/j.csbj.2020.10.007
- PubMed
- Google Scholar

Article and author information

Author details

Nighat Noureen
1. Greehey Children’s Cancer Research Institute, UT Health San Antonio, San Antonio, United States
2. Department of Population Health Sciences, UT Health San Antonio, San Antonio, United States
Contribution
Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Software, Visualization, Writing - original draft

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-7495-8201
Zhenqing Ye
1. Greehey Children’s Cancer Research Institute, UT Health San Antonio, San Antonio, United States
2. Department of Population Health Sciences, UT Health San Antonio, San Antonio, United States
Contribution
Investigation, Visualization

Competing interests
No competing interests declared
Yidong Chen
1. Greehey Children’s Cancer Research Institute, UT Health San Antonio, San Antonio, United States
2. Department of Population Health Sciences, UT Health San Antonio, San Antonio, United States
Contribution
Investigation, Resources

Competing interests
No competing interests declared
Xiaojing Wang
1. Greehey Children’s Cancer Research Institute, UT Health San Antonio, San Antonio, United States
2. Department of Population Health Sciences, UT Health San Antonio, San Antonio, United States
Contribution
Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Writing - original draft

Competing interests
No competing interests declared
Siyuan Zheng
1. Greehey Children’s Cancer Research Institute, UT Health San Antonio, San Antonio, United States
2. Department of Population Health Sciences, UT Health San Antonio, San Antonio, United States
Contribution
Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Visualization, Writing - original draft

For correspondence
zhengs3@uthscsa.edu

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-1031-9424

Funding

Cancer Prevention and Research Institute of Texas (RR170055)

Siyuan Zheng

Cancer Prevention and Research Institute of Texas (RP170345)

Nighat Noureen

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was supported by CPRIT (RR170055 to ZS). NN was supported by a CPRIT postdoctoral fellowship award (RP170345).

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.