Artificial intelligence approaches for tumor phenotype stratification from single-cell transcriptomic data

  1. Namrata Bhattacharya
  2. Anja Rockstroh
  3. Sanket Suhas Deshpande
  4. Sam Koshy Thomas
  5. Anunay Yadav
  6. Chitrita Goswami
  7. Smriti Chawla
  8. Pierre Solomon
  9. Cynthia Fourgeux
  10. Gaurav Ahuja
  11. Brett Hollier
  12. Himanshu Kumar
  13. Antoine Roquilly
  14. Jeremie Poschmann
  15. Melanie Lehman
  16. Colleen C Nelson  Is a corresponding author
  17. Debarka Sengupta  Is a corresponding author
  1. Australian Prostate Cancer Research Centre-Queensland, Faculty of Health, School of Biomedical Sciences, Centre for Genomics and Personalised Health, Queensland University of Technology, Australia
  2. Department of Computer Science and Engineering, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla, Phase III, India
  3. Translational Research Institute, Princess Alexandra Hospital, Australia
  4. Department of Computational Biology, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla, Phase III, India
  5. School of Mathematical Sciences, The University of Adelaide, Australia
  6. Center for Computational Biomedicine, Harvard Medical School, United States
  7. Nantes Université, CHU Nantes, INSERM, Center for Research in Transplantation and Translational Immunology, UMR, France
  8. Centre for Artificial Intelligence, Indraprastha Institute of Information Technology-Delhi (IIIT-Delhi), Okhla, Phase III, India
  9. Laboratory of Immunology and Infectious Disease Biology, Department of Biological Sciences, Indian Institute of Science Education and Research (IISER), India
  10. Vancouver Prostate Centre, Department of Urologic Sciences, University of British Columbia, Canada
16 figures, 7 tables and 3 additional files

Figures

SCellBOW workflow.

(a) Schematic overview of SCellBOW workflow for identifying cellular clusters and assessing the aggressiveness of the predicted clusters. For SCellBOW clustering, firstly, a corpus was created from the gene expression matrix, where cells were analogous to documents and genes to words. Next, the pre-trained model was retrained with the vocabulary of the target dataset. Then, clustering was performed on embeddings generated from the neural network. For SCellBOW phenotype algebra, vectors were created for reference (total tumor) and queries. Then, the query vector was subtracted from the reference vector to calculate the predicted risk score using a bootstrapped random survival forest. Finally, survival probability was evaluated, and phenotypes were stratified by the median predicted risk score. Created using BioRender.com.

Evaluation of single-cell representations using SCellBOW.

(a–c) UMAP plots for the normal prostate (a), PBMC (b), and pancreas (c) datasets. The coordinates are colored by cell types. (d-f) UMAP plots for normal prostate (d), PBMC (e), and pancreas (f) datasets, where the coordinates are colored by SCellBOW clusters. CL is used as an abbreviation for cluster. (g–i) Radial plot for the percentage of contribution of different methods towards ARI for various resolutions ranging from 0.2 to 2.0. ItClust is a resolution-independent method; thus, the ARI is kept constant across all the resolutions. (j) Box plot for the NMI of different methods across different resolutions ranging from 0.2 to 2.0 in steps of 0.2. (k) Bar plot for the cell type silhouette index (SI) for different methods. The default resolution was set to 1.0.

Evaluation of in-house splenocytes and matched PBMC dataset.

(a) An experiment schematic diagram highlighting the sites of the organs for tissue collection and sample processing. In this matched PBMC-splenocyte CITE-seq experiment, PBMCs and splenocytes were collected, followed by high-throughput sequencing and downstream analyses. Created using BioRender.com. (b, c) UMAP plots for SCellBOW embedding colored by donors (b) and cell types (c). (d) The UMAP plots for the embedding of SCellBOW compared to different benchmarking methods. The coordinates of all the plots are colored by cell type annotation results using Azimuth. (e) Bar plot for ARI, NMI, cell type SI at resolution 1.0. (f, g) Alluvial plots for Azimuth cell types mapped to SCellBOW clusters (f) and Scanpy clusters (g). The resolution of SCellBOW was set to 1.0. CL is used as an abbreviation for cluster.

Phenotype algebra on GBM and BRCA known molecular subtypes.

(a) Heatmap for GSVA score for three molecular subtypes of GBM: CLA, MES, and PRO, grouped by SCellBOW clusters at resolution 1.0. (b) UMAP plot for the embedding of BRCA target dataset colored by PAM50 molecular subtype. (c) Survival plot for GBM molecular subtypes based on phenotype algebra. (d) Violin plot for predicted risk scores for GBM molecular subtypes, with n = 50 bootstrapped models per subtype. (e) Survival plot for BRCA molecular subtypes based on phenotype algebra. The total tumor is denoted by T. (f) Violin plot for predicted risk scores for BRCA molecular subtypes with n = 50 bootstrapped models per subtype.

Phenotype algebra on mCRPC known molecular subtypes based on AR- and NE-activity.

(a) Schematic of the transdifferentiation states underlying lineage plasticity that occurs during mCRPC progression from an ARPC to NEPC. Created using BioRender.com. (b) Scatter plot of GSVA scores of ARPC and NEPC gene sets, K-means clustering was used to allocate cells into the three high-level ARAH, ARAL, and NEPC categories. (c) UMAP plot for projection of SCellBOW embedding colored by ARAH, ARAL, and NEPC. (d) Heatmap showing the top differentially expressed genes (y-axis) between each high-level category (x-axis) and all other cells, tested with a Wilcoxon rank-sum test. (e) Survival plot for mCRPC cancer phenotypes based on phenotype algebra. The total tumor is denoted by T. (f) Violin plot for predicted risk scores for mCRPC phenotypes - ARAH, ARAL, and NEPC, with n = 50 bootstrapped models per subtype. (g) Survival plot for mCRPC tumor microenvironment phenotypes based on phenotype algebra. The total tumor is denoted by T. (h) Violin plot of predicted risk scores for mCRPC tumor microenvironment phenotypes, comparing tumor and normal cells, with n = 50 bootstrapped models per group.

Phenotype algebra on He et al., 2021 mCRPC data based on SCellBOW clusters.

(a) UMAP plot for projection of embeddings with coloring based on the SCellBOW clusters at resolution 0.8. CL is used as an abbreviation for cluster. (b) Violin plot of phenotype algebra-based cluster-wise risk scores for SCellBOW clusters based on phenotype algebra-based predictions. (c) Patient and organ site distribution across the SCellBOW clusters. (d) Illustration of the distribution of cells from the three high-level groups- ARAH, ARAL, and NEPC across the SCellBOW clusters. (e) Bubble plot of row-scaled GSVA scores for custom curated gene sets containing activated and repressed AR- and NE- signatures. (f) Correlation plot of six phenotypic categories based on DSP gene expression correlated with the SCellBOW clusters based on scRNA-seq gene expression. The six phenotypic categories are defined by Brady et al., 2021 based on the activity of AR and NE programs. (g) Top gene sets correlated with SCellBOW clusters. Signatures were collected from the C2 ‘‘curated’’, C5 ‘‘Gene Ontology’’, and H ‘‘hallmark’’ gene sets from mSigDB (Liberzon et al., 2015). Ranking by row scaled GSVA scores of one cluster against all others.

Appendix 2—figure 1
Cell embeddings visualization.

(a-c) The UMAP plots showing embedding of SCellBOW compared to different existing methods benchmarked on normal prostate (a), peripheral blood mononuclear cells (PBMC) (b), and pancreas datasets (c). The coordinates of all the plots are colored by true cell types.

Appendix 2—figure 2
Cell embedding visualization of the normal prostate scRNA-seq dataset.

(a–f) The UMAP plots showing embedding of SCellBOW, Scanpy, Seurat, ItClust, ProjectR, and DESC on normal prostate. The coordinates of all the plots are colored by clusters. (g–l) Alluvial plots showing the mapping of clusters resulting from the benchmarking tools onto the true cell types from Henry et al. normal prostate dataset. CL is used as an abbreviation for cluster.

Appendix 2—figure 3
Extended performance evaluation of SCellBOW with scBERT and scPhere.

(a–d) Radial plot for the percentage of contribution of different methods towards ARI for various resolutions ranging from 0.2 to 2.0 for normal prostate (a), PBMC (b), pancreas (c), and CITE-seq (d) datasets. (e) Box plot for the NMI of different methods across different resolutions ranging from 0.2 to 2.0 in steps of 0.2. (f) Bar plot for the cell type silhouette index (SI) for different methods. The default resolution was set to 1.0.

Appendix 2—figure 4
Extended analysis of in-house CITE-seq dataset.

(a) Violin plot showing the distribution of UMIs for singlets, doublets, and negative cells. (b) The UMAP plots showing embedding of SCellBOW colored by donors. (c) The UMAP plots showing embedding of SCellBOW colored by tissue of origin. (d) Dot plot to check the expression of marker genes of PBMC per cell type identified by Azimuth. (e) Bar plot showing the proportion of annotated cell types across different clusters of SCellBOW. (f) Compositional difference in proportion annotated cell types in PBMC vs. splenocytes. (g) Heatmap for annotated cell type-wise differentially expressed genes in each cell type. (h) UMAP plots of the cells colored by their tissue source. (i) Volcano plot showing the differential genes (red dots) in the spleen and PBMC for B cells (p-value < 0.05, False discovery rate (FDR)<0.01). (j) Donut plot showing the compositional difference in the proportion of B cell subtypes (B naive, B effector, and plasmablast) in PBMC and spleen. (k) UMAP plot showing the embedding of SCellBOW colored by B cell subtypes. (l, m) Volcano plot showing the differential genes (red dots) in the spleen and PBMC for B effector (l) and B naive cells (m) (p-value <0.05, FDR < 0.01).

Appendix 2—figure 5
Cell embedding visualization of the in-house CITE-seq scRNA-seq dataset.

(a–f) The UMAP plots showing embedding of SCellBOW, Scanpy, Seurat, ItClust, ProjectR, and DESC on the PBMC-spleen dataset. The coordinates of all the plots are colored by clusters. (g–l) The alluvial plots showing the mapping of clusters resulting from the benchmarking tools onto the cell types identified by Azimuth.

Appendix 2—figure 6
Survival risk inference using phenotype algebra on raw gene expression data.

(a–c) Phenotype algebra-based risk scores using gene expression profile of GBM molecular subtypes (a), PAM50 molecular subtypes of BRCA (b), three high-level categories of mCRPC (c). The total tumor is denoted by T. (d-f) Heatmap for -log10(p-value) of the predicted risk scores for GBM subtype (d), BRCA subtype (e), and mCRPC clusters (f), and using Wilcoxon unpaired one-sided test.

Appendix 2—figure 7
Survival risk inference using fixed-length embeddings from scETM, scPhere, and scBERT.

(a–c) Phenotype algebra-based risk scores of GBM molecular subtypes using fixed-length embeddings from scETM (a), scPhere (b), and scBERT (c). The total tumor is denoted by T. (d-f) Phenotype algebra-based risk scores of PAM50 molecular subtypes of BRCA using fixed-length embeddings from scETM (a), scPhere (b), and scBERT (c). (g–i) Phenotype algebra-based risk scores of three high-level categories of mCRPC using fixed-length embeddings from scETM (a), scPhere (b), and scBERT (c).

Appendix 2—figure 8
Extended analysis of He et al., 2021 single-cell mCPRC dataset.

(a) Elbow plot for selecting the best K for K-means clustering. (b, c) Scatter plot of GSVA scores of ARPC and NEPC gene sets colored by the K-means clusters (E) and SCellBOW clusters (F). (d) UMAP plot visualizing the high-level ARAH, ARAL, and NEPC categories on SCellBOW embeddings. (e, f) The UMAP plots showing the embedding of SCellBOW colored by metastasis site (a) and donors (b). (g) Alluvial plot to visualize tumor metastasis site of the donors. (h) UMAP plot visualizing SCellBOW embeddings of tumor microenvironment cells (malignant +non-malignant) in the He et al. dataset based on author annotations.

Appendix 2—figure 9
Assessing the quality of clustering using transfer learning on BOW models using the pancreas dataset.

(a–c) UMAP plot of Scanpy embedding (a), Doc2Vec embedding (b), and SCellBOW embedding (c) of pancreas dataset using Leiden clustering at resolution 1.0. (d-f) UMAP plot of Scanpy embedding (d), Doc2Vec embedding (e), and SCellBOW embedding (f) of pancreas dataset colored with their annotated cell types (g–i) Alluvial plot for cell types against Leiden clusters for Scanpy (g) Doc2vec (h) SCellBOW (i). (j) Barplot for ARI, NMI, cluster purity, Silhouette index (cell type and cluster).

Appendix 2—figure 10
Assessing the effect of random seed on the quality of cell sentence generation using the pancreas dataset.

(a–f) The UMAP plots showing the embedding of SCellBOW generated using cell sentences with different random seeds ranging from 2 to 25. The colors of the cells in the UMAP plots indicate clusters. (g) Heatmap showing the ARI between each pair of clustering outcomes with distinct seeds.

Tables

Appendix 1—table 1
Overview of tools and benchmarking methods used in this paper.
Appendix 1—table 2
Summary of datasets analyzed in this paper.
ModelDatasetTissueTechnologyData TypeCell/sampleDetectedUsed inSCellBOWData used asCell filterGene filterHVG
Normal ProstateKarthaus et al., 2020Human primary prostate cancer10 XTPM120,300ClusteringSource200205000
Henry et al., 2018Human normal prostate10 XRaw count28,702ClusteringTarget20033000
PBMCZheng et al., 2017Human PBMC10 XRaw count68, 579ClusteringSource200205000
Zheng et al., 2017Human PBMC10 XRaw count2,700ClusteringTarget200202000
PancreasBaron et al., 2016Human pancreasinDropRaw count8,562ClusteringSource200202000
Muraro et al., 2016Human pancreasCEL-Seq2Raw count2,042ClusteringSource
Wang et al., 2016Human pancreasSMARTerRaw count430ClusteringSource
Segerstolpe et al., 2016Human pancreasSmart-Seq2Raw count2,068ClusteringTarget20032000
GBMNeftel et al., 2019Human glioblastoma10 XRaw count12,074AlgebraSource200201000
Couturier et al., 2020Human glioblastoma10 XRaw count4,508AlgebraTarget20031000
TCGA-GBM Weinstein et al., 2013*Human glioblastomaBulk RNA-seqRaw count613AlgebraSurvival
BRCAWu et al., 2020Human breast cancer10 XRaw count24,271AlgebraSource200201000
Zhou et al., 2021Human Breast cancerSmart-seq2Raw count545AlgebraTarget20031000
TCGA-BRCA Weinstein et al., 2013*Human Breast cancerBulk RNA-seqRaw count1,079AlgebraSurvival
mCRPCHe et al., 2021Human metastatic prostate cancerSmart-Seq2TPM836AlgebraTarget20031000
Abida et al., 2019Human metastatic prostate cancerBulk RNA-seqTPM81AlgebraSurvival
  1. Data downloaded from https://www.cancer.gov/tcga.

Appendix 1—table 3
Summary of evaluation metric for all target datasets at resolution = 1.0 analyzed in this paper.
Normal Prostate3 K PBMCPancreasIn-house CITE-seq
ARINMISI(Cell type)ARINMISI(Cell type)ARINMISI(Cell type)ARINMISI(Cell type)
SCellBOW0.260.560.130.490.520.110.560.820.530.650.720.06
Scanpy0.160.530.130.360.50.10.380.730.240.510.660.08
Seurat0.150.520.210.310.480.10.520.790.530.470.660.04
ItClust0.010.02–0.050.330.46–0.050.310.29–0.050.430.55–0.13
scETM0.110.370.040.380.50.080.350.670.460.410.61–0.03
scBERT0.180.510.220.320.470.070.350.720.350.580.700.06
scPhere0.040.420.050.190.380.040.230.630.330.240.550.21
DESC0.170.520.020.460.49–0.060.390.720.20.540.650.01
Appendix 1—table 4
Marker gene set major immune cell types.
Major cell typesMarker genes
B cellsCD19, CD79A, MS4A1, CD74, HLA-DRA
CD4 TIL7R, CCR7, CD3D, CD4
CD8 TGZMK, CD8A, CD8B, GZMB
DCCST3, CD14, ITGAM, ITGAX
MAITCD3D, KLRB1, RORA, ZBTB16
MonoCD14, S100A12
NK cellsNKG7, GNLY, CD247, CCL3, GZMB, CD3D
Appendix 1—table 5
Computation time across different transfer learning methods under the same hardware conditions (128 GB RAM, 16 core processor).
Wall time (Pancreas dataset)
MethodsSource Model (~12 K cells)Target Model (~2 K cells)Total time
ItClust2 min 4 s~2 min
SCellBOW (thread = 16)2 min 5 s1 min 20 s~3 min
SCellBOW (thread = 1)6 min 21 s2 min 8 s~8 min
scETM (600 epoch, thread = 16)22 min 38 s5 min 46 s~27 min
scETM (600 epoch, thread = 1)23 min 49 s5 min 58 s~28 min
scBERT3 hrs 33 min2 min~3 hrs
  1. The term ‘thread’ represents the number of threads used: thread = 1 indicates single-threaded execution, while thread >1 indicates multi-threaded execution.

Appendix 1—table 6
Gene set for molecular subtypes of Glioblastoma.
SubtypeMarker genes
ProneuralDLL3, BCAN, OLIG2, NCAM1, NKX2-2, ASCL1, PDGFRA
ClassicalEGFR, CDKN2A, RB1, CDK4, CCDN2
MesenchymalCHI3L1, CD44, VIM, RELB, TRADD, PDPN, YKL40, MET, NF1, TNFRSF1A
Appendix 1—table 7
Labrecque et al., 2019 gene sets for molecular subtypes of mCRPC.
SubtypeMarker genes
NEPCCHGA, SYP, ACTL6B, SNAP25, INSM1, ASCL1, CHRNB2, SRRM4
ARPCAR, NKX3-1, KLK3, CHRNA2, SLC45A3, NAP1L2, S100A14, TRGC1, TARP

Additional files

Supplementary file 1

Result of differential expression analysis for the He et al., 2021 metastatic prostate cancer dataset.

This file also includes information on custom gene sets used for ARPC and NEPC analysis.

https://cdn.elifesciences.org/articles/98469/elife-98469-supp1-v1.xlsx
Supplementary file 2

Result of differential expression analysis for the in-house matched PBMC and splenocyte dataset.

https://cdn.elifesciences.org/articles/98469/elife-98469-supp2-v1.xlsx
MDAR checklist
https://cdn.elifesciences.org/articles/98469/elife-98469-mdarchecklist1-v1.docx

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Namrata Bhattacharya
  2. Anja Rockstroh
  3. Sanket Suhas Deshpande
  4. Sam Koshy Thomas
  5. Anunay Yadav
  6. Chitrita Goswami
  7. Smriti Chawla
  8. Pierre Solomon
  9. Cynthia Fourgeux
  10. Gaurav Ahuja
  11. Brett Hollier
  12. Himanshu Kumar
  13. Antoine Roquilly
  14. Jeremie Poschmann
  15. Melanie Lehman
  16. Colleen C Nelson
  17. Debarka Sengupta
(2025)
Artificial intelligence approaches for tumor phenotype stratification from single-cell transcriptomic data
eLife 13:RP98469.
https://doi.org/10.7554/eLife.98469.3