SCellBOW workflow.

a, Schematic overview of SCellBOW workflow for identifying subclones and assessing subclonal tumor aggressiveness. For SCellBOW clustering, firstly, a corpus was created from the gene expression matrix, where cells were analogous to documents and genes to words. Next, the pre-trained model was retrained with the vocabulary of the target dataset. Then, clustering was performed on embeddings generated from the neural network. For SCellBOW phenotype algebra, vectors were created for reference (total tumor) and queries. Then, the query vector was subtracted from the reference vector to calculate the predicted risk score using a bootstrapped random survival forest. Finally, survival probability was evaluated and phenotypes were stratified by the median predicted risk score.

© 2024, BioRender Inc. Any parts of this image created with BioRender are not made available under the same license as the Reviewed Preprint, and are © 2024, BioRender Inc.

Evaluation of single-cell representations using SCellBOW.

a-c, UMAP plots for the normal prostate (a), PBMC (b), and pancreas (c) datasets. The coordinates are colored by cell types.

d-f, UMAP plots for normal prostate (d), PBMC (e), and pancreas (f) datasets, where the coordinates are colored by SCellBOW clusters. CL is used as an abbreviation for cluster.

g-i, Radial plot for the percentage of contribution of different methods towards ARI for various resolutions ranging from 0.2 to 2.0. ItClust is a resolution-independent method; thus, the ARI is kept constant across all the resolutions.

j, Box plot for the NMI of different methods across different resolutions ranging from 0.2 to 2.0 in steps of 0.2.

k, Bar plot for the cell type silhouette index (SI) for different methods. The default resolution was set to 1.0.

Evaluation of in-house splenocytes and matched PBMCs.

a, An experiment schematic diagram highlighting the sites of the organs for tissue collection and sample processing. In this matched PBMC-splenocyte CITE-seq experiment, PBMCs and splenocytes were collected, followed by high-throughput sequencing and downstream analyses.

b, The UMAP plots for the embedding of SCellBOW compared to different benchmarking methods. The coordinates of all the plots are colored by cell type annotation results using Azimuth. c-d, UMAP plots for SCellBOW embedding colored by donors (c) and cell types (d).

e-f, Alluvial plots for Azimuth cell types mapped to SCellBOW clusters (e) and Scanpy clusters

(f). The resolution of SCellBOW was set to 1.0. CL is used as an abbreviation for cluster. g, Bar plot for ARI, NMI, cell type SI at resolution 1.0.

© 2024, BioRender Inc. Any parts of this image created with BioRender are not made available under the same license as the Reviewed Preprint, and are © 2024, BioRender Inc.

Subclonal survival risk inference.

a, UMAP plot for the embedding of BRCA target dataset colored by PAM50 molecular subtype.

b, Heatmap for GSVA score for three molecular subtypes of GBM: CLA, MES, and PRO, grouped by SCellBOW clusters at resolution 1.0.

c, Survival plot for BRCA molecular subtypes based on phenotype algebra. The total tumor is denoted by T.

d, Violin plot for predicted risk scores for BRCA molecular subtypes.

e, Survival plot for GBM molecular subtypes based on phenotype algebra. f, Violin plot for predicted risk scores for GBM molecular subtypes.

Phenotype algebra of metastatic prostate cancer data based on AR-and NE-activity.

a, Schematic of the transdifferentiation states underlying lineage plasticity that occurs during mCRPC progression from an ARPC to NEPC.

b, Scatter plot of GSVA scores of ARPC and NEPC gene sets, K-means clustering was used to allocate cells into the three high-level ARAH, ARAL, and NEPC categories.

c, UMAP plot for projection of SCellBOW embedding colored by ARAH, ARAL, and NEPC.

d, Heatmap showing the top differentially expressed genes (y-axis) between each high-level category (x-axis) and all other cells, tested with a Wilcoxon rank-sum test.

e, Survival plot for mCRPC phenotypes based on phenotype algebra. The total tumor is denoted by T.

f, Violin plot for predicted risk scores for mCRPC phenotypes-ARAH, ARAL, and NEPC.

© 2024, BioRender Inc. Any parts of this image created with BioRender are not made available under the same license as the Reviewed Preprint, and are © 2024, BioRender Inc.

Phenotype algebra of metastatic prostate cancer data based on SCellBOW clusters.

a, UMAP plot for projection of embeddings with coloring based on the SCellBOW clusters at resolution 0.8. CL is used as an abbreviation for cluster.

b, Violin plot of phenotype algebra-based cluster-wise risk scores for SCellBOW clusters based on phenotype algebra-based predictions.

c, Illustration of the distribution of cells from the three high-level groups-ARAH, ARAL, and NEPC across the SCellBOW clusters.

d, Patient and organ site distribution across the SCellBOW clusters.

e, Bubble plot of row-scaled GSVA scores for custom curated gene sets containing activated and repressed AR-and NE-signatures.

f, Correlation plot of six phenotypic categories based on DSP gene expression correlated with the SCellBOW clusters based on scRNA-seq gene expression. The six phenotypic categories are defined by Brady et al. based on the activity of AR and NE programs.

g, Top gene sets correlated with SCellBOW clusters. Signatures were collected from the C2 ‘‘curated’’, C5 ‘‘Gene Ontology’’, and H ‘‘hallmark’’ gene sets from mSigDB94. Ranking by row scaled GSVA scores of one cluster against all others.