1. Computational and Systems Biology
  2. Human Biology and Medicine
Download icon

Unsupervised machine learning reveals risk stratifying glioblastoma tumor cells

  1. Nalin Leelatian
  2. Justine Sinnaeve
  3. Akshitkumar M Mistry
  4. Sierra M Barone
  5. Asa A Brockman
  6. Kirsten E Diggins
  7. Allison R Greenplate
  8. Kyle D Weaver
  9. Reid C Thompson
  10. Lola B Chambless
  11. Bret C Mobley
  12. Rebecca A Ihrie  Is a corresponding author
  13. Jonathan M Irish  Is a corresponding author
  1. Department of Cell and Developmental Biology, Vanderbilt University, United States
  2. Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, United States
  3. Department of Pathology, Microbiology and Immunology, Vanderbilt University Medical Center, United States
  4. Department of Neurological Surgery, Vanderbilt University Medical Center, United States
Research Article
  • Cited 0
  • Views 912
  • Annotations
Cite this article as: eLife 2020;9:e56879 doi: 10.7554/eLife.56879

Abstract

A goal of cancer research is to reveal cell subsets linked to continuous clinical outcomes to generate new therapeutic and biomarker hypotheses. We introduce a machine learning algorithm, Risk Assessment Population IDentification (RAPID), that is unsupervised and automated, identifies phenotypically distinct cell populations, and determines whether these populations stratify patient survival. With a pilot mass cytometry dataset of 2 million cells from 28 glioblastomas, RAPID identified tumor cells whose abundance independently and continuously stratified patient survival. Statistical validation within the workflow included repeated runs of stochastic steps and cell subsampling. Biological validation used an orthogonal platform, immunohistochemistry, and a larger cohort of 73 glioblastoma patients to confirm the findings from the pilot cohort. RAPID was also validated to find known risk stratifying cells and features using published data from blood cancer. Thus, RAPID provides an automated, unsupervised approach for finding statistically and biologically significant cells using cytometry data from patient samples.

Introduction

A modern goal of quantitative analysis of single cell data in human cancers is to move beyond human-driven identification of cell types using known markers (expert gating) to machine learning tools that can reveal and characterize novel and abnormal cells (Diggins et al., 2015; Greenplate et al., 2019; Irish, 2014; Saeys et al., 2016). Citrus, an automated analysis tool based on assignment of samples to binary categories (e.g. ‘healthy’ and ‘disease’) before testing whether cell populations are associated with these categories, was designed with this purpose in mind (Supplementary file 1; Bruggner et al., 2014). However, many important clinical features of patient tissue samples are reported as continuous variables, such as time to progression, overall survival, or percentage of immune infiltrate, which can be challenging to convert to arbitrary binary categories and may not be driven by a single unified cellular phenotype (Gonzalez et al., 2018; Good et al., 2018; Levine et al., 2015). Similarly, known, healthy cell populations from different stages of development or differentiation may be required for some approaches, such as developmentally dependent predictor of relapse (DDPR Good et al., 2018), and are not always available or fully represented for all datasets. This is especially acute for some tissues, such as brain, which may be quiescent in adults and not routinely sampled in clinical care or research. Tools are needed that can take into account continuous clinical variables that may be censored, such as overall survival or progression free survival (PFS), and which operate in an unsupervised manner. Ultimately, tools that work with high dimensional data should help users to translate findings from an algorithmic machine learning tool to common practice by identifying lower dimensional correlates that can be used to validate signatures using a complementary, clinically tractable approach. This transparency was a focus of the tool design and validation strategy used here. A computational workflow constructed for this purpose should also be validated via repeated subsampling of data to ensure the phenotypes identified are robust, by testing of different dimensionality reduction tools, by testing across multiple datasets, and by validation of prognostic signatures using complementary approaches. Finally, a practical challenge of modern single cell discovery projects is that they may often be at a project point where they are working with a smaller initial cohort (around 25 patients). This study size is powered to closely correlate cell subsets with patient outcomes using signaling cytometry data, as this study and others have shown for blood cancers (Gonzalez et al., 2018; Good et al., 2018; Irish et al., 2004; Irish et al., 2010; Kotecha et al., 2008; Levine et al., 2015), but necessitates extensive statistical and biological validation, as discussed below.

RAPID (Risk Assessment Population IDentification) is a newly created algorithm that was designed using single cell cytometry data and which addresses the key challenges of clinical research using discovery cohorts of patients (https://github.com/cytolab/RAPID; Leelatian, 2020; copy archived at https://github.com/elifesciences-publications/RAPID). This open-access tool can couple single cell experiments to clinical outcome and other variables in an unsupervised manner and provide information that can be translated into simplified tests on other platforms. For this study, the algorithm was assessed for 1) cluster stability (Melchiotti et al., 2017) for both cells and phenotypes; 2) modularity (Diggins et al., 2015; Saeys et al., 2016), which would allow the algorithm to function with a range of dimensionality reduction approaches, such as no dimensionality reduction, t-distributed stochastic neighbor embedding (t-SNE Amir et al., 2013), or uniform manifold approximation and projection (UMAP Becht et al., 2019), clustering tools, such as FlowSOM (Van Gassen et al., 2015) or dbscan (Akers et al., 2013), and enrichment analysis tools, such as marker enrichment modeling (MEM Diggins et al., 2017); 3) transparency, evaluated as the ability to derive simple models of data structure (Gandelman et al., 2019), such as decision trees or flow cytometry gating hierarchies, so that new datasets could be easily assessed; 4) independence - whether risk stratifying cell populations are independent of known predictors (age, others); and 5) reproducibility and translational potential, tested by gathering additional data using traditional, one-dimensional immunohistochemistry (IHC) that is widely used in clinical testing.

Here, the utility and validity of the RAPID algorithm were tested using two datasets with varying levels of prior knowledge, numbers of patients and cells, and outcome trajectories. The first was a new data set of 28 glioblastoma patient samples and is described in detail below. Central findings from this first dataset were then validated using 73 additional samples analyzed using a different technology. The second was a previously published data set of 54 bone marrow samples from B cell precursor acute lymphoblastic leukemia (Good et al., 2018). This study was chosen as an example of a dataset in which prognostic features had already been independently identified, and so validation was assessed by whether known features were revealed by RAPID.

When applied to single cell cytometry data from human tumors, as shown here, the aim of RAPID was to reveal and characterize populations of risk stratifying cells. For this goal, glioblastoma, the cancer type in the first dataset, represents an excellent challenge, since glioblastoma is a highly heterogeneous solid tumor that is amenable to single cell approaches (Doxie et al., 2018; Gonzalez et al., 2018; Greenplate et al., 2019; Leelatian et al., 2017a) and where there is a great opportunity for molecular prognostic features to have an impact on new treatments and clinical care. Glioblastoma is the most common primary brain tumor in adults, is highly aggressive, and is known to contain cells with diverse genomic and transcriptomic features reflecting abnormal neural lineages (Bhaduri et al., 2020; Leelatian et al., 2017a; Ostrom et al., 2017; Patel et al., 2014; Wei et al., 2016). Previous studies in glioblastomas have either measured signaling states in bulk primary tumors (Brennan et al., 2009; Brennan et al., 2013; Verhaak et al., 2010) or characterized genomic and transcriptomic profiles in a limited number of single cells (<33,000) (Bhaduri et al., 2020; Johnson and White, 2014; Neftel et al., 2019; Patel et al., 2014; Stommel et al., 2007; Wei et al., 2016). While differing subclasses of glioblastomas were proposed a decade ago (Verhaak et al., 2010), these categories do not correspond to large differences in prognosis and are not always reflected by individual cells (Patel et al., 2014). Mosaic amplifications of receptor tyrosine kinase (RTK) genes are commonly observed in subsets of cells within a single glioblastoma tumor (Snuderl et al., 2011), suggesting that single cell analysis of glioblastoma should include signaling measurements. In other cancer types, phospho-protein signaling has repeatedly revealed cancer cell subsets that are closely linked to patient clinical outcomes (Gonzalez et al., 2018; Good et al., 2018; Irish et al., 2004; Irish et al., 2010; Kotecha et al., 2008; Levine et al., 2015). These results suggest that a protein-level approach in a small pilot cohort may reveal phenotypically distinct cancer cell subsets whose abundance provides new ways to stratify glioblastoma outcomes. While it is known that upstream regulators of pro-growth and pro-survival signaling are altered in brain tumors, little is known about the activation states of signaling effector proteins in single glioblastoma cells, as these features are inaccessible to sequencing modalities (Meyer et al., 2015; Mistry et al., 2019; Snuderl et al., 2011; Spitzer and Nolan, 2016).

Another challenge that the RAPID algorithm was designed to address was the need to work with heterogeneous cell phenotypes and populations that might be rare and variable across patients. Cytometry data are a good match for this type of algorithm, as a large number of cells are collected from each tumor sample, the data have an excellent signal-to-noise ratio and support quantitative comparisons, and cytometry enables direct measurement of signaling pathway activation (Irish et al., 2010; Kotecha et al., 2008; Mistry et al., 2019; Myklebust et al., 2017). When glioblastoma mass cytometry data were analyzed by RAPID, both negative- and positive-prognostic phenotypes were identified, with protein-level phenotypes not described by prior studies. Statistical description of prognostic phenotypes within the RAPID algorithm then enabled the design of a simple workflow using traditional IHC, which stratified outcome in a separate set of 73 glioblastoma patient tissues.

Results

RAPID identifies stratifying cell subsets in an automatic and unsupervised manner

The RAPID algorithm workflow is depicted in Figure 1 using results from Dataset 1. Following patient-specific identification of major cell types (Figure 1a), the algorithm (Figure 1b) randomly sampled an equal number of glioblastoma cells from each patient’s tumor and analyzed the cells on a single, common t-SNE. This even sampling was conducted to generate a t-SNE analysis where each patient contributed equally. Subsequent statistical testing (Figure 1c) included repeated subsampling to ensure that sampled cells were representative of the original tumors. After multiple statistical tests, the most robust and reproducible cell types identified by RAPID were validated biologically, including using a new data type and a larger cohort (Figure 1d).

Figure 1 with 6 supplements see all
RAPID identifies single cell phenotypes associated with continuous clinical variables that are stable and validated via complementary approaches.

(a) Graphic of tumor processing and data collection. After data collection and standard pre-processing, non-immune, non-endothelial glioblastoma cells were computationally isolated for analysis by RAPID. (b) RAPID workflow on glioblastoma cells identified from 28 patients and computationally pooled for t-SNE analysis. Cell subsets were automatically identified by FlowSOM and were systematically assessed for association with patient overall or progression-free survival. 43 glioblastoma cell subsets were identified and were color-coded based on hazard ratio of death and p-values (HR >1, red; HR <1, blue). Cell density, FlowSOM clusters, and cluster significance are depicted on t-SNE plots. (c) RAPID results were tested for stability. Each tumor was randomly subsampled for 4,710 cells multiple times. Each of these cell subsampling runs was subject to 100 iterative FlowSOM analyses and an F-measure was calculated for each cluster. Only clusters with an F-measure of greater than 0.5 were considered stable. Then, the phenotypes of stable clusters associated with patient outcome were assessed via RMSD and used to determine stable phenotypes. (d) Validation of the findings from the mass cytometry data was done using lower dimensional gating strategies and an orthogonal technology to confirm the biological findings.

The RAPID algorithm was unsupervised and included two key statistical decisions. The first decision was the automation of the number of target clusters sought at the clustering step (Figure 1b, middle). This was achieved through repeated analysis with the chosen clustering tool, in this case FlowSOM (Van Gassen et al., 2015), followed by statistical analysis. RAPID iteratively tested a range (cluster number 5–50) of unsupervised self-organizing maps from FlowSOM to identify an appropriate number of stable clusters containing phenotypically homogenous cells. The minimum number of clusters that minimized intra-cluster variance for each feature was calculated after all iterations were completed and set as the optimized target cluster number (see Materials and methods). Clustering with other tools, such as DBSCAN, or clustering on untransformed axes, was both slower and less accurate in identifying stable, phenotypically distinct clusters, consistent with published observations (data not shown and Weber and Robinson, 2016). The second decision was in assessing cluster abundance in patients (Figure 1b, right). RAPID assigned patients to high or low abundance for each automatically identified cluster based on a statistical cut point, set as the interquartile range of the population abundance across the samples (see Materials and methods). These two decisions resulted in automation of steps that are typically manual in cytometry analysis.

After finding clusters in an unsupervised manner and determining which patients’ tumors contained a high level of each cluster, the last step in a run of RAPID was to test whether each cluster stratified risk of death. For this last test, RAPID applied a univariate Cox survival analysis to determine the correlation between the abundance of tumor cells in each cluster and patient survival outcome (Supplementary file 2). Clusters were identified as prognostic by assessing the hazard ratio (HR) of death in patients who had either high or low abundance of the cell cluster. Negative and positive prognostic clusters were colored red or blue, respectively, if they were significantly associated (p<0.05) with an HR that was >1 (negative, red) or <1 (positive, blue). The RAPID algorithm used statistical analysis of the common t-SNE, feature variance, and population abundance to automatically set all computational analysis parameters, independent of clinical outcomes.

The output of RAPID includes a PDF containing a color-coded, 2D t-SNE plot depicting all FlowSOM clusters, a 2D t-SNE plot colored by clusters which were significantly associated with patient outcome, and Kaplan-Meier survival plots of patients for each subset (additional files described in Materials and methods) (Figure 1b). To compactly report and depict the phenotype of algorithmically identified cell subsets, RAPID used Marker Enrichment Modeling (MEM) labels (Diggins et al., 2017). Thus, feature enrichment was reported on a +10 to −10 scale, where +10 indicated that the feature was especially enriched in those cells and −10 indicated that the feature was specifically excluded from those cells, relative to all other cells in other clusters. The MEM label here was thus an objective description of what made each population distinct from the other clusters identified by RAPID. In summary, RAPID provided an unsupervised, automated, statistical approach to revealing and characterizing clinically significant cells.

Identification of risk stratifying glioblastoma cells in Dataset 1

RAPID was designed for datasets like Dataset 1, a pilot glioblastoma mass cytometry dataset including cells collected from 28 patients with isocitrate dehydrogenase (IDH) wild-type glioblastoma at the time of primary surgical resection (Supplementary file 3). This dataset is currently available online (https://flowrepository.org/id/FR-FCM-Z24K). The median PFS and overall survival (OS) after diagnosis were 6.3 and 13 months, respectively, typical of the trajectory of this disease (Stupp et al., 2005). Resected tissues were immediately dissociated into single cell suspensions as previously reported (Leelatian et al., 2017b) and the resulting cells were stained with a customized antibody panel, which was designed to capture the expression of known cell surface proteins, intracellular proteins, and phospho-signaling events (Supplementary file 4). Collectively, the antigens included in this panel positively identified >99% of viable single cells within any given tumor sample (see Materials and methods). To identify glioblastoma cells prior to RAPID, as in Figure 1a, a patient-specific t-SNE was created using 26 of the measured markers for the tumor and stromal cells from each patient’s tumor (Amir et al., 2013; Figure 1—figure supplement 1 and Supplementary file 4). Patient-specific t-SNE maps revealed non-glioblastoma populations of immune (CD45+) and endothelial (CD45-CD31+) cells, consistent with prior mass cytometry and sequencing studies of gliomas (Diggins et al., 2017; Greenplate et al., 2019; Leelatian et al., 2017a; Neftel et al., 2019; Patel et al., 2014). Immune and endothelial cells from each individual patient were computationally excluded prior to subsequent downstream analysis (Figure 1, Figure 1—figure supplement 1), and CD45-CD31- cells were labeled as glioblastoma cells.

Plots of cell density on the t-SNE axes revealed phenotypically distinct subpopulations of glioblastoma cells within a single patient’s tumor (example patient LC26: Figure 1—figure supplement 1, maps for all patients: Supplementary file 6) Intra-tumoral subsets were distinguished by differences in expression of core neural identity proteins and by aberrant co-expression of neural lineage and stem cell proteins. In the example case of tumor LC26, abnormal phenotypes in glioblastoma cells included co-expression of astrocytic S100B and stem-like CD133 or co-expression of markers associated with different molecular subtypes of glioblastoma, such as mesenchymal (CD44) and classical (EGFR) (Figure 1—figure supplement 1; Verhaak et al., 2010). These results with protein confirmed the existence of non-canonical cell types that had previously been observed in single-cell RNA-seq (Patel et al., 2014). The abnormal co-expression of identity proteins seen here, as well as previously reported single cell studies relying on inferred DNA alterations (Neftel et al., 2019), indicate that the large majority of the CD45-CD31- cells were likely cancer lineage cells.

Using an equal number of subsampled glioblastoma cells from each patient (see Materials and methods), a single, common t-SNE map was created to represent glioblastoma cell protein phenotypes across all patients (N = 131,880 cells; 4,710 cells x 28 patients, using 24 measured features). The RAPID algorithm, using the pooled data from all patients, identified 43 phenotypically distinct cell clusters, and then determined for each tumor whether a patient was high or low for a particular cluster using the interquartile range of abundance for that cluster. For example, for glioblastoma cluster 24, the interquartile range was 0.67% to 3.36%, resulting in a cut point of 2.69%. Those patients with ≤2.69% were designated ‘low’ for cluster 24 while those with >2.69% were assigned to the ‘high’ group. Additional cut points, based on splitting populations into quartiles or tertiles, were tested and resulted in consistent prognostic phenotypes (the average F-measure of patients being consistently assigned to the high, low, or neither categories identified below was 0.86). The number of tumors that contributed to each cluster varied between the 43 clusters, but a median of 8 tumors contained cells in each cluster (Supplementary file 2, Supplementary file 6). Furthermore, each cluster contained cells from at least 4 tumors and, at the median, contained cells from 12 tumors (Supplementary file 5, Supplementary file 6).

The RAPID algorithm identified four Glioblastoma Negative Prognostic (GNP) clusters (red; clusters 33, 34, 37, and 42) and five Glioblastoma Positive Prognostic (GPP) clusters (blue; clusters 2, 3, 4, 5, and 41) whose abundance was associated with overall survival (Figure 1b). MEM labels were used to identify the enriched features of risk stratifying glioblastoma cells (Figure 1—figure supplements 2 and 3). MEM labels were calculated for both total proteins (P), such as S100B and EGFR, and signaling effectors (S), such as p-STAT5. GNP cells aberrantly co-expressed neural-lineage proteins (astrocytic S100B and stem-like SOX2). Additionally, GNP cells displayed phosphorylation of RTK signaling effectors known to promote cell survival, growth, and proliferation (e.g. p-STAT5Y694, p-S6S235/S236, p-STAT3Y705) (Figure 1—figure supplements 2 and 4). The MEM protein enrichment values (average and standard deviation) for GNP cells included neural lineage determinants (▲S100B+5±1.6, SOX2+5±1) and phospho-proteins (▲p-STAT3+3±2.1, p-STAT5+2±1.8, p-S6+3±1.4) and identified proteins that were specifically lacking in GNP cells relative to other glioblastoma cell clusters (▼EGFR-2±0.1, GFAP-4±0.7, CD44-4±0) (Figure 1—figure supplement 4). In contrast, GPP cells were positively enriched for EGFR (▲EGFR+5±0.8) and consistently lacked pro-survival phospho-proteins (▼p-S6-4±3.7, p-STAT5-2±0.8, p-STAT3-2±1.6) and one of the proliferation markers measured (▼cyclin B1-3±3.3) (Figure 1—figure supplement 4).

Non-malignant cells, including immune and endothelial cells, were excluded from initial RAPID analyses and subsequent biaxial gating confirmed that the GNP and GPP subsets were not unexpected residual CD45+ or CD31+ cells (Figure 1—figure supplement 4). However, infiltrating immune cells can comprise a large proportion of non-cancer cells in glioblastomas and have highly variable overall abundance across patients (Hussain et al., 2006). Notably, GPP-high (n = 7) patients’ tumors all contained more than 9% CD45+ cells (median %=25.3 ± 13.8), whereas all GNP-high (n = 8) patients’ tumors contained less than 9% CD45+ cells (median %=3.3 ± 2.4, p<0.001, Figure 1—figure supplement 5, Supplementary file 2).

Identification of risk stratifying B-cell leukemia cells in Dataset 2

FCS files from a previously published mass cytometry study of B-cell precursor acute lymphoblastic leukemia (BCP-ALL) by an independent lab were input into the RAPID workflow to test whether the RAPID algorithm could re-discover prognostic cell subsets in other disease settings (Good et al., 2018). Dataset 2 is available online (originally: https://github.com/kara-davis-lab/DDPR/releases, in this study: https://github.com/cytolab/RAPID). This dataset contained almost twice the number of patients (n = 54) but less than half the number of total cells compared to Dataset 1 (48,600) because of a single patient with only 900 live, lineage-negative blast cells (Good et al., 2018). A total of 47 clusters were identified by RAPID, 3 of which were negative prognostic cell subsets that were associated with time to relapse (Figure 2). Importantly, features identified in the original publication as part of the signature associated with relapse (black text, Figure 2) were re-identified using RAPID. In the protein feature MEM values, enrichment of CD38 and CD34 was consistent with previously reported trends in pre-pro B cell-like phenotypes in BCP-ALL. Most notably, the signaling features p-S6, p-SYK, and p-4EBP1, which were important features positively associated with relapse in the DDPR model, were enriched in the negative prognostic populations identified by RAPID. Thus, RAPID was able to identify cells and features associated with time to relapse in another disease setting, generating a signature of negative-prognostic cells consistent with the original findings by another research group.

RAPID analysis of a published B-cell leukemia dataset to identify negative prognostic cell subsets.

(a) t-SNE plot of 54 B-cell leukemia patient samples with negative prognostic populations (A, B, C) colored in red. (b) MEM labels for three negative prognostic cell subsets (NP_A, NP_B, NP_C). Features important in the original discovery of predictors of relapse are colored in black. (c) Kaplan-Meier Curve comparing time to relapse in patients with high abundance of negative prognostic cells (identified by RAPID) to patients with low abundance of negative prognostic cells.

Statistical validation 1: Clusters identified by RAPID were statistically robust

To determine the stability of the clusters identified by RAPID, 99 additional runs of FlowSOM were performed within the RAPID workflow (Figure 1c). Due to the stochastic nature of FlowSOM, the clusters identified in each subsequent run could contain different cells. For each of the clusters, an F-measure was calculated, based on the accuracy of cell assignment within a cluster in subsequent iterations of FlowSOM (see Methods, Supplementary file 2). Of the original 43 clusters, five had an average F-measure of less than 0.5 (average F-measure of all clusters = 0.75). These five clusters, including cluster 33, previously identified as a GNP cluster, were considered unstable and were not included in subsequent analyses (indicated by shading in Figure 1 and Figure 1—figure supplement 2, and asterisks in Figure 1—figure supplement 3 and Supplementary file 2).

Statistical validation 2: Clusters identified by RAPID were not dependent on individual patients or sub-samplings

A key design decision in RAPID was the use of an equal number of cell events from each patient to avoid tumors disproportionately impacting the analysis based on the number of cells collected. However, this decision limits a given RAPID analysis run to a number of cells equal to the smallest collected from any one patient. For the tumors studied here, the number of live glioblastoma cells ranged from 4,710 to 330,000 cells per patient. To test whether the cells subsampled for RAPID were representative of the total tumor sample and eliminate the possibility that randomly subsampled cells from larger samples are not representative, 9 additional t-SNE analyses were generated, each with a different sample of 4,710 cells selected at random, with replacement, from each patient. Each of these 9 t-SNE projections was then used in a new RAPID analysis, creating 10 total analyses (the original and 9 new tests). Of these, a total of 55 clusters from the 10 runs were considered stable (F-measure >0.5) and prognostic (see Methods, Figure 3). An F-measure could not be calculated on a cell-by-cell basis because the cells varied between analyses, but the average F-measure based on patient categorization (GNP-high, GPP-high, and GNP and GPP low) was 0.79 between t-SNE runs.

Subsampling of glioblastoma cells repeatedly resulted in GNP and GPP subsets with similar phenotypes.

RMSD map comparing MEM scores for stable GNP and GPP subsets identified in the main figures and from nine additional t-SNE runs. GNP subsets are noted by red circles and GPP subsets are noted by blue circles. Colored boxes to the left of the red or blue circles indicate the t-SNE run from which the subset is derived. Median MEM labels (± standard deviation) are shown for five major populations to the right. The number of t-SNE analyses represented in each group, as well as median p-value and hazard ratio (HR) are noted in the bottom right corner of each MEM label.

To quantify the degree of similarity between the 47 newly identified prognostic clusters and the 8 representative GNP (34, 37, 42) and GPP (2, 3, 4, 5, 41) clusters, the root-mean-square deviation (RMSD) in the MEM enrichment values was calculated (Diggins et al., 2018; Diggins et al., 2017). GNP subsets from subsequent runs were highly similar to the GNP subsets identified by the initial analysis described above, and the same was observed for GPP subsets (Figure 3; GNP v GNP average RMSD = 92.8, GPP v GPP average RMSD = 88.9, and GNP v GPP average RMSD = 80.9). However, some phenotypes were only observed in a small number of t-SNE runs. For example, the phenotype representing cluster 41 was only seen in one other t-SNE. Because this cell type was not observed in at least 50% of the cell sub-samplings, it was considered phenotypically unstable and removed from subsequent analyses (indicated by shading in Figure 1 and Figure 1—figure supplement 2, and asterisks in Figure 1—figure supplement 3 and Supplementary file 2).

Statistical validation 3: Comparable clusters were identified by RAPID using UMAP instead of t-SNE

To test the modularity of RAPID, the algorithm was implemented using different dimensionality reduction values as input parameters, replacing t-SNE with UMAP, a tool that emphasizes both local and global data structure (Becht et al., 2019). RAPID identified 31 populations using UMAP input; 4 of these were prognostic and significantly associated with OS (1 GNPUMAP and 3 GPPUMAP) (Figure 4). GNPUMAP MEM scores reflected the characteristic S100B and SOX2 co-expression observed in the GNP populations along with an active pro-survival basal signaling status. GPPUMAP subsets were similarly defined by co-expression of EGFR and CD44 and a general lack of the measured phosphorylated signaling effectors (Figure 4). When the cells identified using t-SNE were overlaid on the UMAP axes, they occupied similar phenotypic space as UMAP-identified clusters, and vice versa (F-measure for cell assignment to GNP, GPP, or neither = 0.87, Figure 4). Thus, when UMAP was used in the RAPID algorithm, GNP and GPP populations were identified that had comparable phenotypes to those identified previously in t-SNE analyses, confirming that RAPID is not dependent upon a specific dimensionality reduction tool (Figure 4).

GNP and GPP cells were also identified using dimensionality reduction tool UMAP in the RAPID algorithm.

(a) UMAP analysis of 131,880 cells from 28 patients. Upper left plot - heat on cell density; lower left plot – colored by FlowSOM cluster; right plot – colored by GNP(red)/GPP(blue) designation and p-value. (b) Per-cell expression levels of 5 identity proteins, 3 phosphorylated signaling effectors, and proliferation marker cyclin B1 are depicted. (c) Enrichment of identity proteins (P) and phosphorylated signaling effectors (S) of glioblastoma cell subsets was quantified using MEM. GNP and GPP cells are labeled in red and blue, respectively. (d) Histogram analysis depicts the expression of key identity proteins and phosphorylation signaling effectors of GNP (red) and GPP (blue) compared to all glioblastoma (GBM) cells (gray, top row). (e) Overall survival curves for four UMAP-identified populations associated with survival. Cox-proportional hazard model was used to determine a hazard ratio (HR) of death. Censored patients are indicated by vertical ticks. (f) GNP (red) and GPP (blue) cells identified via t-SNE (‘t-SNE GNP’ or ‘t-SNE GPP’) and UMAP (‘UMAP GNP’ or ‘UMAP GPP’) are overlaid on either UMAP or t-SNE axes. (g) Categorization of each patient (dots) based on GNP high (red), GPP high (blue), or neither (gray) according to abundance based on RAPID using t-SNE or RAPID using UMAP (F-measure = 0.86).

Statistical validation 4: Risk stratifying cells were continuously associated with outcomes and independent of other glioblastoma stratifying features

At the conclusion of the RAPID analysis, to ensure that results were not an artifact of the high-low cut point choice and to determine if the effect of cell subset abundance was continuous and independent of other features known to stratify glioblastoma survival, a multivariate Cox proportional-hazards model analysis was performed incorporating known predictive features and GNP or GPP cell abundance. The included known predictors were age (Ohgaki et al., 2004; Shapiro et al., 1989), O6-alkylguanine DNA alkyltransferase (MGMT) promoter methylation status (Brown et al., 2016a; Hegi et al., 2005), and treatment variables including the extent of surgical resection (Brown et al., 2016b; Grabowski et al., 2014), therapy with temozolomide (Stupp et al., 2005), and radiation (Mirimanoff et al., 2006; Walker et al., 1980). Multivariate survival analysis of GNP cell abundance on a continuous scale, keeping the other predictors constant, indicated that each 1% increase in GNP cells was associated with an approximately 7% increase in mortality compared to baseline (OS HR = 1.07 [95% CI 1.02–1.12], p=0.003). Similarly, a 1% increase in GPP cells was associated with an approximately 7% decrease in mortality rate (OS HR = 0.93 [0.87–1.0], p=0.05) and an approximately 4% increase in time to tumor progression, as compared to baseline (PFS HR = 0.96 [0.93–0.998], p=0.04). When GNP and GPP were assessed simultaneously, abundance of GNP cells was the primary predictor of mortality (OS HR = 1.05 [1.00–1.10], p=0.04), while abundance of GPP cells was the primary predictor of time to tumor progression (PFS HR = 0.96 [0.92–1.00]; p=0.03). Thus, the abundances of GNP and GPP cell subsets were associated with distinct and contrasting patient outcomes (Figure 1—figure supplement 4), and their predictive value was independent of each other and known prognostic factors of patient survival.

Since assessing progression-free survival (PFS) can be especially useful in the clinic for cancers with longer median survival, RAPID was also used for the identification of glioblastoma cell clusters with differential PFS, as opposed to OS. Of the 43 subsets identified by RAPID, 4 subsets were significantly associated with PFS (subsets 20, 33, and 43 with unfavorable PFS (GNPPFS) and subset 3 was associated with favorable PFS (GPPPFS), Figure 1—figure supplement 6).

Tumors are mosaics of multiple subsets but number of subsets does not correlate with outcome

In the representative t-SNE run (Figure 1), RAPID identified 43 phenotypically distinct glioblastoma cell subsets within the tumors analyzed by mass cytometry in this study (Figure 1, Figure 1—figure supplement 4). The abundance of the 43 clusters varied extensively across patients (Supplementary file 2). Tumors contained a median of 14 clusters at >1% with a range from 5 cell clusters in LC06 to a maximum of 27 cell clusters represented in LC25 (Supplementary file 2, per-patient maps in Supplementary file 6). Although intra-tumor diversity has been hypothesized to contribute to poor response to treatment and survival, here, the number of glioblastoma cell clusters present within a tumor at >1% abundance (a surrogate for intra-tumor diversity) was not observed to be associated with differential survival (ρ = 0.047, p=0.812). In contrast, the abundance of each of the 7 stable and prognostic glioblastoma cell clusters was closely correlated with overall survival (Figure 1—figure supplement 4).

Biological validation 1: A transparent algorithm enables creation of a simple cell identification strategy that captures the cells identified in Dataset 1

After patterns are recognized by a machine learning approach, it is useful to learn from key features and create a straightforward test using alternative technologies or simpler models. One such model is a decision tree using one- or two-dimensional cytometry gating (Gandelman et al., 2019), consistent with traditional strategies in immunology and hematopathology. Therefore, a two-dimensional prognostic strategy was designed based on the MEM labels generated from the mass cytometry data. As described above (and Figure 1—figure supplements 2, 3 and 4), MEM labels were generated for each GNP and GPP population, as well as the combined subsets (GNP_Total and GPP_Total), reflecting enriched proteins in each population. These quantitative labels highlighted the most enriched proteins and were used to select S100B (enriched in GNP cells and largely absent from GPP cells) and EGFR (enriched in GPP cells and largely absent from GNP cells) for two-parameter analysis (Figure 5). Using only these two proteins, patients could be grouped as GNP-like, GPP-like or GNP and GPP Low, and these groups again exhibited stratified clinical outcomes (HR = 6.56, GNP-like median OS = 111.5 days, GPP-like median OS = 896 days, Figure 5). Thus, a simple gating model based on the two most divergent features identified by RAPID was able to meaningfully separate patients into clinically distinct groups.

A simple gating strategy based on S100B and EGFR can stratify patients using mass cytometry or immunohistochemistry data.

(a) Biaxial plot of S100B (y-axis) and EGFR (x-axis). Gray contours depict all 131,880 cells from all patients. Density contour overlays depict GNP (top) or GPP (bottom) cells identified by the RAPID algorithm. (b) Biaxial plot of S100B (y-axis) and EGFR (x-axis). Gray contours depict all 131,880 cells from all patients as in (a). Red box indicates gate for S100B+/EGFR- cells, called GNP-like. Blue box indicates gate for EGFR+ cells, called GPP-like. (c) Kaplan Meier curve comparing overall survival (in days) of patients with high percentages of GNP-like cells in red (red gate in a, >65.7% = high) and patients with high percentages of GPP-like cells in blue (blue gate in a, >31.2% = high). The hazard ratio of death, calculated using a Cox proportional hazards model, is 6.56 (p=0.0007). (d) Example TMA cores stained for S100B (left) or EGFR (right). Brown signal is from 3,3′-Diaminobenzidine (DAB). (e) Graph depicting DAB signal intensity for S100B (y-axis) or EGFR (x-axis) from tissue microarray immunohistochemistry on 73 glioblastoma patient samples. The red box outlines patients described as GNP-like (S100Bhigh/EGFRlow) and the blue box outlines patients designated GPP-like (EGFRhigh). All other patients are shown in gray. (f) A Kaplan-Meier curve showing overall survival (in days) of patients in the GNP-like (red) or GPP-like (blue) groups. The hazard ratio of death, calculated using a Cox proportional hazards model, is 2.3 (p-value=0.03).

Biological validation 2: A larger cohort of glioblastoma samples was stratified using IHC based on phenotypes discovered by RAPID

Unlike fluorescence or mass flow cytometry, IHC is routinely used in surgical pathology. To confirm the ability of S100B and EGFR in separating clinically distinct patient populations using an orthogonal approach, a tissue microarray (TMA) of 73 glioblastoma patient samples was developed. Serial TMA sections were stained with antibodies against S100B and EGFR and the overall signal intensity was determined using QuPath software for each feature (see Methods). By comparing S100B and EGFR staining intensity, patients were scored as GNP-like, GPP-like, or GNP and GPP Low (Figure 5). A Kaplan-Meier analysis comparing overall survival between patients enriched with GNP-like cells to those with GPP-like cells confirmed that GNP-like cell enrichment is associated with a shorter overall survival (HR = 2.3, GNP-like median OS = 298 days, GPP-like median OS = 560 days, Figure 5). These results validated the suspension mass cytometry findings and demonstrated that once revealed by RAPID, GNP-like and GPP-like cells could be identified in new samples by complementary approaches used in laboratory and clinical settings.

Discussion

The focus of this study was the creation of an unsupervised approach that could work with pilot datasets to suggest prognostic cell types for validation. Ultimately, the RAPID algorithm was tested using numerous statistical approaches, validated with two datasets, and validated as revealing biologically robust cells detectable on other platforms in a larger follow up cohort with formalin-fixed, paraffin-embedded tissue. Prior workflows and algorithms were developed to identify cell populations of interest in cancer samples and emphasized supervised modeling, as with Citrus (Bruggner et al., 2014) and Cytofast (Beyrend et al., 2018), or comparison to known subsets, as with DDPR (Good et al., 2018) and Phenograph (Levine et al., 2015). These approaches could not be used with Dataset 1, either because they required a level of prior knowledge about non-malignant adult human brain cells which was not available, or because they required supervision using categorical outcomes, which are not always clearly delineated for continuous variables. Another advantage of RAPID is that it does not require a target cluster number, which is important when it is not known how many phenotypically distinct subsets will be observed in a given cancer type. Cell subsets in tumors can be challenging to manually annotate as they may reasonably be assigned to multiple known cell types, as was apparent here and in prior studies (Neftel et al., 2019; Patel et al., 2014). RAPID is unsupervised, provides a quantitative label of features enriched in each cluster, and is modular, such that a variety of dimensionality reduction and clustering tools can be used. Currently, a user inputs raw data files (e.g., FCS files from cytometry platforms or equivalent data types from other platforms) and annotated patient survival data. The recommended use of RAPID is to run the full algorithm at least 10 times to seek consensus populations that are stable in phenotype and risk stratification. Both single-run implementation for discovery and a version using these best practices are included as R markdown scripts on the RAPID Github page (https://github.com/cytolab/RAPID). RAPID outputs quantitatively described cell clusters and their significance with respect to patient outcome. While the focus of this study was cytometry data, the design is suitable to other single cell data types where clinical outcomes or similar continuous variables have been scored for pilot cohorts, typically at least 25 individuals. Published datasets were not available for single-cell RNA-seq that matched the criteria for RAPID, including having thousands of cells per sample, more than 25 individuals with annotated clinical outcomes, and multiple features scored consistently for every cell. As single cell RNA-seq and imaging cytometry technologies advance, we anticipate RAPID will be useful for such datasets, especially given how widespread t-SNE, UMAP, and related approaches are within these fields.

The utility of RAPID includes its ability to identify stable, robust clusters that are independent of known prognosticators, provide users with opportunities to customize the workflow with a variety of tools, and inform subsequent studies on validation datasets or using different technologies. Here, RAPID was extensively probed for its performance in each of these areas. By repeated subsampling of each tumor and iterative FlowSOM analyses, clusters with consistent cell content and phenotypes observed in the majority of subsamplings were identified (Figure 3). Furthermore, these clusters were independently associated with continuous clinical variables - patient overall survival and PFS. A subsequent, low dimensional decision tree applied to both mass cytometry data and a new set of patient samples stained via IHC was also able to stratify patients, suggesting that the biology learned from the high dimensional approach could be used to inform complementary approaches. Critically, RAPID was also used to analyze a dataset from different tissue in a different disease collected at a different institution, Dataset 2 in Figure 2 (Good et al., 2018). In this application of RAPID, features previously identified by the original authors to be associated with time to relapse were re-captured, identifying cellular phenotypes concordant with prior results without requiring the normal developmental trajectory reference used in the original analysis.

Within Dataset 1 analyzing 28 IDH wild-type pre-therapy glioblastoma patient samples, the RAPID workflow automatically uncovered two prognostic phenotypic signatures which were independent of other known predictors of outcome. Glioblastoma Negative Prognostic (GNP) cells, characterized by enrichment for S100B, SOX2, p-STAT3, and p-STAT5, were associated with decreased overall survival, while Glioblastoma Positive Prognostic (GPP) cells, characterized by co-enrichment of EGFR and CD44 proteins, were associated with longer overall survival. Once revealed in high-dimensional data, a simple gating scheme using S100B and EGFR could be used to stratify outcome in a separate, expanded set of samples using traditional pathological approaches. High-dimensional cytometry and RAPID were critical to revealing novel prognostic cells in glioblastoma data in two ways. First, assessment of a large number of cells per tumor – over 2 million viable single cells, with at least 4,710 glioblastoma cells from each patient - enabled the use of an unsupervised approach in the identification of rare, novel cell subsets across patients. Second, per-cell quantification of phosphorylated signaling effector proteins revealed potential mechanisms of tumor cell regulation that are not readily apparent in bulk tumor data, genomic analyses, or lower dimensional approaches such as one- to four-color imaging. Supervised analysis of single cell data has previously uncovered signaling events tied to patient survival in hematologic malignancies (Good et al., 2018; Irish et al., 2004; Levine et al., 2015; Myklebust et al., 2017), and a similar pattern was observed here.

The GNP signature was defined by abnormal neural development features such as co-expression of stem cell transcription factor SOX2 and astrocyte lineage marker S100B (Ikushima et al., 2009; Raponi et al., 2007) and simultaneous high basal phosphorylation of multiple signaling effectors downstream of receptor tyrosine kinases reported to be important in tumor biology (Bhat et al., 2013; Carro et al., 2010; Dolma et al., 2016; Fan et al., 2017; Tan et al., 2019; Wei et al., 2013; Figure 1—figure supplement 4). RAPID also uncovered a connection between p-STAT5 and glioblastoma outcome previously unidentified in primary patient samples. STAT5 signaling is required in development of many tissues to block apoptosis and drive cell cycle entry (Irish et al., 2006) for example, p-STAT5 is an essential feature of negative prognostic acute myeloid leukemia signaling profiles (Irish et al., 2004; Levine et al., 2015). The signaling events of the negative and positive prognostic cells can now be studied in glioblastoma research models, such as patient xenografts and glioblastoma organoids (Bhaduri et al., 2020; Hubert et al., 2016; Jacob et al., 2020; Ogawa et al., 2018), using new combinations of targeted therapies, such as JAK inhibitors that target molecules upstream of STAT5 and STAT3, in combination with PI3K/mTOR pathway inhibitors, which will target molecules upstream of AKT and S6 signaling. In this way, new combinations of existing therapies may prove useful in targeting the signaling that defines the negative prognostic cells seen here.

Recent work using single cell gene expression has described the existence of multiple cellular states in glioblastoma tumors and the ability of cells to transition between states (Neftel et al., 2019). Similar to most transcript-based studies, RAPID analyses were performed on cells collected at a single timepoint, precluding a direct investigation of the ability of GNP or GPP cells to transition to other phenotypes; however, it is possible that phosphorylated, active STAT3, STAT5, and S6 may enable transition between progenitor-like states as they do in earlier development, and thus influence patient outcome (Rushing et al., 2019; Yoshimatsu et al., 2006). Another key research question for the future will be whether the signature features of the risk stratifying cells seen here will also be seen in other types of intractable human malignancies. Intriguingly, p-STAT5, p-ERK, and p-STAT3 signaling profiles reminiscent of the negative prognostic cells from glioblastoma have been seen in leukemia (Irish et al., 2004; Kotecha et al., 2008; Levine et al., 2015) and ovarian cancer (Gonzalez et al., 2018).

The GPP signature, in contrast, was defined by EGFR and CD44 co-enrichment, diminished evidence of proliferation, and specific lack of STAT5 phosphorylation. GPP cells were further associated with higher proportions of tumor-infiltrating immune cells. This result suggests an understanding of prognostic cell content or biomarkers may be relevant for immunotherapy research in glioblastoma. Previous DNA and RNA-driven molecular subtyping predicts EGFR expression in the classical subset of glioblastoma tumors and CD44 expression in mesenchymal tumors (Verhaak et al., 2010). As these categories were primarily based on bulk tumor data, cells co-expressing EGFR and CD44 (classified as GPP cells in this study) may have previously been missed, although single glioma cells have been shown to simultaneously amplify sequence or co-express transcripts for important signaling regulators (Patel et al., 2014; Snuderl et al., 2011). EGFR has been extensively studied as a driver of gliomas in the past (reviewed in Saadeh et al., 2018), and the association of this gene and transcript with outcome has been a matter of debate (Li et al., 2018; Saadeh et al., 2018; Xu et al., 2017). This study finds that expression of EGFR protein is associated with better overall survival. One reason for the difference between this study and other reports may be that EGFR protein levels were measured in individual cells rather than copy number analysis or transcript levels in bulk tumor samples; our own analyses and others’ have indicated that copy number or transcript level are not necessarily predictive of protein expression (Baser et al., 2019; Brennan et al., 2009; Chakravarty et al., 2017). Although antibody-based methods for protein detection, like those used here, depend on the specificity of each selected clone, it is important to note that two different, rigorously validated antibodies (mass cytometry, clone AY13; TMA, clone A-10) gave the same results (Figure 5). S100B has been explored as a serum biomarker (Holla et al., 2016), and S100B is known for its impact on macrophages, including microglia (Wang et al., 2013). These features of negative and positive prognostic cells extend the single cell phospho-specific flow cytometry approach to a new solid tumor that is in urgent need of new biological insights and targets.

When applied to a new glioblastoma dataset as well as a previously published study of blood cancer, RAPID reliably identified cells whose abundance was predictive of good or poor outcome. Cellular identification was robust, stable, and reproducible, and independent of the specific dimensionality reduction tools used. Critically, the discoveries from RAPID were able to inform a scoring system for detection of GNP-like and GPP-like phenotypes in IHC data that stratified patient outcome in 73 patient samples. RAPID also led to the development of a lower-dimensional cytometry pipeline which could be optimized for clinical stratification. There is now the exciting potential to extend the hypotheses suggested by RAPID into clinical research studies using either traditional flow cytometry or IHC on widely available formalin-fixed, paraffin-embedded samples, as in the biological validation here (Figure 5). Thus, techniques accessible to clinical research, such as IHC, could be informed by the results from RAPID and envisioned as a way to assign glioblastoma patients to treatment groups in early phase clinical trials.

Materials and methods

Key resources table
Reagent type
(species) or
resource
DesignationSource or
reference
IdentifiersAdditional
information
Biological sample (Homo Sapien)Primary glioblastoma tumorsVanderbilt University Medical CenterFreshly isolated from primary glioblastoma resections
ReagentRhodiumFluidigmCat# 201103AMC (1:4000)
AntibodyAnti-Cyclin B1 (mouse-monoclonal)BD BiosciencesRRID:AB_395287
Cat#554176 Clone: GNS-1
MC (1:100)
AntibodyAnti-TUJ1 (mouse-monoclonal)BiolegendRRID:AB_2313773
Cat#801201 Clone: TUJ1
MC (1:100)
AntibodyAnti-cCasp3 (rabbit-monoclonal)FluidigmRRID:AB_2847863
Cat#3142004A Clone: 5A1E
MC (1:100)
AntibodyAnti-CD117 (mouse-monoclonal)FluidigmRRID:AB_2847864
Cat#3143001B Clone:104D2
MC (1:100)
AntibodyAnti-S100B (mouse-monoclonal)BD BiosciencesRRID:AB_647296
Cat#612376 Clone: 19/S100B
MC (1:100)
AntibodyAnti-CD31 (mouse-monoclonal)FluidigmRRID:AB_2737262
Cat#3145004B Clone: WM59
MC (1:100)
AntibodyAnti-ɣH2AX (mouse-monoclonal)FluidigmRRID:AB_2847865
Cat# 3147016A Clone: JBW301
MC (1:100)
AntibodyAnti-CD34 (mouse-monoclonal)FluidigmRRID:AB_2810243
Cat#3148001B Clone: 581
MC (1:100)
Antibodyp-4E-BP1 (T37/T46)FluidigmRRID:AB_2847866
Cat# 3149005A Clone: 236B4
MC (1:100)
AntibodyAnti-p-STAT5 (Y694) (mouse-monoclonal)FluidigmRRID:AB_2744690
Cat#3150005A Clone:47
MC (1:100)
AntibodyAnti-BMX (mouse-monoclonal)BD BiosciencesRRID:AB_2290762
Cat# 610793 Clone: 40/BMX
MC (1:100)
AntibodyAnti-p-AKT (S473) (rabbit-monoclonal)FluidigmRRID:AB_2811246
Cat#3152005A Clone: D9E
MC (1:100)
AntibodyAnti-p-STAT1 (Y701)
(rabbit-monoclonal)
FluidigmRRID:AB_2811248
Cat#3153003A Clone: 58D6
MC (1:100)
AntibodyAnti-CD45 (mouse-monoclonal)FluidigmRRID:AB_2810854
Cat# 3154001B Clone: HI30
MC (1:400)
AntibodyAnti-NCAM/CD56 (mouse-monoclonal)BiolegendRRID:AB_604092
Cat# 318302 Clone: HCD56
MC (1:100)
AntibodyAnti-p-p38 (T180/Y182)
(rabbit-monoclonal)
FluidigmRRID:AB_2661826
Cat# 3156002A Clone: D3F9
MC (1:100)
AntibodyAnti-p-STAT3 (Y705) (mouse-monoclonal)FluidigmRRID:AB_2811100
Cat# 3158005A Clone: 4/P-STAT3
MC (1:100)
AntibodyAnti-ITGα6/CD49F (rat-monoclonal)BiolegendRRID:AB_345296
Cat# 313602 Clone: GoH3
MC (1:100)
AntibodyAnti-CD133 (mouse-monoclonal)Miltenyi BiotechRRID:AB_244339
Cat# 130-090-422
Clone: AC133
MC (1:50)
AntibodyAnti-PDGFRα (mouse-monoclonal)BiolegendRRID:AB_755996
Cat#323502 Clone: 16A1
MC (1:50)
AntibodyAnti-SOX2 (mouse-monoclonal)BD BiosciencesRRID:AB_10694256
Cat# 561469 Clone: O30-678
MC (1:100)
AntibodyAnti-SSEA-1/CD15 (mouse-monoclonal)FluidigmRRID:AB_2810970
Cat# 3164001B Clone: W6D3
MC (1:100)
AntibodyAnti-EGFR (mouse-monoclonal)BiolegendRRID:AB_10945161
Cat# 352902 Clone:AY13
MC (1:100)
AntibodyAnti-p-NFκB p65 (S529) (mouse-monoclonal)FluidigmRRID:AB_2847867
Cat# 3166006A Clone: K10-895.12.50
MC (1:100)
AntibodyAnti-L1CAM (mouse-monoclonal)BD BiosciencesRRID:AB_395337
Cat#554273 Clone: 5G3
MC (1:100)
AntibodyAnti-Nestin (mouse-monoclonal)MilliporeRRID:AB_2251134
Cat# MAB5326 Clone:10C2
MC (1:100)
AntibodyAnti-CD44 (mouse-
monoclonal)
BiolegendRRID:AB_1501199
Cat# 338802 Clone: BJ18
MC (1:100)
AntibodyAnti-GFAP (mouse-monoclonal)BD BiosciencesRRID:AB_396366
Cat# 556328 Clone: 1B4
MC (1:200)
AntibodyAnti-p-ERK1/2 (T202/Y204) (rabbit-monoclonal)FluidigmRRID:AB_2811250
Cat#3171010A Clone: D13.14.4E
MC (1:100)
AntibodyAnti-p-S6 (S235/S236) (mouse-monoclonal)FluidigmRRID:AB_2811251
Cat#3172008A Clone: N7-548
MC (1:100)
AntibodyAnti SOX10 (mouse-monoclonal)Santa CruzRRID:AB_10844002
Cat#sc-365692 Clone: A-2
MC (1:100)
AntibodyAnti-HLA-DR (mouse-monoclonal)FluidigmRRID:AB_2665397
Cat# 3174001B Clone: L243
MC (1:200)
AntibodyAnti-p-HH3 (rat-monoclonal)FluidigmRRID:AB_2847869
Cat# 3175012A Clone: HTA28
MC (1:400)
AntibodyAnti-Histone H3 (rabbit-monoclonal)FluidigmRRID:AB_2847870
Cat# 3176016A Clone: D1H2
MC (1:200)
AntibodyS100B (rabbit-polyclonal)DakoRRID:AB_2811056
Cat#GA50461-2
IHC (RTU)
AntibodyEGFRSanta CruzRRID:AB_10920395
Cat# sc-373746
Clone: A-10
IHC (1:100)
Software, algorithmRAPIDhttps://github.com/cytolab/RAPID
Data filesFCS data fileshttps://flowrepository.org/id/FR-FCM-Z24K

Lead contact and materials availability

Request a detailed protocol

Further information and requests for datasets and materials should be addressed to jonathan.irish@vanderbilt.edu.

Experimental model and subject details

Patient samples

Request a detailed protocol

Surgical resection specimens of 28 IDH-wildtype glioblastomas collected at Vanderbilt University Medical Center between 2014 and 2016 were processed into single cell suspensions following an established protocol (Leelatian et al., 2017b). Only samples that were confirmed to be IDH-wildtype glioblastomas by standard pathological diagnosis were used. All samples were collected with patient informed consent in compliance with the Vanderbilt Institutional Review Board (IRBs #030372, #131870, #181970), and in accordance with the declaration of Helsinki.

Patient characteristics and collection of clinical data

Request a detailed protocol

Additional patient characteristics are included in Supplementary file 3 for all samples in this study. All patients were adults (≥18 years of age) at the time of their maximal safe surgical resection of their cerebral (supratentorial) glioblastomas. Extent of surgical resection was independently classified as either gross total or subtotal resection by a neurosurgeon and a neuroradiologist. Gross total resection was defined as agreement by both viewers of no significant residual tumor enhancement on patients’ gadolinium-enhanced magnetic resonance imaging (MRI) of the brain obtained within 24 hr after surgery. All patients were considered for treatment with postoperative chemotherapy (temozolomide) and radiation according to the standard of care (Stupp et al., 2005), after determination of MGMT promoter methylation status by pyrosequencing (Cancer Genetics, Inc, Los Angeles, CA, USA). Multiplex polymerase chain reaction (PCR) was used to determine IDH1/2 mutational status. Patients’ postoperative course was followed until February 2019, noting time to first, definitive radiographic progression or recurrence of glioblastoma as agreed upon by the treating neuro-oncologist and neuroradiologist, and the time to patients’ death. All deaths were deemed to be due to the natural course of patients’ glioblastoma. Median overall survival of the analyzed 28 patients with IDH wild-type glioblastoma was 388.5 days (13 months) and median PFS was 187.5 days (6.3 months), which is typical for the disease (Ostrom et al., 2017; Stupp et al., 2005).

Method details

Mass cytometry analysis

Request a detailed protocol

Cells derived from patient samples were prepared as previously described (Leelatian et al., 2017b). A multi-step staining protocol was used, which included 1) live surface stain, 2) 0.02% saponin permeabilization intracellular stain, and 3) intracellular stain after permeabilization with ice-cold methanol. All antibodies used, including clone information, and the steps when used are given in Supplementary file 4. After staining, cells were resuspended in deionized water containing standard normalization beads (Fluidigm) (Finck et al., 2013), and collected on a CyTOF 1.0 instrument located in the Cancer and Immunology Core facility at Vanderbilt University. Mass cytometry standardization beads were used to remove batch effects and to set the variance stabilizing arcsinh scale transformation for each channel following field-standard protocols (Greenplate et al., 2019; Leelatian et al., 2015; Leelatian et al., 2017b). Rhodium viability stain and cleaved caspase-3 antibody were included in staining to exclude non-viable and apoptotic cells, respectively. Detection of total histone H3 was used to identify intact, nucleated cells (Leelatian et al., 2017a). A 34-dimensional mass cytometry antibody panel was used to analyze over 2 million viable cells from 28 tumors (ranging from 4860 to 336,284 cells per tumor). Data were normalized with MATLAB-based normalization software (Finck et al., 2013), and were arcsinh transformed (cofactor 5), prior to analysis using the Cytobank platform (Kotecha et al., 2010). Positively identified cells were defined by having signal above 10 on any channel on which an antibody was used to detect antigen. A patient-specific t-SNE view was generated, using 26 of the measured markers for all tumor and stromal cells from each patient’s tumor (Amir et al., 2013; Supplementary file 4). Immune (CD45+) and endothelial cells (CD31+) were computationally excluded from each individual patient prior to subsequent downstream analysis. Remaining CD45-CD31- cells were included in a common t-SNE analysis, generated using 24 of 34 measured markers (Supplementary file 4). Distribution of each of the 28 patients’ cells on the common t-SNE axes and mass intensity for each marker are shown in Supplementary file 6. This common t-SNE analysis was used for automated analysis of risk stratifying cell subsets in RAPID (below).

Quantification and statistical analysis

Implementation of RAPID in R

Request a detailed protocol

FCS files for each patient sample (28) containing only cells of interest (non-immune, non-endothelial cells) were input in R (4,710 cells from each patient, 131,880 cells total). Cell subset identification was performed using the previously published FlowSOM R package (Van Gassen et al., 2015). t-SNE values (t-SNE1_glioblastoma and t-SNE2_glioblastoma) from t-SNE (or UMAP values from UMAP) analysis of CD45-CD31- glioblastoma cells from 28 patients were used as parameters for cell subset clustering. Within the RAPID workflow, the optimal number of clusters was determined by first identifying, for each feature, the smallest number of clusters that minimizes the intra-cluster signal variance for that feature. Then, the optimal cluster number of the data set was determined by taking the median of the optimal numbers for each individual feature. Once the cluster number was determined, the abundance of cell subsets and their clinical significance was assessed using outcome-guided analysis. Patients were divided into Low and High groups, based on the distribution (interquartile variance, IQR) of the abundance of a given cell subset across the cohort. A univariate Cox regression analysis was then used to estimate the effect size (hazard ratio, HR, of death) on survival and quantify its statistical significance with a p-value. The RAPID program output included: 1) a PDF containing two color coded, 2D t-SNE (or UMAP) plots (.png), one depicting all FlowSOM clusters and one depicting prognostic status and p-value, Kaplan-Meier survival plots of patients for each subset; 2) MEM outputs including a PDF of the MEM heatmap as well as. txt files of MEM and Median values for each feature, enrichment scores, and IQR values; 3) a .txt file of the FlowSOM cluster value for prognostic subsets, a .txt file of survival statistics for each FlowSOM cluster, and a .csv file with subset abundance information per patient;and 4) new FCS files with added columns for cluster and prognostic status for each cell. In this study, abundance of Glioblastoma Negative Prognostic (GNP) and Glioblastoma Positive Prognostic (GPP) cells in each tumor was quantified as percentages per total glioblastoma cells (i.e. immune and endothelial cells were already excluded). Total GNP and GPP cell abundance was determined for each patient by adding the events in all GNP (or GPP subsets, respectively) together. GNP high patients were identified as containing more GNP cells than the IQR of total GNP abundance (3.1%). GPP high patients were defined in the same manner (total GPP cell abundance IQR = 8.58%). MEM analysis was performed in R, using the previously published R package (Diggins et al., 2017). In short, MEM captured and quantified cell subset-specific feature enrichment by scaling the magnitude (median) differences between clusters, depending on the spread (IQR) of the data. These values were then computed in comparison to the remaining cells in a given dataset. MEM values were interpreted as either being positively enriched (▲, UP positive values) or negatively enriched (▼, DN negative values). The variation of a given cellular feature across GNP or GPP cell subsets was quantified as ± standard deviations (SD). For the primary data set used in this study (131,880 cells), RAPID ran in 15 min from start to finish after dimensionality reduction.

Cluster stability testing

Request a detailed protocol

Ten independent t-SNE analyses were performed on equal numbers of randomly sampled cells from each patient (4,710 cells per patient, 131,880 total cells). RAPID was used to analyze each of these ten t-SNE runs. For each sub-sampling of cells and the respective t-SNE, an additional 99 FlowSOM clusterings were performed without setting a seed for reproducible results. After each analysis, an F-measure was calculated per cluster, measuring both the precision and recall of cell assignment. After 100 total FlowSOM runs, each of the original clusters had an average F-measure, interpreted here as a measure of cluster stability.

Survival and statistical analysis

Request a detailed protocol

Time from surgical resection to death (overall survival, OS) and time from surgical resection to the initial radiographic recurrence or death before radiographic assessment (PFS) were depicted using right-censored Kaplan-Meier curves and analyzed in R. Survival time points were censored if, at last follow up, the patient was known to be alive or had not had radiographic progression. Differences in the survival curves of groups were compared using the Cox univariate regression model, reporting a hazard ratio (HR) with 95% confidence intervals between the survival curves.

A Cox proportional-hazards regression model was created to assess the influence of GNP and GPP cells on OS and PFS as continuous variables while accounting for other factors known to affect survival, including age at diagnosis, MGMT promoter methylation status, extent of surgical resection (EOR), treatment with temozolomide (TMZ), and radiation (XRT). The hazard model can be written as:

HR=h(t)h0(t)=e(bGNPGNP+bageAge+bMGMTMGMT+bEOREOR+bXRTXRT+bTMZTMZ)

where h(t)h0(t) represents the ratio of hazard comparing the risk of death at time t to the baseline hazard (obtained when all variables are equal to zero) and ebx represents the hazard ratio of variable x. The data were fit using R software, version 3.5 (R foundation for Statistical Computing, Vienna, Austria). The proportional-hazards assumption was tested in all multivariate models and supported by a non-significant relationship between Schoenfeld residuals and time for each covariate included in the model (p > 0.38; degree of freedom = 1) and the overall model (p = 0.96; degrees of freedom = 6 and 7). Statistical significance α was set at 0.05 for all statistical analyses, one- or two-tailed noted in figure legends.

An F-measure was used to quantify the level of agreement between classifications of patients or cells between alternative analysis strategies as wells as multiple RAPID iterations. The F-measure is the harmonic mean of the precision and recall given by the equation F = 2 * (Precision * Recall) / (Precision + Recall) where Precision = True Positive / (True Positive + False Positive) and Recall = True Positive / (True Positive + False Negative). An F-measure of 1 indicates perfect agreement between two different strategies or iterations as opposed to an F-measure of 0 which would mean no agreement between classifications of patients or cells from two strategies or iterations. Patients could be classified as GNP high, GNP and GPP low, or GPP high, while cells were classified as GNP, GPP, or neither. None of the patients in this study were classified as both GNP high and GPP high. To calculate the F-measure of patient categorization, the classification of the 28 patients into the three prognostic groups from the t-SNE implementation of RAPID was used as the reference point from which to compare patient classification resulting from the UMAP implementation of RAPID. Similarly, the stability of the RAPID workflow in assigning cells to GNP, GPP, or non-significant clusters was tested by using the t-SNE implementation of RAPID (FlowSOM seed 38) as the reference from which to compare 100 iterations of RAPID (random FlowSOM seed per iteration). Calculation of the F-measure was implemented using R software, version 3.5.

Computer specifications

Request a detailed protocol

R was downloaded from https://cran.r-project.org/bin/ and implemented using the R Studio GUI https://www.rstudio.com/products/rstudio/download/#download. PC users also needed to download R Tools https://cran.r-project.org/bin/windows/Rtools/ and MAC users needed to download X11 Quartz https://www.xquartz.org/. RAPID was implemented, using these tools, on several personal computers. It was developed on a Dell Precision 7820 with a solid state hard drive and 64 GB RAM.

Tissue microarray construction and analysis

TMA sample selection

Request a detailed protocol

Formalin-fixed paraffin-embedded (FFPE) glioblastoma specimens were identified using the Vanderbilt Surgical Pathology database. The absence of IDH mutation was determined by multiplex PCR coupled with base extension assay (SNaPshot reaction mixture, Life Technologies, Carlsad, CA, USA), followed by capillary electrophoresis on an ABI Genetic Analyzer 3130XL and GeneMapper v.4.1. Following confirmation of the previously rendered histologic diagnosis, hematoxylin and eosin stained slides were scanned on the Panoramic P250 (3DHistech) whole slide scanner. Areas containing viable tumor were identified and circled by two pathologists (BM, NL).

TMA construction and staining

Request a detailed protocol

Blocks were delivered to the Vanderbilt University Medical Center TPSR (Translational Pathology Shared Resource), where cores were extracted from the encircled areas. Donor blocks and recipient blocks were loaded into the Tissue Microarray Grandmaster (3DHistech). The virtual slide images were aligned and overlaid on the tissue block and cores were removed from the donor block based on the pathologist annotation. Three 1 mm core samples were collected from each tumor and placed in the recipient block. IHC of serial sections of two TMA blocks (<10 μm thick) were stained with primary antibodies conjugated to HRP and 3,3′-Diaminobenzidine (DAB) detection for EGFR and S100B, and counter stained with Hematoxylin by the Translational Pathology Shared Resource (TPSR) at Vanderbilt University. Digital images were obtained with an Ariol SL-50 automated scanning microscope and the Leica SCN400 Slide Scanner from VUMC Digital Histology Shared Resource.

MarkerCloneCompany
S100BpolyclonalDako
EGFRA-10Santa Cruz Biotechnology

TMA imaging and analysis

Request a detailed protocol

Whole slide imaging was performed in the Digital Histology Shared Resource at Vanderbilt University Medical Center (www.mc.vanderbilt.edu/dhsr). For each marker, a QuPath project was created and all slide images were uploaded to be processed in batch. In QuPath, regions of interest (ROI’s) were designated by circling each tumor core. Each ROI was computationally linked to the patient by a unique identifier, allowing cores from the same patient to be grouped. For each marker, the ‘Estimate Stain Vectors’ function in QuPath was used to find the appropriate deconvolution parameters to isolate the signal intensity contribution from Hematoxylin and DAB respectively. The deconvolution parameters are listed below:

MarkerHematoxylinDABBackground
S100B0.604840.675320.4220440.209960.502340.83879224223221
EGFR0.723530.637370.265080.249520.523840.81445221219220

For each ROI, nuclear segmentation on the Hematoxylin Optical Density (OD) was optimized using the ‘Watershed cell detection’ function in QuPath, and the cytoplasm around each nucleus was estimated by performing a 3 μm expansion from the nuclear outline. All measurements from all detections were exported for analysis in R. In R, specific parameters (Name, Cell.DAB.OD.mean, Cytoplasm.DAB.OD.mean, and Nucleus.DAB.OD.mean) were extracted for every detection (cell) from every patient. These parameters identify the ROI/core from which the cell was segmented, its corresponding patient ID, the mean optical density of the deconvoluted DAB signal in each entire segmented cell, the DAB signal in only the cytoplasm, and the signal exclusively in the nucleus respectively. The full TMA map linking QuPath IDs, Patient_IDs, Block, and Core_IDs was also imported. In addition, for each marker, the median DAB intensity was calculated for each patient (averaged over three cores). The thresholds and measurements on which these thresholds were applied are summarized below:

MarkerMeasurementThreshold - Block AThreshold - Block B
S100BCell_DAB0.40.4
EGFRCell_DAB0.20.2

Patients were categorized as GNP-like if their TMA cores had S100B staining intensity above the first quartile of S100B intensities (>0.6728) and had EGFR staining below the 50th percentile (<0.4199). Patients were categorized as GPP-like if their TMA cores scored in the top tertile of EGFR intensity (>0.6929).

Data and code availability

Data availability

Request a detailed protocol

Annotated flow data files are available at the following link https://flowrepository.org/id/FR-FCM-Z24K. FCS files that contain the cells from the representative t-SNE can also be found on the GitHub page: https://github.com/cytolab/RAPID. Patient-specific views of population abundance and channel mass signals for all analyzed patients in this study are found in Supplementary file 6.

Code availability

Request a detailed protocol

RAPID code is currently available on Github, along with FCS files from Dataset 1 and 2 for analysis, at: https://github.com/cytolab/RAPID ‘2020-01-15 RAPID Workflow Script on Davis Dataset.Rmd’ contains RAPID code for a single run as presented in Figure 1b. ‘2020-04-21 RAPID Stability Tests.Rmd’ contains RAPID code for repeated stability tests as presented in Figure 1c.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74

Decision letter

  1. C Daniela Robles-Espinoza
    Reviewing Editor; International Laboratory for Human Genome Research, Mexico
  2. Philip A Cole
    Senior Editor; Harvard Medical School, United States
  3. C Daniela Robles-Espinoza
    Reviewer; International Laboratory for Human Genome Research, Mexico
  4. Julie Laffy
    Reviewer; Weizmann Institute of Science, Israel

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

With larger datasets of single-cell data being made available, there is need in the field for analysis pipelines that can couple these experiments to clinical outcome and other variables in an unsupervised manner, and that can provide functional hypotheses that can be later tested in other cohorts in a simplified manner. Here, the authors have built RAPID, an algorithm that can take single-cell cytometry data as input and have extensively tested it in two cancer datasets, identifying populations of risk-stratifying cells. We envision that in years to come, RAPID will be able to identify markers correlated with important clinical characteristics, and that the functional relationships it finds and are confirmed will be useful for clinical practice.

Decision letter after peer review:

[Editors’ note: the authors submitted for reconsideration following the decision after peer review. What follows is the decision letter after the first round of review.]

Thank you for submitting your work entitled "High risk glioblastoma cells revealed by machine learning and single cell signaling profiles" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by a Senior Editor.

Our decision has been reached after consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that your work will not be considered further for publication in eLife.

The reviewers, whose assessments are included below, agreed that RAPID is potentially interesting and useful to the scientific community but the lack of access to the code to test it prevents further assessment. In discussion, they agreed it is necessary for code and data to be accessible before publication for reviewers' perusal. Additionally, they considered the claim that biological findings "could be immediately used to guide clinical design" misguided, as an extensive comparison with existing literature has not been undertaken and extrapolates from results in a very small number of patients. Another suggestion for improving the work include analysing a larger validation cohort to establish the performance of RAPID in different datasets and under different conditions.

Reviewer #1:

In this work, Leelatian and collaborators study a set of 28 resected glioblastoma tissues from as many patients through single-cell technologies assessing 34 phospho-proteins, transcription factors and lineage proteins. They develop 2 technologies: 1) a set of 34 antibodies for single cell mass cytometry and 2) an unsupervised machine learning algorithm called RAPID – which is able to identify phenotypically similar cells ("clusters") and their association to survival variables.

I think the scale of the work is impressive (the study analyses >2 million cells) and the methodology seems very useful. However, there are a few issues that I think would need to be addressed before this manuscript is published.

1) I believe the software to be the main contribution of this manuscript, as the two proteins discovered at the end of the single-cell RAPID analysis have been studied in the context of glioblastoma/glioma and survival before (more of this below, but e.g., PMIDs 9445288, 28693199, 28885661, 27401156, 23719262). The data needs to be made publicly available, the same as the software. It says in the manuscript that it will be done so upon publication, but it absolutely needs to be released by the time the manuscript is published. At the moment it is impossible for me to assess whether RAPID may be easy to use, what variables it needs as input, and how easy it may be to install. Also, I do not know if the authors plan to release a detailed user manual, something that would be essential for publishing a piece of software.

2) I believe that the validation of the software needs to be done on more than one existing single-cell dataset, if possible. The fact that running different iterations of RAPID resulted in highly variable results in the same dataset (18 to 48 clusters identified), and that only one cluster overlapped in the OS vs PFS analyses calls, in my opinion, for more extensive testing. It's interesting that only 7 out of the 43 clusters identified when all cells were put together were considered "universal". Does this mean that this technique is highly susceptible to the number of tumours tested? Hopefully the software is easy to run and answers to these questions can be achieved relatively easily.

3) How do the authors reconcile their results with the observations by other studies that EGFR overexpression is associated with poor prognosis glioblastoma, seemingly contrary to the results in this study? (e.g. PMIDs 29445288, 28693199, 28885661). Linked to this point, I believe that the results are overhyped at times. For example, the phrase "These findings could be used immediately to guide clinical trial design" should be either removed or rewritten in a more measured way. Which population did the authors study, only European-descent (i.e. white) patients? Also, the number of samples is quite small for such a claim. Generalising in this way would be hurtful to global clinical practice.

4) About the classification of tumours into high/low for distinct markers. Two tumours may be highly similar and classified in different groups (in the example given in the text, tumours with 2.68% of cells in the cluster were classified as “low” but those with 2.7% as “high”). Would the conclusions be maintained if the high group was defined as those above the 75th percentile and the low group below the 25th percentile? How important is this definition to the results of the RAPID workflow?

Reviewer #2:

This is a potentially interesting manuscript that seeks to profile glioblastoma samples by mass cytometry to assess intra-tumoral heterogeneity at the protein level. The authors profile a large number of cells in 28 samples using 34 markers (retaining 26 markers in final set). They then use this information (with various filtering steps) to identify modules of variability within tumors and then use some of those modules/signatures to stratify patients’ outcome with a machine learning approach. They suggest that their findings can be "immediately" used to inform clinical trial design.

The study suffers from intrinsic limitations that are hard to overcome and that would be acceptable if the authors had not jumped to conclusions, they do not have the power to support. First, the analysis relies on 26 measured proteins, which is a rather limited set of parameters; any analysis that relies on protein measurement is also only as good as the antibody used and one should keep in mind that some of the clusters inferred might be technical; hence any major statement would require some form of validation by other approaches which is mostly lacking in this manuscript. But all that would be addressable/ok, if the authors had been more reasonable in their conclusions.

By far the main limitation and major caveat this reviewer sees is that the authors draw major conclusions on stratifying GBM patients for their outcome based on a cohort of 28 patients. This is massively underpowered to address patient stratification. The markers they home on are nothing very new (EGFR, SOX2, S100 etc) that have been interrogated in much larger cohorts (TCGA has >400 samples) and have not yielded as valuable information as claimed by the authors.

This study would only be useful if the 28 samples were a discovery cohort and their findings were validated in hundreds of patients’ samples. If EGFR was that good at predicting outcome of GBM, we would know by now. So unfortunately, underpowered study for the claims on outcome. The mass cytometry profile/data would be of interest if followed-up, but not for survival analysis given limited dataset.

Reviewer #3:

Leelatian, Sinnaeve et al. describes a single-cell mass cytometry data-set measuring 34 proteins and phospho-proteins across 28 glioblastoma patients. This represents a rich data-set which the authors take advantage to develop a novel computational tool, RAPID, to identify cell type clusters that inform on patient survival. This approach identified 9 cell clusters that stratified patients according to their overall survival, and these can be classified into negative and positive prognostic using S100B and EGFR protein abundance.

Although further work would be required to confirm the robustness and utility of these findings to the clinic, this study presents novel insights, tools and data-sets that will be of interest to the scientific community.

1) The separation and classification of the 43 cell clusters (Figure 1B) is not very clear. Consistently, these varied extensively across the 10 down-sample analyses with 18-48 optimal clusters. Have the authors tried multiple configurations of the tSNE parameters (e.g. perplexity, learning rate)? Clustering could also be compared to current state-of-the-art pipelines (PMID: 31217225), for example running Principal Component Analysis (PCA) as a pre-processing step before the non-linear tSNE.

2) Several clusters of glioblastoma cell types exist (Figure 2A), in particular a subset (lower left) seems to have a heterogeneous expression of Nestin and strong phosphorylation of AKT. Is this a common observation across other patient samples? If so, what is the percentage of cells this represents and could there be an impact on the analysis if this population, like the other cell types, were removed? Additionally, p-AKT is enriched in GNP cluster 37 and GPP cluster 41. Specifically, GPP cluster 41 is the only cluster enriched in 2 patients (LC03 and LC09) which are classified as GNP. To me this is counterintuitive, and might indicate some unique variation, either true biological or artefact, of this cell type.

3) The potential translation into the clinic seems to be one of the main messages of the manuscript and is indeed particularly interesting. Specifically, the fact that only S100B and EGFR expression is enough to stratify patient overall survival. Could the observations gained from the single-cell experiments be used to re-analyse bulk patient genomic and proteomic data-sets to support further the clinical relevance of these findings? For example, what is the percentage of glioblastoma IDH1 wild type patients with S100B and EGFR high or low?

4) The availability of RAPID as a tool to the community is important and a positive aspect of this work. Thus, I believe the source should have been made accessible prior to publication.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "Unsupervised machine learning reveals risk stratifying glioblastoma tumor cells" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Philip Cole as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Julie Laffy (Reviewer #4).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, we are asking editors to accept without delay manuscripts, like yours, that they judge can stand as eLife papers without additional data, even if they feel that they would make the manuscript stronger. Thus, the revisions requested below only address clarity and presentation.

Summary:

In this revised manuscript, Leelatian, Sinnaeve and collaborators introduce the RAPID algorithm for identifying clusters of cells relevant for stratification of patient survival in single-cell datasets, especially mass cytometry datasets including more than 25 patients. In this new version, they have tested RAPID with two datasets from different cancer types, one generated by them and another one already published, and have performed extensive statistical validation both technical and biological.

Revisions:

Please address the following points raised by the reviewers:

1) Clustering is performed across all tumours combined, so the authors should distinguish inter-tumour effects from the intra-tumour effects that they describe. Could they show for example that the cluster phenotypes are derivable from a significant number of individual tumours? Inter-tumour effects would imply some different conclusions to the ones the authors draw, depending on the patient specificities of these effects. Assuming no patient specificity, one potentially interesting and even complementary conclusion would be that basal levels of certain proteins in a tumour are indicative of patient outcome (indiscriminately of any particular cellular subpopulation). Do the authors see that EGFR, S100B and other enriched proteins vary primarily within, or between, tumours? To check for patient specificity, the authors should state in the text i) which, if any, clusters were largely tumour-specific and ii) how many tumours contributed to each of the clusters. Without these figures, the biological relevance of each of the clusters (namely the NP and PP subsets) is difficult to assess. It might also be useful to add tSNE plots coloured by patient alongside the existing plots (Figures 1B, 2A, 4A). If there are indeed clusters dominated by individual tumours, I would think that they should be removed early on in the RAPID pipeline. If there are many such cases, then perhaps the authors would have to consider an alternative clustering approach.

2) Do any of the cell subsets identified (Figure 1B, C) reflect a proneural population? This is a known component of glioblastomas and has moreover been postulated to be associated with better prognosis because it is the dominant component in lower-grade gliomas (e.g. Verhaak et al., 2010). How do the authors reconcile this with their own findings? Or, if a proneural phenotype is lacking, could the authors comment on why that might be? One reason could be technical: simply that proneural markers were not sufficiently represented in the panel. If that is the case, the authors should acknowledge this in the text and take care not to overstate their results pertaining to the phenotypes of NP and PP subsets in GBM.

https://doi.org/10.7554/eLife.56879.sa1

Author response

[Editors’ note: the authors resubmitted a revised version of the paper for consideration. What follows is the authors’ response to the first round of review.]

Reviewer #1:

[…]

1) I believe the software to be the main contribution of this manuscript, as the two proteins discovered at the end of the single-cell RAPID analysis have been studied in the context of glioblastoma/glioma and survival before (more of this below, but e.g., PMIDs 9445288, 28693199, 28885661, 27401156, 23719262).

We appreciate the reviewer’s enthusiasm regarding development of the software, and its ability to be used in the context of many different human diseases and we have updated the Discussion to include the references and to compare our results, especially for EGFR. Two key considerations for comparing our results to prior literature are: 1) Our analysis quantifies per-cell level of EGFR protein, as opposed to DNA mutation or RNA transcript in prior studies, and 2) we consider both EGFR and S100B together when scoring patients. Notably, the protein differences in EGFR and S100B revealed by mass cytometry were validated by IHC in a larger cohort.

The data needs to be made publicly available, the same as the software. It says in the manuscript that it will be done so upon publication, but it absolutely needs to be released by the time the manuscript is published. At the moment it is impossible for me to assess whether RAPID may be easy to use, what variables it needs as input, and how easy it may be to install.

All data and code were available to reviewers during the prior review, but we appreciate that the links were not apparent in the Supplement. We have updated the manuscript in several places to clarify how to access data. The link to Dataset 1 is:

https://flowrepository.org/id/RvFrKN2ctDJmmVNE4ZnMJrAZeVraXbwvrhjx3YaBZIV6nWIanMrbhrVBx7yvODtX

FlowRepository is public and enables deeper annotation. Dataset 2 and RAPID code are on Github and public here: https://github.com/cytolab/RAPID/. This includes RAPID source code and cytometry files from the published pre-B ALL example (Dataset 2).

Also, I do not know if the authors plan to release a detailed user manual, something that would be essential for publishing a piece of software.

The RAPID code and a detailed walkthrough with comments is provided as an R code and an R markdown file (at https://github.com/cytolab/RAPID) which includes instructions and guidelines within the code itself.

2) I believe that the validation of the software needs to be done on more than one existing single-cell dataset, if possible.

RAPID was used on an additional, single cell data set from B cell precursor acute lymphoblastic leukemia (Good et al., 2018, Dataset 2). The results from this test were last in the original Supplemental Figure 6 and have now been moved into the main text (Figure 2) and emphasized in the Results to highlight this important validation. We sought additional test datasets, but there were not published, annotated, single cell datasets that met the criteria for this study. We have noted this in the Discussion.

The fact that running different iterations of RAPID resulted in highly variable results in the same dataset (18 to 48 clusters identified), and that only one cluster overlapped in the OS vs PFS analyses calls, in my opinion, for more extensive testing. It's interesting that only 7 out of the 43 clusters identified when all cells were put together were considered "universal". Does this mean that this technique is highly susceptible to the number of tumours tested? Hopefully the software is easy to run and answers to these questions can be achieved relatively easily.

We appreciate the concerns regarding cluster variation and have added significant statistical testing to directly address this (see Statistical validation 1 and 2, especially). Note that we believe it is an advantage of RAPID to aim to be independent of user input and bias. Thus, we prefer to avoid having the user specify a target number of clusters, although it is possible to use the RAPID code in a way where the user can specify this number (and set a seed to achieve deterministic, invariable results). The additional statistical testing includes iterative FlowSOM runs, multiple runs of t-SNE, and comparison of subset features across many such runs to identify stable clusters and phenotypes. These results are now included as main Figure 3 and are graphically summarized in new panels in Figure 1C. Critically, while different absolute numbers of clusters were identified, the cellular phenotypes identified by RAPID from these clusters were consistent across runs and cell subsampling (Figure 3).

3) How do the authors reconcile their results with the observations by other studies that EGFR overexpression is associated with poor prognosis glioblastoma, seemingly contrary to the results in this study? (e.g. PMIDs 29445288, 28693199, 28885661).

We thank the reviewer for raising this concern and have updated the Discussion to compare and contrast our methods and findings to previous literature and have included these 3 EGFR citations.

Linked to this point, I believe that the results are overhyped at times. For example, the phrase "These findings could be used immediately to guide clinical trial design" should be either removed or rewritten in a more measured way. Which population did the authors study, only European-descent (i.e. white) patients? Also, the number of samples is quite small for such a claim. Generalising in this way would be hurtful to global clinical practice.

The reviewer makes a reasonable point, and we have tempered our language in this area, refocusing the manuscript on the use of RAPID as a discovery tool and referring to use in clinical research (as opposed to clinical practice). Regarding patient demographics, we confirmed that the overall survival characteristics were typical for the course of this disease in the United States patient population. Patient samples are reflective of the general demographics of patients at our institution (27 Caucasian patients and 1 African American patient).

4) About the classification of tumours into high/low for distinct markers. Two tumours may be highly similar and classified in different groups (in the example given in the text, tumours with 2.68% of cells in the cluster were classified as “low” but those with 2.7% as “high”). Would the conclusions be maintained if the high group was defined as those above the 75th percentile and the low group below the 25th percentile? How important is this definition to the results of the RAPID workflow?

We appreciate the point about the cutoff potentially being arbitrary and have address this directly in two ways. 1) In the revised manuscript we tried several cutpoints, including the 25%/75% approach suggested above, and found that the f-measure for patients falling into the same group is 0.86. 2) We also complemented the cutpoint version of the analysis with a continuous analysis using a multivariate Cox proportional-hazards model analysis that assessed GNP and GPP content as continuous features instead of using a cutpoint (now reported in the results).

Reviewer #2:

This is a potentially interesting manuscript that seeks to profile glioblastoma samples by mass cytometry to assess intra-tumoral heterogeneity at the protein level. The authors profile a large number of cells in 28 samples using 34 markers (retaining 26 markers in final set). They then use this information (with various filtering steps) to identify modules of variability within tumors and then use some of those modules/signatures to stratify patients’ outcome with a machine learning approach. They suggest that their findings can be "immediately" used to inform clinical trial design.

The study suffers from intrinsic limitations that are hard to overcome and that would be acceptable if the authors had not jumped to conclusions, they do not have the power to support.

We appreciate the concern that our wording was too strong for the data that we present. We have tempered our language throughout the manuscript to avoid over-stating conclusions and to focus on the utility of RAPID as a discovery tool that can generate hypotheses for further clinical research with other techniques (as in the IHC validation example included for Dataset 1).

First, the analysis relies on 26 measured proteins, which is a rather limited set of parameters; any analysis that relies on protein measurement is also only as good as the antibody used and one should keep in mind that some of the clusters inferred might be technical; hence any major statement would require some form of validation by other approaches which is mostly lacking in this manuscript. But all that would be addressable/ok, if the authors had been more reasonable in their conclusions.

The reviewer correctly points out that the validity of mass cytometry data, like other antibody-based approaches, is dependent on the quality of antibodies used for detection of features of interest. Our antibodies were rigorously validated and titrated on known controls prior to use on patient samples, and many of these antibodies have been published previously (Leelatian et al., 2017 and the companion protocol, Leelatian et al., 2017). We used IHC with different, validated antibody clones used routinely in our tissue pathology core to confirm the major findings (S100B v EGFR) in additional patient samples.

By far the main limitation and major caveat this reviewer sees is that the authors draw major conclusions on stratifying GBM patients for their outcome based on a cohort of 28 patients. This is massively underpowered to address patient stratification.

We appreciate the reviewer’s concern; for this reason, we included a tissue microarray with 73 cases for validation of the key biological features identified in the 28 dissociated samples used for mass cytometry analysis (Figure 5). In this case, the effect size of GNP/GPP prevalence is large relative to other studies reporting outcome stratification, and we were able to validate this stratification in an independent cohort.

The markers they home on are nothing very new (EGFR, SOX2, S100 etc) that have been interrogated in much larger cohorts (TCGA has >400 samples) and have not yielded as valuable information as claimed by the authors.

The reviewer correctly notes that EGFR, SOX2, and S100B have been investigated in GBM studies previously. The surprising finding in our work is not the specific features, all of which were included in our panel based on prior results, but that the abnormal co-expression of them at the protein level reveals new, previously unappreciated subsets, which also have distinct signaling phenotypes. The vast majority of GBM studies are based on measuring DNA and RNA, especially in bulk tumor samples, cell lines, or ex vivo conditions. We expect that transcript levels may give different results from protein measurements, especially when considered coordinately with other features rather than as a single variable. By combining primary patient samples from resected tumors and protein measurements, we sought to add to the field. We would note that the unexpected finding regarding EGFR revealing a subset of cells correlated with better outcomes, which was validated using traditional IHC, indicates how we can find unexpected and useful results with well-studied markers when assessing them at the single cell level using unsupervised analyses.

This study would only be useful if the 28 samples were a discovery cohort and their findings were validated in hundreds of patients’ samples. If EGFR was that good at predicting outcome of GBM, we would know by now. So unfortunately, underpowered study for the claims on outcome. The mass cytometry profile/data would be of interest if followed-up, but not for survival analysis given limited dataset.

We appreciate the reviewer’s concern regarding a pilot cohort size of N=28. For this reason, we tested whether the features most enriched on stable, risk-stratifying clusters found in our mass cytometry dataset could then stratify outcome in a separate, immunohistochemical analysis of N=73 glioblastoma cases. We agree that testing of the mass cytometry approach in a very large cohort would also be an exciting cancer biology advance, but it is beyond the scope of the present study focused on the RAPID algorithm.

Reviewer #3:

[…]

1) The separation and classification of the 43 cell clusters (Figure 1B) is not very clear.

We appreciate this feedback and have expanded both graphical depictions of our analysis (in Figure 1) and explanations in the text. In brief, RAPID employs a clustering algorithm called FlowSOM that builds self-organizing maps, based on t-SNE values, then groups nodes together based on similarity and the number of clusters input by the user. To avoid user bias in determining a cluster number, RAPID iteratively performed FlowSOM analyses from 5 to 50 clusters and chose a number that minimized intra-cluster variance for each feature. In Figure 1B, the clusters are projected back onto the t-SNE plot. We have also amended the Materials and methods to clarify this point.

Consistently, these varied extensively across the 10 down-sample analyses with 18-48 optimal clusters. Have the authors tried multiple configurations of the tSNE parameters (e.g. perplexity, learning rate)?

This question was addressed above in the reply to reviewer 1. Briefly, we have included a section on cell subsampling to directly address this concern.

Work in our own labs, as well as that by others (PMID: 31780669) suggest that increasing perplexity in t-SNE beyond optimized values, like those used here, does not significantly alter the results. We have tried different t-SNE parameters, run RAPID without using t-SNE, and run RAPID with UMAP (Figure 4), and the main results of the study are consistently observed.

Clustering could also be compared to current state-of-the-art pipelines (PMID: 31217225), for example running Principal Component Analysis (PCA) as a pre-processing step before the non-linear tSNE.

We appreciate the reviewer’s suggestion. PCA is routinely used as a pre-processing step in RNA-seq analysis prior to t-SNE implementation, due to the large number of genes measured, lack of dynamic range relative to noise, and preponderance of zero values (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). As a result of using targeted measurements and ionizing mass spectrometry, mass cytometry data do not have these issues (multiple log dynamic range, no zero / drop out, and a smaller number of highly impactful features measure). PCA analysis does not significantly reduce the dimensionality of mass cytometry data, as most of the measured features are non-redundant. Thus, PCA is not needed prior to t-SNE and, when used, does not significantly alter the resulting embedding. Notably, UMAP, which focuses more on global structure (more like PCA), returned similar results to the locally-oriented t-SNE analysis. Our workflow is otherwise comparable to the cited workflow, beginning with data clean up (which differs for sequencing data and cytometry data) followed by dimensionality reduction and clustering, (original modular workflow PMID: 25979346, reviewed in PMID: 27320317, updated regularly).

2) Several clusters of glioblastoma cell types exist (Figure 2A), in particular a subset (lower left) seems to have a heterogeneous expression of Nestin and strong phosphorylation of AKT. Is this a common observation across other patient samples? If so, what is the percentage of cells this represents and could there be an impact on the analysis if this population, like the other cell types, were removed?

The reviewer’s observation that several clusters of GBM cells exist is excellent and part of what inspired the development of RAPID to interrogate these clusters. The RAPID analysis will be impacted by which cells are included or excluded. To address this concern, we ran ten separate t-SNE analyses, subsampling cells each time (Statistical validation 2, Figure 3). In this revised manuscript, we have also included additional testing to assess how stable particular clusters and phenotypes are across multiple runs of t-SNE and FlowSOM. These improvements are highlighted in Figure 3.

In the specific case of Nestin/p-AKT, additional tumors in the cohort also have cells with this phenotype, though none as abundantly as LC26. The population is 7.61% of the LC26 tumor cells (9,220 cells out of 30,091). We also included patient-specific t-SNE maps with heat for all antigens, and % abundance for all clusters, as supplemental data (currently available as a PDF at https://www.biorxiv.org/content/10.1101/632208v3.supplementary-material).

Additionally, p-AKT is enriched in GNP cluster 37 and GPP cluster 41. Specifically, GPP cluster 41 is the only cluster enriched in 2 patients (LC03 and LC09) which are classified as GNP. To me this is counterintuitive, and might indicate some unique variation, either true biological or artefact, of this cell type.

The reviewer is correct in pointing out that p-AKT is enriched in GNP cluster 37 and GPP cluster 41. In our revised manuscript, cluster 41, while highly distinct when present, was not stable across multiple iterations of cell subsampling and therefore was removed from analyses of GNP/GPP. Relative to the other patients, LC03 and LC09 are enriched for cluster 41, but relative to their own tumor cell content, they have more GNP cells. We report that GNP cell content is the primary predictor of overall survival, suggesting that it is the balance of GNP and GPP cells that is indicative of a patient’s prognosis.

3) The potential translation into the clinic seems to be one of the main messages of the manuscript and is indeed particularly interesting. Specifically, the fact that only S100B and EGFR expression is enough to stratify patient overall survival. Could the observations gained from the single-cell experiments be used to re-analyse bulk patient genomic and proteomic data-sets to support further the clinical relevance of these findings? For example, what is the percentage of glioblastoma IDH1 wild type patients with S100B and EGFR high or low?

We appreciate the reviewer’s enthusiasm for applying our findings and envision the results being widely applied to current or future GBM data sets and in clinical research. However, bulk analysis techniques which aggregate signals from all cells may lack resolution needed to observe key findings, especially in tumors with large proportions of infiltrating immune cells. Currently, there are many fewer samples in TCGA (N=50) for which bulk proteomic data and outcome are available – notably, fewer than the number included in the validation set here (N=73).

4) The availability of RAPID as a tool to the community is important and a positive aspect of this work. Thus, I believe the source should have been made accessible prior to publication.

All data and code were available to reviewers during the prior review at https://flowrepository.org/id/RvFrKN2ctDJmmVNE4ZnMJrAZeVraXbwvrhjx3YaBZIV6nWIanMrbhrVBx7yvODtX. We have updated the manuscript in several places to clarify how to access data. RAPID code and cytometry files from the BCP-ALL example are publicly available on Github. The other FCS files from the glioblastoma study are available on FlowRepository using the above link.

[Editors’ note: what follows is the authors’ response to the second round of review.]

Revisions:

Please address the following points raised by the reviewers:

1) Clustering is performed across all tumours combined, so the authors should distinguish inter-tumour effects from the intra-tumour effects that they describe. Could they show for example that the cluster phenotypes are derivable from a significant number of individual tumours?

We appreciate this point and have highlighted two key results: 1) cell clusters were observed in multiple tumors and 2) multiple tumors contributed to cell clusters (the reviewer’s question). Thus, there were not specificity relationships between clusters and patients. In this analysis, there were not “private” clusters containing cells from only one tumor.

First, Row 36 of Supplementary file 2 indicates the number of tumors contributing at least 1% of that tumor’s cells to a cluster. For example, 14 tumors had more than 1% of their events assigned to GPP clusters and 13 tumors had more than 1% of events assigned to GNP clusters.

Second, a new Supplementary file 5 has been added to address the reviewer’s question directly by showing the percent of each cluster that was derived from each tumor. Row 32 indicates the number of tumors contributing at least 1% of that cluster’s cells. Notably, the observed clusters were all derived from multiple tumors (at least 4 and a median of 12 tumors contributed to clusters).

Inter-tumour effects would imply some different conclusions to the ones the authors draw, depending on the patient specificities of these effects. Assuming no patient specificity, one potentially interesting and even complementary conclusion would be that basal levels of certain proteins in a tumour are indicative of patient outcome (indiscriminately of any particular cellular subpopulation). Do the authors see that EGFR, S100B and other enriched proteins vary primarily within, or between, tumours?

EGFR and S100B, and many of the features measured, vary both within individual tumors and between patients. Notably, the single cell study design allows the reader to see that these variations are driven by distinct cell types in which proteins were co-expressed (or simultaneously absent). The comparison of the single cell cytometry and imaging makes it clear that this tracking of cell subsets revealed by multiple markers is more sensitive than tracking the markers as separate features, but that the signal from the cells is strong enough that the traditional unimodal analysis can provide a rough surrogate once the key features are known. The variation from patient to patient can also be seen in Supplementary file 6, which shows cell event distribution and “heat” for each patient and parameter on a common (all 28 patients) t-SNE plot, and in our tissue microarray analysis.

To check for patient specificity, the authors should state in the text i) which, if any, clusters were largely tumour-specific and ii) how many tumours contributed to each of the clusters. Without these figures, the biological relevance of each of the clusters (namely the NP and PP subsets) is difficult to assess.

We appreciate this point and have added Supplementary file 5 and highlighted Supplementary file 2, which together indicate no specificity relationships. We have added the following sentences to the manuscript to highlight this point (found in the Results section titled: Identification of risk stratifying glioblastoma cells in Dataset 1):

“The number of tumors that contributed to each cluster varied between the 43 clusters, but a median of 8 tumors contained cells in each cluster (Supplemental file 2, Supplementary file 6). Furthermore, each cluster contained cells from at least 4 tumors and, at the median, contained cells from 12 tumors (Supplemental file 5, Supplementary file 6).”

It might also be useful to add tSNE plots coloured by patient alongside the existing plots (Figures 1B, 2A, 4A). If there are indeed clusters dominated by individual tumours, I would think that they should be removed early on in the RAPID pipeline. If there are many such cases, then perhaps the authors would have to consider an alternative clustering approach.

The 28-page Supplementary file 6 was included to help address the point the reviewer is raising. This supplement shows the requested data in a per-patient view (one page per patient). Along with Supplemental file 2 and the added Supplemental file 5, we believe the reviewer’s concern is well addressed: specificity relationships between clusters and patients were not observed and clusters were not “private” to a specific tumor. Notably, coloring cells in the combined t-SNE by patient results in a view where the 131,880 cell dots outnumber the pixels and obscure each other, making it difficult to see where a given patient falls on the t-SNE axes.

2) Do any of the cell subsets identified (Figure 1B, C) reflect a proneural population? This is a known component of glioblastomas and has moreover been postulated to be associated with better prognosis because it is the dominant component in lower-grade gliomas (e.g. Verhaak et al., 2010). How do the authors reconcile this with their own findings? Or, if a proneural phenotype is lacking, could the authors comment on why that might be? One reason could be technical: simply that proneural markers were not sufficiently represented in the panel. If that is the case, the authors should acknowledge this in the text and take care not to overstate their results pertaining to the phenotypes of NP and PP subsets in GBM.

The proneural, mesenchymal, and classical subtypes are largely defined by DNA alterations or transcript expression, while this paper is focused on per-cell measurements at the protein level. We did include measurements of proteins which might be expected to be enriched in each of these subclasses, including PDGFRA and SOX2 (expected to be found in proneural tumors, as described by Verhaak et al., 2010). We observed many cell subpopulations with SOX2 protein expression, including several GNP subsets, but did not observe co-enrichment of PDGFRA protein in these cells. Within the Discussion, we explore potential reasons why protein-level data may reveal different features than studies of DNA or expressed RNA transcripts. It is also important to note that the referenced classification has been updated; when IDH-mutant tumors are excluded, as in our study, proneural tumors are similar to other subclasses in outcome (Wang et al., Cancer Cell 2017).

https://doi.org/10.7554/eLife.56879.sa2

Article and author information

Author details

  1. Nalin Leelatian

    1. Department of Cell and Developmental Biology, Vanderbilt University, Nashville, United States
    2. Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, United States
    3. Department of Pathology, Microbiology and Immunology, Vanderbilt University Medical Center, Nashville, United States
    Contribution
    Conceptualization, Data curation, Formal analysis, Supervision, Validation, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing
    Contributed equally with
    Justine Sinnaeve
    Competing interests
    No competing interests declared
  2. Justine Sinnaeve

    1. Department of Cell and Developmental Biology, Vanderbilt University, Nashville, United States
    2. Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, United States
    Contribution
    Conceptualization, Data curation, Formal analysis, Supervision, Validation, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing
    Contributed equally with
    Nalin Leelatian
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9303-7969
  3. Akshitkumar M Mistry

    1. Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, United States
    2. Department of Neurological Surgery, Vanderbilt University Medical Center, Nashville, United States
    Contribution
    Resources, Software, Formal analysis, Funding acquisition, Validation, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7918-5153
  4. Sierra M Barone

    Department of Cell and Developmental Biology, Vanderbilt University, Nashville, United States
    Contribution
    Data curation, Software, Formal analysis, Validation, Visualization, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5944-750X
  5. Asa A Brockman

    1. Department of Cell and Developmental Biology, Vanderbilt University, Nashville, United States
    2. Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, United States
    Contribution
    Data curation, Software, Formal analysis, Validation
    Competing interests
    No competing interests declared
  6. Kirsten E Diggins

    1. Department of Cell and Developmental Biology, Vanderbilt University, Nashville, United States
    2. Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, United States
    Contribution
    Software
    Competing interests
    No competing interests declared
  7. Allison R Greenplate

    1. Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, United States
    2. Department of Pathology, Microbiology and Immunology, Vanderbilt University Medical Center, Nashville, United States
    Contribution
    Software, Writing - review and editing
    Competing interests
    No competing interests declared
  8. Kyle D Weaver

    Department of Neurological Surgery, Vanderbilt University Medical Center, Nashville, United States
    Contribution
    Resources
    Competing interests
    No competing interests declared
  9. Reid C Thompson

    Department of Neurological Surgery, Vanderbilt University Medical Center, Nashville, United States
    Contribution
    Resources
    Competing interests
    No competing interests declared
  10. Lola B Chambless

    Department of Neurological Surgery, Vanderbilt University Medical Center, Nashville, United States
    Contribution
    Resources
    Competing interests
    No competing interests declared
  11. Bret C Mobley

    Department of Pathology, Microbiology and Immunology, Vanderbilt University Medical Center, Nashville, United States
    Contribution
    Conceptualization, Resources, Formal analysis, Supervision, Validation, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  12. Rebecca A Ihrie

    1. Department of Cell and Developmental Biology, Vanderbilt University, Nashville, United States
    2. Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, United States
    3. Department of Neurological Surgery, Vanderbilt University Medical Center, Nashville, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Methodology, Project administration, Writing - review and editing
    For correspondence
    rebecca.ihrie@vanderbilt.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0439-0141
  13. Jonathan M Irish

    1. Department of Cell and Developmental Biology, Vanderbilt University, Nashville, United States
    2. Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, United States
    3. Department of Pathology, Microbiology and Immunology, Vanderbilt University Medical Center, Nashville, United States
    Contribution
    Conceptualization, Resources, Software, Supervision, Funding acquisition, Methodology, Project administration, Writing - review and editing
    For correspondence
    jonathan.irish@vanderbilt.edu
    Competing interests
    was a co-founder and a board member of Cytobank Inc and received research support from Incyte Corp, Janssen, and Pharmacyclics
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9428-8866

Funding

National Institutes of Health (R00 CA143231)

  • Jonathan M Irish

Vanderbilt Ingram Cancer Center (P30 CA68485)

  • Jonathan M Irish

Vanderbilt University (International Scholars Program)

  • Nalin Leelatian

Vanderbilt University (Discovery Grant)

  • Nalin Leelatian
  • Jonathan M Irish

Alpha Omega Alpha Honor Medical Society (Postgraduate Award)

  • Akshitkumar M Mistry

Society of Neurological Surgeons (RUNN Award)

  • Akshitkumar M Mistry

National Institutes of Health (F32 CA224962-01)

  • Akshitkumar M Mistry

Burroughs Wellcome Fund (1018894)

  • Akshitkumar M Mistry

National Institutes of Health (T32 HD007502)

  • Justine Sinnaeve

National Institutes of Health (F31 CA199993)

  • Allison R Greenplate

National Institutes of Health (R25 CA136440-04)

  • Kirsten E Diggins

Vanderbilt Ingram Cancer Center (Provocative Question)

  • Jonathan M Irish

National Institutes of Health (R01 CA226833)

  • Jonathan M Irish

National Institutes of Health (U54 CA217450)

  • Jonathan M Irish

National Institutes of Health (U01 AI125056)

  • Sierra M Barone
  • Jonathan M Irish

National Institutes of Health (R01 NS096238)

  • Rebecca A Ihrie

U.S. Department of Defense (W81XWH-16-1-0171)

  • Rebecca A Ihrie

Michael David Greene Brain Cancer Fund

  • Rebecca A Ihrie
  • Jonathan M Irish

Vanderbilt Institute for Clinical and Translational Research (VR51342)

  • Bret C Mobley
  • Rebecca A Ihrie

Vanderbilt Ingram Cancer Center (Ambassadors Award)

  • Rebecca A Ihrie
  • Jonathan M Irish

Southeastern Brain Tumor Foundation

  • Rebecca A Ihrie
  • Jonathan M Irish

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank the Irish and Ihrie labs at Vanderbilt University for helpful discussions.

Research was supported by the following funding resources: NIH/NCI R00 CA143231 (JMI), the Vanderbilt-Ingram Cancer Center (VICC, P30 CA68485), the Vanderbilt International Scholars Program (NL), a Vanderbilt University Discovery Grant (JMI and NL), Alpha Omega Alpha Postgraduate Award (AMM), Society of Neurological Surgeons/RUNN Award (AMM), F32 CA224962-01 (AMM), 2018 Burroughs Wellcome Fund Physician-Scientist Institutional Award 1018894 (AMM), T32 HD007502 (JS), F31 CA199993 (ARG), R25 CA136440-04 (KED), a VICC Provocative Question award (JMI), R01 CA226833 (JMI), U54 CA217450 (JMI), U01 AI125056 (JMI and SMB.), R01 NS096238 (RAI), DOD W81XWH-16-1-0171 (RAI), the Michael David Greene Brain Cancer Fund (RAI), the Vanderbilt Institute for Clinical and Translational Research (VR51342, RAI, BCM), VICC Ambassadors awards (JMI and RAI), and the Southeastern Brain Tumor Foundation (JMI and RAI).

Senior Editor

  1. Philip A Cole, Harvard Medical School, United States

Reviewing Editor

  1. C Daniela Robles-Espinoza, International Laboratory for Human Genome Research, Mexico

Reviewers

  1. C Daniela Robles-Espinoza, International Laboratory for Human Genome Research, Mexico
  2. Julie Laffy, Weizmann Institute of Science, Israel

Publication history

  1. Received: March 12, 2020
  2. Accepted: June 4, 2020
  3. Accepted Manuscript published: June 23, 2020 (version 1)
  4. Version of Record published: July 7, 2020 (version 2)

Copyright

© 2020, Leelatian et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 912
    Page views
  • 168
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Computational and Systems Biology
    2. Structural Biology and Molecular Biophysics
    Lukas S Stelzl et al.
    Research Article Updated
    1. Computational and Systems Biology
    2. Human Biology and Medicine
    Praveen Anand et al.
    Short Report Updated