Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data

  1. Julien Racle
  2. Kaat de Jonge
  3. Petra Baumgaertner
  4. Daniel E Speiser
  5. David Gfeller  Is a corresponding author
  1. University of Lausanne, Switzerland
  2. Swiss Institute of Bioinformatics, Switzerland
  3. Lausanne University Hospital (CHUV), Switzerland

Abstract

Immune cells infiltrating tumors can have important impact on tumor progression and response to therapy. We present an efficient algorithm to simultaneously estimate the fraction of cancer and immune cell types from bulk tumor gene expression data. Our method integrates novel gene expression profiles from each major non-malignant cell type found in tumors, renormalization based on cell-type-specific mRNA content, and the ability to consider uncharacterized and possibly highly variable cell types. Feasibility is demonstrated by validation with flow cytometry, immunohistochemistry and single-cell RNA-Seq analyses of human melanoma and colorectal tumor specimens. Altogether, our work not only improves accuracy but also broadens the scope of absolute cell fraction predictions from tumor gene expression data, and provides a unique novel experimental benchmark for immunogenomics analyses in cancer research (http://epic.gfellerlab.org).

https://doi.org/10.7554/eLife.26476.001

eLife digest

Malignant tumors do not only contain cancer cells. Normal cells from the body also infiltrate tumors. These often include a variety of immune cells that can help detect and kill cancer cells. Many evidences suggest that the proportion of different immune cell types in a tumor can affect tumor growth and which treatments are effective.

Researchers often study tumors by measuring the expression of genes, i.e., which genes are active in tumors. However, the proportion of different cell types in the tumor is often not measured for tumors studied at the gene expression level.

Racle et al. have now demonstrated that a new computer-based tool can accurately detect all the main cell types in a tumor directly from the expression of genes in this tumor. The tool is called “Estimating the Proportion of Immune and Cancer cells” – or EPIC for short. It compares the level of expression of genes in a tumor with a library of the gene expression profiles from specific cell types that can be found in tumors and uses this information to predict how many of each type of cell are present. Experimental measurements of several human tumors confirmed that EPIC’s predictions are accurate.

EPIC is freely available online. Since the active genes in tumors from many patients have already been documented together with clinical data, researchers could use EPIC to investigate whether the cell types in a tumor affect how harmful it is or how well a particular treatment works on it. In the future, this information could help to identify the best treatment for a particular patient and may reveal new genes that cause malignant tumors to develop and grow.

https://doi.org/10.7554/eLife.26476.002

Introduction

Tumors form complex microenvironments composed of various cell types such as cancer, immune, stromal and endothelial cells (Hanahan and Weinberg, 2011; Joyce and Fearon, 2015). Immune cells infiltrating the tumor microenvironment play a major role in shaping tumor progression, response to (immuno-)therapy and patient survival (Fridman et al., 2012). Today, gene expression analysis is widely used to characterize tumors at the molecular level. As a consequence, tumor gene expression profiles from tens of thousands of patients are available across all major tumor types in databases such as Gene Expression Omnibus (GEO [Edgar et al., 2002]) or The Cancer Genome Atlas (TCGA [Hoadley et al., 2014]). Unfortunately, flow cytometry or immunohistochemistry (IHC) measurements to quantify the number of both malignant and tumor-infiltrating immune cells are rarely performed for samples analyzed at the gene expression level. Therefore, to correctly interpret these data in particular from an immuno-oncology point of view (Angelova et al., 2015; Gentles et al., 2015; Hackl et al., 2016; Li et al., 2016; Linsley et al., 2015; Rooney et al., 2015; Şenbabaoğlu et al., 2016; Zheng et al., 2017), reliable and carefully validated bioinformatics tools are required to infer the fraction of cancer and immune cell types from bulk tumor gene expression data.

To this end, diverse bioinformatics methods have been developed. Some aim at estimating tumor purity based on copy number variation (Carter et al., 2012; Li and Li, 2014), or expression data (Ahn et al., 2013; Clarke et al., 2010; Quon et al., 2013; Yoshihara et al., 2013), but do not provide information about the different immune cell types. Others focus on predicting the relative proportions of cell types by fitting reference gene expression profiles from sorted cells (Gong and Szustakowski, 2013; Li et al., 2016; Newman et al., 2015; Qiao et al., 2012) or with the help of gene signatures (Becht et al., 2016; Zhong et al., 2013). These approaches have been recently applied to cancer genomics data to investigate the influence of immune infiltrates on survival or response to therapy (Charoentong et al., 2017; Gentles et al., 2015; Şenbabaoğlu et al., 2016) or predict potential targets for cancer immunotherapy (Angelova et al., 2015; Li et al., 2016). However, none of these methods provides quantitative information about both cancer and non-malignant cell type proportions directly from tumor gene expression profiles. In addition, reference gene expression profiles used in previous studies have been mainly obtained from circulating immune cells sorted from peripheral blood and were generally based on microarrays technology. Finally, several of these approaches have not been experimentally validated in solid tumors from human patients.

Here, we developed a robust approach to simultaneously Estimate the Proportion of Immune and Cancer cells (EPIC) from bulk tumor gene expression data. EPIC is based on a unique collection of RNA-Seq reference gene expression profiles from either circulating immune cells or tumor- infiltrating non-malignant cell types (i.e., immune, stromal and endothelial cells). To account for the high variability of cancer cells across patients and tissue of origin, we implemented in our algorithm the ability to consider uncharacterized, possibly highly variable, cell types. To validate our predictions in human solid tumors, we first analyzed melanoma samples with both flow cytometry and RNA-Seq. We then collected publicly available IHC and single-cell RNA-Seq data of colorectal and melanoma tumors. All three validation datasets showed that very accurate predictions of both cancer and non-malignant cell type proportions could be obtained even in the absence of a priori information about cancer cells.

Results

Reference gene expression profiles from circulating and tumor-infiltrating cells

EPIC incorporates reference gene expression profiles from each major immune and other non-malignant cell type to model bulk RNA-Seq data as a superposition of these reference profiles (Figure 1A,B). To tailor our predictions to recent gene expression studies, we first collected and curated RNA-Seq profiles of various human innate and adaptive circulating immune cell types (Hoek et al., 2015; Linsley et al., 2014; Pabst et al., 2016) (CD4 T cells, CD8 T cells, B cells, NK cells, Monocytes and Neutrophils) from a diverse set of patients analyzed in different centers (see Materials and methods). Principal component analysis (PCA) of these data (Figure 1C) showed that samples clustered first according to cell type and not according to experiment of origin, patient age, disease status or other factors, suggesting that they could be used as bona fide reference expression profiles across different patients. Reference gene expression profiles for each major immune cell type were built from these RNA-Seq samples based on the median normalized counts per gene and cell type. The variability in expression for each gene was also considered when predicting the various cell proportions based on these reference profiles (see Materials and methods and Supplementary file 1).

Figure 1 with 2 supplements see all
Estimating the proportion of immune and cancer cells.

(A) Schematic description of our method. (B) Matrix formulation of our algorithm, including the uncharacterized cell types (red box) with no or very low expression of signature genes (green box). (C) Low dimensionality representation (PCA based on the 1000 most variable genes) of the samples used to build the reference gene expression profiles from circulating immune cells (study 1 [Hoek et al., 2015], study 2 [Linsley et al., 2014], study 3 [Pabst et al., 2016]). (D) Low dimensionality representation (PCA based on the 1000 most variable genes) of the tumor- infiltrating cell gene expression profiles from different patients. Each point corresponds to cell-type average per patient of the single-cell RNA-Seq data of Tirosh et al. (2016) (requiring at least 3 cells of a given cell type per patient). Only samples from primary tumors and non-lymphoid tissue metastases were considered. Projection of the original single-cell RNA-Seq data can be found in Figure 1—figure supplement 1.

https://doi.org/10.7554/eLife.26476.003

Immune cells differ in their gene expression profiles depending on their state and site of origin (e.g., blood or tumors) (Ganesan et al., 2017; Speiser et al., 2016; Zheng et al., 2017). To study the potential effect of these differences on our predictions, we established reference gene expression profiles of each major tumor-infiltrating immune cell type (i.e., CD4 T, CD8 T, B, NK, macrophages). We further derived reference profiles for stromal cells (i.e. cancer-associated fibroblasts (CAFs)) and endothelial cells. These reference gene expression profiles were obtained as cell type averages from the single-cell RNA-Seq data of melanoma patients from Tirosh and colleagues (Tirosh et al., 2016), considering only samples from primary tumor and non-lymphoid tissue metastasis (see Materials and methods and Supplementary file 2). As for circulating immune cell data, principal component analysis of the tumor-infiltrating cells’ gene expression profiles showed that samples clustered first according to cell type (Figure 1D and Figure 1—figure supplement 1, see also results in [Tirosh et al., 2016]).

Cancer and non-malignant cell fraction predictions

Reference gene expression profiles from each of the immune and other non-malignant (i.e., stromal and endothelial) cell types were then used to model bulk gene expression data as a linear combination of m different cell types (Figure 1B). To include cell types like cancer cells that show high variability across patients and tissues of origin, we further implemented in our algorithm the ability to consider an uncharacterized cell population. Mathematically this was done by taking advantage of the presence of gene markers of non-malignant cells that are not expressed in cancer cells. Importantly, we do not require our signature genes to be expressed in exactly one cell type, but only to show very low expression in cancer cells. The mRNA proportion of each immune and other non-malignant cell type was inferred using least-square regression, solving first our system of equations for the marker genes (green box in Figure 1B, see Materials and methods). The fraction of cancer cells was then determined as one minus the fraction of all non-malignant cell types. Cell markers used in this work were determined by differential expression analysis based on our reference cell gene expression profiles as well as gene expression data from non-hematopoietic tissues (see Materials and methods and Appendix 1—table 1). Finally, to account for different amounts of mRNA in different cell types and enable meaningful comparison with flow cytometry and IHC data, we measured the mRNA content of all major immune cell types as well as of cancer cells (Figure 1—figure supplement 2) and used these values to renormalize our predicted mRNA proportions (see Materials and methods).

Validation in blood

We first tested our algorithm using three datasets comprising bulk RNA-Seq data from PBMC (Hoek et al., 2015; Zimmermann et al., 2016) or whole blood (Linsley et al., 2014), as well as the corresponding proportions of immune cell types determined by flow cytometry (Figure 2A). These data were collected from various cancer-free human donors (see Materials and methods). Overall, very accurate predictions were obtained by fitting reference profiles from circulating immune cells, considering either all cell types together (Figure 2A) or each cell type separately (Figure 2—figure supplement 1). When comparing with other widely used cell fraction prediction methods (Becht et al., 2016; Gong and Szustakowski, 2013; Li et al., 2016; Newman et al., 2015; Quon et al., 2013; Zhong et al., 2013), we observed a clear improvement (Figure 2B and Figure 2—figure supplement 1). Of note, the very high correlation values can partly result from the broad range of different cell fractions in our data (Figure 2A) and we emphasize that these correlation values should only be used to compare methods tested on the same datasets (Figure 2B). The root mean squared error (RMSE), which is less dependent on such ‘outlier data points’, shows also improved accuracy for EPIC compared to other methods (Figure 2—figure supplement 1B).

Figure 2 with 4 supplements see all
Predicting cell fractions in blood samples.

(A) Predicted vs. measured immune cell proportions in PBMC (dataset 1 (Zimmermann et al., 2016), dataset 2 (Hoek et al., 2015)) and whole blood (dataset 3 (Linsley et al., 2014)); predictions are based on the reference profiles from circulating immune cells. (B) Performance comparison with other methods. Significant correlations are indicated above each bar (*p<0.05; **p<0.01; ***p<0.001). (C) Predicted immune cells' mRNA proportions (i.e., without mRNA renormalization step) vs. measured values in the same datasets. Correlations are based on Pearson correlation; RMSE: root mean squared error. Proportions of cells observed experimentally are given in Supplementary file 3B-D.

https://doi.org/10.7554/eLife.26476.006

The renormalization by mRNA content, which had not been considered in previous approaches, appeared to be important for predicting actual cell fractions (Figure 2C). Moreover, we observed that most other methods could also benefit from such a renormalization step, be it for the global prediction of all cell types together or for the predictions of each cell type (Figure 2—figure supplement 2). A conceptually related approach was developed by Baron et al. (2016), but the rescaling was done on the gene expression reference profiles (in their case of pancreatic cells) based on the total number of transcripts per cell type and the prediction of the proportion of pancreatic cells from a bulk sample was carried out with CIBERSORT (Newman et al., 2015). For comparison purpose, we implemented such an a priori renormalization step in EPIC as well, and observed similar results than with the a posteriori renormalization (Figure 2—figure supplement 3).

With respect to the other methods, EPIC has two other main distinctive features: (i) it allows for a cell type without a known reference gene expression profile (such a feature is also part of ISOpure [Quon et al., 2013]) and (ii) it integrates information about the variability in each signature gene from the reference gene expression profiles. The latter point slightly improves the prediction accuracy but to a lesser extent than the renormalization by mRNA (Figure 2—figure supplement 3). The former point cannot be tested directly in the blood samples as only cell types with known reference gene expression profiles are composing these samples. Therefore, to test the effect of including a cell type without a known reference profile, we removed all the T cell subsets from the reference gene expression profiles and predicted the proportion of the other cell types in the bulk samples, allowing for one uncharacterized cell type in EPIC. As expected, the results from EPIC including or not the T cell reference profiles remained nearly unchanged, while the other methods suffered from this, except for DSA (Zhong et al., 2013) which is based only on signature genes and not reference profiles (Figure 2—figure supplement 4). Such an advantage of EPIC is especially useful in the context of tumor samples where in general no reference gene expression profile is available for the cancer cells.

Validation in solid tumors

To validate our predictions in tumors, we collected single cell suspensions from lymph nodes of four metastatic melanoma patients (see Materials and methods). A fraction of the cell suspension was used to measure the different cell type proportions with flow cytometry (CD4 T, CD8 T, B, NK, melanoma and other cells comprising mostly stromal and endothelial cells; Supplementary file 3A), and the other fraction was used for bulk RNA sequencing (Figure 3—figure supplement 1). EPIC was first run with reference profiles from circulating immune cells. We observed a remarkable agreement between our predictions and experimentally determined cell fractions (Figure 3A). The high correlation value is possibly driven by the two samples containing about 80% of melanoma cells, but we stress that all predicted cell proportions fall nearly on the ‘y = x’ line. This clearly indicates that the absolute cell fractions were correctly predicted for all cell types, as confirmed by a low RMSE. Of note, the proportion of melanoma cells could be very accurately predicted even in the absence of a priori information about their gene expression.

Figure 3 with 1 supplement see all
Predicting cell fractions in solid tumors with reference profiles from circulating cells.

(A) Comparison of EPIC predictions with our flow cytometry data of lymph nodes from metastatic melanoma patients. (B) Comparison with immunohistochemistry data from colon cancer primary tumors (Becht et al., 2016). (C) Comparison with single-cell RNA-Seq data (Tirosh et al., 2016) from melanoma samples either from lymphoid tissues or primary and non-lymphoid metastatic tumors. Correlations are based on Pearson correlation. Proportions of cells observed experimentally are given in Supplementary file 3A,E.

https://doi.org/10.7554/eLife.26476.011

As a second validation, we compared EPIC predictions with IHC data from colon cancer (Becht et al., 2016) (see Materials and methods). Although a limited number of immune cell types had been assayed in this study, we observed a significant correlation between cell proportions measured by IHC and our predictions, except for the CD8 T cells (Figure 3B).

As a third validation, we used recent single-cell RNA-Seq data from 19 melanoma samples (Tirosh et al., 2016). We applied EPIC on the average expression profile over all single cells for each patient and compared the results with the actual cell fractions (see Materials and methods). Here again, our predictions were consistent with the observed cell fractions, even for melanoma cells for which we did not assume any reference gene expression profile (Figure 3C). Notably, the predicted cell fractions from melanoma cells as well as from all other immune cell types fall nearly on the y = x line, showing that not only the relative cell type proportions could be predicted but also the absolute proportions for all cell types.

We next compared these predictions to those obtained with reference profiles from tumor- infiltrating cells, including also CAFs and endothelial cells (Figure 4). For the single-cell RNA-Seq data (Figure 4C), we applied a leave-one-out procedure, to avoid using the same samples both to build the reference profiles and the bulk RNA-Seq data used as input for the predictions (see Materials and methods). Overall, predictions did not change much compared to those based on circulating immune cell reference gene expression profiles (Figure 4), except for the IHC data where the predictions for CD8 T cells and macrophages clearly improved (Figure 4B). Moreover, we could observe some differences between the results obtained from circulating immune cell reference gene expression profiles and those from tumor-infiltrating cell reference gene expression profiles, when considering the proportions from each cell type independently (Figures 34 and Figure 4—figure supplement 1): (i) predictions for CD8 T cells and macrophages improved in the datasets of primary tumors and non-lymph node metastases but not in the datasets from lymph node metastases; (ii) predictions for B and NK cells displayed similar accuracy based on the circulating cells or tumor-infiltrating cells profiles for all datasets; and (iii) CAFs and endothelial cells were not available for the blood-based reference profiles but the proportion of these cells could be well predicted based on the reference profiles constructed from tumor-infiltrating cells (Figure 4 and Figure 4—figure supplement 1).

Figure 4 with 1 supplement see all
Predictions with reference profiles from tumor-infiltrating cells.

Same as Figure 3 but based on reference profiles built from the single-cell RNA-Seq data of primary tumor and non-lymphoid metastatic melanoma samples from Tirosh et al. (2016). (A) Comparison with flow cytometry data of lymph nodes from metastatic melanoma patients. (B) Comparison with IHC from colon cancer primary tumors (Becht et al., 2016). (C) Comparison with single-cell RNA-Seq data (Tirosh et al., 2016). For primary tumor and non-lymphoid metastasis samples, a leave-one-out procedure was used (see Materials and methods). Proportions of cells observed experimentally are given in Supplementary file 3A,E.

https://doi.org/10.7554/eLife.26476.013

Benchmarking of other methods

We took advantage of our unique collection of independent validation datasets to benchmark other methods for predictions of immune cell type fractions in human tumors. We first compared the results of EPIC and ISOpure (Quon et al., 2013), which is the only other method that can consider uncharacterized cell types and therefore predict the fraction of cancer and immune cell types based only on RNA-seq data. EPIC displayed improved accuracy in all three datasets (Figure 5A, and Figure 5—figure supplements 14). To benchmark other methods, we then restricted our analysis to the predictions of the different immune cell types (Figure 5B and Figure 5—figure supplements 14). Predictions from EPIC were in general more accurate, especially when considering all cell types together. Nevertheless, when restricting the comparisons to relative cell type proportions, some methods like MCPcounter (Becht et al., 2016) and TIMER (Li et al., 2016) were quite consistent in their predictions across the various datasets and showed similar accuracy as EPIC for the cell types considered in these methods (Figure 5—figure supplements 14). Of note, MCPcounter could not be included in the global prediction comparison as this method returns scores that are not comparable between different cell types. Predictions from DSA (Zhong et al., 2013) were also quite accurate when available, but in multiple cases some cell type proportions returned by the method were simply equal to 0 in all samples (Figure 5—figure supplements 14 – correlation values were replaced by NA in these cases).

Figure 5 with 6 supplements see all
Performance comparison with other methods in tumor samples.

(A) Pearson correlation R-values between the cell proportions predicted by EPIC and ISOpure and the observed proportions measured by flow cytometry or single-cell RNA-Seq (Tirosh et al., 2016), considering all cell types together (i.e., B, CAFs, CD4 T, CD8 T, endothelial, NK, macrophages and cancer cells). (B) Same analysis as in Figure 5A but considering only immune cell types (i.e., B, CD4 T, CD8 T, NK and macrophages) in order to include more methods in the comparison. (C) Analysis of ESTIMATE predictions in the single-cell RNA-Seq dataset for the sum of all immune cells, the proportion of stromal cells (cancer-associated fibroblasts) and the proportion of cancer cells (cells identified as melanoma cells in Tirosh et al.). (D) Same as Figure 5C but for EPIC predictions of immune, stromal and cancer cells. Significant correlations in (A–B) are indicated above each bar (*p<0.05; **p<0.01; ***p<0.001).

https://doi.org/10.7554/eLife.26476.015

Immune, stromal and tumor purity scores (Yoshihara et al., 2013) based on gene set enrichment analysis were also correlated with the total fraction of immune cells, stromal cells and the fraction of cancer cells (correlation was not significant for the tumor purity score – Figure 5C and Figure 5—figure supplement 5). However, these correlations were considerably lower than those obtained with our approach (Figure 5D and Figure 5—figure supplement 5). Moreover, such scores are less quantitative and are thus more difficult to interpret with respect to actual cell type proportions.

Next, we performed some explorative analysis to test if such cell fraction prediction methods could go further into the details of the T cell subsets. Based on the single-cell RNA-seq data from Tirosh and colleagues (Tirosh et al., 2016), we identified CD4+ Treg and Thelper cells (see Materials and methods), and built reference gene expression profiles for these cell types as we had done for the other cell types. As before, a leave-1-out procedure at the patient level was used to predict the proportions for these cell types based on the bulk samples constructed from this single-cell RNA-seq data. We observed significant correlations, but the R-values are lower than before (Figure 5—figure supplement 6), suggesting that we may be reaching the limits of cell fraction prediction methods for cell types that show lower abundance and substantial transcriptional similarity. Of note, CIBERSORT using either the LM22 signature or our newly derived signature did not perform better than EPIC.

Discussion

By combining RNA-Seq profiles of all major immune and other non-malignant cell types established from both circulating and tumor-infiltrating cells together with information about cell morphology (i.e., mRNA content) and algorithmic developments to consider uncharacterized and possibly highly variable cell types, EPIC overcomes several limitations of previous approaches to predict the fraction of both cancer and immune or other non-malignant cell types from bulk tumor gene expression data. From an algorithmic point of view, EPIC takes advantage of the fact that cancer cells, in general, express no or only low levels of immune and stromal markers. Therefore the method can be broadly applied to most solid tumors, as confirmed by our validation in melanoma and colorectal samples, but it will not be suitable for hematological malignancies like leukemia or lymphoma.

The accuracy of the predictions for some cell types might be sensitive to the origin or condition of the immune cells used for establishing reference profiles. For instance, we observed that CD8 T cells and macrophages from primary tumors and non-lymph node metastases samples were best predicted using the reference profiles from tumor-infiltrating cells. This may be explained by the fact that the reference profiles from circulating cells corresponded to resting CD8 T cells and monocytes as only few activated C8 T cells and no macrophages are circulating in blood.

Overall, our results suggest that for primary tumors or non-lymphoid tissue metastases reference gene expression profiles from tumor-infiltrating immune cells are more appropriate, while for lymph node metastases, profiles from either circulating or tumor-infiltrating immune cells could be used.

One known limitation of cell fraction predictions arises when some cell types are present at very low frequency (Shen-Orr and Gaujoux, 2013). Our results suggest that predictions of cell proportions are reliable within an absolute error of about 8%, as estimated by the root mean squared error (Figure 3 and Figure 4—figure supplement 1B). These estimates are consistent with the lower detection limit proposed by other groups (Becht et al., 2016; Zhong et al., 2013) and may explain why the relative proportions of NK cells, which are present at lower frequency in melanoma tumors (Balch et al., 1990; Sconocchia et al., 2012), could not be predicted with accuracy comparable to other cell types (Figure 4—figure supplement 1). While this may prevent applications of cell fraction predictions in some tumor types that are poorly infiltrated, many other tumors, like melanoma or colorectal cancer, display high level of infiltrating immune cells and the role of immune infiltrations on tumor progression and survival appears to be especially important in these tumors (Clemente et al., 1996; Fridman et al., 2012; Galon et al., 2006).

Another limitation of cell fraction predictions arises when considering cell types that show high transcriptional similarity. For instance, Treg in Figure 5—figure supplement 6 could not be well predicted neither by EPIC nor CIBERSORT. This can be understood because the gene expression profiles of Thelper and Treg are highly similar with only a few genes expressed differently between the two, making it harder for cell fraction prediction methods to work accurately. In addition, Treg were present at a proportion lower than 10% in all samples (Supplementary file 3E). For such cases, gene set enrichment approaches, although less quantitative, may be more convenient, possibly combining them with the predictions obtained by EPIC for the main immune cell types.

Our predictions for the fraction of uncharacterized cells may include non-malignant cells, such as epithelial cells from neighboring tissues in addition to cancer cells. Compared to recent algorithms that first predict tumor purity based on exome sequencing data, and later infer the relative fraction of immune cell types (Li et al., 2016), the predictions of EPIC for immune and stromal cells are likely more quantitative because they can implicitly consider the presence of not only cancer cells but also other non-malignant cells for which reference profiles are not available. Moreover, EPIC does not require both exome and RNA-Seq data from the same tumor samples, thereby reducing the cost and amount of experimental work for prospective studies, and broadening the scope of retrospective analyses of cancer genomics data to studies that only include gene expression data.

Recent technical developments in single-cell RNA-Seq technology enable researchers to directly access information about both the proportion of all cell types together with their gene expression characteristics (Carmona et al., 2017; Efroni et al., 2015; Jaitin et al., 2014; Singer et al., 2016; Stegle et al., 2015; Tirosh et al., 2016). Such data are much richer than anything that can be obtained with computational deconvolution of bulk gene expression profiles and this technology may eventually replace standard gene expression analysis of bulk tumors. Nevertheless, it is important to realize that, even when disregarding the financial aspects, single-cell RNA-Seq of human tumors is still logistically and technically very challenging due to high level of cell death upon sample manipulation (especially freezing and thawing) and high transcript dropout rates (Finak et al., 2015; Saliba et al., 2014; Stegle et al., 2015). Moreover, one cannot exclude that some cells may better survive the processing with microfluidics devices used in some single-cell RNA-Seq platforms, thereby biasing the estimates of cell type proportions. It is therefore likely that bulk tumor gene expression analysis will remain widely used for several years. Our work shows how we can exploit recent single-cell RNA-Seq data of tumors obtained from a few patients to refine cell fraction predictions in other patients that could not be analyzed with this technology, thereby overcoming some limitations of previous computational approaches that relied only on reference gene expression profiles from circulating immune cells.

In this work, we provide a detailed and biologically relevant validation of our predictions using actual tumor samples from human patients analyzed with flow cytometry, IHC and single-cell RNA-Seq. We note that the slightly lower agreement between our predictions and IHC data may be partly explained by the fact that the exact same samples could not be used for both gene expression and IHC analyses because of the incompatibility between the two techniques. Nevertheless, the overall high accuracy of our predictions indicates that infiltrations of major immune cell types can be quantitatively studied directly from bulk tumor gene expression data using computational approaches such as EPIC.

EPIC can be downloaded as a standalone R package (https://github.com/GfellerLab/EPIC/releases/tag/v1.1) (Racle, 2017) and can be used with reference gene expression profiles pre-compiled from circulating or tumor-infiltrating cells, or provided by the user. EPIC is also available as a web application (http://epic.gfellerlab.org) where users can upload bulk samples gene expression data and perform the full analysis.

Materials and methods

Code availability

Request a detailed protocol

EPIC has been written as an R package. It is freely available on GitHub (https://github.com/GfellerLab/EPIC; copy archived on https://github.com/elifesciences-publications/EPIC) for academic non-commercial research purposes. Version v1.1 of the package was used for our analyses.

In addition to the R package, EPIC is available as a web application at the address: http://epic.gfellerlab.org.

Prediction of cancer and immune cell type proportions

Request a detailed protocol

In EPIC, the gene expression of a bulk sample is modeled as the sum of the gene expression profiles from the pure cell types composing this sample (Figure 1A,B). This can be written as (Venet et al., 2001):

(1) b=C × p

Where b is the vector of all n genes expressed from the bulk sample to deconvolve; C is a matrix (n x m) of the m gene expression profiles from the different cell types; and p is a vector of the proportions from the m cell types in the given sample (Figure 1B).

Matrix C consists of m-1 columns corresponding to various reference non-malignant cell types whose gene expression profiles are known, and a last column corresponding to uncharacterized cells (i.e. mostly cancer cells, but possibly also other non-malignant cell types not included in the reference profiles). EPIC assumes the reference gene expression profiles from the non-malignant cell types are well conserved between patients. Such a hypothesis is supported by the analysis in Figure 1C,D. The uncharacterized cells can be more heterogeneous between patients and EPIC makes no assumption on them.

EPIC works with RNA-seq data, which is implicitly normalized. We use data normalized into transcripts per million (TPM) because it has some properties needed for the full cell proportion prediction to work, as will be shown in the next paragraphs. Therefore, instead of the raw data from Equation (1), we are working with TPM-normalized data, which correspond to the following:

(2) bi¯=106k=1nbklkbiliCij¯=106k=1nCkjlkCijli

Where b¯ and C¯ are the TPM-normalized bulk sample and reference cell gene expression profiles respectively and li is the length of gene i.

Using these, Equation (1) is rewritten to:

(3) b¯=C¯×p¯

where

(4) pj¯=k=1nCkjlki=1nbilipj

This normalization guarantees the sum of the new proportions, p-, is equal to 1:

i=1nb¯i=from eq. 2 i(106kbklkbili)=106

and

i=1nb¯i=from eq. 3 i(C¯×p¯)i=i=1nj=1mCij¯p¯j=from eq. 2j[(106kCkjlkiCijli)p¯j]=106jp¯j
(5) j=1mpj¯=1

In addition to p¯ and C¯ we also define p¯ and C¯, which are the same except that they include the normalized proportions and profiles from the reference cell types only (i.e. they have one less element and one less column than p¯ and C¯ respectively).

Using these normalized quantities, EPIC then solves Equation (3) for a subset of ns equations corresponding to the ns signature genes (S) that are expressed by one or more of the reference cell types but only expressed at a negligible level in the uncharacterized cells (Figure 1B). Previous computational work (Clarke et al., 2010; Gosink et al., 2007) showed that the proportion from uncharacterized cells in bulk samples could indeed be inferred with help of genes not expressed by the uncharacterized cells. Importantly, cell-specific signature genes are well established and widely used in flow cytometry to sort immune cells. Thus, EPIC solves the following system of equations:

(6) bi¯|iS=((C¯×p¯)i+C¯imp¯m)|iS=(C¯×p¯)i|iS

where the term corresponding to the uncharacterized cells (m) vanished thanks to the definition of the signature genes (C¯im|iS0).

The solution to Equation (6) can be estimated by a constrained least square optimization. EPIC takes advantage of the known variability for each gene in the reference profile, to give weights in the function to minimize:

(7) fp¯*=iSwib¯i-C¯*×p¯*i2

with constraints

{p¯j0j=1m1p¯j1

Here, the weights, wi, give more importance to the signature genes with low variability in the reference gene expression profiles. In EPIC, these weights are given by:

wi=minui, 100medianui

where

ui=j=1m-1C¯ij*V¯ij+ε

V- is the matrix of the TPM-based variability of each gene for each of the reference cells (Supplementary files 12), ε is a small number to avoid division by 0, and the term '100medianui' is used to avoid giving too much weight to few of the genes.

Finally, thanks to Equation (5), the proportion for the uncharacterized cells can be obtained by:

(8) p¯m=1j=1m1p¯j

Since we used normalized gene expression data, values of p¯ correspond actually to the fraction of mRNA coming from each cell type, rather than the cell proportions. As the mRNA content per cell type can vary significantly (Figure 1—figure supplement 2), the actual proportions of each cell type can be estimated as:

(9) pj=αp-jrj

where rj is the amount of RNA nucleotides in cell type j (or equivalently the total weight of RNA in each cell type) and α is a normalization constant to have pj=1.

Another method, DeMix (Ahn et al., 2013), starts from Equation (1) to estimate the proportion and gene expression profile from cancer cells in mixed samples. In this method the mixed sample is assumed to be composed only by two cell types: cancer cells (without any a priori known gene expression profile) and normal cells (with known gene expression data, which can either come from tumor-matched or unmatched samples). This method was developed for microarray data and shows that it was important to use the raw data as input assuming it follows a log-normal distribution as is the case for microarray, instead of working with log-transformed data like most other methods did. DeMix estimates the variance of the gene expression in the normal samples and uses this in the maximum likelihood estimation to predict the cancer cell gene expression and proportions, using thus implicitly a gene-specific weight for each gene. EPIC was derived for RNA-seq data and is directly using linear (non-log) data. Notably, when solving the linear regression from Equation (7) in EPIC, we are not assuming any specific distribution for the gene expression and the weights we give to each gene are simply based on their interquartile range in the reference samples. Other measures of gene variability could however be given as input into EPIC's R-package. Contrary to DeMix, another important assumption performed in EPIC is that our signature genes are not expressed by the cancer cells, so that we can easily estimate the proportion of multiple non-malignant cell types composing the bulk samples.

Flow cytometry and gene expression analysis of melanoma samples

Request a detailed protocol

Patients agreed to donate metastatic tissues upon informed consent, based on dedicated clinical investigation protocols established according to the relevant regulatory standards. The protocols were approved by the local IRB, that is, the 'Commission cantonale d’éthique de la recherche sur l’être humain du Canton de Vaud'. Lymph nodes (LN) were obtained from stage III melanoma patients, by lymph node dissection that took place before systemic treatment. The LN from one patient was from the right axilla and the LNs from the other three patients were from the iliac and ilio-obturator regions (Appendix 2—table 1). Single cell suspensions were obtained by mechanical disruption and immediately cryopreserved in RPMI 1640 supplemented with 40% FCS and 10% DMSO. Single cell suspensions from four lymph nodes were thawed and used in parallel experiments of flow cytometry and RNA extraction. In order to limit the number of dead cells after thawing, we removed those cells using a dead cell removal kit (Miltenyi Biotech). Proportions of CD4 T (CD45+/CD3+/CD4+/Melan-A-), CD8 T (CD45+/CD3+/CD8+/Melan-A), NK (CD45+/CD56+/CD3/CD33/Melan-A), B (CD45+/CD19+/CD3/CD33/Melan-A) and Melan-A expressing tumor cells (Supplementary file 3A) were acquired via flow cytometry using the following antibodies: anti-CD3 BV711 (RRID:AB_2566035), anti-CD4 BUV737 (RRID:AB_2713927), anti-CD8 PE-Cy5 (RRID:AB_2713928), anti-CD56 BV421 (RRID:AB_11218798), anti-CD19 APCH7 (RRID:AB_1645728), anti-CD33 PE-Cy7 (RRID:AB_2713932), anti-CD45 APC (RRID:AB_314400), anti-Melan-A FITC (RRID:AB_2713930) and Fixable Viability Dye eFluor 455UV (eBioscience). Data was acquired on a BD LSR II SORP flow cytometry machine (BD Bioscience). Analysis was performed using FlowJo (Tree Star). Cell proportions were based on viable cells only. In parallel total RNA was extracted using the RNAeasy Plus mini kit (Qiagen) following the manufactures’ protocol. Starting material always contained minimum 0.2 × 106 cells. RNA was quantified and integrity was analyzed using a Fragment Analyser (Advanced Analytical). Total RNA from all samples used for sequencing had an RQN ≥7. Libraries were obtained using the Truseq stranded RNA kit (Illumina). Single read (100 bp) was performed using an Illumina HiSeq 2500 sequencer (Illumina).

Post processing of the sequencing was done using Illumina pipeline Casava 1.82. FastQC (version 0.10.1) was used for quality control. RNA-seq reads alignment to the human genome, hg19, and TPM quantification were performed with RSEM (Li and Dewey, 2011) version 1.2.19, using Bowtie2 (Langmead and Salzberg, 2012) version 2.2.4 and Samtools (Li et al., 2009) version 1.2.

RNA-Seq data from this experiment have been deposited in NCBI's Gene Expression Omnibus (Edgar et al., 2002) and are accessible through GEO Series accession number GSE93722 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE93722).

Amount of mRNA per cell type

Request a detailed protocol

Healthy donor peripheral blood was obtained through the blood transfusion center in Lausanne. PBMCs were purified by density gradient using Lymphoprep (Axis-Shieldy). Mononuclear cells were stained in order to sort monocytes, B, T and NK cells using the following antibodies: CD14 FITC (RRID:AB_130992), CD19 PE (RRID:AB_2716572), CD3 APC (RRID:AB_130788), CD56 BV421 (RRID:AB_11218798) and fixable live/dead near IR stain (ThermoFisher Scientific). 1 × 106 live cells from each cell type were sorted using the BD FACS ARIA III (BD Biosciences). Total RNA was extracted using the RNAeasy Plus mini kit (Qiagen) following the manufactures’ protocol and quantified using a Fragment Analyser (Advanced Analytical). Values obtained are given in Figure 1—figure supplement 2A.

The mRNA content for the cancer cells was estimated from the flow cytometry data described in the previous section from the four patients with melanoma. For this we used the forward scatter width, which is a good proxy of cell size and mRNA content (Padovan-Merhar et al., 2015; Tzur et al., 2011), and observed that cancer cells had similar amount of mRNA than B, NK and T cells (Figure 1—figure supplement 2B). We thus used a value of 0.4 pg of mRNA per cancer cell.

Public external datasets used in this study

Request a detailed protocol
  • Dataset 1 was obtained from Zimmermann and colleagues (Zimmermann et al., 2016), through ImmPort (http://www.immport.org), accession SDY67. It includes RNA-Seq samples from PBMC of healthy donors before and after influenza vaccination. In addition, the original flow cytometry results files were available, containing multiple immune cell markers. As an independent validation of EPIC, we used the data from 12 pre-vaccination samples of healthy donors and we computed the corresponding immune cell proportions from the flow cytometry files merging the results from their innate and Treg panels (obtaining B, CD4 T, CD8 T, NK cells and monocytes, Supplementary file 3B; Figure 2A)

  • Dataset 2 was obtained from Hoek and colleagues (Hoek et al., 2015), GEO accession GSE64655. This corresponds to RNA-Seq samples from two different donors. Samples have been taken before an influenza vaccination and also 1, 3 and 7 days after the vaccination (56 samples in total). In their experiment, the authors measured RNA-Seq from bulk PBMC samples and also from sorted immune cells (B, NK, T cells, myeloid dendritic cells, monocytes and neutrophils). In addition, flow cytometry was performed to measure the proportion of these cell types in PBMC before the influenza vaccination (personally communicated by the authors; Supplementary file 3C).

  • Dataset 3 was obtained from Linsley and colleagues (Linsley et al., 2014), GEO accession GSE60424. This dataset includes 20 donors (healthy donors and other donors with amyotrophic lateral sclerosis, multiple sclerosis, type one diabetes or sepsis), for a total of 134 samples. RNA-Seq from these donors has been extracted from whole blood and sorted immune cells (B, NK cells, monocytes, neutrophils, CD4 T and CD8 T cells). In addition to RNA-Seq data, complete blood counts data were available for 5 of these donors (personally communicated by the authors; Supplementary file 3D).

  • Dataset 4 was obtained from Pabst and colleagues (Pabst et al., 2016), GEO accession GSE51984. This includes RNA-Seq from five healthy donors. Samples are from total white blood cells and sorted immune cells (B cells, granulocytes, monocytes and T cells).

  • Colon cancer dataset was obtained from Becht and colleagues (Becht et al., 2016), GEO accession GSE39582. This corresponds to microarrays of primary colon cancer tumors. In addition to gene expression data, immunohistochemistry data of CD3, CD8 and CD68 was available for 33 patients (personally communicated by the authors).

  • Single-cell RNA-Seq data from tumor-infiltrating cells were obtained from Tirosh and colleagues (Tirosh et al., 2016), GEO accession GSE72056. This corresponds to 19 donors and comprises primary tumors, lymph node metastasis or other lesions. It includes 4645 cells. Cell type identity was taken from Tirosh et al. (2016) (B, NK, T cell, macrophage, CAFs, endothelial cell, cancer cell as well as cells not assigned a specific cell type). Among the T cells, we then defined subsets based on their gene expression: CD4 T cells (expressing CD4 but not CD8A nor CD8B) and CD8 T cells (expressing CD8A or CD8B but not CD4). The other T cells not corresponding to one of these two groups were removed from further analyses. We further defined among the CD4 T cells: Treg (expressing either FOXP3 or CD25 above the median among CD4 T cells) and Thelper (those CD4 T cells not belonging to Treg group). In silico reconstructed bulk samples from each donor were obtained as the average per gene from all samples of the given donor. The corresponding cell fractions from these bulk samples were obtained as the number of cells from each cell type divided by the total number of cells (Supplementary file 3E). In the results, we split this dataset in two depending on the origin of the biopsies: lymphoid tissues for samples obtained from lymph node and spleen metastases, vs. the rest of samples, which were obtained from primary tumor and other metastases.

For the above datasets 2 and 3, we obtained raw fastq files. RNA-seq reads alignment to the human genome, hg19, and TPM quantification were performed with RSEM (Li and Dewey, 2011) version 1.2.19, using Bowtie2 (Langmead and Salzberg, 2012) version 2.2.4 and Samtools (Li et al., 2009) version 1.2.

For the other datasets, we directly obtained the summary counts data from the respective studies without mapping the reads by ourselves, and we transformed these counts to TPM wherever necessary.

Reference gene expression profiles from circulating cells

Request a detailed protocol

Reference gene expression profiles of sorted immune cells from peripheral blood were built from the datasets 2, 3 and 4 described in previous section. We verified no experimental biases were present in these data through unsupervised clustering of the samples, with help of a principal component analysis based on the normalized expression from the 1000 most variable genes (Figure 1C).

The median value of TPM counts was computed per cell type and per gene. Similarly, the interquartile range of the TPM counts was computed per cell type and gene, as a measure of the variability of each gene expression in each cell type. Values of these reference profiles are given in Supplementary file 1). Granulocytes from dataset 4 and neutrophils from datasets 2 and 3 were combined to build the reference profile for neutrophils (neutrophils constitute more than 90% of granulocytes). No reference profile was built for the myeloid dendritic cells as only few samples of these sorted cells existed and they were all from the same experiment. Monocytes are not found in tumors but instead there are macrophages, mostly from monocytic lineage, that are infiltrating tumors and that are not found in blood. For this reason, we also used the monocyte reference gene expression profile as a proxy for macrophages when applying EPIC to tumor samples. Such an assumption gave coherent results as observed in the results.

Reference profiles from tumor-infiltrating cells

Request a detailed protocol

We also built gene expression reference profiles from tumor-infiltrating cells. These are based on the single-cell RNA-Seq data from Tirosh and colleagues (Tirosh et al., 2016) described above. We only used the non-lymphoid tissue samples to build these tumor-infiltrating cell’s profiles, avoiding in this way potential ‘normal immune cells’ present in the lymph nodes and spleen. These reference profiles (Supplementary file 2) were built in the same way as described above for the reference profiles of circulating immune cells, but based on the mean and standard deviation instead of median and interquartile range respectively, due to the nature of single-cell RNA-Seq data and gene dropout present with such technique.

When testing EPIC with these profiles for the single-cell RNA-Seq datasets, for the samples of primary tumor and other non-lymph node metastases, a leave-one-out procedure was applied: for each donor we built reference cell profiles based only on the data coming from the other donors.

Cell marker gene identification

Request a detailed protocol

EPIC relies on signature genes that are expressed by the reference cells but not by the uncharacterized cells (e.g., cancer cells). For each reference cell type, we thus built a list of signature genes through the following steps:

  1. The samples (from the peripheral blood datasets) from each immune cell type were tested for overexpression against:

    1. the samples from each other immune cell of peripheral blood datasets (1 cell type vs. 1 other cell type at a time);

    2. the samples of the Illumina Human Body Map 2.0 Project (ArrayExpress ID: E-MTAB-513) considering all non-immune related tissues;

    3. the samples from GTEx (GTEx Consortium, 2015) from each of the following tissues (one tissue at a time): adipose subcutaneous; bladder; colon-transverse; ovary; pancreas; testis (data version V6p).

  2. Only genes overexpressed in the given cell type with an adjusted p-value<0.01 for all these tests were kept. Conditions 1b) and 1c) are there to ensure signature genes are expressed in the immune cells and no other tissues.

  3. The genes that passed 2) were then ranked by the fold change from the overexpression tests to preselect the genes showing the biggest difference between the various cell types.

  4. The list of genes was then manually curated, comparing the expression of the genes per cell type in the peripheral blood datasets and the tumor-infiltrating cells dataset: only genes expressed at similar levels between the blood and tumor-infiltrating cells were kept (to avoid differences due to exhaustion phenotype for example). Genes expressed at much higher levels than the other genes were also removed from the signature as these could have biased the least-square optimization towards good predictions for these genes only.

  5. In addition to CD4 and CD8 T cells signatures, we built a signature list of general T cell genes in the same way as described above, and it contains genes expressed at similar levels in the two T cell subtypes. This general T cell signature is also part of EPIC, even when predicting the proportions of the T cell subsets.

  6. For the tumor-infiltrating cell reference profiles, signature genes from CAFs, endothelial cells and macrophages were also needed. These were built in a similar way than above, considering the overexpression from each of these cell types against each other cell type from the tumor-infiltrating cells data.

All the differential expression tests were performed with DESeq2 (Love et al., 2014).

Appendix 1—table 1 summarizes the full list of signature genes per cell type.

Prediction of cell proportions in bulk samples with other tools

Request a detailed protocol

We compared EPIC's predictions with those from various cell fraction prediction methods. These other methods were run with the following packages (using the default options when possible):

  • CIBERSORT (Newman et al., 2015) (R package version 1.03) was run based on two different gene expression reference profiles:

    • based on the LM22 reference profiles derived in (Newman et al., 2015). For comparison with the experimentally measured cell proportions, we summed together the sub-types predictions of CIBERSORT within each major immune cell type.

    • based on the reference profiles and signature genes we derived here for EPIC.

  • DeconRNASeq (v1.16) (Gong and Szustakowski, 2013) does not contain immune cell reference profiles and we used the reference profiles we derived here as well as the corresponding signature genes. We present the results with ‘use.scale’ parameter set to FALSE, which usually gave better results.

  • DSA (Zhong et al., 2013) only needs a gene signature per cell type to estimate the proportion of cells in multiple bulk samples together. We used the implementation of DSA found in CellMix (Gaujoux and Seoighe, 2013) R package (version 1.6.2). As DSA needs many samples to estimate simultaneously the proportions of cells in these samples, we considered all the PBMC samples from Hoek et al. data when fitting this dataset (eight samples) and all whole blood samples from Linsley et al. data when fitting this other dataset (20 samples), even though the cell proportions have been measured experimentally only for 2 and 5 samples respectively. For the gene signature, we used the same genes as those used for EPIC (Appendix 1—table 1; markers of B cells, CD4 T cells, CD8 T cells, monocytes, neutrophils and NK cells for the predictions in the blood datasets; markers of B cells, CAFs, CD4 T cells, CD8 T cells, endothelial cells, macrophages, and NK cells for the predictions in solid tumors).

  • ISOpure (Quon et al., 2013) estimates the profile and proportion of cancer cells by comparing many bulk samples containing cancer cells and many healthy bulk samples of the same tissue. Although the primary goal is not to compute the proportions of the different cell types composing a sample, cell fractions can still be obtained with this method. In particular, one output of ISOpure is how much each of the healthy reference samples is contributing to a given bulk sample. Instead of using bulk healthy samples, we used our cell reference profiles, so that each ‘reference sample’ corresponded to a different cell type. These reference samples and the bulk samples were then subsetted by the same signature genes that we derived for EPIC. ISOpure was then run based on these data. The contribution of each cell type was taken as the relative contribution outputted by ISOpure from each of the reference cell sample. The R implementation ISOpureR (Anghel et al., 2015) version 1.0.21 was used.

  • MCP-counter (Becht et al., 2016) (R package version 1.1.0) was run with the ‘HUGO_symbols’ chosen as features or with ‘affy133P2_probesets’ for the microarray-based IHC dataset.

  • TIMER (Li et al., 2016) predictions were obtained by slightly adapting the available source code. The reference profiles available in TIMER were used directly. In addition to bulk gene expression, tumor purity estimates based on DNA copy number variation are needed in TIMER to refine the gene signature. As this information is not available in our benchmarking datasets, we kept all the original immune gene signatures for the predictions in blood. For the tumor datasets, we used the gene signatures obtained from the TCGA data for melanoma or colorectal cancer depending on the origin of cancer.

  • ESTIMATE (Yoshihara et al., 2013) was run with their R package version 1.0.11.

For CIBERSORT, DeconRNASeq and ISOpure, when run based on our gene expression reference profiles, we used the reference profiles from peripheral blood immune cells for the predictions in blood and the reference profiles from tumor-infiltrating cells for the predictions in solid tumors.

List of abbreviations

Request a detailed protocol

CAFs: cancer-associated fibroblasts; EPIC: acronym for our method to ‘Estimate the Proportion of Immune and Cancer cells’; GEO: Gene Expression Omnibus; IHC: immunohistochemistry; PCA: principal component analysis; RMSE: root mean squared error; TCGA: The Cancer Genome Atlas; TPM: transcripts per million.

Appendix 1

Appendix 1—table 1
Gene markers used per cell type.

Only markers of cell types present in the respective reference gene expression profiles are used.

https://doi.org/10.7554/eLife.26476.027
Cell typeGenes markers
B cellsBANK1, CD79A, CD79B, FCER2, FCRL2, FCRL5, MS4A1, PAX5, POU2AF1, STAP1, TCL1A
CAFsADAM33, CLDN11, COL1A1, COL3A1, COL14A1, CRISPLD2, CXCL14, DPT, F3, FBLN1, ISLR, LUM, MEG3, MFAP5, PRELP, PTGIS, SFRP2, SFRP4, SYNPO2, TMEM119
CD4 T cellsANKRD55, DGKA, FOXP3, GCNT4, IL2RA, MDS2, RCAN3, TBC1D4, TRAT1
CD8 T cellsCD8B, HAUS3, JAKMIP1, NAA16, TSPYL1
Endothelial cellsCDH5, CLDN5, CLEC14A, CXorf36, ECSCR, F2RL3, FLT1, FLT4, GPR4, GPR182, KDR, MMRN1, MMRN2, MYCT1, PTPRB, RHOJ, SLCO2A1, SOX18, STAB2, VWF
MacrophagesAPOC1, C1QC, CD14, CD163, CD300C, CD300E, CSF1R, F13A1, FPR3, HAMP, IL1B, LILRB4, MS4A6A, MSR1, SIGLEC1, VSIG4
MonocytesCD33, CD300C, CD300E, CECR1, CLEC6A, CPVL, EGR2, EREG, MS4A6A, NAGA, SLC37A2
NeutrophilsCEACAM3, CNTNAP3, CXCR1, CYP4F3, FFAR2, HIST1H2BC, HIST1H3D, KY, MMP25, PGLYRP1, SLC12A1, TAS2R40
NK cellsCD160, CLIC3, FGFBP2, GNLY, GNPTAB, KLRF1, NCR1, NMUR1, S1PR5, SH2D1B
T cellsBCL11B, CD5, CD28, IL7R, ITK, THEMIS, UBASH3A
https://doi.org/10.7554/eLife.26476.026

Appendix 2

Appendix 2—table 1
Characteristics of the patients with metastatic melanoma and corresponding lymph node samples.
https://doi.org/10.7554/eLife.26476.029
PatientAge (years)GenderTissue
LAU12559maleiliac lymph node
LAU35570femaleiliac-obturator lymph node
LAU125587maleaxillary lymph node
LAU131481maleiliac-obturator lymph node
https://doi.org/10.7554/eLife.26476.028

Data availability

The following data sets were generated
The following previously published data sets were used
    1. Tirosh I
    2. Izar B
    (2016) Single cell RNA-seq analysis of melanoma
    Publicly available at the NCBI Gene Expression Omnibus (accession no: GSE72056).

References

Decision letter

  1. Alfonso Valencia
    Reviewing Editor; Barcelona Supercomputing Center - BSC, Spain

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data" for consideration by eLife. Your article has been favorably evaluated by Naama Barkai (Senior Editor) and three reviewers, one of whom is a member of our Board of Reviewing Editors. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted these comments to understand if it will be possible for you to provide what the reviewers consider essential to the publication of this work. At this point we ask you to provide a detailed response with an action plan and timetable for the completion of the essential tasks noted below. We will have the Board and reviewer consider your responses and issue a binding recommendation as soon as possible.

Summary:

This manuscript describes the EPIC method to estimate the proportion of cancerous cells and immune cells of various types within a bulk gene expression profile of a tumor sample or PBMC. The work fits within the recent renewed interest in the area of expression deconvolution in tumor-infiltrating lymphocytes.

The method is based on the extrapolation of the proportion of RNA in standard cell types to determine the proportion of cell types in mixtures. The main features of the method are: it takes into account the estimated proportion of RNA in the samples, weights highly the contribution of less variable genes in the reference set, it does not make assumptions about the heterogeneity of the non-normal cells in the samples, and, it uses the expression of maker genes that are not expressed in cancer cells to normalize the proportion of tumor/non tumor cells in the samples. The data presented correspond mainly to metastatic melanoma samples with some additional colon cancer cases.

Essential revisions:

1) The works requires a clear demonstration of the capacity of the method to do well estimating cell subsets at higher resolution, as this seems to be the main advantage of the method.

2) A better analysis and clarification of what are the key steps of the procedure that improve the results.

3) Clarify the dependency of the results of outliers and in general revise the significance criteria.

4) Make the system accessible via a web interface.

Other concerns:

1) – The authors should consider to extend the reference gene expression profiles to additional cell types such as stromal and endothelial cells which are known to have a role in tumor microenvironment as well as deepen the separation between cytotoxic T cells (CD8 cells) and CD4 T cells gene expression profiles, which currently is low and at a level less than what has been shown to be important correlates for immune-tumor interactions/therapeutic response.

2) Consider rephrasing, the estimation of the tumor in 3D – it is not too surprising given that the melanoma forms the bulk of the tumor in most of these samples and is quite distinct – this is akin to getting neutrophil estimates correct in whole blood.

3) With respect to the second validation (comparing EPIC to IHC data), the authors claimed they observed "a good agreement between cell proportions measured by IHC and our predictions", however there is a weak non-significant correlation (r=0.3, p=0.09) between these factors in macrophages that should be mentioned in the text. This may be due to the maker used in IHC being expressed in other cell subsets or due to protein/mRNA differences – the former being relatively easy to distinguish.

4) The first test and comparison with other methods is carried out with normal blood samples and (Figure 2B). The EPIC method performs better (pearson R) than other methods. Notice that in whole blood (dataset 3 (Linsley et al., 2014)) the differences with CIBERSORT and DSA are minimal. It will be important to clarify if the additional information is provided by the inclusion of the estimated mRNA content in the calculations (Figure 2A versus 2C).

5) Reconsider the following claim "We observed a remarkable agreement between our predictions and experimentally determined cell fractions (Figure 3A). Of note, the proportion of melanoma cells could be very accurately predicted even in the absence of a priori information about their gene expression." in light of the small size of the data sets (4 donors of which two of them dominate the correlation).

6) Comment of the significance of the results in Figure 3B and Figure 3C where the differences are based on single outliers.

7) It will be good to check if in the information in Figure 5A (and supplementary figures) can be better interpreted comparing the method only in normal tissue samples (equivalent to Figure 5 supplementary figures).

8) Assess the significance of the results in Figure 5B.

9) Benchmarking other methods, such as Cibersort by summing high-resolution cell subsets may be incorrect – a better approach would be to replace LM22 with the matrix derived by the authors – if not, I would suggest saying that this comparison is less than ideal.

10) It is not clear from the manuscript whether or not the advance over CIBERSORT is due to a better marker gene set and reference panel or due to the other methodological innovations that they made. This point should be clarified.

11) The reference to the use of "zeros" in the gene expression profile of cancerous tissue as informative to be leveraged in deconvolution in Gosink 2007 and Clarke 2010 should be included.

12) Respect to ISOpure, it is not clear from the description whether ISOpure was run just with signature genes or with the whole gene expression profile.

13) The ideas implicit in DeMix where the ratio of the variances provides an implicit importance weight (Ahn 2013) should be discussed.

Other points:

1) The authors should change the scale and marking of the statistical significance from (*p<0.1; **p<0.05; ***p<0.01) to (*p<0.05; **p<0.01; ***p<0.001) as p>0.05 is not considered significant.

2) The authors claimed that "Immune cells differ in their gene expression profiles depending on their state and site of origin (e.g., blood or tumors)" – a reference should be added.

3) There is a lack in specifying "Supplementary file 4" in the relevant places.

4) Figures and supplementary figures are not numbered in the manuscript, what makes difficult to know what the supplementary figures correspond to what text.

5) The use of a different scale for each method does not help to the interpretation of the figures.

6) Subsection “Validation in blood”: Renormalization by mRNA content is used in (Qiao 2012) to improve the estimates of cell-type proportions.

7) "Prediction of cancer and immune cell type proportions": I found this explanation somewhat confusing because Equation 1 is never actually used in the deconvolution – only Equations 6 and 7. Because Equation 1 is introduced, I expected EPIC to try to solve it at some point.

Regarding references:

1) The authors claim that this is the first deconvolution methodology to take into account total RNA differences between cell subsets is incorrect, to the best of this reviewers knowledge, Baron et al. (Cell Systems 2016) propose a method, Bseq-SC which uses total RNA to improve deconvolution estimates. Of relevance here, their estimates for total RNA levels come from the single cell expression data, information which could be incorporated into EPIC too to increase its resolution.

Other pertinent references include:

Clarke J, Seo P, Clarke B: Statistical expression deconvolution from mixed tissue samples. Bioinformatics. 2010, 26: 1043-1049

Gosink MM, Petrie HT, Tsinoremas NF: Electronically subtracting expression patterns from a mixed cell population. Bioinformatics. 2007, 23: 3328-3334.

Ahn J, Yuan Y, Parmigiani G, Suraokar MB, Diao L, Wistuba II, Wang W. DeMix: deconvolution for mixed cancer transcriptomes using raw measured data. Bioinformatics. 2013 Aug 1;29(15):1865-71.

Qiao W, Quon G, Csaszar E, Yu M, Morris Q, Zandstra PW. PERT: a method for expression deconvolution of human blood samples from varied microenvironmental and developmental conditions. PLoS Comput Biol. 2012

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data" for further consideration at eLife. Your revised article has been favorably evaluated by Naama Barkai (Senior Editor) and three reviewers, one of whom is a member of our Board of Reviewing Editors.

The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:

Essential revisions:

The claim that "ISOpure is specifically designed to run genome-wide" does not appear anywhere in the ISOpure manuscript, is not consistent with the uses of ISOpure, and there is nothing obvious in the ISOpure algorithm that makes it specifically designed to be applied genome-wide (such a claim is made for PERT, and one could argue that PERT contains a gene-specific weighting mechanism, like EPIC, which makes it suitable for genome-wide use, but no such claim is made for ISOpure).

Indeed, an ISOpure-like method applied genome-wide in the PERT manuscript does quite badly at recovering the mixture proportions of the various cell types. Therefore, the new version in the revised version regarding ISOpure needs to be revised and inaccurate statements removed.

In conclusion, in the new version EPIC should be compared with ISOpure using the same inputs as CIBERSORT and EPIC.

https://doi.org/10.7554/eLife.26476.048

Author response

Essential revisions:

1) The works requires a clear demonstration of the capacity of the method to do well estimating cell subsets at higher resolution, as this seems to be the main advantage of the method.

We agree that it is important to estimate cell subsets at a higher resolution. For this reason, we have now constructed gene expression reference profiles for CD4 T cells and CD8 T cells from the peripheral blood datasets. In addition, we also took advantage of the single-cell RNA-seq data from Tirosh et al., 2016, in order to define these CD4 and CD8 T cells for tumor infiltrating cells. Finally, we have built reference gene expression profiles from stromal cells (cancer associated fibroblasts) and endothelial cells, also from the single-cell RNA-Seq data. As before, we defined signature genes for these new cell types.

Going back to the FACS data from the study of Zimmermann et al., 2016, we could estimate the proportion of CD4 T cells and CD8 T cells that were present in these blood samples, and used this data as a first validation of EPIC in blood data, including these new cell subsets, as shown in Author response image 1.

Author response image 1
Comparison between EPIC predictions and measured cell fractions in PBMC dataset from Zimmermann et al. 2016.
https://doi.org/10.7554/eLife.26476.031

Concerning the validation in tumor samples:

1) We expanded our flow cytometry analysis of the four melanoma samples, including CD4 T cells and CD8 T cells to test the predictions there.

2) We could also validate EPIC predictions for the CD8 T cells in the immunohistochemistry data from Becht et al., 2016.

3) Predictions from all cell types, including stromal and endothelial cells, could also be validated based on the data from Tirosh et al., 2016.

We performed a detailed comparison of the predictions from EPIC and the other methods, including these additional cell types. Some of the results are given in Author response image 2, showing the correlation between the predictions from EPIC and the experimentally observed proportions for the various cell types in different datasets.

Author response image 2
Comparison between the experimentally measured cell fractions and EPIC predictions, including additional cell types in: (A) our expanded flow cytometry analysis of melanoma; (B) lymph node metastasis and primary tumor melanoma data from Tirosh et al., 2016.
https://doi.org/10.7554/eLife.26476.032

Pushing the limits of the cell type fraction predictions to even lower subsets, we also defined Thelper and Treg cell subsets, based on the data from Tirosh et al., 2016. We note that in such cases, the accuracy of the predictions starts to suffer because of the substantial similarity of these cell types at the transcriptional level. The results based on EPIC are nevertheless still as good or even better than those obtained from CIBERSORT (Figure 5—figure supplement 6). Our results suggest that quantitative predictions for such less abundant and more redundant cell types are still challenging and may reach the limits of the signal to noise ratio in gene expression data.

We still note that, in our view, estimating cell subsets at higher resolution, albeit very promising, may not be the main advantage of EPIC compared to other methods. Rather, we believe that inclusion of more relevant reference profiles (tumor infiltrating vs. blood), the ability to consider uncharacterized cell types and the renormalization by mRNA content are more important for the improvement in predictions.

2) A better analysis and clarification of what are the key steps of the procedure that improve the results.

We are sorry we did not satisfactorily detail the improvements brought by each of the steps from EPIC.

We have therefore complemented our analyses, clarifying the key steps of the procedure improving the results. The two most important steps are:

1) Renormalization by mRNA content (Figure 2A vs. Figure 2C; Figure 2—figure supplement 2. Figure 2—figure supplement 3).

2) Ability to consider uncharacterized (and possibly highly variable) cell types in the cell fraction prediction (Figure 2—figure supplement 4; and the comparisons in Figure 5—figure supplements 15).

Other steps, like the use of the variability in the gene expression reference profiles (Figure 2—figure supplement 3) and the use of more biologically relevant reference gene expression profiles (Figure 3 vs. Figure 4, Figure 4—figure supplement 1) further add some improvements to the results, as discussed in the manuscript.

3) Clarify the dependency of the results of outliers and in general revise the significance criteria.

We agree that outliers could have some influence on the results. For this reason, we decided to remove the analysis performed about the melanoma immunohistochemistry data from Jönsson et al., 2010. Indeed this dataset was only qualitative, grouping the B and T cells infiltration into absent, low or high, and within the high groups, only four samples were present, two of which were very highly infiltrated such that results were highly biased by these. If removing these two outlier samples, then not enough samples remained in the high infiltration to be conclusive.

Considering the other results, we revised our significance criteria, lowering the p-values needed for results to be considered as significant.

We also discussed in the updated manuscript that datasets with a large range of data points (i.e. large variability in cell fractions across samples) could present some very high correlations due to few points driving the correlation. We then emphasized that it is important to consider the correlation values only to compare the different methods on the same dataset as the range of values there will be the same.

In addition to the correlation values, we also underline in the revised manuscript that the root mean squared values provide another good indicator of the quality of the predictions. Notably, the RMSE is not influenced by outlier data points that can drive the high correlation values.

Based on all these comparisons, we still believe that EPIC performed very well and often significantly better than other methods.

4) Make the system accessible via a web interface.

We have now implemented such a web interface and made it available at the address: http://epic.gfellerlab.org.

Users can upload their own dataset for which they want to estimate the proportion of cells. They could then choose to use our implemented gene expression reference profiles or to upload some reference profiles of their own, and define various other parameters.

Results are then outputted as a web-table with the various cell type proportions (this table can also easily be downloaded), and different figures complement the results, giving an overview of the various cell type proportions.

Other concerns:

1) The authors should consider to extend the reference gene expression profiles to additional cell types such as stromal and endothelial cells which are known to have a role in tumor microenvironment as well as deepen the separation between cytotoxic T cells (CD8 cells) and CD4 T cells gene expression profiles, which currently is low and at a level less than what has been shown to be important correlates for immune-tumor interactions/therapeutic response.

We have now extended our reference gene expression profiles to also include also CD4+ T cells, CD8+ T cells, stromal (cancer-associated fibroblasts) and endothelial cells, showing good accuracy of the predictions for these cell types as well, as described above in point 1.

2) Consider rephrasing, the estimation of the tumor in 3D – it is not too surprising given that the melanoma forms the bulk of the tumor in most of these samples and is quite distinct – this is akin to getting neutrophil estimates correct in whole blood.

We have updated our discussion of these results in the subsection “Validation in blood” to clarify why these predictions can really be considered as good and are not just an artifact of the cancer cell gene expression being totally different from the other cell types.

We emphasize here that it is not only the correlation that is good in our predictions but also the root mean squared error, showing that our predictions are able to predict accurately the true absolute proportions of the melanoma cells (e.g. in Figure 3C (previous 3D), the predicted proportion of cells fall nearly perfectly on the gray dashed line, which corresponds to the y = x line (i.e. line of "predicted value" = "observed value").

In addition to having good predictions for the melanoma cell fractions, these results also show that the predictions were also accurate for the various immune cell types, which have much more similar reference profiles (Figure 4—figure supplement 1), and thus the results we present there are not just as simple as if we would only be predicting the neutrophils proportions from a whole blood sample.

3) With respect to the second validation (comparing EPIC to IHC data), the authors claimed they observed "a good agreement between cell proportions measured by IHC and our predictions", however there is a weak non-significant correlation (r=0.3, p=0.09) between these factors in macrophages that should be mentioned in the text. This may be due to the maker used in IHC being expressed in other cell subsets or due to protein/mRNA differences – the former being relatively easy to distinguish.

In our revised predictions, the correlation between the macrophages predicted by EPIC and observed IHC values is better, even with the reference profiles from peripheral blood (new correlation is r=0.52, p=0.002). This improvement can be explained by our use of a refined list of signature genes and the use of TPM-normalized reference profiles.

However the results are not significant in this analysis for the predictions of the CD8 T cells that did not exist in the previous version of the manuscript. We made therefore a comment in the revision about this correlation that is not significant.

Importantly however, the correlations for both macrophages and CD8 T cells improve in this immunohistochemistry dataset when using the reference profiles from tumor infiltrating cells (r=0.66, p<10-4 and r=0.53, p=0.001 respectively). We anticipate that this improvement is due to the use of reference gene expression profiles corresponding better to the phenotype of CD8 T cells and macrophages present in primary tumors.

We expanded the discussion about this point in the Discussion section.

4) The first test and comparison with other methods is carried out with normal blood samples and (Figure 2B). The EPIC method performs better (pearson R) than other methods. Notice that in whole blood (dataset 3 (Linsley et al., 2014)) the differences with CIBERSORT and DSA are minimal. It will be important to clarify if the additional information is provided by the inclusion of the estimated mRNA content in the calculations (Figure 2A versus 2C).

Our new comparisons show that indeed other methods like CIBERSORT and DSA could also benefit from the renormalization by mRNA (Figure 2—figure supplement 2).

Importantly however, EPIC is especially designed to predict the proportion of cells in tumor samples and the ability of EPIC to consider a cell type without any reference gene expression profiles is another particularly important improvement in this context (Figure 2—-figure supplement 4).

5) Reconsider the following claim "We observed a remarkable agreement between our predictions and experimentally determined cell fractions (Figure 3A). Of note, the proportion of melanoma cells could be very accurately predicted even in the absence of a priori information about their gene expression." in light of the small size of the data sets (4 donors of which two of them dominate the correlation).

We have complemented the discussion of these results, underlining this possible artifact. We still point out that, despite this small sample size and some data points that might be driving high correlation values, the predictions fall nearly on the y=x line, indicating that all cell types could indeed be predicted accurately, as also reflected by the low RMSE.

In addition, we emphasized in the revised manuscript that the correlation values should only be compared for methods tested on the same datasets, and in our comparison with the other methods these correlations were lower for the other methods, despite the presence of the same data possibly dominated by two donors.

6) Comment of the significance of the results in Figure 3B and Figure 3C where the differences are based on single outliers.

We agree that outliers in Figure 3C have some influence on the results. For this reason, as detailed in the response to major comment 3), we decided to remove the analysis performed about this data. But we do not think that outliers in Figure 3B are biasing the results from this other analysis.

7) It will be good to check if in the information in Figure 5A (and supplementary figures) can be better interpreted comparing the method only in normal tissue samples (equivalent to Figure 5 supplementary figures).

We believe this concern has already been answered. Indeed in Figure 5B we do a comparison of all the methods if considering only the normal immune cell types. And in Figure 2B (and Figure 2—figure supplements 12) we performed a detailed comparison of all methods based on healthy blood samples.

8) Assess the significance of the results in Figure 5B.

We have now indicated with some stars the p-values for each correlation given on Figure 5B as well as on Figure 5A and Figure 2B.

9) Benchmarking other methods, such as Cibersort by summing high-resolution cell subsets may be incorrect – a better approach would be to replace LM22 with the matrix derived by the authors – if not, I would suggest saying that this comparison is less than ideal.

10) It is not clear from the manuscript whether or not the advance over CIBERSORT is due to a better marker gene set and reference panel or due to the other methodological innovations that they made. This point should be clarified.

These two comments are highly related in will thus be answered together.

First, we have now included in our analyses the results from CIBERSORT based on LM22 reference profiles as well as based on the reference profiles and signatures we derived in our study. We can note that the results from CIBERSORT based on either of the reference profiles are similar; in some cases results are better based on LM22 signature and in other cases it is better with our signature (e.g. Figure 2B, Figure 2—figure supplement 1, Figure 5B).

We also showed in the new Figure 2—figure supplements 1-4 that the methodological innovations we implemented in EPIC are important for accurate predictions, which is then verified in the context of tumor samples as exemplified by the results from Figure 5A-B and Figure 5—figure supplement 1.

11) The reference to the use of "zeros" in the gene expression profile of cancerous tissue as informative to be leveraged in deconvolution in Gosink 2007 and Clarke 2010 should be included.

We thank the reviewer for pointing out these papers and we have included references to these in the revised manuscript, discussing their idea of using genes not expressed by a given cell to estimate its proportion.

12) Respect to ISOpure, it is not clear from the description whether ISOpure was run just with signature genes or with the whole gene expression profile.

We apologize if the explanation on how ISOpure was run was not clear. Indeed, we used here the whole gene expression profiles and not only the signature genes, as ISOpure has been specifically designed to account for the gene expression of all the genes from a sample. This point has been clarified in the text.

13) The ideas implicit in DeMix where the ratio of the variances provides an implicit importance weight (Ahn 2013) should be discussed.

We thank the reviewer for this reference and we added it in the text. This is also some interesting framework to estimate the proportion of cell types from a mixed sample in the presence of one uncharacterized cell type. This framework works in a similar way than ISOpure that we have tested in our manuscript.

The weights given to each gene in EPIC (based on their variability in the reference profiles) is included in the constrained least square optimization we perform. In contrary, DeMix method does not perform such regression based on signature genes, but it considers the distribution of each gene in each pure normal tissue sample and in each tumor sample. Based on these distributions, assumed to be log-normal distributed in the normal tissue, DeMix estimates the gene distribution inside the tumor cells and proportion of tumor cells that would best match the observed distribution of genes in the mixed tumor samples. We agree that when doing such predictions, the genes that had a narrower distribution both in the normal and tumor samples will be implicitly more important and should have a bigger impact on the predicted tumor cell fractions. However, no real weights will be given to these genes in solving the problem with DeMix.

Also, this method assumes that the normal tissue is composed of a single type of “normal” cells or at least that the normal cell types have a constant proportion inside normal tissue. Such a framework would thus not be appropriate to study the proportions of various immune cell types infiltrating a tumor, as it has been observed that the proportion of the different normal cells (various immune cell types, stromal and endothelial cells) vary from one patient to another.

Other points:

1) The authors should change the scale and marking of the statistical significance from (*p<0.1; **p<0.05; ***p<0.01) to (*p<0.05; **p<0.01; ***p<0.001) as p>0.05 is not considered significant.

As suggested by the reviewer, we modified the scale of statistical significance to (*p<0.05; **p<0.01; ***p<0.001) on the figures where the exact p-value was not directly given.

2) The authors claimed that "Immune cells differ in their gene expression profiles depending on their state and site of origin (e.g., blood or tumors)" – a reference should be added.

We added various references that discuss the fact that tumor infiltrating immune cells present an altered phenotype compared to circulating immune cells (e.g. describing the exhaustion phenotype of T cells).

3) There is a lack in specifying "Supplementary file 4" in the relevant places.

We have added references to this file (now named Supplementary file 3) in the legend of the respective figures. References to this file were already present in various places in the main text.

4) Figures and supplementary figures are not numbered in the manuscript, what makes difficult to know what the supplementary figures correspond to what text.

We are sorry that the numbers did not appear in the figure. This has now been corrected and all figures are numbered.

5) The use of a different scale for each method does not help to the interpretation of the figures.

The scale of the different figures are now the same for each method except for MCPcounter. Indeed MCPcounter outputs do not correspond to cell fractions so that its exact values are not comparable to the other methods. Then, the scales are not necessarily the same between various datasets due to different ranges in the data present in each dataset. Also for the immunohistochemistry data from Becht et al., 2016 (Figure 5—figure supplement 3), the ranges in the predicted proportions from the various methods is broad and we thus use different scales for each method to better see the results, as the true cell proportions are not known (only the intensity of some cell-specific protein markers are known from the immunohistochemistry).

6) Subsection “Validation in blood”: Renormalization by mRNA content is used in (Qiao 2012) to improve the estimates of cell-type proportions.

We included a reference to the method presented in this paper, PERT, which is similar to the method ISOpure we have tested here. To our understanding, ISOpure is some further improvement over PERT, developed in the same group, in order to allow for the estimation of an uncharacterized cell type in addition to the known cell types.

Also, to our understanding, PERT is allowing the gene expression value from each gene to be multiplied by some factor (different for each gene), in order to account for differences in expression in the reference cells studied in isolation and the same cells present in a mixed sample. However, it seems to us that PERT is not considering any renormalization by mRNA content at the cell level.

7) "Prediction of cancer and immune cell type proportions": I found this explanation somewhat confusing because Equation 1 is never actually used in the deconvolution – only Equations 6 and 7. Because Equation 1 is introduced, I expected EPIC to try to solve it at some point.

We are sorry if the description of the equation was not very clear, but we think that this equation (1) is actually very important, as this is the equation this is at the root of EPIC method (see e.g. Figure 1C). In particular, equations (6) and (7) are directly resulting from equation (1). These two equations are obtained from equation (1) after normalizing the data into TPM as described in equation (2) and restraining the resolution to a subset of signature genes that are not expressed by the cancer cells, which simplify the system to solve. In order to make the explanation more easily understandable, we have decided to use the standard TPM normalization instead of our previous normalization of the counts. We have also extended equation (6) to detail how it is finally obtained.

Regarding references:

1) The authors claim that this is the first deconvolution methodology to take into account total RNA differences between cell subsets is incorrect, to the best of this reviewers knowledge, Baron et al. (Cell Systems 2016) propose a method, Bseq-SC which uses total RNA to improve deconvolution estimates. Of relevance here, their estimates for total RNA levels come from the single cell expression data, information which could be incorporated into EPIC too to increase its resolution.

We thank the reviewer for pointing out this reference and apologize for not citing this paper, which we had overlooked.

In this paper, the authors study normal pancreas at the single-cell level, having sequenced a very large amount of human and mice cells. In their analysis, they also developed a method, BSeq-SC to estimate the proportion of the various type of pancreatic cells from a bulk sample. This method integrates reference profiles the authors built from their single-cell RNA-seq data and then use CIBERSORT for the cell fraction predictions. The authors indeed observed that the different types of cells (mostly pancreatic but also very few immune cells) contained a different number of total transcripts. They used this information when building the reference profiles from pancreatic cells.

Contrary to EPIC, this renormalization by mRNA content is thus performed a priori on the reference gene expression profiles, instead of doing it a posteriori on the predicted mRNA proportions. For completeness in our analyses, we thus also modified EPIC in order to perform such a priori renormalization and observed the results were similar to those based on the a posteriorinormalization developed for EPIC (Figure 2—figure supplement 3).

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Essential revisions:

The claim that "ISOpure is specifically designed to run genome-wide" does not appear anywhere in the ISOpure manuscript, is not consistent with the uses of ISOpure, and there is nothing obvious in the ISOpure algorithm that makes it specifically designed to be applied genome-wide (such a claim is made for PERT, and one could argue that PERT contains a gene-specific weighting mechanism, like EPIC, which makes it suitable for genome-wide use, but no such claim is made for ISOpure).

Indeed, an ISOpure-like method applied genome-wide in the PERT manuscript does quite badly at recovering the mixture proportions of the various cell types. Therefore, the new version in the revised version regarding ISOpure needs to be revised and inaccurate statements removed.

In conclusion, in the new version EPIC should be compared with ISOpure using the same inputs as CIBERSORT and EPIC.

We thank the reviewer for this insightful comment about how to best use ISOpure method and apologize for the confusion between PERT and ISOpure. We have now re-run this method based on the same subset of signature genes than what we had derived for EPIC. Of note, these signature genes (in addition to the reference profiles) are a new list of signature genes that we have derived in this manuscript for EPIC but they can be used in other contexts as well, like here for ISOpure.

We observe some improvement for ISOpure based on such subset of genes compared to previous whole-genome results (see e.g. the figures in Author response image 3 comparing the previous results and updated ones). For this reason, we included the revised gene-signature based results for ISOpure in our manuscript.

Author response image 3
Comparison of the prediction accuracies for EPIC, ISOpure based on all genes and ISOpure based on the subset of signature genes we derived for EPIC.

(A) For all immune cell types in the blood datasets (dataset 1: Zimmermann et al. 2016; dataset2: Hoek et al. 2015; dataset 3: Linsley et al. 2014). (B) and (C) in the tumor datasets, based on all cell types, including immune, stromal and cancer cells (B), or based only on all the immune cell types (C) (flow cytometry: our new experiment; single-cell RNA-seq: data from Tirosh et al. 2016). The stars above each bar indicate if the Pearson correlation was significant (* p < 0.05; ** p < 0.01; *** p < 0.001). These figures are the same than in our manuscript Figures 2B and 5A-B but comparing different ISOpure results and EPIC ones.

https://doi.org/10.7554/eLife.26476.033

We can note that despite this improvement for ISOpure, EPIC is still performing better. It is important to underline that the main goal of ISOpure is to estimate the proportion of the cancer cells present in the bulk samples and to estimate the gene expression profiles from the unknown cancer cells from these bulk samples, as we have also written in the manuscript. This can be a reason why ISOpure performs less good than EPIC to predict the proportion of all cell types present in the bulk as this is not the main focus of ISOpure contrary to EPIC.

Finally, please note that the reason why we had originally used ISOpure based on all genes is that the method description of ISOpure's manuscript (Quon et al. 2013) states that the tumor and healthy profiles are composed of the measured gene expression level of G transcripts, where "G is typically on the order of 10,000". It seemed to us that a whole genome data would better reflect these numbers than a subset of about hundred signature genes, and we had thought that ISOpure would therefore perform better based on all genes, which we have now seen was not really the case here.

https://doi.org/10.7554/eLife.26476.049

Article and author information

Author details

  1. Julien Racle

    1. Ludwig Centre for Cancer Research, Department of Fundamental Oncology, University of Lausanne, Epalinges, Switzerland
    2. Swiss Institute of Bioinformatics, Lausanne, Switzerland
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0100-0323
  2. Kaat de Jonge

    Department of Fundamental Oncology, Lausanne University Hospital (CHUV), Epalinges, Switzerland
    Contribution
    Investigation, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
  3. Petra Baumgaertner

    Department of Fundamental Oncology, Lausanne University Hospital (CHUV), Epalinges, Switzerland
    Contribution
    Resources, Supervision, Investigation, Methodology
    Competing interests
    No competing interests declared
  4. Daniel E Speiser

    Department of Fundamental Oncology, Lausanne University Hospital (CHUV), Epalinges, Switzerland
    Contribution
    Resources, Supervision, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
  5. David Gfeller

    1. Ludwig Centre for Cancer Research, Department of Fundamental Oncology, University of Lausanne, Epalinges, Switzerland
    2. Swiss Institute of Bioinformatics, Lausanne, Switzerland
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Validation, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    david.gfeller@unil.ch
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3952-0930

Funding

Center for Advanced Modelling Science

  • Julien Racle
  • David Gfeller

Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (Project grant 31003A_173156)

  • Julien Racle
  • David Gfeller

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We are grateful to Hélène Maby-El Hajjami for compiling the clinical data. We thank Kristen L. Hoek, Andrew Link and their colleagues (Hoek et al., 2015), Cate Speak, Scott Presnell and their colleagues (Linsley et al., 2014), Aurélien De Reynies, Etienne Becht and their colleagues (Becht et al., 2016), for providing us with additional data relating to their published studies. Computations were performed at the Vital-IT (http://www.vital-it.ch) Center for high-performance computing of the Swiss Institute of Bioinformatics.

Ethics

Human subjects: Patients involved in this study agreed to donate metastatic tissues upon informed consent, based on dedicated clinical investigation protocols established according to the relevant regulatory standards. The protocols were approved by the local IRB, i.e. the Commission cantonale d'éthique de la recherche sur l'être humain du Canton de Vaud.

Reviewing Editor

  1. Alfonso Valencia, Barcelona Supercomputing Center - BSC, Spain

Version history

  1. Received: March 2, 2017
  2. Accepted: November 10, 2017
  3. Accepted Manuscript published: November 13, 2017 (version 1)
  4. Version of Record published: December 6, 2017 (version 2)

Copyright

© 2017, Racle et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 16,667
    Page views
  • 1,894
    Downloads
  • 602
    Citations

Article citation count generated by polling the highest count across the following sources: Scopus, Crossref, PubMed Central.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Julien Racle
  2. Kaat de Jonge
  3. Petra Baumgaertner
  4. Daniel E Speiser
  5. David Gfeller
(2017)
Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data
eLife 6:e26476.
https://doi.org/10.7554/eLife.26476

Further reading

    1. Cancer Biology
    Gehad Youssef, Luke Gammon ... Adrian Biddle
    Research Article

    Cancer stem cells (CSCs) undergo epithelial-mesenchymal transition (EMT) to drive metastatic dissemination in experimental cancer models. However, tumour cells undergoing EMT have not been observed disseminating into the tissue surrounding human tumour specimens, leaving the relevance to human cancer uncertain. We have previously identified both EpCAM and CD24 as CSC markers that, alongside the mesenchymal marker Vimentin, identify EMT CSCs in human oral cancer cell lines. This afforded the opportunity to investigate whether the combination of these three markers can identify disseminating EMT CSCs in actual human tumours. Examining disseminating tumour cells in over 12,000 imaging fields from 74 human oral tumours, we see a significant enrichment of EpCAM, CD24 and Vimentin co-stained cells disseminating beyond the tumour body in metastatic specimens. Through training an artificial neural network, these predict metastasis with high accuracy (cross-validated accuracy of 87-89%). In this study, we have observed single disseminating EMT CSCs in human oral cancer specimens, and these are highly predictive of metastatic disease.

    1. Cancer Biology
    2. Medicine
    Dingyu Rao, Hua Lu ... Defa Huang
    Research Article

    Esophageal cancer (EC) is a fatal digestive disease with a poor prognosis and frequent lymphatic metastases. Nevertheless, reliable biomarkers for EC diagnosis are currently unavailable. Accordingly, we have performed a comparative proteomics analysis on cancer and paracancer tissue-derived exosomes from eight pairs of EC patients using label-free quantification proteomics profiling and have analyzed the differentially expressed proteins through bioinformatics. Furthermore, nano-flow cytometry (NanoFCM) was used to validate the candidate proteins from plasma-derived exosomes in 122 EC patients. Of the 803 differentially expressed proteins discovered in cancer and paracancer tissue-derived exosomes, 686 were up-regulated and 117 were down-regulated. Intercellular adhesion molecule-1 (CD54) was identified as an up-regulated candidate for further investigation, and its high expression in cancer tissues of EC patients was validated using immunohistochemistry, real-time quantitative PCR (RT-qPCR), and western blot analyses. In addition, plasma-derived exosome NanoFCM data from 122 EC patients concurred with our proteomic analysis. The receiver operating characteristic (ROC) analysis demonstrated that the AUC, sensitivity, and specificity values for CD54 were 0.702, 66.13%, and 71.31%, respectively, for EC diagnosis. Small interference (si)RNA was employed to silence the CD54 gene in EC cells. A series of assays, including cell counting kit-8, adhesion, wound healing, and Matrigel invasion, were performed to investigate EC viability, adhesive, migratory, and invasive abilities, respectively. The results showed that CD54 promoted EC proliferation, migration, and invasion. Collectively, tissue-derived exosomal proteomics strongly demonstrates that CD54 is a promising biomarker for EC diagnosis and a key molecule for EC development.