Introduction

Tissue biopsies have traditionally been a definitive way to diagnose and stage cancer, however, a biopsy may not be easily accessible for many tumors such as those in the pancreas, lung and brain15. In addition, the small amount of biopsied tissue does not represent the entire heterogeneous pathological profile of the tumor6. In recent years, liquid biopsy has emerged as a plausible diagnostic and monitoring approach with the capability to detect tumor biomarkers in more accessible biological fluids such as plasma, serum and urine7. Detectable tumor biomarkers can include circulating tumor DNA (ctDNA), circulating tumor cells (CTCs) and exosomes8.

Exosomes are extracellular vesicles of endosomal origin that are between ∼40-180 nm in diameter and have been shown to mediate intercellular communication in health and disease810. They can contain a variety of biomolecules including DNA, RNA, proteins, lipids, metabolites and other materials representative of the parent cell8. Exosomes are present at high concentrations in biological fluids, which is a potential advantage as a biomarker11, 12. Exosomal mRNAs and miRNAs have been investigated intensively as diagnostic biomarkers, and mounting evidence suggests that exosomal proteins circulating in biological fluids could be used for cancer diagnosis and monitoring cancer progression7, 13. Challenges that remain include standardization of methods for consistent exosome isolation from various tissues, identification of biomarkers that distinguish cancer and normal exosomes across different cancer types, and the identification of biomarkers that are unique to specific biological fluids (e.g., plasma, serum, urine, etc.).

In addition to accurately identifying exosomal protein biomarkers, it is also challenging to successfully utilize them to diagnose cancers due to intratumor and interpatient heterogeneity14. Conventional diagnostic approaches predominantly rely on a single biomarker, which is often not specific or sensitive15, 16. However, recent advances in machine learning algorithms connected to artificial intelligence provide an opportunity to construct a classifier that identifies a panel of exosomal protein biomarkers that would possess a more comprehensive ability to reflect the complex disease status of different patients and distinguish cancer samples from normal samples with significantly improved sensitivity and specificity.

Results

Unbiased proteomics analysis of exosomes identifies 18 abundant plasma membrane protein markers for various human cell lines

To identify universal exosomal protein biomarkers for differentiating cancer from non-cancer exosomes, we analyzed protein abundance data from 228 cancer and 57 control cell line-derived exosomes, representing various cancer types (Figure 1A; Supplementary Table S1). Because studies employ distinct isolation and mass spectrometry quantification techniques, the number of identified proteins is different among the studies. To overcome the bias caused by such technical factors, we examined the proteins common to all studies and identified 1124 overlapping proteins (Figure 2A). To determine the heterogeneity among cancer and control cell line-derived exosomes, we performed principal component analysis (PCA) using these 1124 proteins in 285 different studies with cancer and control cell line-derived exosomes (Figure 2B). The PCA indicated that the exosomes derived from cancer and control cell lines are heterogeneous and show significant variation in protein expression across cell lines.

Overview of the study

Proteomic characterization of exosomes derived from 285 cell lines from four studies

(A) Overlapped proteins from four different studies of cell line-derived exosomes. (B) PCA plot of cancer and control cell line-derived exosomes. (C) Positivity for 8 commonly used exosomal protein biomarkers in various cell lines. The percentage of samples expressing each protein is shown in the boxes. Darker red indicates a higher percentage. (D) Annotation of the proteins detected in more than 90% of all samples. (E) GO and KEGG pathway enrichment analysis of the proteins detected in more than 90% of all samples. (F) Plasma membrane proteins detected in more than 90 % of all samples.

We next investigated the frequency of the proteins detected in the exosomes from all cell lines, only the cancer cell lines and just the control cell lines. Commonly used exosome biomarkers (e.g., CD9, tetraspanins) were examined first, and from the 12 traditional exosome markers1719, only eight proteins were detected among the 1124 overlapping proteins from all cell lines. Six of eight proteins were detected in at least 90% of all samples, with CD9 and HSPA4 detected with the least frequency (Figure 2C). In addition, the frequency of FLOT1, FLOT2 and TSG101 proteins was higher in cancer cell line-derived exosomes when compared to control cell line-derived exosomes (Figure 2C).

To identify biomarkers detectable at a high frequency in all cancer and control cell line-derived exosomes, we searched for the proteins that were detected in ≥ 90% of all samples (Supplementary Table S2). We annotated these proteins using IPA and found that 78.0% of these proteins localize to the cytoplasm and 13.5% of proteins are associated with the plasma membrane (Figure 2D). Gene Ontology (GO) analysis revealed the enrichment of proteins from pathways related to vesicle-mediated transport, secretory vesicles, exocytosis, endocytosis, and other exosome-related pathways (Figure 2E). To further explore the utility of the proposed biomarkers, we examined the proteins located on the plasma membrane that met the threshold of detection in ≥ 90% of all samples (Figure 2F). Clathrin Heavy Chain (CLTC) was ranked as the top plasma membrane protein detected in 99.6% of all samples and 100% of control samples (Figure 2F). In addition, the scaffolding protein, Syntenin-1(SDCBP) was detected at a high frequency of 97.9% of all samples, corroborating previous findings20. Next, we sought to identify unique markers that can identify cancer cell-derived exosomes (cancer exosomes) by filtering out the proteins present in ≤ 10% of 57 control cell line-derived exosomes. Interestingly, Ataxin 2 Like (ATXN2L), which has been reported to promote cancer cell invasiveness and resistance to chemotherapy, was uniquely detected in the cancer cell-derived exosomes (Figure 3A)21. In total, we identified a set of 18 exosome protein markers that are present at a high abundance in all the cancer exosomes examined (Figure 2F).

Proteomic characterization of exosomes derived from cell lines and tissues

(A) Proteins detected at higher frequency in cancer cell line-derived exosomes. (B) Positivity for 11 commonly used exosomal protein biomarkers in various tissues. (C) Overlapping proteins (>90% frequency) between cell line- and tissue-derived exosomes. (D) Positivity of five plasma membrane proteins detected in more than 90% of both cell line- and tissue-derived exosomes.

Comparison of exosomal proteins derived from cell lines and tissues identified five universal plasma membrane protein markers

We next sought to investigate common biomarkers for the tissue-derived exosomes across cancer types. We calculated the detection frequency of commonly used exosome markers in the 157 samples (101 cancers; 56 controls). Two established exosome markers, CD63 and TSG101, were only detected in 33.1% and 45.9% of all samples, respectively (Figure 3B). To identify high-frequency biomarkers for exosomes derived from both cell lines and tissues, we examined the overlapping proteins that met a threshold of ≥ 90% in all samples for exosomes derived from cell lines and tissues and found 31 common proteins (Figure 3C). Among the 31 proteins, there were five proteins that were detected in over 90% of all cell line and tissue-derived exosomes (Figure 3D). These include Clathrin Heavy Chain (CLTC), Ezrin, (EZR), Talin-1 (TLN1), Adenylyl cyclase-associated protein 1 (CAP1) and Moesin (MSN).

An exosome proteome signature of 18 proteins can differentiate cancer exosomes from non-cancer exosomes across multiple cancer types

Plasma or serum is the most readily accessible source for non-invasive biopsies. We next sought to identify if exosome proteins in plasma and serum could differentiate cancer exosomes from non-cancer exosomes across multiple cancer types. We pooled exosome proteomics data derived from plasma or serum for 205 cancer and 51 control samples from five different studies2226, which included breast cancer, colorectal cancer, glioblastoma, lung carcinoma, liver cancer, neuroblastoma, and pancreatic cancer (Supplementary Table S1). To account for differences in the methodologies of these studies, we first reviewed the proteins and identified 46 proteins that were detected in all the studies (Figure 4A). We then examined their abundances in the 205 cancer and 51 control samples (Figure 4B). Although we could detect differences between the exosomes derived from cancer cell and non-cancer cells, it was difficult to classify them robustly based on PCA (Supplementary Figure S1A). Therefore, we sought to employ the advanced machine learning algorithm to differentiate cancer exosomes from non-cancer exosomes. We first calculated the mutual information score (MI) for each protein and trained the random forest classifier using the different numbers of top proteins according to their MI score to determine the best set of proteins we should include in the classifier. We found that the model performed best using 18 proteins, where the performance was evaluated by the area under the curve of the receiver operating characteristic curve (AUROC) (Figure 4C). Several key cancer-associated proteins were included among these 18 proteins. For example, apolipoprotein C1 (APOC1), which was ranked as the top protein, is significantly decreased in samples of cancer patients (Supplementary Figure S1B) and has been previously reported to be down-regulated in the non-small lung cancer, colorectal cancer, papillary thyroid carcinoma and pediatric nephroblastoma2730.

Identification of the signature proteins of plasma or serum-derived exosomes and the evaluation of random forest classifier

(A) Overlapping exosome proteins detected in the plasma and serum of 205 cancer and 51 control samples from five studies. (B) Heat map of 46 overlapping exosome proteins in cancer and control plasma or serum samples. (C) AUROC score of the random forest classifier on including various numbers of protein features. (D) AUROC of different models in comparison. (E) Classification error matrix of the 75% training set using a random forest classifier for the 18 selected proteins. The number of samples is indicated in each box. (F) AUROC score of the random forest classifier trained using 75% of the dataset. Other metrics are indicated on right. (G) Classification error matrix of 25% testing set using a random forest classifier for the 18 selected proteins. The number of samples is indicated in each box. (H) AUROC score of the random forest classifier tested using 25% of the dataset. Other metrics are indicated on right.

Employing all 18 proteins with 5-fold cross-validation, we constructed a random forest classifier31 to distinguish cancer and control samples and compared it with multiple popular machine learning models, including Support Vector Machine (SVM)32, K Nearest Neighbor Classifier (K-NN)33 and Gaussian Naive Bayes34. Our random forest classifier demonstrated the highest AUROC (Figure 4D). More specifically, it yielded an AUROC of 0.96, with an accuracy of 0.92, a precision of 0.94, and a recall of 0.96 (Figure 4E&F). When applied to the independent test set, the model yielded an AUROC of 0.99, an accuracy of 0.95, a precision of 0.96 and a recall of 0.98 (Figure 4G&H). Importantly, only one sample was misclassified in the independent test set, and 51 cancer samples were correctly classified. Taken together, our results showed the advantage and clinical potential of applying the random forest classifier model to plasma or serum exosome protein based liquid biopsy for cancer diagnosis.

Five plasma/serum exosomal proteins can reliably differentiate five common cancer types

We next sought to further enhance the clinical utility of the exosomes for differentiating between cancer types. We analyzed proteomics data from plasma or serum derived exosomes from patients with five common cancer types, including breast cancer, colorectal cancer, glioma, lung cancer and pancreatic cancer. Initial PCA revealed differences in exosome levels among cancer patients but failed to distinguish the five cancer types (Figure 5A). We determined the crucial features for cancer-type classification by computing mutual information scores for 46 common proteins and built a random forest classifier to determine the optimal number of features to include in the final classifier. We ultimately selected a set of five proteins based on AUROC scores (Figure 5B). We then increased the independent testing data size by utilizing 40% of the total samples and used the remaining 60% as the training set to minimize overfitting issues. Employing the five proteins with 5-fold cross-validation to train the random forest classifier, the model achieved a very high accuracy of 0.99 (Figure 5C), and when applied to the independent test set, the model consistently yielded a high accuracy of 0.94 (Figure 5D). The abundance of the five protein features varied across the five cancer types, reflecting the potential roles of these proteins in specific cancers (Figure 5E). For example, Histidine-rich glycoprotein (HRG), which was abundant in the colorectal cancer plasma-derived exosomes, has been reported to promote the tumor migration of colorectal cancer patients35. Overall, our results demonstrated that this exosome protein-based classification model can reliably differentiate between cancer types and further enhance the diagnostic value of our approach.

Identification of signature proteins expressed by plasma or serum-derived exosomes for classifying five common cancer types and evaluation of random forest classifier

(A) PCA plot of plasma or serum-derived exosomes from five cancer types. (B) AUROC score of the random forest classifier by including various number of protein features. (C-D) Classification error matrix of a 60% training set and 40% testing set to classify the five cancer types using a random forest classifier for the 5 selected proteins. The number of samples is indicated in each box. (E) Protein abundance of five selected protein features in five cancer types.

A urinary exosome proteome signature consisting of 17 proteins detects cancer exosomes across multiple cancer types

Urine is emerging as a superior non-invasive marker for urologic cancers, as its composition directly reflects the physiological changes in the urogenital system36. To test the use of urinary exosome proteins in cancer diagnosis, we collected data from 261 cancer patient samples and 124 control samples from four studies, including bladder cancer, prostate cancer, and renal cancer, lung cancer, cervical cancer, colorectal cancer, esophageal and gastric cancer (Supplementary Table S1)3639. Upon examination of proteins common across all four studies, 229 proteins were identified (Figure 6A). PCA revealed the variance between samples but failed to differentiate between cancer and control samples (Figure 6B). As described earlier, we next employed the random forest classifier method to differentiate cancer and control samples based on their exosomal proteomic profiles. To reduce the feature space and select the most relevant features, we utilized the mutual information score to rank the 229 protein features and then trained the random forest classifier using varying numbers of the top-ranking proteins (Figure 6C). Based on the AUROC scores of including the different number of features, we selected 17 features that resulted in the highest AUROC score (Figure 6C). A majority of these 17 proteins displayed significant variations in abundance between cancer and control samples (Figure 6D). By training a random forest classifier with the 17 protein features and 5-fold cross-validation, the model achieved an AUROC of 0.96, an accuracy of 0.90, a precision of 0.92, and a recall of 0.93 (Figure 6E&F). When tested on an independent set, the model produced an AUROC of 0.91, an accuracy of 0.82, a precision of 0.83, and a recall of 0.92 (Figure 6G&H). To summarize, our findings indicated the promising clinical potential of using urinary exosome proteins for the diagnosis of urologic cancers as well as other non-urologic cancer types.

Identification of signature proteins expressed by urine-derived exosomes and evaluation of random forest classifier

(A) Overlapping exosome proteins detected in the urine from 261 cancer and 124 control samples from four studies. (B) PCA plot of cancer and control urine-derived exosomes. (C) AUROC score of the random forest classifier by including a various number of protein features. (D) Protein abundance of 17 selected protein features in cancer- and control urine-derived exosomes. (E) Classification error matrix of 75% training set using a random forest classifier for the 17 selected proteins. The number of samples is indicated in each box. (F) AUROC score of the random forest classifier trained using 75% of the dataset. Other metrics are indicated on right. (G) Classification error matrix of 25% testing set using a random forest classifier for the 17 selected proteins. The number of samples is indicated in each box. (H) AUROC score of the random forest classifier tested using 25% of the dataset. Other metrics are indicated on right.

Discussion

Liquid biopsy has numerous benefits in the early detection of cancer, categorizing cancer types, tracking cancer progression, and monitoring response to treatment40. Exosomes found in biological fluids can provide a forensic view of their cells of origin. Despite previous studies proposing common protein biomarkers to identify exosomes, a comprehensive set of exosome biomarkers derived from different biological materials has not been established, owing to limitations in isolation and quantification methods4144. Additionally, a reliable diagnostic tool based on proteins associated with exosomes that can be applied across all cancers is yet to be identified.

Here, we generate a comprehensive proteomics profile of exosomes derived from cell lines, tissues, plasma, serum, and urine from 1083 cancer and control samples. An extensive analysis of these samples showed that several widely used exosome markers, such as CD63, CD81, HSP70, and HSPA8, are absent in exosomes derived from a subset of cell lines. Further, FLOT1, FLOT2, TSG101, and CD63 are present at low levels (< 60%) in all exosomes derived from tissues, indicating a need for the identification of additional universal markers for both cell line- and tissue-derived exosomes. Our study identifies five highly abundant universal exosome biomarkers-CLTC, EZR, TLN1, CAP1, and MSN-that were present in over 90% of all cell line- and tissue-derived samples. Additionally, we found that ATXN2L was only present in cancer exosomes and absent in non-cancer exosomes. In this regard, GPC1 which was identified in many studies identified as cancer exosomes specific marker, was not identified here as many of databases did not pick up this protein in analysis.

Here we describe a novel computational approach using the random forest classifier method to define exosome protein panels that serve as effective biomarkers specifically for plasma, serum or urine across cancer types. By training the random forest model and testing with independent datasets, our model yields excellent scores in AUROC, sensitivity, and specificity and differentiates cancer exosomes from non-cancer exosomes. We show that this approach can also be used to classify, with high accuracy, five common cancer types based on their exosome protein signatures.

A majority of the protein makers identified in this study have demonstrable biological relevance in cancer. As an example, ITIH3, which was identified among the protein features for the plasma or serum-based classifier and highly abundant in cancer samples, was reported to be more highly expressed in the plasma of gastric cancer samples compared to the control45 and increased with tumor staging of clear cell renal cell carcinoma patients 46. Importantly, the biomarker panels identified for cell lines, tissues, plasma/serum, and urine overcome the bias associated with the isolation and quantification method as well as the inter-patient variability inherent to the complex process of cancer. These panels are designed to be applied to a variety of cancers.

Collectively, our results demonstrate that exosome protein features can be utilized as reliable biomarkers for the early detection of cancer, classification of cancer types, and potentially for diagnosing tumors of undetermined origin. These results have the potential to advance the development and standardization of innovative and optimized methods for the isolation of exosomes and the implementation of routine plasma, serum- and urine-based exosome screening in clinical settings.

Methods

Public exosome proteomics data

We collected publicly available exosome protein data from cell lines, plasma, serum, and urine from previous studies as summarized in Supplementary Table S120, 2226, 3639, 43, 47. We obtained the raw spectral count and intensity data from the original paper or directly from the authors through communication. The obtained data was log normalized accordingly.

Feature selection and machine learning algorithm

We employed mutual information scores to evaluate the importance of protein features for each prediction, quantifying the amount of uncertainty reduction for one variable given the knowledge of another variable. We calculated the mutual information score for each protein with a target label (cancer/control) using SelectKBest from the scikit-learn 1.1.1 Python library. According to the sample size and characteristics of the data collected for plasma and urine-derived exosomes, we selected a customized number of best features for each prediction. To select the optimum number of proteins to include in the prediction, we ranked the protein features based on the mutual information score and built a random forest classifier to evaluate performance on including a range of number of features. The random forest model can decrease the probability of over-fitting and enhances the resilience towards outliers and input data noise. The area under the curve of the receiver operating characteristic curve (AUROC) was employed to evaluate the performance of the classifier. We also calculated accuracy, precision and recall for comprehensive evaluation. All models were evaluated using 5-fold cross-validation with stratified train-test splits that preserved the percentage of samples for the prediction target. We also tested the performance of alternate machine learning algorithms including support vector classifier, K nearest neighbor classifier and gaussian naive bayes. Overall, the random forest classifier achieved the best performance in our analysis (Supplementary Figure S1B). To visualize high-dimensional datasets, kernel principal component analysis (PCA) from scikit-learn 1.1.1 Python library was employed to perform PCA and plots were generated using the scatterplot function from seaborn 0.11.2 Python library.

Gene ontology and pathway enrichment analysis

The WebGestalt 201948 online tool was used to perform the gene ontology and pathway enrichment analysis of the selected proteins. Pathways with FDR < 0.05 were considered significant.

Statistical analysis

All statistical analyses were conducted using R 4.2.1 software. Significance was determined by the Wilcoxon rank-sum test unless specified otherwise. Significance was concluded if the p-value was <0.05, while in pathway analysis, significance was concluded if the FDR was <0.05 after correction for multiple comparisons.

Disclosure of conflict of interest

UT MD Anderson Cancer Center and R.K. hold patents in the area of exosome biology licensed to Codiak Biosciences, Inc. UT MD Anderson Cancer Center and R.K. are stock equity holders in Codiak Biosciences, Inc. R.K. is a consultant and scientific adviser for Codiak Biosciences, Inc. The remaining authors declare no competing interests.

Acknowledgements

The EV work in the Kalluri lab is supported by MD Anderson Cancer Center, NIH R35CA263815, and NIH P40OD024628 and gifts from Fifth Generation (Love, Tito’s), Lyda Hill Philanthropies, and Bosarge Family Trust.

Data availability

The public datasets used in this study were described in the Methods sections.

Contributions

B.L. and R.K. conceived and designed the study. B.L. developed the computational algorithm and performed the analysis with supervision from R.K. F.G.K. generated the proteomics data of the cell lines. B.L. and R.K. wrote the manuscript. All the authors read the manuscript and discussed the results.

Supplementary Figure S1. Machine learning models for plasma-derived exosomes (A) PCA plot of cancer and control plasma or serum-derived exosomes. (B) Protein abundance of 18 selected protein features in cancer- and control plasma or serum-derived exosomes.