A survey of optimal strategy for signature-based drug repositioning and an application to liver cancer

Abstract
Editor's evaluation
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Pharmacologic perturbation projects, such as Connectivity Map (CMap) and Library of Integrated Network-based Cellular Signatures (LINCS), have produced many perturbed expression data, providing enormous opportunities for computational therapeutic discovery. However, there is no consensus on which methodologies and parameters are the most optimal to conduct such analysis. Aiming to fill this gap, new benchmarking standards were developed to quantitatively evaluate drug retrieval performance. Investigations of potential factors influencing drug retrieval were conducted based on these standards. As a result, we determined an optimal approach for LINCS data-based therapeutic discovery. With this approach, homoharringtonine (HHT) was identified to be a candidate agent with potential therapeutic and preventive effects on liver cancer. The antitumor and antifibrotic activity of HHT was validated experimentally using subcutaneous xenograft tumor model and carbon tetrachloride (CCL₄)-induced liver fibrosis model, demonstrating the reliability of the prediction results. In summary, our findings will not only impact the future applications of LINCS data but also offer new opportunities for therapeutic intervention of liver cancer.

Editor's evaluation

This paper describes a new method and experimental manipulations to identify homoharringtonine as a new potential therapy for liver cancer and the underlying liver disease.

https://doi.org/10.7554/eLife.71880.sa0

Introduction

Despite the major advances in drug research and development (R&D), the cost for de novo drug development remains high, ranging from $3 billion to more than $30 billion. Moreover, it usually takes over 10 years to bring a new drug from bench to bedside, reflecting the complex challenges in this area (Scannell et al., 2012). Within this context, exploring new indications for existing drugs (drug-centric) or identifying effective drugs for certain diseases (disease-centric) represents an appealing concept, namely ‘drug repositioning’ (or ‘drug repurposing’), which can greatly shorten the gap between preclinical drug research and clinical applications (Ashburn and Thor, 2004; Liu et al., 2013). Leveraging big data-driven approaches, drug repositioning can be conducted computationally, which has the potential to complement traditional therapeutic discovery means and further improve the cost-effectiveness of drug development (Li et al., 2016). The most notable data resources supporting the in silico-based therapeutic discovery campaigns would be the Connectivity Map (CMap) (Lamb et al., 2006) and its recent extension called Library of Integrated Network-Based Cellular Signatures (LINCS) (Subramanian et al., 2017). These two projects have generated large-scale drug-induced gene expression profiles on multiple cancer cell lines under different treatment conditions (CMap Build 2: 3 cell lines, 1309 compounds; LINCS: 77 cell lines, 19,811 compounds), representing a treasure trove for in silico therapeutic exploration (Musa et al., 2018). As a 1000-fold scale-up of the original CMap, LINCS contained dramatic increases in both cell line types and perturbations, making it the focus of the present investigation.

The computational drug discovery approach using LINCS (also CMap) data is based upon a basic concept called ‘signature reversion’ (Li et al., 2016). Briefly, compounds with the ability to reverse disease-specific gene expression pattern are considered therapeutic candidates (Figure 1). To date, although there have been many successful applications, many problems with this approach remain unsolved (Chen et al., 2017a; Chen et al., 2017b; van Noort et al., 2014). Due to the lack of appropriate benchmarking standards, limited studies have investigated the factors influencing the accuracy of this approach. Therefore, no consensus regarding the implementation details has been reached across current studies. Constructing rational benchmarking standards and developing the best practice approach can facilitate the development of signature reversion approach and help to identify more effective therapeutic strategies for refractory diseases.

Figure 1 with 2 supplements see all

Download asset Open asset

Overview of LINCS data-driven therapeutic discovery.

The working principle of ‘signature reversion’-based computational approach. A disease signature representing discordant expression pattern needs first to be identified (G1, G2, and G3 stand for upregulated genes while G4, G5, and G6 stand for down-regulated genes in disease state). With this signature, pharmacologic perturbation data sets can be queried to find compounds with the ability to reverse disease expression pattern (suppress expression of G1, G2, and G3 and induce expression of G4, G5, and G6). After determining the candidate compounds, experimental and clinical validation are required to translate computational findings to clinical applications. LINCS, Library of Integrated Network-based Cellular Signatures.

Herein, we mainly focused on the disease of liver cancer. As one of the most lethal malignancies worldwide, liver cancer directly accounts for nearly one million deaths each year (Bray et al., 2018). Hepatocellular carcinoma (HCC) is the major type of liver cancer, representing approximately 90% of all liver cancer cases (Llovet et al., 2016). Although many standard of care therapies, including Lenvatinib (Kudo et al., 2018), regorafenib (Bruix et al., 2017), cabozantinib (Abou-Alfa et al., 2018), ramucirumab (Zhu et al., 2019), pembrolizumab (Finn et al., 2020b), nivolumab (El-Khoueiry et al., 2017), and atezolizumab-bevacizumab (Finn et al., 2020a), have been approved for treating HCC in recent years, most of them can yield only marginal survival benefit. Thus, more effective therapeutics treatments for HCC are highly desired. The objectives of the present study were threefold. The first objective was to develop novel benchmarking standards for evaluating drug retrieval performance. The second one was to determine the best practice approach for LINCS data-based signature reversion. For the last objective, we sought to identify novel drug candidates against liver cancer, exploiting the findings from the second objective.

Results

Summary of influencing factors and compound experiments in LINCS

Many factors may affect the accuracy of signature-based drug retrieval. We have categorized these factors into three main aspects: acquisition of compound signature (reference signature), generation of disease signature (query signature), and selection of disease-compound matching methods (Figure 1, Figure 1—figure supplement 1). Although all factors were mentioned and discussed, not all of them were included in the present analyses, considering that some factors have been covered elsewhere and some were challenging to explore due to data and method restrictions. In this study, systematic analyses were carried out to assess the influences of four major factors on signature matching-based drug discovery, including source of cell line, clinical phenotype of query signature, query signature size, and signature matching method.

Since only compound-induced expression data was the focus of this study, we first excluded experiments of other perturbagens, including gene knockdown (or knockout) and gene overexpression manipulations. Subsequently, the distribution of compound profiles was visualized based on their perturbation time, perturbation dose, and cell line used. Most of the measurements were made in the treatment durations of 6 hr (43%) or 24 hr (56.6%), and under the concentrations of 5 μM (21%) or 10 μM (63%) (Figure 1—figure supplement 2A). The count distribution of all cell lines in LINCS was also presented in Figure 1—figure supplement 2B. Although 71 cell lines were included in LINCS project in total, not all of them were extensively profiled, and only 9 cell lines contained more than 5000 profiles, which, however, account for 77.8% of all compound profiles. There were 2912 compounds shared by these nine cell lines. We further integrated annotation of the most profiled cell lines with treatment duration and concentration information, and illustrated the specific profile numbers of each cell line under the conditions of certain time and dose (Figure 1—figure supplement 2C). Unless otherwise indicated, all the following analyses were performed on a fixed perturbation condition of 10 μM for 6 hr. Besides, compound profiles of all top nine cell lines were only utilized when investigating the factor ‘Source of cell line.’ In other cases, we focused exclusively on the cell line of HepG2, as our main point was to uncover novel therapeutics for liver cancer in this study. A systematic summary of included data sets for analyses was presented in Supplementary file 1A and B.

Compound-induced expression changes are highly cell line-specific

Some previous studies utilized compound profiles from cell lines irrelevant to the disease they are studied for signature reversion prediction. To investigate whether this was a reasonable practice, we conducted following analyses based on LINCS data of the nine most profiled cell lines. First, we visualized the compound profiles in a cosine distance-based two-dimensional t-distributed stochastic neighbor embedding (t-SNE) plot that represented the overall compound perturbation space wherein each dot was equivalent to a unique perturbation and each cell line was color-coded (Figure 2A). As shown in the figure, most dots with the same color clustered together, indicating that most of compound-induced gene expression changes tended to be cell-type specific. Intriguingly, dots with different colors in the white region seemed to mix together, suggesting that some compounds might induce similar gene expression changes across cell lines. To figure out which compounds were likely to cause cell-type specific gene expression changes and which tended to induce universal changes independent of cell lines, we calculated the pairwise cosine similarities (L1) among the profiles from the same compounds measured in different cell lines (Figure 2B). The cosine similarity measures range from –1 to 1, where higher values indicate increased similarity. The similarity scores (compound-level, L2) of the 2912 unique compounds were determined by calculating the median pairwise cosine similarity values (L1) across the nine cell lines (Supplementary file 2). As a result, a high degree of cell-specificity was observed for most compounds, with a median L2 similarity score of 0.078 (Figure 2C). Furthermore, we retrieved the mechanism of action (MOA) information and mapped them to the compounds to determine the MOA-level similarity scores (L3). L3 similarity scores were calculated based on the median values of L2 similarity scores of compounds within the same MOA. Results suggested that inhibitors targeting core cellular processes (e.g., cell cycle, RNA transcription, and protein synthesis) tended to induce similar changes across all cell lines, generally in agreement with previous findings (Figure 2D, Supplementary file 2; Niepel et al., 2017; Subramanian et al., 2017; Wang et al., 2018). We then marked the dots representing the compounds of the top five MOAs in the t-SNE plot. As expected, most of marked dots fell in clusters within the nonspecific region (Figure 2E).

Figure 2 with 1 supplement see all

Download asset Open asset

Highly cell-type specific compound-induced expression changes.

(A) Two-dimensional t-SNE projection based on cosine distance between compound signatures. Each dot represents a unique perturbation-induced expression profile, and each color represents one type of cell line. Drug perturbation data was obtained from GSE92742 and GSE70138. (B) Schematic diagram displaying the calculation process of compound-level (L2) and MOA-level (L3) similarity scores. (C) Distribution of compound-level (L2) cosine similarity scores, which range from –1 (completely opposite pattern) to 1 (perfectly identical pattern). Three examples are presented (left to right: etodolac, geldanamycin, and doxorubicin). (D) Illustration of MOA-level (L3) similarities. Only MOAs with more than five compounds included are shown in the figure. (E) A t-SNE projection showing the distribution of compounds (indicated by purple dots) in top ranked five MOAs (including HDAC inhibitors, IKK inhibitors, mTOR inhibitors, CDK inhibitors, and topoisomerase inhibitors). (F) Schematic diagram displaying the calculation process of cell line pair-level (L4) similarity scores. (G) Correlation between basal expression similarities and perturbed expression similarities (L4) of 36 cell line pairs (nine cell lines in total). Statistical significance and correlation coefficient were determined by ranked-based Spearman correlation. (H) Schematic view of the calculation of cell line-level (L5) similarity scores (upper) and the presentation of L5 similarity scores of nine cell lines in the boxplot (lower). Data are presented as median±quartiles. MOA, mechanism of action; t-SNE, t-distributed stochastic neighbor embedding.

Apart from investigating the similarity of perturbed expression profiles at compound level, we further sought to further investigate the cell line pair/cell line-level similarity. Nine cell lines contributed a total of 36 unique cell line pairs. The cell line pair-level perturbed expression similarities (L4) were determined through calculating the median value of similarity scores of all compound pairs between two cell lines, and the corresponding basal expression similarities were computed using Spearman ranked correlation on expression data from CCLE project (Figure 2F). The result showed that there was a significant, albeit not very remarkable, association between the perturbed expression similarities (cell line pair-level, L4) and basal expression similarities (ρ=0.344; p=0.040), suggesting that cell lines with similar molecular features were more likely to have consistent gene expression changes upon perturbation (Figure 2G). Similarities within the nine cell lines were also explored (cell line-level, L5). Among the nine cell lines we tested, PC3 cell line showed the highest L5 similarity score (median value=0.122) (Figure 2H). Notably, the cosine similarity of 0.122 still denoted a weak relationship, which further supported the conclusion that compound-induced gene expression changes were highly cell line-specific.

Among the nine most profiled cell lines, HepG2 was the only one derived from liver. To investigate whether HepG2 was an appropriate cell line model for computational therapeutics discovery for liver cancer or other liver-associated diseases, we calculated the expression correlation between HepG2 and other cell lines (921 CCLE cell lines) or tissues (17,382 normal tissues from GTEx and 9701 tumor tissues from TCGA PanCancer). Compared to other tissue-derived cancer cell lines or normal/tumor tissues, HepG2 exhibited a significantly higher expression correlation with liver cancer cell lines (median correlation coefficient=0.729), normal liver tissues (median correlation coefficient=0.616), and liver cancer tissues (median correlation coefficient=0.631) (Figure 2—figure supplement 1). Collectively, we supposed that the use of LINCS-derived HepG2 data was preferable to be limited within liver diseases.

Developing benchmarking standards for evaluating drug retrieval performance

Owing to the lack of benchmarking standards, accurate assessment of retrieval performance of signature matching methods remains challenging. Inspired by previous findings (Chen et al., 2017a; Chen et al., 2017b; Cheng et al., 2014; Wagner et al., 2015), we proposed two novel benchmarking standards, namely area under the curve (AUC)-based standard and Kolmogorov-Smirnov (KS) statistic-based standard. They were built upon different notions and thus were independent of each other, which helped to avoid potential bias introduced by single standard. The corresponding benchmarking data sets were developed mainly based on preclinical/clinical data of liver cancer. Detailed processes of data collection and metrics calculation were described in Materials and methods and visualized in Figure 3A.

Figure 3

Download asset Open asset

Establishment of novel benchmarking standards.

(A) Flow chart of the data collection and hypothesis validation for the AUC-based (left) and KS statistic-based (right) benchmarking standards. (B) Correlation between drug efficacy (AUDRC values) and reversal potency (KS-based similarity scores). Two previously published query signatures, including Sig_gastro (left) and Sig_NC (right), were utilized to calculate similarity scores. Drug response data was achieved from CTRP data set. Note that lower similarity scores indicate higher reversal potency and lower AUDRC values imply greater drug sensitivity. Color toward gray indicates no statistical significance determined by KS test. (C) Reversal potency of HCC agents demonstrated by enrichment analysis. Sig_gastro (upper) and Sig_NC (lower) were used to compute similarity scores. AUC, area under the curve; HCC, hepatocellular carcinoma; KS, Kolmogorov-Smirnov.

The development of AUC-based standard was based on the finding that there existed correlation between the reversal potency and treatment efficacy (Chen et al., 2017a; Wagner et al., 2015). In order to further validate whether this correlation remained significant in other conditions, we retrieved drug response data from CTRP data set in which area under the dose-response curve (AUDRC) values were used as measurements of drug sensitivity, and utilized two different HCC signatures as query signatures to obtain KS-based similarity scores (Chen et al., 2017a; Chen et al., 2017b). A total of 109 compounds shared by two data sets were selected to conduct correlation analyses. As a result, statistically significant correlation could still be observed between similarity scores and AUDRC values in these scenarios, further proving the reliability of this standard (Figure 3B). A benchmark data set was then generated, composed of 117 unique compounds with both LINCS and drug efficacy (IC₅₀) data available, which was taken as a basis for the application of AUC-based standard (Supplementary file 3). The resultant AUC from this standard was termed as drug retrieval-associated AUC (DR-AUC). Higher DR-AUC value indicated better performance.

As for KS statistic-based standard, we assumed that agents under evaluation in clinical trial for HCC treatment, namely HCC-related agents, might possess an increased reversal capacity (Chen et al., 2017b). In other words, HCC agents were more likely to cause negative enrichment in KS test. To verify this hypothesis, we compiled a set of 27 potential HCC agents which were both included in LINCS and under clinical trials for liver cancer treatment. Besides, similarity scores of all compounds tested in HepG2 were also calculated, which were then used as ranked list for KS test. The results of KS test demonstrated that the HCC agent set was indeed negatively enriched (Figure 3C). The resultant enrichment scores (ES) here were termed as drug retrieval-associated ES (DR-ES) (Supplementary file 4). Of note, in contrast to DR-AUC, lower DR-ES values denoted better performance.

XSum is the optimal signature matching method for drug retrieval

The two independent benchmarking standards enabled us to quantitatively assess the retrieval performance of different signatures matching methods. Six available methods, including eXtreme Sum (XSum) (Cheng et al., 2014), eXtreme Cosine (XCos) (Cheng et al., 2013; Cheng et al., 2014), eXtreme Pearson (XCor) (Zhou et al., 2018), eXtreme Spearman (XSpe) (Zhou et al., 2018), KS test (Lamb et al., 2006), and the Reverse Gene Expression Score (RGES) (Chen et al., 2017a), were included for performance comparison. To minimize technical bias introduced by different query signatures, four HCC signatures with different sizes generated from distinct data sets were utilized for benchmarking (Supplementary file 5). Of these, Sig_gastro (Chen et al., 2017b) and Sig_NC (Chen et al., 2017a) were directly obtained from previous publications, while Sig_LIRI and Sig_GSE54236 were generated using RNA-seq data from LIRI cohort and microarray data from GSE54236, respectively. A brief summary of the above essential components involved in the evaluation process was presented in Figure 4A.

Figure 4 with 2 supplements see all

Download asset Open asset

Benchmarking different methodologies and parameters.

(A) Diagram summarizing the workflow and the important components involved in the evaluation process of drug retrieval performance of six different signature matching methods. (B) Retrieval performance of six matching methods evaluated by AUC-based benchmarking standard (left) and KS statistic-based benchmarking standard (right). Query signature was generated based on LIRI cohort. (C) Visualization of AUC-based (left) and KS statistic-based (right) performance measurements of XSum method on standardized data for discerning the optimal operating parameter. (D) Diagram summarizing the workflow and the important components associated with the investigation process of the optimal query signature size. (E) Relationship between query signature size determined by iterative fold change-based approach and retrieval performance evaluated by AUC-based standard (left) and KS statistic-based standard (right). (F) Relationship between query signature size determined by random sampling-based approach and retrieval performance evaluated by AUC-based standard (left) and KS statistic-based standard (right). LOESS polynomial regression analysis was performed for curve fitting. AUC, area under the curve; KS, Kolmogorov-Smirnov.

Considering that the performance of eXtreme methods (including XSum, XCos, XCor, and XSpe) may be affected by the number of top genes (topN), we thus calculated the DR-AUC or DR-ES values of each eXtreme methods iteratively, using topN ranging from 50 to 489. In the condition of using Sig_LIRI as query signature, both benchmarking standards demonstrated that XSum outperformed other five methods across almost all candidate topNs (Figure 4B). Concordantly, when using other three query signatures, XSum also achieved better performance compared with other methods, except in the case of using Sig_Gastro as query signature and AUC-based standard for benchmarking, where RGES showed a similar performance with XSum (Figure 4—figure supplement 1A-C). Generally, XSum exhibited a consistently excellent performance, independently of the query signature and benchmarking standard (Supplementary file 6A and B). In addition, our analyses also demonstrated that the recently developed RGES (a modification of the KS method) was superior to the KS method and might serve as an alternative approach for KS-based connectivity mapping (Chen et al., 2017a).

We next sought to find the most appropriate topN value for applying XSum method to achieve the best retrieval performance. Directly selecting the exact topN value where corresponding DR-AUC/DR-ES reached their maximum/minimum might cause bias and could be prone to overfitting. Given the continuous trait of candidate topNs, we chose to divide them into several smaller bins, typically 50 topN values in each bin. The DR-AUC/DR-ES of a given bin was defined as the mean DR-AUC/DR-ES values within this bin, which could decrease potential influences brought about by outliers. With this normalization approach, relatively consistent results across varying conditions were obtained. The optimal window with the best performance was either ‘top150–200’ or ‘top200–250’ (Supplementary file 6C). Besides, we also observed a biphasic pattern of fitting curves, with the inflection points appearing where topNs were around 200 (Figure 4C, Figure 4—figure supplement 1D-F). Collectively, we supposed that topN of 200 could serve as a rough guide.

A query signature size of 100 is applicable for drug retrieval

Next, we intended to further discern the optimal query signature size. Considering that Sig_gastro (N_gene=44) and Sig_NC (N_gene=73) had fixed and relatively small signature sizes, only signatures generated from LIRI and GSE54236 cohorts were utilized for the following investigation. We adopted two complementary approaches: (i) iterative fold change-based and (ii) random sampling-based approaches, to obtain query signatures with varying sizes (Figure 4D). The iterative fold change-based approach could create a number of signatures with discontinuous sizes through setting iterative threshold values of fold changes. The exact sizes of optimal signatures identified by this approach varied substantially (including 55, 79, 140, and 167). Despite this, similar trends of biphasic pattern with inflection points at around 100 under different conditions could still be observed (Figure 4E, Figure 4—figure supplement 2A). The approach based on random sampling was adopted as a complement. The results showed that, as the signature sizes increased, the DR-AUC/DR-ES values also increased/decreased and eventually converged when the signature size was more than 100 (Figure 4F, Figure 4—figure supplement 2B). Accordingly, we considered that a signature size of 100 could be selected as a good compromise. This conclusion remained valid in the conditions when other topN values were applied, such as 100 and 400 (results not shown).

A good query signature should comprehensively reflect the clinical characteristics of corresponding disease

Many previous studies chose to compare normal versus diseased states to define disease signatures. However, signatures that are generated based on other clinical phenotypes, such as prognosis and metastasis, can also be used to query LINCS. Aiming to figure out whether this factor could also affect the performance of drug retrieval, we designed a forward and a backward strategy (Figure 5A). The application of forward strategy was based on two types of signatures, general signatures (representing discordant expression pattern between normal and tumor tissues) and prognostic signatures (associated with survival outcomes). We compared the above two signature phenotypes across varied signature sizes. Unfortunately, the results under different data sets and benchmarking standards were highly inconsistent. This strategy thus failed to provide a definitive conclusion (Figure 5—figure supplement 1A and B).

Figure 5 with 1 supplement see all

Download asset Open asset

Necessary properties of a good query signatures.

(A) Schematic illustration of forward and backward strategy adopted to investigate whether the factor associated with clinical phenotype of query signature can affect computational therapeutic discovery. (B) The DR-AUC value and DR-ES value of the optimal randomized signature showed by ROC curve (upper) and enrichment plot (lower). (C) The association between the optimal signature and the clinical phenotype of discordant expression pattern suggested by ROC curves based on RNA sequencing cohorts (upper) and Microarray cohorts (lower). (D) The association between the optimal signature and the clinical phenotype of prognosis. Color toward gray indicates no statistical significance. (E) The association between the optimal signature and multiple clinical features, including BCLC and TNM stage, tumor thrombus, AFP level, and histologic grade. Data are presented as median±quartiles. N≥100. Statistical significance of difference between groups was determined using either Kruskal-Wallis or Wilcoxon sum rank tests.

Opposed to the forward strategy, backward strategy started from creating a collection of 10,000 random signatures, followed by determining the optimal signature for clinical implication evaluation. The optimal random signature was determined according to both benchmarking standards. Exploring the clinical values of this signature might reveal some necessary features possessed by a ‘good’ query signature (Figure 5B). A comprehensive clinical evaluation on the optimal signature was carried out based on five RNA-seq and five microarray clinical cohorts from three perspectives. First, the ability of this signature to distinguish tumors from non-tumors was investigated. Briefly, we extracted the first principal components (PC1) of this signature to represent its overall expression pattern. AUC was used here as a measurement of the classification capability. The results showed that more than 0.90 of AUC can be obtained in seven out of the eight cohorts (87.5%), indicating that the ability to discern the difference between diseased and normal states might be an indispensable property for achieving good retrieval performance (Figure 5C). Next, we intended to find out whether the optimal signature should be a prognostic indicator. Cox regression analyses were conducted to investigate the association between the signature expression (PC1) and clinical outcome. As a result, significant prognostic implications of the optimal signature could be observed in six out of the eight cohorts (75%), suggesting that prognostic significance was also a necessary characteristic (Figure 5D). At last, the association between signature expression and other clinical features was explored. Considering that CHCC and LIHC cohorts held the most abundant clinical information, corresponding analyses were thus conducted on these two cohorts. The results showed that there was a significant correlation between the optimal signature and multiple clinical features, including BCLC stage (p=0.011), tumor thrombus (p=0.001), AFP level (p=0.022), TNM stage (p<0.001), and histologic grade (p<0.001) (Figure 5E). Accordingly, we concluded that a good query signature should possess the ability to comprehensively recapitulate the clinical features of corresponding disease, rather than only reflect the disease characteristic from single perspective.

Generation of novel liver cancer signature

The conclusions from the above analyses were then applied to establish a signature representing liver cancer initiation and development, which could be utilized to query compounds with potential therapeutic as well as preventive effects against liver cancer. The generation of this evolution-associated signature was based on the concept that the initiation and progression of liver cancer was a stepwise process with gradually acquired advantageous biological capabilities (Figure 6A). Therefore, conceptually, antagonizing genes that were most related to these stages could be a potential therapeutic strategy. Through implementing random forests algorithm on GSE89377 cohort, preliminary screening was performed to include stage-associated genes, where genes with greater predictive power were selected for further analysis. This screening yielded a total of 6017 stage-associated genes (23.9%), of which 309 were landmark genes (Figure 6B). Next, we conducted weighted gene co-expression network analysis (WGCNA) to obtain co-expressed modules with diverse expression patterns (Figure 6—figure supplement 1A). Seven gene modules were discerned by WGCNA analysis (Figure 6—figure supplement 1B and C), and two of them, which we termed the ‘ascending’ module (N=1738) and the ‘descending’ module (N=350) for their greatest relevance to stages and patterns of linear evolution from normal to cancer, were retained for further analyses (Figure 6C and D). Biological processes associated with genes in these two modules were investigated. We found that the ‘ascending’ module was closely associated with proliferation (Figure 6C), while the ‘descending’ module was enriched in several different types of processes (Figure 6D). There were 159 genes in common between these two modules and landmarks. Based on the aforementioned recommendation of query signature size, we sought to further reduce the size of 159–100. This procedure was carried out using HCC occurrence-related clinical and molecular data from GSE15654 cohort. In brief, 10,000 random signatures, each containing 100 genes, were generated based on the 159-gene panel. The one which had the most significant association with HCC occurrence was considered as the optimal query signature (Figure 6—figure supplement 2A and B). This analysis yielded a signature comprised of 82 ascending genes and 18 descending genes, which was then named as Sig_evo (Supplementary file 7). The linear evolution pattern of Sig_evo remained present in training (Figure 6—figure supplement 2C) as well as an independent validation cohort (Figure 6—figure supplement 2D).

Figure 6 with 2 supplements see all

Download asset Open asset

Development of a novel signature representing the initiation and progression of liver cancer.

(A) Schematic of the stepwise process of liver cancer initiation and progression. (B) Preliminary screening of developmental stage-associated genes by random forests algorithm based on GSE89377. (C) The expression pattern of the ‘ascending’ module discerned by WGCNA analysis (left) and the enriched biological processes determined by hypergeometric test (right). (D) The expression pattern of the ‘descending’ module (left) and the enriched biological processes (right). (E) The performance evaluation of the Sig_evo for discerning the difference between tumor and normal tissues based on RNA sequencing cohorts (left) and microarray cohorts (right). (F) The association between the Sig_evo and the clinical phenotype of prognosis. Color toward gray indicates no statistical significance. (G) The association between the Sig_evo and fibrosis-related phenotype suggested by ROC curve. (H) The association between Sig_evo and CCl4-induced expression changes in liver tissues of mice. The enrichment scores and statistical significance were determined by gene set enrichment analysis. (I) The association between Sig_evo and DEN-induced expression changes in liver tissues of rats. WGCNA, weighted gene co-expression network analysis.

As previously discussed, a good query signature should reflect the clinical features of corresponding disease comprehensively. Therefore, we systematically surveyed the association between Sig_evo and the clinical phenotypes of precancerous/cancerous liver lesions using clinical and experimental data from both human and animal data sets. First, based on clinical cohorts of HCC, we demonstrated that Sig_evo had a remarkable capability for distinguishing tumors from non-tumors, with a median AUC of 0.972 in all eight cohorts (Figure 6E). Besides, this signature also held great prognostic power in HCC, as indicated by the results of Cox analyses (Figure 6F). Next, in view of the crucial role of fibrosis in driving hepatocarcinogenesis, further investigation was performed to validate its relevance to fibrosis-related phenotype. The result suggested that Sig_evo could also effectively differentiate between mild (S0/S1) and severe (S3/S4) fibrosis (Figure 6G). Additionally, we collected four experimental data sets, including two carbon tetrachloride (CCl₄)-treated mouse data sets and two diethylnitrosamine (DEN)-treated rat data sets, to assess the enrichment levels of Sig_evo in mouse and rat fibrosis models. It could be observed that ascending genes in Sig_evo were significantly enriched in both CCl₄-treated (GSE27640) and DEN-treated (GSE19057) liver tissues (Figure 6H and I). However, descending genes did not exhibit any significant enrichment pattern in all included data sets, possibly due to the limited gene number (Figure 6H, I). Notably, the expression profiles in GSE63726 were derived from non-parenchymal cell fractions which had abundant hepatic stellate cells (HSCs), and thus the significant enrichment could provide the evidence that this signature might reflect the molecular feature of HSC activation (Figure 6I). In summary, the Sig_evo fully complied with the criteria of a good query signature and was then employed for querying LINCS.

Using the optimal method (XSum) and a compromising parameter (topN=200), we matched Sig_evo with HepG2-derived compound signatures in LINCS and obtained the similarity scores of all compounds (lower scores implied higher reversal potency and greater potential for application). After excluding preclinical agents or agents withdrawn from the market, 793 agents remained (Corsello et al., 2017). These agents were then considered repositioning candidates (Supplementary file 8). Interestingly, some agents which were previously proved to have chemopreventive effects, including erlotinib (Fuchs et al., 2014), caffeine (Hoshida et al., 2012), and fasudil (Nakagawa et al., 2016), dominated relatively high rankings on the list (Figure 7A). Besides, anti-HCC agents were also found to be enriched significantly in compounds with reversal potency (Figure 7B). These findings collectively supported the reliability of the prediction results.

Figure 7 with 5 supplements see all

Download asset Open asset

Homoharringtonine (HHT) has significant tumor killing activity both in vitro and in vivo.

(A) Results of best practice approach-based computational dr_evo as query signature. Top ranked 10 compounds with highest reversal potency were illustrated in the right panel. (B) Enrichment of HCC agents in compounds with reversal potency (XSum score<0). Statistical significance was determined based on the null distribution formed by 10,000 permutations. (C) 2D (left) and 3D (right) chemical structure of HHT. (D) Comparison of distribution of compound activity between HHT and three different drug categories, including chemotherapy (N=45 compounds), targeted cancer agents (N=419 compounds), and non-oncology (N=362 compounds). The IC₅₀ values (from PRISM data set) of each drug category in each cell line (N=482) were determined through calculating the median IC₅₀ value across all the compounds in this category. Data are presented as median±quartiles, N≥100. (E) The drug sensitivity data of HHT (achieved from PRISM data set) across liver cancer cell lines. The drug sensitivities of two HCC agents in the first-line (sorafenib and lenvatinib) and one HCC agent in the second-line (regorafenib) were also presented for comparison. Areas with different colors denote the interquartile range of median IC₅₀ values of compounds within different drug categories. (F) Long-term cell proliferation assay for testing the anti-tumor activity of HHT across 10 liver cancer cell lines. Of these, four cell lines have not been profiled by PRISM for the sensitivity to HHT. (G) Macroscopic image of tumors harvested from xenograft mice treated with vehicle (upper) and HHT (lower). (H) Longitudinal tumor volume progression of subcutaneous MHCC97H xenograft tumors treated with vehicle (N=6) and HHT (N=6). The statistical significance of difference between groups was determined using Student’s t-test. Data are represented as mean ± SD. (I) Body weight changes of mice in control (N=6) and HHT-treated (N=6) groups. Statistical significance was determined using Student’s t-test. Data are represented as mean ± SD. *p<0.05, **p<0.01, ***p<0.001. NS, not significant. HCC, hepatocellular carcinoma.

Figure 7—source data 1 Drug-induced expression changes across different cell lines as well as different concentrations.: https://cdn.elifesciences.org/articles/71880/elife-71880-fig7-data1-v2.xlsx
Download elife-71880-fig7-data1-v2.xlsx

Homoharringtonine is a candidate anti-liver agent

According to the computational results, homoharringtonine (HHT) (Figure 7C), a protein synthesis inhibitor targeting RPL3, had the highest reversal potency among 793 repositioning candidates (Tujebajeva et al., 1989). To prove that the reversal effect of HHT is not cell type- or concertation-dependent, we generated HHT-perturbed expression data using five different liver cancer cell lines (Hep3B, HepG2, Huh6, Huh7, and PLC) and four different concentrations (0.1 μM, 0.5 μM, 1 μM, and 10 μM). HHT with a fixed concentration of 10 μM (a standard concentration in CMap and LINCS) was used to treat different cell lines and a single cell line HepG2 (a cell line used in LINCS) was perturbed by HHT with varying concentrations (Figure 7—figure supplement 1A). HHT signatures were obtained through calculating the fold changes of HHT-treated samples to control samples. Subsequently, GSEA was conducted against different HHT signatures, taking ascending and descending genes in Sig_evo as query gene sets separately. The results indicated that the ascending genes tended to enrich in HHT-induced downregulated genes (ES<0), while descending genes appeared to be more associated with HHT-induced upregulated genes (ES>0), suggesting that the ability of HHT to reverse the Sig_evo was independent of cell type and treatment concentration (Figure 7—figure supplement 1B and C, Figure 7—source data 1, GSE193897).

As the drug target of HHT, RPL3 was characterized for its clinical and biological implications. Comprehensive comparisons of the expression of RPL3 between tumor and non-tumor tissues were conducted using seven clinical cohorts with available expression profiles of both tumor and non-tumor tissues. The results showed that RPL3 had higher expression levels in tumor compared with non-tumor tissues in more than half the clinical cohorts (57.1%) (Figure 7—figure supplement 2A). The increase of protein expression of RPL3 could also be observed in tumor tissues (Figure 7—figure supplement 2B), as shown by immunohistochemical images from the Human Protein Atlas (Uhlén et al., 2015). Higher expression of RPL3 also indicated worse prognosis (Figure 7—figure supplement 2C). In addition, leveraging CRISPR-based screening data from Project Achilles (Meyers et al., 2017), we found that RPL3 was essential for maintaining the survival and growth of all liver cancer cell lines (Figure 7—figure supplement 2D, E). Above results demonstrated the rationality of RPL3 inhibition for treating liver cancer.

Homoharringtonine has a remarkable therapeutic effect against liver cancer

To investigate the in vitro anti-tumor activity of HHT against liver cancer, we analyzed the drug response data of HHT from PRISM data set (Corsello et al., 2020). It could be observed that HHT had a lower distribution of IC₅₀ values across 482 PRISM cell lines compared with molecular-targeted agents and non-oncology agents (Figure 7D). Of note, in liver cancer cell lines, HHT exhibited a powerful tumor suppressor activity with a median IC₅₀ value of 0.408 μM, which was numerically lower than that of other three Food and Drug Administration (FDA)-approved HCC agents (lenvatinib: 0.617 μM; regorafenib: 2.009 μM; sorafenib: 3.348 μM) (Figure 7E). The in vitro anti-tumor activity of HHT was corroborated by the long-term cell proliferation assay (Figure 7F) and short-term IncuCyte real-time assay (Figure 7—figure supplement 3). In addition, in vivo efficacy of HHT was also evaluated using subcutaneous xenograft model of MHCC97H cell line. The result demonstrated that HHT could significantly inhibit the growth of xenograft tumors (Figure 7G and H), with limited drug-related toxicity (Figure 7I). Since co-administration of HHT with other approved agents was more likely to have clinical significance, we also interrogated whether HHT could augment the tumor-killing effect of sorafenib. Three different statistical models were adopted for synergy estimation. The results suggested that HHT could indeed synergize with sorafenib in many conditions, albeit not very remarkable in general (Figure 7—figure supplements 4 and 5).

Homoharringtonine treatment can alleviate liver fibrosis both in vivo and in vitro

Liver fibrosis occurs when the liver tissue is repeatedly and continuously injured, which is a crucial risk factor for hepatocarcinogenesis (O’Rourke et al., 2018). Since we have proved that Sig_evo was associated with liver fibrosis using clinical and animal-derived data, it could be postulated that HHT also had the potential to alleviate liver fibrosis. The antifibrotic effect of HHT was first assessed using carbon tetrachloride (CCL₄)-induced mouse liver fibrosis model (Figure 8A). The results suggested that HHT could significantly reduce Ishak scores and positive area of Sirius Red staining compared to vehicle controls (Figure 8B and C). Besides, HHT treatment could also lead to significant reduction of serum levels of alanine transaminase (ALT) and aspartate transaminase (AST) (Figure 8D). These observations demonstrated that HHT can impede fibrosis development and partially rescued hepatic function in CCL₄-induced mouse model. The activation of HSCs is one of the key steps in fibrosis development (Zhou et al., 2014). To determine whether HHT could inhibit the activation of HSCs, in vitro experiments based on TGF-β1-activated human HSC line LX-2 were further conducted. LX-2 cells were treated with vehicle or HHT (1 μM and 5 μM) for 6 hr, followed by RNA-seq for quantifying HHT-induced expression changes. Nine fibrotic genes from previous publications were collected; high expression level of these genes represented the activation status of HSCs. After HHT treatments, the expression level of almost all fibrotic genes was downregulated (Figure 8E, Figure 8—figure supplement 1, Figure 8—source data 1, GSE180243). The downregulated tendency of the two most critical genes which encoded collagen I and α-SMA were further corroborated by the quantitative real-time PCR (Figure 8—figure supplement 2A). Additionally, the protein-level expression of collagen I and α-SMA was also detected using western blot (Figure 8—figure supplement 2B, Figure 8—source data 2) and immunofluorescence (Figure 8—figure supplement 2C). The results showed that HHT could inhibit the protein expression of collagen I and α-SMA as well. Taken collectively, HHT can inhibit the progression of liver fibrosis via suppressing HSC activation and thus may have certain preventive effects on liver cancer.

Figure 8 with 2 supplements see all

Download asset Open asset

HHT has significant in vivo anti-fibrotic effects.

(A) Schematic diagram (upper) of the experimental design for validating the anti-fibrotic ability of HHT and representative photographs (lower) of the livers harvested from different groups at the time of sacrifice. (B) Representative images of Masson’s trichrome staining and Sirius Red staining of liver tissues from different groups (scale bars: 250 µm). (C) Comparisons of Ishak scores (left) and Sirius Red-based collagen quantification (right) between different groups. Statistical significance was determined using one-way ANOVA followed by Tukey multiple comparison test. Data are represented as mean ± SD (N=6 in each group). (D) Comparisons of serum levels of ALT, AST, ALP, and Alb between different groups. Statistical significance was determined using one-way ANOVA followed by Tukey multiple comparison test. Data are represented as mean ± SD (N=6 in each group). (E) Differential expression of nine fibrosis-associated genes between HHT-treated and HHT-untreated LX-2 cells. *p<0.05, **p<0.01, ***p<0.001, ****p<0.0001. HHT, homoharringtonine; NS, not significant.

Figure 8—source data 1 Sequencing results of HHT-treated LX2 cells.: https://cdn.elifesciences.org/articles/71880/elife-71880-fig8-data1-v2.xlsx
Download elife-71880-fig8-data1-v2.xlsx
Figure 8—source data 2 Raw unedited plots.: https://cdn.elifesciences.org/articles/71880/elife-71880-fig8-data2-v2.pdf
Download elife-71880-fig8-data2-v2.pdf

Discussion

In recent years, the explosive growth of pharmacogenomic data enables the development of computational drug discovery and repositioning, leading to many remarkable findings of novel therapeutics (Kong et al., 2020; Stathias et al., 2018; Yang et al., 2021). Owing to the success of CMap and LINCS projects (Lamb et al., 2006; Subramanian et al., 2017), signature reversion-based computational drug discovery approach has been extensively used (Chen et al., 2017a; Chen et al., 2017b; Dudley et al., 2011; van Noort et al., 2014; Wei et al., 2006). However, lack of suitable benchmarking standards for evaluating drug repositioning performance limits further improvement of this approach. Some studies proposed that the benchmarks assessing drug-drug similarity, such as the anatomical therapeutic chemical (ATC) system, could be taken as alternative standards to indirectly determine the optimal methodologies and parameters of computational repositioning (Cheng et al., 2013; Zhou et al., 2018). However, considering the great difference between the two situations, developing tailored benchmarking standards for assessing disease-drug similarity would be more desirable (Cheng et al., 2014). In this study, we proposed two novel benchmarking standards, AUC-based and KS statistic-based standards. Despite being mutually independent, the evaluation results of the two standards were highly consistent, demonstrating their rationality and robustness.

These two standards enable the establishment of a standardized procedure for performing more effective signature-based drug prediction. We first determined that using reference signatures from one of the most relevant cell lines with the disease of interest instead of from a non-touchstone cell line or aggregation-based consensus results was a preferable option to exploit LINCS data. Next, XSum was identified as an optimal method for matching compound and disease signatures. Interestingly, a prior study that made a comparison of drug retrieval performance between XSum, XCos, and KS methods using a totally different benchmarking standard from ours also come to the same conclusion (Cheng et al., 2014). Furthermore, we also uncovered an appropriate parameter of XSum (topN=200), which lacked guidance previously.

Most of the current investigations and methodological developments were focused on reference signatures and signature matching methods. By contrast, relatively limited efforts have been made to standardize the generation of query signatures (Chan et al., 2019; Wen et al., 2016; Wen et al., 2015). In this study, two potential factors, signature phenotypes and signature size, were systematically analyzed. Through adopting two independent approaches, an appropriate query signature size of 100 was determined. However, prior studies considered a reduced number of 50 as the optimal size of query signatures (Chen et al., 2017a; Parkkinen and Kaski, 2014). It is reasonable to speculate that the utility of different signature matching methods (XSum in this study and KS-based methods in other studies) and also the different benchmarking standards may be responsible for the discrepancy. Next, we determined that a good query signature should hold the ability to comprehensively characterize the clinical features of corresponding disease. This finding seemed to be reasonable since disease was highly likely to be underrepresented when the query signature was generated based on a single clinical phenotype.

Based on these findings, we summarized the best practice approach for LINCS-based drug prediction. An application of this approach to liver cancer was then carried out. An evolution-associated query signature related to the development and progression of liver cancer was first constructed for drug retrieval. Following the best practice approach, HHT was identified as the candidate agent for its highest reversal potency. Since the query signature (Sig_evo) could reflect the properties of liver cancer initiation and development, we considered that HHT might have both therapeutic and preventive effects on liver cancer. The therapeutic effect of HHT was assessed by in vitro cell line models as well as in vivo subcutaneous xenograft model. Both of them suggested remarkable tumor-killing activity of HHT. For validating the preventive effect, an indirect approach that focused on proving the anti-fibrotic effect of HHT was adopted. The results demonstrated that HHT could alleviate liver fibrosis in vivo and inhibit the activation of HSCs in vitro. Inhibition of liver fibrogenesis might prevent the progression of cirrhosis and thereby suppress HCC tumorigenesis (Fuchs et al., 2014). Therefore, we supposed that HHT had the potential to be taken as preventive agents for liver cancer as well. Notably, in view of the grim prognosis and imperfect treatment modalities of liver cancer, prevention of HCC development in patients at high risk of primary malignancy rather than treating patients at advanced stages is theoretically the most desirable approach to improve patient prognosis (Fujiwara et al., 2018; Nakagawa et al., 2016). As HHT has been approved by FDA for the treatment of chronic myelogenous leukemia, it can be tested directly in clinic without worrying about its safety problem (Kantarjian et al., 2013).

In this study, we have performed the most comprehensive surveys so far about the influencing factors of signature reversion-based drug prediction. Two novel benchmarking standards are proposed, providing new insight into the evaluation of related methodologies. All the findings in this study are verified independently by at least two different approaches, ensuring the reliability of the conclusions. Nevertheless, we also recognize several important limitations. First, with our design, our conclusions are conditional and hold only under the conditions of using compound profiles of HepG2 in LINCS as reference signatures. Further investigations using other LINCS data are required to extend current conclusions to other conditions. Second, the parameters recommended by us, including topN of 200 and query signature size of 100, are more or less based on our subjective judgments and should be taken as a rough guide. Although there are sufficient non-quantitative estimates supporting the use of these two parameters, more efforts are still needed to accurately determine the optimal parameters. Third, we only focused on analyzing the data from the project that utilized transcriptomic platforms to measure cell responses during perturbation experiments, and other omics data which are actively being generated by different LINCS centers might also be a good choice for computational drug discovery and repositioning (Keenan et al., 2018; Koleti et al., 2018). Recently, large-scale resources (CPPA) of perturbed protein responses have been generated (Zhao et al., 2020). Considering that proteins are the components of the basic functional units in biological pathways, investigating the optimal repositioning strategy based on proteomic resources may also have important implications.

In summary, our findings fill a knowledge gap in the area of LINCS-based computational repositioning. Through exploiting these findings, we also determined a promising anti-liver cancer agent HHT, of which the therapeutic and preventive effects have been validated experimentally.

Materials and methods

Key resources table

Reagent type (species) or resource	Designation	Source or reference	Identifiers	Additional information
Cell line (Homo sapiens)	Hep3B	ATCC	Cat#: HB-8064;RRID:CVCL_0326
Cell line (H. sapiens)	HepG2	ATCC	Cat#: HB-8065; RRID:CVCL_0027
Cell line (H. sapiens)	Huh6	RCB	Cat#: RCB1367; RRID:CVCL_4381
Cell line (H. sapiens)	Huh7	JCRB	Cat#: JCRB0403; RRID:CVCL_0336
Cell line (H. sapiens)	MHCC97H	Zhongshan Hospital	RRID:CVCL_4972	Liver Cancer Institute of Zhongshan Hospital (Shanghai, China)
Cell line (H. sapiens)	PLC/PRF/5	ATCC	Cat#: CRL-802;RRID:CVCL_0485
Cell line (H. sapiens)	SNU398	ATCC	Cat#: CRL-2233; RRID:CVCL_0077
Cell line (H. sapiens)	SNU449	ATCC	Cat#: CRL-2234; RRID:CVCL_0454
Cell line (H. sapiens)	SNU475	ATCC	Cat#: CRL-2236;RRID:CVCL_0497
Cell line (H. sapiens)	SK-Hep1	ATCC	Cat#: HTB-52; RRID:CVCL_0525
Cell line (H. sapiens)	LX2	ATCC	Cat#: SCC064;RRID:CVCL_5792
Chemical compound, drug	Homoharringtonine	Selleck Chemicals	S9015
Antibody	Anti-HSP90 (Mouse monoclonal)	Santa Cruz Biotechnology	Cat#: sc-13119;RRID:AB_675659	WB (1:5000)
Antibody	Anti-α-SMA (Mouse monoclonal)	Sigma-Aldrich	Cat#: A5228; RRID:AB_262054	WB (1:2000)IF (1:200)
Antibody	Anti-Collagen I (Rabbit polyclonal)	ProteinTech	Cat#: 14695-1-AP; RRID:AB_2082037	WB (1:2000)IF (1:200)
Sequence-based reagent	ACTA2_F	This paper	PCR primer	5′GACAATGGCTCTGGGCTCTGTAA3′
Sequence-based reagent	ACTA2_R	This paper	PCR primer	5′CTGTGCTTCGTCACCCACGTA3′
Sequence-based reagent	COL1A1_F	This paper	PCR primer	5′TCCTGGTCCTGCTGGCAAAGAA3′
Sequence-based reagent	COL1A1_R	This paper	PCR primer	5′CACGCTGTCCAGCAATACCTTGA3′
Software, algorithm	R software, version 3.6.0	https://cran.r-project.org/	RRID:SCR_001905
Software, algorithm	ImageJ, version 1.53k	http://imagej.net/	RRID:SCR_003070
Software, algorithm	Combenefit, version 2.02	https://sourceforge.net/projects/combenefit/

Share this article

Cite this article

Overview of LINCS data-driven therapeutic discovery.

Highly cell-type specific compound-induced expression changes.

Establishment of novel benchmarking standards.

Benchmarking different methodologies and parameters.

Necessary properties of a good query signatures.

Development of a novel signature representing the initiation and progression of liver cancer.

Homoharringtonine (HHT) has significant tumor killing activity both in vitro and in vivo.

Figure 7—source data 1

HHT has significant in vivo anti-fibrotic effects.

Figure 8—source data 1

Figure 8—source data 2

Author details

Chen Yang

Contribution

Contributed equally with

Competing interests

Hailin Zhang

Contribution

Contributed equally with

Competing interests

Mengnuo Chen

Contribution

Contributed equally with

Competing interests

Siying Wang

Contribution

Contributed equally with

Competing interests

Ruolan Qian

Contribution

Competing interests

Linmeng Zhang

Contribution

Competing interests

Xiaowen Huang

Contribution

Competing interests

Jun Wang

Contribution

Competing interests

Zhicheng Liu

Contribution

Competing interests

Wenxin Qin

Contribution

Competing interests

Cun Wang

Contribution

For correspondence

Competing interests

Hualian Hang

Contribution

For correspondence

Competing interests

Hui Wang

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism