Systematic integration of biomedical knowledge prioritizes drugs for repurposing

  1. Daniel Scott Himmelstein
  2. Antoine Lizee
  3. Christine Hessler
  4. Leo Brueggeman
  5. Sabrina L Chen
  6. Dexter Hadley
  7. Ari Green
  8. Pouya Khankhanian
  9. Sergio E Baranzini  Is a corresponding author
  1. University of California, San Francisco, United States
  2. University of Pennsylvania, United States
  3. University of Nantes, France
  4. University of Iowa, United States
  5. Johns Hopkins University, United States
  6. University of California, San Fransisco, United States

Abstract

The ability to computationally predict whether a compound treats a disease would improve the economy and success rate of drug approval. This study describes Project Rephetio to systematically model drug efficacy based on 755 existing treatments. First, we constructed Hetionet (neo4j.het.io), an integrative network encoding knowledge from millions of biomedical studies. Hetionet v1.0 consists of 47,031 nodes of 11 types and 2,250,197 relationships of 24 types. Data were integrated from 29 public resources to connect compounds, diseases, genes, anatomies, pathways, biological processes, molecular functions, cellular components, pharmacologic classes, side effects, and symptoms. Next, we identified network patterns that distinguish treatments from non-treatments. Then, we predicted the probability of treatment for 209,168 compound–disease pairs (het.io/repurpose). Our predictions validated on two external sets of treatment and provided pharmacological insights on epilepsy, suggesting they will help prioritize drug repurposing candidates. This study was entirely open and received realtime feedback from 40 community members.

https://doi.org/10.7554/eLife.26726.001

eLife digest

Of all the data in the world today, 90% was created in the last two years. However, taking advantage of this data in order to advance our knowledge is restricted by how quickly we can access it and analyze it in a proper context.

In biomedical research, data is largely fragmented and stored in databases that typically do not “talk” to each other, thus hampering progress. One particular problem in medicine today is that the process of making a new therapeutic drug from scratch is incredibly expensive and inefficient, making it a risky business. Given the low success rate in drug discovery, there is an economic incentive in trying to repurpose an existing drug that has already been shown to be safe and effective towards a new disease or condition.

Himmelstein et al. used a computational approach to analyze 50,000 data points – including drugs, diseases, genes and symptoms – from 19 different public databases. This approach made it possible to create more than two million relationships among the data points, which could be used to develop models that predict which drugs currently in use by doctors might be best suited to treat any of 136 common diseases. For example, Himmelstein et al. identified specific drugs currently used to treat depression and alcoholism that could be repurposed to treat smoking addition and epilepsy.

These findings provide a new and powerful way to study drug repurposing. While this work was exclusively performed with public data, an expanded and potentially stronger set of predictions could be obtained if data owned by pharmaceutical companies were incorporated. Additional studies will be needed to test the predictions made by the models.

https://doi.org/10.7554/eLife.26726.002

Introduction

The cost of developing a new therapeutic drug has been estimated at 1.4 billion dollars (DiMasi et al., 2016), the process typically takes 15 years from lead compound to market (Reichert, 2003), and the likelihood of success is stunningly low (Hay et al., 2014). Strikingly, the costs have been doubling every 9 years since 1970, a sort of inverse Moore’s law, which is far from an optimal strategy from both a business and public health perspective (Scannell et al., 2012). Drug repurposing — identifying novel uses for existing therapeutics — can drastically reduce the duration, failure rates, and costs of approval (Ashburn and Thor, 2004). These benefits stem from the rich preexisting information on approved drugs, including extensive toxicology profiling performed during development, preclinical models, clinical trials, and postmarketing surveillance.

Drug repurposing is poised to become more efficient as mining of electronic health records (EHRs) to retrospectively assess the effect of drugs gains feasibility (Wang et al., 2015; Xu et al., 2015; Brilliant et al., 2016; Tatonetti et al., 2012). However, systematic approaches to repurpose drugs based on mining EHRs alone will likely lack power due to multiple testing. Similar to the approach followed to increase the power of genome-wide association studies (GWAS) (Stephens and Balding, 2009; Sawcer, 2008), integration of biological knowledge to prioritize drug repurposing will help overcome limited EHR sample size and data quality.

In addition to repurposing, several other paradigm shifts in drug development have been proposed to improve efficiency. Since small molecules tend to bind to many targets, polypharmacology aims to find synergy in the multiple effects of a drug (Roth et al., 2004). Network pharmacology assumes diseases consist of a multitude of molecular alterations resulting in a robust disease state. Network pharmacology seeks to uncover multiple points of intervention into a specific pathophysiological state that together rehabilitate an otherwise resilient disease process (Hopkins, 2008; Hopkins, 2007). Although target-centric drug discovery has dominated the field for decades, phenotypic screens have more recently resulted in a comparatively higher number of first-in-class small molecules (Swinney and Anthony, 2011). Recent technological advances have enabled a new paradigm in which mid- to high-throughput assessment of intermediate phenotypes, such as the molecular response to drugs, is replacing the classic target discovery approach (Iskar et al., 2012; Lamb, 2007; Qu and Rajpal, 2012). Furthermore, integration of multiple channels of evidence, particularly diverse types of data, can overcome the limitations and weak performance inherent to data of a single domain (Hodos et al., 2016). Modern computational approaches offer a convenient platform to tie these developments together as the reduced cost and increased velocity of in silico experimentation massively lowers the barriers to entry and price of failure (Hurle et al., 2013; Liu et al., 2013).

Hetnets (short for heterogeneous networks) are networks with multiple types of nodes and relationships. They offer an intuitive, versatile, and powerful structure for data integration by aggregating graphs for each relationship type onto common nodes. In this study, we developed a hetnet (Hetionet v1.0) by integrating knowledge and experimental findings from decades of biomedical research spanning millions of publications. We adapted an algorithm originally developed for social network analysis and applied it to Hetionet v1.0 to identify patterns of efficacy and predict new uses for drugs. The algorithm performs edge prediction through a machine learning framework that accommodates the breadth and depth of information contained in Hetionet v1.0 (Himmelstein and Baranzini, 2015a; Sun et al., 2011). Our approach represents an in silico implementation of network pharmacology that natively incorporates polypharmacology and high-throughput phenotypic screening.

One fundamental characteristic of our method is that it learns and evaluates itself on existing medical indications (i.e. a 'gold standard’). Next, we introduce previous approaches that also performed comprehensive evaluation on existing treatments. A 2011 study, named PREDICT, compiled 1933 treatments between 593 drugs and 313 diseases (Gottlieb et al., 2011). Starting from the premise that similar drugs treat similar diseases, PREDICT trained a classifier that incorporates five types of drug-drug and two types of disease-disease similarity. A 2014 study compiled 890 treatments between 152 drugs and 145 diseases with transcriptional signatures (Cheng et al., 2014). The authors found that compounds triggering an opposing transcriptional response to the disease were more likely to be treatments, although this effect was weak and limited to cancers. A 2016 study compiled 402 treatments between 238 drugs and 78 diseases and used a single proximity score — the average shortest path distance between a drug’s targets and disease’s associated proteins on the interactome — as a classifier (Guney et al., 2016).

We build on these successes by creating a framework for incorporating the effects of any biological relationship into the prediction of whether a drug treats a disease. By doing this, we were able to capture a multitude of effects that have been suggested as influential for drug repurposing including drug-drug similarity (Gottlieb et al., 2011; Li and Lu, 2012), disease-disease similarity (Gottlieb et al., 2011; Chiang and Butte, 2009), transcriptional signatures (Lamb, 2007; Qu and Rajpal, 2012; Cheng et al., 2014; Lamb et al., 2006; Iorio et al., 2013), protein interactions (Guney et al., 2016), genetic association (Nelson et al., 2015; Sanseau et al., 2012), drug side effects (Campillos et al., 2008; Nugent et al., 2016), disease symptoms (Zhou et al., 2014), and molecular pathways (Pratanwanich and Lió, 2014). Our ability to create such an integrative model of drug efficacy relies on the hetnet data structure to unite diverse information. On Hetionet v1.0, our algorithm learns which types of compound–disease paths discriminate treatments from non-treatments in order to predict the probability that a compound treats a disease.

We refer to this study as Project Rephetio (pronounced as rep-het-ee-oh). Both Rephetio and Hetionet are portmanteaus combining the words repurpose, heterogeneous, and network with the URL het.io.

Results

Hetionet v1.0

We obtained and integrated data from 29 publicly available resources to create Hetionet v1.0 (Figure 1). The hetnet contains 47,031 nodes of 11 types (Table 1) and 2,250,197 relationships of 24 types (Table 2). The nodes consist of 1552 small molecule compounds and 137 complex diseases, as well as genes, anatomies, pathways, biological processes, molecular functions, cellular components, perturbations, pharmacologic classes, drug side effects, and disease symptoms. The edges represent relationships between these nodes and encompass the collective knowledge produced by millions of studies over the last half century.

Hetionet v1.0.

(A) The metagraph, a schema of the network types. (B) The hetnet visualized. Nodes are drawn as dots and laid out orbitally, thus forming circles. Edges are colored by type. (C) Metapath counts by path length. The number of different types of paths of a given length that connect two node types is shown. For example, the top-left tile in the Length 1 panel denotes that Anatomy nodes are not connected to themselves (i.e. no edges connect nodes of this type between themselves). However, the bottom-left tile of the Length 4 panel denotes that 88 types of length-four paths connect Symptom to Anatomy nodes.

https://doi.org/10.7554/eLife.26726.003
Table 1
Metanodes.

Hetionet v1.0 includes 11 node types (metanodes). For each metanode, this table shows the abbreviation, number of nodes, number of nodes without any edges, and the number of metaedges connecting the metanode.

https://doi.org/10.7554/eLife.26726.004
MetanodeAbbrNodesDisconnectedMetaedges
AnatomyA40224
Biological processBP11,38101
Cellular componentCC139101
CompoundC1552148
DiseaseD13718
GeneG20,945180016
Molecular functionMF288401
PathwayPW182201
Pharmacologic classPC34501
Side effectSE5734331
SymptomS438231
Table 2
Metaedges.

Hetionet v1.0 contains 24 edge types (metaedges). For each metaedge, the table reports the abbreviation, the number of edges, the number of source nodes connected by the edges, and the number of target nodes connected by the edges. Note that all metaedges besides Gene→regulates→Gene are undirected.

https://doi.org/10.7554/eLife.26726.005
MetaedgeAbbrEdgesSourcesTargets
Anatomy–downregulates–GeneAdG102,2403615,097
Anatomy–expresses–GeneAeG526,40724118,094
Anatomy–upregulates–GeneAuG97,8483615,929
Compound–binds–GeneCbG11,57113891689
Compound–causes–Side EffectCcSE138,94410715701
Compound–downregulates–GeneCdG21,1027342880
Compound–palliates–DiseaseCpD39022150
Compound–resembles–CompoundCrC648610421054
Compound–treats–DiseaseCtD75538777
Compound–upregulates–GeneCuG18,7567033247
Disease–associates–GeneDaG12,6231345392
Disease–downregulates–GeneDdG7623445745
Disease–localizes–AnatomyDlA3602133398
Disease–presents–SymptomDpS3357133415
Disease–resembles–DiseaseDrD543112106
Disease–upregulates–GeneDuG7731445630
Gene–covaries–GeneGcG61,69090439532
Gene–interacts–GeneGiG147,164952614,084
Gene–participates–Biological ProcessGpBP559,50414,77211,381
Gene–participates–Cellular ComponentGpCC73,56610,5801391
Gene–participates–Molecular FunctionGpMF97,22213,0632884
Gene–participates–PathwayGpPW84,37289791822
Gene→regulates→GeneGr > G265,67246347048
Pharmacologic Class–includes–CompoundPCiC1029345724

For example, Compound–binds–Gene edges represent when a compound binds to a protein encoded by a gene. This information has been extracted from the literature by human curators and compiled into databases such as DrugBank, ChEMBL, DrugCentral, and BindingDB. We combined these databases to create 11,571 binding edges between 1389 compounds and 1689 genes. These edges were compiled from 10,646 distinct publications, which Hetionet binding edges reference as an attribute. Binding edges represent a comprehensive catalog constructed from low-throughput experimentation. However, we also integrated findings from high-throughput technologies — many of which have only recently become available. For example, we generated consensus transcriptional signatures for compounds in LINCS L1000 and diseases in STARGEO.

While Hetionet v1.0 is ideally suited for drug repurposing, the network has broader biological applicability. For example, we have prototyped queries for (a) identifying drugs that target a specific pathway, (b) identifying biological processes involved in a specific disease, (c) identifying the drug targets responsible for causing a specific side effect, and (d) identifying anatomies with transcriptional relevance for a specific disease (Himmelstein, 2016j). Each of these queries was simple to write and took less than a second to run on our publicly available Hetionet Browser. Although it is possible that existing services provide much of the aforementioned functionality, they offer less versatility. Hetionet differentiates itself in its ability to flexibly query across multiple domains of information. As a proof of concept, we enhanced the biological process query (b), which identified processes that were enriched for disease-associated genes, using multiple sclerosis (MS) as an example disease. The verbose Cypher code for this query is shown below:

MATCH path =
 //Specify the type of path to match
 (n0:Disease)-[e1:ASSOCIATES_DaG]-(n1:Gene)-[:INTERACTS_GiG]-
 (n2:Gene)-[:PARTICIPATES_GpBP]-(n3:BiologicalProcess)
WHERE
 //Specify the source and target nodes
 n0.name = 'multiple sclerosis' AND
 n3.name = 'retina layer formation'
 //Require GWAS support for the Disease-associates-Gene relationship
 AND 'GWAS Catalog' in e1.sources
 //Require the interacting gene to be upregulated in a relevant tissue
 AND exists((n0)-[:LOCALIZES_DlA]-(:Anatomy)-[:UPREGULATES_AuG]-(n2))
RETURN path

The query above identifies genes that interact with MS GWAS-genes. However, interacting genes are discarded unless they are upregulated in an MS-related anatomy (i.e. anatomical structure, e.g. organ or tissue). Then relevant biological processes are identified. Thus, this single query spans four node and five relationship types.

The integrative potential of Hetionet v1.0 is reflected by its connectivity. Among the 11 metanodes, there are 66 possible source–target pairs. However, only 11 of them have at least one direct connection. In contrast, for paths of length 2, 50 pairs have connectivity (paths types that start on the source node type and end on the target node type, see Figure 1C). At length 3, all 66 pairs are connected. At length 4, the source–target pair with the fewest types of connectivity (Side Effect to Symptom) has 13 metapaths, while the pair with the most connectivity types (Gene to Gene) has 3542 pairs. This high level of connectivity across a diversity of biomedical entities forms the foundation for automated translation of knowledge into biomedical insight.

Hetionet v1.0 is accessible via a Neo4j Browser at https://neo4j.het.io. This public Neo4j instance provides users an installation-free method to query and visualize the network. The Browser contains a tutorial guide as well as guides with the details of each Project Rephetio prediction. Hetionet v1.0 is also available for download in JSON, Neo4j, and TSV formats (Himmelstein, 2017a). The JSON and Neo4j database formats include node and edge properties — such as URLs, source and license information, and confidence scores — and are thus recommended.

Systematic mechanisms of efficacy

One aim of Project Rephetio was to systematically evaluate how drugs exert their therapeutic potential. To address this question, we compiled a gold standard of 755 disease-modifying indications, which form the Compound–treats–Disease edges in Hetionet v1.0. Next, we identified types of paths (metapaths) that occurred more frequently between treatments than non-treatments (any compound–disease pair that is not a treatment). The advantage of this approach is that metapaths naturally correspond to mechanisms of pharmacological efficacy. For example, the Compound–binds–Gene–associates–Disease (CbGaD) metapath identifies when a drug binds to a protein corresponding to a gene involved in the disease.

We evaluated all 1206 metapaths that traverse from compound to disease and have length of 2–4 (Figure 2A). To control for the different degrees of nodes, we used the degree-weighted path count (DWPC, see Materials and methods) — which downweights paths going through highly connected nodes (Himmelstein and Baranzini, 2015a) — to assess path prevalence. In addition, we compared the performance of each metapath to a baseline computed from permuted networks. Hetnet permutation preserves node degree while eliminating edge specificity, allowing us to isolate the portion of unpermuted metapath performance resulting from actual network paths. We refer to the permutation-adjusted performance measure as Δ AUROC. A positive Δ AUROC indicates that paths of the given type tended to occur more frequently between treatments than non-treatments, after accounting for different levels of connectivity (node degrees) in the hetnet. In general terms, Δ AUROC assesses whether paths of a given type were informative of drug efficacy.

Performance by type and model coefficients.

(A) The performance of the DWPCs for 1206 metapaths, organized by their composing metaedges. The larger dots represent metapaths that were significantly affected by permutation (false discovery rate < 5%). Metaedges are ordered by their best performing metapath. Since a metapath’s performance is limited by its least informative metaedge, the best performing metapath for a metaedge provides a lower bound on the pharmacologic utility of a given domain of information. (B) Barplot of the model coefficients. Features were standardized prior to model fitting to make the coefficients comparable (Himmelstein and Lizee, 2016a).

https://doi.org/10.7554/eLife.26726.006

Overall, 709 of the 1206 metapaths exhibited a statistically significant Δ AUROC at a false discovery rate cutoff of 5%. These 709 metapaths included all 24 metaedges, suggesting that each type of relationship we integrated provided at least some therapeutic utility. However, not all metaedges were equally present in significant metapaths: 259 significant metapaths included a Compound–binds–Gene metaedge, whereas only four included a Gene–participates–Cellular Component metaedge. Table 3 lists the predictiveness of several metapaths of interest. Refer to the Discussion for our interpretation of these findings.

Table 3
The predictiveness of select metapaths.

A small selection of interesting or influential metapaths is provided (complete table online). Len. refers to number of metaedges composing the metapath. Δ AUROC and −log10(p) assess the performance of a metapath’s DWPC in discriminating treatments from non-treatments (in the all-features stage as described in Materials and methods). p assesses whether permutation affected AUROC. For reference, p=0.05 corresponds to −log10(p) = 1.30. Note that several metapaths shown here provided little evidence that Δ AUROC ≠ 0 underscoring their poor ability to predict whether a compound treated a disease. Coef. reports a metapath’s logistic regression coefficient as seen in Figure 2B. Metapaths removed in feature selection have missing coefficients, whereas metapaths given zero-weight by the elastic net have coef. = 0.0.

https://doi.org/10.7554/eLife.26726.007
Abbrev.Len.Δ auroc−log₁₀(P)Coef.Metapath
CbGaD214.5%6.20.20Compound–binds–Gene–associates–Disease
CdGuD21.7%4.5Compound–downregulates–Gene–upregulates–Disease
CrCtD222.8%6.90.15Compound–resembles–Compound–treats–Disease
CtDrD217.2%5.80.13Compound–treats–Disease–resembles–Disease
CuGdD21.1%2.6Compound–upregulates–Gene–downregulates–Disease
CbGbCtD321.7%6.50.22Compound–binds–Gene–binds–Compound–treats–Disease
CbGeAlD38.4%5.20.04Compound–binds–Gene–expresses–Anatomy–localizes–Disease
CbGiGaD39.0%4.40.00Compound–binds–Gene–interacts–Gene–associates–Disease
CcSEcCtD314.0%6.80.08Compound–causes–Side Effect–causes–Compound–treats–Disease
CdGdCtD33.8%4.60.00Compound–downregulates–Gene–downregulates–Compound–treats–Disease
CdGuCtD3−2.1%2.4Compound–downregulates–Gene–upregulates–Compound–treats–Disease
CiPCiCtD323.3%7.50.16Compound–includes–Pharmacologic Class–includes–Compound–treats–Disease
CpDpCtD34.3%3.90.06Compound–palliates–Disease–palliates–Compound–treats–Disease
CrCrCtD317.0%5.00.12Compound–resembles–Compound–resembles–Compound–treats–Disease
CrCbGaD38.2%6.10.002Compound–resembles–Compound–binds–Gene–associates–Disease
CtDdGdD34.2%3.9Compound–treats–Disease–downregulates–Gene–downregulates–Disease
CtDdGuD30.5%1.0Compound–treats–Disease–downregulates–Gene–upregulates–Disease
CtDlAlD312.4%6.0Compound–treats–Disease–localizes–Anatomy–localizes–Disease
CtDpSpD313.9%6.1Compound–treats–Disease–presents–Symptom–presents–Disease
CtDuGdD30.7%1.3Compound–treats–Disease–upregulates–Gene–downregulates–Disease
CtDuGuD31.1%1.4Compound–treats–Disease–upregulates–Gene–upregulates–Disease
CuGdCtD3−1.6%2.9Compound–upregulates–Gene–downregulates–Compound–treats–Disease
CuGuCtD34.4%3.50.00Compound–upregulates–Gene–upregulates–Compound–treats–Disease
CbGiGiGaD47.0%5.10.00Compound–binds–Gene–interacts–Gene–interacts–Gene–associates–Disease
CbGpBPpGaD44.9%3.80.00Compound–binds–Gene–participates–Biological Process–participates–Gene–associates–Disease
CbGpPWpGaD47.6%7.90.05Compound–binds–Gene–participates–Pathway–participates–Gene–associates–Disease

Predictions of drug efficacy

We implemented a machine learning approach to translate the network connectivity between a compound and a disease into a probability of treatment (Himmelstein, 2016k; Himmelstein, 2017b). The approach relies on the 755 known treatments as positives and 29,044 non-treatments as negatives to train a logistic regression model. Note that 179,369 non-treatments were omitted as negative training observations because they had a prior probability of treatment equal to zero (see Materials and methods). The features consisted of a prior probability of treatment, node degrees for 14 metaedges, and DWPCs for 123 metapaths that were well suited for modeling. A cross-validated elastic net was used to minimize overfitting, yielding a model with 31 features (Figure 2B). The DWPC features with negative coefficients appear to be included as node-degree-capturing covariates, i.e. they reflect the general connectivity of the compound and disease rather than specific paths between them. However, the 11 DWPC features with non-negligible positive coefficients represent the most salient types of connectivity for systematically modeling drug efficacy. See the metapaths with positive coefficients in Table 3 for unabbreviated names. As an example, the CcSEcCtD feature assesses whether the compound causes the same side effects as compounds that treat the disease. Alternatively, the CbGeAlD feature assesses whether the compound binds to genes that are expressed in the anatomies affected by the disease.

We applied this model to predict the probability of treatment between each of 1538 connected compounds and each of 136 connected diseases, resulting in predictions for 209,168 compound–disease pairs (Himmelstein et al., 2016a), available at http://het.io/repurpose/. The 755 known disease-modifying indications were highly ranked (AUROC = 97.4%, Figure 3). The predictions also successfully prioritized two external validation sets: novel indications from DrugCentral (AUROC = 85.5%) and novel indications in clinical trial (AUROC = 70.0%). Together, these findings indicate that Project Rephetio has the ability to recognize efficacious compound–disease pairs.

Predictions performance on four indication sets.

We assess how well our predictions prioritize four sets of indications. (A) The y-axis labels denote the number of indications (+) and non-indications (−) composing each set. Violin plots with quartile lines show the distribution of indications when compound–disease pairs are ordered by their prediction. In all four cases, the actual indications were ranked highly by our predictions. (B) ROC Curves with AUROCs in the legend. (C) Precision–Recall Curves with AUPRCs in the legend.

https://doi.org/10.7554/eLife.26726.008

Predictions were scaled to the overall prevalence of treatments (0.36%). Hence a compound–disease pair that received a prediction of 1% represents a twofold enrichment over the null probability. Of the 3980 predictions with a probability exceeding 1%, 586 corresponded to known disease-modifying indications, leaving 3394 repurposing candidates. For a given compound or disease, we provide the percentile rank of each prediction. Therefore, users can assess whether a given prediction is a top prediction for the compound or disease. In addition, our table-based prediction browser links to a custom guide for each prediction, which displays in the Neo4j Hetionet Browser. Each guide includes a query to display the top paths supporting the prediction and lists clinical trials investigating the indication.

Nicotine dependence case study

There are currently two FDA-approved medications for smoking cessation (varenicline and bupropion) that are not nicotine replacement therapies. PharmacotherapyDB v1.0 lists varenicline as a disease-modifying indication and nicotine itself as a symptomatic indication for nicotine dependence, but is missing bupropion. Bupropion was first approved for depression in 1985. Owing to the serendipitous observation that it decreased smoking in depressed patients taking this drug, Bupropion was approved for smoking cessation in 1997 (Harmey et al., 2012). Therefore, we looked whether Project Rephetio could have predicted this repurposing. Bupropion was the ninth best prediction for nicotine dependence (99.5th percentile) with a probability 2.50-fold greater than the null. Figure 4 shows the top paths supporting the repurposing of bupropion.

Evidence supporting the repurposing of bupropion for smoking cessation.

This figure shows the 10 most supportive paths (out of 365 total) for treating nicotine dependence with bupropion, as available in this prediction’s Neo4j Browser guide. Our method detected that bupropion targets the CHRNA3 gene, which is also targeted by the known-treatment varenicline (Mihalak et al., 2006). Furthermore, CHRNA3 is associated with nicotine dependence (Thorgeirsson et al., 2008) and participates in several pathways that contain other nicotinic-acetylcholine-receptor (nAChR) genes associated with nicotine dependence. Finally, bupropion causes terminal insomnia (Boshier et al., 2003) as does varenicline (Hays et al., 2008), which could indicate an underlying common mechanism of action.

https://doi.org/10.7554/eLife.26726.009

Atop the nicotine dependence predictions were nicotine (10.97-fold over null), cytisine (10.58-fold), and galantamine (9.50-fold). Cytisine is widely used in Eastern Europe for smoking cessation due to its availability at a fraction of the cost of other pharmaceutical options (Cahill et al., 2016). In the last half decade, large-scale clinical trials have confirmed cytisine’s efficacy (West et al., 2011; Walker et al., 2014). Galantamine, an approved Alzheimer’s treatment, is currently in Phase 2 trial for smoking cessation and is showing promising results (Ashare et al., 2016). In summary, nicotine dependence illustrates Project Rephetio’s ability to predict efficacious treatments and prioritize historic and contemporary repurposing opportunities.

Epilepsy case study

Several factors make epilepsy an interesting disease for evaluating repurposing predictions (Khankhanian and Himmelstein, 2016). Antiepileptic drugs work by increasing the seizure threshold — the amount of electric stimulation that is required to induce seizure. The effect of a drug on the seizure threshold can be cheaply and reliably tested in rodent models. As a result, the viability of most approved drugs in treating epilepsy is known.

We focused our evaluation on the top 100 scoring compounds — referred to as the epilepsy predictions in this section — after discarding a single combination drug. We classified each compound as anti-ictogenic (seizure suppressing), unknown (no established effect on the seizure threshold), or ictogenic (seizure generating) according to medical literature (Khankhanian and Himmelstein, 2016). Of the top 100 epilepsy predictions, 77 were anti-ictogenic, eight were unknown, and 15 were ictogenic (Figure 5A). Notably, the predictions contained 23 of the 25 disease-modifying antiepileptics in PharamcotherapyDB v1.0.

Top 100 epilepsy predictions.

(A) Compounds — ranked from 1 to 100 by their predicted probability of treating epilepsy — are colored by their effect on seizures (Khankhanian and Himmelstein, 2016). The highest predictions are almost exclusively anti-ictogenic. Further down the prediction list, the prevalence of drugs with an ictogenic (contraindication) or unknown (novel repurposing candidate) effect on epilepsy increases. All compounds shown received probabilities far exceeding the null probability of treatment (0.36%). (B) A chemical similarity network of the epilepsy predictions, with each compound’s 2D structure (Himmelstein et al., 2017a). Edges are Compound–resembles–Compound relationships from Hetionet v1.0. Nodes are colored by their effect on seizures. (C) The relative contribution of important drug targets to each epilepsy prediction (Himmelstein et al., 2017a). Specifically, pie charts show how the eight most-supportive drug targets across all 100 epilepsy predictions contribute to individual predictions. Other Targets represents the aggregate contribution of all targets not listed. The network layout is identical to B.

https://doi.org/10.7554/eLife.26726.010

Many of the 77 anti-ictogenic compounds were not first-line antiepileptic drugs. Instead, they were used as ancillary drugs in the treatment of status epilepticus. For example, we predicted four halogenated ethers, two of which (isoflurane and desflurane) are used clinically to treat life-threatening seizures that persist despite treatment (Mirsattari et al., 2004). As inhaled anesthetics, these compounds are not appropriate as daily epilepsy medications, but are feasible for refractory status epilepticus where patients are intubated.

Given this high precision (77%), the eight compounds of unknown effect are promising repurposing candidates. For example, acamprosate — whose top prediction was epilepsy — is a taurine analog that promotes alcohol abstinence. Support for this repurposing arose from acamprosate’s inhibition of the glutamate receptor and positive modulation of the GABAA receptor (Figure 5C). If effective against epilepsy, acamprosate could serve a dual benefit for recovering alcoholics who experience seizures from alcohol withdrawal.

While certain classes of compounds were highly represented in our epilepsy predictions, such benzodiazepines and barbiturates, there was also considerable diversity (Khankhanian and Himmelstein, 2016). The 100 predicted compounds encompassed 26 third-level ATC codes (Knaus, 2016), such as antiarrhythmics (quinidine, classified as anti-ictogenic) and urologicals (phenazopyridine, classified as unknown). Furthermore, 25 of the compounds were chemically distinct, i.e. they did not resemble any of the other epilepsy predictions (Figure 5B).

Next, we investigated which components of Hetionet contributed to the epilepsy predictions (Khankhanian and Himmelstein, 2016). In total, 392,956 paths of 12 types supported the predictions. Using several different methods for grouping paths, we were able to quantify the aggregate biological evidence. Our algorithm primarily drew on two aspects of epilepsy: its known treatments (76% of the total support) and its genetic associations (22% of support). In contrast, our algorithm drew heavily on several aspects of the predicted compounds: their targeted genes (44%), their chemically similar compounds (30%), their pharmacologic classes, their palliative indications (5%), and their side effects (4%).

Specifically, 266,192 supporting paths originated with a Compound–binds–Gene relationship. Aggregating support by these genes shows the extent that 121 different drug targets contributed to the predictions (Khankhanian and Himmelstein, 2016). In order of importance, the predictions targeted GABAA receptors (15.3% of total support), cytochrome P450 enzymes (5.6%), the sodium channel (4.6%), glutamate receptors (3.8%), the calcium channel (2.7%), carbonic anhydrases (2.5%), cholinergic receptors (2.1%), and the potassium channel (1.4%). Besides cytochrome P450, which primarily influences pharmacokinetics (Johannessen and Landmark, 2010), our method detected and leveraged bonafide anti-ictogenic mechanisms (Rogawski and Löscher, 2004). Figure 5C shows drug target contributions per compound and illustrates the considerable mechanistic diversity among the predictions.

Also notable are the 15 ictogenic compounds in our top 100 predictions. Nine of the ictogenic compounds share a tricyclic structure (Figure 5B), five of which are tricyclic antidepressants. While the ictogenic mechanisms of these antidepressants are still unclear (Johannessen Landmark et al., 2016), Figure 5C suggests their anticholinergic effects may be responsible (Himmelstein, 2017d), in accordance with previous theories (Dailey and Naritoku, 1996).

We also ranked the contribution of the 1137 side effects that supported the epilepsy predictions through 117,720 CcSEcCtD paths. The top five side effects — ataxia (0.069% of total support), nystagmus (0.049%), diplopia (0.045%), somnolence (0.044%), and vomiting (0.043%) — reflect established adverse effects of antiepileptic drugs (Zadikoff et al., 2007; Wu and Thijs, 2015; ROFF HILTONHilton et al., 2004; Placidi et al., 2000; Jahromi et al., 2011). In summary, our method simultaneously identified the hallmark side effects of antiepileptic drugs while incorporating this knowledge to prioritize 1538 compounds for anti-ictogenic activity.

Discussion

We created Hetionet v1.0 by integrating 29 resources into a single data structure — the hetnet. Consisting of 11 types of nodes and 24 types of relationships, Hetionet v1.0 brings more types of information together than previous leading-studies in biological data integration (Gligorijević and Pržulj, 2015). Moreover, we strove to create a reusable, extensible, and property-rich network. While all the resources we include are publicly available, their integration was a time-intensive undertaking and required careful consideration of legal barriers to data reuse. Hetionet allows researchers to begin answering integrative questions without having to first spend months processing data.

Our public Neo4j instance allows users to immediately interact with Hetionet. Through the Cypher language, users can perform highly specialized graph queries with only a few lines of code. Queries can be executed in the web browser or programmatically from a language with a Neo4j driver. For users that are unfamiliar with Cypher, we include several example queries in a Browser guide. In contrast to traditional REST APIs, our public Neo4j instance provides users with maximal flexibility to construct custom queries by exposing the underlying database.

As data has grown more plentiful and diverse, so has the applicability of hetnets. Unfortunately, network science has been naturally fragmented by discipline resulting in relatively slow progress in integrating heterogeneous data. A 2014 analysis identified 78 studies using multilayer networks — a superset of hetnets (heterogeneous information networks) with the potential for additional dimensions, such as time. However, the studies relied on 26 different terms, 9 of which had multiple definitions (Kivela et al., 2014; Himmelstein et al., 2015b). Nonetheless, core infrastructure and algorithms for hetnets are emerging. Compared to the existing mathematical frameworks for multilayer networks that must deal with layers other than type (such as the aspect of time) (Kivela et al., 2014), the primary obligation of hetnet algorithms is to be type aware. One goal of our project has been to unite hetnet research across disciplines. We approached this goal by making Project Rephetio entirely available online and inviting community feedback throughout the process (Himmelstein et al., 2015c).

Integrating every resource into a single interconnected data structure allowed us to assess systematic mechanisms of drug efficacy. Using the max performing metapath to assess the pharmacological utility of a metaedge (Figure 2A), we can divide our relationships into tiers of informativeness. The top tier consists of the types of information traditionally considered by pharmacology: Compound–treats–Disease, Pharmacologic Class–includes–Compound, Compound–resembles–Compound, Disease–resembles–Disease, and Compound–binds–Gene. The upper-middle tier consists of types of information that have been the focus of substantial medical study, but have only recently started to play a bigger role in drug development, namely the metaedges Disease–associates–Gene, Compound–causes–Side Effect, Disease–presents–Symptom, Disease–localizes–Anatomy, and Gene–interacts–Gene.

The lower-middle tier contains the transcriptomics metaedges such as Compound–downregulates–Gene, Anatomy–expresses–Gene, Gene→regulates→Gene, and Disease–downregulates–Gene. Much excitement surrounds these resources due to their high-throughput and genome-wide scope, which offers a route to drug discovery that is less biased by existing knowledge. However, our findings suggest that these resources are only moderately informative of drug efficacy. Other lower-middle tier metaedges were the product of time-intensive biological experimentation, such as Gene–participates–Pathway, Gene–participates–Molecular Function, and Gene–participates–Biological Process. Unlike the top tier resources, this knowledge has historically been pursued for basic science rather than primarily medical applications. The weak yet appreciable performance of the Gene–covaries–Gene suggests the synergy between the fields of evolutionary genomics and disease biology. The lower tier included the Gene–participates–Cellular Component metaedge, which may reflect that the relevance of cellular location to pharmacology is highly case dependent and not amenable to systematic profiling.

The performance of specific metapaths (Table 3) provides further insight. For example, significant emphasis has been put on the use of transcriptional data for drug repurposing (Iorio et al., 2013). One common approach has been to identify compounds with opposing transcriptional signatures to a disease (Qu and Rajpal, 2012; Sirota et al., 2011). However, several systematic studies report underwhelming performance of this approach (Gottlieb et al., 2011; Cheng et al., 2014; Guney et al., 2016) — a finding supported by the low performance of the CuGdD and CdGuD metapaths in Project Rephetio. Nonetheless, other transcription-based methods showed some promise. Compounds with similar transcriptional signatures were prone to treating the same disease (CuGuCtD and CdGdCtD metapaths), while compounds with opposing transcriptional signatures were slightly averse to treating the same disease (CuGdCtD and CdGuCtD metapaths). In contrast, diseases with similar transcriptional profiles were not prone to treatment by the same compound (CtDdGuD and CtDuGdD).

By comparably assessing the informativeness of different metaedges and metapaths, Project Rephetio aims to guide future research towards promising data types and analyses. One caveat is that omics-scale experimental data will likely play a larger role in developing the next generation of pharmacotherapies. Hence, were performance reevaluated on treatments discovered in the forthcoming decades, the predictive ability of these data types may rise. Encouragingly, most data types were at least weakly informative and hence suitable for further study. Ideally, different data types would provide orthogonal information. However, our model for whether a compound treats a disease focused on 11 metapaths — a small portion of the hundreds of metapaths available. While parsimony aids interpretation, our model did not draw on the weakly-predictive high-throughput data types — which are intriguing for their novelty, scalability, and cost-effectiveness — as much as we had hypothesized.

Instead our model selected types of information traditionally considered in pharmacology. However, unlike a pharmacologist whose area of expertise may be limited to a few drug classes, our model was able to predict probabilities of treatment for all 209,168 compound–disease pairs. Furthermore, our model systematically learned the importance of each type of network connectivity. For any compound–disease pair, we now can immediately provide the top network paths supporting its therapeutic efficacy. A traditional pharmacologist may be able to produce a similar explanation, but likely not until spending substantial time researching the compound’s pharmacology, the disease’s pathophysiology, and the molecular relationships in between. Accordingly, we hope certain predictions will spur further research, such as trials to investigate the off-label use of acamprosate for epilepsy, which is supported by one animal model (Farook et al., 2008).

As demonstrated by the 15 ictogenic compounds in our top 100 epilepsy predictions, Project Rephetio’s predictions can include contraindications in addition to indications. Since many of Hetionet v1.0’s relationship types are general (e.g. the Compound–binds–Gene relationship type conflates antagonist with agonist effects), we expect some high scoring predictions to exacerbate rather than treat the disease. However, the predictions made by Hetionet v1.0 represent such substantial relative enrichment over the null that uncovering the correct directionality is a logical next step and worth undertaking. Going forward, advances in automated mining of the scientific literature could enable extraction of precise relationship types at omics scale (Ehrenberg et al., 2016; Himmelstein et al., 2016b).

Future research should focus on gleaning orthogonal information from data types that are so expansive that computational methods are the only option. Our CuGuCtD feature — measuring whether a compound upregulates the same genes as compounds which treat the disease — is a good example. This metapath was informative by itself (Δ AUROC = 4.4%) but was not selected by the model, despite its orthogonal origin (gene expression) to selected metapaths. Using a more extensive catalog of treatments as the gold standard would be one possible approach to increase the power of feature selection.

Integrating more types of information into Hetionet should also be a future priority. The ‘network effect’ phenomenon suggests that the addition of each new piece of information will enhance the value of Hetionet’s existing information. We envision a future where all biological knowledge is encoded into a single hetnet. Hetionet v1.0 was an early attempt, and we hope the strong performance of Project Rephetio in repurposing drugs foreshadows the many applications that will thrive from encoding biology in hetnets.

Materials and methods

Hetionet was built entirely from publicly available resources with the goal of integrating a broad diversity of information types of medical relevance, ranging in scale from molecular to organismal. Practical considerations such as data availability, licensing, reusability, documentation, throughput, and standardization informed our choice of resources. We abided by a simple litmus test for determining how to encode information in a hetnet: nodes represent nouns, relationships represent verbs (Chen, 1997; Himmelstein et al., 2016c).

Our method for relationship prediction creates a strong incentive to avoid redundancy, which increases the computational burden without improving performance. In a previous study to predict disease–gene associations using a hetnet of pathophysiology (Himmelstein and Baranzini, 2015a), we found that different types of gene sets contributed highly redundant information. Therefore, in Hetionet v1.0, we reduced the number of gene set node types from 14 to 3 by omitting several gene set collections and aggregating all pathway nodes.

Nodes

Request a detailed protocol

Nodes encode entities. We extracted nodes from standard terminologies, which provide curated vocabularies to enable data integration and prevent concept duplication. The ease of mapping external vocabularies, adoption, and comprehensiveness were primary selection criteria. Hetionet v1.0 includes nodes from five ontologies — which provide hierarchy of entities for a specific domain — selected for their conformity to current best practices (Malone et al., 2016).

We selected 137 terms from the Disease Ontology (Schriml et al., 2012; Kibbe et al., 2015) (which we refer to as DO Slim (Himmelstein and Li, 2015d; Himmelstein, 2016g)) as our disease set. Our goal was to identify complex diseases that are distinct and specific enough to be clinically relevant yet general enough to be well annotated. To this end, we included diseases that have been studied by GWAS and cancer types from TopNodes_DOcancerslim (Wu et al., 2015). We ensured that no DO Slim disease was a subtype of another DO Slim disease. Symptoms were extracted from MeSH by taking the 438 descendants of Signs and Symptoms (Himmelstein and Pankov, 2015a; Himmelstein, 2016h).

Approved small molecule compounds with documented chemical structures were extracted from DrugBank version 4.2 (Law et al., 2014; Himmelstein, 2015b; Himmelstein, 2016i). Unapproved compounds were excluded because our focus was repurposing. In addition, unapproved compounds tend to be less studied than approved compounds making them less attractive for our approach where robust network connectivity is critical. Finally, restricting to small molecules with known documented structures enabled us to map between compound vocabularies (see Mappings).

Side effects were extracted from SIDER version 4.1 (Kuhn et al., 2016; Himmelstein, 2015c; Himmelstein, 2016j). SIDER codes side effects using UMLS identifiers (Bodenreider, 2004), which we also adopted. Pharmacologic Classes were extracted from the DrugCentral data repository (Ursu et al., 2017; Himmelstein et al., 2016d). Only pharmacologic classes corresponding to the ‘Chemical/Ingredient’, ‘Mechanism of Action’, and ‘Physiologic Effect’ FDA class types were included to avoid pharmacologic classes that were synonymous with indications (Himmelstein et al., 2016d).

Protein-coding human genes were extracted from Entrez Gene (Maglott et al., 2011; Himmelstein et al., 2015h; Himmelstein, 2016l). Anatomical structures, which we refer to as anatomies, were extracted from Uberon (Mungall et al., 2012). We selected a subset of 402 Uberon terms by excluding terms known not to exist in humans and terms that were overly broad or arcane (Malladi et al., 2015; Himmelstein, 2016m).

Pathways were extracted by combining human pathways from WikiPathways (Kutmon et al., 2016; Pico et al., 2008), Reactome (Fabregat et al., 2016), and the Pathway Interaction Database (Schaefer et al., 2009). The latter two resources were retrieved from Pathway Commons (RRID:SCR_002103) (Cerami et al., 2011), which compiles pathways from several providers. Duplicate pathways and pathways without multiple participating genes were removed (Pico and Himmelstein, 2015; Himmelstein and Pico, 2016a). Biological processes, cellular components, and molecular functions were extracted from the Gene Ontology (Ashburner et al., 2000). Only terms with 2–1000 annotated genes were included.

Mappings

Request a detailed protocol

Before adding relationships, all identifiers needed to be converted into the vocabularies matching that of our nodes. Oftentimes, our node vocabularies included external mappings. For example, the Disease Ontology includes mappings to MeSH, UMLS, and the ICD, several of which we submitted during the course of this study (Himmelstein, 2015e). In a few cases, the only option was to map using gene symbols, a disfavored method given that it can lead to ambiguities.

When mapping external disease concepts onto DO Slim, we used transitive closure. For example, the UMLS concept for primary progressive multiple sclerosis (C0751964) was mapped to the DO Slim term for multiple sclerosis (DOID:2377).

Chemical vocabularies presented the greatest mapping challenge (Himmelstein, 2015b), since these are poorly standardized (Hersey et al., 2015). UniChem’s (Chambers et al., 2013) Connectivity Search (Chambers et al., 2014) was used to map compounds, which maps by atomic connectivity (based on First InChIKey Hash Blocks (Heller et al., 2013)) and ignores small molecular differences.

Edges

Request a detailed protocol

Anatomy–downregulates–Gene and Anatomy–upregulates–Gene edges (Himmelstein et al., 2016f; Himmelstein and Bastian, 2015e; Himmelstein and Bastian, 2015f) were extracted from Bgee (Bastian et al., 2008), which computes differentially expressed genes by anatomy in post-juvenile adult humans. Anatomy–expresses–Gene edges were extracted from Bgee and TISSUES (Santos et al., 2015; Himmelstein and Jensen, 2015g; Himmelstein and Jensen, 2015h).

Compound–binds–Gene edges were aggregated from BindingDB (Chen et al., 2001; Gilson et al., 2016), DrugBank (Law et al., 2014; Wishart et al., 2006), and DrugCentral (Ursu et al., 2017). Only binding relationships to single proteins with affinities of at least 1 μM (as determined by Kd, Ki, or IC₅₀) were selected from the October 2015 release of BindingDB (Himmelstein and Gilson, 2015i; Himmelstein et al., 2015d). Target, carrier, transporter, and enzyme interactions with single proteins (i.e. excluding protein groups) were extracted from DrugBank 4.2 (Himmelstein, 2016i; Himmelstein and Protein, 2015j). In addition, all mapping DrugCentral target relationships were included (Himmelstein et al., 2016d).

Compound–treats–Disease (disease-modifying indications) and Compound–palliates–Disease (symptomatic indications) edges are from PharmacotherapyDB as described in Intermediate resources. Compound–causes–Side Effect edges were obtained from SIDER 4.1 (Kuhn et al., 2016; Himmelstein, 2015c; Himmelstein, 2016j), which uses natural language processing to identify side effects in drug labels. Compound–resembles–Compound relationships (Himmelstein, 2016i; Himmelstein and Chen, 2015k; Himmelstein et al., 2015q) represent chemical similarity and correspond to a Dice coefficient ≥0.5 (Dice, 1945) between extended connectivity fingerprints (Rogers and Hahn, 2010; Morgan, 1965). Pharmacologic Class–includes–Compound edges were extracted from DrugCentral for three FDA class types (Ursu et al., 2017; Himmelstein et al., 2016d). Compound–downregulates–Gene and Compound–upregulates–Gene relationships were computed from LINCS L1000 as described in Intermediate resources.

Disease–associates–Gene edges were extracted from the GWAS Catalog (Himmelstein and Baranzini, 2016b), DISEASES (Himmelstein and Jensen, 2015l; Himmelstein and Jensen, 2016c), DisGeNET (Himmelstein, 2015f; Himmelstein and Piñero, 2016d), and DOAF (Himmelstein, 2015g; Himmelstein, 2016s). The GWAS Catalog compiles disease–SNP associations from published GWAS (MacArthur et al., 2017). We aggregated overlapping loci associated with each disease and identified the mode reported gene for each high confidence locus (Himmelstein, 2015h; Himmelstein et al., 2015v). DISEASES integrates evidence of association from text mining, curated catalogs, and experimental data (Pletscher-Frankild et al., 2015). Associations from DISEASES with integrated scores ≥ 2 were included after removing the contribution of DistiLD. DisGeNET integrates evidence from over 10 sources and reports a single score for each association (Piñero et al., 2015; Piñero et al., 2017). Associations with scores ≥ 0.06 were included. DOAF mines Entrez Gene GeneRIFs (textual annotations of gene function) for disease mentions (Xu et al., 2012). Associations with three or more supporting GeneRIFs were included. Disease–downregulates–Gene and Disease–upregulates–Gene relationships (Himmelstein et al., 2015a; Himmelstein et al., 2016j) were computed using STARGEO as described in Intermediate resources.

Disease–localizes–Anatomy, Disease–presents–Symptom, and Disease–resembles–Disease edges were calculated from MEDLINE co-occurrence (Himmelstein and Pankov, 2015a; Himmelstein, 2016u). MEDLINE is a subset of 21 million PubMed articles for which designated human curators have assigned topics. When retrieving articles for a given topic (MeSH term), we activated two non-default search options as specified below: majr for selecting only articles where the topic is major and noexp for suppressing explosion (returning articles linked to MeSH subterms). We identified 4,161,769 articles with two or more disease topics; 696,252 articles with both a disease topic (majr) and an anatomy topic (noexp) (Himmelstein, 2015i); and 363,928 articles with both a disease topic (majr) and a symptom topic (noexp). We used a Fisher’s exact test (Fisher, 1922) to identify pairs of terms that occurred together more than would be expected by chance in their respective corpus. We included co-occurring terms with p<0.005 in Hetionet v1.0.

Gene→regulates→Gene directed edges were generated from the LINCS L1000 genetic interference screens (see Intermediate resources) and indicate that knockdown or overexpression of the source gene significantly dysregulated the target gene (Himmelstein and Chung, 2015q; Himmelstein et al., 2016k). Gene–covaries–Gene edges represent evolutionary rate covariation ≥0.75 (Priedigkeit et al., 2015; Himmelstein and Partha, 2015r; Himmelstein, 2016w). Gene–interacts–Gene edges (Himmelstein et al., 2015z; Himmelstein and Baranzini, 2016e) represent when two genes produce physically interacting proteins. We compiled these interactions from the Human Interactome Database (Rual et al., 2005; Venkatesan et al., 2009; Yu et al., 2011; Rolland et al., 2014), the Incomplete Interactome (Menche et al., 2015), and our previous study (Himmelstein and Baranzini, 2015a). Gene–participates–Biological Process, Gene–participates–Cellular Component, and Gene–participates–Molecular Function edges are from Gene Ontology annotations (Huntley et al., 2015). As described in Intermediate resources, annotations were propagated (Himmelstein et al., 2015g; Himmelstein et al., 2015f). Gene–participates–Pathway edges were included from the human pathway resources described in the Nodes section (Pico and Himmelstein, 2015; Himmelstein and Pico, 2016a).

Directionality

Request a detailed protocol

Whether a certain type of relationship has directionality is defined at the metaedge level. Directed metaedges are only necessary when they connect a metanode to itself and correspond to an asymmetric relationship. In the case of Hetionet v1.0, the sole directed metaedge was Gene→regulates→Gene. To demonstrate the implications of directionality, Hetionet v1.0 contains two relationships between the genes HADH and STAT1: HADH–interacts–STAT1 and HADH→regulates→STAT1. Both edges can be represented in the inverse orientation: STAT1–interacts–HADH and STAT1←regulates←HADH. However due to directed nature of the regulates relationship, STAT1→regulates→HADH is a distinct edge, which does not exist in the network. Similarly, HADH–associates–obesity and obesity–associates–HADH are inverse orientations of the same underlying undirected relationship. Accordingly, the following path exists in the network: obesity–associates–HADH→regulates→STAT1, which can also be inverted to STAT1←regulates←HADH–associates–obesity.

Intermediate resources

Request a detailed protocol

In the process of creating Hetionet, we produced several datasets with broad applicability that extended beyond Project Rephetio. These resources are referred to as intermediate resources and described below.

Transcriptional signatures of disease using STARGEO

Request a detailed protocol

STARGEO is a nascent platform for annotating and meta-analyzing differential gene expression experiments (Hadley et al., 2017). The STAR acronym stands for Search-Tag-Analyze Resources, while GEO refers to the Gene Expression Omnibus (Edgar et al., 2002; Barrett et al., 20122013). STARGEO is a layer on top of GEO that crowdsources sample annotation and automates meta-analysis.

Using STARGEO, we computed differentially expressed genes between healthy and diseased samples for 49 diseases (Himmelstein et al., 2015a; Himmelstein et al., 2016j). First, we and others created case/control tags for 66 diseases. After combing through GEO series and tagging samples, 49 diseases had sufficient data for case-control meta-analysis: multiple series with at least three cases and three controls. For each disease, we performed a random effects meta-analysis on each gene to combine log₂ fold-change across series. These analyses incorporated 27,019 unique samples from 460 series on 107 platforms.

Differentially expressed genes (false discovery rate ≤0.05) were identified for each disease. The median number of upregulated genes per disease was 351 and the median number of downregulated genes was 340. Endogenous depression was the only of the 49 diseases without any significantly dysregulated genes.

Transcriptional signatures of perturbation from LINCS L1000

Request a detailed protocol

LINCS L1000 profiled the transcriptional response to small molecule and genetic interference perturbations. To increase throughput, expression was only measured for 978 genes, which were selected for their ability to impute expression of the remaining genes. A single perturbation was often assayed under a variety of conditions including cell types, dosages, timepoints, and concentrations. Each condition generates a single signature of dysregulation z-scores. We further processed these signatures to fit into our approach (Himmelstein et al., 2016m; Himmelstein et al., 2016n).

First, we computed consensus signatures — which meta-analyze multiple signatures to condense them into one — for DrugBank small molecules, Entrez genes, and all L1000 perturbations (Himmelstein and Chung, 2015q; Himmelstein et al., 2016k). First, we discarded non-gold (non-replicating or indistinct) signatures. Then, we meta-analyzed z-scores using Stouffer’s method. Each signature was weighted by its average Spearman’s correlation to other signatures, with a 0.05 minimum, to de-emphasize discordant signatures. Our signatures include the 978 measured genes and the 6489 imputed genes from the ‘best inferred gene subset’. To identify significantly dysregulated genes, we selected genes using a Bonferroni cutoff of p=0.05 and limited the number of imputed genes to 1000.

The consensus signatures for genetic perturbations allowed us to assess various characteristics of the L1000 dataset. First, we looked at whether genetic interference dysregulated its target gene in the expected direction (Himmelstein, 2016c). Looking at measured z-scores for target genes, we found that the knockdown perturbations were highly reliable, while the overexpression perturbations were only moderately reliable with 36% of overexpression perturbations downregulating their target. However, imputed z-scores for target genes barely exceeded chance at responding in the expected direction to interference. Hence, we concluded that the imputation quality of LINCS L1000 is poor. However, when restricting to significantly dyseregulated targets, 22 out of 29 imputed genes responded in the expected direction. This provides some evidence that the directional fidelity of imputation is higher for significantly dysregulated genes. Finally, we found that the transcriptional signatures of knocking down and overexpressing the same gene were positively correlated 65% of the time, suggesting the presence of a general stress response (Himmelstein et al., 2016o).

Based on these findings, we performed additional filtering of signifcantly dysregulated genes when building Hetionet v1.0. Compound–down/up-regulates–Gene relationships were restricted to the 125 most significant per compound-direction-status combination (status refers to measured versus imputed). For genetic interference perturbations, we restricted to the 50 most significant genes per gene-direction-status combination and merged the remaining edges into a single Gene→regulates→Gene relationship type containing both knockdown and overexpression perturbations.

PharmacotherapyDB: physician curated indications

Request a detailed protocol

We created PharmacotherapyDB, an open catalog of drug therapies for disease (Himmelstein, 2016a; Himmelstein et al., 2016p; Himmelstein et al., 2016q). Version 1.0 contains 755 disease-modifying therapies and 390 symptomatic therapies between 97 diseases and 601 compounds.

This resource was motivated by the need for a gold standard of medical indications to train and evaluate our approach. Initially, we identified four existing indication catalogs (Himmelstein et al., 2015e): MEDI-HPS which mined indications from RxNorm, SIDER 2, MedlinePlus, and Wikipedia (Wei et al., 2013); LabeledIn which extracted indications from drug labels via human curation (Khare et al., 2014; Khare et al., 2015; Himmelstein and Khare, 2015s); EHRLink which identified medication–problem pairs that clinicians linked together in electronic health records (McCoy et al., 2012; Himmelstein, 2015j); and indications from PREDICT, which were compiled from UMLS relationships, drugs.com, and drug labels (Gottlieb et al., 2011). After mapping to DO Slim and DrugBank Slim, the four resources contained 1388 distinct indications.

However, we noticed that many indications were palliative and hence problematic as a gold standard of pharmacotherapy for our in silico approach. Therefore, we recruited two practicing physicians to curate the 1388 preliminary indications (Himmelstein et al., 2015j). After a pilot on 50 indications, we defined three classifications: disease modifying meaning a drug that therapeutically changes the underlying or downstream biology of the disease; symptomatic meaning a drug that treats a significant symptom of the disease; and non-indication meaning a drug that neither therapeutically changes the underlying or downstream biology nor treats a significant symptom of the disease. Both curators independently classified all 1388 indications.

The two curators disagreed on 444 calls (Cohen’s κ = 49.9%). We then recruited a third practicing physician, who reviewed all 1388 calls and created a detailed explanation of his methodology (Himmelstein et al., 2015j). We proceeded with the third curator’s calls as the consensus curation. The first two curators did have reservations with classifying steroids as disease modifying for autoimmune diseases. We ultimately considered that these indications met our definition of disease modifying, which is based on a pathophysiological rather than clinical standard. Accordingly, therapies we consider disease modifying may not be used to alter long-term disease course in the modern clinic due to a poor risk–benefit ratio.

User-friendly gene ontology annotations

Request a detailed protocol

We created a browser (http://git.dhimmel.com/gene-ontology/) to provide straightforward access to Gene Ontology annotations (Himmelstein et al., 2015g; Himmelstein et al., 2015f). Our service provides annotations between Gene Ontology terms and Entrez Genes. The user chooses propagated/direct annotation and all/experimental evidence. Annotations are currently available for 37 species and downloadable as user-friendly TSV files.

Data copyright and licensing

Request a detailed protocol

We committed to openly releasing our data and analyses from the origin of the project (Spaulding et al., 2015). Our goals were to contribute to the advancement of science (Hrynaszkiewicz, 2011; Molloy, 2011), maximize our impact (McKiernan et al., 2016; Piwowar and Vision, 2013), and enable reproducibility (Stodden et al., 2016; Stodden and Miguez, 2014; Baggerly, 2010). These objectives required publicly distributing and openly licensing Hetionet and Project Rephetio data and analyses (Hrynaszkiewicz and Cockerill, 2012; Hagedorn et al., 2011).

Since we integrated only public resources, which were overwhelmingly funded by academic grants, we had assumed that our project and open sharing of our network would not be an issue. However, upon releasing a preliminary version of Hetionet (Himmelstein and Jensen, 2015u), a community reviewer informed us of legal barriers to integrating public data. In essence, both copyright (rights of exclusivity automatically granted to original works) and terms of use (rules that users must agree to in order to use a resource) place legally binding restrictions on data reuse. In short, public data is not by default open data.

Hetionet v1.0 integrates 29 resources (Table 4), but two resources were removed prior to the v1.0 release. Of the total 31 resources (Himmelstein et al., 2015i), 5 were United States government works not subject to copyright, and 12 had licenses that met the Open Definition of knowledge version 2.1. Four resources allowed only non-commercial reuse. Most problematic were the remaining nine resources that had no license — which equates to all rights reserved by default and forbids reuse (Oxenham, 2016) — and one resource that explicitly forbid redistribution.

Table 4
The 29 public data resources integrated to construct Hetionet v1.0.

Components notes which types of nodes and edges in Hetionet v1.0 derived from the resource (as per the abbreviations in Table 1 and 2). Cat. notes the general category of license (Himmelstein et al., 2015i). Category 1 refers to United States government works that we deemed were not subject to copyright. Category 2 refers to resources with licenses that allow use, redistribution, and modification (although some restrictions may still exist). The subset of category 2 licenses that we deemed to meet the the Open Definition are denoted with OD. Category 4 refers to resources without a license, hence with all rights reserved. References provides Research Resource Identifiers as well as citations to resource publications and related Project Rephetio materials. For information on license provenance, institutional affiliations, and funding for each resource, see the online table.

https://doi.org/10.7554/eLife.26726.011
ResourceComponentsLicenseCat.References
Entrez GeneGcustom1RRID:SCR_002473 (Maglott et al., 2011; Himmelstein et al., 2015h; Himmelstein, 2016l)
LabeledInCtD, CpDcustom1RRID:SCR_015667 (Khare et al., 2014; Khare et al., 2015; Himmelstein and Khare, 2015s)
MEDLINEDlA, DpS, DrDcustom1RRID:SCR_002185 (Himmelstein and Pankov, 2015a; Himmelstein, 2016u)
MeSHScustom1RRID:SCR_004750 (Himmelstein and Pankov, 2015a; Himmelstein, 2016h)
Pathway Interaction DatabasePW, GpPW1RRID:SCR_006866 (Schaefer et al., 2009; Pico and Himmelstein, 2015; Himmelstein and Pico, 2016a)
Disease OntologyDCC BY 3.02ODRRID:SCR_000476 (Schriml et al., 2012; Kibbe et al., 2015; Himmelstein and Li, 2015d; Himmelstein, 2016g)
DISEASESDaGCC BY 4.02ODRRID:SCR_015664 (Himmelstein and Jensen, 2015l; Himmelstein and Jensen, 2016c; Pletscher-Frankild et al., 2015)
DrugCentralPC, CbG, PCiCCC BY 4.02ODRRID:SCR_015663 (Ursu et al., 2017; Himmelstein et al., 2016d)
Gene OntologyBP, CC, MF, GpBP, GpCC, GpMFCC BY 4.02ODRRID:SCR_002811 (Ashburner et al., 2000; Huntley et al., 2015; Himmelstein et al., 2015g; Himmelstein et al., 2015f)
GWAS CatalogDaGcustom2ODRRID:SCR_012745 (Himmelstein and Baranzini, 2016b; MacArthur et al., 2017; Himmelstein, 2015h; Himmelstein et al., 2015v)
ReactomePW, GpPWcustom2ODRRID:SCR_003485 (Fabregat et al., 2016; Cerami et al., 2011; Pico and Himmelstein, 2015; Himmelstein and Pico, 2016a)
LINCS L1000CdG, CuG, Gr > Gcustom2OD(Himmelstein and Chung, 2015q; Himmelstein et al., 2016k; Himmelstein, 2015k)
TISSUESAeGCC BY 4.02ODRRID:SCR_015665 (Santos et al., 2015; Himmelstein and Jensen, 2015g; Himmelstein and Jensen, 2015h)
UberonACC BY 3.02ODRRID:SCR_010668 (Mungall et al., 2012; Malladi et al., 2015; Himmelstein, 2016m)
WikiPathwaysPW, GpPWCC BY 3.0/custom2ODRRID:SCR_002134 (Kutmon et al., 2016; Pico et al., 2008; Pico and Himmelstein, 2015; Himmelstein and Pico, 2016a)
BindingDBCbGmixed CC BY 3.0 and CC BY-SA 3.02ODRRID:SCR_000390 (Chen et al., 2001; Gilson et al., 2016; Himmelstein and Gilson, 2015i; Himmelstein et al., 2015d)
DisGeNETDaGODbL2ODRRID:SCR_006178 (Himmelstein, 2015f; Himmelstein and Piñero, 2016d; Piñero et al., 2015; Piñero et al., 2017)
DrugBankC, CbG, CrCcustom2RRID:SCR_002700 (Law et al., 2014; Himmelstein, 2015b; Himmelstein, 2016i; Himmelstein et al., 2016r)
MEDICtD, CpDCC BY-NC-SA 3.02RRID:SCR_015668 (Himmelstein et al., 2015e; Wei et al., 2013)
PREDICTCtD, CpDCC BY-NC-SA 3.02(Gottlieb et al., 2011; Himmelstein et al., 2015e)
SIDERSE, CcSECC BY-NC-SA 4.02RRID:SCR_004321 (Kuhn et al., 2016; Himmelstein, 2015c; Himmelstein, 2016j)
BgeeAeG, AdG, AuG4RRID:SCR_002028 (Himmelstein et al., 2016f; Himmelstein and Bastian, 2015e; Himmelstein and Bastian, 2015f; Bastian et al., 2008)
DOAFDaG4RRID:SCR_015666 (Himmelstein, 2015g; Himmelstein, 2016s; Xu et al., 2012)
ehrlinkCtD, CpD4(McCoy et al., 2012; Himmelstein, 2015j)
Evolutionary Rate CovariationGcG4RRID:SCR_015669 (Priedigkeit et al., 2015; Himmelstein and Partha, 2015r; Himmelstein, 2016w)
hetio-dagGiG4(Himmelstein and Baranzini, 2015a; Himmelstein et al., 2015z; Himmelstein and Baranzini, 2016e)
Incomplete InteractomeGiG4(Himmelstein et al., 2015z; Himmelstein and Baranzini, 2016e; Menche et al., 2015; Himmelstein, 2015a)
Human Interactome DatabaseGiG4RRID:SCR_015670 (Himmelstein et al., 2015z; Himmelstein and Baranzini, 2016e; Rual et al., 2005; Venkatesan et al., 2009; Yu et al., 2011; Rolland et al., 2014)
STARGEODdG, DuG4(Himmelstein et al., 2015a; Himmelstein et al., 2016j; Hadley et al., 2017)

Additional difficulty resulted from license incompatibles across resources, which was caused primarily by non-commercial and share-alike stipulations. Furthermore, it was often unclear who owned the data (Elliott, 2005). Therefore, we sought input from legal experts and chronicled our progress (Himmelstein et al., 2015i; Himmelstein, 2015k; Himmelstein et al., 2016r; Himmelstein, 2015a; Himmelstein, 2015d).

Ultimately, we did not find an ideal solution. We had to choose between absolute compliance and Hetionet: strictly adhering to copyright and licensing arrangements would have decimated the network. On the other hand, in the United States, mere facts are not subject to copyright, and fair use doctrine helps protect reuse that is transformative and educational. Hence, we choose a path forward which balanced legal, normative, ethical, and scientific considerations.

If a resource was in the public domain, we licensed any derivatives as CC0 1.0. For resources licensed to allow reuse, redistribution, and modification, we transmitted their licenses as properties on the specific nodes and relationships in Hetionet v1.0. For all other resources — for example, resources without licenses or with licenses that forbid redistribution — we sent permission requests to their creators. The median time till first response to our permission requests was 16 days, with only two resources affirmatively granting us permission. We did not receive any responses asking us to remove a resource. However, we did voluntarily remove MSigDB (Liberzon et al., 2011), since its license was highly problematic (Himmelstein, 2015d). As a result of our experience, we recommend that publicly funded data should be explicitly dedicated to the public domain whenever possible.

Permuted hetnets

Request a detailed protocol

From Hetionet, we derived five permuted hetnets (Himmelstein, 2016b). The permutations preserve node degree but eliminate edge specificity by employing an algorithm called XSwap to randomly swap edges (Hanhijärvi et al., 2009). To extend XSwap to hetnets (Himmelstein and Baranzini, 2015a), we permuted each metaedge separately, so that edges were only swapped with other edges of the same type. We adopted a Markov chain approach, whereby the first permuted hetnet was generated from Hetionet v1.0, the second permuted hetnet was generated from the first, and so on. For each metaedge, we assessed the percent of edges unchanged as the algorithm progressed to ensure that a sufficient number of swaps had been performed to randomize the network (Himmelstein, 2016b). Permuted hetnets are useful for computing the baseline performance of meaningless edges while preserving node degree (Himmelstein, 2015l). Since, our use of permutation focused on assessing Δ AUROC, a small number of permuted hetnets was sufficient, as the variability in a metapath’s AUROC across the permuted hetnets was low.

Graph databases and Neo4j

Request a detailed protocol

Traditional relational databases — such as SQLite, MySQL, and PostgreSQL — excel at storing highly structured data in tables. Connectivity between tables is accomplished using foreign-key references between columns. However, for many biomedical applications the connectivity between entities is of foremost importance. Furthermore, enforcing a rigid structure of what attributes an entity may possess is less important and often unnecessarily prohibitive. Graph databases focus instead on capturing connectivity (relationships) between entities (nodes). Accordingly, graph databases such as Neo4j offer greater ease when modeling biomedical relationships and superior performance when traversing many levels of connectivity (Yoon et al., 2017; Jaiswal, 2013). Until recently, graph database adoption in bioinformatics was limited (Have and Jensen, 2013). However lately, the demand to model and capture biological connectivity at scale has led to increasing adoption (Lysenko et al., 2016; Balaur et al., 2016; Summer et al., 2016; Mungall et al., 2017).

We used the Neo4j graph database for storing and operating on Hetionet and noticed major benefits from tapping into this large open source ecosystem (Himmelstein, 2015m). Persistent storage with immediate access and the Cypher query language — a sort of SQL for hetnets — were two of the biggest benefits. To facilitate our migration to Neo4j, we updated hetio — our existing Python package for hetnets (Himmelstein, 2016g) — to export networks into Neo4j and DWPC queries to Cypher. In addition, we created an interactive GraphGist for Project Rephetio, which introduces our approach and showcases its Cypher queries. Finally, we created a public Neo4j instance (Himmelstein, 2016i), which leverages several modern technologies such Neo4j Browser guides, cloud hosting with HTTPS, and Docker deployment (Belmann et al., 2015; Beaulieu-Jones and Greene, 2017).

Machine learning approach

Request a detailed protocol

Project Rephetio relied on the previously published DWPC metric to generate features for compound–disease pairs. The DWPC measures the prevalence of a given metapath between a given source and target node (Himmelstein and Baranzini, 2015a). It is calculated by first extracting all paths from the source to target node that follow the specified metapath. Next, each path is weighted by taking the product of the node degrees along the path raised to a negative exponent. This damping exponent — the sole parameter — thereby determines the extent that paths through high-degree nodes are downweighted: we chose w = 0.4 based on our past optimizations (Himmelstein and Baranzini, 2015a). The DWPC equals the sum of the path weights (referred to as path-degree products). Traversing the hetnet to extract all paths between a source and target node, which we performed in Neo4j, is the most computationally intensive step in computing DWPCs (Himmelstein and Lizee, 2016t). For future work, we are exploring matrix multiplication approaches, which could improve runtime several orders of magnitude.

Project Rephetio made several refinements to metapath-based hetnet edge prediction compared to previous studies (Himmelstein and Baranzini, 2015a; Sun et al., 2011). First, we transformed DWPCs by mean scaling and then taking the inverse hyperbolic sine (Burbidge et al., 1988) to make them more amenable to modeling (Himmelstein et al., 2016s). Second, we bifurcated the workflow into an all-features stage and an all-observations stage (Himmelstein, 2016k). The all-features stage assesses feature performance and does not require computing features for all negatives. Here, we selected a random subset of 3020 (4 × 755) negatives. Little error was introduced by this optimization, since the predominant limitation to performance assessment was the small number of positives (755) rather than negatives. Based on the all-features performance assessment (Himmelstein, 2015n), we selected 142 DWPCs to compute on all observations (all 209,168 compound–disease pairs). The feature selection was designed to remove uninformative features (according to permutation) and guard against edge-dropout contamination (Himmelstein, 2016h). Third, we included 14 degree features, which assess the degree of a specific metaedge for either the source compound or target disease.

Network support of predictions

Request a detailed protocol

To improve the interpretability of the predictions, we developed a method for decomposing a prediction into its network support (Himmelstein, 2016e). This information is deployed to our Neo4j Browser guides, allowing users to assess the biomedical evidence contributing to a given prediction. First, we used logistic regression terms to quantify the contribution of metapaths that positively support a prediction. Second, we decomposed a metapath’s contribution, according to its DWPC, into specific paths contributions. Finally, we aggregated paths based on their source (first) or target (last) edge to quantify the contribution of specific edges of the source compound or target disease (Himmelstein, 2016f).

Using the acamprosate–epilepsy prediction as an example, we first quantified metapath contributions: 40% of the prediction was supported by CbGbCtD paths, 36% by CbGaD paths, 11% by CcSEcCtD paths, 8% by CbGpPWpGaD paths, and 5% by CbGeAlD paths. Second, we calculated path contributions: Acamprosate–binds–GRM5–associates–epilepsy syndrome was the most supportive path, contributing 11% of the prediction. Finally, we aggregated path contributions to calculate that the source edge of Acamprosate—binds—GRM5 contributed 23% of the prediction, while the target edge of epilepsy syndrome–treats–Felbamate contributed 12%.

Prior probability of treatment

Request a detailed protocol

The 755 treatments in Hetionet v1.0 are not evenly distributed between all compounds and diseases. For example, methotrexate treats 19 diseases and hypertension is treated by 68 compounds. We estimated a prior probability of treatment — based only on the treatment degree of the source compound and target disease — on 744,975 permutations of the bipartite treatment network (Lizee and Himmelstein, 2016a). Methotrexate received a 79.6% prior probability of treating hypertension, whereas a compound and disease that both had only one treatment received a prior of 0.12%.

Across the 209,168 compound–disease pairs, the prior predicted the known treatments with AUROC = 97.9%. The strength of this association threatened to dominate our predictions. However, not modeling the prior can lead to omitted-variable bias and confounded proxy variables. To address the issue, we included the logit-transformed prior, without any regularization, as a term in the model. This restricted model fitting to the 29,799 observations with a nonzero prior — corresponding to the 387 compounds and 77 diseases with at least one treatment. To enable predictions for all 209,168 observations, we set the prior for each compound–disease pair to the overall prevalence of positives (0.36%).

This method succeeded at accommodating the treatment degrees. The prior probabilities performed poorly on the validation sets with AUROC = 54.1% on DrugCentral indications and AUROC = 62.5% on clinical trials. This performance dropoff compared to training shows the danger of encoding treatment degree into predictions. The benefits of our solution are highlighted by the superior validation performance of our predictions compared to the prior (Figure 3).

Indication sets

Request a detailed protocol

We evaluated our predictions on four sets of indications as shown in Figure 3.

  • Disease Modifying — the 755 disease-modifying treatments in PharmacotherapyDB v1.0. These indications are included in the hetnet as treats edges and used to train the logistic regression model. Due to edge dropout contamination and self-testing (Himmelstein, 2016h; Lizee and Himmelstein, 2016b), overfitting could potentially inflate performance on this set. Therefore, for the three remaining indication sets, we removed any observations that were positives in this set.

  • DrugCentral — We discovered the DrugCentral database after completing our physician curation for PharmacotherapyDB. This database contained 210 additional indications (Himmelstein et al., 2016d). While we didn’t curate these indications, we observed a high proportion of disease-modifying therapy.

  • Clinical Trial — We compiled indications that have been investigated by clinical trial from ClinicalTrials.gov (Himmelstein, 2016d). This set contains 5594 indications. Since these indications were not manually curated and clinical trials often show a lack of efficacy, we expected lower performance on this set.

  • Symptomatic — 390 symptomatic indications from PharacotherapyDB. These edges are included in the hetnet as palliates edges.

Only the Clinical Trial and DrugCentral indication sets were used for external validation, since the Disease Modifying and Symptomatic indications were included in the hetnet. As an aside, several additional indication catalogs have recently been published, which future studies may want to also consider (Himmelstein et al., 2015e; Brown and Patel, 2017; Shameer et al., 2017; Sharp, 2017).

Realtime open science and thinklab

Request a detailed protocol

We conducted our study using Thinklab — a platform for real-time open collaborative science — on which this study was the first project (Himmelstein et al., 2015c). We began the study by publicly proposing the idea and inviting discussion (Himmelstein et al., 2015k). We continued by chronicling our progress via discussions. We used Thinklab as the frontend to coordinate and report our analyses and GitHub as the backend to host our code, data, and notebooks. On top of our Thinklab team consisting of core contributors, we welcomed community contribution and review. In areas where our expertise was lacking or advice would be helpful, we sought input from domain experts and encouraged them to respond on Thinklab where their comments would be CC BY licensed and their contribution rated and rewarded.

In total, 40 non-team members commented across 86 discussions, which generated 622 comments and 191 notes (Figure 6). Thinklab content for this project totaled 145,771 words or 918,837 characters (Himmelstein and Lizee, 2016v). Using an estimated 7000 words per academic publication as a benchmark, Project Rephetio generated written content comparable in volume to 20.8 publications prior to its completion. We noticed several other benefits from using Thinklab including forging a community of contributors (Patil and Siegel, 2009); receiving feedback during the early stages when feedback was most actionable (Mietchen et al., 2015); disseminating our research without delay (Powell, 2016; Vale, 2015); opening avenues for external input (Allison et al., 2016); facilitating problem-oriented teaching (Himmelstein et al., 2016t; Waldrop, 2015); and improving our documentation by maintaining a publication-grade digital lab notebook (Giles, 2012).

The growth the Project Rephetio corpus on Thinklab over time.

This figure shows Project Rephetio contributions by user over time. Each band represented the cumulative contribution of a Thinklab user to discussions in Project Rephetio (Himmelstein and Lizee, 2016v). Users are ordered by date of first contribution. Users who contributed over 4500 characters are named. The square root transformation of characters written per user accentuates the activity of new contributors, thereby emphasizing collaboration and diverse input.

https://doi.org/10.7554/eLife.26726.012

Thinklab began winding down operations in July 2017 and has switched to a static state. While users will no longer be able to add comments, the corpus of content remains browsable at https://think-lab.github.io and available in machine-readable formats at dhimmel/thinklytics.

The preprint for this study is available at doi.org/bs4f (Himmelstein et al., 2016u). The manuscript was written in markdown, originally on Thinklab at doi.org/bszr (Himmelstein et al., 2016v). In August 2017, we switched to using the Manubot system to generate the manuscript. With Manubot, a GitHub repository (dhimmel/rephetio-manuscript) tracks the manuscript’s source code, while continuous integration automatically rebuilds the manuscript upon changes. As a result, the latest version of the manuscript is always available at dhimmel.github.io/rephetio-manuscript. Additionally, readers can leave feedback or questions for the Project Rephetio team via GitHub Issues.

Software and data availability

Request a detailed protocol

All software and datasets from Project Rephetio are publicly available on GitHub, Zenodo, or Figshare (Himmelstein et al., 2017b). Additional documentation for these materials is available in the corresponding Thinklab discussions. For reader convenience, software, datasets, and Thinklab discussions have been cited throughout the manuscript as relevant. Copies of the most relevant Github repositories are archived at: https://github.com/elifesciences-publications/hetionethttps://github.com/elifesciences-publications/integratehttps://github.com/elifesciences-publications/learnhttps://github.com/elifesciences-publications/hetio and https://github.com/elifesciences-publications/rephetio-manuscript.

Data availability

The following previously published data sets were used

References

    1. Cahill K
    2. Lindson-Hawley N
    3. Thomas KH
    4. Fanshawe TR
    5. Lancaster T
    (2016)
    Nicotine receptor partial agonists for smoking cessation
    The Cochrane Database of Systematic Reviews 9:CD006103.
  1. Conference
    1. Ehrenberg HR
    2. Shin J
    3. Ratner AJ
    4. Fries JA
    5. Ré C
    (2016)
    Data Programming with DDLite
    Proceedings of the Workshop on Human-in-the-Loop Data Analytics - HILDA’. pp. 1–6.
    1. Hanhijärvi S
    2. Garriga GC
    3. Puolamäki K
    (2009)
    Proceedings of the 2009 SIAM International Conference on Data Mining
    Randomization Techniques for Graphs, Proceedings of the 2009 SIAM International Conference on Data Mining, 10.1137/1.9781611972795.67.
  2. Website
    1. Himmelstein D
    2. Jensen LJ
    (2015u) One network to rule them all
    ThinkLab. Accessed September 11, 2017.
  3. Website
    1. Himmelstein D
    2. Li TS
    (2015d) Unifying disease vocabularies
    ThinkLab. Accessed September 11, 2017.
  4. Website
    1. Himmelstein D
    (2015a) Incomplete Interactome licensing
    ThinkLab. Accessed September 11, 2017.
  5. Website
    1. Himmelstein D
    (2015b) Unifying drug vocabularies
    ThinkLab. Accessed September 11, 2017.
  6. Website
    1. Himmelstein D
    (2015c) Extracting side effects from SIDER 4
    ThinkLab. Accessed September 11, 2017.
  7. Website
    1. Himmelstein D
    (2015d) MSigDB licensing
    ThinkLab. Accessed September 11, 2017.
  8. Website
    1. Himmelstein D
    (2015e) Disease Ontology feature requests
    ThinkLab. Accessed September 11, 2017.
  9. Website
    1. Himmelstein D
    (2015k) LINCS L1000 licensing
    ThinkLab. Accessed September 11, 2017.
  10. Website
    1. Khankhanian P
    2. Himmelstein D
    (2016) Prediction in epilepsy
    ThinkLab. Accessed September 11, 2017.
    1. Knaus K
    (2016)
    The SAGE Encyclopedia of Pharmacology and Society
    Anatomical Therapeutic Chemical Classification System (WHO), The SAGE Encyclopedia of Pharmacology and Society, 10.4135/9781483349985.n37.
    1. Li J
    2. Lu Z
    (2012) A New Method for Computational Drug Repositioning Using Drug Pairwise Similarity
    Proceedings. IEEE International Conference on Bioinformatics and Biomedicine 2012:1–4.
    https://doi.org/10.1109/BIBM.2012.6392722
  11. Website
    1. Malladi V
    2. Himmelstein D
    3. Mungall C
    (2015) Tissue node
    ThinkLab. Accessed September 11, 2017.
    1. Rogers D
    2. Hahn M
    (2010) Extended-connectivity fingerprints
    Journal of Chemical Information and Modeling 50:742–754.
    https://doi.org/10.1021/ci100050t
  12. Conference
    1. Sun Y
    2. Barber R
    3. Gupta M
    4. Aggarwal CC
    5. Jiawei H
    (2011)
    Co-author relationship prediction in heterogeneous bibliographic networks
    2011 International Conference on Advances in Social Networks Analysis and Mining. pp. 121–128.

Decision letter

  1. Alfonso Valencia
    Reviewing Editor; Barcelona Supercomputing Center (BSC), Spain

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Systematic integration of biomedical knowledge prioritizes drugs for repurposing" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Aviv Regev as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Jason Moore (Reviewer #1); Amitabh Sharma (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

This paper describes a very large resource of information on potential drug target interaction and its use for the prediction of drug repositioning opportunities The large network ensembles 47k nodes (11 kinds) and 2.25M edges (24 kinds) from 29 public databases. The network provides new relations that could lead to interesting discoveries. The algorithmic proposal is simple and reasonable, the testing makes sense and it is based on independent data sets. The examples provided are interesting as illustration of the best results, i.e., insights into epilepsy and nicotine dependence treatment.

There are a few points that require careful attention:

Justifications for the cutoffs used to define edges. Edges have a different level of reliability ranging very differently depending on the considered domain (ex. gene expression level). It is not clear how this has been taken into account/normalised. How edge reliability has been made comparable across resources and data types? Additionally, the authors have used different discretization threshold for different information domains. It is not clear how these thresholds have been determined and how/if they could potentially impact the predictive ability of the presented approach. This should be explained and each choice on this regard numerically justified.

The test sets used to validate the predictive ability of the Rephetio approach look strongly unbalanced toward negative cases (in 3 cases out of 4 the number of negatives is 3 orders of magnitude larger than the positives). It is necessary to assess the impact performance evaluation comparing with balanced sets of negatives and positives.

Other comments below are related with areas that require better explanations, clarification and editing:

- It was a bit confusing to map the nodes mentioned in the Introduction and Results to the Materials and methods. For example, anatomy is not mentioned in the nodes section of the Materials and methods. Making this more consistent might help the reader.

- It might be useful to expand some of your text on graph databases since they are very new and not many people know what they are or why they are useful.

- Edges of Hetionet are directed/undirected or a mixture of both. This inherently affects the definition of a path.

- The analysis of pathways of at most length 4 should be justified. What is the average length of a path between any node-pairs? And how loops are managed in this regard?

- The selection of the cases (e.g. methapaths Figure 3) should make clear how many good cases are there by for example providing a table with the best performing ones.

- The labels in Figure 2 are annoyingly all abbreviated. To improve readability I would suggest explicitly indicating the meta edge type.

- Even if referencing their relevant previous publication the "degree-weighted path count" should be briefly explained. Particularly, how this algorithm penalizes the paths involving high degree nodes?

- The multiple sclerosis example mentioned in the third paragraph of the subsection “Hetionet v1.0”: are the results shown anywhere? I believe the 4 nodes mentioned gene, disease, BP and anatomy? What exactly are the 5 types of interactions (guess: GpBP, DaG, DdG, DuG, AuG)? I understand that the authors have to contend with only a few examples to demonstrate the functionalities of their tools but still, it would help the reader to visualize this example if these details were presented.

- Related to the above: It would help the reader to show the four lines of Cypher code in the MS example to demonstrate the ease of use claimed by the authors. Or alternatively, to refer the reader to the part of the website where the syntax is introduced. Without this, the sentence "Furthermore, the portion of the query to identify paths meeting the above specification required only four lines of Cypher code" is left up in the air a little bit.

- In the second paragraph of the subsection “Systematic mechanisms of efficacy”, could the authors elaborate on the DWPC delta AUROC? The concept has been introduced in a previous publication, but it would be helpful to recap the definition so as to induce more biological insight in the reader.

- In Figure 2A, it took me a while to figure out that the dot sizes are not continuous but instead there are only two dot sizes with small dots corresponding to non-significant (FDR>0.05) and larger dots corresponding to significant (FDR<0.05) metapaths. It would help to put a small legend clarifying this.

- The selected metapaths in Table 3. We see that good (all of the 11 positive covariates of the following section) as well as poor predictors are both shown. According to what criteria were they selected? Is it just a random selection intended to show the range of parameters and what they signify? If so, this should be noted.

- "29,044 non-treatments": If non-treatments are any disease-compound pairs that are not the 755 known treatments, then the number should be much higher. It would be helpful if the authors refer the reader to the Materials and methods section "Prior probability of treatment" here.

- The permutation test has to be more clearly explained and potentially combined or compared with some previously published approaches (BiRewire for example.

- The baseline performances of their network the authors assembled 5 randomized versions of it. This number seems to be very small and this section requires clarification.

- Did the Thinklab contributors – pre-reviewing the paper – consent to having their names appear in a publication?

- The analysis described in the first Results section (Systematic mechanisms of efficacy) leads to the authors' conclusion that the frequency of the information types in the metapaths for selected existing drug-disease pairs is higher for those traditionally considered by pharmacology and it is particularly low for metaedges involving gene expression and transcriptional data. In the eye of the authors this should tune down the recent excitement surrounding this type of data. This argument looks a bit circular. The authors have used a set of established drug-disease possibly involving many approved drugs. Drug discovery pipelines have been typically guided so far by knowledge of disease mechanisms, chemical structures of drug candidates and targets. Should not be obvious that for approved drugs what lead to their development is reflected by the enriched meta edges found by the authors?

- Additionally, most of the recent excitement around the use of transcriptional data for drug repurposing comes from the development of signature matching tools exploiting drug-drug similarities and drug-disease anti-similarity at the level of transcriptional signatures (respectively elicited by the drugs under consideration upon treatment of in vitro models, and from contrasting diseased vs. normal state). Would this be worthy the inclusion of other two type of edges (transcriptional signature similarity/anti-similarity at the whole signature level)? This possibility should be at least discussed.

Finally, Heterogeneous network presented in this study unambiguously refers to "aggregated" networks consisting of the sum of multiple types of nodes and edges. While heterogeneous networks present perhaps the most viable way of integrating multiple and diverse types of biomedical data and are therefore very valuable to gaining integrated biological insights through machine learning approaches, it is important to distinguish them from multilayer networks, whose theory and mathematical framework has been established in the reviews mentioned. The current diversity of nomenclature in the filed stems from the fact that, at least for now, these slightly different types of multilayer networks have to distinguished for their mathematical treatment. On the other hand, no such mathematical framework exists for generic heterogeneous networks such as HetNet, but rather, an exhaustive survey of node/edge combinations (metapaths) is needed, such as the one presented in this paper. The authors should note the fundamental difference between multilayer and aggregated network approaches when making this comparison.

https://doi.org/10.7554/eLife.26726.016

Author response

There are a few points that require careful attention:

Justifications for the cutoffs used to define edges. Edges have a different level of reliability ranging very differently depending on the considered domain (ex. gene expression level). It is not clear how this has been taken into account/normalised. How edge reliability has been made comparable across resources and data types? Additionally, the authors have used different discretization threshold for different information domains. It is not clear how these thresholds have been determined and how/if they could potentially impact the predictive ability of the presented approach. This should be explained and each choice on this regard numerically justified.

We appreciate this reviewer’s viewpoint on the potentially different thresholds for each edge type and have made our best efforts to choose them rigorously. However, our approach does not require that each type of edge is equally reliable. For example, let’s assume 90% of Gene-associates-Disease edges are true whereas only 70% of Compound-upregulates-Gene edges are true. Such a disparity would potentially affect the performance of a specific metaedge, but since we use a supervised machine learning approach, the impact on predictions of a given edge type is dependent on its reliability.

Whenever selecting a threshold for a given edge type, our goal was to balance precision (percent of edges that are true) and recall (percent of real relationships that are included as edges). We attempted to pick thresholds so that we were reasonably confident in any given edge being true (high precision). In order to achieve moderate levels of recall, we generally selected more permissive thresholds for high-throughput data sources.

While constructing gold standards for each metaedge would be an interesting direction, it was outside of the scope of this study. Therefore, for most metaedges, we did not have objective measures of precision and recall rates at various inclusion thresholds. However, we did attempt to balance precision and recall when selecting inclusion thresholds, although this process was often qualitative rather than quantitative. In many cases, the final decision was reached after considerable discussion on Thinklab and exploratory data analyses. In addition, we often asked the authors of a particular resource to comment on an appropriate threshold.

Of course, this is not an entirely satisfactory answer. One intriguing (although application specific) solution would be a performance-driven approach. For example, optimizing an edge inclusion threshold to maximize performance. Since Hetionet v1.0 has many edge inclusion thresholds, difficult decisions would need to be made on whether to optimize the thresholds independently or jointly. The primary barrier we faced, and why we didn’t perform these analyses, was computational. Such an approach would add considerable code complexity by requiring a unified pipeline that interacts with many discrete repositories. In addition, we were already running up against compute time barriers, without performance-driven threshold optimization. We’re currently pursuing matrix multiplication approaches to compute DWPCs, which would render performance-driven edge threshold optimization computationally feasible. Therefore, we agree with the reviewers that this is an important future direction.

We recommend readers that are interested in our edge threshold determinations to refer to the cited Thinklab discussions and source code on GitHub. As a quick note for technical readers, the exact provenance for each relationship in Hetionet v1.0 can be deduced from working backwards from the integrate.ipynb notebook source code. Below we elaborate on several examples regarding edge threshold determination:

CompoundbindsGene relationships from BindingDB. We discussed various techniques for consolidating multiple binding affinity experiments and selecting an edge inclusion threshold. Mike Gilson, the principal investigator behind BindingDB, commented: “If a given compound and protein target have multiple measurements of different types, I’d probably use them in the following order of preference: Kd over Ki over IC50.” In a subsequent comment, we describe the method we chose: “multiple affinities for the same bindingdb–uniprot pairs were resolved by preferentially selecting Kd over Ki over IC50 and taking a geometric mean when there were multiple measurements of the same measure.” We further explain our choice of binding affinity threshold, writing: “Setting an affinity threshold at 1 μM (1000 nanomolar) – suggested by both [Mike Gilson] and [Alessandro Didonna] – retained ~20% of interactions.”

Compound–resembles–Compound relationships. We generated chemical similarity relationships based on the Dice coefficient between extended connectivity fingerprints (ECFPs). Mike Keiser commented in reference to his past research:

“Using ECFP4 (i.e., Morgan with radius 2 in RDKit) and a Tanimoto coefficient, we found cutoffs more around 0.28 (the range can vary pretty substantially depending on fingerprint type used). In general, 0.5 is considered pretty high similarity for ECFP/Morgan fingerprints at least with Tanimoto coefficients (I’m less sure of the Dice coefficient equivalents, off-hand).”

After evaluating the distribution of pairwise similarity scores as well as examining how chemical similarity associated with transcriptional similarity and the prevalence of shared targets between compound pairs, we decided on a threshold of 0.5.

AnatomyexpressesGene relationships from TISSUES. We discussed how to consolidate evidence from multiple channels of evidence with TISSUES’ principal investigator Lars Juhl Jensen. Dr. Jensen wrote: “If you want to enforce a hard cutoff to make things binary, I would urge you to at least take the integrated scores that takes everything into account and apply the cutoff to that. In this case a score of 3 might be appropriate.” We ended up selecting a more stringent threshold of 2 based on the observed distribution of integrated scores (notebook Cell 3), which indicated that increasing the threshold from 2 to 3 would benefit recall only minimally.

Anatomyup/downregulatesGene relationships from Bgee. For a given gene–anatomy pair, Bgee provides several reliability measures for whether the gene is underexpressed or overexpressed. We included unambiguously “low quality” or “high quality” relationships, based on the advice of Frederic Bastian, a lead architect of Bgee, who wrote:

“I would definitely use the “low quality” as well here. Because the overall call generated is based on a voting system weighted by p-values, so even if it is “low quality” because of conflicting analyses, the best p-value has won anyway.”

We investigated the frequency of different call qualities (notebook Cell 14) and discussed the inclusion threshold options extensively.

Gene–covaries–Gene relationships from Evolutionary Rate Covariation. We selected a covariation threshold of 0.75. This determination was based on the relative frequency of positive to negative covariation coefficients. From this analysis, we reasoned that “selecting a threshold of ERC > 0.75 would lead to a false discovery rate of approximately 10%."

GeneassociatesDisease relationships from DisGeNET. We discussed the threshold with the resource’s creator Janet Piñero, positing that “we would like a permissive threshold, allowing up to a ~30% false discovery rate.” Dr. Piñero responded: “If you choose score ≥ 0.06, then you will be including associations reporting by curated sources, or having animal models supporting them, or being reported by several papers (20–200). It will not be permissive, though (less than 10% of GDAs satisfies this criteria).” Despite not being the most permissive, we went with the score ≥ 0.06 threshold, since we compiled gene–disease associations from three other resources, and therefore could afford to be more stringent for each given resource.

MEDLINE co-occurrence relationships (Disease–localizes–Anatomy, Disease–presents–Symptom, and Disease–resembles–Disease). We included co-occurring terms with p < 0.005 from a Fisher’s exact test assessing whether terms (MEDLINE topics) occurred more frequently together than expected by chance. We selected this threshold after examining the resulting datasets for each relationship type (DlA, DpS, DrD) and observing that this threshold appeared to give high precision with sufficient recall.

As the above examples illustrate, edge inclusion thresholds were chosen based on extensive empirical assessment, data exploration, and advice from resource creators and other experts. Finally, for many relationships, we include the confidence scores as an edge property, so Hetionet users can perform additional filtering according to their needs.

The test sets used to validate the predictive ability of the Rephetio approach look strongly unbalanced toward negative cases (in 3 cases out of 4 the number of negatives is 3 orders of magnitude larger than the positives). It is necessary to assess the impact performance evaluation comparing with balanced sets of negatives and positives.

We rely on AUROC as our primary measure of classifier performance since it is prevalence agnostic. Hence, Figure 3A and 3B are unaffected by the imbalance of positives and negatives. We acknowledge that both precision and recall are affected by the imbalance (Figure 3C), however this is an important aspect of their utility. In the context of genomic predictions, Myers et al. 2006 go into more detail on this point, stating:

“To avoid such misleading evaluations, the balance of positives and negatives in the gold standard should match that of the application domain as closely as possible. Precision, or PPV, then becomes a direct, representative measure of how well one could expect a dataset or method to perform on whole-genome tasks.”

In our case, we included the full set of negatives in our various gold standards (validation sets), since we aimed to capture the unbalanced nature of drug efficacy. The point we are trying to highlight is that a random compound-disease pair has a low probability of being a treatment, and that’s an important aspect of the problem domain that should be reflected in our evaluation.

Other comments below are related with areas that require better explanations, clarification and editing:

- It was a bit confusing to map the nodes mentioned in the Introduction and Results to the Materials and methods. For example, anatomy is not mentioned in the nodes section of the Materials and methods. Making this more consistent might help the reader.

The Nodes section of Materials and methods does document each node type. To help readers locate node descriptions, we have now bolded the first mention for each metanode in this section.

We did notice the Edges section of Materials and methods did not explicitly mention Gene→regulates→Gene, Gene–participates–Pathway, and Pharmacologic Class–includes–Compound edges, and has now been corrected. We thank the reviewer for catching this oversight.

- It might be useful to expand some of your text on graph databases since they are very new and not many people know what they are or why they are useful.

We have now updated the Neo4j section of the Materials and methods to “Graph databases and Neo4j”. This revised section contains a paragraph describing the utility of graph databases, especially in the historical context of relational databases. This section also provides a location for us to reference many of the exciting biomedical applications of Neo4j that have occurred concurrently with Project Rephetio.

- Edges of Hetionet are directed/undirected or a mixture of both. This inherently affects the definition of a path.

We added a Materials and methods section entitled “Directionality”, which aims to demonstrate how this is handled by Project Rephetio. This section distinguishes directionality from orientation, by exhibiting how undirected metaedges/edges can be traversed in an oriented manner (e.g. that HADH–associates–obesity and obesity–associates–HADH are the same undirected edge but in inverse orientations). In addition, we provide additional notes below.

Directionality is defined at the metaedge level. With the exception of Gene→regulates→Gene, all of Hetionet v1.0’s metaedges are undirected. In short, directionality is only ever necessary for metaedges that connect a metanode to itself. Note however that not all Gene–Gene metaedges are directed. For example, Gene–interacts–Gene and Gene–covaries–Gene are undirected, since there is no inherent directionality in the relationships they encode. Since the Gene→regulates→Gene metaedge is directed, a direction must be specified whenever an edge of that type is created in the hetio python package.

Hetio accounts for directionality, as needed, when traversing metapaths or paths. For example, the metapaths CbGr>GaD and CbG<rGaD are distinct. In the former, the first gene regulates the second gene. In the latter, the second gene regulates the first gene. However, since Gene–interacts–Gene edges are undirected, there is only one CbGiGaD metapath. When our network traversal infrastructure encounters a directed metaedge, only edges conforming to the specified direction are traversed. In the case of undirected metaedges, all of the corresponding edges are traversed, since there is no concept of directionality.

- The analysis of pathways of at most length 4 should be justified. What is the average length of a path between any node-pairs? And how loops are managed in this regard?

The limitation of metapaths at length 4 was a primarily computational decision. As seen in Figure 1C, the number of possible metapaths grows combinatorially with increasing length. For Compound–Disease metapaths on Hetionet v1.0, there are 13 metapaths of length 2, 121 metapaths of length 3, and 1072 metapaths of length 4. Not only are longer metapaths more abundant, but computing each DWPC value for a longer metapath is more runtime intensive when using network traversal algorithms. The matrix multiplication methods currently under development will allow us to efficiently compute path counts and DWPCs for longer metapaths. However, we anticipate that the informativeness of DWPCs will diminish with excessively long metapaths.

Duplicate metanodes in a metapath are permitted. In other words, the metapaths we analyzed correspond to all walks on the metagraph from Compound to Disease with length 2–4. When calculating path counts and degree-weighted path counts, duplicate nodes are excluded.

- The selection of the cases (e.g. methapaths Figure 3) should make clear how many good cases are there by for example providing a table with the best performing ones.

Unfortunately, we are not sure what the reviewer means by “cases” in his/her comment. We thus interpreted this comment to mean either:

1) How many of the 209,168 compound–disease pairs received noteworthy predicted probabilities?

2) How many of 1,206 metapaths assessed were informative?

Regarding the first interpretation, the manuscript states: “Of the 3,980 predictions with a probability exceeding 1%, 586 corresponded to known disease-modifying indications, leaving 3,394 repurposing candidates.” Regarding the second interpretation, the manuscript states: “Overall, 709 of the 1,206 metapaths exhibited a statistically significant Δ AUROC at a false discovery rate cutoff of 5%.” For more information, we also refer readers to the interactive tables for predictions and metapaths online.

- The labels in Figure 2 are annoyingly all abbreviated. To improve readability I would suggest explicitly indicating the meta edge type.

We updated Figure 2A to use full metaedge names (commit). Regrettably, we still use abbreviations for metapaths in Figure 2B as their full names were too long.

- Even if referencing their relevant previous publication the "degree-weighted path count" should be briefly explained. Particularly, how this algorithm penalizes the paths involving high degree nodes?

We have now added a paragraph in Materials and methods, under “Machine learning approach”, that describes the DWPC metric in detail.

- The multiple sclerosis example mentioned in the third paragraph of the subsection “Hetionet v1.0”: are the results shown anywhere? I believe the 4 nodes mentioned gene, disease, BP and anatomy? What exactly are the 5 types of interactions (guess: GpBP, DaG, DdG, DuG, AuG)? I understand that the authors have to contend with only a few examples to demonstrate the functionalities of their tools but still, it would help the reader to visualize this example if these details were presented.

In response to this suggestion, we now include the Cypher code for the enhanced multiple sclerosis query within the manuscript. From the query, one can observe that the 4 node types are Disease, Gene, Biological Process, and Anatomy. Similarly the 5 relationship types are ASSOCIATES_DaG, INTERACTS_GiG, PARTICIPATES_GpBP, LOCALIZES_DlA, and UPREGULATES_AuG. Note that we adopted a slightly different nomenclature for Neo4j node labels and relationships types, as per Cypher style conventions and to help disambiguate metaedges.

- Related to the above: It would help the reader to show the four lines of Cypher code in the MS example to demonstrate the ease of use claimed by the authors. Or alternatively, to refer the reader to the part of the website where the syntax is introduced. Without this, the sentence "Furthermore, the portion of the query to identify paths meeting the above specification required only four lines of Cypher code" is left up in the air a little bit.

Thanks for this suggestion, which we have now incorporated this example into the manuscript.

- In the second paragraph of the subsection “Systematic mechanisms of efficacy”, could the authors elaborate on the DWPC delta AUROC? The concept has been introduced in a previous publication, but it would be helpful to recap the definition so as to induce more biological insight in the reader.

We have now added the following description: “A positive Δ AUROC indicates that paths of the given type tended to occur more frequently between treatments than non-treatments, after accounting for different levels of connectivity (node degrees) in the hetnet. In general terms, Δ AUROC assesses whether paths of a given type were informative of drug efficacy."

- In Figure 2A, it took me a while to figure out that the dot sizes are not continuous but instead there are only two dot sizes with small dots corresponding to non-significant (FDR>0.05) and larger dots corresponding to significant (FDR<0.05) metapaths. It would help to put a small legend clarifying this.

We added a legend for dot size (commit), in addition to the sentence in the figure caption.

- The selected metapaths in Table 3. We see that good (all of the 11 positive covariates of the following section) as well as poor predictors are both shown. According to what criteria were they selected? Is it just a random selection intended to show the range of parameters and what they signify? If so, this should be noted.

Table 3 shows the metapaths that received a non-negligible positive coefficient in the logistic regression model as well as metapaths that are discussed as interesting findings. The full metapath table is available online. We have now clarified the caption.

- "29,044 non-treatments": If non-treatments are any disease-compound pairs that are not the 755 known treatments, then the number should be much higher. It would be helpful if the authors refer the reader to the Materials and methods section "Prior probability of treatment" here.

The 29,044 negatives used for training were all non-treatments between compounds and diseases that had at least one treatment. In other words, 29,044 negatives = 387 compounds × 77 diseases − 755 treatments. We now note in the Results section that “179,369 non-treatments were omitted as negative training observations because they had a prior probability of treatment equal to zero."

- The permutation test has to be more clearly explained and potentially combined or compared with some previously published approaches (BiRewire for example.

We appreciate the reviewers’ recommendation to look into BiRewire, which we were previously unaware of. Both BiRewire and our implementation apply the same underlying algorithm, which we refer to as XSwap as per Hanhijärvi et al. 2009. In both approaches, edge swaps (or “switching-steps") are successively applied to randomize edges while preserving node degree. A Markov chain approach begins each round of permutation from the last. For more information, see our detailed comparison online. In addition, the manuscript now describes the permutation methodology in greater detail.

- The baseline performances of their network the authors assembled 5 randomized versions of it. This number seems to be very small and this section requires clarification.

The primary purpose of the permuted hetnets was to compute the Δ AUROC of metapaths. This calculation takes the difference between the AUROC of a metapath and the average AUROC of the metapath on the 5 permuted hetnets. Given the large number of positives and negatives evaluated by the ROC curve, the AUROC values are highly stable. Therefore, 5 permutations was sufficient to calculate Δ AUROCs. In addition, we argue that the benefit to a much larger number of permutations would be small, since the noise inherent to the sole AUROC from the single unpermuted hetnet remains constant. We’ve updated the Permuted Hetnets section of the Methods to justify our choice.

- Did the Thinklab contributors – pre-reviewing the paper – consent to having their names appear in a publication?

Consent was not sought since Thinklab contributions were made publicly. Thinklab contributor names are listed in the acknowledgements and displayed in Figure 6. This is not considered as an endorsement of the study by the named individuals, but rather as a reflection of the publicly-available scientific record behind the study.

- The analysis described in the first Results section (Systematic mechanisms of efficacy) leads to the authors' conclusion that the frequency of the information types in the metapaths for selected existing drug-disease pairs is higher for those traditionally considered by pharmacology and it is particularly low for metaedges involving gene expression and transcriptional data. In the eye of the authors this should tune down the recent excitement surrounding this type of data. This argument looks a bit circular. The authors have used a set of established drug-disease possibly involving many approved drugs. Drug discovery pipelines have been typically guided so far by knowledge of disease mechanisms, chemical structures of drug candidates and targets. Should not be obvious that for approved drugs what lead to their development is reflected by the enriched meta edges found by the authors?

We acknowledge that our gold standard of treatments is biased towards traditional pharmacology and existing knowledge. However, we assume that known treatments are enriched for drug efficacy compared to random non-treatments (compound–disease pairs without known efficacy). Further, we hypothesize that high-throughput information types, if predictive of drug efficacy, would apply equally well to known and unknown treatments. Therefore, we would expect that metapaths based on high-throughput data sources (e.g. transcriptional edges) still prioritize known treatments if truly informative. Nonetheless, we agree that knowledge biases could favor the performance of metapaths based on traditional pharmacology. Accordingly we added the following sentences to the Discussion:

“One caveat is that omics-scale experimental data will likely play a larger role in developing the next generation of pharmacotherapies. Hence, were performance reevaluated on treatments discovered in the forthcoming decades, the predictive ability of these data types may rise.”

- Additionally, most of the recent excitement around the use of transcriptional data for drug repurposing comes from the development of signature matching tools exploiting drug-drug similarities and drug-disease anti-similarity at the level of transcriptional signatures (respectively elicited by the drugs under consideration upon treatment of in vitro models, and from contrasting diseased vs. normal state). Would this be worthy the inclusion of other two type of edges (transcriptional signature similarity/anti-similarity at the whole signature level)? This possibility should be at least discussed.

Regarding transcription-based methods for repurposing, our findings provide little support for “drug-disease anti-similarity” approaches, but do provide stronger support “drug-drug similarity” approaches. The “drug-disease anti-similarity” approach is evaluated by the CuGdD and CdGuD metapaths, whose respective Δ AUROCs were 1.1% and 1.7% (Table 3). The “drug-drug similarity” approach is evaluated by the CuGuCtD and CdGdCtD metapaths, whose respective Δ AUROCs were 4.4% and 3.8%. We do think that our current Discussion section accurately portrays these findings.

Not including metaedges for transcriptional (anti-)similarity at the whole signature level was a design decision motivated by the desire to avoid encoding duplicate information into multiple metaedges. For example, the hypothetical metaedge of Compound–transcriptionally-resembles–Compound (CtrC) would replicate information already provided by Compound–upregulates–Gene and Compound–downregulates–Gene edges. However, if we only included CtrC edges and not CuG/CdG edges, we could not assess metapaths such as CuGdD. But including CuG/CdG edges did allow us to assess CtrC relationships, just through longer metapaths that substitute CuGuC or CdGdC for CtrC.

Whether to collapse transcriptional signatures into similarity relationships likely depends on the application. The benefits of collapsing are lower dimensionality, reduced computational complexity, and fine-tuning the similarity metric. In certain cases, the decision was obvious in favor of collapsing. For example, the Compound–resembles–Gene relationships were calculated from chemical fingerprints that are composed of many bits representing molecular features. Here, we took a Dice coefficient between two compounds to create a similarity score, rather than having “molecular feature” nodes in the hetnet. This made sense since the only purpose of the molecular features was to assess compound resemblance. In addition, the chemical substructures encoded by the molecular features are computational abstractions rather than true biochemical entities.

Going forward, similar design decisions will be relevant. For example, morphological profiling can assess how a compound affects cell morphology (e.g. “size, shape, texture, intensity"). Since the morphological features are machine-generated and rather obscure in meaning, it likely makes the most sense to create Compound-morphologically-resembles-Compound relationships type. Of course, as more research becomes available on the utility of morphological features, the decision on how to encode these features in a hetnet can rely more heavily on the findings of prior work.

Finally, Heterogeneous network presented in this study unambiguously refers to "aggregated" networks consisting of the sum of multiple types of nodes and edges. While heterogeneous networks present perhaps the most viable way of integrating multiple and diverse types of biomedical data and are therefore very valuable to gaining integrated biological insights through machine learning approaches, it is important to distinguish them from multilayer networks, whose theory and mathematical framework has been established in the reviews mentioned. The current diversity of nomenclature in the filed stems from the fact that, at least for now, these slightly different types of multilayer networks have to distinguished for their mathematical treatment. On the other hand, no such mathematical framework exists for generic heterogeneous networks such as HetNet, but rather, an exhaustive survey of node/edge combinations (metapaths) is needed, such as the one presented in this paper. The authors should note the fundamental difference between multilayer and aggregated network approaches when making this comparison.

We appreciate the reviewer’s insights on the nuances of different graph conceptualizations. We agree that the term “aggregated” is helpful for explaining hetnets, and updated an Introduction sentence:

“Hetnets (short for heterogeneous networks) are networks with multiple types of nodes and relationships. They offer an intuitive, versatile, and powerful structure for data integration by aggregating graphs for each relationship type onto common nodes.”

Furthermore, we updated the Discussion to better reflect the “fundamental difference between multilayer and aggregated network approaches” (citations removed below):

“A 2014 analysis identified 78 studies using multilayer networks — a superset of hetnets (heterogeneous information networks) with the potential for additional dimensions, such as time. […] Compared to the existing mathematical frameworks for multilayer networks that must deal with layers other than type (such as the aspect of time), the primary obligation of hetnet algorithms is to be type aware.”

https://doi.org/10.7554/eLife.26726.017

Article and author information

Author details

  1. Daniel Scott Himmelstein

    1. Biological and Medical Informatics Program, University of California, San Francisco, San Francisco, United States
    2. Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3012-7446
  2. Antoine Lizee

    1. Department of Neurology, University of California, San Francisco, San Francisco, United States
    2. ITUN-CRTI-UMR 1064 Inserm, University of Nantes, Nantes, France
    Contribution
    Software, Formal analysis, Visualization, Methodology
    Competing interests
    No competing interests declared
  3. Christine Hessler

    Department of Neurology, University of California, San Francisco, San Francisco, United States
    Contribution
    Supervision, Validation, Methodology
    Competing interests
    No competing interests declared
  4. Leo Brueggeman

    1. Department of Neurology, University of California, San Francisco, San Francisco, United States
    2. University of Iowa, Iowa City, United States
    Contribution
    Resources, Data curation, Software, Formal analysis, Methodology
    Competing interests
    No competing interests declared
  5. Sabrina L Chen

    1. Department of Neurology, University of California, San Francisco, San Francisco, United States
    2. Johns Hopkins University, Baltimore, United States
    Contribution
    Data curation, Formal analysis, Validation
    Competing interests
    No competing interests declared
  6. Dexter Hadley

    1. Department of Pediatrics, University of California, San Fransisco, San Fransisco, United States
    2. Institute for Computational Health Sciences, University of California, San Francisco, San Francisco, United States
    Contribution
    Resources, Data curation, Software, Formal analysis
    Competing interests
    No competing interests declared
  7. Ari Green

    Department of Neurology, University of California, San Francisco, San Francisco, United States
    Contribution
    Conceptualization, Validation
    Competing interests
    No competing interests declared
  8. Pouya Khankhanian

    1. Department of Neurology, University of California, San Francisco, San Francisco, United States
    2. Center for Neuroengineering and Therapeutics, University of Pennsylvania, Philadelphia, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis
    Competing interests
    No competing interests declared
  9. Sergio E Baranzini

    1. Biological and Medical Informatics Program, University of California, San Francisco, San Francisco, United States
    2. Department of Neurology, University of California, San Francisco, San Francisco, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing—review and editing
    For correspondence
    sergio.baranzini@ucsf.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0067-194X

Funding

National Science Foundation (1144247)

  • Daniel Scott Himmelstein

Heidrich Family and Friends Foundation

  • Sergio E Baranzini

National Institutes of Health (5R01NS088155)

  • Sergio E Baranzini

National Cancer Institute (UH2CA203792)

  • Dexter Hadley

U.S. National Library of Medicine (1U01LM012675)

  • Dexter Hadley

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We are immensely grateful to our Thinklab contributors who joined us in our experiment of radically open science. The following non-team members provided contributions that received five or more Thinklab points: Lars Juhl Jensen, Frederic Bastian, Alexander Pico, Casey Greene, Benjamin Good, Craig Knox, Mike Gilson, Chris Mungall, Katie Fortney, Venkat Malladi, Tudor Oprea, MacKenzie Smith, Caty Chung, Allison McCoy, Alexey Strokach, Ritu Khare, Greg Way, Marina Sirota, Raghavendran Partha, Oleg Ursu, Jesse Spaulding, Gaya Nadarajan, Alex Ratner, Scooter Morris, Alessandro Didonna, Alex Pankov, Tong Shu Li, and Janet Piñero. Additionally, the founder of Thinklab, Jesse Spaulding, supported community contributions and developed the platform with Project Rephetio’s needs in mind. We also appreciate DigitalOcean’s sponsorship the Hetionet Browser to cover its hosting costs. Finally, we would like to thank Neo Technology, whose staff provided excellent technical support. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant Number 1144247 to DSH. SEB is supported by NINDS/NIH grant number 5R01NS088155 and the Heidrich Family and Friends Foundation. DH is supported by the the National Cancer Institute of the National Institutes of Health under Award Number UH2CA203792 and the National Library of Medicine under Award Number 1U01LM012675. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Reviewing Editor

  1. Alfonso Valencia, Barcelona Supercomputing Center (BSC), Spain

Version history

  1. Received: March 11, 2017
  2. Accepted: September 11, 2017
  3. Accepted Manuscript published: September 22, 2017 (version 1)
  4. Version of Record published: October 13, 2017 (version 2)

Copyright

© 2017, Himmelstein et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 14,906
    Page views
  • 1,834
    Downloads
  • 230
    Citations

Article citation count generated by polling the highest count across the following sources: Scopus, Crossref, PubMed Central.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Daniel Scott Himmelstein
  2. Antoine Lizee
  3. Christine Hessler
  4. Leo Brueggeman
  5. Sabrina L Chen
  6. Dexter Hadley
  7. Ari Green
  8. Pouya Khankhanian
  9. Sergio E Baranzini
(2017)
Systematic integration of biomedical knowledge prioritizes drugs for repurposing
eLife 6:e26726.
https://doi.org/10.7554/eLife.26726

Further reading

    1. Biochemistry and Chemical Biology
    2. Computational and Systems Biology
    Ian R Outhwaite, Sukrit Singh ... Markus A Seeliger
    Research Article

    Kinase inhibitors are successful therapeutics in the treatment of cancers and autoimmune diseases and are useful tools in biomedical research. However, the high sequence and structural conservation of the catalytic kinase domain complicates the development of selective kinase inhibitors. Inhibition of off-target kinases makes it difficult to study the mechanism of inhibitors in biological systems. Current efforts focus on the development of inhibitors with improved selectivity. Here, we present an alternative solution to this problem by combining inhibitors with divergent off-target effects. We develop a multicompound-multitarget scoring (MMS) method that combines inhibitors to maximize target inhibition and to minimize off-target inhibition. Additionally, this framework enables optimization of inhibitor combinations for multiple on-targets. Using MMS with published kinase inhibitor datasets we determine potent inhibitor combinations for target kinases with better selectivity than the most selective single inhibitor and validate the predicted effect and selectivity of inhibitor combinations using in vitro and in cellulo techniques. MMS greatly enhances selectivity in rational multitargeting applications. The MMS framework is generalizable to other non-kinase biological targets where compound selectivity is a challenge and diverse compound libraries are available.

    1. Computational and Systems Biology
    2. Physics of Living Systems
    Jennifer K Briggs, Anne Gresch ... Richard KP Benninger
    Research Article

    Diabetes is caused by the inability of electrically coupled, functionally heterogeneous -cells within the pancreatic islet to provide adequate insulin secretion. Functional networks have been used to represent synchronized oscillatory [Ca2+] dynamics and to study -cell subpopulations, which play an important role in driving islet function. The mechanism by which highly synchronized -cell subpopulations drive islet function is unclear. We used experimental and computational techniques to investigate the relationship between functional networks, structural (gap-junction) networks, and intrinsic -cell dynamics in slow and fast oscillating islets. Highly synchronized subpopulations in the functional network were differentiated by intrinsic dynamics, including metabolic activity and KATP channel conductance, more than structural coupling. Consistent with this, intrinsic dynamics were more predictive of high synchronization in the islet functional network as compared to high levels of structural coupling. Finally, dysfunction of gap junctions, which can occur in diabetes, caused decreases in the efficiency and clustering of the functional network. These results indicate that intrinsic dynamics rather than structure drive connections in the functional network and highly synchronized subpopulations, but gap junctions are still essential for overall network efficiency. These findings deepen our interpretation of functional networks and the formation of functional sub-populations in dynamic tissues such as the islet.