Adverse Drug Reactions: The benefits of data mining

Careful analysis of a database populated by physicians and patients sheds new light on the side effects of drugs.
  1. Audrey Bone
  2. Keith Houck  Is a corresponding author
  1. US Environmental Protection Agency, United States

The goal of pharmaceutical drug development is to produce compounds that can treat medical conditions effectively without causing side effects (which are known as adverse drug reactions in the pharmaceutical industry). It has been estimated that serious versions of adverse drug reactions occur in over two million patients per year in the US, with 100,000 of them resulting in deaths (Giacomini et al., 2007). Potential new drugs are subject to in vivo testing with laboratory animals and in vitro studies in cell lines before they are ever used in human clinical trials. However, humans differ from laboratory animals in many ways and there are limitations to the applicability of in vitro studies. Therefore, adverse drug reactions (ADRs) are often not identified until a drug is tested in a clinical trial, which can result in costly failures.

Moreover, even if a drug is approved for use after clinical trials, some critical ADRs only become apparent after a large number of patients have been treated over a long time. This is because it can be difficult to account for a number of important factors in clinical trials, such as patient age, co-exposures to other drugs, genetic differences, environmental and dietary variances, and long-term use (Woodcock, 2016). The diet drug fenfluramine-phentermine (fenphen), for example, had to be withdrawn in 1997 following patient deaths that resulted from a drug metabolite binding to an off-target receptor (5HT2B) that caused a heart valve disease (Rothman et al., 2000). Hence, there is a real need for methods that can predict ADRs much earlier in the drug development process.

Current methods to predict ADRs are limited. Many animal models do not adequately predict human responses (Kullak-Ublick et al., 2017; FDA, 2017), and while in vitro studies can examine the molecular pathways underlying an adverse reaction, we need to know something about the mechanisms driving the ADR in the first place. The receptor implicated in fenphen toxicity is an example of a molecular target that compounds can be tested against with in vitro assays, as is an ion-channel protein called hERG that has been linked to heart arrhythmias (Roy et al., 1996). However, the majority of ADRs do not have known underlying mechanisms.

Now, in eLife, Mateusz Maciejewski of Pfizer, Brian Shoichet of UCSF, Laszlo Urban of Novartis and colleagues report a third approach that involves analyzing a large, crowd-sourced database of ADRs maintained by the Food and Drug Administration (FDA) in the United States (Maciejewski et al., 2017). The FDA Adverse Event Reporting System (FAERS) is a publicly accessible and voluntary database that allows physicians, pharmacists and patients (and also lawyers involved in drug litigation) to report adverse events associated with prescription or over-the-counter medicines, along with nutritional products, cosmetics and food/beverages (Sakaeda et al., 2013). The database now contains over nine million records reaching back to 1969 and continues to grow rapidly (Figure 1). However, while FAERS contains a wealth of real world information, these data must be handled with care.

Making the most of the FDA Adverse Event Reporting System (FAERS).

Patients, physicians, pharmacists and other health-care professionals input information about adverse drug reactions into the FAERS database. Maciejewski et al. have shown that it is possible to use data mining and statistical analysis to extract new insights about adverse drug reactions from the database: the first step is to deal with the noise and other problems associated with such crowd-sourced databases. The amount of reports in FAERS has grown rapidly over the past decade (top right; data from FDA).

As an example, FAERS uses names rather than chemical structures to identify drugs, with each chemical structure having an average of 16 different names (or 378 in the case of fluoxetine, also known as Prozac), so Maciejewski et al. were required to first aggregate all the information associated with each chemical structure. They also had to remove redundant data (e.g., where the same event was entered multiple times) and other data that were misleading (e.g., when the adverse event was actually a pre-existing medical condition).

Using various data visualization techniques, Maciejewski et al. were then able to begin to dig deeper into the data and identify a number of potentially confounding factors that may impact overly simplistic interpretations. Data for individual drugs plotted chronologically showed distinct spikes in reports that could be tied to specific events. For example, initial reports of cardiovascular and cerebrovascular events associated with rofecoxib (the nonsteroidal anti-inflammatory drug with the brand name Vioxx) resulted primarily from physician reports. The number of reports later increased dramatically, first due to patients and later due to lawyers, following the publication of a clinical study linking Vioxx to cardiovascular events and, two years later, when warnings were added to the Vioxx label.

An analysis of the diabetes drugs rosiglitazone and pioglitazone (which have similar structures) illustrated how the database can be used to differentiate between a class effect (in which an effect is seen across an entire class of drugs) and a drug-specific effect. Rosiglitazone showed a strong signal of cardiovascular events, such as congestive cardiac failure, that persisted over time. Pioglitazone, on the other hand, showed only a small, inconsistent spike in cardiovascular events that coincided with the increased public scrutiny of rosiglitazone. Over time, a strong bladder cancer signal appeared for pioglitazone that was not seen with rosiglitazone. This result is supported by recent epidemiological studies which suggest that differences in receptor selectivity are responsible for the differences between the drugs (Tuccori et al., 2016).

Maciejewski et al. also illustrated the need to take pharmacokinetics into account when analyzing the FAERS database by examining hypertension associated with a cancer treatment involving the inhibition of vascular endothelial growth factor receptor. Nineteen different inhibitors were analyzed and only those with exposure margins (the ratio of the potency against the target to the patient’s serum concentration) under 10 were linked to hypertension. Maciejewski et al. concluded that this could be used as a drug development guideline for this class of compounds and that the use of exposure margins in the FAERS analysis may help define drugs that cause adverse events.

Crowd-sourced databases are often noisy and subject to interference from many factors because the data are entered by non-experts. However, such databases can be an invaluable resource when analyzed appropriately. Maciejewski et al. have shown how to handle the noise in the FAERS database and the limitations of the database structure, and how to deal with social factors such as news reports, drug recalls, and ongoing litigation. Moreover, using relatively simple statistical methods, they demonstrated how to extract useful information about adverse events (including information about relationships and mechanisms) from the data. Their work will also provide a foundation for the use of sophisticated methods (such as empirical Bayesian statistics and hierarchical methods) in future studies. The recommendations they make for improving the database, such as including pharmacokinetics information, would make it even more valuable.


The views expressed in this paper are those of the authors and do not necessarily represent the views or policies of the US Environmental Protection Agency.


    1. Woodcock J
    (2016) "Precision" drug development?
    Clinical Pharmacology & Therapeutics 99:152–154.

Article and author information

Author details

  1. Audrey Bone

    Audrey Bone is in the National Center for Computational Toxicology, Office of Research and Development, US Environmental Protection Agency, Research Triangle Park, United States

    Competing interests
    No competing interests declared
  2. Keith Houck

    Keith Houck is in the National Center for Computational Toxicology, Office of Research and Development, US Environmental Protection Agency, Research Triangle Park, United States

    For correspondence
    Competing interests
    No competing interests declared

Publication history

  1. Version of Record published: August 16, 2017 (version 1)


This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.


  • 2,324
    Page views
  • 195
  • 1

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Audrey Bone
  2. Keith Houck
Adverse Drug Reactions: The benefits of data mining
eLife 6:e30280.

Further reading

    1. Medicine
    2. Neuroscience
    Guido I Guberman et al.
    Research Article Updated


    The heterogeneity of white matter damage and symptoms in concussion has been identified as a major obstacle to therapeutic innovation. In contrast, most diffusion MRI (dMRI) studies on concussion have traditionally relied on group-comparison approaches that average out heterogeneity. To leverage, rather than average out, concussion heterogeneity, we combined dMRI and multivariate statistics to characterize multi-tract multi-symptom relationships.


    Using cross-sectional data from 306 previously concussed children aged 9–10 from the Adolescent Brain Cognitive Development Study, we built connectomes weighted by classical and emerging diffusion measures. These measures were combined into two informative indices, the first representing microstructural complexity, the second representing axonal density. We deployed pattern-learning algorithms to jointly decompose these connectivity features and 19 symptom measures.


    Early multi-tract multi-symptom pairs explained the most covariance and represented broad symptom categories, such as a general problems pair, or a pair representing all cognitive symptoms, and implicated more distributed networks of white matter tracts. Further pairs represented more specific symptom combinations, such as a pair representing attention problems exclusively, and were associated with more localized white matter abnormalities. Symptom representation was not systematically related to tract representation across pairs. Sleep problems were implicated across most pairs, but were related to different connections across these pairs. Expression of multi-tract features was not driven by sociodemographic and injury-related variables, as well as by clinical subgroups defined by the presence of ADHD. Analyses performed on a replication dataset showed consistent results.


    Using a double-multivariate approach, we identified clinically-informative, cross-demographic multi-tract multi-symptom relationships. These results suggest that rather than clear one-to-one symptom-connectivity disturbances, concussions may be characterized by subtypes of symptom/connectivity relationships. The symptom/connectivity relationships identified in multi-tract multi-symptom pairs were not apparent in single-tract/single-symptom analyses. Future studies aiming to better understand connectivity/symptom relationships should take into account multi-tract multi-symptom heterogeneity.


    Financial support for this work came from a Vanier Canada Graduate Scholarship from the Canadian Institutes of Health Research (G.I.G.), an Ontario Graduate Scholarship (S.S.), a Restracomp Research Fellowship provided by the Hospital for Sick Children (S.S.), an Institutional Research Chair in Neuroinformatics (M.D.), as well as a Natural Sciences and Engineering Research Council CREATE grant (M.D.).

    1. Epidemiology and Global Health
    2. Medicine
    Botond Antal et al.
    Research Article Updated


    Type 2 diabetes mellitus (T2DM) is known to be associated with neurobiological and cognitive deficits; however, their extent, overlap with aging effects, and the effectiveness of existing treatments in the context of the brain are currently unknown.


    We characterized neurocognitive effects independently associated with T2DM and age in a large cohort of human subjects from the UK Biobank with cross-sectional neuroimaging and cognitive data. We then proceeded to evaluate the extent of overlap between the effects related to T2DM and age by applying correlation measures to the separately characterized neurocognitive changes. Our findings were complemented by meta-analyses of published reports with cognitive or neuroimaging measures for T2DM and healthy controls (HCs). We also evaluated in a cohort of T2DM-diagnosed individuals using UK Biobank how disease chronicity and metformin treatment interact with the identified neurocognitive effects.


    The UK Biobank dataset included cognitive and neuroimaging data (N = 20,314), including 1012 T2DM and 19,302 HCs, aged between 50 and 80 years. Duration of T2DM ranged from 0 to 31 years (mean 8.5 ± 6.1 years); 498 were treated with metformin alone, while 352 were unmedicated. Our meta-analysis evaluated 34 cognitive studies (N = 22,231) and 60 neuroimaging studies: 30 of T2DM (N = 866) and 30 of aging (N = 1088). Compared to age, sex, education, and hypertension-matched HC, T2DM was associated with marked cognitive deficits, particularly in executive functioning and processing speed. Likewise, we found that the diagnosis of T2DM was significantly associated with gray matter atrophy, primarily within the ventral striatum, cerebellum, and putamen, with reorganization of brain activity (decreased in the caudate and premotor cortex and increased in the subgenual area, orbitofrontal cortex, brainstem, and posterior cingulate cortex). The structural and functional changes associated with T2DM show marked overlap with the effects correlating with age but appear earlier, with disease duration linked to more severe neurodegeneration. Metformin treatment status was not associated with improved neurocognitive outcomes.


    The neurocognitive impact of T2DM suggests marked acceleration of normal brain aging. T2DM gray matter atrophy occurred approximately 26% ± 14% faster than seen with normal aging; disease duration was associated with increased neurodegeneration. Mechanistically, our results suggest a neurometabolic component to brain aging. Clinically, neuroimaging-based biomarkers may provide a valuable adjunctive measure of T2DM progression and treatment efficacy based on neurological effects.


    The research described in this article was funded by the W. M. Keck Foundation (to LRMP), the White House Brain Research Through Advancing Innovative Technologies (BRAIN) Initiative (NSFNCS-FR 1926781 to LRMP), and the Baszucki Brain Research Fund (to LRMP). None of the funding sources played any role in the design of the experiments, data collection, analysis, interpretation of the results, the decision to publish, or any aspect relevant to the study. DJW reports serving on data monitoring committees for Novo Nordisk. None of the authors received funding or in-kind support from pharmaceutical and/or other companies to write this article.