1. Epidemiology and Global Health
  2. Human Biology and Medicine
Download icon

A meta-analysis of threats to valid clinical inference in preclinical research of sunitinib

  1. Valerie C Henderson
  2. Nadine Demko
  3. Amanda Hakala
  4. Nathalie MacKinnon
  5. Carole A Federico
  6. Dean Fergusson
  7. Jonathan Kimmelman Is a corresponding author
  1. McGill University, Canada
  2. Ottawa Hospital Research Institute, Canada
Research Article
Cited
16
Views
3,555
Comments
0
Cite as: eLife 2015;4:e08351 doi: 10.7554/eLife.08351

Abstract

Poor study methodology leads to biased measurement of treatment effects in preclinical research. We used available sunitinib preclinical studies to evaluate relationships between study design and experimental tumor volume effect sizes. We identified published animal efficacy experiments where sunitinib monotherapy was tested for effects on tumor volume. Effect sizes were extracted alongside experimental design elements addressing threats to valid clinical inference. Reported use of practices to address internal validity threats was limited, with no experiments using blinded outcome assessment. Most malignancies were tested in one model only, raising concerns about external validity. We calculate a 45% overestimate of effect size across all malignancies due to potential publication bias. Pooled effect sizes for specific malignancies did not show apparent relationships with effect sizes in clinical trials, and we were unable to detect dose–response relationships. Design and reporting standards represent an opportunity for improving clinical inference.

https://doi.org/10.7554/eLife.08351.001

eLife digest

Developing a new drug can take years, partly because preclinical research on non-human animals is required before any clinical trials with humans can take place. Nevertheless, only a fraction of cancer drugs that are put into clinical trials after showing promising results in preclinical animal studies end up proving safe and effective in human beings.

Many researchers and commentators have suggested that this high failure rate reflects flaws in the way preclinical studies in cancer are designed and reported. Now, Henderson et al. have looked at all the published animal studies of a cancer drug called sunitinib and asked how well the design of these studies attempted to limit bias and match the clinical scenarios they were intended to represent.

This systematic review and meta-analysis revealed that many common practices, like randomization, were rarely implemented. None of the published studies used ‘blinding’, whereby information about which animals are receiving the drug and which animals are receiving the control is kept from the experimenter, until after the test; this technique can help prevent any expectations or personal preferences from biasing the results. Furthermore, most tumors were tested in only one model system, namely, mice that had been injected with specific human cancer cells. This makes it difficult to rule out that any anti-cancer activity was in fact unique to that single model.

Henderson et al. went on to find evidence that suggests that the anti-cancer effects of sunitinib might have been overestimated by as much as 45% because those studies that found no or little anti-cancer effect were simply not published. Though it is known that the anti-cancer activity of the drug increases with the dose given in both human beings and animals, an evaluation of the effects of all the published studies combined did not detect such a dose-dependent response.

The poor design and reporting issues identified provide further grounds for concern about the value of many preclinical experiments in cancer. These findings also suggest that there are many opportunities for improving the design and reliability of study reports. Researchers studying certain medical conditions (such as strokes) have already developed, and now routinely implement, a set of standards for the design and reporting of preclinical research. It now appears that the cancer research community should do the same.

https://doi.org/10.7554/eLife.08351.002

Introduction

Preclinical experiments provide evidence of clinical promise, inform trial design, and establish the ethical basis for exposing patients to a new substance. However, preclinical research is plagued by poor design and reporting practices (van der Worp et al., 2010; Begley, 2013a; Begley and Ioannidis, 2015). Recent reports also suggest that many effects in preclinical studies fail replication (Begley and Ellis, 2012). Drug development efforts grounded on non-reproducible findings expose patients to harmful and inactive agents; they also absorb scarce scientific and human resources, the costs of which are reflected as higher drug prices.

Several studies have evaluated the predictive value of animal models in cancer drug development (Johnson et al., 2001; Voskoglou-Nomikos et al., 2003; Corpet and Pierre, 2005). However, few have systematically examined experimental design—as opposed to use of specific models—and its impact on effect sizes across different malignancies (Amarasingh et al., 2009; Hirst et al., 2013). A recent systematic review of guidelines for limiting bias in preclinical research design was unable to identify any guidelines in oncology (Henderson et al., 2013). Validity threats in preclinical oncology may be particularly important to address in light of the fact that cancer drug development has one of the highest rates of attrition (Hay et al., 2014), and oncology drug development commands billions of dollars in funding each year (Adams and Brantner, 2006).

In what follows, we conducted a systematic review and meta-analysis of features of design and outcomes for preclinical efficacy studies of the highly successful drug sunitinib. Sunitinib is a multi-targeted tyrosine kinase inhibitor sunitinib (SU11248, Sutent) and is licensed as monotherapy for three different malignancies (Chow and Eckhardt, 2007; Raymond et al., 2011). As it was introduced into clinical development around 2000 and tested against numerous malignancies, sunitinib provided an opportunity to study a large sample of preclinical studies across a broad range of malignancies—including several supporting successful translation trajectories.

Results

Study characteristics

Our screen from database and reference searches captured 74 studies eligible for extraction, corresponding to 332 unique experiments investigating tumor volume response (Figure 1, Table 1, Table 1—source data 1E). Effect sizes (standardized mean difference [SMD] using Hedges' g) could not be computed for 174 experiments (52%) due to inadequate reporting (e.g., sample size not provided, effect size reported as a median, lack of error bars, Figure 1—figure supplement 1). Overall, 158 experiments, involving 2716 animals, were eligible for meta-analysis. The overall pooled SMD for all extracted experiments across all malignancies was −1.8 [−2.1, −1.6] (Figure 2—figure supplement 1). Mean duration of experiments used in meta-analysis (Figures 2–4) was 31 days (±14 days standardized deviation of the mean (SDM)).

Figure 1 with 1 supplement see all
Descriptive analysis of (A) internal, construct, and (B) external validity design elements.

External validity scores were calculated for each malignancy type tested, according to the formula: number species used + number of models used; an extra point was assigned if a malignancy type tested more than one species and more than one model.

https://doi.org/10.7554/eLife.08351.003
Table 1

Demographics of included studies

https://doi.org/10.7554/eLife.08351.006
Study level demographicsIncluded studies (n = 74)
Conflict of interest
 Declared19 (26%)
Funding statement*
 Private, for-profit44 (59%)
 Private, not-for-profit35 (47%)
 Public37 (50%)
 Other2 (3%)
Recommended clinical testing
 Yes37 (50%)
Publication date
 2003–200613 (18%)
 2007–200917 (23%)
 2010–201344 (59%)
  1. *

    Does not sum to 100% as many studies declared more than one funding source.

Figure 2 with 1 supplement see all
Summary of pooled SMDs for each malignancy type.

Shaded region denotes the pooled standardized mean difference (SMD) and 95% confidence interval (CI) (−1.8 [−2.1, −1.6]) for all experiments combined at the last common time point (LCT).

https://doi.org/10.7554/eLife.08351.008
Relationship between study design elements and effect sizes.

The shaded region denotes the pooled SMD and 95% CI (−1.8 [−2.1, −1.6]) for all experiments combined at the LCT.

https://doi.org/10.7554/eLife.08351.011
Funnel plot to detect publication bias.

Trim and fill analysis was performed on pooled malignancies, as well as the three malignancies with the greatest study volume. (A) All experiments for all malignancies (n = 182), (B) all experiments within renal cell carcinoma (RCC) (n = 35), (C) breast cancer (n = 32), and (D) colorectal cancer (n = 29). Time point was the LCT. Open circles denote original data points whereas black circles denote ‘filled’ experiments. Trim and fill did not produce an estimate in RCC; therefore, no overestimation of effect size could be found.

https://doi.org/10.7554/eLife.08351.012

Design elements addressing validity threats

Effects in preclinical studies can fail clinical generalization because of bias or random variation (internal validity), a mismatch between experimental operations and the clinical scenario modeled (construct validity), or idiosyncratic causal mediators in an experimental system (external validity) (Henderson et al., 2013). We extracted design elements addressing each using consensus design practices identified in a systematic review of validity threats in preclinical research (Henderson et al., 2013).

Few studies used practices like blinding or randomization to address internal validity threats (Figure 1A). Only 6% of experiments investigated a dose–response relationship (3 or more doses). Concealment of allocation or blinded outcome assessment was never reported in studies that advanced to meta-analysis. It is worth noting that one research group employed concealed allocation and blinded assessment for the many experiments it described (Maris et al., 2008). However, statistics were reported in a way that did not align with those we needed to calculate SMD. We found that 58.8% of experiments included active drug comparators, thus, facilitating interpretation of sunitinib activity (however, we note that in some of the experiments, sunitinib was an active comparator in a test of a different drug or drug combination). Construct validity practices can only be meaningfully evaluated against a particular, matched clinical trial. Nevertheless, Figure 1A shows that experiments predominantly relied on juvenile, female, immunocompromised mouse models, and very few animal efficacy experiments used genetically engineered cancer models (n = 4) or spontaneously arising tumors (n = 0). Malignancies generally scored low (score = 1) for addressing external validity (Figure 1B), with breast cancer studies employing the greatest variety of species (n = 2) and models (n = 4).

Implementation of internal validity practices did not show clear relationships with effect sizes (Figure 3A). However, sunitinib effect sizes were significantly greater when active drug comparators were present in an experiment compared to when they were not (−2.2 [−2.5, −1.9] vs −1.4 [−1.7, −1.1], p-value <0.001).

Within construct validity, there was a significant difference in pooled effect size between genetically engineered mouse models and human xenograft (p-value <0.0001) and allograft (p-value 0.001) model types (Figure 3B). For external validity (Figure 3C), malignancies tested in more and diverse experimental systems tended to show less extreme effect sizes (p < 0.001).

Evidence of publication bias

For the 158 individual experiments, 65.8% showed statistically significant activity at the experiment level (p < 0.05, Figure 2—figure supplement 1), with an average sample size of 8.03 animals per treatment arm and 8.39 animals per control arm. Funnel plots for all studies (Figure 4A), as well as our renal cell carcinoma (RCC) subset (Figure 4B) suggest potential publication bias. Trim and fill analysis suggests an overestimation of effect size of 45% (SMD changed from −1.8 [−2.1, −1.7] to −1.3 [−1.5, −1.0]) across all indications. For high-grade glioma and breast cancer, the overestimation was 11% and 52%, respectively. However, trim and fill analysis suggested excellent symmetry for the RCC subgroup, suggesting coverage of the overall effect size and confidence intervals and not overestimation of effect size.

Preclinical studies and clinical correlates

Every malignancy tested with sunitinib showed statistically significant anti-tumor activity (Figure 2). Though we did not perform a systematic review to estimate clinical effect sizes for sunitinib against various malignancies, a perusal of the clinical literature suggests little relationship between pooled effect sizes and demonstrated clinical activity. For instance, sunitinib monotherapy is highly active in RCC patients (Motzer et al., 2006a, 2006b) and yet showed a relatively small preclinical effect; in contrast, sunitinib monotherapy was inactive against small cell lung cancer in a phase 2 trial (Han et al., 2013), but showed relatively large preclinical effects.

Using measured effect sizes at a standardized time point of 14 days after first administration (a different time point than in Figures 2–4 to better align our evaluation of dose–response), we were unable to observe a dose–response relationship over three orders of magnitude (0.2–120 mg/kg/day) for all experiments (Figure 5A). We were also unable to detect a dose–response relationship over the full dose range (4–80 mg/kg/day) tested in the RCC subset (Figure 5B). The same results were observed when we performed the same analyses using the last time point in common between the experimental and control arms.

Dose–response curves for sunitinib preclinical studies.

Only experiments with a once daily (no breaks) administration schedule were included in both graphs. Effect size data were taken from a standardized time point (14 days after first sunitinib administration). (A) Experiments (n = 158) from all malignancies tested failed to show a dose–response relationship. (B) A dose–response relationship was not detected for RCC (n = 24). (C) Dose–response curves reported in individual studies within the RCC subset showed dose–response patterns (blue diamond = Huang 2010a [n = 3], red square = Huang 2010d [n = 3], green triangle = Ko 2010a [n = 3], purple X = Xin 2009 [n = 3]).

https://doi.org/10.7554/eLife.08351.013

Discussion

Preclinical studies serve an important role in formulating clinical hypotheses and justifying the advance of a new drug into clinical testing. Our meta-analysis, which included malignancies that respond to sunitinib in human beings and those that do not, raises several questions about methods and reporting practices in preclinical oncology—at least in the context of one well-established drug.

First, reporting of design elements and data was poor and inconsistent with widely recognized standards for animal studies (Kilkenny et al., 2010). Indeed, 98 experiments (30% of qualitative sample) could not be quantitatively analyzed because sample sizes or measures of dispersion were not provided. Experimenters only sporadically addressed major internal validity threats and tended not to test indication-activity in more than one model and species. This finding is consistent with what others have observed in experimental stroke and other research areas (Macleod et al., 2004; van der Worp et al., 2005; Kilkenny et al., 2009; Glasziou et al., 2014). Some teams have shown a relationship between failure to address internal validity threats and exaggerated effect size (Crossley et al., 2008; Rooke et al., 2011); we did not observe a clear relationship. Consistent with what has been reported in stroke (O'Collins et al., 2006), our findings suggest that testing in more models tends to produce smaller effect sizes. However, since a larger sample of studies will provide a more precise estimate of effect, we cannot rule out that the trends observed for external validity reflect a regression to the mean.

Second, preclinical studies for sunitinib seem to be prone to publication bias. Notwithstanding limitations on using funnel plots to detect publication bias (Lau et al., 2006), our plots were highly asymmetrical. That all malignancy types tested showed statistically significant anti-cancer activity strains credulity. Others have reported that far more animal studies report statistical significance than would be expected (Wallace et al., 2009; Tsilidis et al., 2013), and our observations that two thirds of individual studies showed significance extends these observations.

Third, we were unable to detect a meaningful relationship between preclinical effect sizes and known clinical behavior. Although a full analysis correlating trial and preclinical effect sizes will be needed, we did not observe obvious relationships between the two. We also did not detect a dose–response effect over three orders of magnitude even within an indication—RCC—known to respond to sunitinib and even when different time points were used. It is possible that heterogeneity in cell lines or strains may have obscured the effects of dose. For example, experimenters may have delivered higher doses to xenografts known to show slow tumor growth. However, RCC patients—each of whom harbors genetically distinct tumors—show dose–response effects in trials (Faivre et al., 2006) and between trials in a meta-analysis (Houk et al., 2010). It is also possible that the toxicity of sunitinib may have limited the ability to demonstrate dose response, though this contradicts demonstration of dose response within studies (Abrams et al., 2003; Amino et al., 2006; Ko et al., 2010). Finally, the tendency for preclinical efficacy studies to report drug dose, but rarely drug exposure (i.e., serum measurement of active drug), further limits the construct validity of these studies (Peterson and Houghton, 2004).

One explanation for our findings is that human xenograft models, which dominated our meta-analytic sample, have little predictive value, at least in the context of receptor tyrosine kinase inhibitors. This is a possibility that contradicts other reports (Kerbel, 2003; Voskoglou-Nomikos et al., 2003). We disfavor this explanation in light of the suggestion of publication bias; also, xenografts should show a dose–response regardless of whether they are useful clinical models. A second explanation is that experimental methods are so varied as to mask real effects. However, we note that the observed patterns on experimental design are based purely on what was reported in ‘Materials and methods’ section. Third, experiments assessing changes in tumor volume might only be interpretable in the context of other experiments within a preclinical report, such as with mechanistic and pharmacokinetic studies. This explanation is consistent with our observation that studies testing effect along a causal pathway tended to produce smaller effect sizes. A fourth possible explanation for our findings is that the predictive value of a small number of preclinical studies was obscured by inclusion of poorly designed and executed preclinical studies in our meta-analysis. Quantitative analysis of preclinical design factors that confer greater clinical generalizability awaits side-by-side comparison with pooled effects in clinical trials. Finally, it may be that design and reporting practices are so poor in preclinical cancer research as to make interpretation of tumor volume curves useless. Or, non-reporting may be so rampant as to render meta-analysis of preclinical research impossible. If so, this raises very troubling questions for the publication economy of cancer biology: even well-designed and reported studies may be difficult to interpret if their results cannot be compared to and synthesized with other studies.

Our systematic review has several limitations. First, we relied on what authors reported in the published study. It is possible certain experimental practices, like randomization, were used but not reported in methods. Further to this, we relied only on published reports, and restriction of searches to the English language may have excluded some articles. In February of 2012, we filed a Freedom of Information Act request from the Food and Drug Administration (FDA) for additional preclinical data submitted in support of sunitinib's licensure; nearly 4 years later, the request has not been honored. Second, effect sizes were calculated using graph digitizer software from tumor volume curves: minor distortion of effect sizes may have occurred but were likely non-differential between groups. Third, subtle experimental design features—not apparent in ‘Materials and methods’ sections—may explain our failure to detect a dose–response effect. For instance, few reports provide detailed animal housing and testing conditions, perhaps leading to important inter-laboratory differences in tumor growth. It should also be emphasized that our study was exploratory in nature; findings like ours will need to be confirmed using prespecified protocols. Fourth, our study represents analysis of a single drug, and it may be our findings do not extend beyond receptor tyrosine kinase inhibitors, or sunitinib. However, many of our findings are consistent with those observed in other systematic reviews of preclinical cancer interventions (Amarasingh et al., 2009; Sugar et al., 2012; Hirst et al., 2013). Fifth, our analysis does not directly address many design elements—like duration of experiment or choice of tissue xenograft—that are likely to bear on study validity. Finally, we acknowledge that there may be funding constraints that limit implementation of validity practices described above. We note, nevertheless, that other realms, in particular, neurology, have found ways to make such methods a mainstay.

Numerous commentators have raised concerns about the design and reporting of preclinical cancer research (Sugar et al., 2012; Begley, 2013b). In one report, only 11% preclinical cancer studies submitted to a major biotechnology company withstood in-house replication (Begley and Ellis, 2012). The Center for Open Science and Science Exchange has initiated a project that will attempt to reproduce 50 of the highest impact papers in cancer biology published between 2010 and 2012 (Morrison, 2014). In a recent commentary, Smith et al. fault many researchers for performing in vitro preclinical tests using drug levels that are clinically unachievable due to toxicity (Smith and Houghton, 2013). Unaddressed preclinical validity threats like this—and the ones documented in our study—encourage futile clinical development trajectories. Many research areas, like stroke, epilepsy, and cardiology, have devised design guidelines aimed at improving the clinical generalizability of preclinical studies (Fisher et al., 2009; Galanopoulou et al., 2012; Curtis et al., 2013; Pusztai et al., 2013); and the ARRIVE guidelines (Kilkenny et al., 2010) for reporting animal experiments have been taken up by numerous journals and funding bodies. Our findings provide further impetus for developing and implementing guidelines for the design, reporting, and synthesis of preclinical studies in cancer.

Materials and methods

Literature search

To identify all in vivo animal studies testing the anti-cancer properties of sunitinib (‘efficacy studies’), we queried the following databases on 27 February 2012 using a search strategy adapted from Hooijmans et al. (2010) and de Vries et al. (2011): Ovid MEDLINE In-Process & Other Non-Indexed Citations and Ovid MEDLINE (dates of coverage from 1948 to 2012), EMBASE Classic and EMBASE database (dates of coverage from 1974 to 2012) and BIOSIS Previews (dates of coverage from 1969 to 2012). Search results were entered into an EndNote library and duplicates were removed. Additional citations were identified during the screening of identified articles. See Table 1—source data 1C,D for detailed search strategy and PRISMA flow diagram.

Screening was performed at citation level by two reviewers (CF and VCH), and at full-text by one reviewer (VCH). Inclusion criteria were (a) original reports or abstracts, (b) English language, (c) contained at least one experiment measuring disease response in a live, non-human animals, and (d) employed sunitinib in a control, comparator, or experimental context, (e) tested anti-cancer activity. To avoid capturing the same experiment twice, in rare cases where the same experiment was reported in different articles, the most detailed and/or recent publication was included.

Extraction

All included studies were evaluated at the study-level, but only those with eligible experiments (e.g., those evaluating the effect of monotherapy on tumor volume and that were reported with sample sizes and error measurements) were forwarded to experiment-level extractions. We excluded experiments when they had been reported in a previous publication after specifically searching for duplicates during screening and analysis. For each eligible experiment, we extracted experimental design elements derived from a prior systematic review of validity threats in preclinical research (Henderson et al., 2013).

Details regarding the coding of internal and construct validity categories are given in Figure 1—source data 1A. To score for external validity, we created an index that summed the number of species and models tested for a given malignancy and awarded an extra point if more than one species and model was tested. For example, if experiments within a malignancy tested two species and three different model types, the external validity score would be 4 (1 point for the second species, one point for the second model type, one point for the third model type, and an extra point because more than one model and species were employed).

Our primary outcome was experimental tumor volume and we extracted necessary information (sample size, mean measure of treatment effect, and SDM/SEM) to enable calculation of study and aggregate level effect sizes. Since the units of tumor volume were not always consistent between experiments, we extracted those experiments for which a reasonable proxy of tumor volume could be obtained. These included physical caliper measurements (often reported in mm3 or cm3), tumor weights (often reported in mg), optical measurements made from luminescent tumor cell lines (often reported in photons/second), and fold differences in tumor volumes between the control and treatment arms. We extracted experiments of both primary and metastatic tumors, but not experiments where tumor incidence was reported. To account for these different measures of tumor volume, SMDs were calculated using Hedges' g. Hedges' g is a widely accepted standardized measure of effect in meta-analyses where units are not always identical. For experiments where more than one dose of sunitinib was tested against the same control arm, we created a pooled SMD to adjust appropriately for the multiple use of the same control group. Data were extracted at baseline (Day 0 and defined as the first day of drug administration), Day 14 (the closest measured data point to 14 days following first dose), and the last common time point (LCT) between the control group and the treatment group. The LCT was variable between experiments and the last time point for which we could calculate SMD and often represented the point at which the greatest difference was observed between the arms. Data presented graphically were extracted using the graph digitizer software GraphClick (Arizona Software). Extraction was performed by four independent and trained coders (VCH, ND, AH, and NM) using DistillerSR. There was a 12% double-coding overlap to minimize inter-rater heterogeneity and prevent coder drift. Discrepancies in double coding were reconciled through discussion, and if necessary, by a third coder. The gross agreement rate before reconciliation for all double-coded studies was 83%.

Meta-analysis

Effect sizes were calculated as SMDs using Hedges' g with 95% confidence intervals. Pooled effect sizes were calculated using a random effects model employing the DerSimonian and Laird (1986) method, in OpenMeta[Analyst] (Wallace et al., 2009). We also calculated heterogeneity within each malignancy using I2 statistics (Figure 2—source data 1B). To assess the predictive value of preclinical studies in our sample, we calculated pooled effect sizes for each type of malignancy. Subgroup analyses were performed for each validity element. p-values were calculated by a two-sided independent group T-test. Statistical significance was set at a p-value <0.05; as this was an exploratory study we did not adjust for multiple analyses.

Funnel plots to assess publication bias and Duval and Tweedie's trim and fill estimates were generated using Comprehensive Meta Analyst software (Dietz et al., 2014). Funnel plots were created for all experiments in aggregate, and for the three indications for which greater than 20 experiments were analyzable.

Dose–response curves are a widely used tool for testing the strength of causal relationships (Hill, 1965), and if preclinical studies indicate real drug-responses, we should be able to detect a dose–response effect across different experiments. Dose–response relationships were found in post-analysis of sunitinib clinical studies in metastatic RCC and Gastrointestinal stromal tumour (GIST) (Houk et al., 2010). We tested for all indications in aggregate, as well as for RCC, an indication known to respond to sunitinib in human beings (Motzer et al., 2006a, 2006b, 2009). To eliminate variation at the LCT between treatment and control arms, dose–response curves were created using data from a time point 14 days from the initiation of sunitinib treatment. Experiments with more than one treatment arm were not pooled as in other analyses, but expanded out so that each treatment arm (with it's respective dose) could be plotted properly. As we were unable to find experiments that reported drug exposure (e.g., drug serum levels), we calculated pooled effect sizes in OpenMeta[Analyst] and plotted against dose. To avoid the confounding effect of discontinuous dosing, we included only experiments that used a regular administration schedule without breaks (i.e., sunitinib administered at a defined dose once a day instead of experiments where sunitinib was dosed more irregularly or only once).

As this meta-analysis was exploratory and involved development of methodology, we did not prospectively register a protocol.

References

  1. 1
  2. 2
    Preclinical evaluation of the tyrosine kinase inhibitor SU11248 as a single agent and in combination with ‘standard of care’ therapeutic agents for the treatment of breast cancer
    1. TJ Abrams
    2. LJ Murray
    3. E Pesenti
    4. VW Holway
    5. T Colombo
    6. LB Lee
    7. JM Cherrington
    8. NK Pryer
    (2003)
    Molecular Cancer Therapeutics 2:1011–1021.
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
    OpenMEE: software for ecological and evolutionary meta-analysis
    1. G Dietz
    2. IJ Dahabreh
    3. J Gurevitch
    4. MJ Lajeunesse
    5. CH Schmid
    6. TA Trikalinos
    7. BC Wallace
    (2014)
    OpenMEE: software for ecological and evolutionary meta-analysis.
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
    The environment and disease: association or causation?
    1. AB Hill
    (1965)
    Proceedings of the Royal Society of Medicine 58:295–300.
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
    Human tumor xenografts as predictive preclinical models for anticancer drug activity in humans: better than commonly perceived-but they can be improved
    1. RS Kerbel
    (2003)
    Cancer Biology & Therapy 2:S134–S139.
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
    Clinical predictive value of the in vitro cell line, human xenograft, and mouse allograft preclinical cancer models
    1. T Voskoglou-Nomikos
    2. JL Pater
    3. L Seymour
    (2003)
    Clinical Cancer Research 9:4227–4239.
  51. 51

Decision letter

  1. M Dawn Teare
    Reviewing Editor; University of Sheffield, United Kingdom

eLife posts the editorial decision letter and author response on a selection of the published articles (subject to the approval of the authors). An edited version of the letter sent to the authors after peer review is shown, indicating the substantive concerns or comments; minor concerns are not usually shown. Reviewers have the opportunity to discuss the decision before the letter is sent (see review process). Similarly, the author response typically shows only responses to the major concerns raised by the reviewers.

Thank you for submitting your work entitled “A Meta-Analysis of Validity Threats and Clinical Correlates in Preclinical Research of Sunitinib” for peer review at eLife. Your submission has been favorably evaluated by Prabhat Jha (Senior Editor) and three reviewers, one of whom, Dawn Teare, is a member of our Board of Reviewing Editors. The other two reviewers, Carlijn Hooijmans and Glenn Begley, have agreed to share their identity.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

The authors describe a meta-analysis of preclinical studies performed with sunitinib. Their analysis found poor scientific methodology with for example lack of blinding, lack of evidence of power calculations and likely bias in over-reporting of positive results. It provides objective evidence to support claims that have been made by others. This is an important and valuable contribution. It addresses an issue of increasing scientific and societal importance.

Essential revisions:

1) The dose response analysis needs revision. The systematic review has collected the data from animal research but I cannot see where the clinical outcome data comes from. In fact I do not see any clinical data. Is the assumption that the clinical data does show a dose response? I find the lack of a dose response rather curious in that presumably individual studies did report an effect but somehow there is heterogeneity between the studies and hence this reduces to a random scatter? There appear to be a lot of points in Figure 5A but I cannot easily see how many studies were eligible for the dose-response. I also do not understand the statistical analysis going along with Figure 5. What sort of curve/line is being fitted? What role does the two-group t-test have in a dose response analysis? The line drawn on the figure appears to be a straight line. I think this section needs to be more clearly written.

2) It would be very helpful for the authors to comment on the number of studies that reported drug exposure rather than drug dose. It is common for preclinical studies to report the intended drug dose, but without reference to the actual levels of drug attained. Assessment of drug levels confirms (i) that drug was actually administered and (ii) to the intended cohort. Moreover, it is frequently the case that when levels rather than dose are examined, it is evident that the levels achieved in rodents are not tolerated in humans. This has specifically been recognized for sunitinib and is an additional explanation for the disconnection between preclinical versus clinical effectiveness.

3) It would also be helpful if the authors could comment on the number of studies that included a positive control. While sunitinib may have had activity, without a positive control it is difficult to place that activity into context. Was the efficacy comparable or greater than that seen with another known active agent?

4) It would be beneficial to understand how many of the studies used an appropriate statistical test when analyzing their data.

5) The authors did not comment on the ARRIVE Guidelines for pharmacology studies. These should be discussed (cited in Begley and Ioannidis., Reproducibility in Science. Improving the Standard for Basic and Preclinical Research., Circulation Research., 2015; 116: 116-126.)

6) The authors should check that their references are correctly cited. For example, the statement (in the Introduction) “A recent systematic review of guidelines for limiting bias in preclinical research design was unable to identify any guidelines in oncology (Han et al., 2013)” is not supported by Han et al., 2013.

7) The observation that “Our findings suggest that testing in more models tends to produce smaller effect sizes” has also been made previously (cited in Begley and Ioannidis, 2015).

8) Why did the authors of this review decide a) to include only papers published in English b) not to retrieve data via contacting the authors of the original papers? The authors conclude that there is a substantial risk on publication bias. But it might be possible that the observed risk on publication bias could have been much smaller when the authors contacted the authors for missing data and included also papers published in languages other than English.

9) In this review, the outcome measure of interest needs a more detailed description in the Material and methods section. In the subsection “Extraction,” the authors state that they are only interested in tumor volume (“[…] those evaluating the effect of monotherapy on tumor volume”), whereas the Abstract and Introduction talk about tumor growth (which is a much broader term). Did the authors only assess tumor volume of primary tumors? Or did they include the volumes of metastases in the body as well? Did the authors include studies in which the weight of tumors and metastases were assessed? And what about tumor incidence? The authors also state: “[…] To account for different measures of tumor growth”; which measures of tumor growth did they include?

10) More details are needed in the Materials and methods concerning the meta analyses:

a) Are data pooled in case tumor volume was determined in various regions of the body within one experiment?

Are the data pooled in the overall analyses (Figure 3) when results were assessed at various time points?

b) In animal studies the same control group is often used for multiple experimental. Did the authors correct for multiple use of the control groups?

c) What was the minimum group size to allow subgroup analyses?

11) The authors should report group sizes of all subgroups (also in figures; e.g. Figure 3), and take group size into account when interpreting the results.

12) The authors should report heterogeneity statistics (for example I2), and take these results into account when interpreting the data.

https://doi.org/10.7554/eLife.08351.014

Author response

Essential revisions:

1) The dose response analysis needs revision. The systematic review has collected the data from animal research but I cannot see where the clinical outcome data comes from. In fact I do not see any clinical data. Is the assumption that the clinical data does show a dose response? I find the lack of a dose response rather curious in that presumably individual studies did report an effect but somehow there is heterogeneity between the studies and hence this reduces to a random scatter? There appear to be a lot of points in Figure 5A but I cannot easily see how many studies were eligible for the dose-response. I also do not understand the statistical analysis going along with Figure 5. What sort of curve/line is being fitted? What role does the two-group t-test have in a dose response analysis? The line drawn on the figure appears to be a straight line. I think this section needs to be more clearly written.

Dose response is discussed twice in the manuscript. First it appears when we report the fraction of studies that tested a dose response. Here, our emphasis is on dose response testing as a means of enhancing the internal validity of claims regarding clinical promise. The second mention is at the end of our Results section. There, the dose-response analysis was not meant to compare our preclinical results directly with sunitinib clinical data (although a dose-response relationship in GIST and mRCC has been demonstrated by others – see below). Our premise was that preclinical studies should demonstrate a dose-response relationship in the absence of overwhelming inter-experimental heterogeneity and publication bias. This premise was based on the known dose response in human beings, and the fact that dose-response was shown within individual preclinical experiments. The reviewers have interpreted our point correctly: our inability to show a clear dose-response after pooling all results raises some concerns about the predictive value of studies included in our meta-analysis.

Regarding sample size, there are 158 data points in Figure 5A. Each data point represents a single experiment or sub-experiment (in cases where one experiment had more than one treatment arm with a defined dose). Inclusion into the dose-response analysis required experiments or sub-experiments to have used a regular dosing schedule (i.e. sunitinib administered at a defined dose once a day instead of experiments where sunitinib was dosed more irregularly or only once). These inclusion criteria led to 26 experiments or sub-experiments to be excluded. It is worth reiterating that the number of data points possible in our dose-response analysis for all malignancies was 183. This number differs from the total number of data points possible in our other analyses, where we calculated a weighted average of the SMDs and 95% CIs in the cases of experiments that had more than one treatment arm.

Regarding the curve we fitted, we believe our simple regression (calculated using the linear regression function in Excel) is sound for descriptive purposes and assessing a simple relationship between doses. Our intent was simply to judge the presence or absence of an expected positive correlation between effect and increasing dose rather than identifying precise dose-response curves.

Regarding the source of clinical outcome data, we think this confusion stems from a poor choice of title, which we changed to “A Meta-Analysis of Threats to Valid Clinical Inference in Preclinical Research of Sunitinib”.

We modified the manuscript as follows:

We enhanced the description of dose response in Methods, Results, and Discussion sections. We also cite Houk et al (PMID: 19967539) to show that a dose-response relationship was found in GIST and mRCC in clinical studies. This citation helps support our assertion that the absence of a dose-response in preclinical models (especially when RCC was analyzed alone) is odd.

We also removed the two-group t-test, which was a remnant of a previous draft and out of place.

We made some edits to figure legend, as well as Methods section, to explain more clearly how (and why) we performed dose response curves, as well as the number of experiments included in them.

2) It would be very helpful for the authors to comment on the number of studies that reported drug exposure rather than drug dose. It is common for preclinical studies to report the intended drug dose, but without reference to the actual levels of drug attained. Assessment of drug levels confirms (i) that drug was actually administered and (ii) to the intended cohort. Moreover, it is frequently the case that when levels rather than dose are examined, it is evident that the levels achieved in rodents are not tolerated in humans. This has specifically been recognized for sunitinib and is an additional explanation for the disconnection between preclinical versus clinical effectiveness.

We agree that reporting drug exposure (and measured serum levels of drug in experimental animals) would be the optimal measurement upon which to construct dose-response curves and to later correlate with clinical data. Unfortunately, these measurements were rarely included within tumour volume experiments we extracted. Indeed, this comment prompted us to reexamine the first 20% of studies in our sample and we found no instances where levels were measured instead of dose. We had included 4 studies that included serum measurements of sunitinib post-dosing (PK). However, none of these serum levels were taken during efficacy experiments; they were only performed in studies to correlate with molecular data (e.g. FLT3 expression levels).

We protocolized and standardized eligibility criteria and dose exposure measurement to reduce variability inherent with using dose, given that drug serum levels were not available to us. For instance, in dose response analysis, we only included experiments where the dose was clearly stated and was administered once per day (as opposed to more complex or variable dosing schedules). Additionally, we used a common time point between experiments (14 days after the first day of drug administration).

The difficulty we faced constructing a robust measure of drug exposure and effect further highlights the poor reporting practices in preclinical research and the need for improvement in this regard.

In light of the above, we made the following modifications to the manuscript:

We added a statement in the Methods section as to why we used dose instead of drug exposure (“As we were unable to find experiments that reported drug exposure (e.g. drug serum levels), we calculated pooled effect sizes in OpenMeta[Analyst] and plotted against dose”).

We also added a statement in the Discussion section stating “the tendency for preclinical efficacy studies to report drug dose, but rarely drug exposure (i.e. serum measurement of active drug), further limits the construct validity of these studies,” and we buttressed this with a new reference (Peterson et al., 2004).

3) It would also be helpful if the authors could comment on the number of studies that included a positive control. While sunitinib may have had activity, without a positive control it is difficult to place that activity into context. Was the efficacy comparable or greater than that seen with another known active agent?

We agree that efficacy data are difficult to make sense of absent positive controls. Examining our dataset, we found that 93/158 experiments included in the meta-analysis included an active comparator drug arm. Note that we do not use the term “positive control” strictly here, as sometimes sunitinib was being the positive control in a study of a newer drug.

We inserted two sentences to the subsection “Design Elements Addressing Validity Threats” describing the frequency of active drug comparators in experiments, and the relationship between their inclusion and reported effect sizes.

4) It would be beneficial to understand how many of the studies used an appropriate statistical test when analyzing their data.

We agree this would be a useful analysis but it is beyond the scope of this paper. To address appropriateness adequately we would need raw subject data, subject flow, protocols, and analytical plans. While some issues are black and white, many are nuanced/subjective and require complete background info. Instead, we focused our energies in the validity criteria identified in our PLoS Medicine systematic review of preclinical design guidelines.

5) The authors did not comment on the ARRIVE Guidelines for pharmacology studies. These should be discussed (cited in Begley and Ioannidis., Reproducibility in Science. Improving the Standard for Basic and Preclinical Research., Circulation Research., 2015; 116: 116-126.)

We heartily agree that citing ARRIVE is important for this audience and we’ve now cited and singled it out for mention at the end of the Discussion. We also added a reference to the Circulation Research article, which provides a nice overview of reproducibility challenges. We also took this occasion to mention the Center for Open Science Reproducibility Project: Cancer Biology.

6) The authors should check that their references are correctly cited. For example, the statement (in the Introduction) “A recent systematic review of guidelines for limiting bias in preclinical research design was unable to identify any guidelines in oncology (Han et al., 2013)” is not supported by Han et al., 2013.

Thanks to the reviewers for catching this. It has now been corrected.

7) The observation that “Our findings suggest that testing in more models tends to produce smaller effect sizes” has also been made previously (cited in Begley and Ioannidis, 2015).

Thank you. We now cite the Circulation article (see above). For this particular claim, the referees are correct – this has been suggested before – in particular by CAMARADES. We now cite the paper (O’Collins et al, Annals Neurol, 2006).

8) Why did the authors of this review decide a) to include only papers published in English b) not to retrieve data via contacting the authors of the original papers? The authors conclude that there is a substantial risk on publication bias. But it might be possible that the observed risk on publication bias could have been much smaller when the authors contacted the authors for missing data and included also papers published in languages other than English.

This was decided mainly on feasibility and funding constraints. We did not have a budget to both identify relevant expertise and to translate articles. Contacting authors for additional data would pull in experiments and findings that have not been subject to peer review. Our experience contacting drug companies for missing data does not lead us to think this would have been a productive use of our time. We did, however, attempt to obtain Sugen/Pfizer’s preclinical dataset from FDA by filing a Freedom of Information Act. The FOIA was submitted in February of 2012, around the time the grant received funding. The 3-year grant ended in 2015, with the FOIA request not yet honoured. We added a sentence in the Discussion stating this as a limitation.

9) In this review, the outcome measure of interest needs a more detailed description in the Material and methods section. In the subsection “Extraction,” the authors state that they are only interested in tumor volume (“[…] those evaluating the effect of monotherapy on tumor volume”), whereas the Abstract and Introduction talk about tumor growth (which is a much broader term). Did the authors only assess tumor volume of primary tumors? Or did they include the volumes of metastases in the body as well? Did the authors include studies in which the weight of tumors and metastases were assessed? And what about tumor incidence? The authors also state: “[…] To account for different measures of tumor growth”; which measures of tumor growth did they include?

We expanded the Methods section to reflect the reviewers’ queries. For instance, we changed our broad references to “tumour growth” into references to tumour volume measurements (which is what we were extracting). We also provide a more detailed description of outcome measurement (see, for example, text beginning “Our primary outcome was experimental […]”, in the subsection “Extraction”).

10) More details are needed in the Materials and methods concerning the meta analyses.

a) Are data pooled in case tumor volume was determined in various regions of the body within one experiment?

We were a bit unsure what the reviewers mean by “various regions of the body within one experiment”. We did not encounter studies where tumors were injected into more than one location within a single experiment. Perhaps the referees are referring to inter-experimental variability in tumour implantation site (e.g. experimental renal tumour injected subcutaneously into the hindleg vs. injected orthotopically into the renal capsule of the mouse). If this is the case, we did record this information (as to whether the tumour implantation site was heterotopic, orthotopic, or systemic), but we did not perform separate analyses according to this variable. Our concern is that doing so would divide our sample into slices too thin to enable analysis. If referees think it useful we would be happy to add a sentence or two about the use orthotopic vs. heterotopic models in our sample of experiments.

Are the data pooled in the overall analyses (Figure 3) when results were assessed at various time points?

We clarified information of our time points to the Methods (in the subsection “Extraction”). We also clarified which time point was used for which analysis in the Results section and in each figure legend. As explained in the subsection “Study Characteristics”, the mean number of days in last common timepoint (LCT) was 31 days (± 14 days SDM).

b) In animal studies the same control group is often used for multiple experimental. Did the authors correct for multiple use of the control groups?

Indeed, we did adjust. In cases where multiple arms of treatment were reported versus the same control, a pooled estimate of the treatment arms was calculated using the inverse variance approach and then compared to the single control to produce an SMD estimate and 95% CIs.

c) What was the minimum group size to allow subgroup analyses?

We did not specify a minimum group size for meta-analysis. Instead we presented all available data. The goal of subgroup analyses in meta-analysis is to present all available data relevant to the strata even if just 1 experiment, thus a minimum subgroup size is not as relevant. Now that we have integrated sample sizes into our figures more clearly, readers can perhaps make their own judgments.

11) The authors should report group sizes of all subgroups (also in figures; e.g. Figure 3), and take group size into account when interpreting the results.

We converted Figure 3 to forest plots and have included the subgroup sizes.

12) The authors should report heterogeneity statistics (for example I2), and take these results into account when interpreting the data.

The primary purpose of our paper was to evaluate potential sources of methodological and pre-clinical heterogeneity through pre-specified items of internal, external, and construct validity. We now report I2 statistics but we must note that we (and others – e.g. Paul SR, Donner A., Small sample performance of tests of homogeneity of odds ratios in k 2×2 tables., Stat Med 1992; 11: 159-65; Hardy RJ, Thompson SG., Detecting and describing heterogeneity in meta-analysis., Stat Med 1998; 17: 841-56) are somewhat skeptical about the value of heterogeneity statistics due to their extreme sensitivity in the presence of a small number of studies. As the referees will see, heterogeneity statistics, as expected, within each malignancy was generally high (see Figure 2–source data 1B), with some notable exceptions (e.g. high-grade glioma, prostate cancer). By design, our study sheds light on possible sources of methodological heterogeneity (i.e. Figure 3) and we have incorporated such observations in our interpretation and conclusions.

https://doi.org/10.7554/eLife.08351.015

Article and author information

Author details

  1. Valerie C Henderson

    Studies of Translation, Ethics and Medicine Research Group, Biomedical Ethics Unit, McGill University, Montréal, Canada
    Contribution
    VCH, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.
  2. Nadine Demko

    Studies of Translation, Ethics and Medicine Research Group, Biomedical Ethics Unit, McGill University, Montréal, Canada
    Contribution
    ND, Acquisition of data, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.
  3. Amanda Hakala

    Studies of Translation, Ethics and Medicine Research Group, Biomedical Ethics Unit, McGill University, Montréal, Canada
    Contribution
    AH, Acquisition of data, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.
  4. Nathalie MacKinnon

    Studies of Translation, Ethics and Medicine Research Group, Biomedical Ethics Unit, McGill University, Montréal, Canada
    Contribution
    NMK, Acquisition of data, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.
  5. Carole A Federico

    Studies of Translation, Ethics and Medicine Research Group, Biomedical Ethics Unit, McGill University, Montréal, Canada
    Contribution
    CAF, Acquisition of data, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.
  6. Dean Fergusson

    Department of Clinical Epidemiology, Ottawa Hospital Research Institute, Ottawa, Canada
    Contribution
    DF, Conception and design, Analysis and interpretation of data, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.
  7. Jonathan Kimmelman

    Studies of Translation, Ethics and Medicine Research Group, Biomedical Ethics Unit, McGill University, Montréal, Canada
    Contribution
    JK, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article
    For correspondence
    jonathan.kimmelman@mcgill.ca
    Competing interests
    The authors declare that no competing interests exist.

Funding

Canadian Institutes of Health Research (Instituts de recherche en santé du Canada) (EOG 111391)

  • Jonathan Kimmelman

The funder had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

Dan G Hackam, Jeremy Grimshaw, Malcolm Smith, Elham Sabri, Benjamin Carlisle.

Reviewing Editor

  1. M Dawn Teare, Reviewing Editor, University of Sheffield, United Kingdom

Publication history

  1. Received: April 25, 2015
  2. Accepted: September 5, 2015
  3. Version of Record published: October 13, 2015 (version 1)

Copyright

© 2015, Henderson et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 3,555
    Page views
  • 582
    Downloads
  • 16
    Citations

Article citation count generated by polling the highest count across the following sources: Scopus, Crossref, PubMed Central.

Comments

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading