Point of View: Four erroneous beliefs thwarting more trustworthy research

  1. Mark Yarborough  Is a corresponding author
  2. Robert Nadon
  3. David G Karlin
  1. University of California, Davis, United States
  2. McGill University, Canada
  3. Independent researcher, France

Abstract

A range of problems currently undermines public trust in biomedical research. We discuss four erroneous beliefs that may prevent the biomedical research community from recognizing the need to focus on deserving this trust, and thus which act as powerful barriers to necessary improvements in the research process.

Introduction

In 2014, in an essay titled ‘Why scientists should be held to a higher standard of honesty than the average person,’ a former editor of the British Medical Journal argued that science depends wholly on trust (Smith, 2014). While many in the biomedical research community may quibble over the word ‘wholly’ here, few would dispute his overall point: the public’s confidence is essential to the future of research. According to a noted scholar on the subject, the best way to enjoy trust is to deserve it (Hardin, 2002). One would hope that the research community is a deserving case, given the existence of safeguards such as professional norms, regulatory compliance and peer review. Unfortunately, there is an ever-growing body of evidence that calls into question the effectiveness of these measures.

This evidence includes, but is by no means limited to, findings about underpowered studies (Ioannidis, 2005), routine overestimations of efficacy (Sena et al., 2010; Tsilidis et al., 2013), the failure to take prior research into account (Robinson and Goodman, 2011; Lund et al., 2016), a propensity to confuse hypothesis-generating studies with hypothesis-confirming ones (Kimmelman et al., 2014), a worrisome waste of resources (Chalmers and Glasziou, 2009), and the low uptake of critical reforms meant to improve research (Enserink, 2017; Peers et al., 2014). A recent popular book, Rigor Mortis, synthesizes such evidence into a compelling narrative that casts the reputation of research in a negative light (Harris, 2017).

While all of this evidence is cause for concern, we are most concerned by the reluctance of the research community to implement the reforms that could improve research quality. One can imagine a continuum of research practices that impact how scientific understanding advances. At one end one encounters the unforgivable, such as data fabrication or falsification. At the other end one finds the perfect, such as published research reports so thorough that findings can be easily reproduced from them.

The concerns of interest to us in what follows have little to do with the misconduct found on the unforgivable end of the continuum. Instead, they fall all along it and pertain to unsound research practices (such as non-robust reporting of methods, flawed study designs, incomplete reporting of data handling, and deficient statistical analyses) that nevertheless impede the advance of science. These are the practices that reform measures could counter if researchers were less reluctant to adopt them. In an effort to account for this reluctance, we review four erroneous beliefs that we think contribute to it.

We acknowledge that we lack extensive data confirming the prevalence and distribution of these beliefs. Thus, readers can form their own opinions about whether the beliefs are as widespread as we fear they are. We have come upon our concerns as a result of our careers related to biomedical research, which will be the focus of our remarks below, though we think the issues are relevant to life sciences research more broadly. One of us (MY) has extensively studied how to promote trustworthiness in biomedical research, and another (RN) has a long and successful career devoted to understanding the role of sound methodologies in producing it. The final author (DGK) is a preclinical researcher who was among those who pioneered early efforts to learn how researchers and research institutions can meaningfully connect the research community with the publics it seeks to serve. We think this collective pedigree lends credence to our analysis and to the strategy for moving forward that we recommend in the conclusion.

Recognizing the barriers to a greater focus on deserving trust

It’s about the science, not the scientists

Erroneous belief one is that questioning the trustworthiness of research simultaneously questions the integrity of researchers. As a result, many individuals react counterproductively to calls to improve trustworthiness. They are akin to pilots who confuse discussions about improving the flightworthiness of airplanes with criticism of their aviation skills. Though understandable, such concerns miss the point (Yarborough, 2014a). The multitude of methods, materials, highly sophisticated procedures and complex analyses intrinsic to biomedical research all create ways for it to err, making it exceptionally difficult to detect problems (Hines et al., 2014). These are the critical matters that all researchers must learn to direct their attention to. Yet they cannot do so if constructive criticism about how to improve science is taken personally.

We need to focus on the health of the orchard, not just the bad apples in it

Erroneous belief two is that the bulk of problems in research is due to bad actors. There is no doubt that misconduct is a substantial problem (Fang et al., 2012). This should not blind us, however, to how common study design and data analysis errors are in biomedical research (Altman, 1994). Indeed, these errors are likely to increase due to trends in current scientific practice, particularly the growing size and interdisciplinarity of investigative teams (Wuchty et al., 2007; He and Zhang, 2009; Gazni et al., 2012). Because they require divisions of labor and expertise, such collaborations create fertile ground for producing unreliable research. Affected publications draw much less scrutiny than those of authors who engage in misconduct (Steen et al., 2013), and thus problems in them are likely to be discovered much later, if at all. For example, consider that the number of retracted publications is much less than 1% of published articles (Grieneisen and Zhang, 2012), yet publication bias has been found to affect entire classes of research (Tsilidis et al., 2013; Macleod et al., 2015).

The prevalence of erroneous research results and the enduring problems they cause require proactive efforts to detect and prevent them. What we find instead is a disproportionate emphasis on detecting and punishing ‘bad apples.’ The more we concentrate on this, the more difficult it becomes to identify strategies that allow us to focus on what should be seen as more pressing issues.

Our beliefs about self-correcting science need self-correcting

Erroneous belief three is that science self-corrects. Assumptions that published studies are systematically replicated/replicable, or are later identified if they are not, build resistance against reforms. In theory, reproducibility injects quality assurance into the very heart of research. When one adds other traditional safeguards such as professional research norms and peer review, the reliability of research seems well guarded.

However, a growing body of research to check whether scientific results can be reproduced confirms the shortcomings of these safeguards (Hudson, 2003; Allchin, 2015; Banobi et al., 2011; Zimmer, 2011; Twaij et al., 2014; Drew, 2019). We mention just two examples of this research here. The Reproducibility Project: Cancer Biology has been underway for almost five years and originally sought to reproduce 50 critical cancer biology studies (Couzin-Frankel, 2013). The project was scaled back to 18 studies, due largely to costs, but also because important details about research methods were unreported in some of the studies the effort sought to reproduce. As for results, of the first 13 completed replication studies, only five produced results similar to the original studies while the other eight produced either mixed or negative results (Kaiser, 2018).

An effort to replicate the findings of 100 experimental studies in psychology journals produced a similarly low rate of replication. Only 36% of the original findings were replicated according to the conventional statistical significance standard of p<0.05 for an effect in the same direction (Open Science Collaboration, 2015).

When errors get corrected, it is more often due to happenstance than any kind of methodical effort

Such findings serve as a vivid wake-up call that alerts us to how easily and how often erroneous research results make their way into print, often in leading journals. Once there, they may linger for years or even decades prior to being discovered (if they are ever discovered) (Judson, 2004; Bar-Ilan and Halevi, 2017), and may continue to be cited post-discovery (Steen, 2011). And when errors get corrected, it is more often due to happenstance than any kind of methodical effort (Allchin, 2015). All this is sobering when we consider that erroneous findings can result in potentially dangerous clinical trials (Steen, 2011).

Further shaking our confidence in the ability of science to self-correct is how few opportunities there actually are to confirm results. Efforts such as the Reproducibility Project: Cancer Biology notwithstanding, most research sponsors and publishers value, and thus fund and publish, innovative studies rather than research that tries to confirm past findings. And even if sponsors did place higher value on confirmatory studies, the growing complexity of science can make confirmation difficult, or even impossible (Jasny et al., 2011). Besides information about study methods and materials possibly not being available, studies may also use novel and/or highly sensitive/volatile study materials (Hines et al., 2014), impinge on intellectual property rights (Williams, 2010; Godfrey and German, 2008), or deal with proprietary data sets (Peng, 2011). Thus, even if there was a time in science when there were chances ‘to get it right’ or when consensus could emerge, that is no longer the case (Yarborough, 2014b).

Following the rules does not guarantee we are getting it right

Erroneous belief four is that compliance with regulations is capable of solving the problems that gave rise to the regulations themselves. Governments, research sponsors and publishers have gone to great lengths to implement reforms that one hopes contribute to deserved trust. But this is true only to a point; one can follow all the rules, extensive though they may be, and still not get it right (Yarborough et al., 2009). We offer efforts to combat research misconduct in the United States as evidence.

The United States Congress, following a series of research scandals, issued a mandate for corrective action to combat falsification, fabrication and plagiarism. This eventually led to a program that endures to this day (Office of Research Integrity, 2015), requiring federally funded institutions to investigate allegations of research misconduct. The much larger body of poor-quality science is left completely unaddressed by these government rules. Research shows that about 2% of researchers report engaging in misconduct while fifteen times as many (30%) report having engaged in practices that contribute to irreproducible research (Fanelli, 2009); other studies report even higher percentages (John et al., 2012; Agnoli et al., 2017). Yet, due to the need to follow the rules, resources go overwhelmingly to investigating misconduct. Thus, while such rules bestow quite modest protections to research, they require significant time, energy and money (Michalek et al., 2010), and simultaneously provide a false sense of security that problems are being resolved – when in fact they are not (Yarborough, 2014b).

Suggestions to help build cultures and climates that assure deserved trust

If we can find a way to shed these erroneous beliefs, we could become more proactive in showing how we deserve the public’s trust. We would not need to start de novo. There are already some proven solutions, as well as promising new recommendations and reforms, that can make inroads on many of the problems identified above. We highlight just a few of them below. Broad implementation of such initiatives could pay valuable dividends. For instance, rather than expend extraordinary resources on investigations of misconduct after it has caused damage (Michalek et al., 2010), we might instead fund empirical studies of both existing and proposed reforms. In consequence, we could determine which reforms are most capable of strengthening the overall health of biomedical research (Ioannidis, 2014).

We recognize that the solutions that we highlight below do not do justice to them as a class, but we do believe they constitute a reasonably representative group. Nor do we mean to suggest that they are without controversy. The main point of our essay, however, is not to provide a thorough review of current and proposed reforms and their individual merits. To do so would focus readers’ attention on what changes need to be made in research; our purpose is to explore erroneous beliefs that may prevent sufficient focus on why changes are needed in the first place.

If authors felt safe bringing honest errors to the attention of others, it would encourage much-needed openness about the mistakes that inevitably occur within fields as complex as biomedical research.

Publishing reforms: underway but they could be more ambitious

It is encouraging to see that many journals have begun to implement important reform measures. Among the most encouraging is that some now perform rigorous statistical review of appropriate studies, or make such reviews available to peer reviewers or associate editors who request them. Some journals have also modified their instructions to authors in order to improve the reporting of research results. The improved instructions bring transparency to research and aid reproducibility efforts. Recent studies of these modified instructions show that they improve published preclinical study reports, suggesting that even modest journal reforms can work to good effect (The NPQIP Collaborative group, 2019; Minnerup et al., 2016). It should be noted, though, that the benefits of such reforms might be small. A recent study showed that a checklist designed to improve compliance with the ARRIVE guidelines had a quite limited effect (Hair et al., 2018), showing that having helpful tools is no guarantee that they will be used. Thus, it remains unclear what the ultimate impact of such reform measures might be.

With this evidence in mind, it would be nice if journals were even more ambitious and took on some more novel recommendations. One example is to consider expanding the taxonomy for correcting and retracting publications so that authors can avoid the current stigma around correcting the scientific record (Fanelli et al., 2018). This would make it possible to take up a 2016 recommendation to reward authors for self-corrections and retractions (Fanelli, 2016). If authors felt safe bringing honest errors to the attention of others, it would encourage much-needed openness about the mistakes that inevitably occur within fields as complex as biomedical research.

Researcher practices: plentiful recommendations with too few takers

Publisher reforms can only accomplish so much. Most of the improvements that are required to demonstrate how the research community deserves the public’s trust need to arise from how research is conducted. A wealth of thoughtful recommendations are already in place, but too many are awaiting widespread adoption. Among the most notable are a set of recommendations for increasing value and reducing waste in biomedical research that appeared as part of a series of articles in The Lancet in 2014.

Those recommendations center around several needs: to carefully set research priorities; improve research design, conduct and analysis; improve research regulation and management; reduce incomplete or unusable reports of studies; and make research results more accessible (Macleod et al., 2014; Chalmers et al., 2014; Ioannidis et al., 2014; Salman et al., 2014; Glasziou et al., 2014; Chan et al., 2014). The series has not gone without notice, with more than 46,000 downloads of articles in the series within the first year of publication (Moher et al., 2016) and over 900 citations (as of early 2019) in PubMed Central registered articles. Early evidence suggested that the series placed the issues that it addressed on the radar screens of research sponsors, regulators and journals. Disappointingly, academic institutions initially did not seem to pay them much notice (Moher et al., 2016). This reinforces our concern that we need to identify what it is about the mindset of so many in the research community that is currently stifling interest in reform. So long as this lack of interest persists, there is little hope that what we consider the highest impact changes will occur anytime soon. We have two such changes in mind that researchers themselves need to take more of the lead on.

We need to improve research design and its reporting

Researchers need to pay more attention to research methodology, given its central role in establishing the reliability of published research results. Some journals now encourage this behavior by, for instance, requiring that authors complete checklists to indicate whether or not they have used study design procedures such as blinding, randomization and statistical power analysis. Depending on the journal and type of study, modest to substantial gains in reporting prevalence of study design details are achieved when researchers can complete these requirements (The NPQIP Collaborative group, 2019; Hair et al., 2018; Han et al., 2017). Such improved reporting allows for better assessment of the published literature. Better still would be researchers routinely using universally accepted basic procedures. For example, it is widely acknowledged that for animal studies, randomly allocating animals to groups and blinding experimenters to group allocations is required for sound statistical inference (Macleod, 2014).

We need to increase data sharing

Routine sharing of data should be the new default for researchers, unless there are compelling reasons not to share. Data sharing can, among other things, promote reproducibility, improve the accuracy of results, accelerate research, and promote better risk-benefit analysis in clinical trials (Institute of Medicine, 2013). Despite the growing consensus about the value that data sharing brings to research, we must acknowledge that when and how data sharing should occur remains controversial. As recently noted, “[s]ome argue that the researchers who invested time, dollars, and effort in producing data should have exclusive rights to analyze the data and publish their findings. Others point out that data sharing is difficult to enforce in any case, leading to an imbalance in who benefits from the practice – a problem that some researchers say has yet to be satisfactorily resolved” (Callier, 2019). Given such issues, it comes as no surprise that compliance with journal data sharing policies can be lackluster (Stodden et al., 2018).

Taking these difficulties into consideration, realistic suggestions to encourage data sharing include: 1) that all journals implement a clear data sharing policy (Nosek et al., 2015) that allows reasonable flexibility to take into account cases when data cannot be shared because of ethical or identity protection concerns, or that allow ‘embargo’ periods during which data are not shared (Banks et al., 2019); 2) that journals systematically require data sharing during the review process, to help reviewers to evaluate the results (this would have the additional benefit of meaning that no additional effort is required afterward to make the data public); 3) that training courses in Responsible Conduct of Research (RCR) include methods to de-identify study participants and aggregate their results (a major prerequisite to data sharing [Banks et al., 2019]); and 4) the creation of awards for researchers who promote data sharing (Callier, 2019).

Finally, we need to know whether improved methodology and increased data sharing are really leading to reproducible research. Unfortunately, we could not locate studies that have addressed this question, making this an important line of future research.

Institution level practices: promising and proven remedies looking for suitors

When it comes to institutional practices that could strengthen the trustworthiness of research, surely the holy grail would be to better align researcher incentives with good science (Ware and Munafò, 2015). This would be a heavy lift since it would involve changes to how institutions collectively approach recruitment, tenure and promotion. Rather than relying upon current surrogates such as bibliometrics for assessing faculty productivity and success (McKiernan, 2019), they would need to use more direct measures of good science. A workshop involving research quality and other experts was convened in Washington DC in 2017 to explore what such measures might be and how they might be used. It identified six key principles that institutions could embrace to effect such a transition (Moher et al., 2018), but their effectiveness remains untested as they have yet to be implemented. It is worth noting, however, that at least one institution – the University Medical Center Utrecht – has tried to reengineer how it assesses its research programs and faculty in order to better align incentives with good science. In the words of the champions of that change initiative, they are learning how to better “shape the structures that shape science…[to] make sure that [those structures] do not warp it” (Benedictus et al., 2016).

There are smaller scale reforms that institutions could also embrace to help ensure high quality standards in research. For example, there are many innovative practices that institutions could currently use to prevent problems, but are not. Perhaps the most obvious one is a research data audit. Akin to a finance audit, a research data audit is meant to check that published data are “quantifiable and verifiable" by examining “the degree of correspondence of the published data with the original source data” (Shamoo, 2013). First proposed at scientific conferences in the 1970 s, (Shamoo, 2013) and later in print in Nature in 1987 (Dawson, 1987), such audits “would typically require the examination of data in laboratory notebooks and other work sheets, upon which research publications are based” (Glick, 1989). Advocates argue that data audits should be routine in as many settings as possible. This would provide a double benefit; it would help to deter fraud on the one hand and promote quality assurance on the other (Shamoo, 2013).

The FDA and the United States Office of Research Integrity currently conduct such audits ‘for cause’ when misconduct or other misbehaviors are suspected. The FDA also uses them for certain new drugs deemed to be potentially ‘high risk.’ Although most current audits typically review the proper use of specified research procedures, there is no reason that they could not also be used to encourage the proper generation and use of actual data (Shamoo, 2013).

Critical incident reporting (CRI) is another promising prevention practice. It can be used to uncover problems, that, if left unchecked, might prove detrimental to a group’s research or reports about their research. Open software exists for implementing such a system. Accessed anonymously online, the system prompts users to report in their own terms what happened that is of concern to them. Experts can then promptly analyze incidents to see what systems changes might prevent future recurrences. The first adopters of such a system report that it “has led to the emergence of a mature error culture, and has made the laboratory a safer and more communicative environment” (Dirnagl et al., 2016).

The same opportunity pertains to two other successful problem reduction methods: root cause analysis (RCA) and failure modes and effects analysis (FMEA) (Yarborough, 2014a). RCA examines past near misses and problems in order to identify their main contributors. FMEA anticipates ways that future concerns might occur and prioritizes the severity of negative consequences if they do occur (for example, in aviation one might compare increased fuel consumption by a plane versus the catastrophic failure of a wing). The most critically needed preventive measures can then be targeted to avoid severe problems occurring in the first place.

RCA and FMEA have both been used to good effect across a wide spectrum of industries and endeavors, including the pharmaceutical industry and clinical medicine. Their track record clearly shows that they can be used to reduce medication, surgical and anesthesia errors, and ensure quality in the drug manufacturing process. Both these methods lend themselves most easily to manufacturing and engineering settings, but their successes suggest they also warrant testing for use in research. In particular, they may improve the human factors that can lead to avoidable problems, especially in team-based science settings where geographic dispersion and distributed expertise are the norm (Yarborough, 2014a; Dirnagl et al., 2016).

It seems clear that data audits, CRI, RCA, and FMEA each have tremendous potential for improving research: potential that, like the above publishing reforms and researcher practices, has gone largely untapped to this point. We worry that the four erroneous beliefs that we have highlighted are blunting curiosity about the health of biomedical research, and are thereby preventing the adoption of a more proactive stance toward quality concerns. Hence, a critical next challenge is learning how to erode the appeal of these beliefs.

One strategy that we think is particularly worth considering is education. A wider appreciation of evidence that demonstrates the range and extent of quality concerns in research, combined with evidence about how few of them stem from research misconduct, should diminish belief that a few bad apples are our biggest problems. A placeholder for this education is already in place. RCR education is now firmly ensconced in many graduate and postgraduate life sciences courses and could naturally incorporate modules that tackle the erroneous beliefs head on.

We should note, however, that this strategy is far from perfect, given longstanding concerns about the effectiveness of RCR curricula (Antes et al., 2010; Presidential Commission for the Study of Bioethical Issues, 2011) and the fact that sponsors who mandate RCR instruction, like the National Institutes of Health (NIH) and the National Science Foundation (NSF) in the United States, often stipulate content that needs to be covered by it. The latter challenge need not be insuperable, though, since both NIH and NSF also encourage innovation and customization of RCR learning activities. Using RCR education as a vehicle for fostering improved quality in research may also help to make such instruction appear more relevant to the careers of learners.

As an example, RCR sessions could examine the scientific record on self-correction. The aforementioned cancer and psychology replication projects would surely warrant consideration, but we think that an equally relevant and highly illustrative case study showing how this might be done is a recently published study (Border et al., 2019) about the lasting detrimental impact of a 1996 study about the SLC6A4 gene on depression research (Lesch et al., 1996). This publication spurred at least an additional 450 published ones, consumed millions of dollars, and controversy about it continues to this day (Yong, 2019). Such case studies can drive home multiple lessons because they simultaneously show how science cannot be relied upon to self-correct in a timely or efficient way and that regulations often fail to touch upon matters critical to the health of research.

There are plenty of thoughtfully tailored recommendations that have not yet resulted in the improvements to research they are surely capable of producing

Conclusion

Readers may be tempted to dismiss the foregoing analysis of erroneous beliefs as mere personal observations. They may prefer instead either hard data about how research measures up against metrics that contribute to deserving trust. Or they may wish for yet another round of study design and data analysis recommendations capable of solving the broad range of ills currently diminishing the quality of research. The recommendations would plot the path to progress while the data would make our pace of progress apparent to all.

As we have tried to make clear, there are plenty of thoughtfully tailored recommendations that have not yet resulted in the improvements to research they are surely capable of producing – simply because there has been too little uptake of them. Nor, for that matter, is there any shortage of calls to arms and manifestos, including those from some of the most eminent scholars and leaders in biomedical research (Alberts et al., 2014; Munafò et al., 2017). Since these have had such little effect so far, especially at the institutional level, it is not clear why we would expect yet more recommendations to enjoy a better reception. Besides, many questionable research practices are hidden from view. For example, inconvenient data points, or even entire experiments, are at times ignored (Martinson et al., 2005); data are added to experiments until desired p-values are obtained (Simmons et al., 2011); and unreliable methods are used when randomizing animals in studies (Institute for Laboratory Animal Research Roundtable on Science and Welfare in Laboratory Animal Use, 2015). Because these behaviors are hidden, traditional metrics are unlikely to capture their extent or their influence on the trustworthiness of research.

These behaviors notwithstanding, ‘open science’ practices would be one way to increase confidence in research results that could also provide metrics of trustworthiness. For example, some questionable research practices, such as p-hacking (Head et al., 2015), could be detected more easily by requiring that data and analysis code be publicly available in all but the most exceptional circumstances. Indeed, one group has called for traditional institutional performance metrics such as impact factor and number of publications to be replaced with open science metrics (Barnett and Moher, 2019). Although measurable open science would not eliminate questionable research practices, it would move biomedical research toward increased accountability.

Open science practices are still no panacea, however, for all the quality concerns we have highlighted here. What is most needed at this juncture is a collective focus on deserving trust. Such a focus could make researchers and the leaders of research institutions more receptive to reform efforts. The four erroneous beliefs we have discussed surely hinder that collective focus, and thus deter the research community from adopting reforms that can secure the public’s trust – which is vital to biomedical research.

References

    1. Allchin D
    (2015)
    Correcting the “self-correcting” mythos of science
    Filosofia E História Da Biologia 10:19–35.
    1. Barnett AG
    2. Moher D
    (2019)
    Turning the tables: A university league-table based on quality not quantity [version 1; peer review: 1 approved]
    F1000Research.
  1. Website
    1. Drew A
    (2019) APS replication initiative under way
    Observer. Vol 26: Association for Psychological Science 2013. Accessed July 18, 2019.
    1. Glick JL
    (1989)
    Principles of Research Data Audit
    On the cost effectiveness of data auditing, Principles of Research Data Audit, Taylor & Francis.
  2. Book
    1. Hardin R
    (2002)
    Trust and Trustworthiness
    New York: Russell Sage Foundation.
  3. Book
    1. Harris RF
    (2017)
    Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions
    New York: Basic Books.
  4. Book
    1. Institute of Medicine
    (2013)
    Sharing Clinical Research Data:Workshop Summary
    Washington, DC: Institute of Medicine.
  5. Book
    1. Judson HF
    (2004)
    The Great Betrayal: Fraud in Science
    Orlando: Harcourt, Inc.
    1. McKiernan EC
    (2019)
    Use of the journal impact factor in academic review, promotion, and tenure evaluations
    PeerJ Preprints 7:e27638.

Decision letter

  1. Peter Rodgers
    Senior Editor; eLife, United Kingdom
  2. Emma Pewsey
    Reviewing Editor; eLife, United Kingdom
  3. Malcolm R MacLeod
    Reviewer; The University of Edinburgh, United Kingdom
  4. Martin Michel
    Reviewer
  5. Olavo B Amaral
    Reviewer; Federal University of Rio de Janeiro, Brazil

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Four erroneous beliefs thwarting more trustworthy research" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by Emma Pewsey (Associate Features Editor) and Peter Rodgers (Features Editor). The following individuals involved in review of your submission have agreed to reveal their identity: Malcolm R MacLeod (Reviewer #1); Martin Michel (Reviewer #2); Olavo B Amaral (Reviewer #3).

The reviewers have discussed the reviews with one another and I have drafted this decision to help you prepare a revised submission.

Please note that the reviewers have raised a number of discussion points. You do not need to address these in the manuscript itself, but you may wish to respond to them in your author response, which will appear on the eLife website along with your article and the decision letter.

Summary:

The article nicely summarises most of the major problems surrounding trust in biomedical research. While it is not the first article on this topic, the authors have written it up in a very refreshing and enjoyable way.

Essential revisions:

1) Throughout the text it appears to be more accurate to refer to "irreproducible" instead of "erroneous" research results. A result that is not reproduced might be a fully accurate description of a real experiment, but one that either describes a chance finding, a finding that is dependent on very specific experimental conditions, or it could indeed be erroneous (i.e. fabricated, falsified, or due to an erroneous analysis or method). If you feel it is important to use the term "error", please clearly define what it means.

One suggestion from the reviewers to help you reframe the discussion to avoid a focus on "errors" is to think of research practices on a continuum, from the unforgivable to the perfect. At each point, the value of research can be reduced by making "errors" (be these deliberate or inadvertent, or even just consequences of the limitations of the methodologies used) or enhanced by adopting better practice (e.g. open data). For instance, the value of a non-randomised animal study is enhanced if this is clearly stated (because you can interpret the findings accordingly), and the value of any evidentiary claim is enhanced if it was asserted as the primary outcome measure in an a priori study protocol. Strategies that incorporate these different approaches might produce marginal improvements across the continuum.

2) Although the recommendations that follow the discussion of the "four erroneous beliefs" are all sensible, most of them do not address the beliefs directly. Please discuss in greater depth how the beliefs thwart attempts at reform, the approaches that could be taken to change them, and any further barriers that prevent this from happening. Ideally this discussion will include some empirical evidence and concrete examples.

Other discussion points:

1) In reviewer #2's experience of teaching on better reproducibility by more vigorous study design, data analysis and reporting, the PhD students in the class immediately see the point. However, they often come back and ask for advice when their supervisor dismisses concerns about reproducibility with "we've always done it that way" and "everybody does it that way". Have you also found that young researchers are more open to change than established ones?

2) When discussing beliefs 1 and 2 ("it's about the science, not the scientists" and "we need to focus on the health of the orchard, not just the bad apples"), it could be worth discussing the fact that much of the training in research ethics over the last decades has revolved around misconduct. This is likely to reinforce these two beliefs – and, according to the authors' view, might preclude scientists from seeing the problem as their own (as misconduct and ethics breaches are rarely something that one will admit to). Perhaps framing research ethics training around the science (e.g. irreproducible research and its consequences), rather than the scientists, might help in overcoming these beliefs.

3) The authors mention at various points that there is "reluctance within the research community to implement the reforms", that "there has been too little uptake of them" and that "calls to arms have had little effect so far". The authors could discuss this issue in more depth and provide evidence to show that this is the case. What constitutes "little effect"? Some change does appear to be happening, albeit slowly. Although data and code sharing are still subpar, they are likely to be on the rise (e.g. Campbell et al., 2019 at doi: 10.1016/j.tree.2018.11.010), as are open access (Piwowar et al., 2018 at doi: 10.7717/peerj.4375) and preprint deposition (Kaiser, 2017 at doi: 10.1126/science.357.6358.1344).

4) The authors speak of "reluctance within the research community to implement the reforms". But is the community really reluctant, or are many people unaware of the problem or unfamiliar with the possible solutions and necessary reforms? Similarly, the authors state that "these beliefs are blunting curiosity about the health of biomedical research". The rising interest in the subject over the last few years, suggests that inertia in taking necessary changes and a feeling of non-responsibility and/or powerlessness towards them ("it's the system, not me") might be more important than lack of acknowledgement. How do these different factors contribute and interact to prevent necessary changes?

5) On Belief 4, the authors argue that the fact that investment in tackling misconduct has failed to prevent irreproducible research is evidence that "following the rules doesn't guarantee we are getting it right". But couldn't it be the case that this effort has focused on the major rules (e.g. do not falsify, fabricate or plagiarize) but has ignored the (many more) minor ones that are just as vital for published research findings to be reproducible (e.g. use adequate power, differentiate exploratory and confirmatory research, avoid p-hacking and HARKing)? The failure to prevent bad science could be because there are too few rules, rather than because following the rules doesn't work.

6) On the issue of science not being systematically self-correcting, it might be worth mentioning the high prevalence of failed replication attempts that are not published, most commonly due to the authors of replications not attempting to publish them – see Baker et al., 2016 (doi: 10.1038/533452a) for a survey-based indication of that.

7) "… rather than research that tries to confirm past findings." One possibility to increase confirmation is to place higher value on confirmatory studies of past findings, but the other would be to raise the threshold for publication in the first place in some instances (e.g. requiring preregistration or independent confirmation, see Mogil and Macleod, 2017 – doi: 10.1038/542409a). As there are arguments for both sides, it could be worth touching on this point.

8) Subsection “Publishing Reforms: underway but they could be more ambitious”: You could discuss whether, with such a large number of journals (in which peer review varies widely in quality), and preprints making important headway in biology, we should expect the main source of quality control to come from journals. Note that arguments have been made in the opposite direction (e.g. removing barriers to publication to diminish its reward value and make 'publish or perish' senseless – e.g. Nosek and Bar-Anan, 2012 http://dx.doi.org/10.1080/1047840X.2012.692215 and others).

9) It sounds somewhat incongruent to state that there has been little change in research practices, while at the same time arguing that many of these practices cannot be measured. A counterargument to the statements that metrics are unable to counter irreproducibility is that to change incentives in order to foster trust, assessing what kind of research is more "deserving" of trust is important – thus, good metrics are perhaps precisely what is needed to build up trust. Sharing of data and analysis code, for example, can help to assess whether p values have been hacked (as they allow for reanalysis of the data using other methods). Thus, using appropriate sharing of data as a metric is likely to improve some of these issues.

https://doi.org/10.7554/eLife.45261.002

Author response

Essential revisions:

1) Throughout the text it appears to be more accurate to refer to "irreproducible" instead of "erroneous" research results. A result that is not reproduced might be a fully accurate description of a real experiment, but one that either describes a chance finding, a finding that is dependent on very specific experimental conditions, or it could indeed be erroneous (i.e. fabricated, falsified, or due to an erroneous analysis or method). If you feel it is important to use the term "error", please clearly define what it means.

One suggestion from the reviewers to help you reframe the discussion to avoid a focus on "errors" is to think of research practices on a continuum, from the unforgivable to the perfect. At each point, the value of research can be reduced by making "errors" (be these deliberate or inadvertent, or even just consequences of the limitations of the methodologies used) or enhanced by adopting better practice (e.g. open data). For instance, the value of a non-randomised animal study is enhanced if this is clearly stated (because you can interpret the findings accordingly), and the value of any evidentiary claim is enhanced if it was asserted as the primary outcome measure in an a priori study protocol. Strategies that incorporate these different approaches might produce marginal improvements across the continuum.

To address these comments, we have dropped most of the uses of the term errors and mistakes and, unless the context for their use is clear, use instead such terms as “problems” or “unreliable research.” We thank the reviewer for helping us to see how vague the terms “errors” and “mistakes” can be. We gave a lot of thought to alternative phrasing, including using “irreproducible” as suggested but in the end decided against using that particular term. We explain our thinking here. First, our original reason for using terms like error and mistake was because we wanted to be clear to readers that our main focus was not research misconduct. As we explain later in the text, research misconduct is already consuming enough of the oxygen in the room that could be used instead to combat more prevalent problems. Second, while “irreproducible research” certainly covers much, perhaps even most, of the terrain that is in fact our focus, we worried that phrase may be off-putting for some readers and a bit narrow for our purposes. Despite the widely-cited surveys documenting that most researchers believe there is a “reproducibility crisis,” many in the research community nevertheless reject that narrative and these are the very readers we are most hoping to engage in our essay. Third, we thought that, at least technically speaking, framing everything in terms of irreproducible research might perhaps also be a bit vague, given the distinct ways that results might fail to reproduce. (Here we have the Goodman/Fanelli/Ioannidis STM discussion in mind.) We think that terms such as “unreliable research”, “problems”, and “erroneous research results” avoid these issues while accurately conveying our intent so those are the new terms we settled on. We made similar changes near the end of the manuscript. Finally, we found the suggestion to reference a continuum of research practices especially helpful, so we incorporated it into the manuscript near the outset.

2) Although the recommendations that follow the discussion of the "four erroneous beliefs" are all sensible, most of them do not address the beliefs directly. Please discuss in greater depth how the beliefs thwart attempts at reform, the approaches that could be taken to change them, and any further barriers that prevent this from happening. Ideally this discussion will include some empirical evidence and concrete examples.

This is a somewhat complex request to address. The recommendations following our discussion of the erroneous beliefs were added at the request of the editors, which they wanted to see prior to deciding to send the manuscript out for review. So, we feel like they should remain in the manuscript. The reviewers are correct in pointing out, however, that they do not directly mitigate the erroneous beliefs themselves. Thus, we have added a discussion of educational activities to show how one could take on the erroneous beliefs more directly, as well as an example as to how this might be done. We are not sure how to integrate into the manuscript, though, more in-depth discussion about exactly how the four beliefs are thwarting attempts at reform. The impetus for the manuscript was curiosity about why so many reforms that target a range of concerns have yet to be acted upon by too many in the research community, a question that we acknowledge mainly affords speculation at this point. We attribute this inattention to a failure to fully appreciate the need for these reforms, despite the wealth of published evidence establishing how much they are needed. What might account for this inattention is surely a complex range of factors that it would be hard to study and definitively quantify, but it seems reasonable to assume that a certain mindset, characterized in part by the erroneous beliefs, is part of that range. So, we wanted to try to tease out some of the components about that mindset in order to spur discussion about them. We continue to believe that an ensuing discussion about them should prove valuable even in the absence of targeted empirical study about how widespread they are and how they may be specifically thwarting individual recommendations. We think that highlighting the general issues covered by the four beliefs can better spur reflection and conversation than would an effort to try to tie specific beliefs to the uptake rate of specific reforms in the absence of studies designed to test those ties. We trust that this more general approach is acceptable for articles in the “Features” section of the journal.

Other discussion points:

We have given a lot of thought about how/whether to address in the manuscript the discussion points below and decided to only make one minor addition and would like to briefly explain our thinking here. While all of the suggestions have a lot of merit, we worried that discussing them in the manuscript would focus readers’ attention on what changes need to be made in research whereas the broad focus of our manuscript is why changes need to be made. We have inserted new text in the manuscript to this effect. We would very much like to keep that the focus and we worry that too much discussion about actual proposed changes will prompt readers to dwell on whether they agree with those particular changes rather than the need for change itself. Reforms preceded by widespread recognition of the need to change will arguably fare better than ones undertaken when there is no consensus about the need for them in the first place. That is why we prefer not to include more discussion than we already have about representative reforms to tackle various problems. We know that the reviewers are very familiar with the extensive landscape of current reforms underway and the many candidates for additional ones. We also hope that the reviewers and editors will agree that global considerations about the health of the research enterprise and the current systems that shape it have value. We wanted to preserve this more global perspective and were a bit worried that too much discussion of particular reforms will draw attention away from the broader considerations. Our edits are an attempt to strike this balance. We trust that this will be acceptable to the reviewers and editors.

1) In reviewer #2's experience of teaching on better reproducibility by more vigorous study design, data analysis and reporting, the PhD students in the class immediately see the point. However, they often come back and ask for advice when their supervisor dismisses concerns about reproducibility with "we've always done it that way" and "everybody does it that way". Have you also found that young researchers are more open to change than established ones?

2) When discussing beliefs 1 and 2 ("it's about the science, not the scientists" and "we need to focus on the health of the orchard, not just the bad apples"), it could be worth discussing the fact that much of the training in research ethics over the last decades has revolved around misconduct. This is likely to reinforce these two beliefs – and, according to the authors' view, might preclude scientists from seeing the problem as their own (as misconduct and ethics breaches are rarely something that one will admit to). Perhaps framing research ethics training around the science (e.g. irreproducible research and its consequences), rather than the scientists, might help in overcoming these beliefs.

The discussion we added includes some of these same points. Also, with respect to Other discussion point 1 above, in the experiences of the lead author, he finds much more student interest in his research ethics class in research quality and reproducibility issues than was the case as recently as 3-4 years ago. However, given the length of our manuscript and the anecdotal nature of the observation, we chose not to include this in the body of the paper.

3) The authors mention at various points that there is "reluctance within the research community to implement the reforms", that "there has been too little uptake of them" and that "calls to arms have had little effect so far". The authors could discuss this issue in more depth and provide evidence to show that this is the case. What constitutes "little effect"? Some change does appear to be happening, albeit slowly. Although data and code sharing are still subpar, they are likely to be on the rise (e.g. Campbell et al., 2019 at doi:10.1016/j.tree.2018.11.010), as are open access (Piwowar et al., 2018 at doi:10.7717/peerj.4375) and preprint deposition (Kaiser, 2017 at 10.1126/science.357.6358.1344).

We have also chosen not to add any extensive additional text, though we did add “especially at the institutional level” to moderate our claim somewhat. We trust that the reviewers and editors will be ok with this decision. One complicating factor here is that much of the evidence would be found in the blogosphere and popular press, where comments make it clear that there is still a lot of resistance in many quarters about both the extent and severity of problems. See, for example, https://www.theatlantic.com/science/archive/2018/11/psychologys-replication-crisis-real/576223/?utm_source=feed and the skeptics addressed there. We can also point to numerous conversations of our own and of colleagues who regularly get pushback when the topic is problems in research. Also, we think that we have made reference in the manuscript to several examples of reforms in place and the extent of their impact to date.

4) The authors speak of "reluctance within the research community to implement the reforms". But is the community really reluctant, or are many people unaware of the problem or unfamiliar with the possible solutions and necessary reforms? Similarly, the authors state that "these beliefs are blunting curiosity about the health of biomedical research". The rising interest in the subject over the last few years, suggests that inertia in taking necessary changes and a feeling of non-responsibility and/or powerlessness towards them ("it's the system, not me") might be more important than lack of acknowledgement. How do these different factors contribute and interact to prevent necessary changes?

Again, we have chosen not to address this suggestion specifically. The reviewer is no doubt correct to point out that some important changes are afoot. It is also worth noting that reluctance may not be the preferred term to use here since, as the reviewer suggests, it might be a feeling of powerlessness, lack of awareness, or something else. But we remain comfortable with the term “reluctance.” Our collective sense is that the impetus for reform is still confined to a significant degree within the meta-research community, despite the years, at times even decades, e.g., continued use of mislabeled cell lines and citations of studies known to involve them, of publicizing and discussing the problems. This gap is what we are curious to try to understand. Hence this essay speculating that the 4 erroneous beliefs we identify are likely contributors to it. Perhaps one finds it easy to deflect concern about irreproducible or otherwise unreliable research if one is confident that science self-corrects. If it does, then there is less need to worry about methodologic quality or reproducibility. Or, why worry about bias if there are now required disclosures about financial interest? We think such erroneous beliefs are at least as contributory to the current disappointing pace of reform as is lack of awareness or feelings of powerlessness and as such warrant the consideration of readers. And please note that our discussion about RCR does pertain to the awareness issue.

5) On Belief 4, the authors argue that the fact that investment in tackling misconduct has failed to prevent irreproducible research is evidence that "following the rules doesn't guarantee we are getting it right". But couldn't it be the case that this effort has focused on the major rules (e.g. do not falsify, fabricate or plagiarize) but has ignored the (many more) minor ones that are just as vital for published research findings to be reproducible (e.g. use adequate power, differentiate exploratory and confirmatory research, avoid p-hacking and HARKing)? The failure to prevent bad science could be because there are too few rules, rather than because following the rules doesn't work.

Again, we have chosen not to address this suggestion specifically. While we share much of this reviewer’s diagnosis, we also believe that researchers have to internalize a deep commitment to proper research methodology and we are skeptical that assuring the use of proper methods is best accomplished by having more rules for researchers to follow. We think that a more productive approach is having more virtuous researchers in the Aristotelian sense (habitually doing things in the right way with the proper motivations and the right reasons) and we are suggesting that unhelpful beliefs like the ones we have highlighted are a possible major culprit hindering a stronger allegiance to proper research methodology that deserves greater focus.

6) On the issue of science not being systematically self-correcting, it might be worth mentioning the high prevalence of failed replication attempts that are not published, most commonly due to the authors of replications not attempting to publish them – see Baker et al., 2016 (doi: 10.1038/533452a) for a survey-based indication of that.

We think that devoting space within the article to this suggestion would take us a bit off target in that there are so many causes that hinder self-correction, including failure to publish results, which is already discussed and documented in the literature, some of which we have cited, while our focus is the broad issue of science not self-correcting.

7) "… rather than research that tries to confirm past findings." One possibility to increase confirmation is to place higher value on confirmatory studies of past findings, but the other would be to raise the threshold for publication in the first place in some instances (e.g. requiring preregistration or independent confirmation, see Mogil and Macleod, 2017; doi:10.1038/542409a). As there are arguments for both sides, it could be worth touching on this point.

We completely agree that this would be another possibility. Our response here is similar to the immediate one above. We are a bit reluctant to extend our discussion because we think the current discussion already supports the major points we wanted to make. In addition, changing the nature of journals and what they publish, as the reviewer suggests, is a long and heterogenous process and shaking faith that science self-corrects, which is part of our aim here, might prove to be a useful accelerant to that process. Thus, rather than focus on what changes are most needed, we have chosen instead to try to spur greater reflection that could help show why changes are needed in the first place.

8) Subsection “Publishing Reforms: underway but they could be more ambitious”: You could discuss whether, with such a large number of journals (in which peer review varies widely in quality), and preprints making important headway in biology, we should expect the main source of quality control to come from journals. Note that arguments have been made in the opposite direction (e.g. removing barriers to publication to diminish its reward value and make 'publish or perish' senseless – e.g. Nosek and Bar-Anan, 2012 http://dx.doi.org/10.1080/1047840X.2012.692215 and others).

This suggestion has a lot to recommend it as well but we only have so much space and we had to have some parameters for our discussion of publisher reforms so we chose to limit those to the kinds of incremental reforms we highlighted, as opposed to a radical rethink of scientific publishing. We trust this demarcation will be acceptable to the reviewers and editors.

9) It sounds somewhat incongruent to state that there has been little change in research practices, while at the same time arguing that many of these practices cannot be measured. A counterargument to the statements that metrics are unable to counter irreproducibility is that to change incentives in order to foster trust, assessing what kind of research is more "deserving" of trust is important – thus, good metrics are perhaps precisely what is needed to build up trust. Sharing of data and analysis code, for example, can help to assess whether p values have been hacked (as they allow for reanalysis of the data using other methods). Thus, using appropriate sharing of data as a metric is likely to improve some of these issues.

We like this idea and have modified the text accordingly.

https://doi.org/10.7554/eLife.45261.003

Article and author information

Author details

  1. Mark Yarborough

    Mark Yarborough is in the Bioethics Program, University of California, Davis, Sacramento, CA, United States

    Contribution
    Conceptualization, Writing—original draft, Writing—review and editing
    For correspondence
    mayarborough@ucdavis.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8188-4968
  2. Robert Nadon

    Robert Nadon is in the Department of Human Genetics, McGill University, Montreal, Canada

    Contribution
    Conceptualization, Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared
  3. David G Karlin

    David G Karlin is an independent researcher based in Marseille, France

    Contribution
    Conceptualization, Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared

Funding

The authors declare that there was no funding for this work

Publication history

  1. Received:
  2. Accepted:
  3. Accepted Manuscript published:
  4. Version of Record published:

Copyright

© 2019, Yarborough et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 4,266
    views
  • 388
    downloads
  • 11
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Mark Yarborough
  2. Robert Nadon
  3. David G Karlin
(2019)
Point of View: Four erroneous beliefs thwarting more trustworthy research
eLife 8:e45261.
https://doi.org/10.7554/eLife.45261