Science Forum: Addressing selective reporting of experiments through predefined exclusion criteria

Institute of Medical Biochemistry Leopoldo de Meis, Universidade Federal do Rio de Janeiro, Brazil

May 22, 2020

Open access
Copyright information

Download
Cite
CommentOpen annotations (there are currently 0 annotations on this page).
Share

Article
Abstract
Introduction
The many-level file drawer
The reverse Texas sharpshooter
Paths to confirmatory research
Pre-specified criteria to clean up the data record
Conclusion
Data availability
References
Decision letter
Author response
Article and author information
Metrics

Abstract

The pressure for every research article to tell a clear story often leads researchers in the life sciences to exclude experiments that 'did not work' when they write up their results. However, this practice can lead to reporting bias if the decisions about which experiments to exclude are taken after data have been collected and analyzed. Here we discuss how to balance clarity and thoroughness when reporting the results of research, and suggest that predefining the criteria for excluding experiments might help researchers to achieve this balance.

Introduction

Experiments fail all the time, for both complex and trivial reasons. Because of this, many experiments are repeated – typically after tinkering a bit with the protocol – and the initial failed attempts often go unreported in published articles. This is understandable: trying to present all the unsuccessful results of a long, obstacle-prone project might make an article almost unreadable. Scientific articles thus frequently privilege conciseness over completeness, in order to tell an intuitive story and facilitate the understanding of complex ideas (Sanes, 2019; Sollaci and Pereira, 2004).

Narrative quality, however, can go against the need for transparency in reporting to ensure reproducibility: as the process of conducting research is vulnerable to a large amount of biases (Chavalarias and Ioannidis, 2010), selective reporting of results can have detrimental consequences on the scientific record (Nosek et al., 2015; Nissen et al., 2016). If enough experiments are conducted, some are bound to attain significant results by chance alone using typical statistical standards (Ioannidis, 2005). Failure to fully report on all attempts to carry out an experiment, thus, can lead to a scenario where data can be cherry-picked to support one's hypothesis (Simmons et al., 2011).

As laboratory scientists, we understand both sides of the argument: reporting on every experimental failure will increase noise without adding much value to the reported results; on the other hand, having unlimited flexibility to decide whether an experiment can be excluded from a research article opens up a huge avenue for bias to creep in. Here we make the case that predefined inclusion and exclusion criteria for experiments can help solve this conundrum, and discuss ways in which they can be implemented in the workflow of experimental projects, particularly in those of a confirmatory nature. We also describe how we are taking this approach in the Brazilian Reproducibility Initiative, a large-scale multicenter replication of experimental findings in basic biomedical science (Amaral et al., 2019).

The many-level file drawer

Selective reporting can appear at many levels. A considerable body of literature exists on the omission of whole studies, a phenomenon best known as the 'file drawer effect' (Rosenthal, 1979). This is best studied in areas such as clinical trials and psychology (Dwan et al., 2013; Ferguson and Heene, 2012; Fanelli, 2012), in which meta-analytic statistical methods are routinely used for estimating publication bias (Jin et al., 2015). At the level of analysis, there is also evidence of bias in selective reporting of measured outcomes within trials (Williamson et al., 2005). At the other end of the scale, at the level of data collection, there has been a reasonable amount of discussion about the selective post-hoc exclusion of data points identified as outliers (Holman et al., 2016).

Not reporting the results of individual experiments is an intermediate level of bias that lies between the omission of studies and the omission of data points. It appears to be common in scientific fields where a single article typically includes multiple experiments or datasets, as in much of the life sciences. Although this is potentially one of the largest sources of bias in bench research, it has been relatively underdiscussed. The case has been made that the prevalence of significant results within single articles is frequently too high to be credible, considering the statistical power of individual experiments (Schimmack, 2012; Lakens and Etz, 2017). However, such statistical evidence cannot identify whether this is due to experiments going missing, leading to selective reporting of positive results, or whether the published experiments are biased towards positive findings at the level of measurement or analysis. Whereas one can locate unpublished clinical trials because they are preregistered, or look for mismatching sample sizes in articles to infer removal/loss of subjects, detecting an unreported experiment requires information that is not usually available to the reader.

Confirmation bias can easily lead one to discard experiments that 'didn't work' by attributing the results to experimental artifacts.

Once more, the problem is that reporting the full information on every experiment conducted within a project might be counterproductive as well. Laboratory science can be technically challenging, and experimental projects hardly ever run smoothly from start to finish; thus, a certain degree of selective reporting can be helpful to separate signal from noise. After all, hardly anyone would be interested to know that your histological sections failed to stain, or that your culture behaved in strange ways because of contamination.

What is the limit, however, to what can be left out of an article? While most scientists will agree that suppressing results from an article because they do not fit a hypothesis is unethical, few would argue in favor of including every methodological failure in it. However, if researchers are free to classify experiments into either category after the results are in, there will inevitably be room for bias in this decision.

The reverse Texas sharpshooter

It is all too easy to find something that went wrong with a particular experiment to justify its exclusion. Maybe the controls looked different than before, or one remembers that someone had complained about that particular antibody vial. Maybe the animals seemed very stressed that day, or the student who ran the experiment didn't have a good hand for surgery. Or maybe an intentional protocol variation apparently made a previously observed effect disappear. This is particularly frequent in exploratory research, where protocols are typically adjusted along the way. It is thus common that people will repeat an experiment again and again with minor tweaks until a certain result is found – frequently one that confirms an intuition or a previous finding (e.g. 'I got it the first time, something must have gone wrong this time').

All of the factors above might be sensible reasons to exclude an experiment. However, this makes it all too easy to find a plausible explanation for a result that does not fit one's hypothesis. Confirmation bias can easily lead one to discard experiments that 'didn't work' by attributing the results to experimental artifacts. In this case, 'not working' conflates negative results – i.e. non-significant differences between groups – with methodological failures – i.e. an experiment that is uninterpretable because its outcome could not be adequately measured. Even in the best of intentions, a scientist with too much freedom to explore reasons to exclude an experiment will allow unconscious biases related to its results to influence his or her decision (Holman et al., 2015).

This problem is analogous to the forking paths in data analysis (Gelman and Loken, 2013), or to the Texas sharpshooter fallacy (Biemann, 2013), in which hypothesizing after the results are known (HARKing) leads a claim to be artificially confirmed by the same data that inspired it (Hollenbeck and Wright, 2017). But while the Texas sharpshooter hits the bullseye because he or she draws the target at the point where the bullet landed, the scientist looking to invalidate an experiment draws his or her validation target away from the results – usually based on a problem that is only considered critical after the results are seen. Importantly, these decisions – and experiments – will be invisible in the final article if the norm is to set aside the pieces that do not fit the story.

Paths to confirmatory research

One much-discussed solution to the problem of analysis flexibility is preregistration of hypotheses and methods (Forstmeier et al., 2017; Nosek et al., 2018). The practice is still mostly limited to areas such as clinical trials (in which registration is mandatory in many countries) and psychology (in which the movement has gained traction over reproducibility concerns), and is not easy to implement in laboratory science, where protocols are frequently decided over the course of a project, as hypotheses are built and remodeled along the way.

Although exploratory science is the backbone of most basic science projects, a confirmation step with preregistered methods could greatly improve the validation of published findings (Mogil and Macleod, 2017). As it stands today, most basic research in the life sciences is performed in an exploratory manner, and should be taken as preliminary rather than confirmatory evidence – more akin to a series of case reports than to a clinical trial. This type of research can still provide interesting and novel insights, but its weight as evidence for a given hypothesis should be differentiated from that of confirmatory research that follows a predefined protocol (Kimmelman et al., 2014).

Interestingly, the concept of preregistration can also be applied to criteria that determine whether an experiment is methodologically sound or not, and thus amenable to suppression from a published article. Laboratory scientists are used to including controls to assess the internal validity of their methods. In PCR experiments, for instance, measures are typically taken along the way to alert the researcher when something goes wrong: the ratio of absorbance of an RNA sample at 280 and 260 nm is used as a purity test for the sample, and non-template controls are typically used to check for specificity of amplification (Matlock, 2015).

As it stands today, most basic research in the life sciences is performed in an exploratory manner, and should be taken as preliminary rather than confirmatory evidence.

Rarely, however, are criteria for what constitutes an appropriate result for a positive or negative control decided and registered in advance, leaving the researcher free to make this decision once the results of the experiment are in. This not only fails to prevent bias, but actually adds degrees of freedom: much like adding variables or analysis options, adding methodological controls can provide the researcher with more possible justifications to exclude experiments (Wicherts et al., 2016). Once more, the solution seems to require that the criteria to discard an experiment based on these controls are set without seeing the results, in order to counter the possibility of bias.

Preregistration is not the only possible path to confirmatory research. Preregistering exclusion criteria may be unnecessary if the data for all experiments performed are presented with no censoring (Oberauer and Lewandowsky, 2019). In this setting, it is up to the reader to judge whether the data fit a hypothesis, as performing the entire set of possible analyses (Steegen et al., 2016) can show how much the conclusions depend on certain decisions, such as ignoring an experiment. However, this can only happen if all collected data are presented, which is not common practice in the life sciences (Wallach et al., 2018), partly because it goes against the tradition of conveying information in a narrative form (Sanes, 2019). If some form of data filtering – for failed experiments, noisy signals or uninformative data – is important for clarity (and this may very well be the case), preventing bias requires that exclusion criteria are set independently of the results.

A third option to address bias in the decision to include or exclude experiments is to perform blind data analysis – in which inclusion choices are made by experts who are blinded to the results, but have solid background knowledge of the method in order to devise sensible criteria. This is commonly performed in physics, for instance, and allows validity criteria to be defined even after data collection (MacCoun and Perlmutter, 2015). Such a procedure might be less optimal than establishing and publically registering criteria, as preregistration offers additional advantages such as allowing unpublished studies to be tracked (Powell-Smith and Goldacre, 2016). Nevertheless, it is likely easier to implement, allows greater flexibility in analysis decisions, and can still go a long way in limiting selective reporting and analysis bias.

Pre-specified criteria to clean up the data record

A solution to selective reporting, thus, is to set criteria to consider an experiment as valid, or a set of data as relevant for analysis, ideally before it is performed/collected. These include inclusion and exclusion criteria for animals or cultures to be used, positive and negative controls to determine if an assay is sensitive and/or specific, and additional variables or experiments to verify that an intervention's known effects have been observed. Ideally, these criteria should be as objective as possible, with thresholds and rules for when data must be included and when they should be discarded. They should also be independent of the outcome measure of the experiment – that is, the observed effect size should not be used as a basis for exclusion – and applied equally to all experimental groups. Importantly, when criteria for validity are met, this should be taken as evidence that the experiment is appropriate, and that it would thus be unethical to exclude it from an article reporting on the data.

As for any decision involving predefined thresholds, concerns over sensitivity and specificity arise: criteria that are too loose might lead to the inclusion of questionable or irrelevant data in an article, whereas those that are too stringent could lead meaningful experiments to be discarded. As with preregistration or statistical significance thresholds, this should not discourage researchers from addressing these limitations in an exploratory manner – one is always free to show data that does not fit validity criteria if this is clearly pointed out. What is important is that authors are transparent about it – and that the reader knows whether they are following prespecified criteria to ignore an experiment or have decided on it after seeing the results. Importantly, this can only happen when data is shown – meaning that decisions to ignore an experiment with no predefined reason must inevitably be discussed alongside its results.

For widely used resources such as antibodies and cell lines, there are already a number of recommendations for validation that have been developed by large panels of experts and can be used for this purpose. For cell line authentication, for example, the International Cell Line Authentication Committee (ICLAC) recommends a ≥ 80% match threshold for short-tandem-repeat (STR) profiling, which allows for some variation between passages (e.g., due to genetic drift; Capes-Davis et al., 2013). For antibodies, the International Working Group for Antibody Validation recommends a set of strategies for validation (Uhlen et al., 2016), such as quantitative immunoprecipitation assays that use predefined thresholds for antibody specificity (Marcon et al., 2015). Other areas and methods are likely to have similarly established guidelines that can be used as references for setting inclusion and exclusion criteria for experiments.

As coordinators of the Brazilian Reproducibility Initiative, a multicenter replication of 60–100 experiments from the Brazilian biomedical literature over the last 20 years, conducted by a team of more than 60 labs, we have been faced with the need for validation criteria in many stages during protocol development (Amaral et al., 2019). As the project is meant to be confirmatory in nature, we intend to preregister every protocol, including the analysis plan. Furthermore, to make sure that each replication is methodologically sound, we are encouraging laboratories to add as many additional controls as they judge necessary to each experiment. To deal with the problem raised in this essay, however, we also require that they prespecify their criteria for using the data from these controls in the analysis.

For RT-PCR experiments, for instance, controls for RNA integrity and purity must be accompanied by which ratios will allow inclusion of the sample in the final experiment – or, conversely, will lead data to be discarded. For cell viability experiments using the MTT assay, positive controls for cell toxicity are recommended to test the sensitivity of the assay, but must include thresholds for inclusion of the experiment (e.g., a reduction of at least X% in cell viability). For behavioral experiments, accessory measurements to evaluate an intervention's known effects (such as weight in the case of high-calorie diets) can be used to confirm that it has worked as expected, and that testing its effects on other variables is warranted. Once more, thresholds must be set beforehand, and failure to meet inclusion criteria will lead the experiment to be considered invalid and repeated in order to attain a usable result.

Defining validation criteria in advance for every experiment has not been an easy exercise: even though researchers routinely devise controls for their experiments, they are not used to setting objective criteria to decide whether or not the results of an experiment should be included when reporting the data. However, in a preregistered, confirmatory project, we feel that this is vital to allow us to decide if a failure to replicate a result represents a contradiction of the original finding or is due to a methodological artifact. Moreover, we feel that predefining validation criteria will help to protect the project from criticism about a lack of technical expertise by the replicating labs, which has been a common response to failed replication attempts in other fields (Baumeister, 2019).

Agreement with predictions or previous findings should never be criteria for including an experiment in an article.

As one cannot anticipate all possible problems, it is likely that, in some experiments at least, such prespecified criteria might turn out not be ideal in separating successful experiments from failed ones. Nevertheless, for the sake of transparency, we feel that it is important that any post-hoc decisions for considering experiments as unreliable are marked as such, and that both the decision and its impact on the results are disclosed. Once more, there is no inherent problem with exploratory research or data-dependent choices; the problem is when these are done secretly and communicated selectively (Hollenbeck and Wright, 2017; Simmons et al., 2011).

Conclusion

Although we have focused on the use of validation criteria to make decisions about experiments, they can also be used to make decisions about which data to analyze. In fields such as electrophysiology or functional neuroimaging, for example, data typically pass through preprocessing pipelines before analysis: the use of predefined validation criteria could thus prevent the introduction of bias by researchers when exploring these pipelines (Phillips, 2004; Carp, 2012; Botvinik-Nezer et al., 2019). Genomics and a number of other high-throughput fields have also developed standard evaluation criteria to avoid bias in analysis (Kang et al., 2012). This suggests that communities centering on specific methods can reach a consensus on which criteria are minimally necessary to draw the line between data that can be censored and those that must be analyzed.

Such changes will only happen on a larger scale, however, if researchers are aware of the potential impacts of post-hoc exclusion decisions on the reliability of results, an area in which the life sciences still lag behind other fields of research. Meanwhile, individual researchers focusing on transparency and reproducibility should consider the possibility of setting – and ideally registering – predefined inclusion and exclusion criteria for experiments in their protocols. Some recommendations to consider include the following:

whenever possible, prespecify and register the criteria that will be used to define whether an experiment is valid for analysis – or, conversely, whether it should be excluded from it;
do not use criteria based on the effect size of the outcome measures of interest: agreement with predictions or previous findings should never be criteria for including an experiment in an article.
implement public preregistration (Nosek et al., 2018), blind analysis (MacCoun and Perlmutter, 2015) and/or full data reporting with multiverse analysis (Steegen et al., 2016) to ensure that data inclusion choices are transparent and independent of the data.

Making these criteria as objective as possible can help researchers make inclusion decisions in an unbiased way, avoiding reliance on 'gut feelings' that can easily lead one astray. As Richard Feynman once said: ‘science is a way of trying not to fool yourself, and you are the easiest person to fool’ (Feynman, 1974). An easy way to make this advice heard is to explicitly state what an appropriate experiment means before starting it, and to stick to your view after the results are in.

Data availability

There are no data associated with this article.

References

(2019) The Brazilian Reproducibility Initiative
eLife 8:e41602.

https://doi.org/10.7554/eLife.41602
- PubMed
- Google Scholar
Preprint
1. Baumeister R
(2019) Self-control, ego depletion, and social psychology’s replication crisis
PsyArXiv.

https://doi.org/10.31234/osf.io/uf3cn
- Google Scholar
1. Biemann T
(2013) What if we were Texas sharpshooters? Predictor reporting bias in regression analysis
Organizational Research Methods 16:335–363.

https://doi.org/10.1177/1094428113485135
- Google Scholar
Preprint
(2019) Variability in the analysis of a single neuroimaging dataset by many teams
bioRxiv.

https://doi.org/10.1101/843193
- Google Scholar
1. Capes-Davis A
2. Reid YA
3. Kline MC
4. Storts DR
5. Strauss E
6. Dirks WG
7. Drexler HG
8. MacLeod RA
9. Sykes G
10. Kohara A
11. Nakamura Y
12. Elmore E
13. Nims RW
14. Alston-Roberts C
15. Barallon R
16. Los GV
17. Nardone RM
18. Price PJ
19. Steuer A
20. Thomson J
21. Masters JR
22. Kerrigan L
(2013) Match criteria for human cell line authentication: where do we draw the line?
International Journal of Cancer 132:2510–2519.

https://doi.org/10.1002/ijc.27931
- PubMed
- Google Scholar
1. Carp J
(2012) On the plurality of (methodological) worlds: estimating the analytic flexibility of FMRI experiments
Frontiers in Neuroscience 6:149.

https://doi.org/10.3389/fnins.2012.00149
- PubMed
- Google Scholar
1. Chavalarias D
2. Ioannidis JP
(2010) Science mapping analysis characterizes 235 biases in biomedical research
Journal of Clinical Epidemiology 63:1205–1215.

https://doi.org/10.1016/j.jclinepi.2009.12.011
- PubMed
- Google Scholar
(2013) Systematic review of the empirical evidence of study publication bias and outcome reporting bias - an updated review
PLOS ONE 8:e66844.

https://doi.org/10.1371/journal.pone.0066844
- PubMed
- Google Scholar
1. Fanelli D
(2012) Negative results are disappearing from most disciplines and countries
Scientometrics 90:891–904.

https://doi.org/10.1007/s11192-011-0494-7
- Google Scholar
1. Ferguson CJ
2. Heene M
(2012) A vast graveyard of undead theories: publication bias and psychological science's aversion to the null
Perspectives on Psychological Science 7:555–561.

https://doi.org/10.1177/1745691612459059
- PubMed
- Google Scholar
Website
1. Feynman R
(1974) Cargo cult science
Accessed February 28, 2020.

http://calteches.library.caltech.edu/51/2/CargoCult.htm
(2017) Detecting and avoiding likely false-positive findings - a practical guide: avoiding false-positive findings
Biological Reviews 92:1941–1968.

https://doi.org/10.1111/brv.12315
- Google Scholar
Website
1. Gelman A
2. Loken E
(2013) The garden of forking paths: why multiple comparisons can be a problem even when there is no ‘fishing expedition’or ‘p-hacking’and the research hypothesis was posited ahead of time
Accessed February 1, 2020.

http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf
1. Hollenbeck JR
2. Wright PM
(2017) Harking, sharking, and tharking: making the case for post hoc analysis of scientific data
Journal of Management 43:5–18.

https://doi.org/10.1177/0149206316679487
- Google Scholar
(2015) Evidence of experimental bias in the life sciences: why we need blind data recording
PLOS Biology 13:e1002190.

https://doi.org/10.1371/journal.pbio.1002190
- PubMed
- Google Scholar
(2016) Where have all the rodents gone? The effects of attrition in experimental research on cancer and stroke
PLOS Biology 14:e1002331.

https://doi.org/10.1371/journal.pbio.1002331
- PubMed
- Google Scholar
1. Ioannidis JP
(2005) Why most published research findings are false
PLOS Medicine 2:e124.

https://doi.org/10.1371/journal.pmed.0020124
- PubMed
- Google Scholar
1. Jin ZC
2. Zhou XH
3. He J
(2015) Statistical methods for dealing with publication bias in meta-analysis
Statistics in Medicine 34:343–360.

https://doi.org/10.1002/sim.6342
- PubMed
- Google Scholar
1. Kang DD
2. Sibille E
3. Kaminski N
4. Tseng GC
(2012) MetaQC: objective quality control and inclusion/exclusion criteria for genomic meta-analysis
Nucleic Acids Research 40:e15.

https://doi.org/10.1093/nar/gkr1071
- PubMed
- Google Scholar
(2014) Distinguishing between exploratory and confirmatory preclinical research will improve translation
PLOS Biology 12:e1001863.

https://doi.org/10.1371/journal.pbio.1001863
- PubMed
- Google Scholar
1. Lakens D
2. Etz AJ
(2017) Too true to be bad: when sets of studies with significant and nonsignificant findings are probably true
Social Psychological and Personality Science 8:875–881.

https://doi.org/10.1177/1948550617693058
- PubMed
- Google Scholar
1. MacCoun R
2. Perlmutter S
(2015) Blind analysis: hide results to seek the truth
Nature 526:187–189.

https://doi.org/10.1038/526187a
- PubMed
- Google Scholar
1. Marcon E
2. Jain H
3. Bhattacharya A
4. Guo H
5. Phanse S
6. Pu S
7. Byram G
8. Collins BC
9. Dowdell E
10. Fenner M
11. Guo X
12. Hutchinson A
13. Kennedy JJ
14. Krastins B
15. Larsen B
16. Lin ZY
17. Lopez MF
18. Loppnau P
19. Miersch S
20. Nguyen T
21. Olsen JB
22. Paduch M
23. Ravichandran M
24. Seitova A
25. Vadali G
26. Vogelsang MS
27. Whiteaker JR
28. Zhong G
29. Zhong N
30. Zhao L
31. Aebersold R
32. Arrowsmith CH
33. Emili A
34. Frappier L
35. Gingras AC
36. Gstaiger M
37. Paulovich AG
38. Koide S
39. Kossiakoff AA
40. Sidhu SS
41. Wodak SJ
42. Gräslund S
43. Greenblatt JF
44. Edwards AM
(2015) Assessment of a method to characterize antibody selectivity and specificity for use in immunoprecipitation
Nature Methods 12:725–731.

https://doi.org/10.1038/nmeth.3472
- PubMed
- Google Scholar
Website
1. Matlock B
(2015) Assessment of nucleic acid purity
Accessed February 28, 2020.

https://assets.thermofisher.com/TFS-Assets/CAD/Product-Bulletins/TN52646-E-0215M-NucleicAcid.pdf
1. Mogil JS
2. Macleod MR
(2017) No publication without confirmation
Nature 542:409–411.

https://doi.org/10.1038/542409a
- PubMed
- Google Scholar
(2016) Publication bias and the canonization of false facts
eLife 5:e21451.

https://doi.org/10.7554/eLife.21451
- PubMed
- Google Scholar
1. Nosek BA
2. Alter G
3. Banks GC
4. Borsboom D
5. Bowman SD
6. Breckler SJ
7. Buck S
8. Chambers CD
9. Chin G
10. Christensen G
11. Contestabile M
12. Dafoe A
13. Eich E
14. Freese J
15. Glennerster R
16. Goroff D
17. Green DP
18. Hesse B
19. Humphreys M
20. Ishiyama J
21. Karlan D
22. Kraut A
23. Lupia A
24. Mabry P
25. Madon T
26. Malhotra N
27. Mayo-Wilson E
28. McNutt M
29. Miguel E
30. Paluck EL
31. Simonsohn U
32. Soderberg C
33. Spellman BA
34. Turitto J
35. VandenBos G
36. Vazire S
37. Wagenmakers EJ
38. Wilson R
39. Yarkoni T
(2015) Promoting an open research culture
Science 348:1422–1425.

https://doi.org/10.1126/science.aab2374
- Google Scholar
(2018) The preregistration revolution
PNAS 115:2600–2606.

https://doi.org/10.1073/pnas.1708274114
- PubMed
- Google Scholar
1. Oberauer K
2. Lewandowsky S
(2019) Addressing the theory crisis in psychology
Psychonomic Bulletin & Review 26:1596–1618.

https://doi.org/10.3758/s13423-019-01645-2
- PubMed
- Google Scholar
1. Phillips CV
(2004) Publication bias in situ
BMC Medical Research Methodology 4:20.

https://doi.org/10.1186/1471-2288-4-20
- PubMed
- Google Scholar
1. Powell-Smith A
2. Goldacre B
(2016) The TrialsTracker: automated ongoing monitoring of failure to share clinical trial results by all major companies and research institutions
F1000Research 5:2629.

https://doi.org/10.12688/f1000research.10010.1
- PubMed
- Google Scholar
1. Rosenthal R
(1979) The file drawer problem and tolerance for null results
Psychological Bulletin 86:638–641.

https://doi.org/10.1037/0033-2909.86.3.638
- Google Scholar
1. Sanes JR
(2019) Tell me a story
eLife 8:e50527.

https://doi.org/10.7554/eLife.50527
- PubMed
- Google Scholar
1. Schimmack U
(2012) The ironic effect of significant results on the credibility of multiple-study articles
Psychological Methods 17:551–566.

https://doi.org/10.1037/a0029487
- PubMed
- Google Scholar
(2011) False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant
Psychological Science 22:1359–1366.

https://doi.org/10.1177/0956797611417632
- PubMed
- Google Scholar
1. Sollaci LB
2. Pereira MG
(2004)
The introduction, methods, results, and discussion (IMRAD) structure: a ﬁfty-year survey

Journal of the Medical Library Association 92:364–371.
- PubMed
- Google Scholar
(2016) Increasing transparency through a multiverse analysis
Perspectives on Psychological Science 11:702–712.

https://doi.org/10.1177/1745691616658637
- PubMed
- Google Scholar
1. Uhlen M
2. Bandrowski A
3. Carr S
4. Edwards A
5. Ellenberg J
6. Lundberg E
7. Rimm DL
8. Rodriguez H
9. Hiltke T
10. Snyder M
11. Yamamoto T
(2016) A proposal for validation of antibodies
Nature Methods 13:823–827.

https://doi.org/10.1038/nmeth.3995
- PubMed
- Google Scholar
(2018) Reproducible research practices, transparency, and open access data in the biomedical literature, 2015-2017
PLOS Biology 16:e2006930.

https://doi.org/10.1371/journal.pbio.2006930
- PubMed
- Google Scholar
(2016) Degrees of freedom in planning, running, analyzing, and reporting psychological studies: a checklist to avoid p-hacking
Frontiers in Psychology 7:1832.

https://doi.org/10.3389/fpsyg.2016.01832
- PubMed
- Google Scholar
(2005) Outcome selection bias in meta-analysis
Statistical Methods in Medical Research 14:515–524.

https://doi.org/10.1191/0962280205sm415oa
- PubMed
- Google Scholar

Decision letter

Peter Rodgers

Senior and Reviewing Editor; eLife, United Kingdom
Anita E Bandrowski

Reviewer; University of California, San Diego, United States
Wolfgang Forstmeier

Reviewer; Max Planck Institute for Ornithology, Germany

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Thank you for submitting your article "Addressing selective reporting of experiments – the case for predefined exclusion criteria" for consideration by eLife. Your article has been reviewed by two peer reviewers, both of whom have agreed to reveal their identity: Anita E Bandrowski (Reviewer #1); Wolfgang Forstmeier (Reviewer #2). If you are able to address the points raised by the reviewers (see below) in a revised version, we will be happy to accept your article for publication.

Summary:

Reviewer #1

This is an excellent article about a very important and largely overlooked topic in the reproducibility conversation. It absolutely should be published. As an experimental scientist, I appreciate the author's practical stance, something that many conversations miss. Indeed, in that vein I am suggesting a few tidbits that may help the article become a better reference for experimental scientists, not just a very interesting position paper. However, these are merely suggestions and the manuscript should be published with or without addressing them.

Reviewer #2

I think this manuscript makes a very useful contribution by promoting higher standards of scientific objectivity. The issue of selective reporting of experiments within publications that present several experiments has received only limited attention so far. Hence, the present paper is helpful in terms of inspiring a debate about scientific objectivity and in terms of clarifying to some extent what is and what is not legitimate. I think the present manuscript could still be improved in two ways.

Essential revisions:

1) The authors state: "...only way to prevent bias is by setting exclusion criteria beforehand. "

This is a very interesting argument, but I believe that there are other possibilities. Preregistration has not eliminated bias from the psychology literature, instead there are now many abandoned preregistrations. While this is a better state than biomedicine, where we really don't know who attempted which study, it is still not satisfying.

One could argue that it may become possible to label the study as exploratory vs validated. It seems that failure to validate is mostly problematic when clinical trials are involved and those might be reserved for only studies that have been validated using preregistration, proper controls, power analysis, blinding and a full description of the experimental group subjects. These criteria are addressed in less than 10% of studies, in our experience (MacLeod's laboratory has looked at percentages of rigor criteria across certain aspects of the disease literature much earlier than we did).

It should be noted, however, that exploratory studies can provide glimpses of interesting results not easily ascertained by holding all variables constant. One might argue that the real problem is that these exploratory studies are treated like real validated studies instead of something more akin to case reports. There is a good reason to have case reports, they are intended to describe an interesting observation. If biomedicine adopts a similar labeling system, it may also help to put the study in context.

2. In several sections, the authors describe antibody and cell lines examples, but give more practical examples of validation criteria for RT-PCR.

What should help the paper is a bit of practical advice for studies involving antibodies and cell lines. Across the biomedical literature, there are some technique specific advice that has been offered, not by a single expert, but by large panels of experts that have met and created a set of recommendations for properly validating the techniques.

Please consider adding some resources for different experimental paradigms in the form of a set of key references.

For antibodies I would suggest the Ulhen et al., 2016: PMID: 27595404 paper (based on a panel of experts).

For cell lines, I would suggest ICLAC.org (again a panel of experts with specific guidelines; there are multiple papers here, but probably pointing to the organization is sufficient).

3. I think it would be good if the paper could end with some clear messages (recommendations presented as numbered bullet points), which every study (i.e. publication) should consider. I would like to leave this up to the authors, but it should contain the message that the decision about the validity of an experiment must not be based on the effect size of the outcome measure.

4. I think that the paper would benefit from widening its scope such that its messages are applicable not only to preregistered studies but also to others. Rather than saying that scientific objectivity would require preregistration of criteria of experimental validity, I would say that there are 3 ways of achieving high objectivity standards: i) Preregistered criteria of validity; ii) Blinding of the person, who develops and applies the criteria of validity, from the research outcome of each experiment; iii) Complete reporting and summary of all experiments and pilot trials.

Currently the paper already hints at the possibility iii) in the subsection “Preregistration in confirmatory research” as an alternative to preregistration. The idea of blinding the decision maker has been explained elsewhere (eg MacCoun & Perlmutter, 2015; MacCoun & Perlmutter, 2017.

I guess it should be feasible for many studies to find an independent expert who is blinded from the data (outcome of experiments) and who can develop criteria for validity of experiments (even after data collection) and who applies those criteria strictly without knowing how this affects the overall outcome and conclusions. If studies implement such blinding techniques, I think it is also helpful to see these procedures described in the methods section. This signals awareness and will increase the credibility of a study compared to other studies where the authors may not even be aware of the issue of confirmation bias and selective reporting.

- MacCoun, R., & Perlmutter, S. (2015). Blind analysis: Hide results to seek the truth. Nature, 526(7572), 187-189.

- MacCoun, R. J., & Perlmutter, S. (2017). Blind analysis as a correction for confirmatory bias in physics and in psychology. Psychological science under scrutiny: Recent challenges and proposed solutions, 297-322.

https://doi.org/10.7554/eLife.56626.sa1

Author response

[We repeat the reviewers’ points here in italic, and include our replies point by point, as well as a description of the changes made, in plain text].

Essential revisions:

1) The authors state: "...only way to prevent bias is by setting exclusion criteria beforehand. "

This is a very interesting argument, but I believe that there are other possibilities. Preregistration has not eliminated bias from the psychology literature, instead there are now many abandoned preregistrations. While this is a better state than biomedicine, where we really don't know who attempted which study, it is still not satisfying.

We agree with the reviewers that preregistrations are neither the only solution nor an infallible one. To address the reviewer’s concern, we have changed the passage mentioned to avoid conveying that interpretation. That said, we believe the point of preregistrations is to have a track record of what was originally planned, even if this is later abandoned and/or changed (as discussed in the subsection “Paths to confirmatory research”). Thus, the fact that not every preregistered study is published and not every preregistered protocol is followed faithfully is not necessarily an argument against the practice – one could argue that preregistration will at least allow a reader to know what remains unpublished or was changed in these cases. In any case, conforming to the reviewers’ suggestions, we now also discuss alternatives to preregistration to address reporting bias (see subsection “Paths to confirmatory research” and response to point #4 below).

One could argue that it may become possible to label the study as exploratory vs validated. It seems that failure to validate is mostly problematic when clinical trials are involved and those might be reserved for only studies that have been validated using preregistration, proper controls, power analysis, blinding and a full description of the experimental group subjects. These criteria are addressed in less than 10% of studies, in our experience (MacLeod's laboratory has looked at percentages of rigor criteria across certain aspects of the disease literature much earlier than we did).

It should be noted, however, that exploratory studies can provide glimpses of interesting results not easily ascertained by holding all variables constant. One might argue that the real problem is that these exploratory studies are treated like real validated studies instead of something more akin to case reports. There is a good reason to have case reports, they are intended to describe an interesting observation. If biomedicine adopts a similar labeling system, it may also help to put the study in context.

We agree with the reviewer that most of basic biomedical research is indeed analogous to case reports or small exploratory studies in clinical research, and that a small fraction of basic science studies can be thought of as confirmatory research in the same sense as a large clinical trial. We also agree that it’s important to point out that this does not make them worthless – in fact, it can be argued that most of discovery science should be exploratory in nature –, and that a way to explicitly label studies as exploratory or confirmatory could be of use. We now highlight this distinction more clearly (using the case report analysis suggested) in the subsection “The reverse Texas sharpshooter”.

2. In several sections, the authors describe antibody and cell lines examples, but give more practical examples of validation criteria for RT-PCR.

What should help the paper is a bit of practical advice for studies involving antibodies and cell lines. Across the biomedical literature, there are some technique specific advice that has been offered, not by a single expert, but by large panels of experts that have met and created a set of recommendations for properly validating the techniques.

Please consider adding some resources for different experimental paradigms in the form of a set of key references.

For antibodies I would suggest the Ulhen et al., 2016: PMID: 27595404 paper (based on a panel of experts).

For cell lines, I would suggest ICLAC.org (again a panel of experts with specific guidelines; there are multiple papers here, but probably pointing to the organization is sufficient).

We thank the reviewer for pointing out these excellent references. We had indeed overemphasized certain methods in our original manuscript due to our experience in the Brazilian Reproducibility Initiative – which drove us to write this piece and is currently limited to three methods (MTT assays, RT-PCR and elevated plus maze), although it might add antibody-based techniques (Western blotting and/or immunohistochemistry) in the future.

We have used the references provided to add a paragraph about the use of pre-specified criteria for validation in cell line authentication and antibody quality control (see subsection “Pre-specified criteria to clean up the data record”), as suggested. We are aware that this list is still far from extensive, but such examples are indeed useful to drive home the kind of criteria that we are advocating for.

3. I think it would be good if the paper could end with some clear messages (recommendations presented as numbered bullet points), which every study (i.e. publication) should consider. I would like to leave this up to the authors, but it should contain the message that the decision about the validity of an experiment must not be based on the effect size of the outcome measure.

We agree with the reviewer that clear recommendations are useful, and have tried to do this by adding bullet points to the Conclusion section as suggested, making an effort to be very brief and straight to the point. We also fully agree with the message that the validity of an experiment cannot be based on the effect size of outcome measure, and now state this explicitly in subsection “Pre-specified criteria to clean up the data record” and Conclusion section.

4. I think that the paper would benefit from widening its scope such that its messages are applicable not only to preregistered studies but also to others. Rather than saying that scientific objectivity would require preregistration of criteria of experimental validity, I would say that there are 3 ways of achieving high objectivity standards: i) Preregistered criteria of validity; ii) Blinding of the person, who develops and applies the criteria of validity, from the research outcome of each experiment; iii) Complete reporting and summary of all experiments and pilot trials.

Currently the paper already hints at the possibility iii) in the subsection “Preregistration in confirmatory research” as an alternative to preregistration. The idea of blinding the decision maker has been explained elsewhere (eg MacCoun & Perlmutter, 2015; MacCoun & Perlmutter, 2017.

I guess it should be feasible for many studies to find an independent expert who is blinded from the data (outcome of experiments) and who can develop criteria for validity of experiments (even after data collection) and who applies those criteria strictly without knowing how this affects the overall outcome and conclusions. If studies implement such blinding techniques, I think it is also helpful to see these procedures described in the methods section. This signals awareness and will increase the credibility of a study compared to other studies where the authors may not even be aware of the issue of confirmation bias and selective reporting.

- MacCoun, R., & Perlmutter, S. (2015). Blind analysis: Hide results to seek the truth. Nature, 526(7572), 187-189.

- MacCoun, R. J., & Perlmutter, S. (2017). Blind analysis as a correction for confirmatory bias in physics and in psychology. Psychological science under scrutiny: Recent challenges and proposed solutions, 297-322.

We agree with the reviewer that there are options to preregistration in order for research to increase its confirmatory value, including independent blind analysis. We have thus widened the scope of the section on preregistration (now titled “Paths to confirmatory research”), which now describes different ways in which research may move towards confirmatory status, and also mention blind analysis as an alternative in the Conclusion section. That said, we do believe that preregistration offers some additional advantages to blind analysis, in the sense of making the record of planning public and helping to address publication bias as well. This is now discussed in the subsection “Pre-specified criteria to clean up the data record”.

https://doi.org/10.7554/eLife.56626.sa2

Article and author information

Author details

Kleber Neves

Kleber Neves is in the Institute of Medical Biochemistry Leopoldo de Meis, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil

Contribution
Writing - original draft

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-9519-4909
Olavo B Amaral

Olavo B Amaral is in the Institute of Medical Biochemistry Leopoldo de Meis, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil

Contribution
Conceptualization, Writing - review and editing

For correspondence
olavo@bioqmed.ufrj.br

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-4299-8978

Funding

Serrapilheira Institute (Brazilian Reproducibility Initiative)

Kleber Neves

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Publication history

Received: March 5, 2020
Accepted: May 15, 2020
Version of Record published: May 22, 2020

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.