Neuroscience

Science Forum: How failure to falsify in high-volume science contributes to the replication crisis

College of Information Sciences and Technology, The Pennsylvania State University, United States
Center for Open Science, United States
Department of Psychology and the Social Life and Engineering Sciences Imaging Center, The Pennsylvania State University, United States

Aug 8, 2022

Open access
Copyright information

Download
Cite
CommentOpen annotations (there are currently 0 annotations on this page).
Share

Article
Figures and data
Abstract
Background and motivation
Accelerating science by falsifying strong hypotheses
Harnessing big data to advance metascience
Practical challenges to falsification
Conclusion
Data availability
References
Decision letter
Author response
Article and author information
Metrics

Abstract

The number of scientific papers published every year continues to increase, but scientific knowledge is not progressing at the same rate. Here we argue that a greater emphasis on falsification – the direct testing of strong hypotheses – would lead to faster progress by allowing well-specified hypotheses to be eliminated. We describe an example from neuroscience where there has been little work to directly test two prominent but incompatible hypotheses related to traumatic brain injury. Based on this example, we discuss how building strong hypotheses and then setting out to falsify them can bring greater precision to the clinical neurosciences, and argue that this approach could be beneficial to all areas of science.

Background and motivation

The “replication crisis” in various areas of research has been widely discussed in journals over the past decade [see, for example, Gilbert et al., 2016; Baker, 2016; Open Science Collaboration, 2015; Munafò et al., 2017]. At the center of this crisis is the concern that any given scientific result may not be reliable; in this way, the crisis is ultimately a question about the collective confidence we have in our methods and results (Alipourfard et al., 2012). The past decade has also witnessed many advances in data science, and “big data” has both contributed to concerns about scientific reliability (Bollier and Firestone, 2010; Calude and Longo, 2017) and also offered the possibility of improving reliability in some fields (Rodgers and Shrout, 2018).

In this article we discuss scientific progress in the clinical neurosciences, and focus on an example related to traumatic brain injury (TBI). Using this example, we argue that the rapid pace of work in this field, coupled with a failure to directly test and eliminate (falsify) hypotheses, has resulted in an expansive literature that lacks the precision necessary to advance science. Instead, we suggest that falsification – where one develops a strong hypothesis, along with methods that can test and refute this hypothesis – should be used more widely by researchers. The strength of a hypothesis refers to how specific and how refutable it is (Popper, 1963; see Table 1 for examples). We also argue for greater emphasis on testing and refuting strong hypotheses through a “team science” framework that allows us to address the heterogeneity in samples and/or methods that makes so many published findings tentative (Cwiek et al., 2021; Bryan et al., 2021).

Table 1

Examples of hypotheses of different strength.

Exploratory research does not generally involve testing a hypothesis. A Testable Association is a weak hypothesis as it is difficult to refute. A Testable/Falsifiable Position is stronger, and a hypothesis that is Testable/Falsifiable with Alternative Finding is stronger still.

Type of research/hypothesis	Example
Exploratory	“We examine the neural correlates of cognitive deficit after brain injury implementing graph theoretical measures of whole brain neural networks”
Testable Association	“We hypothesize that graph theoretical measures of whole brain neural networks predict cognitive deficit after brain injury”
Testable/Falsifiable Position (offers possible mechanism and direction/magnitude of expected finding)	“We hypothesize that memory deficits during the first 6 months post injury are due to white matter connection loss and maintain a linear and positive relationship with increased global network path length”
Testable/Falsifiable with Alternative Finding (indicates how the hypothesis would and would not be supported)	“We hypothesize that memory deficits during the first 6 months post injury are due to white matter connection loss and maintain a linear and positive relationship with increased global network path length. Diminished global path length in individuals with greatest memory impairment would challenge this hypothesis”

Hyperconnectivity hypothesis in brain connectomics

To provide a specific example for the concerns outlined in this critique, we draw from the literature using resting-state fMRI methods and network analysis (typically graph theory, see Caeyenberghs et al., 2017 to examine systems-level plasticity in TBI). Beginning with one of the first papers combining functional neuroimaging and graph theory to examine network topology (Nakamura et al., 2009), an early observation in the study of TBI was that physical disruption of pathways due to focal and diffuse injury results in regional expansion (increase) in strength or number of functional connections. This initial finding was observed in a small longitudinal sample, but then similar effects were observed in other samples (Mayer et al., 2011; Bharath et al., 2015; Hillary et al., 2015; Johnson et al., 2012; Sharp et al., 2011; Iraji et al., 2016) and animal models of TBI (Harris et al., 2016). These findings were summarized in a paper by one of the current authors (FGH) outlining potential mechanisms for hyperconnectivity and its possible long-term consequences, including elevated metabolic demand, abnormal protein aggregation and, ultimately, increased risk for neurodegeneration (see Hillary and Grafman, 2017). The “hyperconnectivity response” to neurological insult was proposed as a possible biomarker for injury/recovery in a review summarizing findings in TBI brain connectomics (Caeyenberghs et al., 2017).

Nearly simultaneously, other researchers offered a distinct – in fact, nearly the opposite – set of findings. Several studies of moderate to severe brain injury (as examined above) found that white matter disruption during injury resulted in structural and functional disconnection of networks. The authors in these papers outline a “disconnection” hypothesis: the physical degradation of white matter secondary to traumatic axonal injury results in reduced connectivity of brain networks, which is visible both structurally in diffusion imaging studies (Fagerholm et al., 2015) and functionally using resting-state fMRI approaches (Bonnelle et al., 2011). These findings were summarized in a high-profile review (Sharp et al., 2014) where the authors argue that TBI “substantially disrupts [connectivity], and that this disruption predicts cognitive impairment …”.

When juxtaposed, these two hypotheses hold distinct explanations for the same phenomenon with the first proposing that axonal injury results in a paradoxically enhanced functional network response and the second that the same pathophysiology results in reduced functional connectivity. Both cannot be true as they have been proposed, so which is correct? Even with two apparently contradictory hypotheses in place, there has been no direct testing of these positions against one another to determine the scenarios where either may have merit. Instead, each of these hypotheses remained unconditionally intact and served to support distinct sets of outcomes.

The most important point to be made from this example is not that competing theories in this literature exist. To the contrary, having competing theories for understanding a phenomenon places science in a strong position; the theories can be tested against one another to qualify (or even eliminate) one position. The point is that there have been no attempts to falsify either a hyperconnectivity or disconnection hypothesis, allowing researchers to evoke one or the other depending upon the finding for a given dataset (i.e., disconnection due to white matter loss, or functional “compensation” in the case of hyperconnectivity). What has contributed to this problem is that increasingly complex computational modeling also increases the investigator degrees of freedom, both implicitly and explicitly, to support their hypotheses. In the case of the current example of neural networks, these include selection from a number of brain atlases or other methods for brain parcellation and likewise numerous approaches to neural network definition (see Hallquist and Hillary, 2019). Figure 1 provides a schematic representation of two distinct and simultaneously supported hypotheses in head injury.

Figure 1

Download asset Open asset

Two competing theories for functional network response after brain injury.

Panel A represents the typical pattern of resting connectivity for the default mode network (DMN) and the yellow box shows a magnified area of neuronal bodies and their axonal projections. Panel B reveals three active neuronal projections (red) that are then disrupted by hemorrhagic lesion of white matter (Panel C). In response to this injury, a hyperconnectivity response (Panel D, left) shows increased signaling to adjacent areas resulting in a pronounced DMN response (Panel D, right). By contrast a disconnection hypothesis maintains that signaling from the original neuronal assemblies is diminished due to axonal degradation and neuronal atrophy secondary to cerebral diaschisis (Panel E, left) resulting in reduced functional DMN response (Panel E, right).

To be clear, the approach taken by investigators in this TBI literature is consistent with a research agenda designed to meet the demands for high publication throughput (more on this below). Examiners publish preliminary findings but remain appropriately tentative in their conclusions given that the sample is small and unexplained factors are numerous. Indeed, a common refrain in many publications is the “need for replication in a larger sample”. As opposed to pre-registering and testing strong hypotheses, investigators are reinforced to identify significant results (any result) for publication. In brain injury work examining network plasticity, investigators have often made general claims that brain injury results in “different” or “altered” connectivity (a problem dating back to early fMRI studies in TBI; Hillary, 2008). While unintentional, imprecise hypotheses increase the likelihood that chance findings are published. The primary consequence is that all findings are “winners”, permitting growing support for either position without movement toward resolution.

Overall, the TBI connectomics literature presents a clear example of a failure to falsify and we argue that it is attributable, at least in part, to the publication of large numbers of papers reporting the results of studies in which small samples were used to examine under-specified hypotheses. This “science-by-volume” approach is exacerbated by the overuse of inappropriate statistical tests, which increases the probability that spurious findings will be reported as meaningful (Button et al., 2013).

The challenges outlined here, where there is a general failure to test and refute strong hypotheses, are not isolated to the TBI literature. Similar issues have been expressed in preclinical studies of stroke (Corbett et al., 2017) in the translational neurosciences where investigators maintain flexible theory and predictions to fit methodological limitations (Macleod et al., 2014; Pound and Ritskes-Hoitinga, 2018; Henderson et al., 2013), and in cancer research where only portions of published data sets provide support for hypotheses (Begley and Ellis, 2012). These factors have likely contributed to the repeated failure of clinical trials to move from animal models to successful Phase III interventions in clinical neuroscience (Tolchin et al., 2020). This example in the neurosciences also mirrors the longstanding problems of co-existing yet inconsistent theories in other disciplines like social psychology (see Watts, 2017).

Big data and computational methods as friend and foe

The big data revolution and advancement of computational modeling powered by enhanced computing infrastructure, on the one hand, has magnified concerns about scientific reliability through unprecedented flexibility in data exploration and analysis. Sufficiently large datasets provably contain spurious correlations and the number of these coincidental regularities increases as the dataset size increases (Calude and Longo, 2017; Graham and Spencer, 1990). Adding flexibility, predictive algorithms built on top of these large datasets typically involve a great number of investigator decisions – the combined effects of which undermine reliability of findings [for an example in connectivity modeling see Hallquist and Hillary, 2019]. Results of machine learning models, for example, are sensitive to model specification and parameter tuning (Pineau, 2021; Bouthillier et al., 2019; Cwiek et al., 2021). Computational approaches permit systematically combing through a great number of potential variables of interest and their statistical relationships (specifically, at scales which would be manually infeasible). Consequently, the burden of reliability falls upon the existence of strong, well-founded hypotheses with sufficient power and clear pre-analysis plans. It has even been suggested that null hypothesis significance testing should only be used in the neurosciences in support of pre-registered hypotheses based on strong theory (Szucs and Ioannidis, 2017).

So, while there is concern that Big Data moves too fast and without the necessary constraints of theory, there is also emerging sentiment that the tremendous computational power coupled with unparalleled data access has the potential to transform some of the most basic scientific tenets, including introduction of a “third scientific pillar” to be added to theory and experimentation (see National Science Foundation, 2010). While this latest position received criticism (Andrews, 2012), computational methods have been reliably demonstrated to offer novel tools to address the replication crisis – an issue addressed in greater detail in “solutions” below.

Operating without anchors in a sea of high-volume science

One challenge then is to determine where the bedrock of our field (our foundational knowledge) ends, and where areas of discovery that show promise (but have yet to be established) begin. By some measure neurology is a fledgling field in the biological sciences: the publication of De humani corporis fabrica by Vesalius in 1543 is often taken to mark the start of the study of human anatomy (Vesalius, 1555) Jean-Martin Charcot – often referred to as the “founder of neurology” – arrived approximately 300 years later (Zalc, 2018). If we simplify our task and start with the work of Milner, Geschwind and Luria in the 1950s, it is still a challenge to determine what is definitively known and what remains conjectural in the field. This challenge is amplified by the pressure on researchers to publish or perish (Macleod et al., 2014; Kiai, 2019; Lindner et al., 2018). The number of papers published per year continues to increase without asymptote (Bornmann and Mutz, 2015). When considering all papers published in the clinical neurosciences since 1900, more than 50% of the entire literature has been published in the last 10 years and 35% in the last five years (see supplementary figures S1a,b in Priestley et al., 2022). In the most extreme examples, “hyperprolific” lab directors publish a scientific paper roughly every 5 days (Ioannidis et al., 2018). It is legitimate to ask if the current proliferation of published findings has been matched by advances in scientific knowledge, or if the rate of publishing is outpacing scientific ingenuity (Sandström and van den Besselaar, 2016) and impeding the emergence of new theories (Chu and Evans, 2021).

We argue that a culture of science-by-volume is problematic for the reliability of science, primarily when paired with research agendas not designed to test/refute hypotheses. First, without pruning possible explanations through falsification, the science-by-volume approach creates an ever-expanding search space where finite human and financial resources are deployed to maximize breadth in published findings as opposed to depth of understanding (Figure 2A). Second, and as an extension of the last point, failure to falsify in a high-volume environment challenges our capacity to know which hypotheses represent foundational theory, which hypotheses are encouraging but require further confirmation, and which hypotheses should be rejected. Finally, in the case of the least-publishable-unit (Broad, 1981) a single data set may be carved into several smaller papers resulting in circles of self-citation and the illusion of reliable support for a hypothesis (or hypotheses) (Gleeson and Biddle, 2000).

Figure 2

Download asset Open asset

The role of falsification in pruning high volume science to identify the fittest theories.

Panels A and B illustrate the conceptual steps in theory progression from exploration through confirmation and finally application. The x-axis is theoretical progression (time) and the y-axis is the number of active theories. Panel A depicts progression in the absence of falsification with continued branching of theories in the absence of pruning (theory reduction through falsification). By contrast the “Confirmatory Stage” in Panel B includes direct testing and refutation of theories/explanations resulting in only the fittest theories to choose from during application. Note: both Panels A and B include replication, but falsification during the “confirmation” phase results in a linear pathway and fewer choices from the “fittest” theories at the applied stage.

There have even been efforts internationally to make science more deliberate through de-emphasis of publication rates in academic circles (Dijstelbloem et al., 2013). Executing this type of systemic change in publication rate poses significant challenges and may ultimately be counterproductive because it fails to acknowledge the advancements in data aggregation and analysis afforded by high performance computing and rapid scientific communication through technology. So, while an argument can be made that our rate of publishing is not commensurate with our scientific progress, a path backward to a lower annual publication rate seems an unlikely solution and ignores the advantages of modernity. Instead, we should work toward establishing scientific foundation by testing and refuting strong hypotheses and these efforts may hold the greatest benefit when used to prune theories to determine the fittest prior to replication (Figure 2B). This effort maximizes resources and makes the goals for replication, as a confrontation of theoretical expectations, very clear (Nosek and Errington, 2020a). The remainder of the paper outlines how this can be achieved with focus on several contributors to the replication crisis.

Accelerating science by falsifying strong hypotheses

In praise of strong hypotheses

Successful refutation of hypotheses ultimately depends upon a number of factors, not the least of which is the specificity of the hypothesis (Earp and Trafimow, 2015). A simple, but well-specified, hypothesis, brings greater leverage to science than a hypothesis that is far reaching with broad implications but cannot be directly tested or refuted. Even Popper wrote about concerns in the behavioral sciences regarding the rather general nature of hypotheses (Bartley, 1978), a sentiment that has recently been described as a “crisis” in psychological theory advancement (Rzhetsky et al., 2015). As discussed in the TBI connectomics example, hypotheses may have been broad and "exploratory" because authors remained conservative in their claims and conclusions because studies have been systematically under-powered (one report estimating power at 8%; Button et al., 2013). While exploration is a vital part of science (Figure 2), it must be recognized as scientific exploration as opposed to an empirical test of a hypothesis. Under-developed hypotheses have been argued to be at least one contributor to repeated failure of clinical trials in acute neurological interventions (Schwamm, 2014) yet, paradoxically, strong hypotheses may offer increased sensitivity to subtle effects even in small samples (Lazic, 2018).

If we appeal to Popper, the strongest hypotheses make “risky predictions”, therefore prohibiting alternative explanations (see Popper, 1963). Moreover, the strongest hypotheses make clear at the outset the findings that would support the prediction, and also those that would not. Practically speaking this could take the form of teams of scientists developing opposing sets of hypotheses and then agreeing on both the experiments and the outcomes that would falsify one or both positions (what Nosek and Errington refer to as precommitment; Nosek and Errington, 2020b). This creates scenarios a priori where strong hypotheses are matched with methods that can provide clear tests. This approach is currently being applied in the “accelerating research on consciousness” programme funded by the Templeton World Charity Foundation. Strong hypotheses must be matched with methods that can provide clear tests, a coupling that cannot be overstated. In the brain imaging literature alone, there are poignant examples where flawed methods (or misunderstanding of their applications) has resulted in the repeated substantiation of spurious results (in structural covariance analysis see Carmon et al., 2020 in resting-state fMRI see Satterthwaite et al., 2012; Van Dijk et al., 2012).

Addressing heterogeneity to create strong hypotheses

One approach to strengthen hypotheses is to address sample and methodological heterogeneity which plagues the clinical neurosciences (Benedict and Zivadinov, 2011; Bennett et al., 2019; Schrag et al., 2019; Zucchella et al., 2020; Yeates et al., 2019). To echo a recent review of work in the social sciences, the neurosciences require a “heterogeneity revolution” (Bryan et al., 2021). Returning again to the TBI connectomics example, investigators relied upon small datasets heterogeneous with respect to age of injury, time post injury, injury severity, and other factors that could critically influence the response of the neural system to injury. Strong hypotheses determine the influence of sample characteristics by directly modeling the effects of demographic and clinical factors (Bryan et al., 2021) as opposed to statistically manipulating the variance accounted for by them – including the widespread and longstanding misapplication of covariance statistics to “equilibrate” groups in case-control designs (Miller and Chapman, 2001; Zinbarg et al., 2010; Storandt and Hudson, 1975). Finally, strong hypotheses leverage the pace of our current science as an ally, where studies designed specifically to address sample heterogeneity can test the role of clinical and demographic predictors in brain plasticity and outcome.

Open science and sharing to bolster falsification efforts

Addressing sample heterogeneity requires large diverse samples, and one way to achieve this is via data sharing. While data-sharing practices and availability differ across scientific disciplines (Tedersoo et al., 2021), there are enormous opportunities for sharing data in the clinical neurosciences (see, for example the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the Transforming Research and Clinical Knowledge in Traumatic Brain Injury (TRACK-TBI) initiative), even in cases where data were not collected with identical methods (such as the Enhancing NeuroImaging Genetics through Meta-Analysis (ENIGMA) Consortium; see Olsen et al., 2021 for more on severe brain injury, and Thompson et al., 2020 for a broad summary of work in clinical neuroscience). However, data aggregation and harmonization approaches remain largely untested as a solution to science-by-volume problems in the neurosciences.

It should be stressed that data sharing as a practice is not a panacea to poor study design and/or an absence of theory. The benefits of data combination do not eliminate any existing issues related to instrumentation and data collection occurring at individual sites; it is crucial to understand that data sharing permits faster accumulation of data while retaining any existing methodological concerns (e.g., harmonization). If unaddressed, these concerns introduce magnified noise or systematic bias masquerading as high-powered findings (Maikusa et al., 2021). However, well-designed data sharing efforts with rigorous harmonization approaches (e.g., Fortin et al., 2017; Tate et al., 2021) hold opportunities for falsification through meta-analyses, mega-analyses, and between site data comparisons (Thompson et al., 2022). Data sharing and team science also provide realistic opportunities to address sample heterogeneity and site-level idiosyncrasies in method.

Returning to the TBI connectomics example above, data sharing could play a central role in resolving this literature. The neural network response to injury most likely depends upon where one looks (specific neural networks), time post injury, and perhaps a range of clinical and demographic factors such as age of injury, current age, sex, and premorbid status. Clinically and demographically heterogeneous samples of n~40–50 subjects do not have the resolution necessary to determine when hyperconnectivity occurs and when it may give way to disconnection (see Caeyenberghs et al., 2017; Hillary and Grafman, 2017). Data sharing and team science organized to test strong hypotheses can provide clarity to this literature.

Harnessing big data to advance metascience

Metascience (Peterson and Panofsky, 2014) has become central to many of the issues raised here. Metascience uses the tools of science to describe and evaluate science on a macro scale and to motivate reforms in scientific practice (Munafò et al., 2017; Ioannidis et al., 2015; Gurevitch et al., 2018). The emergence of metascience is at least partially attributable to advances in web search and indexing, network science, natural language processing, and computational modeling. Amongst other aims, work under this umbrella has sought to diagnose biases in research practice (Larivière et al., 2013; Clauset et al., 2015; Huang et al., 2020), understand how researchers select new work to pursue (Rzhetsky et al., 2015; Jia et al., 2020), identify factors contributing to academic productivity (Liu et al., 2018; Li et al., 2018; Pluchino et al., 2019; Janosov et al., 2020), and forecast the emergence of new areas of research (Prabhakaran et al., 1959; Asooja et al., 2016; Salatino et al., 2018; Chen et al., 2017; Krenn and Zeilinger, 2020; Behrouzi et al., 2020).

A newer thread of ongoing efforts within the metascience community is working to build and promote infrastructure for reproducible and transparent scholarly communication (see Konkol et al., 2020 for a recent review, Wilkinson et al., 2016; Nosek et al., 2015). As part of this vision, primary deliverables of research processes include machine-readable outputs that can be queried by researchers for meta-analyses and theory development (Priem, 2013; Lakens and DeBruine, 2021; Brinckman et al., 2019). These efforts are coupled with recent major investments in approaches to further automate research synthesis and hypothesis generation. The Big Mechanism program, for example, was set up by the Defense Advanced Research Projects Agency (DARPA) to fund the development of technologies to read the cancer biology literature, extract fragments of causal mechanisms from publications, assemble these mechanisms into executable models, and use these models to explain and predict new findings, and even test these predictions (Cohen, 2015).

Lines of research have also emerged using creative assembly of experts (e.g., prediction markets; Dreber et al., 2015; Camerer et al., 2016; Camerer et al., 2018; Gordon et al., 2020 and AI-driven approaches Altmejd et al., 2019; Pawel and Held, 2020; Yang et al., 2020) to estimate confidence in specific research hypotheses and findings. These too have been facilitated by advances in information extraction, natural language processing, machine learning, and larger training datasets. The DARPA-funded Systematizing Confidence in Open Research and Evidence (SCORE) program, for example, is nearing the end of coordinated three-year long effort to develop technologies to predict and explain replicability, generalizability and robustness of published claims in the social and behavioral sciences literatures (Alipourfard et al., 2012). As it continues to advance, the metascience community may serve to revolutionize the research process resulting in a literature that is readily interrogated and upon which strong hypotheses can be built.

Falsification for scaffolding convergence research

Advances in computing hold the promise of richer datasets, AI-driven meta-analyses, and even automated hypothesis generation. However, thus far, efforts to harness big data and emerging technologies for falsification and replication have been relatively uncoordinated, with the aforementioned Big Mechanism and SCORE programs amongst a few notable exceptions.

The need to refine theories becomes increasingly apparent when confronted with resource, ethical, and practical constraints that limit what can be further pursued empirically. At the same time, addressing pressing societal needs requires innovation and convergence research. An example are calls for “grand challenges”, a family of initiatives focused on tackling daunting unsolved problems with large investments intended to make an applied impact. These targeted investments tend to lead to a proliferation of science; however, these mechanisms could also incorporate processes to refine and interrogate theories as they progress towards addressing a specific and compelling issue. A benefit of incorporating falsification into this pipeline is that it encourages differing points of view, a desired feature of grand challenges (Helbing, 2012) and other translational research programs. For example, including clinical researchers in the design of experiments being conducted at the preclinical stage can strengthen the quality of hypotheses before testing them to potentially increase the utility of the result, regardless of the outcome (Seyhan, 2019). To realize the full potential, investment in developing and maturing computational models is also needed to leverage the sea of scientific data to help identify the level of confidence in the fitness and replicability of each theory, and where best to deploy resources. This could lead to more rapid theory refinements and greater feedback for what new data to collect than would be possible using hypothesis-driven or data-intensive approaches in isolation (Peters et al., 2014).

Practical challenges to falsification

We have proposed that falsification of strong hypothesis provides a mechanism to increase study reliability. High volume science should ideally function to eliminate possible explanations, otherwise productivity obfuscates progress. But can falsification ultimately achieve this goal? A strict Popperian approach, that every observation represents either a confirmation or refutation of a hypothesis, is challenging to implement in day-to-day scientific practice (Lakatos, 1970; Kuhn, 1970). What’s more, one cannot, with complete certainty, disprove a hypothesis any more than one can hope to prove a hypothesis (see Lakatos, 1970). It was Popper who emphasized that truth is ephemeral and even when it can be accessed, it remains provisional (Popper, 1959).

The philosophical dilemma in establishing the “true” nature of a scientific finding is reflected in the pragmatic challenges facing replication science. Even after an effort to replicate a finding, when investigators are presented with the results and asked if the replication was a success, the outcome is often disagreement resulting in “intellectual gridlock” (Nosek and Errington, 2020b). So, if the goal to falsify a hypothesis is both practically and philosophically flawed, why the emphasis? The answer is that, while falsification cannot remove the foibles of human nature, systematic methodological error, and noise from the scientific process, by setting our sights on testing and refuting strong a priori hypotheses we may uncover the shortcomings to our explanations. Attempts to falsify through refutation cannot be definitive but the outcome of multiple efforts can critically inform the direction of a science (Earp and Trafimow, 2015) when formally integrated into the scientific process (as depicted in Figure 2).

Finally, falsification alone serves as an incomplete response to problems of scientific reliability but becomes a powerful tool when combined with efforts that maximize transparency in method, make null results available, facilitate data/code sharing, and increase the incentive structures for investigators to refocus on open and transparent science.

Conclusion

Due to several factors including a high-volume science culture and previously unavailable computational resources, the empirical sciences have never been more productive. This unparalleled productivity invites questions about the rigor and direction of science and, ultimately, how these efforts translate to scientific advancement. We have proposed that it should be a primary goal to identify the “ground truths” that can serve as a foundation for more deliberate study and, to do so, there must be greater emphasis on testing and refuting strong hypotheses. The falsification of strong hypotheses enhances the power of replication first by pruning options and identifying the most promising hypotheses including possible mechanisms. When conducted through a team science framework, the endeavor leverages shared datasets that allow us to address heterogeneity that makes so many findings tentative. We must take steps toward more transparent and open science including – and most importantly – study pre-registration of strong hypotheses. The ultimate goal is to harness the rapid advancements in big data, computational power, and strong, well-defined theory with the goal to accelerate science.

Data availability

There are no data associated with this article.

References

Preprint
1. Alipourfard N
2. Arendt B
3. Benjamin DM
4. Benkler N
5. Bishop MM
6. Burstein M
7. Bush M
8. Caverlee J
9. Chen Y
10. Clark C
11. Dreber A
12. Errington TM
13. Fidler F
14. Fox NW
15. Frank A
16. Fraser H
17. Friedman S
18. Gelman B
19. Gentile J
20. Giles CL
21. Gordon MB
22. Gordon-Sarney R
23. Griffin C
24. Gulden T
25. Hahn K
26. Hartman R
27. Holzmeister F
28. Hu XB
29. Johannesson M
30. Kezar L
31. Kline Struhl M
32. Kuter U
33. Kwasnica AM
34. Lee DH
35. Lerman K
36. Liu Y
37. Loomas Z
38. Luis B
39. Magnusson I
40. Miske O
41. Mody F
42. Morstatter F
43. Nosek BA
44. Parsons ES
45. Pennock D
46. Pfeiffer T
47. Pujara J
48. Rajtmajer S
49. Ren X
50. Salinas A
51. Selvam RK
52. Shipman F
53. Silverstein P
54. Sprenger A
55. Squicciarini AM
56. Stratman S
57. Sun K
58. Tikoo S
59. Twardy CR
60. Tyner A
61. Viganola D
62. Wang J
63. Wilkinson DP
64. Wintle B
65. Wu J
(2012) Systematizing Confidence in Open Research and Evidence (SCORE)
SocArXiv.

https://osf.io/preprints/socarxiv/46mnb
- Google Scholar
1. Altmejd A
2. Dreber A
3. Forsell E
4. Huber J
5. Imai T
6. Johannesson M
7. Kirchler M
8. Nave G
9. Camerer C
10. Wicherts JM
(2019) Predicting the replicability of social science lab experiments
PLOS ONE 14:e0225826.

https://doi.org/10.1371/journal.pone.0225826
- PubMed
- Google Scholar
1. Andrews GE
(2012) Drowning in the data deluge
Notices of the American Mathematical Society 59:933.

https://doi.org/10.1090/noti871
- Google Scholar
Conference
1. Asooja K
2. Bordea G
3. Vulcu G
4. Buitelaar P
(2016)
Forecasting emerging trends from scientific literature

Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016.
- Google Scholar
1. Baker M
(2016) 1,500 scientists lift the lid on reproducibility
Nature 533:452–454.

https://doi.org/10.1038/533452a
- PubMed
- Google Scholar
1. Bartley WW
(1978) The philosophy of Karl Popper
Philosophia 7:675–716.

https://doi.org/10.1007/BF02378843
- Google Scholar
1. Begley CG
2. Ellis LM
(2012) Drug development: Raise standards for preclinical cancer research
Nature 483:531–533.

https://doi.org/10.1038/483531a
- PubMed
- Google Scholar
(2020) Predicting scientific research trends based on link prediction in keyword networks
Journal of Informetrics 14:101079.

https://doi.org/10.1016/j.joi.2020.101079
- Google Scholar
1. Benedict RHB
2. Zivadinov R
(2011) Risk factors for and management of cognitive dysfunction in multiple sclerosis
Nature Reviews Neurology 7:332–342.

https://doi.org/10.1038/nrneurol.2011.61
- PubMed
- Google Scholar
(2019) Practitioner review: unguided and guided self-help interventions for common mental health disorders in children and adolescents: A systematic review and meta-analysis
Journal of Child Psychology and Psychiatry, and Allied Disciplines 60:828–847.

https://doi.org/10.1111/jcpp.13010
- PubMed
- Google Scholar
1. Bharath RD
2. Munivenkatappa A
3. Gohel S
4. Panda R
5. Saini J
6. Rajeswaran J
7. Shukla D
8. Bhagavatula ID
9. Biswal BB
(2015) Recovery of resting brain connectivity ensuing mild traumatic brain injury
Frontiers in Human Neuroscience 9:513.

https://doi.org/10.3389/fnhum.2015.00513
- PubMed
- Google Scholar
Website
1. Bollier D
2. Firestone CM
(2010) The promise and peril of big data Aspen Institute, Communications and Society Program
Accessed August 2, 2022.

https://www.aspeninstitute.org/publications/promise-peril-big-data/
1. Bonnelle V
2. Leech R
3. Kinnunen KM
4. Ham TE
5. Beckmann CF
6. De Boissezon X
7. Greenwood RJ
8. Sharp DJ
(2011) Default mode network connectivity predicts sustained attention deficits after traumatic brain injury
Journal of Neuroscience 31:13442–13451.

https://doi.org/10.1523/JNEUROSCI.1163-11.2011
- PubMed
- Google Scholar
1. Bornmann L
2. Mutz R
(2015) Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references
Journal of the Association for Information Science and Technology 66:2215–2222.

https://doi.org/10.1002/asi.23329
- Google Scholar
Conference
(2019)
Unreproducible research is reproducible

International Conference on Machine Learning PMLR.
- Google Scholar
1. Brinckman A
2. Chard K
3. Gaffney N
4. Hategan M
5. Jones MB
6. Kowalik K
7. Kulasekaran S
8. Ludäscher B
9. Mecum BD
10. Nabrzyski J
11. Stodden V
12. Taylor IJ
13. Turk MJ
14. Turner K
(2019) Computing environments for reproducibility: capturing the “whole tale.”
Future Generation Computer Systems 94:854–867.

https://doi.org/10.1016/j.future.2017.12.029
- Google Scholar
1. Broad WJ
(1981) The publishing game: Getting more for less
Science 211:1137–1139.

https://doi.org/10.1126/science.7008199
- PubMed
- Google Scholar
(2021) Behavioural science is unlikely to change the world without a heterogeneity revolution
Nature Human Behaviour 5:980–989.

https://doi.org/10.1038/s41562-021-01143-3
- PubMed
- Google Scholar
1. Button KS
2. Ioannidis JPA
3. Mokrysz C
4. Nosek BA
5. Flint J
6. Robinson ESJ
7. Munafò MR
(2013) Power failure: why small sample size undermines the reliability of neuroscience
Nature Reviews Neuroscience 14:365–376.

https://doi.org/10.1038/nrn3475
- PubMed
- Google Scholar
(2017) Mapping the functional connectome in traumatic brain injury: What can graph metrics tell us?
NeuroImage 160:113–123.

https://doi.org/10.1016/j.neuroimage.2016.12.003
- PubMed
- Google Scholar
1. Calude CS
2. Longo G
(2017) The deluge of spurious correlations in big data
Foundations of Science 22:595–612.

https://doi.org/10.1007/s10699-016-9489-4
- Google Scholar
1. Camerer CF
2. Dreber A
3. Forsell E
4. Ho TH
5. Huber J
6. Johannesson M
7. Kirchler M
8. Almenberg J
9. Altmejd A
10. Chan T
11. Heikensten E
12. Holzmeister F
13. Imai T
14. Isaksson S
15. Nave G
16. Pfeiffer T
17. Razen M
18. Wu H
(2016) Evaluating replicability of laboratory experiments in economics
Science 351:1433–1436.

https://doi.org/10.1126/science.aaf0918
- PubMed
- Google Scholar
1. Camerer CF
2. Dreber A
3. Holzmeister F
4. Ho TH
5. Huber J
6. Johannesson M
7. Kirchler M
8. Nave G
9. Nosek BA
10. Pfeiffer T
11. Altmejd A
12. Buttrick N
13. Chan T
14. Chen Y
15. Forsell E
16. Gampa A
17. Heikensten E
18. Hummer L
19. Imai T
20. Isaksson S
21. Manfredi D
22. Rose J
23. Wagenmakers EJ
24. Wu H
(2018) Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015
Nature Human Behaviour 2:637–644.

https://doi.org/10.1038/s41562-018-0399-z
- PubMed
- Google Scholar
1. Carmon J
2. Heege J
3. Necus JH
4. Owen TW
5. Pipa G
6. Kaiser M
7. Taylor PN
8. Wang Y
(2020) Reliability and comparability of human brain structural covariance networks
NeuroImage 220:117104.

https://doi.org/10.1016/j.neuroimage.2020.117104
- PubMed
- Google Scholar
Conference
1. Chen C
2. Wang Z
3. Li W
4. Sun X
(2017) Modeling scientific influence for research trending topic prediction
Proceedings of the AAAI Conference on Artificial Intelligence.

https://doi.org/10.1609/aaai.v32i1.11882
- Google Scholar
1. Chu JSG
2. Evans JA
(2021) Slowed canonical progress in large fields of science
PNAS 118:e2021636118.

https://doi.org/10.1073/pnas.2021636118
- PubMed
- Google Scholar
(2015) Systematic inequality and hierarchy in faculty hiring networks
Science Advances 1:e1400005.

https://doi.org/10.1126/sciadv.1400005
- PubMed
- Google Scholar
1. Cohen PR
(2015) DARPA’s Big Mechanism program
Physical Biology 12:045008.

https://doi.org/10.1088/1478-3975/12/4/045008
- PubMed
- Google Scholar
1. Corbett D
2. Carmichael ST
3. Murphy TH
4. Jones TA
5. Schwab ME
6. Jolkkonen J
7. Clarkson AN
8. Dancause N
9. Weiloch T
10. Johansen-Berg H
11. Nilsson M
12. McCullough LD
13. Joy MT
(2017) Enhancing the alignment of the preclinical and clinical stroke recovery research pipeline: Consensus-based core recommendations from the Stroke Recovery and Rehabilitation Roundtable Translational Working Group
Neurorehabilitation and Neural Repair 31:699–707.

https://doi.org/10.1177/1545968317724285
- PubMed
- Google Scholar
1. Cwiek A
2. Rajtmajer SM
3. Wyble B
4. Honavar V
5. Grossner E
6. Hillary FG
(2021) Feeding the machine: challenges to reproducible predictive modeling in resting-state connectomics
Network Neuroscience 1:1–20.

https://doi.org/10.1162/netn_a_00212
- PubMed
- Google Scholar
Website
(2013) Position paper: Why science does not work as it should and what to do about it
Accessed August 2, 2022.

http://scienceintransition.nl/app/uploads/2013/10/Science-in-Transition-Position-Paper-final.pdf
1. Dreber A
2. Pfeiffer T
3. Almenberg J
4. Isaksson S
5. Wilson B
6. Chen Y
7. Nosek BA
8. Johannesson M
9. Wachter KW
(2015) Using prediction markets to estimate the reproducibility of scientific research
PNAS 112:15343–15347.

https://doi.org/10.1073/pnas.1516179112
- PubMed
- Google Scholar
1. Earp BD
2. Trafimow D
(2015) Replication, falsification, and the crisis of confidence in social psychology
Frontiers in Psychology 6:621.

https://doi.org/10.3389/fpsyg.2015.00621
- PubMed
- Google Scholar
1. Fagerholm ED
2. Hellyer PJ
3. Scott G
4. Leech R
5. Sharp DJ
(2015) Disconnection of network hubs and cognitive impairment after traumatic brain injury
Brain: A Journal of Neurology 138:1696–1709.

https://doi.org/10.1093/brain/awv075
- PubMed
- Google Scholar
1. Fortin JP
2. Parker D
3. Tunç B
4. Watanabe T
5. Elliott MA
6. Ruparel K
7. Roalf DR
8. Satterthwaite TD
9. Gur RC
10. Gur RE
11. Schultz RT
12. Verma R
13. Shinohara RT
(2017) Harmonization of multi-site diffusion tensor imaging data
NeuroImage 161:149–170.

https://doi.org/10.1016/j.neuroimage.2017.08.047
- PubMed
- Google Scholar
(2016) Comment on “Estimating the reproducibility of psychological science.”
Science 351:1037.

https://doi.org/10.1126/science.aad7243
- PubMed
- Google Scholar
1. Gleeson M
2. Biddle S
(2000) Duplicate publishing and the least publishable unit
Journal of Sports Sciences 18:227–228.

https://doi.org/10.1080/026404100364956
- PubMed
- Google Scholar
1. Gordon M
2. Viganola D
3. Bishop M
4. Chen Y
5. Dreber A
6. Goldfedder B
7. Holzmeister F
8. Johannesson M
9. Liu Y
10. Twardy C
11. Wang J
12. Pfeiffer T
(2020) Are replication rates the same across academic fields? Community forecasts from the DARPA SCORE programme
Royal Society Open Science 7:200566.

https://doi.org/10.1098/rsos.200566
- PubMed
- Google Scholar
1. Graham RL
2. Spencer JH
(1990) Ramsey theory
Scientific American 263:112–117.

https://doi.org/10.1038/scientificamerican0790-112
- Google Scholar
(2018) Meta-analysis and the science of research synthesis
Nature 555:175–182.

https://doi.org/10.1038/nature25753
- PubMed
- Google Scholar
1. Hallquist MN
2. Hillary FG
(2019) Graph theory approaches to functional network organization in brain disorders: A critique for brave new small-world
Network Neuroscience 3:1–26.

https://doi.org/10.1162/netn_a_00054
- PubMed
- Google Scholar
1. Harris NG
2. Verley DR
3. Gutman BA
4. Thompson PM
5. Yeh HJ
6. Brown JA
(2016) Disconnection and hyper-connectivity underlie reorganization after TBI: A rodent functional connectomic analysis
Experimental Neurology 277:124–138.

https://doi.org/10.1016/j.expneurol.2015.12.020
- PubMed
- Google Scholar
1. Helbing D
(2012) Accelerating scientific discovery by formulating grand scientific challenges
The European Physical Journal Special Topics 214:41–48.

https://doi.org/10.1140/epjst/e2012-01687-x
- Google Scholar
(2013) Threats to validity in the design and conduct of preclinical efficacy studies: A systematic review of guidelines for in vivo animal experiments
PLOS Medicine 10:e1001489.

https://doi.org/10.1371/journal.pmed.1001489
- PubMed
- Google Scholar
1. Hillary FG
(2008) Neuroimaging of working memory dysfunction and the dilemma with brain reorganization hypotheses
Journal of the International Neuropsychological Society 14:526–534.

https://doi.org/10.1017/S1355617708080788
- PubMed
- Google Scholar
(2015) Hyperconnectivity is a fundamental response to neurological disruption
Neuropsychology 29:59–75.

https://doi.org/10.1037/neu0000110
- PubMed
- Google Scholar
1. Hillary FG
2. Grafman JH
(2017) Injured brains and adaptive networks: the benefits and costs of hyperconnectivity
Trends in Cognitive Sciences 21:385–401.

https://doi.org/10.1016/j.tics.2017.03.003
- PubMed
- Google Scholar
(2020) Historical comparison of gender inequality in scientific careers across countries and disciplines
PNAS 117:4609–4616.

https://doi.org/10.1073/pnas.1914221117
- PubMed
- Google Scholar
(2015) Meta-research: Evaluation and improvement of research methods and practices
PLOoS Biology 13:e1002264.

https://doi.org/10.1371/journal.pbio.1002264
- PubMed
- Google Scholar
(2018) Thousands of scientists publish a paper every five days
Nature 561:167–169.

https://doi.org/10.1038/d41586-018-06185-8
- PubMed
- Google Scholar
1. Iraji A
2. Chen H
3. Wiseman N
4. Welch RD
5. O’Neil BJ
6. Haacke EM
7. Liu T
8. Kou Z
(2016) Compensation through functional hyperconnectivity: A longitudinal connectome assessment of mild traumatic brain injury
Neural Plasticity 2016:4072402.

https://doi.org/10.1155/2016/4072402
- PubMed
- Google Scholar
(2020) Success and luck in creative careers
EPJ Data Science 9:9.

https://doi.org/10.1140/epjds/s13688-020-00227-w
- Google Scholar
(2020) Quantifying patterns of research-interest evolution
Nature Human Behaviour 1:0078.

https://doi.org/10.1038/s41562-017-0078
- Google Scholar
1. Johnson B
2. Zhang K
3. Gay M
4. Horovitz S
5. Hallett M
6. Sebastianelli W
7. Slobounov S
(2012) Alteration of brain default network in subacute phase of injury in concussed individuals: Resting-state fMRI study
NeuroImage 59:511–518.

https://doi.org/10.1016/j.neuroimage.2011.07.081
- PubMed
- Google Scholar
1. Kiai A
(2019) To protect credibility in science, banish “publish or perish.”
Nature Human Behaviour 3:1017–1018.

https://doi.org/10.1038/s41562-019-0741-0
- PubMed
- Google Scholar
(2020) Publishing computational research - a review of infrastructures for reproducible and transparent scholarly communication
Research Integrity and Peer Review 5:10.

https://doi.org/10.1186/s41073-020-00095-y
- PubMed
- Google Scholar
1. Krenn M
2. Zeilinger A
(2020) Predicting research trends with semantic and neural networks with an application in quantum physics
PNAS 117:1910–1916.

https://doi.org/10.1073/pnas.1914370116
- PubMed
- Google Scholar
Book
1. Kuhn TS
(1970) Logic of discovery or psychology of research
In: Lakatos I, Musgrave A, editors. Criticism and the Growth of Knowledge. Cambridge: Cambridge University Press. pp. 1–23.

https://doi.org/10.1017/CBO9781139171434
- Google Scholar
1. Lakatos I
(1970) History of science and its rational reconstructions
PSA 1970:91–136.

https://doi.org/10.1086/psaprocbienmeetp.1970.495757
- Google Scholar
1. Lakens D
2. DeBruine LM
(2021) Improving transparency, falsifiability, and rigor by making hypothesis tests machine-readable
Advances in Methods and Practices in Psychological Science 4:251524592097094.

https://doi.org/10.1177/2515245920970949
- Google Scholar
1. Larivière V
2. Ni C
3. Gingras Y
4. Cronin B
5. Sugimoto CR
(2013) Bibliometrics: Global gender disparities in science
Nature 504:211–213.

https://doi.org/10.1038/504211a
- PubMed
- Google Scholar
1. Lazic SE
(2018) Four simple ways to increase power without increasing the sample size
Laboratory Animals 52:621–629.

https://doi.org/10.1177/0023677218767478
- PubMed
- Google Scholar
1. Li W
2. Aste T
3. Caccioli F
4. Livan G
(2018) Early coauthorship with top scientists predicts success in academic careers
Nature Communications 10:5170.

https://doi.org/10.1038/s41467-019-13130-4
- PubMed
- Google Scholar
(2018) Scientific productivity: An exploratory study of metrics and incentives
PLOS ONE 13:e0195321.

https://doi.org/10.1371/journal.pone.0195321
- PubMed
- Google Scholar
1. Liu L
2. Wang Y
3. Sinatra R
4. Giles CL
5. Song C
6. Wang D
(2018) Hot streaks in artistic, cultural, and scientific careers
Nature 559:396–399.

https://doi.org/10.1038/s41586-018-0315-8
- PubMed
- Google Scholar
(2014) Biomedical research: increasing value, reducing waste
Lancet 383:101–104.

https://doi.org/10.1016/S0140-6736(13)62329-6
- PubMed
- Google Scholar
1. Maikusa N
2. Zhu Y
3. Uematsu A
4. Yamashita A
5. Saotome K
6. Okada N
7. Kasai K
8. Okanoya K
9. Yamashita O
10. Tanaka SC
11. Koike S
(2021) Comparison of traveling-subject and combat harmonization methods for assessing structural brain characteristics
Human Brain Mapping 1:5278–5287.

https://doi.org/10.1002/hbm.25615
- Google Scholar
1. Mayer AR
2. Mannell MV
3. Ling J
4. Gasparovic C
5. Yeo RA
(2011) Functional connectivity in mild traumatic brain injury
Human Brain Mapping 32:1825–1835.

https://doi.org/10.1002/hbm.21151
- PubMed
- Google Scholar
1. Miller GA
2. Chapman JP
(2001) Misunderstanding analysis of covariance
Journal of Abnormal Psychology 110:40–48.

https://doi.org/10.1037//0021-843x.110.1.40
- PubMed
- Google Scholar
(2017) A manifesto for reproducible science
Nature Human Behaviour 1:0021.

https://doi.org/10.1038/s41562-016-0021
- PubMed
- Google Scholar
(2009) Resting network plasticity following brain injury
PLOS ONE 4:e8220.

https://doi.org/10.1371/journal.pone.0008220
- PubMed
- Google Scholar
Website
1. National Science Foundation
(2010) Computational and Data-enabled Science and Engineering
Accessed August 2, 2022.

http://www.nsf.gov/mps/cds-e/
1. Nosek BA
2. Alter G
3. Banks GC
4. Borsboom D
5. Bowman SD
6. Breckler SJ
7. Buck S
8. Chambers CD
9. Chin G
10. Christensen G
11. Contestabile M
12. Dafoe A
13. Eich E
14. Freese J
15. Glennerster R
16. Goroff D
17. Green DP
18. Hesse B
19. Humphreys M
20. Ishiyama J
21. Karlan D
22. Kraut A
23. Lupia A
24. Mabry P
25. Madon T
26. Malhotra N
27. Mayo-Wilson E
28. McNutt M
29. Miguel E
30. Paluck EL
31. Simonsohn U
32. Soderberg C
33. Spellman BA
34. Turitto J
35. VandenBos G
36. Vazire S
37. Wagenmakers EJ
38. Wilson R
39. Yarkoni T
(2015) Promoting an open research culture
Science 348:1422–1425.

https://doi.org/10.1126/science.aab2374
- Google Scholar
1. Nosek BA
2. Errington TM
(2020a) What is replication?
PLOS Biology 18:e3000691.

https://doi.org/10.1371/journal.pbio.3000691
- Google Scholar
1. Nosek BA
2. Errington TM
(2020b) The best time to argue about what a replication means? Before you do it
Nature 583:518–520.

https://doi.org/10.1038/d41586-020-02142-6
- PubMed
- Google Scholar
1. Olsen A
2. Babikian T
3. Bigler ED
4. Caeyenberghs K
5. Conde V
6. Dams-O’Connor K
7. Dobryakova E
8. Genova H
9. Grafman J
10. Håberg AK
11. Heggland I
12. Hellstrøm T
13. Hodges CB
14. Irimia A
15. Jha RM
16. Johnson PK
17. Koliatsos VE
18. Levin H
19. Li LM
20. Lindsey HM
21. Livny A
22. Løvstad M
23. Medaglia J
24. Menon DK
25. Mondello S
26. Monti MM
27. Newcombe VFJ
28. Petroni A
29. Ponsford J
30. Sharp D
31. Spitz G
32. Westlye LT
33. Thompson PM
34. Dennis EL
35. Tate DF
36. Wilde EA
37. Hillary FG
(2021) Toward a global and reproducible science for brain imaging in neurotrauma: the ENIGMA adult moderate/severe traumatic brain injury working group
Brain Imaging and Behavior 15:526–554.

https://doi.org/10.1007/s11682-020-00313-7
- PubMed
- Google Scholar
1. Open Science Collaboration
(2015) Estimating the reproducibility of psychological science
Science 349:aac4716.

https://doi.org/10.1126/science.aac4716
- PubMed
- Google Scholar
1. Pawel S
2. Held L
(2020) Probabilistic forecasting of replication studies
PLOS ONE 15:e0231416.

https://doi.org/10.1371/journal.pone.0231416
- PubMed
- Google Scholar
(2014) Harnessing the power of big data: infusing the scientific method with machine learning to transform ecology
Ecosphere 5:art67.

https://doi.org/10.1890/ES13-00359.1
- Google Scholar
Preprint
1. Peterson D
2. Panofsky DPA
(2014) Metascience as a Scientific Social Movement
SocArXiv.

https://osf.io/preprints/socarxiv/4dsqa/
- Google Scholar
1. Pineau J
(2021)
Improving reproducibility in machine learning research: a report from the neurips 2019 reproducibility program

Journal of Machine Learning Research 22:1–20.
- Google Scholar
1. Pluchino A
2. Burgio G
3. Rapisarda A
4. Biondo AE
5. Pulvirenti A
6. Ferro A
7. Giorgino T
(2019) Exploring the role of interdisciplinarity in physics: success, talent and luck
PLOS ONE 14:e0218793.

https://doi.org/10.1371/journal.pone.0218793
- PubMed
- Google Scholar
Book
1. Popper KR
(1959)
The Logic of Scientific Discovery

Julius Springer, Hutchinson & Co.
- Google Scholar
Book
1. Popper K
(1963)
Conjectures and Refutations: The Growth of Scientific Knowledge

Routledge.
- Google Scholar
1. Pound P
2. Ritskes-Hoitinga M
(2018) Is it possible to overcome issues of external validity in preclinical animal research? Why most animal models are bound to fail
Journal of Translational Medicine 16:304.

https://doi.org/10.1186/s12967-018-1678-1
- PubMed
- Google Scholar
Conference
(1959) Predicting the Rise and Fall of Scientific Topics from Trends in their Rhetorical Framing
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.

https://doi.org/10.18653/v1/P16-1111
- Google Scholar
1. Priem J
(2013) Beyond the paper
Nature 495:437–440.

https://doi.org/10.1038/495437a
- PubMed
- Google Scholar
Preprint
(2022) Establishing Ground Truth in the Clinical Neurosciences: If Replication Is the Answer, Then What Are the Questions?
PsyArXiv.

https://psyarxiv.com/rb32d/
- Google Scholar
1. Rodgers JL
2. Shrout PE
(2018) Psychology’s replication crisis as scientific opportunity: A précis for policymakers
Policy Insights from the Behavioral and Brain Sciences 5:134–141.

https://doi.org/10.1177/2372732217749254
- Google Scholar
(2015) Choosing experiments to accelerate collective discovery
PNAS 112:14569–14574.

https://doi.org/10.1073/pnas.1509757112
- PubMed
- Google Scholar
Conference
(2018) AUGUR: Forecasting the Emergence of New Research Topics
Proceedings of the ACM/IEEE Joint Conference on Digital Libraries.

https://doi.org/10.1145/3197026.3197052
- Google Scholar
1. Sandström U
2. van den Besselaar P
(2016) Quantity and/or quality? The importance of publishing many papers
PLOS ONE 11:e0166149.

https://doi.org/10.1371/journal.pone.0166149
- PubMed
- Google Scholar
1. Satterthwaite TD
2. Wolf DH
3. Loughead J
4. Ruparel K
5. Elliott MA
6. Hakonarson H
7. Gur RC
8. Gur RE
(2012) Impact of in-scanner head motion on multiple measures of functional connectivity: relevance for studies of neurodevelopment in youth
NeuroImage 60:623–632.

https://doi.org/10.1016/j.neuroimage.2011.12.063
- PubMed
- Google Scholar
1. Schrag A
2. Zhelev SS
3. Hotham S
4. Merritt RD
5. Khan K
6. Graham L
(2019) Heterogeneity in progression of prodromal features in Parkinson’s disease
Parkinsonism & Related Disorders 64:275–279.

https://doi.org/10.1016/j.parkreldis.2019.05.013
- PubMed
- Google Scholar
1. Schwamm LH
(2014) Progesterone for traumatic brain injury--resisting the sirens’ song
The New England Journal of Medicine 371:2522–2523.

https://doi.org/10.1056/NEJMe1412951
- PubMed
- Google Scholar
1. Seyhan AA
(2019) Lost in translation: the valley of death across preclinical and clinical divide – identification of problems and overcoming obstacles
Translational Medicine Communications 4:18.

https://doi.org/10.1186/s41231-019-0050-7
- Google Scholar
1. Sharp DJ
2. Beckmann CF
3. Greenwood R
4. Kinnunen KM
5. Bonnelle V
6. De Boissezon X
7. Powell JH
8. Counsell SJ
9. Patel MC
10. Leech R
(2011) Default mode network functional and structural connectivity after traumatic brain injury
Brain: A Journal of Neurology 134:2233–2247.

https://doi.org/10.1093/brain/awr175
- PubMed
- Google Scholar
1. Sharp DJ
2. Scott G
3. Leech R
(2014) Network dysfunction after traumatic brain injury
Nature Reviews Neurology 10:156–166.

https://doi.org/10.1038/nrneurol.2014.15
- PubMed
- Google Scholar
1. Storandt M
2. Hudson W
(1975) Misuse of analysis of covariance in aging research and some partial solutions
Experimental Aging Research 1:121–125.

https://doi.org/10.1080/03610737508257953
- PubMed
- Google Scholar
1. Szucs D
2. Ioannidis JPA
(2017) When null hypothesis significance testing is unsuitable for research: A reassessment
Frontiers in Human Neuroscience 11:390.

https://doi.org/10.3389/fnhum.2017.00390
- PubMed
- Google Scholar
1. Tate DF
2. Dennis EL
3. Adams JT
4. Adamson MM
5. Belanger HG
6. Bigler ED
7. Bouchard HC
8. Clark AL
9. Delano-Wood LM
10. Disner SG
11. Eapen BC
12. Franz CE
13. Geuze E
14. Goodrich-Hunsaker NJ
15. Han K
16. Hayes JP
17. Hinds SR
18. Hodges CB
19. Hovenden ES
20. Irimia A
21. Kenney K
22. Koerte IK
23. Kremen WS
24. Levin HS
25. Lindsey HM
26. Morey RA
27. Newsome MR
28. Ollinger J
29. Pugh MJ
30. Scheibel RS
31. Shenton ME
32. Sullivan DR
33. Taylor BA
34. Troyanskaya M
35. Velez C
36. Wade BS
37. Wang X
38. Ware AL
39. Zafonte R
40. Thompson PM
41. Wilde EA
(2021) Coordinating global multi-site studies of military-relevant traumatic brain injury: opportunities, challenges, and harmonization guidelines
Brain Imaging and Behavior 15:585–613.

https://doi.org/10.1007/s11682-020-00423-2
- PubMed
- Google Scholar
1. Tedersoo L
2. Küngas R
3. Oras E
4. Köster K
5. Eenmaa H
6. Leijen Ä
7. Pedaste M
8. Raju M
9. Astapova A
10. Lukner H
11. Kogermann K
12. Sepp T
(2021) Data sharing practices and data availability upon request differ across scientific disciplines
Scientific Data 8:192.

https://doi.org/10.1038/s41597-021-00981-0
- PubMed
- Google Scholar
1. Thompson PM
2. Jahanshad N
3. Ching CRK
4. Salminen LE
5. Thomopoulos SI
6. Bright J
7. Baune BT
8. Bertolín S
9. Bralten J
10. Bruin WB
11. Bülow R
12. Chen J
13. Chye Y
14. Dannlowski U
15. de Kovel CGF
16. Donohoe G
17. Eyler LT
18. Faraone SV
19. Favre P
20. Filippi CA
21. Frodl T
22. Garijo D
23. Gil Y
24. Grabe HJ
25. Grasby KL
26. Hajek T
27. Han LKM
28. Hatton SN
29. Hilbert K
30. Ho TC
31. Holleran L
32. Homuth G
33. Hosten N
34. Houenou J
35. Ivanov I
36. Jia T
37. Kelly S
38. Klein M
39. Kwon JS
40. Laansma MA
41. Leerssen J
42. Lueken U
43. Nunes A
44. Neill JO
45. Opel N
46. Piras F
47. Piras F
48. Postema MC
49. Pozzi E
50. Shatokhina N
51. Soriano-Mas C
52. Spalletta G
53. Sun D
54. Teumer A
55. Tilot AK
56. Tozzi L
57. van der Merwe C
58. Van Someren EJW
59. van Wingen GA
60. Völzke H
61. Walton E
62. Wang L
63. Winkler AM
64. Wittfeld K
65. Wright MJ
66. Yun JY
67. Zhang G
68. Zhang-James Y
69. Adhikari BM
70. Agartz I
71. Aghajani M
72. Aleman A
73. Althoff RR
74. Altmann A
75. Andreassen OA
76. Baron DA
77. Bartnik-Olson BL
78. Marie Bas-Hoogendam J
79. Baskin-Sommers AR
80. Bearden CE
81. Berner LA
82. Boedhoe PSW
83. Brouwer RM
84. Buitelaar JK
85. Caeyenberghs K
86. Cecil CAM
87. Cohen RA
88. Cole JH
89. Conrod PJ
90. De Brito SA
91. de Zwarte SMC
92. Dennis EL
93. Desrivieres S
94. Dima D
95. Ehrlich S
96. Esopenko C
97. Fairchild G
98. Fisher SE
99. Fouche JP
100. Francks C
101. Frangou S
102. Franke B
103. Garavan HP
104. Glahn DC
105. Groenewold NA
106. Gurholt TP
107. Gutman BA
108. Hahn T
109. Harding IH
110. Hernaus D
111. Hibar DP
112. Hillary FG
113. Hoogman M
114. Hulshoff Pol HE
115. Jalbrzikowski M
116. Karkashadze GA
117. Klapwijk ET
118. Knickmeyer RC
119. Kochunov P
120. Koerte IK
121. Kong XZ
122. Liew SL
123. Lin AP
124. Logue MW
125. Luders E
126. Macciardi F
127. Mackey S
128. Mayer AR
129. McDonald CR
130. McMahon AB
131. Medland SE
132. Modinos G
133. Morey RA
134. Mueller SC
135. Mukherjee P
136. Namazova-Baranova L
137. Nir TM
138. Olsen A
139. Paschou P
140. Pine DS
141. Pizzagalli F
142. Rentería ME
143. Rohrer JD
144. Sämann PG
145. Schmaal L
146. Schumann G
147. Shiroishi MS
148. Sisodiya SM
149. Smit DJA
150. Sønderby IE
151. Stein DJ
152. Stein JL
153. Tahmasian M
154. Tate DF
155. Turner JA
156. van den Heuvel OA
157. van der Wee NJA
158. van der Werf YD
159. van Erp TGM
160. van Haren NEM
161. van Rooij D
162. van Velzen LS
163. Veer IM
164. Veltman DJ
165. Villalon-Reina JE
166. Walter H
167. Whelan CD
168. Wilde EA
169. Zarei M
170. Zelman V
171. ENIGMA Consortium
(2020) ENIGMA and global neuroscience: A decade of large-scale studies of the brain in health and disease across more than 40 countries
Translational Psychiatry 10:100.

https://doi.org/10.1038/s41398-020-0705-1
- PubMed
- Google Scholar
(2022) The Enhancing NeuroImaging Genetics through Meta-Analysis Consortium: 10 years of global collaborations in human brain mapping
Human Brain Mapping 43:15–22.

https://doi.org/10.1002/hbm.25672
- PubMed
- Google Scholar
(2020) AAN position statement: ethical issues in clinical research in neurology
Neurology 94:661–669.

https://doi.org/10.1212/WNL.0000000000009241
- PubMed
- Google Scholar
(2012) The influence of head motion on intrinsic functional connectivity MRI
NeuroImage 59:431–438.

https://doi.org/10.1016/j.neuroimage.2011.07.044
- PubMed
- Google Scholar
Book
1. Vesalius A
(1555)
De Humani Corporis Fabrica (Of the Structure of the Human Body)

Basel: Johann Oporinus.
- Google Scholar
1. Watts DJ
(2017) Should social science be more solution-oriented?
Nature Human Behaviour 1:0015.

https://doi.org/10.1038/s41562-016-0015
- Google Scholar
1. Wilkinson MD
2. Dumontier M
3. Aalbersberg IJJ
4. Appleton G
5. Axton M
6. Baak A
7. Blomberg N
8. Boiten JW
9. da Silva Santos LB
10. Bourne PE
11. Bouwman J
12. Brookes AJ
13. Clark T
14. Crosas M
15. Dillo I
16. Dumon O
17. Edmunds S
18. Evelo CT
19. Finkers R
20. Gonzalez-Beltran A
21. Gray AJG
22. Groth P
23. Goble C
24. Grethe JS
25. Heringa J
26. ’t Hoen PAC
27. Hooft R
28. Kuhn T
29. Kok R
30. Kok J
31. Lusher SJ
32. Martone ME
33. Mons A
34. Packer AL
35. Persson B
36. Rocca-Serra P
37. Roos M
38. van Schaik R
39. Sansone SA
40. Schultes E
41. Sengstag T
42. Slater T
43. Strawn G
44. Swertz MA
45. Thompson M
46. van der Lei J
47. van Mulligen E
48. Velterop J
49. Waagmeester A
50. Wittenburg P
51. Wolstencroft K
52. Zhao J
53. Mons B
(2016) The FAIR guiding principles for scientific data management and stewardship
Scientific Data 3:160018.

https://doi.org/10.1038/sdata.2016.18
- PubMed
- Google Scholar
1. Yang Y
2. Youyou W
3. Uzzi B
(2020) Estimating the deep replicability of scientific findings using human and artificial intelligence
PNAS 117:10762–10768.

https://doi.org/10.1073/pnas.1909046117
- PubMed
- Google Scholar
(2019) Derivation and initial validation of clinical phenotypes of children presenting with concussion acutely in the emergency department: Latent class analysis of a multi-center, prospective cohort, observational study
Journal of Neurotrauma 36:1758–1767.

https://doi.org/10.1089/neu.2018.6009
- PubMed
- Google Scholar
1. Zalc B
(2018) One hundred and fifty years ago Charcot reported multiple sclerosis as a new neurological disease
Brain: A Journal of Neurology 141:3482–3488.

https://doi.org/10.1093/brain/awy287
- PubMed
- Google Scholar
(2010) Biased parameter estimates and inflated type I error rates in analysis of covariance (and analysis of partial variance) arising from unreliability: Alternatives and remedial strategies
Journal of Abnormal Psychology 119:307–319.

https://doi.org/10.1037/a0017552
- PubMed
- Google Scholar
(2020) Non-invasive brain stimulation for gambling disorder: A systematic review
Frontiers in Neuroscience 14:729.

https://doi.org/10.3389/fnins.2020.00729
- PubMed
- Google Scholar

Decision letter

Peter Rodgers

Senior and Reviewing Editor; eLife, United Kingdom

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Thank you for submitting your article "How Failure to Falsify in High Volume Science contributes to the Replication Crisis" to eLife for consideration as a Feature Article. Your article has been reviewed by four peer reviewers, and the evaluation has been overseen by a member of the eLife Features Team (Peter Rodgers). The following individuals involved in review of your submission have agreed to reveal their identity: Robert Thibault; Jean-Baptiste Poline; Nishant Sinha; and Yujiang Wang.

The reviewers and editors have discussed the reviews and we have drafted this decision letter to help you prepare a revised submission.

Summary:

This manuscript is of broad interest to the scientific community and particularly engaging for the readers in neuroscience. As the number of scientific articles exploring new hypotheses grows exponentially every year, there have not been many attempts to select the most competent hypotheses while falsifying the others. The authors juxtapose two prevalent hypotheses in network neuroscience literature on brain injury as an example. The manuscript offers suggestions on selecting the most compelling hypothesis and directions for designing studies to falsify them. However, there are a number of points that need to be addressed to make the article suitable for publication.

Essential revisions:

1. Lines 55-62: Please give examples of strong and weak hypotheses. Please also explain why "hypotheses that are supported by a range of findings should be considered weak". You might also want to consider linking strong/weak hypotheses to the idea of "research questions", which can be broad and vague, as compared to "hypotheses" which should be more precise.

2. Lines 120-122: Are there any systematic reviews on this topic that use one (or several) strong hypotheses and then check if the various studies support the stronger hypothesis? (e.g., does area X change within timeframe Y, in Z types of patients)? Also, I don't think the issue here is that they lack power to refute hypotheses, but instead that this lack of power is coupled with many statistical tests, leading to false positives that are used to support weak hypotheses (e.g. that brain activity "changes").

3. Lines 230-232: This paragraph could benefit from having a concrete example of a strong hypothesis and a matching weak hypothesis. This will likely help the reader grasp the issue more concretely.

4. Line 261: I am confused by the statement "sample heterogeneity requires large diverse samples". What part of heterogeneity are you trying to address? If testing a treatment that will be deployed to a diverse population of TBIs, I get that you would want a diverse sample, but a diverse sample will increase heterogeneity.

5. Lines 280-282: Or…it might be almost all noise? If we take Button's 2013 finding of 8% power in neuroimaging studies, then differences in findings are more likely due to noise than to demographics or other characteristics.

6. One issue to consider is that the hypotheses considered in the TBI example seem to be too vague or formulated in a too general manner.

7. The interaction between publication incentivization and weak hypotheses is alluded to but leaves the reader unclear on the topic; please say more about this.

8. As noted in the last part of the paper, one of the common tool to limit false discovery is pre-registration. It is unclear why this is not emphasized in the core of the text, rather than lately, as a tool to *review* hypotheses.

9. The social mechanisms to lead us to "team science" are unclear – if this article is meant to help the research community to move in this specific direction, a practical path should be proposed, as "modification of mindset" is certainly more a goal than a practical approach.

10. Specifically, framing all of scientific research in neuroscience in terms of hypotheses that can be confirmed or refuted is limiting; and as the authors acknowledged, the answer might be not a clear yes/no. Please consider discussing how it is sometimes more productive to revise aspects of a particular hypothesis. Nevertheless, I agree with the underlying sentiment that a key to replication is being able to receive recognition for the effort and rigour rather than the outcome.

11. As a computational researcher I would also like to see a stronger emphasis on both:

i) how data, code, and informatics are creating some of the replication crisis;

ii) how stronger informatics frameworks may be part of the solution.

Just an anecdote from personal experience and years of headache: Our lab has been trying to apply a specific networks neuroscience approach called "structural covariance networks". We initially applied it to test the local hyperconnection hypothesis in ASD. It took us years to realise and acknowledge that the computational method itself is very problematic in terms of being replicable even in data from the same scanner, site etc., and despite controlling for every biological variable imaginable. It took us another year, and a very talented student, to understand that it is the method itself that enhances the noise in the data in a very "unuseful" way thus drowning any biological effect. We finally could publish this insight here: https://pubmed.ncbi.nlm.nih.gov/32621973/ I use this example to highlight that our "problem" could not really be framed as a hypothesis refuting exercise, as the real insight for us, and I hope also for the community, was not whether the original hypothesis was right or wrong, but that our tool was flawed.

12. The caption for Figure 2 is confusing, and the content of (Priestley et al., under review) is not clear; please delete this figure and add one or two sentences to the text to say that the number of papers in [subject] has increased from about XX per year in 1970 to YYY per year in 2020.

https://doi.org/10.7554/eLife.78830.sa1

Author response

Essential revisions:

1. Lines 55-62: Please give examples of strong and weak hypotheses. Please also explain why "hypotheses that are supported by a range of findings should be considered weak". You might also want to consider linking strong/weak hypotheses to the idea of "research questions", which can be broad and vague, as compared to "hypotheses" which should be more precise.

This is an important point and based upon this feedback we work to address this issue in text, we now have ~line 57 and we now include a Table with examples.

“In the work of falsification, the more specific and more refutable a hypothesis is, the stronger it is, and hypotheses that can be supported by different sets of findings should be considered weak (Popper, 1963; see Table 1 for example of hypotheses).”

2. Lines 120-122: Are there any systematic reviews on this topic that use one (or several) strong hypotheses and then check if the various studies support the stronger hypothesis? (e.g., does area X change within timeframe Y, in Z types of patients)?

The lack of direct examination of this problem in the hyperconnectivity literature was a primary impetus for the current review. There has been no systematic effort to hold these positions (hyperconnectivity v. disconnection) side-by-side to test them.

Also, I don't think the issue here is that they lack power to refute hypotheses, but instead that this lack of power is coupled with many statistical tests, leading to false positives that are used to support weak hypotheses (e.g. that brain activity "changes").

This is an outstanding point and we agree that the sheer volume of statistical tests in a number of studies increases the probability that findings by chance are published as significant and important. On line 121, we have modified this statement to reflect this point to read:

“Overall, the TBI connectomics literature presents a clear example of a failure to falsify and we argue that it is attributable at least in part by science-by-volume, where small samples are used to examine non-specific hypotheses. This scenario is further worsened using a number of statistical tests which increases the probability that spurious findings are cast as meaningful [40,95].”

3. Lines 230-232: This paragraph could benefit from having a concrete example of a strong hypothesis and a matching weak hypothesis. This will likely help the reader grasp the issue more concretely.

This is an important point and one we discussed as a group prior to the initial submission. We agree completely that concrete examples help to understand the problem and have added several to the text and to Table 1 (line 63).

4. Line 261: I am confused by the statement "sample heterogeneity requires large diverse samples". What part of heterogeneity are you trying to address? If testing a treatment that will be deployed to a diverse population of TBIs, I get that you would want a diverse sample, but a diverse sample will increase heterogeneity.

This is a good point and we have worked to clarify this statement. The point is that small samples do not allow investigation of the effects of sex, education, age, and other factors. Larger, more diverse samples permit direct modeling of these effects. This statement now reads (line 276):

“Addressing sample heterogeneity requires large diverse samples for direct modeling of influencing factors and one avenue to make this possible is data sharing.”

5. Lines 280-282: Or…it might be almost all noise? If we take Button's 2013 finding of 8% power in neuroimaging studies, then differences in findings are more likely due to noise than to demographics or other characteristics.

This is an interesting point, but there does appear to be a there, there. A number of higher powered studies do track changes in connectivity that appear to be directly related to pathophysiology and, importantly, correlate with behavior. However, one cannot deny that at least a subset of these studies presents results that capitalize upon spurious signal or noise.

6. One issue to consider is that the hypotheses considered in the TBI example seem to be too vague or formulated in a too general manner.

The point made here by the Referee is not entirely clear. If this is a statement about the need for stronger hypotheses, we agree that greater specificity is needed and we hope to add context for this point by adding example hypotheses in Table 1. Alternatively, if the Reviewer aims to indicate that this is a TBI-specific phenomenon, that is also possible, though it is unclear why this would occur only in TBI within the clinical neurosciences.

7. The interaction between publication incentivization and weak hypotheses is alluded to but leaves the reader unclear on the topic; please say more about this.

This relationship is made more explicit with the passage on line 115:

“As opposed to pre-registering and testing strong hypotheses, investigators are compelled to identify significant results (any result) for publication. In brain injury work examining network plasticity, investigators have often made general claims that brain injury results in “different” or “altered” connectivity (a problem dating back to early fMRI studies in TBI; [Hillary, 2008]). While it is unlikely the intention, under-specified hypotheses increase the likelihood that chance findings are published. The primary consequence is that all findings are “winners”, permitting growing support for either position without movement toward resolution.”

8. As noted in the last part of the paper, one of the common tool to limit false discovery is pre-registration. It is unclear why this is not emphasized in the core of the text, rather than lately, as a tool to *review* hypotheses.

This is an excellent point, and we now make clear the importance of study preregistration at the outset of the paper (line 55) so that when we return to it, there is context.

9. The social mechanisms to lead us to "team science" are unclear – if this article is meant to help the research community to move in this specific direction, a practical path should be proposed, as "modification of mindset" is certainly more a goal than a practical approach.

We appreciate this point and agree that this statement is confusing. We have removed this statement.

10. Specifically, framing all of scientific research in neuroscience in terms of hypotheses that can be confirmed or refuted is limiting; and as the authors acknowledged, the answer might be not a clear yes/no. Please consider discussing how it is sometimes more productive to revise aspects of a particular hypothesis. Nevertheless, I agree with the underlying sentiment that a key to replication is being able to receive recognition for the effort and rigour rather than the outcome.

This is an important point. Part of the scientific process clearly requires revisions of our theory. As this reviewer alludes to, however, when we revise our hypotheses to fit our outcomes, we risk advancing hypotheses supported by spurious data. We see preregistration as part of the solution and based upon comments elsewhere in this critique, we have refocused on how preregistration can help not only in the development of strong hypotheses, but also in their modification. We also now include Table 1 to provide modern context for what might be considered a “falsifiable” hypothesis. We also include a section titled “Practical Challenges to Falsification” to make clear that falsification of strong hypotheses is one tool of many to improve our science.

11. As a computational researcher I would also like to see a stronger emphasis on both:

i) how data, code, and informatics are creating some of the replication crisis;

ii) how stronger informatics frameworks may be part of the solution.

We agree and now add a section to outline the parameters of this natural tension (see “Big Data as Friend and Foe”) line 146

Just an anecdote from personal experience and years of headache: Our lab has been trying to apply a specific networks neuroscience approach called "structural covariance networks". We initially applied it to test the local hyperconnection hypothesis in ASD. It took us years to realise and acknowledge that the computational method itself is very problematic in terms of being replicable even in data from the same scanner, site etc., and despite controlling for every biological variable imaginable. It took us another year, and a very talented student, to understand that it is the method itself that enhances the noise in the data in a very "unuseful" way thus drowning any biological effect. We finally could publish this insight here: https://pubmed.ncbi.nlm.nih.gov/32621973/ I use this example to highlight that our "problem" could not really be framed as a hypothesis refuting exercise, as the real insight for us, and I hope also for the community, was not whether the original hypothesis was right or wrong, but that our tool was flawed.

We appreciate the reviewer sharing this illustrative example. It does add a dimension (weak hypothesis v. weak method) that requires recognition in this manuscript. I might additionally argue though that stronger hypothesis (including alternative hypotheses) place the investigator in a better position to detect flawed methodology. That is, truly spurious results may stand-out against sanity checks offered by strong hypotheses, but the point still stands that faulty methods contribute to problems of scientific reliability (something we allude to briefly at the outset with reference to Alipourfard et al., 2021). We now add comment on this on and references to examples for how methods/stats can lead to systematically flawed results (line 238). We now write:

“Strong hypotheses must be matched with methods that can provide clear tests, a coupling that cannot be overstated. In the brain imaging literature alone, there are poignant examples where flawed methods (or misunderstanding of their applications) has resulted in the repeated substantiation of spurious results (in structural covariance analysis see Carmen et al., 2021 and in resting-state fMRI see Satterthwaite et al., 2016; Van Dijk et al., 2012).”

12. The caption for Figure 2 is confusing, and the content of (Priestley et al., under review) is not clear; please delete this figure and add one or two sentences to the text to say that the number of papers in [subject] has increased from about XX per year in 1970 to YYY per year in 2020.

We have accepted this recommendation and have deleted the figure and replaced it with statistics highlighting the annual increase in publication numbers.

https://doi.org/10.7554/eLife.78830.sa2

Article and author information

Author details

Sarah M Rajtmajer

Sarah M Rajtmajer is in the College of Information Sciences and Technology, The Pennsylvania State University, University Park, United States

Contribution
Writing – original draft, Writing – review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-1464-0848
Timothy M Errington

Timothy M Errington is at the Center for Open Science, Charlottesville, United States

Contribution
Writing – review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-4959-5143
Frank G Hillary

Frank G Hillary is in the Department of Psychology and the Social Life and Engineering Sciences Imaging Center, The Pennsylvania State University, University Park, United States

Contribution
Conceptualization, Writing – original draft, Writing – review and editing

For correspondence
fhillary@psu.edu

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-1427-0218

Funding

No external funding was received for this work.

Publication history

Received: March 22, 2022
Accepted: July 28, 2022
Accepted Manuscript published: August 8, 2022
Version of Record published: August 23, 2022

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.