Replication of “null results” – Absence of evidence or evidence of absence?

Samuel Pawel; Rachel Heyard; Charlotte Micheloud; Leonhard Held

doi:10.7554/eLife.92311.2

eLife assessment

By assessing what it means to replicate a null finding, and by proposing two methods that can be used to evaluate whether null findings have been replicated (frequentist equivalence testing, and Bayes factors), this article represents an important contribution to work on reproducibility. Through a compelling re-analysis of results from the Reproducibility Project: Cancer Biology, the authors demonstrate that even when 'replication success' is reduced to a single criterion, different methods to assess replication of a null finding can lead to different conclusions.

https://doi.org/10.7554/eLife.92311.2.sa2

Significance of findings

important: Findings that have theoretical or practical implications beyond a single subfield

landmark
fundamental
important
valuable
useful

Strength of evidence

compelling: Evidence that features methods, data and analyses more rigorous than the current state-of-the-art

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

In several large-scale replication projects, statistically non-significant results in both the original and the replication study have been interpreted as a “replication success”. Here we discuss the logical problems with this approach: Non-significance in both studies does not ensure that the studies provide evidence for the absence of an effect and “replication success” can virtually always be achieved if the sample sizes are small enough. In addition, the relevant error rates are not controlled. We show how methods, such as equivalence testing and Bayes factors, can be used to adequately quantify the evidence for the absence of an effect and how they can be applied in the replication setting. Using data from the Reproducibility Project: Cancer Biology, the Experimental Philosophy Replicability Project, and the Reproducibility Project: Psychology we illustrate that many original and replication studies with “null results” are in fact inconclusive. We conclude that it is important to also replicate studies with statistically non-significant results, but that they should be designed, analyzed, and interpreted appropriately.

Introduction

Absence of evidence is not evidence of absence – the title of the 1995 paper by Douglas Altman and Martin Bland has since become a mantra in the statistical and medical literature (Altman and Bland, 1995). Yet, the misconception that a statistically non-significant result indicates evidence for the absence of an effect is unfortunately still widespread (Greenland, 2011; Makin and de Xivry, 2019). Such a “null result” – typically characterized by a p-value p > 0.05 for the null hypothesis of an absent effect – may also occur if an effect is actually present. For example, if the sample size of a study is chosen to detect an assumed effect with a power of 80%, null results will incorrectly occur 20% of the time when the assumed effect is actually present. If the power of the study is lower, null results will occur more often. In general, the lower the power of a study, the greater the ambiguity of a null result. To put a null result in context, it is therefore critical to know whether the study was adequately powered and under what assumed effect the power was calculated (Hoenig and Heisey, 2001; Greenland, 2012). However, if the goal of a study is to explicitly quantify the evidence for the absence of an effect, more appropriate methods designed for this task, such as equivalence testing (Wellek, 2010; Lakens, 2017; Senn, 2021) or Bayes factors (Kass and Raftery, 1995; Goodman, 1999, 2005; Dienes, 2014; Keysers et al., 2020), should be used from the outset.

The interpretation of null results becomes even more complicated in the setting of replication studies. In a replication study, researchers attempt to repeat an original study as closely as possible in order to assess whether consistent results can be obtained with new data (National Academies of Sciences, Engineering, and Medicine, 2019). In the last decade, various large-scale replication projects have been conducted in diverse fields, from the biomedical to the social sciences (Prinz et al., 2011; Begley and Ellis, 2012; Klein et al., 2014; Open Science Collaboration, 2015; Camerer et al., 2016, 2018; Klein et al., 2018; Cova et al., 2018; Errington et al., 2021, among others). The majority of these projects reported alarmingly low replicability rates across a broad spectrum of criteria for quantifying replicability. While most of these projects restricted their focus on original studies with statistically significant results (“positive results”), the Reproducibility Project: Cancer Biology (RPCB, Errington et al., 2021), the Experimental Philosophy Replicability Project (EPRP, Cova et al., 2018), and the Reproducibility Project: Psychology (RPP, Open Science Collaboration, 2015) also attempted to replicate some original studies with null results – either non-significant or interpreted as showing no evidence for a meaningful effect by the original authors.

Although the EPRP and RPP interpreted non-significant results in both original and replication study as a “replication success” for some individual replications (see, for example, the replication of McCann (2005, replication report: https://osf.io/wcm7n) or the replication of Ranganath and Nosek (2008, replication report: https://osf.io/9xt25)), they excluded the original null results in the calculation of an overall replicability rate based on significance. In contrast, the RPCB explicitly defined null results in both the original and the replication study as a criterion for “replication success”. According to this “non-significance” criterion, 11/15 = 73% replications of original null effects were successful. Four additional criteria were used to provide a more nuanced assessment of replication success for original null results: (i) whether the original effect estimate was included in the 95% confidence interval of the replication effect estimate (success rate 11/15 = 73%), (ii) whether the replication effect estimate was included in the 95% confidence interval of the original effect estimate (success rate 12/15 = 80%), (iii) whether the replication effect estimate was included in the 95% prediction interval based on the original effect estimate (success rate 12/15 = 80%), (iv) and whether the p-value obtained from combining the original and replication effect estimate with a meta-analysis was non-significant (success rate 10/15 = 67%). Criteria (i) to (iii) are useful for assessing compatibility in effect estimates between the original and the replication study. Their suitability has been extensively discussed in the literature. The prediction interval criterion (iii) or equivalent criteria (e.g., the Q-test) are usually recommended because they account for the uncertainty from both studies and have adequate error rates when the true effect sizes are the same (Patil et al., 2016; Mathur and VanderWeele, 2020; Schauer and Hedges, 2021).

While the effect estimate criteria (i) to (iii) can be applied regardless of whether or not the original study was non-significant, the “meta-analytic non-significance” criterion (iv) and the aforementioned non-significance criterion refer specifically to original null results. We believe that there are several logical problems with both, and that it is important to highlight and address them, especially since the non-significance criterion has already been used in three replication projects without much scrutiny. It is crucial to note that it is not our intention to diminish the enormously important contributions of the RPCB, the EPRP, and the RPP, but rather to build on their work and provide recommendations for ongoing and future replication projects (e.g., Amaral et al., 2019; Murphy et al., 2022).

The logical problems with the non-significance criterion are as follows: First, if the original study had low statistical power, a non-significant result is highly inconclusive and does not provide evidence for the absence of an effect. It is then unclear what exactly the goal of the replication should be – to replicate the inconclusiveness of the original result? On the other hand, if the original study was adequately powered, a non-significant result may indeed provide some evidence for the absence of an effect when analyzed with appropriate methods, so that the goal of the replication is clearer. However, the criterion by itself does not distinguish between these two cases. Second, with this criterion researchers can virtually always achieve replication success by conducting a replication study with a very small sample size, such that the p-value is non-significant and the result is inconclusive. This is because the null hypothesis under which the p-value is computed is misaligned with the goal of inference, which is to quantify the evidence for the absence of an effect. Third, the criterion does not control the error of falsely claiming the absence of an effect at a predetermined rate. This is in contrast to the standard criterion for replication success, which requires significance from both studies (also known as the two-trials rule, see Section 12.2.8 in Senn, 2021), and ensures that the error of falsely claiming the presence of an effect is controlled at a rate equal to the squared significance level (for example, 5% × 5% = 0.25% for a 5% significance level). The non-significance criterion may be intended to complement the two-trials rule for null results. However, it fails to do so in this respect, which may be required by regulators and funders. These logical problems are equally applicable to the meta-analytic non-significance criterion.

In the following, we present two principled approaches for analyzing replication studies of null results – frequentist equivalence testing and Bayesian hypothesis testing – that can address the limitations of the non-significance criterion. We use the null results replicated in the RPCB, RPP, and EPRP to illustrate the problems of the non-significance criterion and how they can be addressed. We conclude the paper with practical recommendations for analyzing replication studies of original null results, including simple R code for applying the proposed methods.

Null results from the Reproducibility Project: Cancer Biology

Figure 1 shows effect estimates on standardized mean difference (SMD) scale with 95% confidence intervals from two RPCB study pairs. In both study pairs, the original and replication studies are “null results” and therefore meet the non-significance criterion for replication success (the two-sided p-values are greater than 0.05 in both the original and the replication study). The same is true when applying the meta-analytic non-significance criterion (the two-sided p-values of the meta-analyses p_MA are greater than 0.05). However, intuition would suggest that the conclusions in the two pairs are very different.

Two examples of original and replication study pairs which meet the non-significance replication success criterion from the Reproducibility Project: Cancer Biology (Errington et al., 2021). Shown are standardized mean difference effect estimates with 95% confidence intervals, sample sizes n, and two-sided p-values p for the null hypothesis that the effect is absent. Effect estimate, 95% confidence interval, and p-value from a fixed-effect meta-analysis p_MA of original and replication study are shown in gray.

The original study from Dawson et al. (2011) and its replication both show large effect estimates in magnitude, but due to the very small sample sizes, the uncertainty of these estimates is large, too. With such low sample sizes, the results seem inconclusive. In contrast, the effect estimates from Goetz et al. (2011) and its replication are much smaller in magnitude and their uncertainty is also smaller because the studies used larger sample sizes. Intuitively, the results seem to provide more evidence for a zero (or negligibly small) effect. While these two examples show the qualitative difference between absence of evidence and evidence of absence, we will now discuss how the two can be quantitatively distinguished.

Methods for assessing replicability of null results

There are both frequentist and Bayesian methods that can be used for assessing evidence for the absence of an effect. Anderson and Maxwell (2016) provide an excellent summary in the context of replication studies in psychology. We now briefly discuss two possible approaches – frequentist equivalence testing and Bayesian hypothesis testing – and their application to the RPCB, EPRP, and RPP data.

Frequentist equivalence testing

Equivalence testing was developed in the context of clinical trials to assess whether a new treatment – typically cheaper or with fewer side effects than the established treatment – is practically equivalent to the established treatment (Wellek, 2010; Lakens, 2017). The method can also be used to assess whether an effect is practically equivalent to an absent effect, usually zero. Using equivalence testing as a way to put non-significant results into context has been suggested by several authors (Hauck and Anderson, 1986; Campbell and Gustafson, 2018). The main challenge is to specify the margin Δ > 0 that defines an equivalence range [−Δ, +Δ] in which an effect is considered as absent for practical purposes. The goal is then to reject the null hypothesis that the true effect is outside the equivalence range. This is in contrast to the usual null hypotheses of superiority tests which state that the effect is zero or smaller than zero, see Figure 2 for an illustration.

Null hypothesis (H₀) and alternative hypothesis (H₁) for superiority and equivalence tests (with equivalence margin Δ > 0).

To ensure that the null hypothesis is falsely rejected at most α × 100% of the time, the standard approach is to declare equivalence if the (1-2α)×100% confidence interval for the effect is contained within the equivalence range, for example, a 90% confidence interval for α = 0.05 (Westlake, 1972). This procedure is equivalent to declaring equivalence when two one-sided tests (TOST) for the null hypotheses of the effect being greater/smaller than +Δ and −Δ, are both significant at level α (Schuirmann, 1987). A quantitative measure of evidence for the absence of an effect is then given by the maximum of the two one-sided p-values – the TOST p-value (Greenland, 2023, section 4.4). In case a dichotomous replication success criterion for null results is desired, it is natural to require that both the original and the replication TOST p-values are smaller than some level α (conventionally α = 0.05). Equivalently, the criterion would require the (1 – 2α) × 100% confidence intervals of the original and the replication to be included in the equivalence region. In contrast to the non-significance criterion, this criterion controls the error of falsely claiming replication success at level α² when there is a true effect outside the equivalence margin, thus complementing the usual two-trials rule in drug regulation (Senn, 2021, Section 12.2.8).

Returning to the RPCB data, Figure 3 shows the standardized mean difference effect estimates with 90% confidence intervals for all 15 effects which were treated as null results by the RPCB.¹ Most of them showed non-significant p-values (p > 0.05) in the original study. It is noteworthy, however, that two effects from the second experiment of the original paper 48 were regarded as null results despite their statistical significance. According to the non-significance criterion (requiring p > 0.05 in original and replication study), there are 11 “successes” out of total 15 null effects, as reported in Table 1 from Errington et al. (2021).

Effect estimates on standardized mean difference (SMD) scale with 90% confidence interval for the 15 “null results” and their replication studies from the Reproducibility Project: Cancer Biology (Errington et al., 2021). The title above each plot indicates the original paper, experiment and effect numbers. Two original effect estimates from original paper 48 were statistically significant at p < 0.05, but were interpreted as null results by the original authors and therefore treated as null results by the RPCB. The two examples from Figure 1 are indicated in the plot titles. The dashed gray line represents the value of no effect (SMD = 0), while the dotted red lines represent the equivalence range with a margin of Δ = 0.74, classified as “liberal” by Wellek (2010, Table 1.1). The p-value p_TOST is the maximum of the two one-sided p-values for the null hypotheses of the effect being greater/less than +Δ and −Δ, respectively. The Bayes factor BF₀₁ quantifies the evidence for the null hypothesis H₀ : SMD = 0 against the alternative H₁ : SMD ≠ 0 with normal unit-information prior assigned to the SMD under H₁.

We will now apply equivalence testing to the RPCB data. The dotted red lines in Figure 3 represent an equivalence range for the margin Δ = 0.74, which Wellek (2010, Table 1.1) classifies as “liberal”. However, even with this generous margin, only 4 of the 15 study pairs are able to establish replication success at the 5% level, in the sense that both the original and the replication 90% confidence interval fall within the equivalence range (or, equivalently, that their TOST p-values are smaller than 0.05). For the remaining 11 studies, the situation remains inconclusive and there is no evidence for the absence or the presence of the effect. For instance, the previously discussed example from Goetz et al. (2011) marginally fails the criterion (p_TOST = 0.06 in the original study and p_TOST = 0.04 in the replication), while the example from Dawson et al. (2011) is a clearer failure (p_TOST = 0.75 in the original study and p_TOST = 0.88 in the replication) as both effect estimates even lie outside the equivalence margin.

The post-hoc specification of equivalence margins is controversial. Ideally, the margin should be specified on a case-by-case basis in a pre-registered protocol before the studies are conducted by researchers familiar with the subject matter. In the social and medical sciences, the conventions of Cohen (1992) are typically used to classify SMD effect sizes (SMD = 0.2 small, SMD = 0.5 medium, SMD = 0.8 large). While effect sizes are typically larger in preclinical research, it seems unrealistic to specify margins larger than 1 on SMD scale to represent effect sizes that are absent for practical purposes. It could also be argued that the chosen margin Δ = 0.74 is too lax compared to margins commonly used in clinical research (Lange and Freitag, 2005). We therefore report a sensitivity analysis regarding the choice of the margin in Figure 4 in Appendix A. This analysis shows that for realistic margins between 0 and 1, the proportion of replication successes remains below 50% for the conventional α = 0.05 level. To achieve a success rate of 11/15 = 73%, as was achieved with the non-significance criterion from the RPCB, unrealistic margins of Δ > 2 are required.

Appendix B shows similar equivalence test analyses for the four study pairs with original null results from the RPP and EPRP. Three study pair results turn out to be inconclusive due to the large uncertainty around their effect estimates.

Bayesian hypothesis testing

The distinction between absence of evidence and evidence of absence is naturally built into the Bayesian approach to hypothesis testing. A central measure of evidence is the Bayes factor (Kass and Raftery, 1995; Goodman, 1999; Dienes, 2014; Keysers et al., 2020), which is the updating factor of the prior odds to the posterior odds of the null hypothesis H₀ versus the alternative hypothesis H₁

The Bayes factor BF₀₁ quantifies how much the observed data have increased or decreased the probability Pr(H₀) of the null hypothesis relative to the probability Pr(H₁) of the alternative. As such, Bayes factor are direct measures of evidence for the null hypothesis, in contrast to p-values, which are only indirect measures of evidence as they are computed under the assumption that the null hypothesis is true (Held and Ott, 2018). If the null hypothesis states the absence of an effect, a Bayes factor greater than one (BF₀₁ > 1) indicates evidence for the absence of the effect and a Bayes factor smaller than one indicates evidence for the presence of the effect (BF₀₁ < 1), whereas a Bayes factor not much different from one indicates absence of evidence for either hypothesis (BF₀₁ ≈ 1). Bayes factors are quantitative summaries of the evidence provided by the data in favor of the null hypothesis as opposed to the alternative hypothesis. If a dichotomous criterion for successful replication of a null result is desired, it seems natural to require a Bayes factor larger than some level γ > 1 from both studies, for example, γ = 3 or γ = 10 which are conventional levels for “substantial” and “strong” evidence, respectively (Jeffreys, 1961). In contrast to the nonsignificance criterion, this criterion provides a genuine measure of evidence that can distinguish absence of evidence from evidence of absence.

The main challenge with Bayes factors is the specification of the effect under the alternative hypothesis H₁. The assumed effect under H₁ is directly related to the Bayes factor, and researchers who assume different effects will end up with different Bayes factors. Instead of specifying a single effect, one therefore typically specifies a “prior distribution” of plausible effects. Importantly, the prior distribution, like the equivalence margin, should be determined by researchers with subject knowledge and before the data are collected.

To compute the Bayes factors for the RPCB null results, we used the observed effect estimates as the data and assumed a normal sampling distribution for them (Dienes, 2014), as typically done in a meta-analysis. The Bayes factors BF₀₁ shown in Figure 3 then quantify the evidence for the null hypothesis of no effect against the alternative hypothesis that there is an effect using a normal “unit-information” prior distribution (Kass and Wasserman, 1995) for the effect size under the alternative H₁, see Appendix C for further details on the calculation of these Bayes factors. We see that in most cases there is no substantial evidence for either the absence or the presence of an effect, as with the equivalence tests. For instance, with a lenient Bayes factor threshold of 3, only 1 of the 15 replications are successful, in the sense of having BF₀₁ > 3 in both the original and the replication study. The Bayes factors for the two previously discussed examples are consistent with our intuitions – in the Goetz et al. (2011) example there is indeed substantial evidence for the absence of an effect (BF₀₁ = 5 in the original study and BF₀₁ = 4.1 in the replication), while in the Dawson et al. (2011) example there is even anecdotal evidence for the presence of an effect, though the Bayes factors are very close to one due to the small sample sizes (BF₀₁ = 1/1.1 in the original study and BF₀₁ = 1/1.8 in the replication).

As with the equivalence margin, the choice of the prior distribution for the SMD under the alternative H₁ is debatable. The normal unit-information prior seems to be a reasonable default choice, as it implies that small to large effects are plausible under the alternative, but other normal priors with smaller/larger standard deviations could have been considered to make the test more sensitive to smaller/larger true effect sizes. The sensitivity analysis in Appendix A therefore also includes an analysis on the effect of varying prior standard deviations and the Bayes factor thresholds. However, again, to achieve replication success for a larger proportion of replications than the observed 1/15 = 7%, unreasonably large prior standard deviations have to be specified.

Of note, among the 15 RPCB null results, there are three interesting cases (the three effects from original paper 48 by Lin et al., 2012 and its replication by Lewis et al., 2018) where the Bayes factor is qualitatively different from the equivalence test, revealing a fundamental difference between the two approaches. The Bayes factor is concerned with testing whether the effect is exactly zero, whereas the equivalence test is concerned with whether the effect is within an interval around zero. Due to the very large sample size in the original study (n = 514) and the replication (n = 1′153), the data are incompatible with an exactly zero effect, but compatible with effects within the equivalence range. Apart from this example, however, both approaches lead to the same qualitative conclusion – most RPCB null results are highly ambiguous.

Appendix B also shows Bayes factor analyses for the four study pairs with original null results from the RPP and EPRP. In contrast to the RPCB results, most Bayes factors indicate non-anecdotal evidence for a null effect in cases where the non-significance criterion was met, possibly because of the larger sample sizes and smaller effects in these fields.

Conclusions

The concept of “replication success” is inherently multifaceted. Reducing it to a single criterion seems to be an oversimplification. Nevertheless, we believe that the “non-significance” criterion – declaring a replication as successful if both the original and the replication study produce nonsignificant results – is not fit for purpose. This criterion does not ensure that both studies provide evidence for the absence of an effect, it can be easily achieved for any outcome if the studies have sufficiently small sample sizes, and it does not control the relevant error rates. While it is important to replicate original studies with null results, we believe that they should be analyzed using more informative approaches. Box 1 summarizes our recommendations.

Box 1

Recommendations for the analysis of replication studies of original null results. Calculations are based on effect estimates with standard errors σ_i from an original study (i = o) and its replication (i = r). Both effect estimates are assumed to be normally distributed around the true effect size θ with known variance σ²_i.

The effect size θ₀ represents the value of no effect, typically θ₀ = 0.

Equivalence test

Specify a margin Δ > 0 that defines an equivalence range [θ₀ − Δ, θ₀ + Δ] in which effects are considered absent for practical purposes.
Compute the TOST p-values for original (i = o) and replication (i = r) data
with Φ(⋅) the cumulative distribution function of the standard normal distribution.
Declare replication success at level α if p_TOST,o ≤ α and p_TOST,r ≤ α, conventionally α = 0.05.
Perform a sensitivity analysis with respect to the margin Δ. For example, visualize the TOST p-values for different margins to assess the robustness of the conclusions.

Bayes factor

Specify a prior distribution for the effect size θ that represents plausible values under the alternative hypothesis that there is an effect (H₁ : θ ≠ θ₀). For example, specify the mean m and standard deviation s of a normal distribution θ | H₁ ∼ N(m, s²).
Compute the Bayes factors contrasting H₀ : θ = θ₀ to H₁ : θ ≠ θ₀ for original (i = o) and replication (i = r) data. Assuming a normal prior distribution, the Bayes factor is
Declare replication success at level γ > 1 if BF_01,o ≥ γ and BF_01,r ≥ γ, conventionally γ = 3 (substantial evidence) or γ = 10 (strong evidence).
Perform a sensitivity analysis with respect to the prior distribution. For example, visualize the Bayes factors for different prior standard deviations to assess the robustness of the conclusions.

Our reanalysis of the RPCB studies with original null results showed that for most studies that meet the non-significance criterion, the conclusions are much more ambiguous – both with frequentist and Bayesian analyses. While the exact success rate depends on the equivalence margin and the prior distribution, our sensitivity analyses show that even with unrealistically liberal choices, the success rate remains below 40% which is substantially lower than the 73% success rate based on the non-significance criterion.

This is not unexpected, as a study typically requires larger sample sizes to detect the absence of an effect than to detect its presence (Matthews, 2006, Section 11.5.3). Of note, the RPCB sample sizes were chosen so that each replication had at least 80% power to detect the original effect estimate based on a standard superiority test. However, the design of replication studies should ideally align with the planned analysis (Anderson and Kelley, 2022) so if the goal of the study is to find evidence for the absence of an effect, the replication sample size should be determined based on a test for equivalence, see Flight and Julious (2015) and Pawel et al. (2023) for frequentist and Bayesian approaches, respectively.

Our reanalysis of the RPP and EPRP studies with original null results showed that Bayes factors indeed indicate some evidence for no effect in cases where the non-significance criterion was satisfied, possibly due to the smaller effects and typically larger sample sizes in these fields compared to cancer biology. On the other hand, in most cases the precision of the effect estimates was still limited so that only one study pair achieved replication success with the equivalence testing approach. However, it is important to note that the conclusions from the RPP and EPRP analyses are merely anecdotal, as there were only four study pairs with original null results to analyze.

For both the equivalence test and the Bayes factor approach, it is critical that the equivalence margin and the prior distribution are specified independently of the data, ideally before the original and replication studies are conducted. Typically, however, the original studies were designed to find evidence for the presence of an effect, and the goal of replicating the “null result” was formulated only after failure to do so. It is therefore important that margins and prior distributions are motivated from historical data and/or field conventions (Campbell and Gustafson, 2021), and that sensitivity analyses regarding their choice are reported.

In addition, when analyzing a single pair of original and replication studies, we recommend interpreting Bayes factors and TOST p-values as quantitative measures of evidence and discourage dichotomizing them into “success” or “failure”. For example, two TOST p-values p_TOST = 0.049 and p_TOST = 0.051 carry similar evidential weight regardless of one being slightly smaller and the other being slightly larger than 0.05. On the other hand, when more than one pair of original and replication studies are analyzed, dichotomization may be required for computing an overall success rate. In this case, the rate may be computed for different thresholds that correspond to qualitatively different levels of evidence (e.g., 1, 3, and 10 for Bayes factors, or 0.05, 0.01, and 0.005 for p-values).

Researchers may also ask whether the equivalence test or the Bayes factor is “better”. We believe that this is the wrong question to ask, because both methods address different questions and are better in different senses; the equivalence test is calibrated to have certain frequentist error rates, which the Bayes factor is not. The Bayes factor, on the other hand, seems to be a more natural measure of evidence as it treats the null and alternative hypotheses symmetrically and represents the factor by which rational agents should update their beliefs in light of the data. Replication success is ideally evaluated along multiple dimensions, as nicely exemplified by the RPCB, EPRP, and RPP. Replications that are successful on multiple criteria provide more convincing support for the original finding, while replications that are successful on fewer criteria require closer examination. Fortunately, the use of multiple methods is already standard practice in replication assessment, so our proposal to use both of them does not require a major paradigm shift.

While the equivalence test and the Bayes factor are two principled methods for analyzing original and replication studies with null results, they are not the only possible methods for doing so. A straightforward extension would be to first synthesize the original and replication effect estimates with a meta-analysis, and then apply the equivalence and Bayes factor tests to the meta-analytic estimate similar to the meta-analytic non-significance criterion used by the RPCB. This could potentially improve the power of the tests, but consideration must be given to the threshold used for the p-values/Bayes factors, as naive use of the same thresholds as in the standard approaches may make the tests too liberal (Shun et al., 2005). Furthermore, there are various advanced methods for quantifying evidence for absent effects which could potentially improve on the more basic approaches considered here (Lindley, 1998; Johnson and Rossell, 2010; Morey and Rouder, 2011; Kruschke, 2018; Stahel, 2021; Micheloud and Held, 2023; Izbicki et al., 2023).

Acknowledgements

We thank the RPCB, EPRP, and RPP contributors for their tremendous efforts and for making their data publicly available. We thank Maya Mathur for helpful advice on data preparation. We thank Benjamin Ineichen for helpful comments on drafts of the manuscript. We thank the three reviewers and the reviewing editor for useful comments that substantially improved the paper. Our acknowledgment of these individuals does not imply their endorsement of our work. We thank the Swiss National Science Foundation for financial support (grant #189295).

Software and data

The code and data to reproduce our analyses is openly available at https://gitlab.uzh.ch/samuel.pawel/rsAbsence. A snapshot of the repository at the time of writing is available at https://doi.org/10.5281/zenodo.7906792. We used the statistical programming language R version 4.3.2 (R Core Team, 2022) for analyses. The R packages ggplot2 (Wickham, 2016), dplyr (Wickham et al., 2022), knitr (Xie, 2022), and reporttools (Rufibach, 2009) were used for plotting, data preparation, dynamic reporting, and formatting, respectively. The data from the RPCB were obtained by downloading the files from https://github.com/mayamathur/rpcb (commit a1e0c63) and extracting the relevant variables as indicated in the R script preprocess-rpcb-data.R which is available in our git repository. The RPP and EPRP data were obtained from the RProjects data set available in the R package ReplicationSuccess (Held, 2020), see the package documentation (https://CRAN.R-project.org/package=ReplicationSuccess) for details on data extraction.

Appendix A: Sensitivity analyses

The post-hoc specification of equivalence margins Δ and prior distribution for the SMD under the alternative H₁ is debatable. Commonly used margins in clinical research are much more stringent (Lange and Freitag, 2005); for instance, in oncology, a margin of Δ = log(1.3) is commonly used for log odds/hazard ratios, whereas in bioequivalence studies a margin of Δ = log(1.25) is the convention (Senn, 2021, Chapter 22). These margins would translate into margins of Δ = 0.14 and Δ = 0.12 on the SMD scale, respectively, using the SMD = (√3/π)log OR conversion (Cooper et al., 2019, p. 233). Similarly, for the Bayes factor we specified a normal unit-information prior under the alternative while other normal priors with smaller/larger standard deviations could have been considered. Here, we therefore investigate the sensitivity of our conclusions with respect to these parameters.

The top plot of Figure 4 shows the number of successful replications as a function of the margin Δ and for different TOST p-value thresholds. Such an “equivalence curve” approach was first proposed by Hauck and Anderson (1986). We see that for realistic margins between 0 and 1, the proportion of replication successes remains below 50% for the conventional α = 0.05 level. To achieve a success rate of 11/15 = 73%, as was achieved with the non-significance criterion from the RPCB, unrealistic margins of Δ > 2 are required. Changing the success criterion to a more lenient level (α = 0.1) or a more stringent level (α = 0.01) hardly changes the conclusion.

Number of successful replications of original null results in the RPCB as a function of the margin Δ of the equivalence test (p_TOST ≤ α in both studies for α = 0.1, 0.05, 0.01) or the standard deviation of the zero-mean normal prior distribution for the SMD effect size under the alternative H₁ of the Bayes factor test (BF₀₁ ≥ γ in both studies for γ = 3, 6, 10).

The bottom plot of Figure 4 shows a sensitivity analysis regarding the choice of the prior standard deviation and the Bayes factor threshold. It is uncommon to specify prior standard deviations larger than the unit-information standard deviation of 2, as this corresponds to the assumption of very large effect sizes under the alternatives. However, to achieve replication success for a larger proportion of replications than the observed 1/15 = 7%, unreasonably large prior standard deviations have to be specified. For instance, a standard deviation of roughly 5 is required to achieve replication success in 50% of the replications at a lenient Bayes factor threshold of γ = 3. The standard deviation needs to be almost 20 so that the same success rate 11/15 = 73% as with the nonsignificance criterion is achieved. The necessary standard deviations are even higher for stricter Bayes factor thresholds, such as γ = 6 or γ = 10.

Appendix B: Null results from the RPP and EPRP

Here we perform equivalence test and Bayes factor analyses for the three original null results from the Reproducibility Project: Psychology (Eastwick and Finkel, 2008; Ranganath and Nosek, 2008; Reynolds and Besner, 2008) and the original null result from the Reproducibility Project: Experimental Philosophy (McCann, 2005). To enable comparison of effect sizes across different studies, both the RPP and the EPRP provided effect estimates as Fisher z-transformed correlations which we use in the following.

Figure 5 shows effect estimates with 90% confidence intervals, two-sided p-values for the null hypothesis that the effect size is zero, TOST p-values for a margin of Δ = 0.2, and Bayes factors using a normal prior centered around zero with a standard deviation of 2. We see that all replications except the replication of Ranganath and Nosek (2008) would be considered successful with the non-significance criterion, as the original and replication p-values are greater than 0.05. In all three cases, the Bayes factors also indicate substantial (BF₀₁ > 3) to strong evidence (BF₀₁ > 10) for the null hypothesis of no effect. Compared to the Bayes factors in the RPCB, the evidence is stronger, possibly due to the mostly larger sample sizes in the RPP and EPRP.

Effect estimates on Fisher z-transformed correlation scale with 90% confidence interval for the “null results” and their replication studies from the Reproducibility Project: Psychology (RPP, Open Science Collaboration, 2015) and the Experimental Philosophy Replicability Project (EPRP, Cova et al., 2018). The dashed gray line represents the value of no effect (z = 0), while the dotted red lines represent the equivalence range with a margin of Δ = 0.74. The p-value p_TOST is the maximum of the two one-sided p-values for the null hypotheses of the effect being greater/less than +Δ and −Δ, respectively. The Bayes factor BF₀₁ quantifies the evidence for the null hypothesis H₀ : z = 0 against the alternative H₁ : z ≠ 0 with normal prior centered around zero and standard deviation of 2 assigned to the effect size under H₁.

Interestingly, the opposite conclusion is reached when we analyze the data using an equivalence test with a margin of Δ = 0.2 (which may be considered liberal as it represents a small to medium effect based on the Cohen, 1992 convention). In this case, equivalence at the 5% level can only be established for the Ranganath and Nosek (2008) original study and its replication simultaneously, as the confidence intervals from the other studies are too wide to be included in the equivalence range. Furthermore, the Ranganath and Nosek (2008) replication also illustrates the conceptual difference between testing for an exactly zero effect versus testing for an effect within an interval around zero. That is, the Bayes factor indicates no evidence for a zero effect (because the estimate is clearly not zero), but the equivalence test indicates evidence for a negligible effect (because the estimate is clearly within the equivalence range).

As before, the particular choices of the equivalence margin Δ for the equivalence test and prior standard deviation of the Bayes factor are debatable. We therefore report sensitivity analyses in Figure 6 which show the TOST p-values and Bayes factors of original and replication studies for a range of margins and prior standard deviations, respectively. Apart from the Ranganath and Nosek (2008) study pair, all studies require large margins of about Δ = 0.4 to establish replication success at the 5% level (in the sense of original and replication TOST p-values being smaller than 0.05). On the other hand, in all but the Ranganath and Nosek (2008) replication, the data provide substantial evidence for a null effect (BF₀₁ > 3) for prior standard deviations of about one, while larger prior standard deviations of about three are required for the data to indicate strong evidence (BF₀₁ > 10) for a null effect, whereas the data from the Ranganath and Nosek (2008) replication provide very strong evidence against a null effect for all prior standard deviations considered.

Sensitivity analyses for the “null results” and their replication studies from the Reproducibility Project: Psychology (RPP, Open Science Collaboration, 2015) and the Experimental Philosophy Replicability Project (EPRP, Cova et al., 2018). The Bayes factor of the replication of Ranganath and Nosek (2008) decreases very quickly and is only shown for a limited range.

Appendix C: Technical details on Bayes factors

We assume that effect estimates are normally distributed around an unknown effect size θ with known variance equal to their squared standard error, i.e.,

for original (i = o) and replication (i = r). This framework is similar to meta-analysis and can be applied to many types of effect sizes and data (Spiegelhalter et al., 2004, Section 2.4). We want to quantify the evidence for the null hypothesis that the effect size is equal to a null effect (H₀ : θ = θ₀, typically θ₀ = 0) against the alternative hypothesis that the effect size is non-null (H₁ : θ ≠ θ₀). This requires specification of a prior distribution for the effect size under the alternative, and we will assume a normal prior θ | H₁ ~ N(m, s²) in the following. The Bayes factor based on an effect estimate is then given by the ratio of its likelihood under the null hypothesis to its marginal likelihood under the alternative hypothesis, i.e.,

In the main analysis we used a normal unit-information prior, that is, a normal distribution centered around the value of no effect (m = 0) with a standard deviation s corresponding to the standard error of an SMD estimate based on one observation (Kass and Wasserman, 1995). Assuming that the group means are normally distributed and with n the total sample size and τ the known data standard deviation, the distribution of the SMD is . The standard error σ of the SMD based on one unit (n = 1), is hence 2, meaning that the standard deviation of the unit-information prior is s = 2.

References

1. Altman D. G.
2. Bland J. M.
1995Statistics notes: Absence of evidence is not evidence of absenceBMJ 311:485–485https://doi.org/10.1136/bmj.311.7003.485 Google Scholar
1. Amaral O. B.
2. Neves K.
3. Wasilewska-Sampaio A. P.
4. Carneiro C. F.
2019Science forum: The Brazilian reproducibility initiativeeLife 8https://doi.org/10.7554/elife.41602 Google Scholar
1. Anderson S. F.
2. Kelley K.
2022Sample size planning for replication studies: The devil is in the designPsychological Methods https://doi.org/10.1037/met0000520 Google Scholar
1. Anderson S. F.
2. Maxwell S. E.
2016There's more than one way to conduct a replication study: Beyond statistical significancePsychological Methods 21:1–12https://doi.org/10.1037/met0000051 Google Scholar
1. Begley C. G.
2. Ellis L. M.
2012Raise standards for preclinical cancer researchNature 483:531–533https://doi.org/10.1038/483531a Google Scholar
1. Camerer C. F.
2. Dreber A.
3. Forsell E.
4. Ho T.
5. Huber J.
6. Johannesson M.
7. Kirchler M.
8. Almenberg J.
9. Altmejd A.
10. et al.
2016Evaluating replicability of laboratory experiments in economicsScience 351:1433–1436https://doi.org/10.1126/science.aaf0918 Google Scholar
1. Camerer C. F.
2. Dreber A.
3. Holzmeister F.
4. Ho T.
5. Huber J.
6. Johannesson M.
7. Kirchler M.
8. Nave G.
9. Nosek B.
10. et al.
2018Evaluating the replicability of social science experiments in nature and science between 2010 and 2015Nature Human Behavior 2:637–644https://doi.org/10.1038/s41562-018-0399-z Google Scholar
1. Campbell H.
2. Gustafson P.
2018Conditional equivalence testing: An alternative remedy for publication biasPLOS ONE 13:e0195145https://doi.org/10.1371/journal.pone.0195145 Google Scholar
1. Campbell H.
2. Gustafson P.
2021What to make of equivalence testing with a post-specified margin?Meta-Psychology 5https://doi.org/10.15626/mp.2020.2506 Google Scholar
1. Cohen J.
1992A power primerPsychological Bulletin 112:155–159https://doi.org/10.1037/0033-2909.112.1.155 Google Scholar
2019The Handbook of Research Synthesis and MetaAnalysis
1. Cooper H.
2. Hedges L. V.
3. Valentine J. C.
, editors. Russell Sage Foundation
https://doi.org/10.7758/9781610448864 Google Scholar
1. Cova F.
2. Strickland B.
3. Abatista A.
4. Allard A.
5. Andow J.
6. Attie M.
7. Beebe J.
8. Berniūnas R.
9. Boudesseul J.
10. Colombo M.
11. et al.
2018Estimating the reproducibility of experimental philosophyReview of Philosophy and Psychology https://doi.org/10.1007/s13164-018-0400-9 Google Scholar
1. Dawson M. A.
2. Prinjha R. K.
3. Dittmann A.
4. Giotopoulos G.
5. Bantscheff M.
6. Chan W.-I.
7. Robson S. C.
8. wa Chung C.
9. Hopf C.
10. Savitski M. M.
11. Huthmacher C.
12. Gudgin E.
13. Lugo D.
14. Beinke S.
15. Chapman T. D.
16. Roberts E. J.
17. Soden P. E.
18. Auger K. R.
19. Mirguet O.
20. Doehner K.
21. Delwel R.
22. Burnett A. K.
23. Jeffrey P.
24. Drewes G.
25. Lee K.
26. Huntly B. J. P.
27. Kouzarides T.
2011Inhibition of BET recruitment to chromatin as an effective treatment for MLL-fusion leukaemiaNature 478:529–533https://doi.org/10.1038/nature10509 Google Scholar
1. Dienes Z.
2014Using Bayes to get the most out of non-significant resultsFrontiers in Psychology 5https://doi.org/10.3389/fpsyg.2014.00781 Google Scholar
1. Eastwick P. W.
2. Finkel E. J.
2008Sex differences in mate preferences revisited: Do people know what they initially desire in a romantic partner?Journal of Personality and Social Psychology 94:245–264https://doi.org/10.1037/0022-3514.94.2.245 Google Scholar
1. Errington T. M.
2. Mathur M.
3. Soderberg C. K.
4. Denis A.
5. Perfito N.
6. Iorns E.
7. Nosek B. A.
2021Investigating the replicability of preclinical cancer biologyeLife 10https://doi.org/10.7554/elife.71601 Google Scholar
1. Flight L.
2. Julious S. A.
2015Practical guide to sample size calculations: non-inferiority and equivalence trialsPharmaceutical Statistics 15:80–89https://doi.org/10.1002/pst.1716 Google Scholar
1. Goetz J. G.
2. Minguet S.
3. Navarro-Lérida I.
4. Lazcano J. J.
5. Samaniego R.
6. Calvo E.
7. Tello M.
8. Osteso-Ibáñez T.
9. Pellinen T.
10. Echarri A.
11. Cerezo A.
12. Klein-Szanto A. J.
13. Garcia R.
14. Keely P. J.
15. Sánchez-Mateos P.
16. Cukierman E.
17. Pozo M. A. D.
2011Biomechanical remodeling of the microenvironment by stromal caveolin-1 favors tumor invasion and metastasisCell 146:148–163https://doi.org/10.1016/j.cell.2011.05.040 Google Scholar
1. Goodman S. N.
1999Toward evidence-based medical statistics. 2: The Bayes factorAnnals of Internal Medicine 130:1005https://doi.org/10.7326/0003-4819-130-12-199906150-00019 Google Scholar
1. Goodman S. N.
2005Introduction to Bayesian methods I: measuring the strength of evidenceClinical Trials 2:282–290https://doi.org/10.1191/1740774505cn098oa Google Scholar
1. Greenland S.
2011Null misinterpretation in statistical testing and its impact on health risk assessmentPreventive Medicine 53:225–228https://doi.org/10.1016/j.ypmed.2011.08.010 Google Scholar
1. Greenland S.
2012Nonsignificance plus high power does not imply support for the null over the alternativeAnnals of Epidemiology 22:364–368https://doi.org/10.1016/j.annepidem.2012.02.007 Google Scholar
1. Greenland S.
2023Divergence versus decision P-values: A distinction worth making in theory and keeping in practice: Or, how divergence P-values measure evidence even when decision P-values do notScandinavian Journal of Statistics 50:54–88https://doi.org/10.1111/sjos.12625 Google Scholar
1. Hauck W. W.
2. Anderson S.
1986A proposal for interpreting and reporting negative studiesStatistics in Medicine 5:203–209https://doi.org/10.1002/sim.4780050302 Google Scholar
1. Held L.
2020A new standard for the analysis and design of replication studies (with discussion)Journal of the Royal Statistical Society: Series A (Statistics in Society 183:431–448https://doi.org/10.1111/rssa.12493 Google Scholar
1. Held L.
2. Ott M.
2018On p-values and Bayes factorsAnnual Review of Statistics and Its Application 5:393–419https://doi.org/10.1146/annurev-statistics-031017-100307 Google Scholar
1. Hoenig J. M.
2. Heisey D. M.
2001The abuse of powerThe American Statistician 55:19–24https://doi.org/10.1198/000313001300339897 Google Scholar
1. Izbicki R.
2. Cabezas L. M. C.
3. Colugnatti F. A. B.
4. Lassance R. F. L.
5. de Souza A. A. L.
6. Stern R. B.
2023Rethinking hypothesis testsPreprintGoogle Scholar
1. Jeffreys H.
1961Theory of ProbabilityOxford: Clarendon Press Google Scholar
1. Johnson V. E.
2. Rossell D.
2010On the use of non-local prior densities in Bayesian hypothesis testsJournal of the Royal Statistical Society: Series B (Statistical Methodology 72:143–170https://doi.org/10.1111/j.1467-9868.2009.00730.x Google Scholar
1. Kass R. E.
2. Raftery A. E.
1995Bayes factorsJournal of the American Statistical Association 90:773–795https://doi.org/10.1080/01621459.1995.10476572 Google Scholar
1. Kass R. E.
2. Wasserman L.
1995A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterionJournal of the American Statistical Association 90:928–934https://doi.org/10.1080/01621459.1995.10476592 Google Scholar
1. Keysers C.
2. Gazzola V.
3. Wagenmakers E.-J.
2020Using Bayes factor hypothesis testing in neuroscience to establish evidence of absenceNature Neuroscience 23:788–799https://doi.org/10.1038/s41593-020-0660-4 Google Scholar
1. Klein R. A.
2. Ratliff K. A.
3. Vianello M.
4. Adams R. B.
5. Bahník v.
6. Bernstein M. J.
7. Bocian K.
8. Brandt M. J.
9. Brooks B.
10. et al.
2014Investigating variation in replicability: A “many labs” replication projectSocial Psychology 45:142–152https://doi.org/10.1027/1864-9335/a000178 Google Scholar
1. Klein R. A.
2. Vianello M.
3. Hasselman F.
4. Adams B. G.
5. Reginald B.
6. Adams J.
7. Alper S.
8. Aveyard M.
9. Axt J. R.
10. Babalola M. T.
11. et al.
2018Many labs 2: Investigating variation in replicability across samples and settingsAdvances in Methods and Practices in Psychological Science 1:443–490https://doi.org/10.1177/2515245918810225 Google Scholar
1. Kruschke J. K.
2018Rejecting or accepting parameter values in Bayesian estimationAdvances in Methods and Practices in Psychological Science 1:270–280https://doi.org/10.1177/2515245918771304 Google Scholar
1. Lakens D.
2017Equivalence testsSocial Psychological and Personality Science 8:355–362https://doi.org/10.1177/1948550617697177 Google Scholar
1. Lange S.
2. Freitag G.
2005Choice of delta: Requirements and reality – results of a systematic reviewBiometrical Journal 47:12–27https://doi.org/10.1002/bimj.200410085 Google Scholar
1. Lewis L. M.
2. Edwards M. C.
3. Meyers Z. R.
4. Talbot C. C.
5. Hao H.
6. Blum D.
7. Iorns E.
8. Tsui R.
9. Denis A.
10. Perfito N.
11. Errington T. M.
2018Replication study: Transcriptional amplification in tumor cells with elevated c-MyceLife 7https://doi.org/10.7554/elife.30274 Google Scholar
1. Lin C. Y.
2. Loven J.
3. Rahl P. B.
4. Paranal R. M.
5. Burge C. B.
6. Bradner J. E.
7. Lee T. I.
8. Young R. A.
2012Transcriptional amplification in tumor cells with elevated c-MycCell 151:56–67https://doi.org/10.1016/j.cell.2012.08.026 Google Scholar
1. Lindley D. V.
1998Decision analysis and bioequivalence trialsStatistical Science 13https://doi.org/10.1214/ss/1028905932 Google Scholar
1. Makin T. R.
2. de Xivry J.-J. O.
2019Ten common statistical mistakes to watch out for when writing or reviewing a manuscripteLife 8https://doi.org/10.7554/elife.48175 Google Scholar
1. Mathur M. B.
2. VanderWeele T. J.
2020New statistical metrics for multisite replication projectsJournal of the Royal Statistical Society: Series A (Statistics in Society 183:1145–1166https://doi.org/10.1111/rssa.12572 Google Scholar
1. Matthews J. N.
2006Introduction to Randomized Controlled Clinical TrialsNew York: Chapman and Hall/CRC https://doi.org/10.1201/9781420011302 Google Scholar
1. McCann H. J.
2005Intentional action and intending: Recent empirical studiesPhilosophical Psychology 18:737–748https://doi.org/10.1080/09515080500355236 Google Scholar
1. Micheloud C.
2. Held L.
2023The replication of equivalence studiesarXiv preprinthttps://doi.org/10.48550/ARXIV.2204.06960 Google Scholar
1. Morey R. D.
2. Rouder J. N.
2011Bayes factor approaches for testing interval null hypothesesPsychological Methods 16:406–419https://doi.org/10.1037/a0024377 Google Scholar
1. Murphy J.
2. Mesquida C.
3. Caldwell A. R.
4. Earp B. D.
5. Warne J. P.
2022Proposal of a selection protocol for replication of studies in sports and exercise scienceSports Medicine 53:281–291https://doi.org/10.1007/s40279-022-01749-1 Google Scholar
1. National Academies of Sciences, Engineering, and Medicine
2019Reproducibility and Replicability in ScienceNational Academies Press https://doi.org/10.17226/25303 Google Scholar
1. Open Science Collaboration
2015Estimating the reproducibility of psychological scienceScience 349:aac4716https://doi.org/10.1126/science.aac4716 Google Scholar
1. Patil P.
2. Peng R. D.
3. Leek J. T.
2016What should researchers expect when they replicate studies? A statistical view of replicability in psychological sciencePerspectives on Psychological Science 11:539–544https://doi.org/10.1177/1745691616646366 Google Scholar
1. Pawel S.
2. Consonni G.
3. Held L.
2023Bayesian approaches to designing replication studiesPsychological Methods https://doi.org/10.1037/met0000604 Google Scholar
1. Prinz F.
2. Schlange T.
3. Asadullah K.
2011Believe it or not: how much can we rely on published data on potential drug targets?Nature Reviews Drug Discovery 10:712–712https://doi.org/10.1038/nrd3439-c1 Google Scholar
1. R Core Team
2022R: A Language and Environment for Statistical ComputingVienna, Austria: R Foundation for Statistical Computing https://www.R-project.org/Google Scholar
1. Ranganath K. A.
2. Nosek B. A.
2008Implicit attitude generalization occurs immediately; explicit attitude generalization takes timePsychological Science 19:249–254https://doi.org/10.1111/j.1467-9280.2008.02076.x Google Scholar
1. Reynolds M.
2. Besner D.
2008Contextual effects on reading aloud: Evidence for pathway controlJournal of Experimental Psychology: Learning, Memory, and Cognition 34:50–64https://doi.org/10.1037/0278-7393.34.1.50 Google Scholar
1. Rufibach K.
2009reporttools: R functions to generate LATEX tables of descriptive statisticsJournal of Statistical Software, Code Snippets 31https://doi.org/10.18637/jss.v031.c01 Google Scholar
1. Schauer J. M.
2. Hedges L. V.
2021Reconsidering statistical methods for assessing replicationPsychological Methods 26:127–139https://doi.org/10.1037/met0000302 Google Scholar
1. Schuirmann D. J.
1987A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailabilityJournal of Pharmacokinetics and Biopharmaceutics 15:657–680https://doi.org/10.1007/bf01068419 Google Scholar
1. Senn S.
2021Statistical Issues in Drug DevelopmentWiley https://doi.org/10.1002/9781119238614 Google Scholar
1. Shun Z.
2. Chi E.
3. Durrleman S.
4. Fisher L.
2005Statistical consideration of the strategy for demonstrating clinical evidence of effectiveness-one larger vs two smaller pivotal studiesStatistics in Medicine 24:1619–1637https://doi.org/10.1002/sim.2015 Google Scholar
1. Spiegelhalter D. J.
2. Abrams R.
3. Myles J. P.
2004Bayesian Approaches to Clinical Trials and Health-Care EvaluationNew York: Wiley Google Scholar
1. Stahel W. A.
2021New relevance and significance measures to replace p-valuesPLOS ONE 16:e0252991https://doi.org/10.1371/journal.pone.0252991 Google Scholar
1. Wellek S.
2010Testing statistical hypotheses of equivalence and noninferiorityCRC press Google Scholar
1. Westlake W. J.
1972Use of confidence intervals in analysis of comparative bioavailability trialsJournal of Pharmaceutical Sciences 61:1340–1341https://doi.org/10.1002/jps.2600610845 Google Scholar
1. Wickham H.
2016ggplot2: Elegant Graphics for Data AnalysisSpringer International Publishing https://doi.org/10.1007/978-3-319-24277-4 Google Scholar
1. Wickham H.
2. Francois R.
3. Henry L.
4. Muller K.
2022dplyr: A Grammar of Data ManipulationR package version 1.0.10https://CRAN.R-project.org/package=dplyr
1. Xie Y.
2022knitr: A General-Purpose Package for Dynamic Report Generation in RR package version 1.40https://yihui.org/knitr/

Article and author information

Author information

Samuel Pawel
Epidemiology, Biostatistics and Prevention Institute, Center for Reproducible Science, University of Zurich, Switzerland
- For correspondence: samuel.pawel@uzh.ch
- For correspondence:⠀samuel.pawel@uzh.ch (SP)
- Contributed equally
Rachel Heyard
Epidemiology, Biostatistics and Prevention Institute, Center for Reproducible Science, University of Zurich, Switzerland
- Contributed equally
Charlotte Micheloud
Epidemiology, Biostatistics and Prevention Institute, Center for Reproducible Science, University of Zurich, Switzerland
Leonhard Held
Epidemiology, Biostatistics and Prevention Institute, Center for Reproducible Science, University of Zurich, Switzerland

Version history

Preprint posted: May 8, 2023
Sent for peer review: September 8, 2023
Reviewed Preprint version 1: November 22, 2023
Reviewed Preprint version 2: February 16, 2024
Version of Record published: May 13, 2024

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.92311. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Revised: This Reviewed Preprint has been revised by the authors in response to the previous round of peer review; the eLife assessment and the public reviews have been updated where necessary by the editors and peer reviewers.

Reviewing Editor
Philip Boonstra
University of Michigan, Ann Arbor, United States of America
Senior Editor
Peter Rodgers
eLife, Cambridge, United Kingdom

Reviewer #1 (Public Review):

Summary:
The goal of Pawel et al. is to provide a more rigorous and quantitative approach for judging whether or not an initial null finding (conventionally with p >= 0.05) has been replicated by a second similarly null finding. They discuss important objections to relying on the qualitative significant/non-significant dichotomy to make this judgement. They present two complementary methods (one frequentist and the other Bayesian) which provide a superior quantitative framework for assessing the replicability of null findings.

Strengths:
Clear presentation; illuminating examples drawn from the well-known Reproducibility Project: Cancer Biology data set; R-code that implements suggested analyses. Using both methods as suggested provides a superior procedure for judging the replicability of null findings.

Weaknesses:
The frequentist and the Bayesian methods can be used to make binary assessments of an original finding and its replication. The authors clarify, though, that they can also be used to make continuous quantitative judgements about strength of evidence. I believe that most will use the methods in a binary fashion, but the availability of more nuanced assessments is welcome. This revision has addressed what I initially considered a weakness.

https://doi.org/10.7554/eLife.92311.2.sa1

Reviewer #2 (Public Review):

Summary and strengths:

1) The work provides significant insights because usually non-significant studies can be considered replicated by their null replications as well. The work discuss and provide data demonstrating that when analyzing studies with p > 0.05 for the result to be replicated, equivalence tests and bayes factor approaches are more suitable, since studies can be underpowered even if replications use larger samples than their original studies in general. Non-significant p-values are highly expected even with 80% of power for a true effect.

2) The evidence used features methods and analyses more rigorous than current state-of-the-art research on replicability.

Weaknesses:
I am satisfied with the revisions made by the authors in response to my initial suggestions, as well as their subsequent responses to my observations throughout the reviewing process.

https://doi.org/10.7554/eLife.92311.2.sa0

Author Response

The following is the authors’ response to the original reviews.

eLife assessment

This work provides a valuable contribution and assessment of what it means to replicate a null study finding, and what are the appropriate methods for doing so (apart from a rote p-value assessment). Through a convincing re-analysis of results from the Reproducibility Project: Cancer Biology using frequentist equivalence testing and Bayes factors, the authors demonstrate that even when reducing 'replicability success' to a single criterion, how precisely replication is measured may yield differing results. Less focus is directed to appropriate replication of non-null findings.

Reviewer #1 (Public Review):

Summary:

The goal of Pawel et al. is to provide a more rigorous and quantitative approach for judging whether or not an initial null finding (conventionally with p ≥ 0.05) has been replicated by a second similarly null finding. They discuss important objections to relying on the qualitative significant/non-significant dichotomy to make this judgment. They present two complementary methods (one frequentist and the other Bayesian) which provide a superior quantitative framework for assessing the replicability of null findings.

Strengths:

Clear presentation; illuminating examples drawn from the well-known Reproducibility Project: Cancer Biology data set; R-code that implements suggested analyses. Using both methods as suggested provides a superior procedure for judging the replicability of null findings.

Weaknesses:

The proposed frequentist and the Bayesian methods both rely on binary assessments of an original finding and its replication. I'm not sure if this is a weakness or is inherent to making binary decisions based on continuous data.

For the frequentist method, a null finding is considered replicated if the original and replication 90% confidence intervals for the effects both fall within the equivalence range. According to this approach, a null finding would be considered replicated if p-values of both equivalences tests (original and replication) were, say, 0.049, whereas would not be considered replicated if, for example, the equivalence test of the original study had a p-value of 0.051 and the replication had a p-value of 0.001. Intuitively, the evidence for replication would seem to be stronger in the second instance. The recommended Bayesian approach similarly relies on a dichotomy (e.g., Bayes factor > 1).

Thanks for the suggestions, we now emphasize more strongly in the “Methods for assessing replicability of null results” and “Conclusions” sections that both TOST p-values and Bayes factors are quantitative measures of evidence that do not require dichotomization into “success” or “failure”.

Reviewer #2 (Public Review):

Summary:

The study demonstrates how inconclusive replications of studies initially with p > 0.05 can be and employs equivalence tests and Bayesian factor approaches to illustrate this concept. Interestingly, the study reveals that achieving a success rate of 11 out of 15, or 73%, as was accomplished with the non-significance criterion from the RPCB (Reproducibility Project: Cancer Biology), requires unrealistic margins of Δ > 2 for equivalence testing.

Strengths:

The study uses reliable and shareable/open data to demonstrate its findings, sharing as well the code for statistical analysis. The study provides sensitivity analysis for different scenarios of equivalence margin and alfa level, as well as for different scenarios of standard deviations for the prior of Bayes factors and different thresholds to consider. All analysis and code of the work is open and can be replicated. As well, the study demonstrates on a case-by-case basis how the different criteria can diverge, regarding one sample of a field of science: preclinical cancer biology. It also explains clearly what Bayes factors and equivalence tests are.

Weaknesses:

It would be interesting to investigate whether using Bayes factors and equivalence tests in addition to p-values results in a clearer scenario when applied to replication data from other fields. As mentioned by the authors, the Reproducibility Project: Experimental Philosophy (RPEP) and the Reproducibility Project: Psychology (RPP) have data attempting to replicate some original studies with null results. While the RPCB analysis yielded a similar picture when using both criteria, it is worth exploring whether this holds true for RPP and RPEP. Considerations for further research in this direction are suggested. Even if the original null results were excluded in the calculation of an overall replicability rate based on significance, sensitivity analyses considering them could have been conducted. The present authors can demonstrate replication success using the significance criteria in these two projects with initially p < 0.05 studies, both positive and non-positive.

Other comments:

Introduction: The study demonstrates how inconclusive replications of studies initially with p > 0.05 can be and employs equivalence tests and Bayesian factor approaches to illustrate this concept. Interestingly, the study reveals that achieving a success rate of 11 out of 15, or 73%, as was accomplished with the non-significance criterion from the RPCB (Reproducibility Project: Cancer Biology), requires unrealistic margins of Δ > 2 for equivalence testing.

Overall picture vs. case-by-case scenario: An interesting finding is that the authors observe that in most cases, there is no substantial evidence for either the absence or the presence of an effect, as evidenced by the equivalence tests. Thus, using both suggested criteria results in a picture similar to the one initially raised by the paper itself. The work done by the authors highlights additional criteria that can be used to further analyze replication success on a case-by-case basis, and I believe that this is where the paper's main contributions lie. Despite not changing the overall picture much, I agree that the p-value criterion by itself does not distinguish between (1) a situation where the original study had low statistical power, resulting in a highly inconclusive non-significant result that does not provide evidence for the absence of an effect and (2) a scenario where the original study was adequately powered, and a non-significant result may indeed provide some evidence for the absence of an effect when analyzed with appropriate methods. Equivalence testing and Bayesian factor approaches are valuable tools in both cases.

Regarding the 0.05 threshold, the choice of the prior distribution for the SMD under the alternative H1 is debatable, and this also applies to the equivalence margin. Sensitivity analyses, as highlighted by the authors, are helpful in these scenarios.

Thank you for the thorough review and constructive feedback. We have added an additional “Appendix C: Null results from the RPP and EPRP” that shows equivalence testing and Bayes factor analyses for the RPP and EPRP null results.

Reviewer #3 (Public Review):

Summary:

The paper points out that non-significance in both the original study and a replication does not ensure that the studies provide evidence for the absence of an effect. Also, it can not be considered a "replication success". The main point of the paper is rather obvious. It may be that both studies are underpowered, in which case their non-significance does not prove anything. The absence of evidence is not evidence of absence! On the other hand, statistical significance is a confusing concept for many, so some extra clarification is always welcome.

One might wonder if the problem that the paper addresses is really a big issue. The authors point to the "Reproducibility Project: Cancer Biology" (RPCB, Errington et al., 2021). They criticize Errington et al. because they "explicitly defined null results in both the original and the replication study as a criterion for replication success." This is true in a literal sense, but it is also a little bit uncharitable. Errington et al. assessed replication success of "null results" with respect to 5 criteria, just one of which was statistical (non-)significance.

It is very hard to decide if a replication was "successful" or not. After all, the original significant result could have been a false positive, and the original null-result a false negative. In light of these difficulties, I found the paper of Errington et al. quite balanced and thoughtful. Replication has been called "the cornerstone of science" but it turns out that it's actually very difficult to define "replication success". I find the paper of Pawel, Heyard, Micheloud, and Held to be a useful addition to the discussion.

Strengths:

This is a clearly written paper that is a useful addition to the important discussion of what constitutes a successful replication.

Weaknesses:

To me, it seems rather obvious that non-significance in both the original study and a replication does not ensure that the studies provide evidence for the absence of an effect. I'm not sure how often this mistake is made.

Thanks for the feedback. We do not have systematic data on how often the mistake of confusing absence of evidence with evidence of absence has been made in the replication context, but we do know that it has been made in at least three prominent large-scale replication projects (the RPP, RPEP, RPCB). We therefore believe that there is a need for our article.

Moreover, we agree that the RPCB provided a nuanced assessment of replication success using five different criteria for the original null results. We emphasize this now more in the “Introduction” section. However, we do not consider our article as “a little bit uncharitable” to the RPCB, as we discuss all other criteria used in the RPCB and note that our intent is not to diminish the important contributions of the RPCB, but rather to build on their work and provide constructive recommendations for future researchers. Furthermore, in response to comments made by Reviewer #2, we have added an additional “Appendix B: Null results from the RPP and EPRP” that shows equivalence testing and Bayes factor analyses for null results from two other replication projects, where the same issue arises.

Reviewer #1 (Recommendations For The Authors):

The authors may wish to address the dichotomy issue I raise above, either in the analysis or in the discussion.

Thank you, we now emphasize that Bayes factors and TOST p-values do not need to be dichotomized but can be interpreted as quantitative measures of evidence, both in the “Methods for assessing replicability of null results” and the “Conclusions” sections.

Reviewer #2 (Recommendations For The Authors):

Given that, here follow additional suggestions that the authors should consider in light of the manuscript's word count limit, to avoid confusing the paper's main idea:

2) Referencing: Could you reference the three interesting cases among the 15 RPCB null results (specifically, the three effects from the original paper #48) where the Bayes factor differs qualitatively from the equivalence test?

We now explicitly cite the original and replication study from paper #48.

3) Equivalence testing: As the authors state, only 4 out of the 15 study pairs are able to establish replication success at the 5% level, in the sense that both the original and the replication 90% confidence intervals fall within the equivalence range. Among these 4, two (Paper #48, Exp #2, Effect #5 and Paper #48, Exp #2, Effect #6) were initially positive with very low p-values, one (Paper #48, Exp #2, Effect #4) had an initial p of 0.06 and was very precisely estimated, and the only one in which equivalence testing provides a clearer picture of replication success is Paper #41, Exp #2, Effect #1, which had an initial p-value of 0.54 and a replication p-value of 0.05. In this latter case (or in all these ones), one might question whether the "liberal" equivalence range of Δ = 0.74 is the most appropriate. As the authors state, "The post-hoc specification of equivalence margins is controversial."

We agree that the post hoc choice of equivalence ranges is a controversial issue. The margins define an equivalence region where effect sizes are considered practically negligible, and we agree that in many contexts SMD = 0.74 is a large effect size that is not practically negligible. We therefore present sensitivity analyses for a wide range of margins. However, we do not think that the choice of this margin is more controversial for the mentioned studies with low p-values than for other studies with greater p-values, since the question of whether a margin plausibly encodes practically negligible effect sizes is not related to the observed p-value of a study. Nevertheless, for the new analyses of the RPP and EPRP data in Appendix B, we have added additional sensitivity analyses showing how the individual TOST p-values and Bayes factors vary as a function of the margin and the prior standard deviation. We think that these analyses provide readers with an even more transparent picture regarding the implications of the choice of these parameters than the “project-wise” sensitivity analyses in Appendix A.

4) Bayes factor suggestions: For the Bayes factor approach, it would be interesting to discuss examples where the BF differs slightly. This is likely to occur in scenarios where sample sizes differ significantly between the original study and replication. For example, in Paper #48, Exp #2 and Effect #4, the initial p is 0.06, but the BF is 8.1. In the replication, the BF dramatically drops to < 1/1000, as does the p-value. The initial evidence of 8.1 indicates some evidence for the absence of an effect, but not strong evidence ("strong evidence for H0"), whereas a p-value of 0.06 does not lead to such a conclusion; instead, it favors H1. It would be interesting if the authors discussed other similar cases in the paper. It's worth noting that in Paper #5, Exp #1, Effect #3, the replication p-value is 0.99, while the BF01 is 2.4, almost indicating "moderate" evidence for H0, even though the p-value is inconclusive.

We agree that some of the examples nicely illustrate conceptual differences between p-values and Bayes factors, e.g., how they take into account sample size and effect size. As methodologists, we find these aspects interesting ourselves, but we think that emphasizing them is beyond the scope of the paper and would distract eLife readers from the main messages.

Concerning the conceptual differences between Bayes factors and TOST p-values, we already discuss a case where there are qualitative differences in more detail (original paper #48). We added another discussion of this phenomenon in the Appendix C as it also occurs for the replication of Ranganath and Nosek (2008) that was part of the RPP.

5) p-values, magnitude and precision: It's noteworthy to emphasize, if the authors decide to discuss this, that the p-value is influenced by both the effect's magnitude and its precision, so in Paper #9, Exp #2, Effect #6, BF01 = 4.1 has a higher p-value than a BF01 = 2.3 in its replication. However, there are cases where both p-values and BF agree. For example, in Paper #15, Exp #2, Effect #2, both the original and replication studies have similar sample sizes, and as the p-value decreases from p = 0.95 to p = 0.23, BF01 decreases from 5.1 ("moderate evidence for H0") to 1.3 (region of "Absence of evidence"), moving away from H0 in both cases. This also occurs in Paper #24, Exp #3, Effect #6.

We appreciate the suggestions but, as explained before, think that the message of our paper is better understood without additional discussion of more general differences between p-values and Bayes factors.

6) The grey zone: Given the above topic, it is important to highlight that in the "Absence of evidence grey zone" for the null hypothesis, for example, in Paper #5, Exp #1, Effect #3 with a p = 0.99 and a BF01 = 2.4 in the replication, BF and p-values reach similar conclusions. It's interesting to note, as the authors emphasize, that Dawson et al. (2011), Exp #2, Effect #2 is an interesting example, as the p-value decreases, favoring H1, likely due to the effect's magnitude, even with a small sample size (n = 3 in both original and replications). Bayes factors are very close to one due to the small sample sizes, as discussed by the authors.

We appreciate the constructive comments. We think that the two examples from Dawson et al. (2011) and Goetz et al. (2011) already nicely illustrate absence of evidence and evidence of absence, respectively, and therefore decided not to discuss additional examples in detail, to avoid redundancy.

7) Using meta-analytical results (?): For papers from RPCB, comparing the initial study with the meta-analytical results using Bayes factor and equivalence testing approaches (thus, increasing the sample size of the analysis, but creating dependency of results since the initial study would affect the meta-analytical one) could change the conclusions. This would be interesting to explore in initial studies that are replicated by much larger ones, such as: Paper #9, Exp #2, Effect #6; Goetz et al. (2011), Exp #1, Effect #1; Paper #28, Exp #3, Effect #3; Paper #41, Exp #2, Effect #1; and Paper #47, Exp #1, Effect #5).

Thank you for the suggestion. We considered adding meta-analytic TOST p-values and Bayes factors before, but decided that Figure 3 and the results section are already quite technical, so adding more analyses may confuse more than help. Nevertheless, these meta-analytic approaches are discussed in the “Conclusions” section.

8) Other samples of fields of science: It would be interesting to investigate whether using Bayes factors and equivalence tests in addition to p-values results in a clearer scenario when applied to replication data from other fields. As mentioned by the authors, the Reproducibility Project: Experimental Philosophy (RPEP) and the Reproducibility Project: Psychology (RPP) have data attempting to replicate some original studies with null results. While the RPCB analysis yielded a similar picture when using both criteria, it is worth exploring whether this holds true for RPP and RPEP. Considerations for further research in this direction are suggested. Even if the original null results were excluded in the calculation of an overall replicability rate based on significance, sensitivity analyses considering them could have been conducted. The present authors can demonstrate replication success using the significance criteria in these two projects with initially p < 0.05 studies, both positive and non-positive.

Thank you for the excellent suggestion. We added an Appendix B where the null results from the RPP and EPRP are analyzed with our proposed approaches. The results are also discussed in the “Results” and “Conclusions” sections.

9) Other approaches: I am curious about the potential impact of using an approach based on equivalence testing (as described in https://arxiv.org/abs/2308.09112). It would be valuable if the authors could run such analyses or reference the mentioned work.

Thank you. We were unaware of this preprint. It seems related to the framework proposed by Stahel W. A. (2021) New relevance and significance measures to replace p-values. PLoS ONE 16(6): e0252991. https://doi.org/10.1371/journal.pone.0252991

We now cite both papers in the discussion.

10) Additional evidence: There is another study in which replications of initially p > 0.05 studies with p > 0.05 replications were also considered as replication successes. You can find it here: https://www.medrxiv.org/content/10.1101/2022.05.31.22275810v2. Although it involves a small sample of initially p > 0.05 studies with already large sample sizes, the work is currently under consideration for publication in PLOS ONE, and all data and materials can be accessed through OSF (links provided in the work).

Thank you for sharing this interesting study with us. We feel that it is beyond the scope of the paper to include further analyses as there are already analyses of the RPCB, RPP, and EPRP null results. However, we will keep this study in mind for future analysis, especially since all data are openly available.

11) Additional evidence 02: Ongoing replication projects, such as the Brazilian Reproducibility Initiative (BRI) and The Sports Replication Centre (https://ssreplicationcentre.com/), continue to generate valuable data. BRI is nearing completion of its results, and it promises interesting data for analyzing replication success using p-values, equivalence regions, and Bayes factor approaches.

We now cite these two initiatives as examples of ongoing replication projects in the introduction. Similarly as for your last point, we think that it is beyond the scope of the paper to include further analyses as there are already analyses of the RPCB, RPP, and EPRP null results.

Reviewer #3 (Recommendations For The Authors):

I have no specific recommendations for the authors.

Thank you for the constructive review.

Reviewing Editor (Recommendations For the Authors):

I recognize that it was suggested to the authors by the previous Reviewing Editor to reduce the amount of statistical material to be made more suitable for a non-statistical audience, and so what I am about to say contradicts advice you were given before. But, with this revised version, I actually found it difficult to understand the particulars of the construction of the Bayes Factors and would have appreciated a few more sentences on the underlying models that fed into the calculations. In my opinion, the provided citations (e.g., Dienes Z. 2014. Using Bayes to get the most out of non-significant results) did not provide sufficient background to warrant a lack of more technical presentation here.

Thank you for the feedback. We added a new “Appendix C: Technical details on Bayes factors” that provides technical details on the models, priors, and calculations underlying the Bayes factors.

https://doi.org/10.7554/eLife.92311.2.sa3

Replication of “null results” – Absence of evidence or evidence of absence?

Significance of findings

Strength of evidence

Abstract

Introduction

Null results from the Reproducibility Project: Cancer Biology

Methods for assessing replicability of null results

Frequentist equivalence testing

Bayesian hypothesis testing

Conclusions

Box 1

Equivalence test

Bayes factor

Acknowledgements

Software and data

Appendix A: Sensitivity analyses

Appendix B: Null results from the RPP and EPRP

Appendix C: Technical details on Bayes factors

References

Article and author information

Author information

Samuel Pawel

Rachel Heyard

Charlotte Micheloud

Leonhard Held

Version history

Cite all versions

Copyright

Peer review process

Editors

Be the first to read new articles from eLife

Significance of findings

Strength of evidence

Abstract

Introduction

Null results from the Reproducibility Project: Cancer Biology

Methods for assessing replicability of null results

Frequentist equivalence testing

Bayesian hypothesis testing

Conclusions

Box 1

Equivalence test

Bayes factor

Acknowledgements

Software and data

Appendix A: Sensitivity analyses

Appendix B: Null results from the RPP and EPRP

Appendix C: Technical details on Bayes factors

References

Article and author information

Author information

Samuel Pawel†

Rachel Heyard†

Charlotte Micheloud

Leonhard Held

Version history

Cite all versions

Copyright

Peer review process

Editors

Samuel Pawel

Rachel Heyard