# Abstract

In several large-scale replication projects, statistically non-significant results in both the original and the replication study have been interpreted as a “replication success”. Here we discuss the logical problems with this approach: Non-significance in both studies does not ensure that the studies provide evidence for the absence of an effect and “replication success” can virtually always be achieved if the sample sizes are small enough. In addition, the relevant error rates are not controlled. We show how methods, such as equivalence testing and Bayes factors, can be used to adequately quantify the evidence for the absence of an effect and how they can be applied in the replication setting. Using data from the Reproducibility Project: Cancer Biology we illustrate that many original and replication studies with “null results” are in fact inconclusive. We conclude that it is important to also replicate studies with statistically non-significant results, but that they should be designed, analyzed, and interpreted appropriately.

**eLife assessment**

This work provides a **valuable** contribution and assessment of what it means to replicate a null study finding, and what are the appropriate methods for doing so (apart from a rote p-value assessment). Through a **convincing** re-analysis of results from the Reproducibility Project: Cancer Biology using frequentist equivalence testing and Bayes factors, the authors demonstrate that even when reducing 'replicability success' to a single criterion, how precisely replication is measured may yield differing results. Less focus is directed to appropriate replication of non-null findings.

# Introduction

*Absence of evidence is not evidence of absence* – the title of the 1995 paper by Douglas Altman and Martin Bland has since become a mantra in the statistical and medical literature (Altman and Bland, 1995). Yet, the misconception that a statistically non-significant result indicates evidence for the absence of an effect is unfortunately still widespread (Makin and de Xivry, 2019). Such a “null result” – typically characterized by a *p*-value *p* > 0.05 for the null hypothesis of an absent effect – may also occur if an effect is actually present. For example, if the sample size of a study is chosen to detect an assumed effect with a power of 80%, null results will incorrectly occur 20% of the time when the assumed effect is actually present. If the power of the study is lower, null results will occur more often. In general, the lower the power of a study, the greater the ambiguity of a null result. To put a null result in context, it is therefore critical to know whether the study was adequately powered and under what assumed effect the power was calculated (Hoenig and Heisey, 2001; Greenland, 2012). However, if the goal of a study is to explicitly quantify the evidence for the absence of an effect, more appropriate methods designed for this task, such as equivalence testing (Wellek, 2010; Lakens, 2017; Senn, 2021) or Bayes factors (Kass and Raftery, 1995; Goodman, 1999, 2005; Dienes, 2014; Keysers et al., 2020), should be used from the outset.

The interpretation of null results becomes even more complicated in the setting of replication studies. In a replication study, researchers attempt to repeat an original study as closely as possible in order to assess whether consistent results can be obtained with new data (National Academies of Sciences, Engineering, and Medicine, 2019). In the last decade, various large-scale replication projects have been conducted in diverse fields, from the biomedical to the social sciences (Prinz et al., 2011; Begley and Ellis, 2012; Klein et al., 2014; Open Science Collaboration, 2015; Camerer et al., 2016, 2018; Klein et al., 2018; Cova et al., 2018; Errington et al., 2021, among others). The majority of these projects reported alarmingly low replicability rates across a broad spectrum of criteria for quantifying replicability. While most of these projects restricted their focus on original studies with statistically significant results (“positive results”), the *Reproducibility Project: Psychology* (RPP, Open Science Collaboration, 2015), the *Reproducibility Project: Experimental Philosophy* (RPEP, Cova et al., 2018), and the *Reproducibility Project: Cancer Biology* (RPCB, Errington et al., 2021) also attempted to replicate some original studies with null results – either non-significant or interpreted as showing no evidence for a meaningful effect by the original authors.

Although the RPEP and RPP interpreted non-significant results in both original and replication study as a “replication success” for some individual replications (see, for example, the replication of ** McCann** (

**, replication report: https://osf.io/wcm7n) or the replication of**

*2005***(**

*Ranganath and Nosek***, replication report: https://osf.io/9xt25)), they excluded the original null results in the calculation of an overall replicability rate based on significance. In contrast, the RPCB explicitly defined null results in both the original and the replication study as a criterion for “replication success”. According to this “non-significance” criterion, 11/15 = 73% replications of original null effects were successful. Four additional criteria were used to assess successful replications of original null results: (i) whether the original effect estimate was included in the 95% confidence interval of the replication effect estimate (success rate 11/15 = 73%), (ii) whether the replication effect estimate was included in the 95% confidence interval of the original effect estimate (success rate 12/15 = 80%), (iii) whether the replication effect estimate was included in the 95% prediction interval based on the original effect estimate (success rate 12/15 = 80%), (iv) and whether the**

*2008**p*-value obtained from combining the original and replication effect estimate with a meta-analysis was non-significant (success rate 10/15 = 67%). Criteria (i) to (iii) are useful for assessing compatibility in effect estimates between the original and the replication study. Their suitability has been extensively discussed in the literature. The prediction interval criterion (iii) or equivalent criteria (e.g., the

*Q*-test) are usually recommended because they account for the uncertainty from both studies and have adequate error rates when the true effect sizes are the same (Patil et al., 2016; Mathur and VanderWeele, 2020; Schauer and Hedges, 2021).

While the effect estimate criteria (i) to (iii) can be applied regardless of whether or not the original study was non-significant, the “meta-analytic non-significance” criterion (iv) and the aforementioned non-significance criterion refer specifically to original null results. We believe that there are several logical problems with both, and that it is important to highlight and address them, especially since the non-significance criterion has already been used in three replication projects without much scrutiny. It is crucial to note that it is not our intention to diminish the enormously important contributions of the RPCB, the RPEP, and the RPP, but rather to build on their work and provide recommendations for future replication researchers.

The logical problems with the non-significance criterion are as follows: First, if the original study had low statistical power, a non-significant result is highly inconclusive and does not provide evidence for the absence of an effect. It is then unclear what exactly the goal of the replication should be – to replicate the inconclusiveness of the original result? On the other hand, if the original study was adequately powered, a non-significant result may indeed provide some evidence for the absence of an effect when analyzed with appropriate methods, so that the goal of the replication is clearer. However, the criterion by itself does not distinguish between these two cases. Second, with this criterion researchers can virtually always achieve replication success by conducting a replication study with a very small sample size, such that the *p*-value is non-significant and the result is inconclusive. This is because the null hypothesis under which the *p*-value is computed is misaligned with the goal of inference, which is to quantify the evidence for the absence of an effect. Third, the criterion does not control the error of falsely claiming the absence of an effect at a predetermined rate. This is in contrast to the standard criterion for replication success, which requires significance from both studies (also known as the two-trials rule, see Section 12.2.8 in Senn, 2021), and ensures that the error of falsely claiming the presence of an effect is controlled at a rate equal to the squared significance level (for example, 5% × 5% = 0.25% for a 5% significance level). The non-significance criterion may be intended to complement the two-trials rule for null results. However, it fails to do so in this respect, which may be required by regulators and funders. These logical problems are equally applicable to the meta-analytic non-significance criterion.

In the following, we present two principled approaches for analyzing replication studies of null results – frequentist equivalence testing and Bayesian hypothesis testing – that can address the limitations of the non-significance criterion. We use the null results replicated in the RPCB to illustrate the problems of the non-significance criterion and how they can be addressed. We conclude the paper with practical recommendations for analyzing replication studies of original null results, including simple R code for applying the proposed methods.

# Null results from the Reproducibility Project: Cancer Biology

Figure 1 shows effect estimates on standardized mean difference (SMD) scale with 95% confidence intervals from two RPCB study pairs. In both study pairs, the original and replication studies are “null results” and therefore meet the non-significance criterion for replication success (the two-sided *p*-values are greater than 0.05 in both the original and the replication study). The same is true when applying the meta-analytic non-significance criterion (the two-sided *p*-values of the meta-analyses *p*_{MA} are greater than 0.05). However, intuition would suggest that the conclusions in the two pairs are very different.

The original study from Dawson et al. (2011) and its replication both show large effect estimates in magnitude, but due to the very small sample sizes, the uncertainty of these estimates is large, too. With such low sample sizes, the results seem inconclusive. In contrast, the effect estimates from Goetz et al. (2011) and its replication are much smaller in magnitude and their uncertainty is also smaller because the studies used larger sample sizes. Intuitively, the results seem to provide more evidence for a zero (or negligibly small) effect. While these two examples show the qualitative difference between absence of evidence and evidence of absence, we will now discuss how the two can be quantitatively distinguished.

# Methods for assessing replicability of null results

There are both frequentist and Bayesian methods that can be used for assessing evidence for the absence of an effect. Anderson and Maxwell (2016) provide an excellent summary in the context of replication studies in psychology. We now briefly discuss two possible approaches – frequentist equivalence testing and Bayesian hypothesis testing – and their application to the RPCB data.

## Frequentist equivalence testing

Equivalence testing was developed in the context of clinical trials to assess whether a new treatment – typically cheaper or with fewer side effects than the established treatment – is practically equivalent to the established treatment (Wellek, 2010; Lakens, 2017). The method can also be used to assess whether an effect is practically equivalent to an absent effect, usually zero. Using equivalence testing as a way to put non-significant results into context has been suggested by several authors (Hauck and Anderson, 1986; Campbell and Gustafson, 2018). The main challenge is to specify the margin Δ > 0 that defines an equivalence range [−Δ, +Δ] in which an effect is considered as absent for practical purposes. The goal is then to reject the null hypothesis that the true effect is outside the equivalence range. This is in contrast to the usual null hypotheses of superiority tests which state that the effect is zero or smaller than zero, see Figure 2 for an illustration.

To ensure that the null hypothesis is falsely rejected at most *α* × 100% of the time, the standard approach is to declare equivalence if the (1 − 2*α*) × 100% confidence interval for the effect is contained within the equivalence range, for example, a 90% confidence interval for *α* = 5% (Westlake, 1972). This procedure is equivalent to declaring equivalence when two one-sided tests (TOST) for the null hypotheses of the effect being greater/smaller than +Δ and −Δ, are both significant at level *α* (Schuirmann, 1987). A quantitative measure of evidence for the absence of an effect is then given by the maximum of the two one-sided *p*-values (the TOST *p*-value). A natural criterion for replication success of original null results may therefore be to require that both the original and the replication TOST *p*-values are smaller than some level *α* (conventionally *α* = 0.05). Equivalently, the criterion would require the (1 − 2*α*) × 100% confidence intervals of the original and the replication to be included in the equivalence region. In contrast to the non-significance criterion, this criterion controls the error of falsely claiming replication success at level *α*^{2} when there is a true effect outside the equivalence margin, thus complementing the usual two-trials rule in drug regulation (Senn, 2021, Section 12.2.8).

Returning to the RPCB data, Figure 3 shows the standardized mean difference effect estimates with 90% confidence intervals for all 15 effects which were treated as null results by the RPCB.^{1} Most of them showed non-significant *p*-values (*p* > 0.05) in the original study. It is noteworthy, however, that two effects from the second experiment of the original paper 48 were regarded as null results despite their statistical significance. According to the non-significance criterion (requiring *p* > 0.05 in original and replication study), there are 11 “successes” out of total 15 null effects, as reported in Table 1 from Errington et al. (2021).

We will now apply equivalence testing to the RPCB data. The dotted red lines in Figure 3 represent an equivalence range for the margin Δ = 0.74, which ** Wellek** (

**, Table 1.1) classifies as “liberal”. However, even with this generous margin, only 4 of the 15 study pairs are able to establish replication success at the 5% level, in the sense that both the original and the replication 90% confidence interval fall within the equivalence range (or, equivalently, that their TOST**

*2010**p*-values are smaller than 0.05). For the remaining 11 studies, the situation remains inconclusive and there is no evidence for the absence or the presence of the effect. For instance, the previously discussed example from Goetz et al. (2011) marginally fails the criterion (

*p*

_{TOST}= 0.06 in the original study and

*p*

_{TOST}= 0.04 in the replication), while the example from Dawson et al. (2011) is a clearer failure (

*p*

_{TOST}= 0.75 in the original study and

*p*

_{TOST}= 0.88 in the replication) as both effect estimates even lie outside the equivalence margin.

The post-hoc specification of equivalence margins is controversial. Ideally, the margin should be specified on a case-by-case basis in a pre-registered protocol before the studies are conducted by researchers familiar with the subject matter. In the social and medical sciences, the conventions of Cohen (1992) are typically used to classify SMD effect sizes (SMD = 0.2 small, SMD = 0.5 medium, SMD = 0.8 large). While effect sizes are typically larger in preclinical research, it seems unrealistic to specify margins larger than 1 on SMD scale to represent effect sizes that are absent for practical purposes. It could also be argued that the chosen margin Δ = 0.74 is too lax compared to margins commonly used in clinical research (Lange and Freitag, 2005). We therefore report a sensitivity analysis regarding the choice of the margin in Figure 4 in the Appendix. This analysis shows that for realistic margins between 0 and 1, the proportion of replication successes remains below 50% for the conventional *α* = 0.05 level. To achieve a success rate of 11/15 = 73%, as was achieved with the non-significance criterion from the RPCB, unrealistic margins of Δ > 2 are required.

## Bayesian hypothesis testing

The distinction between absence of evidence and evidence of absence is naturally built into the Bayesian approach to hypothesis testing. A central measure of evidence is the Bayes factor (Kass and Raftery, 1995; Goodman, 1999; Dienes, 2014; Keysers et al., 2020), which is the updating factor of the prior odds to the posterior odds of the null hypothesis *H*_{0} versus the alternative hypothesis *H*_{1}

The Bayes factor BF_{01} quantifies how much the observed data have increased or decreased the probability Pr(*H*_{0}) of the null hypothesis relative to the probability Pr(*H*_{1}) of the alternative. If the null hypothesis states the absence of an effect, a Bayes factor greater than one (BF_{01} > 1) indicates evidence for the absence of the effect and a Bayes factor smaller than one indicates evidence for the presence of the effect (BF_{01} < 1), whereas a Bayes factor not much different from one indicates absence of evidence for either hypothesis (BF_{01} ≈ 1). A reasonable criterion for successful replication of a null result may hence be to require both studies to report a Bayes factor larger than some level *γ* > 1, for example, *γ* = 3 or *γ* = 10 which are conventional levels for “substantial” and “strong” evidence, respectively (Jeffreys, 1961). In contrast to the non-significance criterion, this criterion provides a genuine measure of evidence that can distinguish absence of evidence from evidence of absence.

The main challenge with Bayes factors is the specification of the effect under the alternative hypothesis *H*_{1}. The assumed effect under *H*_{1} is directly related to the Bayes factor, and researchers who assume different effects will end up with different Bayes factors. Instead of specifying a single effect, one therefore typically specifies a “prior distribution” of plausible effects. Importantly, the prior distribution, like the equivalence margin, should be determined by researchers with subject knowledge and before the data are collected.

To compute the Bayes factors for the RPCB null results, we used the observed effect estimates as the data and assumed a normal sampling distribution for them (Dienes, 2014), as typically done in a meta-analysis. The Bayes factors BF_{01} shown in Figure 3 then quantify the evidence for the null hypothesis of no effect against the alternative hypothesis that there is an effect using a normal “unit-information” prior distribution (Kass and Wasserman, 1995) for the effect size under the alternative *H*_{1}. We see that in most cases there is no substantial evidence for either the absence or the presence of an effect, as with the equivalence tests. For instance, with a lenient Bayes factor threshold of 3, only 1 of the 15 replications are successful, in the sense of having BF_{01} > 3 in both the original and the replication study. The Bayes factors for the two previously discussed examples are consistent with our intuitions – in the Goetz et al. (2011) example there is indeed substantial evidence for the absence of an effect (BF_{01} = 5 in the original study and BF_{01} = 4.1 in the replication), while in the Dawson et al. (2011) example there is even weak evidence for the *presence* of an effect, though the Bayes factors are very close to one due to the small sample sizes (BF_{01} = 1/1.1 in the original study and BF_{01} = 1/1.8 in the replication).

As with the equivalence margin, the choice of the prior distribution for the SMD under the alternative *H*_{1} is debatable. The normal unit-information prior seems to be a reasonable default choice, as it implies that small to large effects are plausible under the alternative, but other normal priors with smaller/larger standard deviations could have been considered to make the test more sensitive to smaller/larger true effect sizes. The sensitivity analysis in the appendix therefore also includes an analysis on the effect of varying prior standard deviations and the Bayes factor thresholds. However, again, to achieve replication success for a larger proportion of replications than the observed 1/15 = 7%, unreasonably large prior standard deviations have to be specified.

Of note, among the 15 RPCB null results, there are three interesting cases (the three effects from original paper 48) where the Bayes factor is qualitatively different from the equivalence test, revealing a fundamental difference between the two approaches. The Bayes factor is concerned with testing whether the effect is *exactly zero*, whereas the equivalence test is concerned with whether the effect is within an *interval around zero*. Due to the very large sample size in the original study (*n* = 514) and the replication (*n* = 1′153), the data are incompatible with an exactly zero effect, but compatible with effects within the equivalence range. Apart from this example, however, both approaches lead to the same qualitative conclusion – most RPCB null results are highly ambiguous.

# Conclusions

The concept of “replication success” is inherently multifaceted. Reducing it to a single criterion seems to be an oversimplification. Nevertheless, we believe that the “non-significance” criterion – declaring a replication as successful if both the original and the replication study produce non-significant results – is not fit for purpose. This criterion does not ensure that both studies provide evidence for the absence of an effect, it can be easily achieved for any outcome if the studies have sufficiently small sample sizes, and it does not control the relevant error rates. While it is important to replicate original studies with null results, we believe that they should be analyzed using more informative approaches. Box 1 summarizes our recommendations.

Box 1

Recommendations for the analysis of replication studies of original null results. Calculations are based on effect estimates with standard errors *σ _{i}* from an original study (

*i*=

*o*) and its replication (

*i*=

*r*). Both effect estimates are assumed to be normally distributed around the true effect size

*θ*with known variance

*σ*

^{2}

_{i}.

The effect size *θ*_{0} represents the value of no effect, typically *θ*_{0} = 0.

## Equivalence test

Specify a margin Δ > 0 that defines an equivalence range [

*θ*_{0}− Δ,*θ*_{0}+ Δ] in which effects are considered absent for practical purposes.Compute the TOST

*p*-values for original (*i*=*o*) and replication (*i*=*r*) datawith Φ(⋅) the cumulative distribution function of the standard normal distribution.

Declare replication success at level

*α*if*p*_{TOST,o}≤*α*and*p*_{TOST,r}≤*α*, conventionally*α*= 0.05.Perform a sensitivity analysis with respect to the margin Δ. For example, visualize the TOST

*p*-values for different margins to assess the robustness of the conclusions.

## Bayes factor

Specify a prior distribution for the effect size

*θ*that represents plausible values under the alternative hypothesis that there is an effect (*H*_{1}:*θ*≠*θ*_{0}). For example, specify the mean*m*and standard deviation*s*of a normal distribution*θ*|*H*_{1}∼ N(*m, s*^{2}).Compute the Bayes factors contrasting

*H*_{0}:*θ*=*θ*_{0}to*H*_{1}:*θ*≠*θ*_{0}for original (*i*=*o*) and replication (*i*=*r*) data. Assuming a normal prior distribution, the Bayes factor isDeclare replication success at level

*γ*> 1 if BF_{01,o}≥*γ*and BF_{01,r}≥*γ*, conventionally*γ*= 3 (substantial evidence) or*γ*= 10 (strong evidence).Perform a sensitivity analysis with respect to the prior distribution. For example, visualize the Bayes factors for different prior standard deviations to assess the robustness of the conclusions.

Our reanalysis of the RPCB studies with original null results showed that for most studies that meet the non-significance criterion, the conclusions are much more ambiguous – both with frequentist and Bayesian analyses. While the exact success rate depends on the equivalence margin and the prior distribution, our sensitivity analyses show that even with unrealistically liberal choices, the success rate remains below 40% which is substantially lower than the 73% success rate based on the non-significance criterion.

This is not unexpected, as a study typically requires larger sample sizes to detect the absence of an effect than to detect its presence (Matthews, 2006, Section 11.5.3). Of note, the RPCB sample sizes were chosen so that each replication had at least 80% power to detect the original effect estimate based on a standard superiority test. However, the design of replication studies should ideally align with the planned analysis (Anderson and Kelley, 2022) so if the goal of the study is to find evidence for the absence of an effect, the replication sample size should be determined based on a test for equivalence, see Flight and Julious (2015) and Pawel et al. (2023) for frequentist respectively Bayesian approaches.

For both the equivalence test and the Bayes factor approach, it is critical that the equivalence margin and the prior distribution are specified independently of the data, ideally before the original and replication studies are conducted. Typically, however, the original studies were designed to find evidence for the presence of an effect, and the goal of replicating the “null result” was formulated only after failure to do so. It is therefore important that margins and prior distributions are motivated from historical data and/or field conventions (Campbell and Gustafson, 2021), and that sensitivity analyses regarding their choice are reported.

Researchers may also ask which of the two approaches is “better”. We believe that this is the wrong question to ask, because both methods address slightly different questions and are better in different senses; the equivalence test is calibrated to have certain frequentist error rates, which the Bayes factor is not. The Bayes factor, on the other hand, seems to be a more natural measure of evidence as it treats the null and alternative hypotheses symmetrically and represents the factor by which rational agents should update their beliefs in light of the data. Replication success is ideally evaluated along multiple dimensions, as nicely exemplified by the RPCB, RPEP, and RPP. Replications that are successful on multiple criteria provide more convincing support for the original finding, while replications that are successful on fewer criteria require closer examination. Fortunately, the use of multiple methods is already standard practice in replication assessment, so our proposal to use both of them does not require a major paradigm shift.

While the equivalence test and the Bayes factor are two principled methods for analyzing original and replication studies with null results, they are not the only possible methods for doing so. A straightforward extension would be to first synthesize the original and replication effect estimates with a meta-analysis, and then apply the equivalence and Bayes factor tests to the meta-analytic estimate similar to the meta-analytic non-significance criterion used by the RPCB. This could potentially improve the power of the tests, but consideration must be given to the threshold used for the *p*-values/Bayes factors, as naive use of the same thresholds as in the standard approaches may make the tests too liberal. Furthermore, there are various advanced methods for quantifying evidence for absent effects which could potentially improve on the more basic approaches considered here (Lindley, 1998; Johnson and Rossell, 2010; Morey and Rouder, 2011; Kruschke, 2018; Micheloud and Held, 2022).

# Acknowledgements

We thank the RPCB, RPEP, and RPP contributors for their tremendous efforts and for making their data publicly available. We thank Maya Mathur for helpful advice on data preparation. We thank Benjamin Ineichen for helpful comments on drafts of the manuscript. Our acknowledgment of these individuals does not imply their endorsement of our work. We thank the Swiss National Science Foundation for financial support (grant #189295).

# Software and data

The code and data to reproduce our analyses is openly available at https://gitlab.uzh.ch/samuel.pawel/rsAbsence. A snapshot of the repository at the time of writing is available at https://doi.org/10.5281/zenodo.7906792. We used the statistical programming language R version 4.3.1 (R Core Team, 2022) for analyses. The R packages ggplot2 (Wickham, 2016), dplyr (Wickham et al., 2022), knitr (Xie, 2022), and reporttools (Rufibach, 2009) were used for plotting, data preparation, dynamic reporting, and formatting, respectively. The data from the RPCB were obtained by downloading the files from https://github.com/mayamathur/rpcb (commit a1e0c63) and extracting the relevant variables as indicated in the R script preprocess-rpcb-data.R which is available in our git repository.

# Appendix: Sensitivity analyses

The post-hoc specification of equivalence margins Δ and prior distribution for the SMD under the alternative *H*_{1} is debatable. Commonly used margins in clinical research are much more stringent (Lange and Freitag, 2005); for instance, in oncology, a margin of Δ = log(1.3) is commonly used for log odds/hazard ratios, whereas in bioequivalence studies a margin of Δ = log(1.25) is the convention (Senn, 2021, Chapter 22). These margins would translate into margins of Δ = 0.14 and Δ = 0.12 on the SMD scale, respectively, using the SMD = (√3/*π*) log OR conversion (Cooper et al., 2019, p. 233). Similarly, for the Bayes factor we specified a normal unit-information prior under the alternative while other normal priors with smaller/larger standard deviations could have been considered. Here, we therefore investigate the sensitivity of our conclusions with respect to these parameters.

The top plot of Figure 4 shows the number of successful replications as a function of the margin Δ and for different TOST *p*-value thresholds. Such an “equivalence curve” approach was first proposed by Hauck and Anderson (1986). We see that for realistic margins between 0 and 1, the proportion of replication successes remains below 50% for the conventional *α* = 0.05 level. To achieve a success rate of 11/15 = 73%, as was achieved with the non-significance criterion from the RPCB, unrealistic margins of Δ > 2 are required. Changing the success criterion to a more lenient level (*α* = 0.1) or a more stringent level (*α* = 0.01) hardly changes the conclusion.

The bottom plot of Figure 4 shows a sensitivity analysis regarding the choice of the prior standard deviation and the Bayes factor threshold. In the main analysis we used a normal unitinformation prior, that is, a normal distribution centered around the value of no effect with a standard deviation corresponding to the standard error of an SMD estimate based on one observation (Kass and Wasserman, 1995). Assuming that the group means are normally distributed and with *n* the total sample size and *τ* the known data standard deviation, the distribution of the SMD is . The standard error *σ* of the SMD based on one unit (*n* = 1) is hence 2. It is uncommon to specify prior standard deviations larger than the unit-information standard deviation of 2, as this corresponds to the assumption of very large effect sizes under the alternatives. However, to achieve replication success for a larger proportion of replications than the observed 1/15 = 7%, unreasonably large prior standard deviations have to be specified. For instance, a standard deviation of roughly 5 is required to achieve replication success in 50% of the replications at a lenient Bayes factor threshold of *γ* = 3. The standard deviation needs to be almost 20 so that the same success rate 11/15 = 73% as with the non-significance criterion is achieved. The necessary standard deviations are even higher for stricter Bayes factor thresholds, such as *γ* = 6 or *γ* = 10.

# References

- Statistics notes: Absence of evidence is not evidence of absence
*BMJ***311**:485–485https://doi.org/10.1136/bmj.311.7003.485 - Sample size planning for replication studies: The devil is in the design
*Psychological Methods*https://doi.org/10.1037/met0000520 - There’s more than one way to conduct a replication study: Beyond statistical significance
*Psychological Methods***21**:1–12https://doi.org/10.1037/met0000051 - Raise standards for preclinical cancer research
*Nature***483**:531–533https://doi.org/10.1038/483531a - Evaluating replicability of laboratory experiments in economics
*Science*:1433–1436https://doi.org/10.1126/science.aaf0918 - Evaluating the replicability of social science experiments in nature and science between 2010 and 2015
*Nature Human Behavior***2**:637–644https://doi.org/10.1038/s41562-018-0399-z - Conditional equivalence testing: An alternative remedy for publication bias
*PLOS ONE***13**https://doi.org/10.1371/journal.pone.0195145 - What to make of equivalence testing with a post-specified margin?
*Meta-Psychology***5**https://doi.org/10.15626/mp.2020.2506 - A power primer
*Psychological Bulletin***112**:155–159https://doi.org/10.1037/0033-2909.112.1.155 - The Handbook of Research Synthesis and Meta-Analysishttps://doi.org/10.7758/9781610448864
- Estimating the reproducibility of experimental philosophy
*Review of Philosophy and Psychology*https://doi.org/10.1007/s13164-018-0400-9 - Inhibition of BET recruitment to chromatin as an effective treatment for MLL-fusion leukaemia
*Nature***478**:529–533https://doi.org/10.1038/nature10509 - Using Bayes to get the most out of non-significant results
*Frontiers in Psychology***5**https://doi.org/10.3389/fpsyg.2014.00781 - Investigating the replicability of preclinical cancer biology
*eLife***10**https://doi.org/10.7554/elife.71601 - Practical guide to sample size calculations: non-inferiority and equivalence trials
*Pharmaceutical Statistics***15**:80–89https://doi.org/10.1002/pst.1716 - Biomechanical remodeling of the microenvironment by stromal caveolin-1 favors tumor invasion and metastasis
*Cell***146**:148–163https://doi.org/10.1016/j.cell.2011.05.040 - Toward evidence-based medical statistics. 2: The Bayes factor
*Annals of Internal Medicine***130**https://doi.org/10.7326/0003-4819-130-12-199906150-00019 - Introduction to Bayesian methods I: measuring the strength of evidence
*Clinical Trials***2**:282–290https://doi.org/10.1191/1740774505cn098oa - Nonsignificance plus high power does not imply support for the null over the alternative
*Annals of Epidemiology***22**:364–368https://doi.org/10.1016/j.annepidem.2012.02.007 - A proposal for interpreting and reporting negative studies
*Statistics in Medicine***5**:203–209https://doi.org/10.1002/sim.4780050302 - The abuse of power
*The American Statistician***55**:19–24https://doi.org/10.1198/000313001300339897 - Theory of Probability
- On the use of non-local prior densities in Bayesian hypothesis tests
*Journal of the Royal Statistical Society: Series B (Statistical Methodology)***72**:143–170https://doi.org/10.1111/j.1467-9868.2009.00730.x - Bayes factors
*Journal of the American Statistical Association***90**:773–795https://doi.org/10.1080/01621459.1995.10476572 - A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion
*Journal of the American Statistical Association***90**:928–934https://doi.org/10.1080/01621459.1995.10476592 - Using Bayes factor hypothesis testing in neuroscience to establish evidence of absence
*Nature Neuroscience***23**:788–799https://doi.org/10.1038/s41593-020-0660-4 - Investigating variation in replicability: A “many labs” replication project
*Social Psychology***45**:142–152https://doi.org/10.1027/1864-9335/a000178 - Many labs 2: Investigating variation in replicability across samples and settings
*Advances in Methods and Practices in Psychological Science***1**:443–490https://doi.org/10.1177/2515245918810225 - Rejecting or accepting parameter values in Bayesian estimation
*Advances in Methods and Practices in Psychological Science***1**:270–280https://doi.org/10.1177/2515245918771304 - Equivalence tests
*Social Psychological and Personality Science***8**:355–362https://doi.org/10.1177/1948550617697177 - Choice of delta: Requirements and reality – results of a systematic review
*Biometrical Journal***47**:12–27https://doi.org/10.1002/bimj.200410085 - Decision analysis and bioequivalence trials
*Statistical Science***13**https://doi.org/10.1214/ss/1028905932 - Ten common statistical mistakes to watch out for when writing or reviewing a manuscript
*eLife***8**https://doi.org/10.7554/elife.48175 - New statistical metrics for multisite replication projects
*Journal of the Royal Statistical Society: Series A (Statistics in Society)***183**:1145–1166https://doi.org/10.1111/rssa.12572 - Introduction to Randomized Controlled Clinical Trialshttps://doi.org/10.1201/9781420011302
- Intentional action and intending: Recent empirical studies
*Philosophical Psychology***18**:737–748https://doi.org/10.1080/09515080500355236 - Micheloud, C. and Held, L. (2022). The replication of non-inferiority and equivalence studies. doi:10.48550/ARXIV.2204.06960. arXiv preprint.https://doi.org/10.48550/ARXIV.2204.06960
- Bayes factor approaches for testing interval null hypotheses
*Psychological Methods***16**:406–419https://doi.org/10.1037/a0024377 - Reproducibility and Replicability in Sciencehttps://doi.org/10.17226/25303
- Estimating the reproducibility of psychological science
*Science***349**https://doi.org/10.1126/science.aac4716 - What should researchers expect when they replicate studies? A statistical view of replicability in psychological science
*Perspectives on Psychological Science***11**:539–544https://doi.org/10.1177/1745691616646366 - Bayesian approaches to designing replication studies
*Psychological Methods*https://doi.org/10.1037/met0000604 - Believe it or not: how much can we rely on published data on potential drug targets?
*Nature Reviews Drug Discovery***10**:712–712https://doi.org/10.1038/nrd3439-c1 - R: A Language and Environment for Statistical Computing
- Implicit attitude generalization occurs immediately; explicit attitude generalization takes time
*Psychological Science***19**:249–254https://doi.org/10.1111/j.1467-9280.2008.02076.x - reporttools: R functions to generate LATEX tables of descriptive statistics
*Journal of Statistical Software, Code Snippets***31**https://doi.org/10.18637/jss.v031.c01 - Reconsidering statistical methods for assessing replication
*Psychological Methods***26**:127–139https://doi.org/10.1037/met0000302 - A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability
*Journal of Pharmacokinetics and Biopharmaceutics***15**:657–680https://doi.org/10.1007/bf01068419 - Statistical Issues in Drug Developmenthttps://doi.org/10.1002/9781119238614
- Testing statistical hypotheses of equivalence and noninferiority
- Use of confidence intervals in analysis of comparative bioavailability trials
*Journal of Pharmaceutical Sciences***61**:1340–1341https://doi.org/10.1002/jps.2600610845 - ggplot2: Elegant Graphics for Data Analysishttps://doi.org/10.1007/978-3-319-24277-4
- Wickham, H., François, R., Henry, L., and Müller, K. (2022). dplyr: A Grammar of Data Manipulation. URL https://CRAN.R-project.org/package=dplyr. R package version 1.0.10.
- Xie, Y. (2022). knitr: A General-Purpose Package for Dynamic Report Generation in R. URL https://yihui.org/knitr/. R package version 1.40.

# Metrics

- views
- 1,067
- downloads
- 11
- citations
- 0

Views, downloads and citations are aggregated across all versions of this paper published by eLife.