1 Introduction

In many fields of medical research, drugs that show promising results in preclinical studies in animals frequently fail to do the same in human clinical trials [1]. This “translation failure” is one of the biggest challenges in biomedical research today: it is estimated that around two-thirds to 95% of therapeutics found to be safe and effective during animal testing, fail when tested in humans [1, 2, 3, 4]. Translatability has been defined as “the ability to apply research discoveries from experimental models to applications that directly benefit humans” [5]. Reasons for low translatability are multifaceted. However, the pervasiveness of suboptimal study design, analysis, and reporting, potentially resulting in a lack of reproducibility (i.e., the “extent to which the results of a study agree with those of replication studies” [5]), has been flagged as a key concern [2, 6]. Animal studies often demonstrate deficiencies such as inaccurate or inconsistent data collection procedures, poor reporting of key variables, including the age and sex of animals used, and a lack of measures to reduce risks of bias, including the absence of randomization or blinding [7, 8, 9]. In terms of statistical methodology, frequent issues include small sample sizes, leading to low powered studies, inadequate control for confounding variables, and insufficient description of statistical methods or reporting of uncertainty measures [10, 11, 12, 13]. Publication bias, the phenomenon in which the decision to publish a study is based on the direction or strength of its findings, is also rampant. Animal studies reporting positive and statistically significant results are more likely to be published than those with negative or statistically non-significant findings, meaning that subsequent human studies may be based upon biased conclusions [14, 15]. Such issues are a detriment to both animals, whose lives are wasted when we draw incorrect conclusions from research performed on them; and humans, who are put at unnecessary risk during clinical trials when an intervention’s reported safety or efficacy is overstated or outright false [7, 16].

Replication and translation

Recently, there has been a growing interest in replications of previously published studies. A replication is defined as a “study that repeats all or part of another study and allows researchers to compare their findings” [5]. To perform a replication study, researchers could for example use the same methodology and/or analysis as presented in an original study on newly collected data. They then attempt to determine if the results from the replication study are consistent with those in the original study [17]. A multitude of metrics have been used or proposed to quantify the consistency of results, and ultimately to decide if a replication was “successful” or not [18]. We will refer to these metrics as “replication success metrics”. They might compare, for example, the p-values or the magnitude, direction, or uncertainty of estimated treatment effects obtained from the original and replication studies. Other metrics, such as the one based on a meta-analysis, combine results from an original study and its replication attempt(s) to estimate an overall effect size [19, 20, 21]. So far, studies attempting to estimate how often translation failure occurs have largely utilized the simple statistical significance criterion, i.e., assessing if the animal and human studies both report a statistically significant treatment effect in the same direction, often referred to as the two-trials rule [22]. To our knowledge, the usage of alternative replication success metrics in a translation setting has not yet been investigated. Translation contrasts with replication in that animal and human studies examine different populations and often have different experimental designs, and thus inherently produce different results. As such, a human study is a “conceptual” rather than a “direct” replication of the animal study [23]. As a result, metrics which are useful for measuring replication may not be as applicable for translation. This distinction motivates the current simulation study.

Study objectives

To date, little is known about the most appropriate metrics to assess or quantify “translatability” or “translation success”. In this study, we aim to investigate whether metrics proposed to quantify replication success can be applied, and are useful, in the context of the translation of results from animals to humans. We will investigate the ability of these metrics to quantify translation success under various simulation conditions, for example in different scenarios of effect sizes and sample sizes, in order to gain a better understanding of their behaviour and characteristics.

2 Methods

2.1 Study design, data, and protocol

This is a simulation study. Synthetic data sets, representing animal studies and human studies, were generated according to various simulation conditions (see Sections 2.3, 2.4 and 2.5). The simulated animal and human findings were then used to evaluate the selected translation success metrics (presented in Section 2.7) using pre-specified performance measures (Section 2.8). A detailed protocol of the present simulation study, following the ADEMP (Aims, Data-generating mechanisms, Estimands and other targets, Methods, Performance measure) preregistration template [24], was preregistered on the Open Science Framework prior to running the simulation study [25].

2.2 Protocol amendments

During drafting of this manuscript, we found a conceptual error in our data generating mechanism. Initially, we planned to incorporate the heterogeneity variance directly into the simulation of the individual observations for the animal, respectively the human, study. We now first use the heterogeneity variance to simulate an effect size for the animal, respectively the human, study, and then use this effect size and solely the sampling variation to simulate the individual observations for the animal, respectively the human, study. Finally, a coding error in the protocol (in the calculation of the pooled variance the “-2” was missing in the denominator, page 4 of the protocol) led to the wrong human group sample size showing up in the protocol text (103 instead of the correct 107).

2.3 Data generation

We assumed that the synthetic studies investigate the effect of a treatment (e.g., prenatal amino acid supplementation) on an outcome that is comparable across animals and humans (e.g., maternal blood pressure as in Terstappen et al. [26]). We simulated individual observations of this outcome measurement for the animals, respectively for the humans, in the treatment group and the control group. To generate the synthetic data, the true animal and human means in the treatment groups were set to µA and µH. The mean in the animal and human control groups was always set to 0, so that µA and µH correspond to mean difference effect sizes. The true effect size heterogeneity variances were set to and . For each simulation repetition i we performed the following:

  1. Simulation of effect sizes: We first simulated k animal effect sizes and one human effect size:

  2. Simulation of the animal finding: We generated k synthetic animal studies. Each study j (j = 1, …, k) had nA animals in the control group (C) and nA animals in the treatment group (T). The outcomes were simulated as

    with l = 1, …, nA. For each of the k synthetic animal studies we then performed a one-sided two-sample t-test, comparing the outcomes for the treatment group to the control group. The k effect estimates were pooled using a random-effects meta-analysis (with restricted maximum likelihood estimator for heterogeneity variance) and the resulting pooled effect size, standard error and p-value constituted the “animal finding”.

  3. Simulation of the human finding: We simulated outcome measurements for the human study, with nH humans in the control group (C) and nH humans in the treatment group (T):

    where m = 1, …, nH. A one-sided two-sample t-test was then performed on the simulated human outcome measurements, comparing treatment and control group, and the resulting effect size (mean difference), standard error of the mean difference and p-value were extracted. This constitutes the “human finding”.

We performed one-sided tests as they take into account the direction of the effect estimate and we were testing for a “beneficial treatment effect”. Note that we use the one-sided tests with significance level α = 0.025, which is equivalent to performing two-sided tests at level 0.05 and checking that the effect goes in the beneficial direction.

2.4 Motivating dataset

The selection of the values of the simulation parameters in Section 2.3 was based on a systematic review and meta-analysis by Terstappen et al. [26] (also Figure 2.1 in Huang and Heyard [25]), which assessed the effects of prenatal amino acid supplementation on birth weight and, as a secondary outcome, maternal blood pressure (BP). In this simulation study, we focused only on the blood pressure data, for which measures from animal and human subjects were comparable. From this data we extracted the species-specific random-effects meta-analytical treatment effects, θA and θH, and the estimated heterogeneity variances and . The typical within-study variances, and , were included 15 animal studies and six human studies with an average sample sizes per group of 8.7 for the animal studies and 26.8 for the human studies.

(a) Nested loop plot of the proportion of statistically significant animal and human findings over all simulation repetitions depending on the simulation conditions, i.e. animal and human effect sizes, heterogeneity across animal and human studies, animal study sample sizes, and the number of animal studies pooled together to obtain the animal finding. The dotted horizontal lines represent a proportion of 2.5% and 80%. The legend under each plot shows which of the progressively thinner columns in the plot correspond to which combination of simulation conditions. Each horizontal line segment contains the proportion of significant findings under each combination of conditions. For example, the segment highlighted with arrow represents the proportion of significant findings in the animal studies, when the smaller sample size was used, the animal effect size was small, there was no heterogeneity across the animal studies, and 5 animal studies were pooled. (b) Nested loop plot of the average animal effect size over all simulation repetitions, depending on the simulation conditions and the decision criterion applied on the animal finding. The average effect size observed in the human studies is not affected by the applied criterion. Note that since the criterion was not added as a simulation condition, the represented data is correlated, as the same simulation repetitions are used to calculate the average effect size for the strict, lenient and no criterion.

Grid of nested loop plots of the proportions of animal-human pairs for which the different metrics flagged successful translation across simulation conditions under no criterion.

Each of the plots in the grid represent another animal-human finding combination. In the first column, for example, the human studies are all simulated under the null hypothesis of no effect. Note that the results for the replication BF and the meta-analysis are not shown here for better readability. The dotted horizontal lines represent α2 = 0.000625, α = 0.025, 1 − β = 0.8 and (1 − β)2 = 0.64. All animal studies in this representation are simulated with a small sample size per group (nA = 10).

2.5 Simulation conditions

To investigate the applicability of the replication success metrics in the context of animal to human translation, we simulated the animal and human findings under various conditions (i.e., values for parameters in Section 2.3). The conditions were chosen based on previous literature [7, 13] and expert knowledge, with the aim of emulating plausible real-world translation scenarios as closely as possible.

Animal and human effect sizes, µA and µH In the motivating dataset, we found that θA = 24.37 and θH = 4.44, meaning that the beneficial treatment effect (a reduction in blood pressure) found in animals was larger than the beneficial treatment effect in humans. However, this might not always be the case and the true treatment effect in animals and humans might be the same, i.e., both small, both large, or both absent entirely. The true effect size in humans could also be larger than in animals. We therefore simulated under all possible combinations of animal and human effect sizes, summarized in Table 1.

Summary of the simulation factors used to generate animal and human studies (varied in a fully factorial way).

Animal and human study heterogeneity and We implemented a similar setup for the between-study heterogeneity variances. In our motivating dataset, animal studies had a higher degree of heterogeneity than human studies . Again, we simulated under all possible combinations of small (i.e., ), large (i.e., ), or zero animal and human study heterogeneity to investigate the behaviour of translation metrics across different scenarios (see Table 1).

Animal and human study sample sizes, nA and nH Animal preclinical studies commonly suffer from insufficient sample sizes, leading to underpowered studies [2, 14, 6]. Therefore, we included two different sample sizes per group in each of the simulated animal studies. The typical sample size observed in animal studies is small, with approximately 10 animals per group [26, 10, 27]. To investigate the effect of an increased sample size on translation success, we simulated with a larger sample size of 20 animals per group. This represents the maximum number of animals per group observed in our data example from Terstappen et al. [26] and in Table 1 of Hooijmans et al. [27]. For the human findings we used a fixed samples size and always simulated with nH = 107 humans per group. This sample size was chosen to reach 80% power for an absolute effect of |θH| = 4.44 and typical within-study variance with a one-sided two-sample t-test and significance level α = 0.025.

Number of pooled animal studies, k In real-world drug development, multiple animal studies are usually performed and results are pooled before deciding to progress to a human study. We investigated the effect of pooling together different numbers of animal studies on translation success: k = 2, 3, 4, and 5. As described above, we performed a random-effects meta-analysis (using the restricted maximum likelihood estimator for the heterogeneity variance) of the k animal studies to generate the animal finding.

We varied the factors above and summarized in Table 1 in a fully factorial manner. This resulted in 3 (animal effect size) × (human effect size) × 3 (animal heterogeneity) × 3 (human heterogeneity) × 2 (animal sample size) × 4 (number of animal studies to pool together, k) = 648 simulation conditions.

2.6 Criteria to continue from an animal study to a human study

Usually, animal studies must show evidence of a beneficial treatment effect before a treatment is tested in humans. Alternatively, treatments that show no evidence for a beneficial effect in animal studies may continue to testing in humans if the treatment is safe and its mechanism of action is plausible in humans [28, 29]. When analysing the applicability of translation success metrics, we considered both of these continuation criteria and excluded the human studies accordingly.

  1. Strict criterion – Here, we only “moved on” to a human study, if the random-effects meta-analysis of the corresponding k animal studies found a significant beneficial treatment effect with one-sided significance level α = 0.025.

  2. Lenient criterion – Under this more lenient criterion, we “moved on” to a human study when-ever the random-effects meta-analysis of the k animal studies found a beneficial effect (i.e., if the estimated effect was negative), even if it was not statistically significant at α = 0.025.

  3. No criterion – As a point of reference, we also computed performance measures for all simulated animal and human findings, regardless of the results of the animal studies.

2.7 Translation success metrics

We compared the characteristics of nine translation success metrics, including the replication success metrics used in Freuli, Held, and Heyard [19] and Muradchanian et al. [20] as well as some more recently developed metrics, across the previously defined simulation conditions: the significance criterion, the meta-analysis, the replication Bayes factor, the unweighted and the weighted Edgington method, the controlled sceptical p-value and three versions of the golden sceptical p-value. These metrics were primarily designed to assess replication success in the pairwise comparison of one original study with one replication study. In the current translation setting, the result of the random-effects meta-analysis of the k animal studies was treated as the “animal finding”.

Significance criterion (Two-trials rule) The significance criterion, often referred to as the two-trials rule, is the current standard for a new drug to meet prior to its approval. It requires that two independent studies demonstrate a drug’s efficacy at a certain significance level, usually α = 0.025 for one-sided hypothesis testing [22] to take into account direction of effect. This criterion is also often used to identify replication success in large-scale reproducibility projects [18]. According to the significance criterion, we flagged a successful translation if both the animal and human studies yielded evidence for a beneficial treatment effect, both at a significance level of α = 0.025 [20]:

where pA and pH represent the p-values found in the animal and human studies, respectively. By setting α = 0.025, the significance criterion controls the overall T1E rate, or in our case the rate of a false positive translation success, at α2 = 0.0252 = 0.000625 [22]. Note that this metric treats the animal and human finding as interchangeable.

Meta-analysis According to the meta-analysis criterion, we flagged translation success if a fixed-effects meta-analysis combining the animal and the human findings found a significant effect in the desired direction (here, a decrease), at a one-sided significance level α2, i.e., pMA < α2. This threshold again ensured an overall T1E control at α2 [30]. We followed Freuli, Held, and Heyard [19] for the implementation of the method via the weighted version of Stouffer’s method described in Cousins [31], and define the meta-analytic p-value pMA of a one-sided test for a negative effect (i.e., a beneficial effect) as follows:

where Φ(·) is the standard normal cumulative distribution function, and zA and zH are the z-values representing the findings in the synthetic animal studies (pooled) and human study. If we were to test for a positive effect, the formula for the desired p-value would change to 1 Φ(…). Note that we used fixed-effects meta-analysis as this represents the commonly used metric in the replication context.

Alternatively, we could have used random-effects meta-analysis, but the assessment of heterogeneity is challenging when only two findings (animal and human) are considered [32]. The metric based on meta-analysis treats the animal and human finding as interchangeable.

Replication Bayes factor In the translation setting, the replication Bayes factor (BF) quantifies the evidence that the outcome observed in a human study is absent or spurious (H0) relative to the evidence that the outcome in the human study is consistent with that found in the (original) animal studies (Hr) [33]. To calculate the replication BF, BF0r, Hr is defined as the alternative hypothesis that the human effect is distributed according to the posterior distribution of the effect after observing the animal finding. A translation was flagged as successful if

using the conventional threshold for substantial evidence for Hr over H0 [34].

Unweighted & weighted Edgington’s method Edgington [35] developed an additive method of combining p-values from independent experiments, which has been applied more recently in a replication success setting [36]. Under the original version of Edgington’s method, to control the overall T1E rate across two studies at level α2 = 0.0252, a successful replication is flagged if the sum of the p-values in the original study po and the replication study pr is smaller than . With Edgington’s method, it is possible to flag success even if one of po or pr is not significant, as long as po + pr 0.035. In our study, a successful translation was flagged with Edgington’s method if

Even more recently, a weighted version of Edgington’s method has been proposed [36]. Here, an original study is down-weighted and a replication study is up-weighted to account for potential biases in the original study, and both study findings are not interchangeable any more. For the same overall T1E control at level α2 = 0.0252, and in the case in which a replication study carries twice the weight of the original study, a successful replication is flagged if po + 2pr 2 ·α = 0.05 [36]. In our study, we gave the human result more weight than the animal finding, and flagged a successful translation with weighted Edgington’s method if

Golden & controlled sceptical p-value The sceptical p-value combines a reverse-Bayes technique with a prior-data conflict assessment. The extent to which the data in a replication or human study conflicts with a sceptical prior which renders the original or animal finding not convincing, can be quantified with the sceptical p-value [37, 20]. In our study, we will examine two specific versions of the sceptical p-value: the golden and controlled sceptical p-value [38, 39].

With the golden sceptical p-value ps, success can be flagged if the p-value of the animal finding is sufficiently small (i.e., pA ≈ α), even if it is not necessarily significant, as long as there is no shrinkage in effect size in the human study [38, 19]. However, in the translation setting, shrinkage of effect size is expected in human studies relative to animal studies. Held, Micheloud, and Pawel [38] have developed a method to calculate a threshold for the golden sceptical p-value in the presence of shrinkage, which is equivalent to a pre-specified threshold of α when no shrinkage is present. The golden sceptical p-value controls the overall T1E rate at a maximum level of α2, provided that the sample size of the replication or human study is larger than in the original or animal study. We can therefore calculate the values for below which a successful translation can be flagged even if pA is sightly higher then α, in the presence of no (0%) shrinkage, moderate (25%) shrinkage, and high (50%) shrinkage. Note that these levels of shrinkage were selected rather arbitrarily, based on an observed shrinkage of about 50% in the Replication Project Psychology [40], while it was even higher in the Replication Project Cancer Biology (mean effect size of 6.15 in the original studies and 1.37 in the replication studies) [41]. In our study, translation success was flagged if the following conditions were satisfied, depending on the allowed level of shrinkage:

  1. when allowing for no shrinkage;

  2. when allowing for moderate shrinkage; or,

  3. when allowing for high shrinkage.

Finally, the controlled sceptical p-value , another recalibration presented in Micheloud, Balabdaoui, and Held [39], guarantees control of the overall T1E rate at level α2. Translation success was flagged if

Note that whenever the animal finding indicated a harmful effect (i.e., the effect goes in the opposite direction than what was expected), we implemented all sceptical p-value metrics in a way that forced them to flag failure. This approach is valid, as no version of the sceptical p-value would ever flag success if the original or the animal finding has a very high p-value.

2.8 Performance measures

To evaluate the performance of each translation success metric, we calculated and compared the proportion P of synthetic pairs of animal and human findings for which the metric flagged successful translation under the different simulation conditions. The denominator for this proportion depends on the animal finding (i.e., the results of the meta-analysis of k animal studies) and the different continuation criteria (strict, lenient, none) described in Section 2.6. This leads to the following three versions of P:

  • Under the strict criterion:

  • Under the lenient criterion:

  • Under no criterion:

Under the assumption of animal and human null effects, this proportion reflects the overall T1E rate – the rate of false positive translation success. The lower this proportion, the better a metric is at uncovering false translation failure. Under the assumption that both animal and human effects are not null, the proportion can be interpreted as translation power, i.e., the probability of true positive translation success. The higher the proportion of true positive translation success, the better the metric is suited to “correctly” declare translation success under the chosen simulation conditions.

We used so-called “nested loop plots” to represent and compare the proportions for the different metrics across simulation conditions as recommended by Rücker and Schwarzer [42]. The combinations of simulation conditions are ordered and arranged on the horizontal axis, while the proportion of successful translations is presented on the vertical axis (we refer to the caption of Figure 1.(a) for a brief description of how to interpret these plots).

2.9 Monte Carlo uncertainty and number of simulation repetitions

The number of simulation repetitions was calculated based on a maximum desired Monte Carlo standard error (MCSE) of 0.5% for P [43]. We considered the “worst-case” scenario of P = 0.5 (i.e., the metric is not better than tossing a coin), as well as the strictest criterion for a human study to be performed. From this, we obtained a maximum of 400000 animal studies to simulate in order to move on to at least 10000 human studies, while maintaining a maximum MCSE of 0.5%. For simplicity, we simulated 400000 animal studies under all combinations of simulation conditions.

2.10 Implementation

Our simulation study was implemented in R (version 4.5) and designed using the SimDesign package [44]. We used the BFr function from the BayesRep package to compute the replication BF [45], and the ReplicationSuccess package for all versions of the sceptical p-value [46]. Following Pawel et al. [47], we recorded and reported the proportion of missingness. This is a common issue in simulation studies in which problems such as non-convergence of optimization algorithms may cause some simulation repetitions and conditions to yield invalid outputs, leading to missing values for the performance measures.

3 Results

3.1 Characteristics of simulated animal and human studies

We first illustrate the impact of our simulation design choices on the animal and human studies separately, and verify that the simulations were performed as expected. For this, Figure 1.(a) shows a nested loop plot with the proportion of significant animal and human findings (one-sided p < α = 0.025) according to the simulation conditions.

As expected, both animal and human studies show a T1E rate (i.e., proportion under the null) of about α = 0.025 under the null hypothesis of no effect (i.e., µA = 0 and µH = 0) combined with no heterogeneity across studies. Increasing heterogeneity increases the T1E rates. For the animal findings, the T1E rate decreases with an increasing number of pooled studies k.

The human study sample size of nH = 107 was chosen in order to achieve 80% power assuming a small effect size and no heterogeneity. Accordingly, under these conditions, we also find that about 80% of the human findings are significant. Also as expected, the power decreases with increasing heterogeneity. Simulating under the large human effect using the same sample size nH, naturally results in higher power close to 1, except when heterogeneity is large.

On the other hand, the animal findings have low power when under the simulation condition of a small animal effect. As expected, power increases with increasing k and the larger animal sample size, but still remains rather low. Increasing heterogeneity across the animal studies further lowers the proportion of significant animal findings. Finally, simulating under the large animal effect results in highly powered findings, almost 100% power, except in the case of high heterogeneity.

Then, Figure 1.(b) shows that conditioning the decision to conduct a human study on the animal finding (being beneficial or significant) results in overestimated effect sizes for the animal finding. The stricter the decision criterion, the more inflated the estimated average effect size in the animal studies.

Missingness Our simulation study was also affected by missingness, though only in rare cases. Specifically, missingness occurred only in the data generating mechanism (see classification in Pawel et al. [47]) and was due to non-convergence of the Fisher scoring algorithm of the meta-analysis of the animal studies using the rma function. A table in our online supplement (https://rachelheyard.pages.uzh.ch/translation_simulation/) summarizes the proportion of missingness for each combination of simulation condition, with a maximum of 0.0025% (i.e., 10 missing values out of 400’000 repetitions). These repetitions were omitted in analyses.

3.2 Performance of translation success metrics

Figure 2 shows the proportion of animal-human pairs for which the different metrics flagged successful translation across simulation conditions. The figure specifically shows the proportion Pno (no criterion) and the small animal sample size nA = 10. Note that Figure 2 does not include the results for the replication BF and the meta-analysis for readability reasons as they behave very differently (see the corresponding Figure A.1 in the appendix). Our online supplement allows the reader to zoom into the different plots and also contains the results with the larger animal sample size; see https://rachelheyard.pages.uzh.ch/translation_simulation/.

Assuming a large animal effect and a small human effect (bottom center plot in Figure 2) This combination of animal and human effect sizes is closest to the results from the meta-analysis in Terstappen et al. [26], and is therefore potentially the most realistic in the translation setting. Here, a well-performing metric should find a relatively high proportion of translation successes, i.e., high translation power.

When no heterogeneity is present in either animals or humans, the translation power for all metrics in the figure is at least 1 − β = 80%. The two-trials rule and the weighted Edgington are both equal to 80% (lines overlap). Unweighted Edgington behaves similarly with a sightly higher proportion. The controlled and golden (high shrinkage) sceptical p-values find the highest translation power.

Increasing the heterogeneity in human studies decreases the translation power of all metrics and brings them closer together. When heterogeneity of the human studies is low and animal heterogeneity is none, all metrics are close to (1 − β)2. These results barely change when increasing the animal study heterogeneity from none to low. This might be due to the fact that the relative sample size c = nH/nA is always larger than 1, even if the sample size of the animal finding is artificially increased with increasing k. A c > 1 forces some metrics to give more weight to the human study; therefore, even slight increases in the heterogeneity across human studies affects translation results. The translation power of the three golden sceptical p-values generally increases with k, as a higher k leads to a higher chance of observing a significant animal finding. The translation power of the three golden sceptical p-values is lowest when animal study heterogeneity is high, which can be explained by the decrease in power of the individual animal studies. The controlled sceptical p-value has a similar pattern with respect to k and animal heterogeneity, but increasing k to 5 countermeasures and its translation power is equal to 80%.

The metric based on a meta-analysis outperforms all other metrics in most conditions represented in Figure A.1. From previous research [19, 39], we know that if either the animal or the human finding is convincing, the likelihood for the meta-analysis to flag success is high, regardless of the evidence in the other study. The replication BF results in very low proportions of successful translation when there is no or low heterogeneity across animal and human studies, because the effect sizes from animal and human findings are too inconsistent. The results for the replication BF are more comparable to the results of the other metrics in the presence of high study heterogeneity in either animals or humans.

Under the lenient criterion, the conclusions are the same. Under the strict criterion, most conclusions hold, while the proportions for the two-trials rule, both Edgington and the controlled p-value are now independent from increases in k and increases in the level of heterogeneity across animal studies. Larger animal sample size per group increases all proportions slightly, apart from the proportion for the replication BF, which decreases.

Assuming null effects in animal and human studies (upper left corner in Figure 2) Here, a well-performing metric should find a low proportion of translation successes, i.e., a low overall T1E rate or false positive translation success rate. All metrics except the replication BF in (Figure A.1) control the overall T1E rate at α2 when there is no animal or human study heterogeneity. An increase in human study heterogeneity inflates the proportion of false positive translations. The two-trials rule is least affected by changes in heterogeneity across human studies, followed by the unweighted Edgington, weighted Edgington and controlled sceptical p-value. The golden sceptical p-value (no shrinkage) keeps the overall T1E rate low, especially when there is no heterogeneity across studies of either animals or humans, which aligns with the theoretically expected pattern [38]. The golden sceptical p-values allowing for moderate/high shrinkage permit weaker animal findings to translate even if shrinkage is observed in the human study, but raise the overall T1E rate.

More animal study heterogeneity also increases the proportion of translation successes for all metrics. For all golden sceptical p-values, the increase in the proportion with k when there is no animal study heterogeneity is due to the corresponding decrease in relative sample size c. However, when there is low or high animal study heterogeneity, an increase in k leads to a decrease in the overall T1E rate for all metrics in Figure 2. This is likely related to the fact that increases in k decrease the partial animal T1E rate when animal study heterogeneity is low or high (see Figure 1.(a)). Generally, when there is no or low heterogeneity across animal studies combined with any level of heterogeneity across human studies, the two-trials rule performs relatively well, with the overall T1E rate mostly below α, and the weighted Edgington follows closely behind. However, when the heterogeneity across animal studies is high, the golden sceptical p-value (no and moderate shrinkage) performs better compared to the other metrics.

The metric based on meta-analysis and the replication BF, visible in Figure A.1, results in very high overall T1E rates. The replication BF weights the evidence of the “replication”, i.e., the human study, more heavily than the “original” animal finding. Increases in human study heterogeneity increase the partial T1E rate of the human finding; likewise, the same is true for the overall T1E rate in the case of the replication BF. The meta-analysis metric treats the animal and human findings as interchangeable and it is therefore enough if just one of the findings is very convincing to flag success. Since high heterogeneity in human studies results in an increased risk of a false positive human result, the overall T1E rate of the meta-analysis metric increases as well, up to 40% in extreme cases. Neither the replication BF nor the meta-analysis metric are affected much by increases in animal study heterogeneity. An increase in k decreases the overall T1E for the meta-analysis slightly, while it increases the overall T1E for the replication BF, especially when there is high heterogeneity across animal studies.

Results under the lenient criterion can be studied in the online appendix and follow similar trends. Under the strict criterion, translation success is conditional on the animal finding being significant. Consequently, the two-trials rule, both versions of Edgington’s method and the controlled sceptical p-values, which previously controlled the overall T1E rate at α2 when there was no study heterogeneity in either animals or humans, now do so at level α = 0.025. The golden sceptical p-value (no shrinkage) now yields the lowest translation success rates across all scenarios. The results for the replication BF and the meta-analysis are more comparable to those of the other metrics, aside from the replication BF when human study heterogeneity is high. Under the strict criterion, the overall T1E tends to increase with k, except when there is high heterogeneity across animal studies.

Assuming small animal and human effects (center plot in Figure 2) Here, we observe translation success rates that are much lower than what we would expect, i.e., < (1 − β)2 = 0.64, except for the replication BF in Figure A.1. This is due to the fact that the animal studies with nA = 10 have insufficient power to detect a small effect. As shown in Figure A.3, the results look slightly better with the larger animal study sample size. In addition, by artificially increasing the sample size of the animal finding with k, we also observe that the rates for all metrics increase at least slightly. Notably, however, this trend is reversed when there is high heterogeneity across animal studies. Increases in heterogeneity across studies of any species decreases the translation power for all metrics except the replication BF and meta-analysis. The replication BF results in the highest proportion of successful translations under all conditions, generally followed by the meta-analysis. This can be explained by the fact that the replication BF puts more weight on the human study, and meta-analysis treats human and animal findings as interchangeable. Among the golden sceptical p-values, the version allowing for no shrinkage would be the most appropriate here, since the true animal and human effect sizes are assumed to be equal. This metric leads to the lowest translation power, apart from when the heterogeneity across human studies is high; then, the two-trials rule leads to similar or smaller translation success rates. The controlled sceptical p-value generally yields higher translation power compared to the other metrics, and is even the highest when there is high animal study heterogeneity, aside from the replication BF and meta-analysis. The golden sceptical p-value (high shrinkage) also performs similarly well, especially with no or low animal study heterogeneity.

Under the strict criterion (see online appendix), the two-trials rule, unweighted and weighted Edgington and the controlled sceptical p-value are approximately equivalent to the power of the human studies (1 − β = 0.8), as illustrated in Figure 1.(a). The golden sceptical p-value (no shrinkage) with borderline significant results generally finds the lowest proportion of translation success across conditions. The low-powered animal studies might lead to overestimated effect sizes for the animal findings, which is most penalised by this version of the golden sceptical p-value (no shrinkage). In addition, weighted Edgington finds proportions of translation success that are equivalent or slightly smaller than those of unweighted Edgington, while the proportions are larger when applying no criterion. This is because weighted Edgington puts less weight on the animal findings, which are heavily inflated under the strict criterion.

Assuming large animal and human effects (bottom right plot in Figure 2) When applying no criterion, most metrics find a translation power of almost 100% under most simulation conditions, except when there is high heterogeneity across animal studies, as in that case the animal studies have low power. Translation success rates are even closer to 100% under the strict criterion.

Assuming a small animal effect and a human null effect (center left plot in Figure 2) Under this combination of effect sizes, translation success should not occur. Indeed, the translation success rates for all metrics are generally small (< α) when there is no heterogeneity across animal or human studies. However, the proportion increases substantially with increasing levels of heterogeneity across human studies, and decreases with increasing levels of heterogeneity across animal studies. The meta-analysis, replication BF, sceptical p-value (high shrinkage) and controlled sceptical p-value lead to the highest proportions, while the golden sceptical p-value (no shrinkage) and the two-trials rule lead to the lowest proportions. Note that the animal studies have low power to detect a small effect.

Assuming an animal null effect and a small human effect (top center plot in Figure 2) When the human effect is small, for which the human studies were powered at 80%, and the animal effect is null, all metrics but the replication BF and the meta-analysis generally result in translation success rates close to α. The proportion for the replication BF is close to the power of the human studies. Under the strict criterion, all metrics get closer to the human study power. Interestingly, under the strict criterion, the metric based on the meta-analysis is one of the metrics with the lowest proportions. This might be due to the human studies being powered at 80% and not higher.

Assuming a large animal effect and a human null effect (bottom left plot in Figure 2) Here, the animal studies have high power, though power decreases with increasing heterogeneity across animal studies. Accordingly, when animal study heterogeneity is non-existent or low, there is a high chance of a very convincing animal finding, which then results in high translation success rates for the meta-analysis metric. The replication BF tends to be the most conservative unless there is high animal study heterogeneity. The remaining metrics all perform similarly well, are only slightly affected by changes in k and generally follow the partial T1E rate of the human studies.

Assuming a small animal effect and a large human effect (center right plot in Figure 2) Here, the human studies are highly powered and the animal findings have low power. Hence, applying the strict criterion (see online appendix) leads to a translation success rate of almost 100% for all conditions except when heterogeneity across human studies is high, where it is then close to 90%. When no criterion is applied, the high-powered human studies still lead to a proportion of 1 or close to 1 for the replication BF and the meta-analysis. The remaining metrics are more conservative, with the two-trials rule yielding the smallest proportions across conditions. Proportions for all metrics increase with increasing k unless there is high animal study heterogeneity, and decrease with increasing human study heterogeneity.

Assuming an animal null effect and a large human effect (top right plot in Figure 2) Here, all metrics aside from the replication BF and the meta-analysis behave as one would expect: they rarely flag translation success. When there is no heterogeneity across animal studies, the proportion ranges from α for the two-trials rule to 10% for the golden sceptical p-value (high shrinkage) and the controlled sceptical p-value. These proportions further increase with increasing levels of animal study heterogeneity. Applying the strict criterion again results in proportions close to 1 for all metrics.

4 Discussion

In our simulation study, we investigated whether metrics used or developed to assess replication success can be applied and are useful in the context of translation of results from animal studies to human studies. Our study was motivated by recorded cases of translation failure in biomedical research. We aimed to assess how well various statistical metrics capture the concept of translation under a wide range of simulated conditions, including differences in effect sizes, effect size heterogeneity, animal study sample sizes and the number of animal studies pooled together. For this, we simulated animal and human studies using parameters informed by a real-world meta-analysis of prenatal amino acid supplementation on maternal blood pressure [26]. We also simulated different scenarios for the decision to move on to a human study: (1) any animal finding leads to a subsequent human study, (2) only beneficial animal findings lead to a subsequent human study, and (3) only significant beneficial animal findings lead to a subsequent human study. Based on the pairs of findings from the simulated animal and human studies, we evaluated nine metrics that have previously been discussed in the replication literature.

We show that the performance of the different metrics highly depends on the simulation conditions. First, when both animal and human true effects are null, most metrics, except for the replication BF and meta-analysis, control the overall T1E rate close to the theoretical α2 under no heterogeneity. When heterogeneity increases, especially between human studies, the overall T1E increases. When both animals and humans had non-null effects, translation power was most influenced by whichever of the animal or human finding that had lower power. For example, under the conditions of small effects in both animals and humans and small animal sample sizes, translation power fell below (1 − β)2. Conversely, assuming large effects in both animals and humans yielded near-perfect translation power except in cases of high heterogeneity across animal studies, in which case translation power was lower.

Asymmetric effect size scenarios revealed systematic tendencies. Meta-analysis generally flagged success more often, driven by strong evidence in either animals or humans, while the two-trials rule and the golden sceptical p-value (no shrinkage) were more conservative and aligned more closely with the weaker of the two findings. Replication BF did not perform well whenever asymmetric effect sizes were simulated. Increasing the number of animal studies that were pooled (k) typically improved translation power when animal effects were non-null and heterogeneity was low, but had little benefit and even a negative impact when animal study heterogeneity was high, which is however often the case in practice.

Conditioning on significant animal findings as a strict criterion to move on to human testing, inflated animal effect size estimates and affected the operating characteristics of the metrics, e.g., sometimes substantially increasing T1E rates or power. Overall, no single metric was uniformly optimal. The controlled sceptical p-values and weighted Edgington performed relatively well across many scenarios, while replication BF and meta-analysis were highly sensitive to strong findings in either animals or humans. Golden sceptical p-values offered more conservative control at the cost of reduced power when true effects were small.

A conceptual challenge uncovered in our simulation study was how to interpret cases in which the true effect sizes in animal and human differ in complex ways. For example, is a “translation success” desirable in the case where the true animal effect is null but the human effect is small? Most probably it is not. Such cases would benefit from a deeper discussion in the community of what constitutes a successful translation, especially because animal testing is often treated as a precursor to human studies rather than an end in itself. It is therefore important to recognize that translation differs fundamentally from replication, because in the translation setting the human finding is the reference point and the target population against which success is ultimately judged.

Limitations

This study has various limitations. Our simulation study assumes a degree of comparability of effect sizes between animal and human studies that may not exist in practice. In reality, the magnitude of effects often differ substantially across species due to biological, methodological and environmental factors. The type of effects and outcome measurements investigated in animals might differ from those that are of interest in human studies. Human studies typically progress through various clinical phases with distinct goals and our simulation study did not distinguish between these phases. Our choice to pool a maximum of five animal studies to form a single “animal finding” may be overly simplistic. In real-world settings, the decision to move to testing in humans is often not (solely) based on the statistical significance and direction of effect in a (relatively small) set of animal studies. A broader array of factors may be considered, including pharmacokinetics, safety profiles and ethical considerations. Our study is merely looking at the statistical aspect of translation, which is only one component of a more complex decision-making process. While the choice of our study parameters was informed by a real meta-analysis, our simulations are based on a single dataset and domain, which might limit the generalisability of our results. Other biomedical fields might exhibit different patterns of effect size differences and heterogeneity. To allow for the broadest possible range of scenarios, we used a fully factorial design and assigned all possible effect sizes and levels of heterogeneity to both animal and human studies. This may have introduced unrealistic scenarios. Finally, we focussed on a specific set of metrics. Other, more appropriate metrics might exist that we are unaware of.

Recommendations

Our findings highlight that the choice of translation success metrics, along with the design features of both animal and human studies, can meaningfully influence conclusions about “translatability”. The low translation power of small-sample animal studies, even if effects are truly present, suggests that pooling multiple studies or increasing samples sizes is crucial to reduce false negatives and avoid inflated effect size estimates, especially when results will be used to justify human clinical trials. Special attention should also be given to heterogeneity when interpreting translation failures, as even modest heterogeneity across studies can reduce the chance of translation success according to most metrics. Our results also suggest caution when basing the justification of clinical trials solely on statistical significance in animal findings, i.e., strict criterion, as this can lead to overly optimistic expectations for human outcomes. When assessing translation outcomes, metrics that balance information from both animals and humans – such as controlled sceptical p-values or weighted Edgington – may provide more robust conclusions than metrics that are driven by strong evidence in just one species (e.g., replication BF focusing on mainly human findings, and meta-analysis).

Conclusions

We conclude that metrics developed for assessing replication success can offer valuable insights for assessing translation success. However, their utility depends strongly on the context, underlying assumptions, and the characteristics of the available evidence. No single metric performed optimally across all simulated scenarios. A combined approach, using multiple metrics alongside an understanding of their respective strengths and limitations, is recommended to assess when and how animal findings translate to human outcomes. Future research is needed to explore and better understand the behaviour of the metrics in the translation setting from a theoretical perspective to draw generalisable conclusions in biomedical contexts.

Data availability

All data and code file to reproduce our simulation results, this manuscript and the online supplement are available via GitLab, https://gitlab.uzh.ch/rachelheyard/translation_simulation. A citable snapshot of the repository at the time writing is archived at https://doi.org/10.5281/zenodo.13587432. Online versions of the figures are available at https://rachelheyard.pages.uzh.ch/translation_simulation/.

Acknowledgements

We thank Gillian Currie and Bernhard Voelkl for valuable feedback on an earlier version of our manuscript. Additionally, we would like to thank the iRISE consortium, and specially work package 1, for continuous feedback in the conceptualization and reporting of our work.

Additional information

Funding statement

RH and KW receive funding from iRISE. iRISE receives funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101094853. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Executive Agency (ERA). Neither the European Union nor the ERA can be held responsible for them. iRISE also receives funding from the Swiss State Secretariat for Education, Research and Innovation (SERI): Direct Funding for Collaborative Projects as part of the transitional measures, and from UK Research and Innovation (UKRI). BVI receives funding from the Swiss National Science Foundation under grant number 407940_206504.

Author contributions

  • Conceptualization: CJH, SP, KEW, BVI, RH

  • Data curation: CJH, KEW

  • Formal Analysis: CJH, RH

  • Funding acquisition: KEW, BVI, RH

  • Methodology: CJH, SP, RH

  • Project administration: RH

  • Software: CJH, RH

  • Supervision: RH

  • Visualization: CJH, RH

  • Writing – original draft: CJH, RH

  • Writing – review & editing: CJH, RH SP, KEW, BVI

Funding

European Commission (EC)

https://doi.org/10.3030/101094853

  • Rachel Heyard

  • Kimberley E Wever

Swiss National Science Foundation (407940_206504)

  • Benjamin Victor Ineichen

Appendix

Complete Figures

Figure A.1 shows the results of the simulation study for all metrics, also replication BF and meta-analysis, across scenarios when the animal studies’ sample size is fixed to 10 per group. Figure A.2 shows the same type of results for when the animal studies’ sample size is fixed to 20 per group, while Figure A.3 shows the zoomed-in results where replication BF and meta-analysis were dropped for readability.

Grid of nested loop plots of the proportions of animal-human pairs for which the different metrics flagged successful translation across simulation conditions under no criterion.

Each of the plots in the grid represent another animal-human finding combination. In the first column, for example, the human studies are all simulated under the null hypothesis of no effect. The dotted horizontal lines represent α2= 0.000625, α = 0.025, 1 − β = 0.8 and (1 − β)2= 0.64. All animal studies in this representation are simulated with a small sample size per group (nA = 10).

Grid of nested loop plots of the proportions of animal-human pairs for which the different metrics flagged successful translation across simulation conditions under no criterion.

Each of the plots in the grid represent another animal-human finding combination. In the first column, for example, the human studies are all simulated under the null hypothesis of no effect. The dotted horizontal lines represent α2= 0.000625, α = 0.025, 1 − β = 0.8 and (1 − β)2= 0.64. All animal studies in this representation are simulated with a larger sample size per group (nA = 20).

Grid of nested loop plots of the proportions of animal-human pairs for which the different metrics flagged successful translation across simulation conditions under no criterion.

Each of the plots in the grid represent another animal-human finding combination. In the first column, for example, the human studies are all simulated under the null hypothesis of no effect. Note that the results for the replication BF and the meta-analysis are not shown here for better readability. The dotted horizontal lines represent α2 = 0.000625, α = 0.025, 1 − β = 0.8 and (1 − β)2 = 0.64. All animal studies in this representation are simulated with a larger sample size per group (nA = 20).