Introduction

Meta-analyses are critical for synthesising medical evidence and providing robust estimates of treatment effects (Murad et al., 2014; Dechartres et al., 2013; Nordmann et al., 2012). However, there are concerns that meta-analyses are frequently improperly conducted, and in recent years there has been methodological concerns over the rigour of an increasing number of these publications (Lawrence et al., 2021; Siemens et al., 2021; Hameed et al., 2020), particularly when constituent Randomized Controlled Trials (RCTs) have data integrity issues which meta-analysis cannot solve (Wilkinson et al., 2024, 2025). There is an additional consideration that deserves mention in dichotomous outcome or binary outcome trials, which is the issue of fragility. In RCTs, many trial results are fragile, with recoding of a small number of events to non-events or vice versa in either the experimental or control arm often sufficient to convert a seemingly statistically significant result to a null, and vice versa (Grimes, 2022; Tignanelli and Napolitano, 2019; Baer et al., 2021a,b).

While this issue has generated much recent discussion in RCTs, this has not received much consideration for meta-analyses, which are often thought of as more robust to such manipulations. This however may not be the case. Previously, Atal et al (Atal et al., 2019) suggested meta-analyses may be more fragile than might be presumed in binary outcome cases. In that work, they established an algorithm to find the minimum re-coding of patients between studies that would engineer a seemingly significant result from a null one or vice versa. Applying this to a large sample of meta-analyses, Atal et al found that many were inherently fragile. This is a powerful method, but one possible objection is that it has the downside of non-generalisability, finding very specific combinations of trials and patients that would have to be re-coded. For example, an Atal meta-analytic fragility of 4 pertains to a specific and often unique circumstance when 4 patients could be recoded from a specific study or combinations thereof to change outputs, but this does not generalise to any 4 patients in that meta-analysis. This makes this definition of meta-analytic fragility useful but not generalisable, and less intuitive to interpret than a typical RCT fragility metric.

In this work, we establish a generalizable meta-analytic fragility metric, based upon Ellipse of Insignificance (EOI) analysis for dichotomous outcome trials (Grimes, 2022). EOI is a refined and analytical fragility metric that simultaneously finds the degree of fragility in both control and experimental arms of dichotomous outcome trials, deriving its name from the geometrical reality relating to chi squared tests that for any seemingly significant result mapped on a plane where one axis corresponds to the experimental arm and the perpendicular the control, there is an associated ellipse inside which all results are null. EOI analysis analytically finds and resolves the minimum displacement (Fewest Experimental/Control Knowingly Uncoded Participants (FECKUP) vector) from this region, reporting fragility as proportion of group sizes in context. As it is analytical and based upon chi square testing, it handles samples of all size, reporting a readily interpreted contextualised measure.

Intimately related to this is Region of Attainable Redaction (ROAR) analysis (Grimes, 2024b), a methodology for ascertaining potential influence of redacted data on reported results. There are many reasons why relevant data might be missing from an analysis, ranging from simple absent data, missed studies, inappropriate cut-offs and even deliberate cherry-picking (Grimes and Heathers, 2021). This is highly relevant to meta-analysis, and it would be useful to have a method to quantify the potential impact of redaction on reported results and gives an additional estimate of how robust reported results may be when new data is accrued. Accordingly in this work, we will extend these methodologies to handle meta-analysis, contrast them to Atal et al’s metric, and apply them to conflicting Vitamin D trials to illustrate their epidemiological utility. This package of meta-analytic fragility tools we name EOIMETA for brevity.

Methods

Derivation of Meta-analytic fragility and redaction analysis (EOIMETA)

For both EOI and ROAR, full mathematical details have been previously published (Grimes, 2022, 2024b) and an online implementation is available at https://drg85.shinyapps.io/EOIROAR/ for any dichotomous outcome trial. Extending this to the metaanalytic situation, consider i studies, each with an experimental group with ai events and bi non-events, and control group with ci events and di non-events per study. Consider for any given meta-analysis the crude unadjusted pooled relative ratio (RRp) and a Cochran–Mantel–Haenszel risk ratio (RRCMH), where confounding effects between studies are given by . When this is small (< 10%), confounding is minimal and EOI analysis can be deployed on pooled meta-analysis results, by treating the entire meta-analysis as a single pooled sample. As meta-analysis weigh studies by variance, simple pooling does not inherently account for study heterogeneity. We can also adjust for this by using a generic inverse-variance weighed average model – in this case, for each included study, we compute relative risk and standard error as

where a continuity correction of 0.5 may be applied for any zero values. The weighted average of each trial is wi = SE(ln(RRi))™1 and total pooled risk ratio is given by

Once the pooled risk ratio is known, we can scale the event and non-event rate in the experimental group to match the adjusted relative risk while keeping the total number in the experimental arm constant. Letting a = Σai, b = Σbi, c = Σci and d = Σdi, we allow a = a + x and b = bx where x may be positive or negative to adjust the experimental group to the derived risk ratio. We can solve for this value to obtain

For consistency, the values of a and b are rounded to integers. This modification accounts for study level heterogeneity, and a standard EOI analysis can then be applied to the vector (a, b, c, d). In addition, we can also employ ROAR analysis to the same vector, yielding the raw number of patients in either or both arm who could be added a given direction to change the result, and exact combination of control and experimental group redactions required to change the result from a significant finding to a null one. Caveats for implementation and interpretation are outlined in the discussion section.

Application to vitamin D studies

Results of vitamin D supplementation and cancer mortality have been markedly inconsistent. In recent years, several metaanalyses of published trials yielded markedly conficting results. A review and meta-analysis by Zhang et al (Zhang et al., 2019) and correction (Zhang et al., 2020) of 5 RCTs involving 39,168 patients found a significant association between vitamin D supplementation and reduced cancer mortality, reporting a 15% decrease in the risk of cancer death with vitamin D supplementation. Similarly, a meta-analysis by Gou et al (Guo et al., 2023) of 11 RCTs involving 112,165 patients found a significant reduction in relative risk of cancer death with supplementation of 12%. By contrast, a subsequent study of 6 RCTS involving 61,853 patients (Zhang et al., 2022) found no significant reduction in cancer death risk with vitamin D supplementation. All 3 meta-analyses involved combinations drawn from just 12 trials.

To investigate this apparent case of meta-analytic fragility, we extracted data from all 12 studies included in the 3 meta-analyses. A small inconsistency between all 3 meta-analysis was found in relation to one RCT (Scragg et al., 2018); In Zhang et al, the rate of cancer deaths in the supplementation and control group were reported as 44/2558 and 45/2550 respectively, versus being reported as 28/2558 versus 30/2558 in Gou et al and 30/2550 versus 30/2550 in Zheng et al. Disagreement between these figures appears to stem from cancers detected in both groups prior to the randomisation process. Accordingly, the adjusted ITT figure used in this work is 30/2544 versus 30/2535 respectively for Scragg et al’s work. A further nuance is that for another small RCT (Ammann et al., 2017), direct data was not given on the number of cases. This however can be estimated from minimal assumption through the hazard ratio. For all included studies, relative risk and 95% confidence intervals were calculated explicitly except for Lehouck et al (Lehouck et al., 2012) where zero recorded deaths in the supplementation sample required confidence intervals to be estimated by bootstrap methods.Initially, we ran fragility analysis on all of these meta-analysis, with the new metric outlined here as well as Atal et al’s algorithm. After this, we combined data and performed a meta-analysis on all 12 studies to ascertain potential effects of vitamin D and cancer mortality. A full fragility and robustness analysis was then performed on this, and results discussed to elucidate findings.

Meta-analytic fragility package

We developed a custom R package (Grimes, 2025) (eoirroar, available at https://github.com/drg85/EOIMETA and on Zenodo) to perform the fragility and redaction analysis described in the methods, as well as Atal et al’s study specific alogorithm method for comparison.

Results

Meta-analytic fragility estimation of Vitamin D Cancer Mortality studies

For all conflicting meta-analyses, the unadjusted crude pooled risk ratio and CMH risk ratio were calculated prior to the analysis being conducted. For the three meta-analyses, this ranged from 0.08% ™ 1.1% confounding, negligible differences which suggest the meta-analytic approach outlined in this work is applicable. Table 1 gives the results for EOI and ROAR meta-analysis as well as the Atal et al method. Figure 2 depicts the EOI analysis results for meta-analysis by Zhang et al (including 5 RCTs, 38,538 patients) and Gou et al (11 RCTs, 111,952 patients).

Application of meta-analytic fragility / redaction sensitivity to conflicting meta-analyses

Figure 1 gives the EOI meta-analytic fragility vectors for both positive meta-analyses, showing the degree of recoding in both experimental and control groups required to flip results. The negative values on both axes arise because the relative risk of the experimental group (vitamin D supplementation) is ostensibly lower than the placebo group, so in the convention of EOI analysis, negative values imply additional of events to the experimental group and/or subtraction of events from the control. In the case of Zhang et al, it would take recoding of just 4 events / non-events in either experimental or control arms, or 4 recodings total in combination (< 0.01% of the total sample) to lose significance, even less than the fragility estimated by Atal et al’s algorithm. In the case of Guo et al, recoding of 38 events / non-events (< 0.038% of sample) achieves the same result, whereas there is a even more extreme specific fragility detected by Atal et al’s algorithm.

Meta-analytic EOI fragility analysis results for (a) Zhang et al and (b) Guo et al.

See text for details.

The redaction analysis is also informative, because even if there were no coding or data errors in any study or their synthesis, the addition of small numbers of patients can profoundly change the result. In the case of Zhang et al, adding just 3 patients to the cohort would be enough to lose significance, while in Guo et al it is 37. Despite Zheng et al having substantially fewer patients than Guo et al, it appears more robust to fragility by all metrics. This in turn leads to an important observation – the fact that a meta-analysis has more patients and studies does not inherently make it more robust or less fragile, as discussed in the next section.

Meta-analysis of all studies

Figure 2 depicts the meta-analysis forest plot of all 12 RCTs of 133,475 patients using a random effects model, while table 2 gives summary meta-analytic fragility statistics. Table 2 gives the statistics for when all 12 studies are combined, finding that both EOI fragility and Atal et al’s fragility actually increase despite the increase in the number of subjects. Even with a large sample, the recoding of just 5 non-events to events, in this case in the placebo arm (additional deaths in the non-vitamin D group), would alter the results to cross the threshold of significance (p = 0.048) and likely change inferences.

Meta-analysis of all 12 RCTS including study details and relative risk of vitamin D supplementation on cancer mortality

Meta-analytic fragility of all 12 studies

Discussion

The meta-analytic fragility tool presented here, EOIMETA, is generalisable and flexible, complementing Atal et al’s study and patient specific method. The nuance between both approaches is that in Atal et al’s algorithm identifies a specific combination of patients from specific studies that may be moved to alter results, whereas the current work establishes a method for pooling meta-analysis on ascertaining their group fragility and the potential impact of missing data. In this respect, they answer complementary and subtlety different questions specific to meta-analyses.

In RCT literature, a frequent objection to the use of fragility analysis methods is that the mere existence of a small FI might be an artefact of trial design. For clinical trials especially, a well-designed trial should be structured as to minimize patient exposure to unknown harms, striving to ensure that just enough participate to allow the detection of clinically relevant effects. He argument that RCTs at least might be fragile ‘by design’ is however convincingly countered by other authors, who find no evidence of p-value distributions clustering around significance thresholds after a sample size calculation and find additional that fragility in well-designed studies is not always low. More recent robust methods for fragility analysis like EOI and for redaction like ROAR also have application far beyond ostensibly fragile-by-design RCTs, but for cohort studies, preclinical work, and even ecological studies. The fragile-by-design argument is also a weaker on in the context of meta-analysis, where the marshalling of many studies to increase effect precision measurements should strengthen the evidence but results here and previously suggest that even meta-analysis of large groups of patients can be profoundly fragile.

There are a few subtle issues and limitations to consider additionally; Atal et al’s meta-analytic fragility metric in principle identifies the precise combination of inter-study and patient recoding required to flip a result, a “worse-case” scenario, identifying the most fragile point in a meta-analyses. This usually results in the Atal et al fragility metric being typically smaller than the complement EOI fragility metric outlined here, but this is not always the case. Because Atal et al’s method is a greedy algorithm, it is time-consuming to run on large collections of studies and can sometimes miss optimal and smaller solutions. Secondly, the adjustment of the events and non-events in the experimental group due to weighed inverse variance methods in the EOI fragility method subtly alters group composition, resulting in a smaller EOI fragility than a crude pooled estimate. What is evident from the examples shown in this work is that even in large cohorts with relatively low heterogeneity, the impact of recoding only miniscule numbers of patients can drastically alter interpretation; in the 12-study meta-analysis, recoding a mere 5 patients in 133,262 was sufficient to alter the result from a null to a positive finding.

While the method outlined here is analytic and rapid, it has some limitations that must be considered. Like Atal’s method, it applies only to meta-analyses of trials with dichotomous outcomes and is not suitable for continuous outcomes. It is also appropriate only in contexts where internal confounding is low, and it may not be reliable in the presence of high between-study heterogeneity. Accordingly, careful application is required. At its most essential, this implementation reflects the meta-analytic relative risk by generating a synthetic pooled 2×2 table that preserves the total sample size in each arm. This can be interpreted as the best single-study approximation of all studies under fixed-effect assumptions. The resulting meta-analytic fragility metric is useful for modeling perturbations in results, such as the addition of patients (who were previously redacted) or the recoding of events and non-events. The advantage of this approach lies in its tractability, offering a simpler alternative to working with multiple studies individually and a deterministic and analytic resultant fragility metric. However, it is important to emphasize that it does not correspond to true patient-level pooling or standardisation.

The use of vitamin D meta-analyses in this work was chosen as illustrative rather than specific, but it is worth noting that there are methodological concerns with much Vitamin D research (Grimes et al., 2024). This goes beyond the scope of the current work, but serves as an example of the reality that meta-analysis is only as strong as its underlying data, and the conclusions drawn from them must always be seen in context. While meta-analysis is a powerful method for refining effect size estimates, it cannot overcome poorly conducted research or bad data (Jané et al., 2025), nor it is intrinsically robust. These limitations should be kept in mind when considering meta-analytic results, and scientists should consider them doubly when opting to perform a meta-analysis. Mass produced and unreliable meta-analyses are a recognised and growing problem (Ioannidis, 2016), and we need be mindful not to add to increasingly issue of research waste and unreliable research (Glasziou and Chalmers, 2018; Grimes, 2024a).

Data Availability

All relevant code for this undertaking including the study data from the included trials is available at the linked Github repository and the Zenodo DOI

https://zenodo.org/records/15878923

Additional information

Funding statement

No external funding received.