Determining fragility and robustness to missing data in binary outcome meta-analyses, illustrated with conflicting associations between vitamin D and cancer mortality

  1. TCD Biostatistics Unit, Discipline of Public Health and Primary Care, School of Medicine, Trinity College Dublin, Dublin, Ireland
  2. Retraction Watch, The Center For Scientific Integrity, New York, United States

Peer review process

Revised: This Reviewed Preprint has been revised by the authors in response to the previous round of peer review; the eLife assessment and the public reviews have been updated where necessary by the editors and peer reviewers.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Philip Boonstra
    University of Michigan, Ann Arbor, United States of America
  • Senior Editor
    Eduardo Franco
    McGill University, Montreal, Canada

Reviewer #1 (Public review):

[Editors' note: this version has been assessed by the Reviewing Editor without further input from the original reviewers. The authors have addressed the comments raised in the previous round of review.]

Summary:

This manuscript addresses an important methodological issue-the fragility of meta-analytic findings-by extending fragility concepts beyond trial-level analysis. The proposed EOIMETA framework provides a generalizable and analytically tractable approach that complements existing methods such as the traditional Fragility Index and Atal et al.'s algorithm. The findings are significant in showing that even large meta-analyses can be highly fragile, with results overturned by very small numbers of event recodings or additions. The evidence is clearly presented, supported by applications to vitamin D supplementation trials, and contributes meaningfully to ongoing debates about the robustness of meta-analytic evidence. Overall, the strength of evidence is moderate to strong.

Strengths:

(1) The manuscript tackles a highly relevant methodological question on the robustness of meta-analytic evidence.

(2) EOIMETA represents an innovative extension of fragility concepts from single trials to meta-analyses.

(3) The applications are clearly presented and highlight the potential importance of fragility considerations for evidence synthesis.

Reviewer #3 (Public review):

Summary and strengths:

In this manuscript, Grimes presents an extension of Ellipse of Insignificant (EOI) and Region of Attainable Redaction (ROAR) metrics to meta-analysis setting as metrics for fragility and robustness evaluation of meta-analysis. The author applies these metrics to three meta-analyses of Vitamin D and cancer mortality, finding substantial fragility in their conclusions. Overall, I think extension/adaption is a conceptually valuable addition to meta-analysis evaluation, and the manuscript is generally well-written.

Author response:

The following is the authors’ response to the previous reviews

Public Reviews:

Reviewer #1 (Public review):

Summary:

This manuscript addresses an important methodological issue-the fragility of meta-analytic findings-by extending fragility concepts beyond trial-level analysis. The proposed EOIMETA framework provides a generalizable and analytically tractable approach that complements existing methods such as the traditional Fragility Index and Atal et al.'s algorithm. The findings are significant in showing that even large meta-analyses can be highly fragile, with results overturned by very small numbers of event recodings or additions. The evidence is clearly presented, supported by applications to vitamin D supplementation trials, and contributes meaningfully to ongoing debates about the robustness of meta-analytic evidence. Overall, the strength of evidence is moderate to strong.

Strengths:

(1) The manuscript tackles a highly relevant methodological question on the robustness of meta-analytic evidence.

(2) EOIMETA represents an innovative extension of fragility concepts from single trials to meta-analyses.

(3) The applications are clearly presented and highlight the potential importance of fragility considerations for evidence synthesis.

Reviewer #3 (Public review):

(1) The manuscript would benefit from a clearer explanation of in what sense EOIMETA is generalizable. The author mentions this several times, but without a clear explanation of what they mean here.

This is a point I was remiss not to better elucidate. With regards to generalisation, the text has been modified to explicitly state that generalisability in this context means no specific study dependence, just a net number of subjects required to flip a result. The text reads:

“Atal's method is highly useful, but one possible objection is that it has the downside of non-generalisability, as it finds very specific combinations of trials and patients that would have to be re-coded (events classified as non-events and vice-versa) for results to become insignificant. For example, an Atal meta-analytic fragility of 4 pertains to a specific and often unique circumstance when 4 patients could be recoded from a specific study or combinations thereof to change outputs, but this does not generalise to any 4 patients in that meta-analysis. This makes this definition of meta-analytic fragility useful but not general, and perhaps less intuitive to interpret than a typical RCT fragility metric. In this work, we establish a generalizable meta-analytic fragility metric, based upon Ellipse of Insignificance (EOI) analysis for dichotomous outcome trials. This method creates a pool of events and non-events in both arms, adjusted for weighing, and answers the general question of how many patients would have to be effectively recoded in a meta-analysis for results to flip, without requiring specific study identification.”

(2) The authors mentioned the proposed tools assume low between-study heterogeneity. Could the author illustrate mathematically in the paper how the between-study heterogeneity would influence the proposed measures? Moreover, the between-study heterogeneity is high in Zhang et al's 2022 study. It would be a good place to comment on the influence of such high heterogeneity on the results, and specifying a practical heterogeneity cutoff would better guide future users.

This is a very fair observation, and I need to better explain myself here! So there are effectively two measures of heterogeneity considered in this work; the typical value from a meta-analysis and the measure of divergence between the crude and the inverse-variance weighed adjusted – when these differ my small amounts, one could conceivably use either measure. I’ve changed the text to better reflect this, including:

“This modification in akin to pooled in a meta-analysis, and adjusts for study level heterogeneity. After this modification, a standard EOI analysis can then be applied to the vector . In addition, we can also employ ROAR analysis to the same vector, yielding the raw number of patients in either or both arm who could be added a given direction to change the result, and exact combination of control and experimental group redactions required to change the result from a significant finding to a null one. Caveats for implementation and interpretation are outlined in the discussion section.”

(3) I think clarifying the concepts of "small effect", "fragile result", and "unreliable result" would be helpful for preventing misinterpretation by future users. I am concerned that the audience may be confusing these concepts. A small effect may be related to a fragile meta-analysis result. A fragile meta-analysis doesn't necessarily mean wrong/untrustworthy results. A fragile but precise estimate can still reflect a true effect, but whether that size of true effect is clinically meaningful is another question. Clarifying the effect magnitude, fragility, and reliability in the discussion would be helpful.

This is an excellent suggestion – I’ve tried to do it with percentages, as in table 2, but these are minute in the case of the vitamin D trials, partially I suspect because they are extraordinarily weak. The Cohen’s H for these meta-analyses yields tiny values, which I think might be tied to the virtually negligible percentages we obtain for number needed to flip. With stronger data, it might be worth expanding this into a useful heuristic measure for robustness, though I don’t think vitamin D data as in this work is going to help us much. In light of the reviewer’s excellent comment, I added the following:

In light of the reviewer’s excellent comment, I added lines 230-240 in the revised manuscript.

(4) Comments on revisions:

I am unable to find the author's responses to my previous round comments (Reviewer #3) in the revision package, though replies to the other reviewers are present. I will provide my updated feedback once these responses are available for review.

My sincere apologies, I neglected the specific comments in error – this document should address them now, thank you again for giving this your time and consideration!

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation