Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.
Read more about eLife’s peer review process.Editors
- Reviewing EditorPeter RodgerseLife, Cambridge, United Kingdom
- Senior EditorPeter RodgerseLife, Cambridge, United Kingdom
Reviewer #1 (Public review):
Summary:
The authors set out on the ambitious task of establishing the reproducibility of claims from the Drosophila immunity literature. Starting out from a corpus of 400 articles from 1959 and 2011, the authors sought to determine whether their claims were confirmed or contradicted by previous or subsequent publications. Additionally, they actively sought to replicate a subset of the claims for which no previous replications were available (although this set was not representative of the whole sample, as the authors focused on suspicious and/or easily testable claims). The focus of the article is on inferential reproducibility; thus, methods don't necessarily map exactly to the original ones.
The authors present a large-scale analysis of the individual replication findings, which are presented in a companion article (Westlake et al., 2025. DOI 10.1101/2025.07.07.663442). In their retrospective analysis of reproducibility, the authors find that 61% of the original claims were verified by the literature, 7.5% were partialy verified, and only 6.8% were challenged, with 23.8% having no replication available. This is in stark contrast with the result of their prospective replications, in which only 16% of claims were successfully reproduced.
The authors proceed to investigate correlates of replicability, with the most consistent finding being that findings stemming from higher-ranked universities (and possibly from very high impact journals) were more likely to be challenged.
Strengths:
(1) The work presents a large-scale, in-depth analysis of a particular field of science that includes authors with deep domain expertise of the field. This is a rare endeavour to establish the reproducibility of a particular subfield of science, and I'd argue that we need many more of these in different areas.
(2) The project was built on a collaborative basis (https://ReproSci.epfl.ch/), using an online database (https://ReproSci.epfl.ch/), which was used to organize the annotations and comments of the community about the claims. The website remains online and can be a valuable resource to the Drosophila immunity community.
(3) Data and code are shared in the authors' GitHub repository, with a Jupyter notebook available to reproduce the results.
Main concerns:
(1) Although the authors claim that "Drosophila immunity claims are mostly replicable", this conclusion is strictly based on the retrospective analysis - in which around 84% of the claims for which a published verification attempt was found. This is in very stark contrast with the findings that the authors replicate prospectively, of which only 16% are verified.
Although this large discrepancy may be explained by the fact that the authors focused on unchallenged and suspicious claims (which seems to be their preferred explanation), an alternative hypothesis is that there is a large amount of confirmation bias in the Drosophila immunity literature, either because attempts to replicate previous findings tend to reach similar results due to researcher bias, or because results that validate previous findings are more likely to be published.
Both explanations are plausible (and, not being an expert in the field, I'd have a hard time estimating their relative probability), and in the absence of prospective replication of a systematic sample of claims - which could determine whether the replication rate for a random sample of claims is as high as that observed in the literature -, both should be considered in the manuscript.
(2) The fact that the analysis of factors correlating with reproducibility includes both prospective and retrospective replications also leads to the possibility of confusion bias in this analysis. If most of the challenged claims come from the authors' prospective replications, while most of the verified ones come from those that were replicated by the literature, it becomes unclear whether the identified factors are correlated with actual reproducibility of the claims or with the likelihood that a given claim will be tested by other authors and that this replication will be published.
(3) The methods are very brief for a project of this size, and many of the aspects in determining whether claims were conceptually replicated and how replications were set up are missing.
Some of these - such as the PubMed search string for the publications and a better description of the annotation process - are described in the companion article, but this could be more explicitly stated. Others, however, remain obscure. Statements such as "Claims were cross-checked with evidence from previous, contemporary and subsequent publications and assigned a verification category" summarize a very complex process for which more detail should be given - in particular because what constitutes inferential reproducibility is not a self-evident concept. And although I appreciate that what constitutes a replication is ultimately a case-by-case decision, a general description of the guidelines used by the authors to determine this should be provided. As these processes were done by one author and reviewed by another, it would also be useful to know the agreement rates between them to have a general sense of how reproducible the annotation process might be.
The same gap in methods descriptions holds for the prospective replications. How were labs selected, how were experimental protocols developed, and how was the validity of the experiments as a conceptual replication assessed? I understand that providing the methods for each individual replication is beyond the scope of the article, but a general description of how they were developed would be important.
(4) As far as I could tell, the large-scale analysis of the replication results was not preregistered, and many decisions seem somewhat ad hoc. In particular, the categorization of journals (e.g. low impact, high impact, "trophy") and universities (e.g. top 50, 51-100, 101+) relies on arbitrary thresholds, and it is unclear how much the results are dependent on these decisions, as no sensitivity analyses are provided.
Particularly, for analyses that correlate reproducibility with continuous variable (such as year of publication, impact factor or university ranking, I'd strongly favor using these variables as continuous variables in the analysis (e.g. using logistic regression) rather than performing pairwise comparisons between categories determined by arbitrary cutoffs. This would not only reduce the impact of arbitrary thresholds in the analysis, but would also increase statistical power in the univariate analyses (as the whole sample can be used in at once) and reduce the number of parameters in the multivariate model (as they will be included as a single variable rather than multiple dummy variables when there are more than two categories).
(5) The multivariate model used to investigate predictors of replicability includes unchallenged claims along with verified ones in the outcome, which seems like an odd decision. If the intention is to analyze which factors are correlated with reproducibility, it would make more sense to remove the unchallenged findings, as these are likely uninformative in this sense. In fact, based on the authors' own replications of unchallenged findings, they may be more likely to belong the "challenged" category than to the "unchallenged" one if they were to be verified.
Reviewer #2 (Public review):
Summary:
Lemaitre et al. conducted an analysis of 400 publications in the Drosophila immunity field (1959-2011), performing both univariable and multivariable analyses to identify factors that correlate with or influence the irreproducibility of scientific claims. Some of the findings are unexpected, for instance, neither the career stage of the PI nor that of the first author appears to matter that much, while others, such as the influence of institutional prestige or publication in "trophy journals," are more predictable. The results provide valuable insight into patterns of irreproducibility in academia and may help inform policies to improve research reproducibility in the field.
Strengths:
This study is based on a large, manually curated dataset, complemented by a companion paper (Westlake et al., 2025. DOI 10.1101/2025.07.07.663442) that provides additional details on experimentally documented cases. The statistical methods are appropriate, and the findings are both important and informative. The results are clearly presented and supported by accessible documentation through the ReproSci project.
Weaknesses:
The analysis is limited to a specific field (immunity) and model system (Drosophila). Since biological context may influence reproducibility -- for example, depending on whether mechanisms are more hardwired or variable -- and the model system itself may contribute to these effects (as the authors note), it remains unclear to what extent these findings generalize to other fields or organisms. The authors could expand the discussion to address the potential scope and limitations of the study's generalizability.
Reviewer #3 (Public review):
Summary:
The authors of this paper were trying to identify how reproducible, or not, their subfield (Drosophilia immunity) was since its inception over 50 years ago. This required identifying not only the papers, but the specific claims made in the paper, assessing if these claims were followed up in the literature, and if so whether the subsequent papers supported or refuted the original claim. In addition to this large manually curated effort, the authors further investigated some claims that were left unchallenged in the literature by conducting replications themselves. This provided a rich corpus of the subfield that could be investigated into what characteristics influence reproducibility.
Strengths:
A major strength of this study is the focus on a subfield, the detailing of identifying the main, major, and minor claims - which is a very challenging manual task - and then cataloging not only their assessment of if these claims were followed up in the literature, but also what characteristics might be contributing to reproducibility, which also included more manual effort to supplement the data that they were able to extract from the published papers. While this provides a rich dataset for analysis, there is a major weakness with this approach, which is not unique to this study.
Weaknesses:
The main weakness is relying heavily on the published literature as the source for if a claim was determined to be verified or not. There are many documented issues with this stemming from every field of research - such as publication bias, selective reporting, all the way to fraud. It's understandable why the authors took this approach - it is the only way to get at a breadth of the literature - however the flaw with this approach is it takes the literature as a solid ground truth, which it is not. At the same time, it is not reasonable to expect the authors to have conducted independent replications for all of the 400 papers they identified. However, there is a big difference trying to assess the reproducibility of the literature by using the literature as the 'ground truth' vs doing this independently like other large-scale replication projects have attempted to do. This means the interpretation of the data is a bit challenging.
Below are suggestions for the authors and readers to consider:
(1) I understand why the authors prefer to mention claims as their primary means of reporting what they found, but it is nested within paper, and that makes it very hard to understand how to interpret these results at times. I also cannot understand at the high-level the relationship between claims and papers. The methods suggest there are 3-4 major claims per paper, but at 400 papers and 1,006 claims, this averages to ~2.5 claims per paper. Can the authors consider describing this relationship better (e.g., distribution of claims and papers) and/or considering presenting the data two ways (primary figures as claims and complimentary supplementary figures with papers as the unit). This will help the reader interpret the data both ways without confusion. I am also curious how the results look when presented both ways (e.g., does shifting to the paper as the unit of analysis shift the figures and interpretation?). This is especially true since the first and last author analysis shows there is varying distribution of papers and claims by authors (and thus the relationship between these is important for the reader).
(2) As mentioned above, I think the biggest weakness is that the authors are taking the literature at face value when assigning if a claim was validated or challenged vs gathering new independent evidence. This means the paper leans more on papers, making it more like a citation analysis vs an independent effort like other large-scale replication projects. I highly recommend the authors state this in their limitations section.
On top of that, I have questions that I could not figure out (though I acknowledge I did not dig super deep into the data to try). The main comment I have is How was verified (and challenged) determined? It seems from the methods it was determined by "Claims were cross-checked with evidence from previous, contemporary and subsequent publications and assigned a verification category". If this is true, and all claims were done this way - are verified claims double counted then? (e.g., an original claim is found by a future claim to be verified - and thus that future claim is also considered to be verified because of the original claim).
Related, did the authors look at the strength of validation or challenged claims? That is, if there is a relationship mapping the authors did for original claims and follow-up claims, I would imagine some claims have deeper (i.e., more) claims that followed up on them vs others. This might be interested to look at as well.
(3) I recommend the authors add sample sizes when not present (e.g., Fig 4C). I also find that the sample sizes are a bit confusing, and I recommend the authors check them and add more explanation when not complete, like they did for Fig 4A. For example, Fig 7B equals to 178 labs (how did more than 156 labs get determined here?), and yet the total number of claims is 996 (opposed to 1,006). Another example, is why does Fig 8B not have all 156 labs accounted for? (related to Fig 8B, I caution on reporting a p value and drawing strong conclusions from this very small sample size - 22 authors). As a last example, Fig 8C has al 156 labs and 1,006 claims - is that expected? I guess it means authors who published before 1995 (as shown in Figure 8A continued to publish after 1995?) in that case, it's all authors? But the text says when they 'set up their lab' after 1995, but how can that be?
(4) Finally, I think it would help if the authors expanded on the limitations generally and potential alternative explanations and/or driving factors. For example, the line "though likely underestimated' is indicated in the discussion about the low rate of challenged claims, it might be useful to call out how publication bias is likely the driver here and thus it needs to be carefully considered in the interpretation of this. Related, I caution the authors on overinterpreting their suggestive evidence. The abstract for example, states claims of what was found in their analysis, when these are suggestive at best, which the authors acknowledge in the paper. But since most people start with the abstract, I worry this is indicating stronger evidence than what the authors actually have.
The authors should be applauded for the monumental effort they put into this project, which does a wonderful job of having experts within a subfield engage their community to understand the connectiveness of the literature and attempt to understand how reliable specific results are and what factors might contribute to them. This project provides a nice blueprint for others to build from as well as leverage the data generated from this subfield, and thus should have an impact in the broader discussion on reproducibility and reliability of research evidence.