Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.
Read more about eLife’s peer review process.Editors
- Reviewing EditorAlex FornitoMonash University, Clayton, Australia
- Senior EditorJonathan RoiserUniversity College London, London, United Kingdom
Reviewer #1 (Public review):
Summary:
The manuscript provides a well‑argued discussion of the misalignment between common predictive performance evaluations reported in the literature and actually measuring clinical utility in the context of predictive psychiatry. Specifically, the authors discuss measurement reliability and prevalence as two neglected factors which can substantially inflate the assessment of model performance for clinical practice. To mitigate this, the authors offer a concrete framework and an accompanying web tool, with which to adjust performance metrics and additional predictive‑value and decision‑analytic measures.
Strengths:
The manuscript speaks convincingly about the risk of face validity and the practical irrelevance of seemingly promising predictive models in psychiatry. The authors outline how predictive performance estimations often fail to generalize to clinical contexts and thereby potentially mislead scientific efforts. In the face of ubiquitous biomarker models and incremental improvements in the literature, the reader is reminded that, irrespective of the glory of the proposed model, low reliability of clinical measurements fundamentally affects (and limits) both effect sizes and predictive performance ("garbage in, garbage out"), and that neglecting this can ultimately lead to misinformed decisions in the treatment of individual patients. The provision of an online tool with a user‑friendly interface and clearly worked examples is a major practical asset that will facilitate the adoption of the proposed framework beyond quantitative methodologists.
Weaknesses:
While the outlined issues highlight important aspects in the translational gap, the suggested solutions remain somewhat theoretical. For example, the use of prevalence might not reflect what a model would see in practice, assuming that population prevalence and the composition of actual clinical cohorts are aligned. Accounting for who presents to care, and under which referral or triage patterns, is a crucial determinant of effective base rates. While the authors do acknowledge the importance of using base rates from the target population, these nuances could be emphasized more prominently at the points where practical recommendations are made. Relatedly, the analytical context and the methodological assumptions are not clearly specified. Many arguments and demonstrations are derived in univariate, group‑comparison settings and then discussed in a way that can be read as broadly applicable.
Reviewer #2 (Public review):
Summary and strengths:
The authors present a description of their online tool to estimate real-world performance of predictive models. The authors bring together different calculations to make better-informed implementation choices. It is a very nice tool to go from effect sizes to base rates to decision curve analysis. The paper describes the background and use of the tool with examples and seems like an extended version of their online how-to. The methods themselves are not new, but I think the tool will be valuable for researchers from different fields. Tools already exist for the conversion of effect sizes (my current favorite is https://www.escal.site/), but I haven't seen measurement noise being incorporated previously. The main benefit is the evaluation of performance under different real-world scenarios. Code is available on GitHub, and the manuscript is well-written.
Weaknesses:
While comprehensive explanation and examples are important for correct use of the tool, I don't really see the added value above their online how-to guide, as the software itself has already been published (Karvelis, P. and Diaconescu, A. O. (2025b). E2p simulator: An interactive tool for estimating real world predictive utility of research findings. Journal of Open Source Software, 10(114):8334.)
Reviewer #3 (Public review):
Summary:
This important work provides a web-based tool to contextualize effect sizes in psychiatry with respect to reliability and base rates (collectively referred to as predictive utility analysis). The methods for the tool incorporate established psychometric principles that I think are of use for multiple fields in this seemingly easy-to-use tool. I agree with the critical importance of this tool and the methodological points made in this manuscript. Enthusiasm for the manuscript is weakened by a lack of clarity on the formulation of the paper and stated goals of the examples used, with the inferences and impact on clinical decision making from various parameterizations via this tool left open-ended.
Strengths:
This paper presents a well-considered and, what I think will be highly useful, web-based tool to contextualize effect sizes with respect to reliability and base rates. As the authors rightly point out, such a tool could be used in conjunction with widespread analytic power analysis tools in study planning. The paper also well contexualizes the need for such a tool in the relatively recent history of concerns of power, reliability, and inference in psychiatry specifically, and more general meta-scientific debates in psychology and neuroscience.
Weaknesses:
My primary feedback on this manuscript is the lack of clarity in what the paper itself, specifically, separate from the tool, is hoping to achieve. There is a central, but unresolved, tension in whether the reader is supposed to:
(1) focus on the specifics of the examples used and whether to reevaluate the substantive claims from the studies, (2) buy in to how various reliability and base rate parameters impact modeling outcomes, (3) receive an introduction to the tool itself.
In my estimation, the largest contribution to the field here is in (2) and (3), but currently much of the real estate of the paper is dedicated to several examples of (1). While these specific examples may be illustrative to some degree, I think given the number and brevity of such, they are unlikely to incidentally achieve points (2) and (3) above. Specific examples include the assertion of kappas for DSM diagnoses, without much nuance (e.g., see https://psycnet.apa.org/buy/2015-27500-001). Given the relatively limited space given to this example, however, it's hard to be entirely certain what the reviewer should take away.
A second point of concern is where this tool would be situated in the research pipeline. I agree with the authors that this tool could be used in ways that parallel power analysis. With that in mind, it seems the most common use of this tool for an individual investigator is likely to be in a priori study planning. In contrast, and with my point above in mind, the use of the tool for existing results is likely best done with multiple estimates of effect sizes, reliability, and base rates, as is common in meta-analysis or consensus reviews. Nevertheless, there is no real example or guidance around how this influences new study planning.
A third point is that more nuance would be useful in the introduction about the current state of psychiatry research. For example, I share many of the authors' concerns about reliability, power, reproducibility, and barriers to translation. That said, it is the case that while effect sizes should be considered considerably more, they are widely considered in psychiatry research via the common place of meta-analysis and other data pooling approaches. Another such example that the authors state in the context of reliability: "However, this [reliability] attenuation is rarely accounted for in routine analyses in psychiatry". This is true in practice, but somewhat misleading insofar as the method by which to do this remains unclear. For example, should we all report disattenuated associations, assuming there is no error and everything is perfectly reliable? This, of course, would be unrealistic to expect zero error. That we can achieve this with the new tool is clear, but the nuance of how and under what circumstances it should be done is not clear, and such nuance should be better reflected in the framing of the problem. That is, there is also a lack of clarity on what ought to be best practices and field-wide goals, rather than simply the lack of an ability to model these factors.
Minor point
For conceptual clarity, it would benefit the manuscript to at least briefly mention the role of validity in translational importance. Of course, the current psychometric issues of reliability, base rate, power, etc are critical, but it should at least be mentioned, given the potential wide audience of this manuscript, validity is important as well. For example, highly reliable measures may not be valid indicators of underlying disease etiology (e.g., fMRI head motion is a highly reliable trait-level feature, but typically not considered an important predictor or consequence of mental health worth investing translational resources in). Relatedly, confounding as a general topic would be useful to mention just briefly, to help with the spirit of considering underlying issues in translation.