1 Introduction

The search for clinically useful psychiatric biomarkers has been a major pursuit of the past several decades, spanning structural and functional neuroimaging, genetic and multiomic approaches, cognitive and computational assays, and, more recently, digital and deep phenotyping (Singh and Rose, 2009; García-Gutéerrez et al., 2020; Jollans and Whelan, 2018; Abi-Dargham et al., 2023; Stephan and Mathys, 2014; Onnela and Rauch, 2016; Clark et al., 2025). With the emergence of precision medicine (Council et al., 2011; Collins and Varmus, 2015), these efforts have recently culminated in the vision of precision psychiatry: the idea that accounting for individual variability across different modalities can help go beyond broad diagnostic categories and trial-and-error treatments, towards more personalized mental health care (Fernandes et al., 2017; Bzdok and Meyer-Lindenberg, 2018; Rutledge et al., 2019; Patzelt et al., 2018; Huys et al., 2016; Paulus and Thompson, 2021; Karvelis et al., 2023; Van Dellen, 2024).

Despite extensive research and methodological advances, real-world clinical impact of individualized prediction models remains difficult to achieve (Salazar de Pablo et al., 2021). While there is a lot of excitement in the field, with many empirical studies reporting high predictive accuracy, a more critical look suggests that this optimism may be misplaced: overfitting (Meehan et al., 2022; Chekroud et al., 2024; Tornero-Costa et al., 2023; Bracher-Smith et al., 2021), small sample sizes, and biased validation methods are the rule rather than the exception (e.g., Karvelis et al., 2022; Schnack and Kahn, 2016; Hunt et al., 2024; Chen et al., 2025). When it comes to individual biomarkers, most of them have also failed to replicate or prove their clinical utility (Abi-Dargham et al., 2023; Cortese et al., 2023). The literature seems to be filled with weak and false positive findings (Ioannidis and Panagiotou, 2011; Collaboration, 2015; Ioannidis, 2005).

A natural question arises: what is holding back progress? While many factors are at play (Kambeitz-Ilankovic et al., 2022; Felsky et al., 2023), here we argue that one crucial bottleneck is a pervasive misalignment between routine statistical analysis practices and the requirements for clinical translation. Most common analytic approaches are optimized for identifying the existence of effects (p-values, model evidence), while effect sizes, which convey practical significance, often receive much less attention (Loth et al., 2021; Kapur et al., 2012; Wasserstein et al., 2019; Lo et al., 2015; Flora, 2020). This is further exacerbated by failing to account for measurement reliability, which attenuates effect sizes (Karvelis and Diaconescu, 2025a; Karvelis et al., 2023; Hedge et al., 2018), diminishing their translational value. When multiple predictors of small effect sizes are then combined to build prediction models, they end up underperforming (which is often masked by inflated performance metrics due to overfitting). On top of that, predictive performance is often evaluated using metrics (e.g., sensitivity, specificity) (Cortese et al., 2023) that do not account for real-world outcome base rates (e.g., prevalence, response rates) and therefore do not convey their real-world predictive value (Abi-Dargham and Horga, 2016; Carter et al., 2017; Jeni et al., 2013; Brabec et al., 2020; Foody, 2023; Guesné et al., 2024) or clinical utility (Rousson and Zumbrunn, 2011). Although these problems are documented in the literature and many are addressed by guidelines such as TRIPOD (Moons et al., 2015; Collins et al., 2024) and STARD (Bossuyt et al., 2015), a lack of accessible tools for dealing with them presents ongoing challenges for the wider research community.

In what follows, we will elaborate on how these methodological challenges collectively undermine progress towards precision psychiatry (Fig. 1). To address them, we will introduce E2P (effect-to-prediction) Simulator, www.e2p-simulator.com (Karvelis and Diaconescu, 2025b), an open-source web tool for evaluating effect sizes and predictive performance by accounting for real-world outcome base rates and measurement reliability to obtain predictive value and clinical utility measures – a procedure we term predictive utility analysis. To demonstrate the range of its applications we will consider three key areas: diagnostic, treatment response, and risk prediction, highlighting important insights along the way.

Precision psychiatry: from individual differences to clinical prediction.

The path to precision psychiatry begins with research on individual differences, which exist across many dimensions. The first challenge is to develop tools to measure and characterize these differences (phenotyping). Next, clinically relevant differences must be identified (biomarker research). Finally, collections of relevant markers can be used to build prediction models. However, systemic problems in research practices (highlighted in red) at multiple steps in this pathway undermine progress.

2 Statistical challenges in understanding practical significance

2.1 Statistical significance is not practical significance

Research in psychiatry and behavioral sciences is dominated by null-hypothesis significance testing (NHST) and p-values, which have well-known limitations (Kirk, 1996; Szucs and Ioannidis, 2017; Ioannidis, 2005; Wasserstein and Lazar, 2016; Wasserstein et al., 2019; Yarkoni, 2022; Meehl, 1992; Kapur et al., 2012). Most problematically, based solely on statistical significance, researchers often conclude that their findings have potential for clinical utility - but significant variables are not automatically good predictors (Wasserstein and Lazar, 2016; Lo et al., 2015; Szucs and Ioannidis, 2017; Kapur et al., 2012; Loth et al., 2021; Kühberger et al., 2015). The same limitations apply to more sophisticated measures of Bayesian model evidence (Stephan et al., 2009; Friston et al., 2016; Palminteri et al., 2017; Wilson and Collins, 2019) - both approaches simply quantify the confidence in the existence of the effects, but not their strength (Fig. 2). In the age of increasingly large datasets, this is problematic because increasingly smaller and thus more practically negligible effects become the focus of scientific inquiry (Szucs and Ioannidis, 2017; Marek et al., 2022). To be able to gauge the translational value of research findings, it is therefore important to focus on the estimation and interpretation of effect sizes (Calin-Jageman, 2018; Flora, 2020; Nakagawa and Cuthill, 2007).

Different levels of research questions with associated methods and metrics.

Much of the research planning and clinical translation strategy is currently formulated based on the mere existence of effects. This can be misleading, as translational potential is proportional not to statistical significance or model evidence, but to effect size. These observed effects can be further expressed as discriminative ability and - by accounting for real-world base rates - as predictive value and clinical utility. Each subsequent step becomes increasingly informative for gauging translational potential, as conveyed by the emojis.

2.2 Poor measurement reliability attenuates effect sizes and predictive utility

The lack of focus on effect size interpretation also leads to overlooking the importance of measurement reliability, which attenuates observed effects. This is particularly problematic in psychiatry, where many constructs - from cognitive (Karvelis et al., 2023; Enkavi et al., 2019), to neuroimaging (Elliott et al., 2020; Nikolaidis et al., 2022; Gell et al., 2024; Vidal-Piñeiro et al., 2025), to diagnostic categories themselves (Regier et al., 2013) - are measured with substantial unreliability.

For correlational analysis, the attenuation of observed effects by poor measurement reliability has been long established (Spearman, 1904):

where ICCx and ICCy denote the reliabilities of the two variables expressed as Intraclass Correlation Coefficient (ICC). However, this attenuation is rarely accounted for in routine analyses in psychiatry. The attenuation effects are considered even less in the context of group differences (e.g., patients vs. controls), where the formulae describing them were not established until very recently (Karvelis and Diaconescu, 2025a); for a general case considered in this paper:

where the observed Cohen’s d is attenuated by both predictor reliability in each of the groups (ICC1 and ICC2) and the reliability of the group labels themselves (measured as Cohen’s κ; e.g., inter-rater agreement).

Given that poor measurement reliability makes the groups overlap more, we would expect it to have important consequences in predictive modeling. Yet, its effects have been explored only by a small number of studies (Gell et al., 2024; Nikolaidis et al., 2022; Jacobucci and Grimm, 2020; Whittle et al., 2018; Lionetti et al., 2025; Cullen et al., 2023). As we will see later, the nature of the E2P Simulator allows anyone to explore these effects without needing to code their own simulation pipelines.

2.3 The challenge of interpreting effect sizes

Even when effect sizes are reported, researchers struggle to interpret them in a meaningful way. The most common approach is to simply rely on conventional levels of small (d = .2, r = .1), medium (.5, .3), and large (.8, .5) - however, these labels are arbitrary, do not convey translational potential, and were never meant to be used so ubiquitously (Funder and Ozer, 2019; Anvari et al., 2023; Carey et al., 2023; Correll et al., 2020; Giner-Sorolla et al., 2024; Rothman and Greenland, 2018; Durlak, 2009). Another common approach is to compare observed effects with other empirical findings in one’s research domain (Lovakov and Agadullina, 2021; Thompson, 2007; Durlak, 2009; Gignac and Szodorai, 2016; Bosco et al., 2015). Although this may serve as a more meaningful reference point, empirically reported effect sizes are often inflated due to publication bias (Schafer and Schwarz, 2019), and, even more importantly, they do not directly convey practical significance (Giner-Sorolla et al., 2024; Kirk, 1996; Kapur et al., 2012). Given that our ultimate goal is to build prediction models to personalize psychiatry, practical significance is best understood in terms of predictive performance (Shmueli, 2010).

2.3.1 ROC-AUC, sensitivity, specificity

Expressing effect sizes in terms of prediction metrics is actually quite straightforward. All we need is to draw a classification threshold between the two groups and determine true and false positives and negatives - all prediction metrics are then derived from these four categories (see Fig. 3). The most common metrics are sensitivity (true positive rate, or proportion of cases correctly identified) and specificity (true negative rate, or proportion of non-cases correctly identified). Moving the threshold across the whole range captures how these metrics trade off against one another, which is captured by the receiver operating characteristic (ROC) curve. Calculating the area under ROC (ROC-AUC) provides an informative summary metric which captures the probability that a predictor ranks a randomly chosen positive case (e.g., a patient with a psychiatric condition) higher than a negative case across all classification thresholds.

Effect size vs. prediction metrics.

Standardized mean difference measures, such as Cohen’s d, capture how far the two distributions are from each other. If we place a decision threshold (red) to classify cases into two groups, there are four possible outcomes: True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN). These four outcomes are used to derive all classification metrics, as presented on the right. In predictive utility analysis, we place a special emphasis on PR-AUC and NB (highlighted in blue), as they can convey both overall and context-specific model performance while accounting for real-world base rate, making them highly informative for understanding the translational potential of research findings.

Even more straightforwardly, ROC-AUC can be computed directly from Cohen’s d:

where Φ is the cumulative distribution function of the standard normal distribution.

2.4 Prediction metrics can be misleading if real-world outcome base rate is not accounted for

While ROC-AUC, sensitivity, and specificity are useful metrics for assessing a predictor’s discriminative performance, they may create a misleading impression of translational potential because these metrics are invariant to outcome base rates (e.g., disease prevalence, treatment response rate, event rate). This is problematic because in clinical contexts many decisions occur under low or varying prevalence (Carter et al., 2017; Abi-Dargham and Horga, 2016). When prevalence is low, most predicted positives will be false - this phenomenon and the associated mistakes are known as the false positive paradox, base rate fallacy, and base rate neglect (Casscells et al., 1978; Kahneman and Tversky, 1973; Tversky and Kahneman, 1974). The relevant metric in such cases is the posterior probability of disease given a positive test, which is most commonly known as positive predictive value (PPV):

where ϕ is the base rate.

When base rates are low, achieving high PPV (also known as precision) is only possible at the expense of sensitivity (also known as recall). This trade-off is more transparently summarized by the area under the Precision-Recall curve (PR-AUC), which has been argued to be a more informative metric for imbalanced classification contexts (Saito and Rehmsmeier, 2015; Cartus et al., 2023; Ozenne et al., 2015; Pinker, 2018), although see Van Calster et al. (2025).

However, PR-AUC itself can be misleading when models are evaluated on artificially balanced groups, which is common practice (Jeni et al., 2013; Brabec et al., 2020; Foody, 2023; Guesné et al., 2024). In such cases, the reported PR-AUC reflects the base rate of the test set rather than the deployment conditions. Fortunately, PR-AUC can be rescaled analytically by simply substituting the expected real-world base rate into Eq. 4 and then recomputing the PR-AUC (Brabec et al., 2020). This means that for any ROC-AUC (or PR-AUC, if the base rate of the test set is known) reported in the literature, it is possible to estimate PR-AUC that can be expected under deployment conditions.

2.4.1 Real-world outcome base rates in Decision Curve Analysis

The challenge of accounting for real-world outcome base rates extends beyond standard prediction metrics to more clinically oriented evaluation frameworks, such as decision curve analysis (DCA) (Vickers and Elkin, 2006). DCA is based on calculating Net Benefit (NB), a clinical utility metric which balances true positives (TP) by weighing them against the potential harms of false positives (FP; e.g., overdiagnosis or overtreatment):

where pt represents the probability threshold used for classification (representing the relative cost of false positives compared to the benefit of true positives) and N is the total number of cases. It can also be expressed in terms of sensitivity, specificity, and base rate ϕ:

In DCA, NB is plotted against a range of probability thresholds, creating a decision curve that shows the net benefit of using the model compared to two alternative strategies: “all” (classifying everyone as positive) and “none” (classifying everyone as negative). The difference in NB between model-based predictions and the default “all” and “none” lines (ΔNB) conveys how much additional clinical utility the model can offer at relevant thresholds. An optimal threshold would reflect the costs of false negatives versus the costs of false positives. For example, a very low threshold (e.g., <5%) is appropriate for suicide risk screening in primary care where the intervention (e.g., safety planning) is low-cost but missing a case is fatal (Ross et al., 2021), whereas a high threshold (e.g., >50%) is required for initiating clozapine in early psychosis due to monitoring costs and side effects (Farooq et al., 2024).

DCA is an increasingly popular method for evaluating clinical prediction models, has desirable decision-theoretic properties (Van Calster et al., 2025), and is recommended by the TRIPOD guidelines (Moons et al., 2015; Collins et al., 2024). However, just as with PR-AUC, NB can be severely inflated if models are evaluated on artificially balanced case-control samples (Rousson and Zumbrunn, 2011). Fortunately, these estimates can also be adjusted post-hoc to reflect real-world base rates without requiring model retraining.

2.5 E2P Simulator and predictive utility analysis

To jointly address all these challenges, we recently developed an open-source web tool E2P Simulator, www.e2p-simulator.com (Karvelis and Diaconescu, 2025b). It is an interactive tool that provides a general framework for cross-translating across effect size metrics (e.g., Cohen’s d, Pearson’s r), discriminative measures (e.g., ROC-AUC, sensitivity, specificity), predictive value metrics (PR-AUC, PPV, NPV), clinical utility (Net Benefit), and many other metrics, highlighting how all of them describe the same underlying data distribution.

Most importantly, E2P Simulator can be used for evaluating real-world predictive utility of biomarkers and prediction models by accounting for real-world base rates (e.g., disease prevalence, response rates, incidence rates) and measurement reliability in the target populations - a procedure we term predictive utility analysis. Similar to how power analysis (Cohen, 1992) can help determine whether an effect can be reliably detected (statistical significance), predictive utility analysis can help determine whether an effect is likely to be useful in practice (practical significance). Following the same analogy, E2P Simulator can be thought of as software for performing predictive utility analysis, the same way G*Power is used for power analysis (Erdfelder et al., 1996).

As such, E2P Simulator can be used for both interpreting existing findings and planning studies. For instance, it can be used to determine whether investing in more reliable measurement instruments would meaningfully improve predictive performance, or to identify optimal target populations and indications with base rates that would yield clinically meaningful predictive values (see Supp. note 1 for more details). Additionally, E2P Simulator can be used for education and reporting standardization. In the next section, we illustrate its use across diagnostic, treatment response, and risk prediction.

3 Demonstration of predictive utility analysis with E2P Simulator

3.1 Diagnostic prediction

3.1.1 A cognitive marker for depression

Consider finding a cognitive biomarker with Cohen’s d = 0.8 between healthy controls and a group diagnosed with depression. Such an effect size is rare in practice and would be interpreted as “large”; e.g., some of the strongest biomarkers, such as negative interpretation bias and decreased autobiographical memory specificity, are d < 0.8 (Weiss-Cowie et al., 2023; Everaert et al., 2017). But how would that translate to diagnostic prediction?

First, we have to account for the inter-rater reliability of the depression diagnosis, which is κ = 0.28 as determined by DSM-5 field trials (Regier et al., 2013). Next, given that the effect size is derived by comparing it to healthy controls, the relevant prevalence (or base rate) would be that of the general population, which is around 8% in the USA and Canada (Shorey et al., 2022). Finally, for this hypothetical example, we can assume the biomarker to have ICC = 0.6 test-retest reliability, in line with the average reliability of cognitive markers in recent studies (Karvelis et al., 2023).

With these parameters, the observed Cohen’s d = 0.8 will yield ROC-AUC = 0.71 and PR-AUC = 0.19 (Fig. 4a). While conventionally 0.8 would be called “large”, the predictive utility is modest as indicated by the low PR-AUC, which captures the trade-off between PPV and sensitivity. Next, we set the classification threshold to correspond to pt = 15% risk of having depression in the DCA plot (a suitable threshold for screening or diagnosis if false positives have relatively low harms/costs). This results in PPV = 0.22 and sensitivity = 0.31, meaning that at this threshold, 78% of predicted cases would be false positives, and 69% of actual cases would still be missed. This would correspond to ΔNB = 0.009 or 9 additional true positives per 1,000 diagnoses.

Diagnostic prediction examples.

(a) A cognitive marker for depression, (b) FDA-cleared diagnostic biomarker p-tau217/Aβ42 for Alzheimer’s disease. Both figures are screenshots of E2P Simulator’s interface showing inputs on the left (effect size metrics, reliability values, and the base rate), the resulting data distributions in the center, and all predictive and utility metrics at the bottom and on the right. The red line is the classification threshold with the corresponding red markers on ROC-AUC, PR-AUC, and DCA plots.

Accounting for measurement reliability reveals that this observed effect would correspond to a much larger true effect, d = 1.58, and, in turn, much better predictive performance: ROC-AUC = 0.87, PR-AUC = 0.46, PPV = 0.34, sensitivity = 0.63, and ΔNB = 0.033 at pt = 0.015, highlighting how much improvement in diagnostic prediction could be achieved simply by improving measurement reliability.

3.1.2 Combining multiple biomarkers to improve diagnostic prediction

E2P Simulator also includes multivariable calculators for estimating how many biomarkers would be needed to achieve a specified target performance (see Supp. note 2 for more details). Let us say our target is PR-AUC = 0.8, which at 8% prevalence corresponds to ROC-AUC = 0.96. Using the multivariable simulator, we find that with small collinearity among the predictors (0.05), this would require 20 predictors, without collinearity (0.00), it would still require 10 d = 0.8 predictors, while with stronger predictors of d = 1.35, even with a higher collinearity of 0.1, we would need only 5 to achieve the same predictive utility (Supplementary Fig. 1). This suggests that focusing on identifying larger effects (e.g., by improving measurement reliability) may be a more promising research strategy than searching for additional predictors that are weak (think multi-modal big data approaches).

3.2 FDA-cleared diagnostic biomarker p-tau217/Aβ42 for Alzheimer’s disease

To provide a reference point to the previous example, let us consider an example from precision medicine, a biomarker recently cleared by the U.S. Food and Drug Administration (FDA): the plasma p-tau217/Aβ342 ratio for aiding Alzheimer’s disease diagnosis (FDA, 2025b). More specifically, it was cleared to identify amyloid pathology in ≥ 50 year-olds with cognitive symptoms, based on a study reporting PPV of 91.8% and NPV of 97.3% against amyloid Positron Emission Tomography (PET) or cerebrospinal fluid (CSF) reference standards (FDA, 2025a); note, however, these results were achieved by using two classification thresholds — one for identifying positives and one for identifying negatives — leaving out approximately 20% of cases in the middle that would require confirmatory PET or CSF tests.

Let us recreate this biomarker in E2P Simulator to obtain a clear picture in terms of all the metrics. The prevalence of amyloid positivity in their sample was 51.1% (FDA, 2025a). Test-retest reliability of both p-tau217 and Aβ42 is very high, around ICC ≈ 0.9 (Della Monica et al., 2024). Inter-rater reliability of PET tracers tends to be quite high κ > 0.9 (Harn et al., 2017) and shows high agreement with CSF κ > 0.8; given that the study had a mixture of both, we can set κ ≈ 0.9 as a rough approximation. Now, we jointly set effect size and decision threshold until we achieve the reported PPV ≈ 92% and NPV ≈ 81% (here we re-calculated NPV based on the threshold used for PPV, which gave 54/255 false positives).

We find that this performance corresponds to an observed Cohen’s d = 2.25, ROC-AUC = 0.94, and PR-AUC = 0.95 (similar results were also reported in a recent study in China by Wang et al. (2025), achieving ROC-AUC ≈ 0.96); the classification threshold in this case corresponds to pt = 0.68, which gives ΔNB = 0.324 (Fig. 4b). In real-world settings, we may expect the prevalence to be lower or at least to vary: for cohorts with mild cognitive impairment, amyloid positivity ranges from ~30% at the age of 50 to ~ 60% at the age of 80 (Jansen et al., 2015). However, even at 30% prevalence, d = 2.25 would still result in an impressive PR-AUC = 0.89.

3.3 Treatment response prediction

3.3.1 Multivariable neurocognitive predictors of antidepressant response

Can we predict treatment response to antidepressants? The latest research using multivariable / machine learning models suggests that neurocognitive predictors can explain around R2 = 0.2 variance in symptom improvement in response to antidepressants and other treatments (Karvelis et al., 2022). While this already combines multiple predictors and may be an optimistic estimate due to overfitting issues, let us consider what this would translate to in practice.

The reliability of task-evoked BOLD responses is, on average, rather low, ICC = 0.4 (Elliott et al., 2020), while the reliability of the most popular scale for assessing depression symptoms, the Hamilton Depression Rating Scale (HAMD), is quite high, ICC = 0.94 (Trajković et al., 2011). The rate of response to antidepressant treatment beyond placebo is about 15% (Stone et al., 2022).

Entering these values into E2P Simulator yields ROC-AUC = 0.73 and PR-AUC = 0.33, indicating rather modest predictive performance, as shown by the low PR-AUC (Fig. 5). At pt = 0.2, which reflects the relative harms of antidepressant side effects, this would result in sensitivity = 0.54, PPV = 0.29, and ΔNB = 0.030, which means that 46% of those who would benefit from treatment would not receive treatment, 71% of those given treatment would not benefit from it, and we would get additional 3 true responders per 100 people who receive the treatment. Improving measurement reliability alone could improve performance quite substantially, up to ROC-AUC = 0.87 and PR-AUC = 0.57, which at pt =0.2 would result in sensitivity = 0.74, PPV = 0.42, and ΔNB = 0.073.

Treatment response prediction example: task-based fMRI for predicting response to antidepressants.

A screenshot of E2P Simulator’s interface showing inputs on the left (effect size metrics, reliability values, and the base rate), the resulting data distributions in the center, and all predictive and utility metrics at the bottom and on the right. The red line is the classification threshold with the corresponding red markers on ROC-AUC, PR-AUC, and DCA plots.

Using E2P Simulator we find that to achieve PR-AUC = 0.8, it would require explaining 80% of variance (R2 = 0.80), which is rather ambitious. This nicely demonstrates the well-known problems of dichotomizing continuous measures (Collins et al., 2016; MacCallum et al., 2002; Royston et al., 2006; Naggara et al., 2011; Streiner, 2002; Karvelis and Diaconescu, 2025a). When symptom improvement on a continuous scale gets converted to responders vs. non-responders, a lot of the information gets lost. This is important to highlight, because most research on treatment response prediction continues to use dichotomized outcome measures (Karvelis et al., 2022; Vieira et al., 2022; Amleshi et al., 2025).

3.3.2 Combining multiple biomarkers to improve treatment response prediction

When building prediction models in neuroscience, a common approach is to first show that predictors are associated with clinical variables, after which both are jointly used to train a classifier (e.g., Hauke et al., 2022; Tozzi et al., 2020; de la Salle et al., 2022; Karvelis et al., 2022). This approach guarantees high collinearity and hurts predictive performance. For example, assuming an average collinearity of 0.15 among all predictors, we would need 17 predictors of r = 0.4 to achieve R2 = 0.8 (Supplementary Fig. 2). Interestingly, if we reduce the effect size of each predictor to r = 0.3, we would never reach R2 = 0.8, no matter how many predictors we have (Supplementary Fig. 2).

3.4 Risk Prediction

3.4.1 Mismatch negativity (MMN) as a predictor of psychosis conversion in at-risk individuals

Most research on predictive modeling in psychiatry has thus far focused on predicting the transition to psychosis (Salazar de Pablo et al., 2021), with positive and negative symptoms and verbal memory deficits being the most prominent clinical and cognitive predictors, and mismatch negativity (MMN) emerging as the most promising biomarker (Andreou et al., 2023; Rosen et al., 2021).

How promising is MMN for predicting the transition to psychosis? The primary intended clinical application of MMN is as a prognostic tool within already identified clinical high-risk (CHR) groups. The largest longitudinal study measured MMN at baseline and, after 24 months of follow-up, found that converters showed the largest deficit in the double-deviant (duration + pitch) condition, with an effect size of d = 0.43 (Hamilton et al., 2022). Over a 24-month follow-up period, the average transition rate in CHR cohorts is about 19% (De Pablo et al., 2021). Test-retest reliability for double-deviant MMN is around ICC = 0.5 (Roach et al., 2020). The inter-rater reliability of DSM-5-based criteria for schizophrenia spectrum and other psychotic disorders is κ = 0.46 (Regier et al., 2013).

Putting this all together with E2P Simulator, we find that double-deviant MMN provides only modest predictive performance within CHR cohorts: ROC-AUC = 0.62 and PR-AUC = 0.27 (Fig. 6a). At a decision threshold of pt = 0.15, which corresponds to relatively low-cost interventions such as increased monitoring, the net benefit is ΔNB = 0.010, with PPV = 0.22 and sensitivity = 0.81. Improving measurement reliability could improve performance up to ROC-AUC = 0.70, PR-AUC = 0.36; at pt = 0.15 this would give PPV = 0.26, sensitivity = 0.78, ΔNB = 0.026. These numbers highlight that MMN offers incremental discrimination but remains far from sufficient as a stand-alone biomarker.

Risk prediction examples.

(a) Double-deviant MMN for predicting transition to psychosis, (b) electronic health records for predicting suicide attempts. Both figures are screenshots of E2P Simulator’s interface showing inputs on the left (effect size metrics, reliability values, and the base rate), the resulting data distributions in the center, and all predictive and utility metrics at the bottom and on the right. The red line is the classification threshold with the corresponding red markers on ROC-AUC, PR-AUC, and DCA plots.

These estimates also need to be considered within the broader context. CHR criteria - based on subthreshold psychotic symptoms, brief episodes, family risk, functional decline - identify only about 6–7% of people who will later develop psychosis (Talukder et al., 2025), meaning that CHR criteria already miss more than 93% of people who will later develop psychosis, substantially limiting the added value of MMN in identifying psychosis risk, and suggesting that a more pressing challenge may be to develop more sensitive CHR criteria.

3.4.2 Predicting 12-month suicide risk using baseline electronic health records

Clinicians rate prediction of suicidality as the highest priority for AI tool development in mental health (Fischer et al., 2025). While many predictive models have been developed using cross-sectional suicidality data (Pigoni et al., 2024), only a few used baseline predictors to estimate prospective risk. One of the largest studies to do so (Edgcomb et al., 2021) followed women (N = 67,000) with serious mental illness for 12 months after a general medical hospitalization and trained models on pre-discharge electronic health records to predict readmission for suicide attempt or self-harm, achieving ROC-AUC of 0.73 (derivation sample) and 0.71 (external sample). A companion study in men (N = 1.4 million) reported similar (AUC ≈ 0.73; derivation sample) results (Thiruvalluru et al., 2023).

Assuming that the 3.9% prevalence reported in the men’s cohort is comparable to that in the women’s cohort, let us consider how ROC-AUC translates to more informative metrics of PR-AUC and NB. Outcome reliability (hospital admissions for attempts or self-harm) can be assumed to be near perfect (κ ≈ 1), while the overall reliability of structured electronic health records (healthcare utilization, prior attempts, psychiatric diagnoses, etc.) can also be assumed to be rather high (κ ≈ 0.8).

Using these inputs, ROC-AUC = 0.71 yields only PR-AUC = 0.10 (Fig. 6b). Using pt = 3% absolute risk as a reasonable threshold for intervention, we find sensitivity = 0.77, PPV = 0.06, and ΔNB = 0.006. This means that while the model would capture about three-quarters of true cases, only 6% of those flagged would actually attempt suicide, and the added benefit would translate into finding six additional true cases per 1,000 individuals. Achieving PR-AUC = 0.8 in this population would require ROC-AUC = 0.98. At pt = 0.03, this would achieve sensitivity = 0.94, PPV = 0.30, and ΔNB = 0.025.

We should also consider the fact that many studies compare attempters to healthy controls, not psychiatric controls. In such cases, a predictor or predictive model should be evaluated using the prevalence of suicide attempts in the general population, which is around 10 times lower, with the prevalence at 12 months being 0.3% (Borges et al., 2010). Such low prevalence, however, makes prediction in the general population an impossible task.

The challenges that low prevalence poses for predicting suicide attempts are well-known Carter et al. (2017); Kessler et al. (2020), so our results here are not surprising. However, we hope that it will help clarify the fundamental prediction challenges for those who continue to work on suicide risk prediction. Considering that the ultimate goal is to reduce suicide rates, improving universal prevention methods may be more productive than developing tools for individual risk assessment (Quinlivan et al., 2017; Steeg et al., 2018).

4 Discussion

In this paper, we highlighted a number of statistical pitfalls that are undermining clinical translation efforts and introduced E2P Simulator together with predictive utility analysis to help researchers address these challenges. We provided detailed examples of how this tool can help interpret research findings and plan research in three key areas: diagnostic, treatment response, and risk prediction. We further showed that even seemingly large effect sizes (Cohen’s d) or strong discrimination performance (ROC-AUC) can yield modest predictive value and clinical utility when accounting for real-world outcome base rates. We also demonstrated that poor measurement reliability can be a significant factor in attenuating predictive performance - a research area that remains understudied (Gell et al., 2024; Jacobucci and Grimm, 2020; Whittle et al., 2018; Cullen et al., 2023). Overall, these observations point to the need to improve measurement reliability and to identify larger effects in order to move closer towards translational goals. This echoes recent calls to find ways to increase effect sizes in neuroimaging and psychiatry research to improve statistical power (DeYoung et al., 2025; Makowski et al., 2024; Jahanshad et al., 2025), but here we directly connect this need to translational challenges.

Although we focused on precision psychiatry to convey our points, the challenges we address are general and apply across all health research in the areas of diagnostic test development, biomarker research, screening instruments, and prediction models (Van Calster et al., 2025; Collins et al., 2025; Leeflang et al., 2013; Maxim et al., 2014; Monsarrat and Vergnes, 2018; McGeechan et al., 2008; Hicks et al., 2022; Rutledge and Loh, 2004), and could be meaningfully applied in many other areas, including forensic psychology (Weber et al., 2025), law (Kirk, 2019; Chin, 2023), and education (Glutting et al., 1997). All these research areas grapple with the challenge of interpreting effect sizes, and using them to inform policy and real-world decision making.

4.1 No single metric tells the whole story

While in this paper we have highlighted the limitations of ROC-AUC and the informativeness of PR-AUC and Net Benefit for assessing translational value, it is important to note that all metrics come with trade-offs (Van Calster et al., 2025). For example, although the invariance of ROC-AUC to base rates can obscure a model’s actual translational value, this same property makes it valuable for benchmarking discriminative ability across different contexts. Furthermore, while both ROC-AUC and PR-AUC provide convenient summary measures of performance across the entire range of classification thresholds, only a small range of those thresholds may be clinically relevant. Net Benefit, on the other hand, does require specifying clinically relevant thresholds but that is not always easy to do for non-clinical experts and may also vary based on patients’ preferences or resource constraints, leading to inaccurate NB estimates. Finally, rank-based metrics like ROC-AUC and PR-AUC provide information only about relative risk. This does not guarantee that the absolute risk estimates are accurate and match observed probabilities - this is known as calibration (Van Calster et al., 2019); while we did not address it in this paper, E2P Simulator does include a module allowing researchers to explore how measurement reliability and base rates may affect calibration. Overall, a robust evaluation requires a multi-faceted approach, combining multiple complementary metrics and visualization tools. This comprehensive perspective is exactly what E2P Simulator was designed to facilitate.

4.2 Discriminative vs. causal effect sizes

To clarify the scope of E2P Simulator, we must distinguish between discriminative effects, which reflect individual or group differences (e.g., patients vs. controls), and causal effects, which quantify the impact of an intervention (e.g., treatment vs. placebo), as these categories are often conflated (Shmueli, 2010; Ramspek et al., 2021; Dyer, 2025). E2P Simulator was designed for interpreting only discriminative effects, which are relevant for building prediction models. For interpreting causal effect sizes (Kraemer and Kupfer, 2006), other interactive tools are already available: https://rpsychologist.com/cohend.

5 Limitations

E2P Simulator in its current form relies on idealized normal distributions. Empirical data may be skewed in different ways (Loth et al., 2021), rendering the parametric metrics less applicable and the resulting conversion across different predictive metrics less exact. To fully mitigate this, predictive utility analysis would need to be performed using the actual data - this will be included in future extensions. Furthermore, the simulations do not automatically account for sampling error, or the uncertainty around the effect size estimates, which is often a concern, especially when samples are small (Ioannidis and Panagiotou, 2011; Button et al., 2013; Ioannidis, 2008). As a workaround, however, E2P Simulator can simply take confidence intervals around the effect size (one at a time) as an input to determine uncertainty around the metrics of interest. The current implementation is also limited to binary classification, excluding sub-typing (multiclass classification), risk stratification (multiple thresholds), or longitudinal (time-to-event) prediction - although low base rate and poor measurement reliability problems would be just as relevant in these contexts.

Supplementary material

1 Supplementary note 1: predictive utility analysis

Predictive utility analysis consists of two key components: (1) estimating predictive value and clinical utility of research findings by accounting for the real-world outcome base rate, and (2) determining how much performance is lost due to measurement reliability. It can be applied in two main scenarios: interpreting existing research findings and planning new studies. Below we outline the workflow for each and provide some caveats along the way.

1.1 Interpreting existing research findings

When evaluating published research or your own completed studies, the workflow starts with observed metrics and works forward to understand real-world utility:

  1. Set measurement reliability: Ideally, reliability estimates should come from the same dataset as the observed effect sizes. If unavailable, use estimates from other relevant research. One could ignore measurement reliability (setting all reliabilities to 1), which would still allow estimating real-world utility but without assessing how much performance is lost due to measurement error.

  2. Set observed metrics: This can be an effect size (e.g., Cohen’s d, OR, Pearson’s r) or a predictive performance metric (e.g., ROC-AUC, PR-AUC). Ensure you use a robust or conservative estimate (not inflated due to small samples or overfitting). We set measurement reliability first because observed effect sizes are already attenuated by it.

  3. Set base rate: Use the expected prevalence in the real-world population where the predictor will be applied—not the study sample composition. For example, if the study used a balanced case-control design (50% cases, 50% controls) but the condition affects only 5% of the target population, use 5% as the base rate. Note: if using PR-AUC as the observed metric, first enter the study sample’s base rate; once set, adjust to the real-world base rate to rescale PR-AUC.

  4. Set classification threshold: Choose a threshold probability (pt) that reflects the clinical context, balancing the costs of false positives against false negatives. The optimal pt can be estimated as:

    where CFP is the cost of a false positive (unnecessary intervention) and CFN is the cost of a false negative (missed case). pt reflects the absolute risk at which clinical action is warranted, and can be informed by expert surveys, stakeholder preferences, or established guidelines.

  5. Document relevant metrics: Record ROC-AUC, PR-AUC, PPV, NPV, and Net Benefit at the chosen threshold. Together, these provide a comprehensive picture of discriminative ability, predictive value, and clinical utility.

1.2 Planning studies

When planning new research, the workflow starts from a target level of real-world performance and works backward to determine what predictors are needed.

  1. Identify clinically meaningful targets: Determine what level of PR-AUC, PPV, NPV, or Net Benefit would be clinically meaningful. This could come from cost-benefit analysis, existing guidelines, or benchmarking against existing clinical instruments.

  2. Determine required group separation: Use the simulator to translate clinical targets into required group separation (e.g., Cohen’s d, ROC-AUC), setting the base rate to the expected real-world prevalence. Compare with typical effect sizes in the literature to assess feasibility.

  3. Explore ways to improve performance: Use the simulator to explore how to get closer to your goal by:

    • Improving measurement reliability of each predictor (e.g., using more reliable assessment methods or protocols)

    • Targeting higher base rate populations: pre-screened or high-risk populations can improve PPV and PR-AUC. However, this should be guided by real-world feasibility. Pre-screening is itself a classification problem that may exclude many actual cases if it has poor sensitivity. Also, group separation found in one population (e.g., healthy controls vs. cases) should not be expected to remain the same in another population (e.g., psychiatric controls vs. cases).

    • Using multiple predictors: estimate how many predictors are needed to achieve target performance. For average effect size and collinearity, use values typical in the field or from your own data. The multivariable calculator provides a rough estimate of model performance without needing to train it.

2 Supplementary note 2: combining multiple predictors

E2P Simulator includes simulators for estimating what predictive performance can be reached by using multiple predictors, when their individual effect sizes and collinearity are known. We start with computing Mahalanobis D - a generalization of Cohen’s d in multidimensional space. Given p predictors with Cohen’s d values d1, d2,... , dp and a p × p correlation matrix R among them, the general formula for Mahalanobis D (Mahalanobis, 1936; Del Giudice, 2009) is:

where d = (d1, d2,..., dp). To make the tool practical, we assume equal effect sizes (d1 = d2 = ... = dp = d) and equal pairwise correlations (rij for all pairs). This simplifies to:

The numerator (p) reflects the number of predictors; the denominator adjusts for shared information. When predictors are uncorrelated (rij = 0), each adds fully: . When perfectly correlated (rij = 1), they are redundant: D = d, regardless of p.

Once we obtain D, the conversion to predictive metrics follow the same relationship as for Cohen’s

where Φ is the cumulative normal distribution function. Substituting the simplified D gives the full formula:

The same logic applies to continuous outcomes. With p predictors each having correlation r with the outcome, and collinearity rij among predictors:

Here, the conversion to ROC-AUC is done non-analytically but by sampling the data after dichotomizing it, just like for single continuous predictors.

The additive (no interactions) assumption means these formulas best approximate linear models such as logistic regression. Research suggests that in clinical prediction, complex non-linear models generally do not outperform logistic regression (Christodoulou et al., 2019), making this a reasonable approximation for many practical settings. Some research fields, like genetics (Hill, 2010), are primarily relying on linear models to begin with. Even when real-world predictors vary in strength and collinearity, the general trends (e.g., diminishing returns with correlated predictors) remain informative for research planning.

In the main text we consider two examples to demonstrate the use of each multivariable calculator (Supp. Fig. 1 and Supp. Fig. 2).

Estimating the strength and number of predictors needed for diagnostic prediction in depression.

A screenshot of the binary multivariable calculator with inputs on the left and results on the right. Here we show three example settings for achieving PR-AUC = 0.8, which at 8% base rate would correspond to ROC-AUC = 0.96; we would need either 20 predictors of d = 0.8 each and 0.1 collinearity (green line); 10 predictors of d = 0.8 each and no (0.0) collinearity (red line); or 5 predictors of d = 1.35 each with 0.1 collinearity (yellow line).

Estimating the strength and number of predictors needed for predicting treatment response to antidepressants

A screenshot of the continuous multivariable calculator with inputs on the left and results on the right: the resulting R2 and PR-AUC as a function of the number of parameters. Here we show two example curves for achieving R2 = 0.8, which at 15% corresponds to PR-AUC = 0.8; we would need 17 predictors of r = 0.4 to achieve R2 = 0.8 (green line). Interestingly, if we reduce the effect size of each predictor to r = 0.3, we would never reach R2 = 0.8, no matter how many predictors we have (red line).

Data availability

This work presents a statistical tool and does not involve empirical data. The source code for E2P Simulator is openly available at https://github.com/povilaskarvelis/e2p-simulator under the MIT license. An archived version is available at https://doi.org/10.5281/zenodo.17112626.

Acknowledgements

Andreea Diaconescu is supported by the Krembil Foundation, Canadian Institute of Health Research and NSERC Discovery Fund.

Additional information

Funding

Krembil Foundation (1000824)

  • Andreea Oliviana Diaconescu