Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.
Read more about eLife’s peer review process.Editors
- Reviewing EditorAndre MarquandRadboud University Nijmegen, Nijmegen, Netherlands
- Senior EditorAndre MarquandRadboud University Nijmegen, Nijmegen, Netherlands
Reviewer #1 (Public review):
Summary:
Overall, this is a well-designed and carefully executed study that delivers clear and actionable guidance on the sample size and representative demographic requirements for robust normative modelling in neuroimaging. The central claims are convincingly supported.
Strengths:
The study has multiple strengths. First, it offers a comprehensive and methodologically rigorous analysis of sample size and age distribution, supported by multiple complementary fit indices. Second, the learning-curve results are compelling and reproducible and will be of immediate utility to researchers planning normative modelling projects. Third, the study includes both replication in an independent dataset and an adaptive transfer analysis from UK Biobank, highlighting both the robustness of the results and the practical advantages of transfer learning for smaller clinical cohorts. Finally, the clinical validation ties the methodological work back to clinical application.
Weaknesses:
There are two minor points for consideration:
(1) Calibration of percentile estimates could be shown for the main evaluation (similar to that done in Figure 4E). Because the clinical utility of normative models often hinges on identifying individuals outside the 5th or 95th percentiles, readers would benefit from visual overlays of model-derived percentile curves on the curves from the full training data and simple reporting of the proportion of healthy controls falling outside these bounds for the main analyses (i.e., 2.1. Model fit evaluation).
(2) The larger negative effect of left-skewed sampling likely reflects a mismatch between the younger training set and the older test set; accounting explicitly for this mismatch would make the conclusions more generalisable.
Reviewer #2 (Public review):
Summary:
The authors test how sample size and demographic balance of reference cohorts affect the reliability of normative models in ageing and Alzheimer's disease. Using OASIS-3 and replicating in AIBL, they change age and sex distributions and number of samples and show that age alignment is more important than overall sample size. They also demonstrate that models adapted from a large dataset (UK Biobank) can achieve stable performance with fewer samples. The results suggest that moderately sized but demographically well-balanced cohorts can provide robust performance.
Strengths:
The study is thorough and systematic, varying sample size, age, and sex distributions in a controlled way. Results are replicated in two independent datasets with relatively large sample sizes, thereby strengthening confidence in the findings. The analyses are clearly presented and use widely applied evaluation metrics. Clinical validation (outlier detection, classification) adds relevance beyond technical benchmarks. The comparison between within-cohort training and adaptation from a large dataset is valuable for real-world applications.
The work convincingly shows that age alignment is crucial and that adapted models can reach good performance with fewer samples. However, some dataset-specific patterns (noted above) should be acknowledged more directly, and the practical guidance could be sharper.
Weaknesses:
The paper uses a simple regression framework, which is understandable for scalability, but limits generalization to multi-site settings where a hierarchical approach could better account for site differences. This limitation is acknowledged; a brief sensitivity analysis (or a clearer discussion) would help readers weigh trade-offs. Other than that, there are some points that are not fully explained in the paper:
(1) The replication in AIBL does not fully match the OASIS results. In AIBL, left-skewed age sampling converges with other strategies as sample size grows, unlike in OASIS. This suggests that skew effects depend on where variability lies across the age span.
(2) Sex imbalance effects are difficult to interpret, since sex is included only as a fixed effect, and residual age differences may drive some errors.
(3) In Figure 3, performance drops around n≈300 across conditions. This consistent pattern raises the question of sensitivity to individual samples or sub-sampling strategy.
(4) The total outlier count (tOC) analysis is interesting but hard to generalize. For example, in AIBL, left-skew sometimes performs slightly better despite a weaker model fit. Clearer guidance on how to weigh model fit versus outlier detection would strengthen the practical message.
(5) The suggested plateau at n≈200 seems context-dependent. It may be better to frame sample size targets in relation to coverage across age bins rather than as an absolute number.