Toward Robust Neuroanatomical Normative Models: Influence of Sample Size and Covariates Distributions

Camille Elleaume; Bruno Hebling Vieira; Dorothea L Floris; Nicolas Langer; the Australian Imaging Biomarkers and Lifestyle flagship study of ageing

doi:10.7554/eLife.108952.1

1. Introduction

Normative modeling is a statistical framework for quantifying individual deviations from typical brain structure or function (Marquand et al., 2016). It estimates covariate-adjusted percentiles of a given measurement in a reference population. These percentiles serve as a basis for quantifying individual deviations from population norms, typically accounting for covariates such as age and sex. Unlike traditional case–control analyses that rely on group-level means, normative models generate subject-specific deviation scores that preserve inter-individual variability. This allows the characterization of brain atypicality at the individual level, without relying on the assumption that individuals within the same group share common patterns (Marquand et al., 2016; Rutherford, Fraza, et al., 2022; Rutherford et al., 2023). Warped Bayesian Linear Regression (Fraza et al., 2021) is a widely used normative modeling approach that combines flexibility and scalability, making it particularly well-suited for large neuroimaging datasets. (Corrigan et al., 2024; Holz et al., 2023; Meijer et al., 2024; Rutherford et al., 2023; Savage et al., 2024; Verdi et al., 2024). Long-term, these models can be envisioned as interpretable and quantitative neuroradiology tools to support clinical decision making (Bozek et al., 2023; Goodkin et al., 2019).

Among clinical applications of normative models, Alzheimer’s disease (AD) represents a particularly relevant use case. AD is the most common cause of dementia and remains challenging to characterize due to its marked heterogeneity in clinical presentation and neuroanatomical patterns (Duara & Barker, 2022; Lam et al., 2013; Rajan et al., 2021). In clinical practice, structural imaging already plays a central role in diagnosis, and hippocampal volume is routinely evaluated in patients relative to expectations built from reference data (Vernooij et al., 2019). Normative modeling builds on this approach by enabling a more fine-grained characterization of neuroanatomical variability, supporting more precise and personalized assessments in AD. Recent applications to AD research had demonstrated its potential to reveal individualized patterns of cortical atrophy that correlate with cognitive performance and key biomarkers, including CSF Aβ₄₂ and phosphorylated tau (Loreto et al., 2024; Verdi et al., 2021, 2023, 2024).

However, the performance of normative models depends critically on the reference population used to fit them (Bethlehem et al., 2022; Bozek et al., 2023). The size of the training data, and the extent to which relevant covariates (e.g., age and sex) are adequately represented can substantially influence model fit and the accuracy of the resulting deviation scores. Large-scale samples have been recommended to accurately estimate outlying percentiles for clinical use (Bozek et al., 2023). However, collecting sufficiently large neuroimaging datasets is particularly challenging, as it often requires multi-site data aggregation and extensive efforts to harmonize differences in acquisition protocols, scanner hardware. To address the challenge of limited sample sizes, studies often employ adaptive transfer learning strategies that allows to recalibrate pre-trained models on smaller samples (Bayer et al., 2022; Gaiser et al., 2023; Kia et al., 2021). But transfer learning applications face a similar challenge. While it offers a pragmatic solution, its success ultimately depends on the quality and representativeness of the available adaptation sample used for re-calibration. Whether models are directly trained within cohorts or adapted from large-scale references, both strategies raise concerns about robustness of clinical interpretations, particularly when the available data are limited, as is commonly observed in neuroimaging studies. Beyond sample size, careful consideration of covariates is critical, especially in applications to AD. The disease typically emerges in older adulthood and is more commonly diagnosed in women (Ferretti et al., 2018; Riedel et al., 2016), making accurate modeling of age and sex covariates especially important in this context. Despite growing adoption of normative modeling, systematic evaluations of how performance is affected by the composition and size of the training or adaptation sample remain limited.

To address this gap, we conducted a systematic investigation of how reference sample size and covariate composition affect normative model performance and clinical readouts. Using a single-site dataset (OASIS-3), we varied the size of the healthy control (HC) reference cohort (from 5 to 600 individuals) and manipulated age and sex distributions to simulate biases in reference populations. We replicated the approach in a two-site cohort (AIBL) and further assessed whether models pre-trained on a large external dataset (UK Biobank; n = 42,747) could be effectively adapted through transfer learning (Rutherford, Kia, et al., 2022), using the same sub-sampling strategies for the adaptation set. Adapted models were then applied to the same fixed test set, allowing direct comparison between models trained within each cohort and UKB-adapted models across all sampling conditions. Models’ performances were evaluated using standard normative modeling evaluation metrics in an independent fixed HC test set. We assessed the error in deviation scores relative to those obtained using the full, theoretically optimal, reference sample in both the HC test set and the clinical sample diagnosed with AD. To assess clinical implications at the application level, we further quantified differences in outlier detection and classification performance when distinguishing AD from HC groups.

This work provides a systematic analysis of how reference sample characteristics influence model fit, deviation estimates, and their clinical interpretability, offering practical guidance for applying normative modeling in ageing and neurodegeneration research and for maximizing the value of existing, deeply phenotyped cohorts when assembling reference datasets.

2. Results

We first evaluated how reference population characteristics influenced normative model accuracy. Using Warped Bayesian Linear Regression (Fraza et al., 2021), we trained normative models for 167 neuroanatomical regions of interest (ROI) using two independent datasets, OASIS-3 and AIBL (Figure 1, Table 1, see Supplementary Figures S17–S33 for the replication in AIBL). Training subsets ranged from 5 participants to the entire training cohort (n = 692), obtained by subsampling to reproduce realistic age- and sex-related skews. For each subsampling scenario, we generated 10 random draws and benchmarked their performance against the model trained on the full training sample.

Methodology diagram for evaluating normative model estimation under different sampling scenarios.
A: Analyses were conducted on the OASIS-3 dataset, with replication in AIBL. Normative models were first fitted using the entire training set (i.e., 80% of HC from the cohort). For each sampling strategy (s), models were re-fitted on randomly drawn subsamples of the training set, with sample sizes (n) ranging from 5 to the maximum available (n = 692 for OASIS, n = 322) and repeated across 10 iterations per sample size. Representative sampling: subsamples preserved the original age distribution using 10 age bins with an equal number of individuals per bin, while also ensuring balanced sex distributions. Left-skewed sampling: overrepresented younger age ranges by applying a beta distribution (α=2, β=5), across 10 equally spaced bins. Right-skewed sampling: overrepresented older age ranges using a beta distribution (α=5, β=2). Each model was evaluated on a fixed test set composed of 20% of HC from the same cohort and all AD individuals. Example normative fits illustrate how identical test values are interpreted under different models: red dots represent values outside the 95% centile range (outliers), while orange dots fall within the normative range. B: In a parallel analysis, normative models were pre-trained on the UK Biobank dataset. Adaptive transfer learning was used to adapt these pre-trained models to the clinical cohorts. The same sub-sampling strategies and sample sizes described in panel A were used here to define the adaptation sets. Adapted models were then applied to the same fixed test set as in panel A, allowing direct comparison between within-cohort models and adapted UKB models across all sampling conditions. C: Each resulting model was evaluated using standard performance metrics to assess model fit. Z-score errors were additionally computed and compared to those obtained from the model fitted using the full training data. Clinical validation was then performed by analyzing outlier detection and predictive performance.

Demographics of HC and AD participants across datasets and sites summary of age ranges, mean ages, standard deviations, and sex ratios (F/M) for HC and AD groups within the UKB, OASIS-3, and AIBL datasets, including site-specific subgroups

2.1. Model fit evaluation

First, to quantify training-set sample size effects, we trained normative models on HC subsamples with representative age distributions (mirroring the full training set age distribution) and balanced sex ratios. We evaluated these models on a fixed HC test set and benchmarked their performance against corresponding full training sample models (Figure 1A). Normative model fits were assessed using Mean Standardized Log Loss (MSLL), Standardized Mean Squared Error (SMSE), Explained Variance (EV), Pearson Correlation (Rho), and Intraclass Correlation Coefficient (ICC) (Figure 1C).

Sample size demonstrated a strong influence on model fit for cortical thicknesses and subcortical volumes (Figure 2). Specifically, model performance improved with larger sample sizes, as reflected by reductions in MSLL (β = -0.496, p < .001) and SMSE (β = -0.427, p < .001) (Figure 2A-B) and increases in EV (β = 0.499, p < .001), Rho (Β = 0.527, p < .001), and ICC (β = 0.728, p < .001) (Figure 2C-E, Table 2). The standardized effect sizes (β values) allow for direct comparisons across metrics and sampling strategies. Performance improved consistently from n = 10 to the full training sample size (n = 692), with n = 5 showing unstable values and falling outside the overall trend. Rapid improvements were observed between n = 10 and 50, capturing∼70–80% of total gains across metrics. Gains continued more gradually between n = 50 and 200, accounting for most of the remaining improvements (∼15–25%). Beyond n = 200, all metrics reached ∼92–95% of their final values. After n = 300, changes became minimal, and metrics plateaued near 97–99% (Figure 2). Notably, ICC reaches excellent reliability from n = 50 across all lobes (Koo & Li, 2016).

Model fit evaluation in the HC test set.
Performance evaluation of models as a function of the number of participants used for fitting the models (n), calculated using the HC test set from the OASIS-3 dataset. Depicted are in A the MSLL; B: SMSE; C: EV; D: Rho; E: ICC. For all plots, solid lines represent the mean performance of the evaluation metric across all cortical region models grouped by lobe and the mean performance across subcortical region models. The thick black line indicates the overall mean performance across all cortical and subcortical models. In Panels A-D, dashed lines denote the performance using the full training set (n=692) and the shaded areas indicate the standard deviation, reflecting variability across 10 iterations of same sample size. In Panel E, grey dotted lines indicate commonly accepted reliability thresholds: below 0.5 (poor), up to 0.75 (moderate), up to 0.9 (good) and above 0.9 (excellent) (Koo & Li, 2016).

Linear mixed model results for evaluation metrics under age-skewed sampling conditions.
Models assess the influence of standardized and log-transformed sample size (n) and age sampling strategy (Representative, Left-skewed, Right-skewed) on model performance metrics (MSLL, SMSE, EV, Rho, and ICC). All variables were standardized to allow comparison of effect sizes. Sample size was log-transformed to account for the non-linear association between sample size and model performance. Representative sampling serves as the reference level. Reported β coehicients and corresponding p-values indicate the direction and significance of each effect.

Next, to isolate the effect of age distribution, we compared models trained on HC subsamples with representative age distributions (as used above) to those trained using left-skewed (younger-biased) and right-skewed (older-biased) age distributions, while maintaining balanced sex ratios. All models used the same incremental sample sizes and were evaluated on the same fixed HC test set (see Figure 1A, C, and Table 2).

Left-skewed sampling (i.e., oversampling of younger subjects) had the most pronounced adverse effect on model fit, with a marked increase in SMSE (β = 1.323, p < .001), increase in MSLL (β = 0.219, p < .001), decreases in EV (β = -0.927, p < .001), Rho (β = -0.642, p < .001), and in ICC (β = -0.300, p < .001). Right-skewed sampling (i.e., oversampling of older subjects) had a smaller effect on model fits compared to left-skewed sampling. It increased SMSE (β = 0.585, p < .001) and MSLL (β = 0.115, p < .001) and decreased EV (β = -0.692, p < .001) and Rho (β = -0.567, p < .001). The effect on ICC was minimal (β = -0.024, p = .026).

Model fit generally improved with increasing sample size across all metrics and sampling strategies. Interaction effects, however, revealed that the rate of improvement varied depending on the sampling strategy. Under left-skewed sampling, SMSE improved substantially with larger sample sizes (interaction β = -0.636, p < .001), suggesting a steeper recovery. Rho and ICC improved, but at slower rate of improvement than under representative sampling (interaction β = -0.112, p < .001 and β = -0.032, p = 0.003). Right-skewed sampling showed uniform improvements across metrics, with interaction effects indicating modest gains in EV (β = 0.047, p < .001), Rho (β = -0.034, p < .001), and ICC (β = -0.048, p < .001), and reductions in MSLL (β = -0.084, p < .001) and SMSE (β = -0.139, p < .001).

These results highlight that the effect of increasing sample size varies depending on the sampling strategy, with less consistent improvements observed under left-skewed conditions. All results for age-skewed samplings are summarized in Table 2.

Finally, to examine the effect of sex ratio imbalance, we trained models on HC subsamples with representative age distributions and varying sex ratios: balanced (1F:1M), 1F:10M, 1F:4M, 4F:1M, and 10F:1M, where F denotes females and M denotes males. As before, all models used the same incremental sample sizes and were evaluated on the fixed HC test set.

Sex distribution imbalances had a comparatively smaller effect on model fit than age distribution shifts (Table 3, Figure 3). Across metrics, the extent of deviation increased with the degree of imbalance. More extreme ratios (1F:10M and 10F:1M) were associated with moderate increases in SMSE (β = 0.166 and β = 0.076, respectively; both p < .001) and reductions in EV (β = -0.124 and β = -0.033, respectively; both p < .001), while more moderate imbalances (1F:4M and 4F:1M) produced minimal changes. ICC was reduced in all imbalanced conditions, with the largest effects observed under the most extreme ratios (β = -0.152 for 1F:10M and β = -0.162 for 10F:1M; both p < .001). Interaction effects with sample size were statistically significant in some cases but remained of small magnitude. All results for sex-unbalance samplings are summarized in Table 3.

Evaluation of model fits in the HC test set across different sampling strategies of the training set, shown for varying sample sizes (n).
Age-skewed sampling strategies include representative (matching the initial age distribution with balanced sex, 1F:1M), left-skewed (favoring younger individuals), and right-skewed (favoring older individuals), each with balanced sex distributions. Sex-imbalanced sampling strategies include female-to-male ratios of 1:1, 1:4, 1:10, 4:1, and 10:1, all with representative age distributions. Solid lines represent the mean metric values across ROIs and iterations, with shaded areas indicating the standard deviation across iterations for MSLL, SMSE, EV, and Rho. For ICC, shaded areas represent variation across ROIs, as ICC already reflects variability across iteration. Dashed lines indicate the mean metric values for models trained with the full sample for MSLL, SMSE, EV, and Rho.

Linear mixed model results for evaluation metrics under sex-imbalanced sampling conditions.
Models assess the influence of standardized and log-transformed sample size (n) and sex ratio in the training set (1F:1M, 1F:4M, 1F:10M, 4F:1M, 10F:1M; F = female, M = male) on model performance metrics (MSLL, SMSE, EV, Rho, and ICC). All variables were standardized to allow comparison of effect sizes. Sample size was log-transformed to account for the non-linear association between sample size and model performance. Representative sampling (1F:1M) serves as the reference level. Reported β coehicients and corresponding p-values indicate the direction and significance of each effect.

In summary, these findings demonstrate that representative sampling generally provided the best performance in terms of model fit across metrics on the test set, while skewed distributions introduced substantial model fit degradation. The left-skewed sampling particularly exacerbated errors, as shown by the larger effect sizes. Increasing the sample size improved fits in all configurations. Sex ratio imbalance had a statistically significant but limited effect on model fit, with only minor deviations observed across metrics. These findings underscore the dominant influence of age distribution and sample size in model fit. Individual effects of sample size for the different age and sex distributions are presented in supplementary material (Figures S1-3) and importantly we replicated these findings in an independent dataset (AIBL) (Figures S17-20, Tables S5-6).

2.2. Z-scores errors

Subsequently, we assessed the direct impact of age and sex distributions and sample size on normative models’ outcomes, quantified by the mean squared error (MSE) and mean bias error (MBE) in Z-scores relative to models trained with the full training set.

Increasing the sample size significantly reduced MSE in both HC and AD groups (β = – 0.443, p < .001), with no significant interaction, indicating similar effects across groups. Age distribution sampling (Figure 4A) had a significant effect on Z-score errors: left-skewed sampling (i.e., oversampling of younger individuals) led to a larger increase in MSE (β = 1.045, p < .001) compared to right-skewed sampling (i.e., oversampling of older individuals) (β = 0.101, p < .001) (Figure 4B, Table 4).

Z-score errors: Influence of sample size, age distribution, and sex imbalance on normative model outcomes.
A: Age distributions for left-skewed (younger-biased), representative, and right-skewed (older-biased) sampling strategies, compared to the full training set in the OASIS-3 dataset. B: Mean squared error (MSE) of Z-scores relative to the full training set model across sample sizes, age distributions, and diagnostic groups. C: Mean bias error (MBE) per region across sample sizes and age distributions in the test set. Shown are the 20 brain regions with the highest Cohen’s d effect sizes distinguishing between HC and AD calculated from models trained on the full training set. From left to right: results for left-skewed, representative, and right-skewed training sets. Blue indicates negative MBE (underestimation), red indicates positive MBE (overestimation), and white indicates close alignment with the full training set model. D: Cubic regression of MSE as a function of age, across sample sizes and sampling strategies. Left-skewed sampling shows increased errors in older individuals; right-skewed sampling shows increased errors in younger individuals. E: Centile curves for the Left Hippocampus as a function of age, derived from models trained on left-skewed, representative, and right-skewed sampling (n = 100). Colored lines represent the 5th, 50th, and 95th percentiles; grey lines show centiles from the full training set model. F: MSE across sample sizes and test set sex, obtained using sex-imbalanced training sets (female-to-male ratios: 10:1, 4:1, 1:1, 1:4, 1:10), all with representative age distributions.

Linear mixed model results for MSE and total outlier count (tOC) under age-skewed sampling.
Models evaluate the influence of diagnosis (HC, AD), log-transformed and standardized sample size (n), and age sampling strategy (Representative, Left-skewed, Right-skewed) on standardized deviation score outcomes. Continuous variables were standardized to allow comparison of effect sizes. Sample size was log-transformed to account for the non-linear association between sample size and model performance. Representative sampling and HC are used as reference levels. Reported β coehicients and corresponding p-values indicate the direction and significance of the effects. Age was not included as an additional predictor to keep the modeling approach aligned with the sex-based analysis (Table 5) and to limit the complexity of the models.

Because the MSE quantifies the magnitude of errors but does not provide information about their direction, we additionally computed the MBE to assess whether errors reflected systematic over- or underestimation of deviation scores. Left-skewed sampling consistently resulted in negative MBE values, indicating overestimation of deviations, while right-skewed sampling produced positive MBE values, indicating underestimation, particularly at smaller sample sizes (Figure 4B).

Cubic regression analyses on age revealed that, under the representative distribution, Z-score errors were lowest in mid-range ages and increased toward both extremes. Left-skewed sampling amplified errors in older individuals, with deviations persisting even at larger sample sizes (e.g., n = 100). In contrast, right-skewed sampling led to elevated errors in younger individuals, which progressively decreased with increasing sample size (Figure 4D).

Age-related trends in MSE were exemplified in centile estimation maps for the Left-Hippocampus (Figure 4E), where left-skewed sampling led to higher centiles in older individuals (indicating overestimation), while right-skewed sampling produced lower centiles in younger individuals (indicating underestimation) (Figure 4E).

Sex imbalances in the training set had a smaller but statistically significant effect on Z-score errors (Table 5). Extreme imbalances led to higher MSE compared to the representative configuration, with the largest increases observed for the 10F:1M (β = 0.089, p < .001) and 1F:10M (β = 0.083, p < .001) ratios and smaller increases for 4F:1M (β = 0.031, p < .001) and 1F:4M (β = 0.046, p < .001). Errors did not statistically differ for the HC and AD groups (β = 0.021, p = .137). Errors tended to be lower for individuals matching the overrepresented sex in the training set, with discrepancies increasing with the degree of imbalance (Figure 4F).

Linear mixed model results for MSE and total outlier count (tOC) under sex-imbalanced sampling.
Models evaluate the influence of diagnosis (HC, AD), log-transformed and standardized sample size (n), and sex ratio in the training set (1:1, 1F:4M, 1F:10M, 4F:1M, 10F:1M) on standardized deviation score outcomes. Continuous variables were standardized to allow comparison of effect sizes. Sample size was log-transformed to account for the non-linear association between sample size and model performance. The 1:1 ratio and HC are used as reference levels. Reported β coehicients and corresponding p-values indicate the direction and significance of the effects. Sex was not included as an additional predictor to keep the modeling approach aligned with the age-based analysis (Table 4) and to limit the complexity of the models.

Highly similar results were found for the AIBL dataset, with the exception that the right-skewed distribution led to a larger effect than in the OASIS-3 dataset (Figure S21, Tables S5–S6).

2.3. Clinical validation

Clinical validity was evaluated for every sampling strategy and sample size using three metrics: (1) the proportion of subjects with atrophy outliers in each ROIs (Z < –1.96); (2) the total outlier count (tOC) per subject, defined as the number of ROIs in which a subject’s deviation score exceeds the Z threshold; and (3) the ROC-AUC of an support vector classifiers (SVC) distinguishing HC versus AD based on deviation scores (Figure 1C).

In the model fitted using the full training set, the highest percentages of volume outliers in the AD group were observed in subcortical regions such as the left hippocampus (30.5%) and left amygdala (28.1%). Among cortical areas, the middle temporal gyrus and parahippocampal gyrus showed also frequent deviations (e.g., 22.8% in the right middle temporal gyrus, 19.2 % in parahippocampal gyrus).

To illustrate how outlier estimation varies with training sample size, example iterations are shown in Figure 5A. At n = 25, outlier percentages were substantially lower than in the full sample model, for example, 8.4% in the left hippocampus (–22.1% compared to the full sample) and 0% in the left amygdala (–28.1%). At n = 50, estimates became closer to the full model in some regions (e.g., 24.0% in the left hippocampus, –6.5%), while others showed higher deviation rates (e.g., 41.3% in the left amygdala, +13.2%). Outlier estimates in the example at n = 75 closely matched those of the full model across many regions; for example, the left hippocampus showed 30.5% (−0.0%). At n = 100, values showed minor variability, with the left hippocampus slightly above the reference at 34.7% (+4.2%). This residual fluctuation across runs highlight some instability in outlier estimation even at higher sample sizes. Similar variability was observed in cortical areas. For instance, the percentage of outliers in the right middle temporal gyrus increased from 12.6% at n = 25 (–10.2%) to 18.0% at n = 100 (–4.8%), compared to 22.8% in the full model. In the right parahippocampal gyrus, values rose from 3.6% at n = 25 (–15.6%) to 18.0% at n = 100 (–1.2%), relative to 19.2% in the full model.

Clinical validation: effect of sample size and age distributions in outlier detection and classification performance.
A: Percentage of participants with extreme negative deviation (Z < -1.96) in each brain region in the AD group, shown for independent example iterations at diherent sample sizes with representative age distribution. B: Average total Outlier Count (tOC) in the AD and HC are represented with dashed and solid lines respectively as a function of sample size (n). The shaded areas indicate the standard deviation across iterations. Dotted grey lines correspond to the estimations of tOC for HC and AD groups obtained with full sample size. C: ROC-AUC of Support Vector Classifier with 10-fold cross-validation as a function of sample size in the training set. The solid lines represent the average AUC across iterations for each sampling strategy. The dotted line shows the performance of the models trained with full sample size.

tOC analysis (Figure 5B, Tables 3-4) showed consistently higher estimated outlier counts in AD compared to HC. Models trained on the full training set estimated an average of 14 outliers per individual in AD compared to 4 in HC. This group difference was confirmed by the statistical models (Tables 3-4). which reported a significant main effect of diagnosis (β = 0.665, p < .001), independent of sample size or sampling distributions.

No significant main effect of sample size was found in HC, as outlier counts quickly converged to full-model estimates. In AD, a significant interaction with sample size (β = 0.096, p < .001) indicated that tOC was underestimated at small sample sizes and increased with larger sample sizes, with representative and right-skewed sampling converging to full-model estimates around n = 100 (Figure 5B).

Left-skewed sampling strongly increased tOC (β = 0.461, p < .001), an effect amplified in AD (interaction β = 0.721, p < .001). Under this sampling strategy, tOC only slowly decreased with larger sample sizes (β = -0.323, p < .001 for the interaction of sample size with left-skewed sampling; β = -0.290, p < .001 for the three-way interaction with AD).

Sex-imbalanced samplings showed no significant main effects on tOC (all p > .05). Several interactions with sample size were significant for unbalanced sex ratios (e.g., 4F1M β = - 0.012, p = .001; 1F10M β = 0.014, p < .001), but these effects were very small compared to those observed for age distributions. The only interaction with AD reaching significance was under the extreme male-dominated condition (1F10M:AD β = -0.013, p = .012), indicating a slight moderation of tOC in AD group in that scenario.

Finally, the influence of training sample size and age distribution on classification performance was assessed by evaluating the ability to distinguish AD and HC groups based on deviation scores from the normative models (i.e. Z-scores). ROC-AUC increased when increasing sample size for all sampling strategies, reaching a performance comparable to the models trained with the full training sample size (AUC = 0.86) at around 15 samples, and stabilized from that point onwards (5C). This suggests that small training sets may already capture relevant group-level patterns.

Importantly, all findings from the clinical validation analyses were replicated in the independent AIBL dataset (Figures S22, Tables S7–S8).

2.4. Adaptation from large dataset

To motivate the need for model adaptation across cohorts (Figure 1B), we examined differences in mean cortical thickness between UKB and the clinical datasets. Systematic differences were evident, with UKB consistently showing higher mean cortical thickness compared to AIBL and OASIS-3, regardless of age (Figure 6). These differences most likely reflect acquisition-related variability and highlight the necessity of adapting models when applying the method to new datasets.

Site effects on mean cortical thickness (whole-brain average) in the UK Biobank (UKB), AIBL, and OASIS-3 datasets for HC.
A: Regression plots showing the relationship between age and mean cortical thickness across different datasets and imaging sites (UKB: 3 sites, AIBL: 2 sites, OASIS-3: 1 site). Marginal density plots are displayed on the sides of each axis, illustrating the distribution of mean cortical thickness and age for each dataset’s imaging sites. The UKB exhibits notable differences in mean cortical thickness estimations compared to AIBL and OASIS-3, unrelated to the age distribution of participants. B: Boxplots representing the mean cortical thickness for participants in each dataset and imaging site. C: Bivariate kernel density estimates plots showing the joint distribution of age and mean cortical thickness for each dataset. The contour plots represent the density of data points, with filled areas reflecting higher concentrations of data.

Adapted models based on pre-trained UKB data and transferred to OASIS-3 rapidly achieved the performance level of models trained directly within the cohort (Figure 7). Across all evaluation metrics (MSLL, SMSE, EV, Rho, and MSE) adapted models plateaued around n = 50, whereas models trained directly within cohort required approximately n = 200 to reach comparable performance. Explained variance (EV) and correlation (Rho) remained stable across all adaptation sample sizes, as adaptation modifies only the mean of the predictions without changing their variance or rank structure.

Adaptive transfer learning evaluation.
A. Comparison of model performance between direct training models (OASIS-3-trained) and models pretrained on UKB then adapted to OASIS-3 (UKB-adapted). The grey curve shows the diherence in performance (within-cohort minus adapted), plotted on a separate y-axis (right). For MSLL, SMSE, and MSE, lower values indicate better performance; therefore, positive diherences indicate better performance of the within-cohort models. For EV and Rho, higher values indicate better performance; hence, positive diherences reflect better performance of the adapted models. EV and Rho values remain constant across models, as the adaptation procedure only modifies mean of models and does not ahect the shape. B. Example adaptation iteration showing the percentage of individuals identified as outliers per region. Last row shows the outlier detection obtained using the full adaptation sample (n = 692).

Despite the convergence in evaluation metrics, outlier detection at full adaptation sample size (n = 692) revealed differences between the two modeling strategies, with and without adaptive transfer learning. In the left amygdala, the adapted model detected 15.0% outliers compared to 28.1% for the models trained directly without adaptation (–13.1%); in the left hippocampus, 17.4% versus 30.5% (–13.1%); in the right amygdala, 11.4% versus 23.4% (– 12.0%); and in the right hippocampus, 15.0% versus 22.2% (–7.2%). Cortical regions also showed discrepancies, such as the right middle temporal gyrus (17.4% adapted vs. 22.8% within-cohort, –5.4%) and the left parahippocampal gyrus (14.4% vs. 16.2%, –1.8%).

Adapted models, while converging faster to the full adaptation set model estimation compared to within-cohort trainings, still showed instability in outlier detection at small adaptation sample sizes. In the example iterations in Figure 7B, at n = 25, the percentage of outliers was 7.2% in the left amygdala and 1.8% in the left hippocampus (–7.8 and –15.6%, respectively); at n = 50, values reached 19.2% and 9.6% (+4.2 and –7.8%). At n = 100, percentages were 15.0% in the left amygdala and 18.0% in the left hippocampus, closely aligning with the full model (differences under ±1%).

Similar to the direct training case, these values correspond to individual example iterations and are shown for illustration. The full replication of the previous analysis using UKB-adapted models to OASIS-3 and AIBL is provided in the Supplementary Materials (Figures S9-16 and S26-33 respectively).

3. Discussion

In this study, we systematically examined how sample size and covariate distributions impact the fit and clinical utility of normative models to AD datasets. To ensure the robustness and generalizability, we leveraged a large-scale discovery sample (OASIS-3) and a replication sample (AIBL). Models were first fitted on 80% of the HC participants from the OASIS-3, representing the optimal scenario in which the largest available dataset is leveraged to maximize accuracy and reliability. These full-sample training models served as a benchmark against which we evaluated the impact of reduced sample sizes and covariate misalignment. We systematically sub-sampled the full training set to vary sample sizes and manipulate covariate distributions (i.e. age and sex) and then assessed their effects on model fit in the HC test set as well as on deviation scores and outlier detection in both the HC test set and the AD cohorts. Finally, we applied the same sampling and evaluation framework to an adaptive transfer learning setting. In which pre- trained UK Biobank models were adapted using the same clinical training subsets method previously used for direct training.

Across all analyses, increasing sample size yielded to more stable model fits and more consistent results. In contrast, misaligned covariate distributions, particularly with respect to age, introduced systematic distortions in model predictions and outlier detection. In the following sections, we provide a detailed analysis of how sample size and covariate distributions affected model evaluation and practical application.

Across all analyses, models trained on the full training set of available HC yielded the best overall fit, demonstrating superior performance when all data are utilized. While we acknowledge that this model does not represent an absolute ground truth, it served as a robust reference for evaluation how variations in sample size and covariate distribution affect model fit. Models trained directly on OASIS-3 achieved model fit comparable to UKB-pretrained models, indicating that training solely on the target cohort was sufficient for reliable estimation. The models fitted with full training set also provided a strong reference for biological plausibility and interpretability, producing results consistent with established AD pathology (Igarashi, 2023; Planche et al., 2022). Specifically, in the HC test set, full training set models performed as expected, with the average tOC per individual around 4 in both datasets, aligning with expectations given the defined threshold for outlier detection (bottom 2.5% among 167 ROIs). In contrast, individuals with AD exhibited elevated tOC values, averaging at 14 in OASIS-3 and 20 in AIBL. Importantly, regions surrounding the entorhinal cortex, hippocampus, and amygdala exhibited the highest outlier percentage, highlighting their well-established involvement in AD pathology (Igarashi, 2023).

Subsampling analyses revealed that the largest improvements in model performance occurred up to approximately 50 subjects, at which point metrics such as the intraclass correlation coefficient (ICC) reached excellent reliability and 60–90% of total performance gains were achieved. Beyond 200 subjects, performance began to plateau with diminishing returns up to 300 and with only marginal improvements up to 600 subjects.

Sample-size requirements depend strongly on the intended use of a normative model. In the present study we exemplified two applications commonly used in recent normative-modeling work: (1) outlier detection through regional deviation mapping across the brain and the tOC per individual (Bhome et al., 2024; Floris et al., 2021; Loreto et al., 2024; Rutherford et al., 2023; Verdi et al., 2023, 2024; Zabihi et al., 2020); (2) case-control classification to assess whether the relative population ranks were preserved. Subsampling of real data showed that approximately 100-200 well-characterized HC were sufficient to generate clinically robust deviation maps and tOC values, while adding additional participants primarily reduced residual variance at a high acquisition cost. For the classification task, performance in OASIS-3 was indistinguishable from the full model with only fifteen training samples.

Sample size had no significant etect on tOC in the HC group, likely due to the low number of deviations expected in healthy aging. In contrast, tOC estimates in the AD group stabilized with increasing sample size. This was particularly evident in regions such as the hippocampus and amygdala, where outlier detection was inconsistent at lower sample sizes despite their known vulnerability in AD (Barnes et al., 2009; Qu et al., 2023). Previous work using simulated hippocampal-volume data in order to suggested that several thousand subjects are needed to stabilize the extreme lower percentiles (Bozek et al., 2023). In practice, achieving cohorts of that scale is hindered by high data acquisition costs, privacy constraints, and the logistical challenges of harmonizing data from multiple sites. Moreover, our clinical-validation tasks illustrated that much lower sample sizes saturates clinical readouts estimation. Our results therefore do not prescribe a universal threshold; rather, they provide a reference for the precision that can reasonably be expected from sample sizes commonly attainable in clinical studies. Roughly 200– 300 participants emerge as a pragmatic baseline for deviation mapping in ageing cohorts, while larger data sets are strongly recommended for accurate percentile screening. Notably, broader lifespan applications may require larger samples to ensure coverage across the full age range. This is further supported by our analysis of age-skewed distributions, which highlights the importance of aligning the training cohort with the target population.

Indeed, beyond sample size, skewed age distributions affected both models fit and the accuracy of deviation scores. Deviation scores were systematically overestimated in older individuals when models were trained on younger (left-skewed) samples and underestimated in younger individuals when trained on older (right-skewed) samples. These patterns align well with well-documented age-related brain changes, including cortical thinning and subcortical atrophy with increasing age (Lemaitre et al., 2012). Left-skewed samplings inflated normative means, leading to more pronounced negative deviations in older individuals, whereas right-skewed samplings lowered references, underestimating deviations in the young. As also observed in the simulation study by Bozek et al. (2023), errors were most pronounced at the extremes of the age distribution, where data were sparse. Centile analyses confirmed that models performed reliably across well-sampled age ranges but produced unstable estimates at underrepresented ages, even under representative sampling. These edge effects persisted despite covariate alignment and were attenuated with larger sample sizes.

Our findings highlight the critical importance of age distribution in clinical applications. When the age range of the training data failed to align with that of the target population, deviation estimates became systematically biased, affecting both global outlier counts and regional detection patterns. This was most notably under the left-skewed sampling, where younger-biased training sets inflated deviation scores across both HC and AD groups, with pronounced effects in regions such as the hippocampus and amygdala. Importantly, these biases were not fully resolved until relatively large sample sizes were reached. While error under right-skewed sampling gradually diminished with increasing sample size, overestimation persisted even at large sample sizes in left-skewed distributions.

The left-skewed sampling had overall a greater effect than right-skewed sampling in both model evaluation and clinical validation, likely due to (1) the dataset’s original bias toward older individuals, making younger-skewed samples less representative, and (2) the older age structure of the AD population, which exacerbates mismatch when younger HC are used to calibrate models in the clinical population. Although we generated skewed distributions artificially, they mirror real-world imbalances frequently observed in neuroimaging studies, including in our own dataset, where recruitment constraints often result in age-biased cohorts (Bethlehem et al., 2022; LeWinn et al., 2017).

Together, these findings underscore the need for careful covariate matching, when developing normative models for clinical application, not only in terms of sample size but also distributional structure. This is particularly important in disorders such as AD, where age is tightly linked to disease progression and anatomical change. Deviations from a representative age distribution can propagate errors that directly affect individual-level interpretation. However, recruiting very old healthy individuals is inherently challenging. With increasing age, clinical and biological heterogeneity also increases, due in part to the greater inter-individual variability in brain structure (Fjell & Walhovd, 2010; Lemaitre et al., 2012), accumulation of comorbidities, and the difficulty of confidently excluding preclinical or undiagnosed conditions. Additionally, selective survival effects may introduce sampling biases, further complicating the construction of representative reference cohorts in the oldest age ranges.

Given the higher prevalence of AD in women and documented sex-related differences in brain anatomy (Ferretti et al., 2018; Küchenhoff et al., 2024; Riedel et al., 2016; Ruigrok et al., 2014), we examined whether imbalances in sex distribution influenced model fit or deviation scores. Overall, sex imbalances had minimal effects in both model evaluation and clinical validation. Model fit remained stable, and only small improvements in deviation scores accuracy were observed. Deviation score accuracy was higher for individuals whose sex was more represented in the training data, and the strength of this association increased with the degree of sex imbalance. While sex imbalances had little impact on global model performance, some interaction effects with diagnosis were significant in deviation scores and outlier detection, particularly under extreme sex ratios. However, these effects were small and inconsistent, indicating limited practical impact overall. This limited sensitivity suggests that sex-related anatomical variation may generalize more robustly than age-related changes, especially in ageing populations. Moreover, unlike age, which introduces continuous, nonlinear variability, sex is a discrete covariate, which makes it easier to model. These findings indicate that moderate sex imbalances are unlikely to affect overall model performance, although sex-matched training data may still be beneficial for improving individual-level assessments.

Our findings highlight the potential of adaptive transfer for efficient model fitting in clinical cohorts if large datasets are available. Normative models pre-trained on the external large-scale UKB dataset rapidly achieved performance levels comparable to models trained directly within the cohort, requiring substantially fewer individuals. Evaluation metrics plateaued around 50 participants for adapted models, whereas within-cohort models required approximately 200 to achieve similar fit. This suggests that pre-trained models can be effectively recalibrated with modest sample sizes, offering practical advantages in clinical or low-resource contexts where large datasets are often unavailable. Notably, this stabilization threshold of 50 participants was higher than the 25 participants reported by Gaiser et al. (2024), who used Hierarchical Bayesian Regression (HBR). HBR’s hierarchical structure allows information sharing across sites, which may facilitate convergence with fewer samples. In contrast, our study used Bayesian Linear Regression (BLR), a method that does not account for cross-site dependencies and may require more data to accommodate population heterogeneity. Nevertheless, BLR remains widely used in recent normative modeling applications (Bhome et al., 2024; Corrigan et al., 2024; Holz et al., 2023; Meijer et al., 2024; Rutherford, Kia, et al., 2022; Savage et al., 2024) because of its computational efficiency, ease of implementation, and flexibility. Another factor likely contributing to this difference in stabilization thresholds is the demographic composition of the studied populations. Gaiser et al. (2024) examined on a younger, developmentally homogeneous cohort (ages 6–17), whereas our sample spanned older adults (44–82 years), a range characterized by greater inter-individual variability in brain structure (Fjell & Walhovd, 2010; Lemaitre et al., 2012). This increased heterogeneity may necessitate larger adaptation sets to achieve stable and reliable deviation estimates.

While evaluation metrics indicated convergence in overall model performance, differences remained in outlier detection between adapted and within-cohort models. In several regions commonly affected in AD, adapted models consistently identified fewer outliers, even at the largest adaptation sample size. These discrepancies suggest that, although adaptation effectively aligns prediction accuracy, subtle differences in the underlying normative distributions may persist and influence deviation-based measures. Furthermore, the age distribution of the adaptation set continued to influence both model fit and outlier detection, despite the pre-trained UKB model already covering the same age range extensively. In contrast, sex distribution had minimal influence on the adaptation outcomes.

Overall, these results support the use of adaptive transfer as a resource-efficient alternative to full retraining, particularly when pre-trained models are available for the targeted demographics.

Some limitations should be considered. This study relied on datasets in which adults of European ancestry were overrepresented. Although, the UK Biobank, which was used to pre-train the model, provided a robust reference distribution for the targeted populations of OASIS-3 and AIBL in this study, expanding reference datasets to include populations with more diverse socioeconomic and ethnic backgrounds would further improve representativeness and ensure broader applicability of normative models (Harnett et al., 2024). Moreover, we did not simultaneously manipulate age and sex distributions in the training and adaptation sets. Exploring these factors jointly would have required a substantially more complex experimental design and statistical modeling, as their interactions could introduce confounding effects that are difficult to disentangle within the current framework. Finally, our analysis was restricted to cortical thickness for cortical regions and volumes for subcortical structures. Thickness was preferred in the cortex as it is more sensitive to age-related cellular alterations (Lemaitre et al., 2012), while volumetric measures were used for subcortical regions, where the absence of a clear laminar organization and current MRI resolution limit reliable thickness estimation (Dima et al., 2021; Lemaitre et al., 2012). Future work could incorporate additional structural or microstructural metrics to enhance the characterization of brain changes and improve the applicability of normative models across conditions.

Collectively, our results indicate that the benefits of enlarging a reference cohort depend critically on how closely its demographic profile mirrors that of the clinical sample. Gains from additional participants plateau when age alignment is poor, whereas even moderate-sized cohorts perform well when demographic concordance is preserved. This nuance underscores the importance of balancing sample size with representativeness when assembling reference datasets for normative modeling in ageing and neurodegeneration research. Our findings provide practical guidelines for maximizing the value of moderately sized but deeply phenotyped datasets, thereby reducing costs and limiting the need for additional large-scale data collection without compromising accuracy.

4. Methods

4.1. Data

4.1.1. Cohorts

We used data from three publicly available datasets: two clinical datasets including individuals diagnosed with AD and HC, namely the Open Access Series of Imaging Studies-3 (OASIS-3) and the Australian Imaging Biomarkers and Lifestyle Study of Ageing (AIBL) (Fowler et al., 2021) for replication. In addition the UKB (Sudlow et al., 2015) was used for pre-training of normative models for the adaptive transfer learning experiments. In the following, we provide more details on each dataset (see also Table 1):

OASIS-3

The OASIS-3 sample included 1,098 participants from the United States. For the present project, we included participants from the OASIS-3 whose ages overlapped with the UKB age range, had available T1-weighted MRI scans, passed QC checks, and were diagnosed with either AD or identified as HC. Only baseline scans were included to ensure independence across observations, and individuals with mild cognitive impairment or other dementia-related diagnoses were excluded. The final OASIS-3 sample consisted of 1,032 individuals aged from 45 to 82 years (58% female), including 865 HC participants and 167 diagnosed with AD (LaMontagne et al., 2019). Image acquisitions were performed at a single site. Detailed acquisition protocols are provided in https://bpb-us-w2.wpmucdn.com/sites.wustl.edu/dist/6/4383/files/2024/04/OASIS-3_Imaging_Data_Dictionary_v2.3-a93c947a586e7367.pdf. Ethics approval was obtained from the institutional human research ethics committees of Austin Health, St Vincent’s Health, Hollywood Private Hospital, and Edith Cowan University, and all participants provided written informed consent.

AIBL

AIBL is a longitudinal Australian cohort study comprising 2,359 participants. Data was collected by the AIBL study group. AIBL study methodology has been reported previously (Ellis et al., 2009). Participants were included if they had available T1-weighted MRI scans that passed QC checks (see section 4.1.2) were aged were aged 44–82 years (overlapping with the UKB range), and were classified as HC or diagnosed with AD. Individuals with other dementia-related diagnoses were excluded, and only baseline time points were retained, consistent with OASIS-3. The final AIBL sample consisted of 462 individuals aged 58–82 years (58% female), including 403 HC and 60 AD participants. MRI data were acquired at two sites; acquisition protocols are described in detail elsewhere (Ayton et al., 2017; Dima et al., 2021). Ethics approval was obtained from the institutional human research ethics committees of Austin Health, St Vincent’s Health, Hollywood Private Hospital, and Edith Cowan University, and all participants provided written informed consent.

UKB

The UKB is a large-scale health and imaging databank of 500,000 participants, from which 42,747 healthy individuals aged 44 to 82 years (53% female) were selected based on the availability of T1-weighted MRI scans and their quality control (QC) assessment (see section 4.1.2. for more details on quality control and preprocessing steps of all datasets). Only baseline time points were included when longitudinal data were available. MRI data were collected at three different sites using harmonized protocols. The detailed acquisition protocols are provided in https://www.fmrib.ox.ac.uk/ukbiobank/protocol/. All participants provided full informed consent, and the study received ethics approval from the National Health Service National Research Ethics Service (Ref 11/NW/0382). This research was conducted using the UKB resource under application number 96841.

4.1.2. MRI pre-processing and ROI extraction

All datasets underwent processing using FreeSurfer software, with quality control (QC) performed primarily after segmentation (Fischl, 2012). The standard FreeSurfer pipeline included intensity normalization, skull stripping, and segmentation of grey matter (GM) and white matter (WM) surfaces, ultimately producing a tessellated representation of the GM/WM boundary. These surfaces were then corrected for topological errors and aligned to a spherical atlas based on individual cortical folding patterns. Cortical thickness was defined as the shortest distance between the vertices of the GM/WM boundary and the pial surface, and cortical thickness was measured for each individual using the Destrieux parcellation scheme, which included 74 regions per hemisphere (Destrieux et al., 2010). For a comprehensive overview of the pipeline and its documentation, refer to the official FreeSurfer website (https://surfer.nmr.mgh.harvard.edu/).

In this study, we focused on cortical thickness and on subcortical volumes. Cortical thickness was included as it is a sensitive biomarker of age-related atrophy, particularly in the parietal cortex, and provides valuable insights into neurodegeneration in AD (Lemaitre et al., 2012; McGinnis et al., 2011). Cortical volumes, which account for both cortical folding and thickness, may be less sensitive to specific cellular changes such as neuronal, dendritic, and synaptic changes (Lemaitre et al., 2012). Subcortical volumes were also considered, as AD is known to particularly affect the entorhinal cortex and surrounding regions (Qu et al., 2023; Yan et al., 2019; Zhao et al., 2019). These structures lack the laminar organization of the cortex, making thickness estimates unreliable, as a result, volumetric measures are typically used for subcortical regions. (Dima et al., 2021; Lemaitre et al., 2012).

For labeling subcortical tissue classes, the volume-based stream was utilized. After affine registration to MNI305 space, initial volumetric labeling, and B1 bias field intensity correction, a high-dimensional nonlinear volumetric alignment to the MNI305 atlas was performed, with the volume labeled according to the ASEG atlas (Fischl et al., 2002). From this atlas, we retained 19 subcortical regions.

4.1.3. Quality Control

QC procedures were implemented across all datasets. For the OASIS-3 and AIBL datasets, the QC process was performed internally by our research team (BHV and CE). We used the quality assessment tools bundled with FreeSurfer which included several automated QC measures, such as signal-to-noise ratio, topological defects, Euler number, and rotation parameters, to identify potential issues with MRI data. For each scan flagged by these measures, we conducted a thorough visual inspection. Scans with poor contrast, orientation, or significant artifacts were excluded. Consequently, 39 scans were excluded from the OASIS-3 dataset, while no exclusion was necessary for the AIBL dataset. For the UKB, QC was conducted as part of the original data release using a semi-automated pipeline. T1-weighted images were scored using an automated classifier, with manual review performed for cases close to the ‘bad data’ threshold. Data deemed seriously problematic were excluded. FreeSurfer outputs were also QC checked using the Qoala-T approach (Klapwijk et al., 2019), supplemented by manual review, ensuring only high-quality data were used. For more details, refer to the documentation available at https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/brain_mri.pdf.

4.2. Normative modeling

4.2.1. Estimation of normative models

To capture individual deviations from typical ageing patterns, normative models were fitted independently within each dataset (OASIS-3 and AIBL) for each ROI (Figure 1A). Each dataset was stratified by site, sex, and age (5-year bins), and split into an training set (80% of HC; n=691 for OASIS-3, n=322 for AIBL) and a test set (20% of HC plus all AD participants; n= 174 HC, n = 167 AD for OASIS-3, and n = 81 HC, n = 60 AD for AIBL) (Table 1).

We employed Warped Bayesian Linear Regression (BLR) with a B-spline basis expansion, using Python 3.9.16 and PCNtoolkit version 0.29 (Fraza et al., 2021). All procedures followed the recommendations for BLR models, as detailed in (Rutherford, Kia, et al., 2022). The covariates (age, sex, and a dummy-coded site variable) were incorporated into the model. Age and sex were modeled using a cubic B-spline basis with three evenly spaced knots, allowing for the creation of a smooth curve that can flexibly represent the changes in brain structure across different ages and sexes. The Sinh-Arcsinh function and the respective warped likelihood were used to handle non-Gaussian distributions in the data (Fraza et al., 2021; Jones & Pewsey, 2009) and optimized the model by minimizing the negative log-likelihood using Powell’s conjugate direction method.

From normative models, deviation scores (Z-scores) were derived to quantify how much an individual’s measurement deviates from the predicted normative range. These scores were computed as the difference between the observed value and the model’s predicted conditional mean, scaled by the estimated variance. Unlike traditional Z-scores, which assume a fixed population variance, normative modeling Z-scores account for two sources of uncertainty commonly defined in machine learning: (1) aleatoric uncertainty (irreducible uncertainty), which captures true underlying variability across individuals that cannot be reduced with more data; and (2) epistemic uncertainty (reducible uncertainty), which reflects parameter uncertainty due to limited data and can be minimized as more data becomes available (Fraza et al., 2021; Marquand et al., 2016). By integrating these components, Z-scores provide a standardized measure of deviation, indicating whether an individual’s measurement falls within or outside the normative range and how atypical it is relative to the reference population.

4.3. Subsampling

We first fitted models for each ROI using 80% of the HC participants from OASIS-3 or AIBL for the replication analysis. These models represent the optimal scenario, leveraging the maximum available data to achieve the highest accuracy and reliability. These models trained on the full sample then served as a reference benchmark for evaluating the impact of reduced sample sizes and covariate misalignment. To assess how systematic variations in (1) sample size and (2) covariate distributions influenced model fit performance, we applied a series of subsampling strategies to generate training sets with varying sizes and age or sex distributions.

For each subsampling condition, a set of individuals was selected to form the training set, which was then used to fit separate models for each ROI. This approach allowed us to systematically assess how variations in sample composition affected model fit across brain regions. Across all subsampling strategies, the test set remained identical (described in 2.2.1), ensuring comparability. All primary results are presented for the OASIS-3 dataset, with replication in the AIBL dataset detailed in the supplementary material.

4.3.1. Sample Size Variations

Subsamples were drawn in increments of 5 from size 5 to 200, and in increments of 50 beyond 200 up to 600. Finer steps at smaller sample sizes allowed closer track of changes in model performance at smaller sample sizes, while larger steps were applied beyond 200 to reduce computational demands, after convergence was expected. For each sample size, 10 random iterations were performed to estimate variability and reduce the influence of individual draws. For replication in AIBL, samples were drawn from 5 to 100 per site, due to more limited available data.

4.3.2. Sub-sampling with diterent covariates distributions

We assessed the influence of age and sex distributions in the training set by using a representative age distribution with balanced sex ratios and then altering either age distribution (using skewed subsampling) or sex balance (using uneven ratios of females and males) while also systematically varying sample size as described above.

To generate subsamples that mimicked the original age distributions of the data, we partitioned the age distribution into 10 bins using a quantile discretizer, which ensured equal numbers of participants in each bin. An equal number of males and females (±1) was then randomly selected across bins. This approach, referred to as representative sampling, ensured that the subsampled training sets closely matched the original age distribution of the dataset while maintaining a balanced sex ratio.

To create training sets with skewed (non-representative) age distributions, we applied two strategies that preferentially selected either younger (left-skewed) or older (right-skewed) individuals. More specifically, the age distribution was divided into 10 equally spaced age bins, using a uniform discretizer (which divides the age range into equal-width intervals but does not ensure an equal number of participants per bin as in the representative subsampling). Samples were randomly drawn from these bins, and a beta distribution was applied to weight the probability of drawing samples from each bin. For the left-skewed sampling, we used parameters α=2 and β=5, increasing the likelihood of selecting younger samples. Conversely, for the right-skewed sampling, parameters α=5 and β=2 were used, concentrating more samples in the older age bins.

To examine the influence of sex imbalance, we adapted the representative age sampling strategy but altered the male-to-female ratio according to predefined levels (1:4; 1:10; 4:1, 10:1). This enables assessment how sex distribution imbalances affected model performance and clinical interpretation of deviations.

4.4. Model evaluation

Subsequently, models trained on subsampled data were evaluated on the same fixed test set of participants, assessing model fit in the HC test set and Z-score errors in both HC and AD test sets (Figure 1C). Performance of the subsampled models was then compared with that of the models trained on the full training set across three subsampling scenarios: 1) representative age distribution with balanced sex ratios, 2) skewed age subsampling with balanced sex ratios, and 3) representative age sampling with unbalanced sex ratios.

4.4.1. Evaluation of normative models fitting

Every derived model was evaluated using state-of-the-art model evaluation metrics including: MSLL, SMSE, EV and Rho (Fraza et al., 2021; Rutherford, Kia, et al., 2022) and ICC (Koo & Li, 2016). Evaluation was restricted to HC individuals, as model fit is intended to assess performance relative to the reference population. The AD cohort was instead considered for Z-scores errors and clinical validation. MSLL, standardized by the mean loss of the training dataset (Bayer et al., 2022; Rasmussen & Williams, 2008), indicates the predictive performance of a model compared to the mean of the training set, with more negative values being preferable as they indicate better predictive accuracy. SMSE assesses the average squared differences between the predicted and actual values, standardized by the variance of the test data. Lower SMSE values indicate better predictive performance, as they signify smaller errors. EV measures the proportion of variance in the test data that is explained by the model. Higher values of EV indicate that the model accounts for a greater portion of the variability in the data, reflecting better model performance. Rho measures the linear correlation between the predicted and actual values. A Pearson correlation close to 1 indicates a strong positive correlation, implying that the model’s predictions are closely aligned with the actual values. The ICC quantifies the reliability of the models’ outputs across iterations. The ICC was computed on the Z-scores of HC test set for each model using a two-way random-effects model for single measurements, with absolute agreement, to assess the consistency of Z-scores across iterations (where iterations refer to repeated random sampling of the same sample size within a subsampling strategy).

4.4.2. Evaluation of Z-scores estimations

Z-score errors were computed for each sampling strategy and sample size by comparing models trained on subsampled data to the ones trained on the full training set. We used MSE to summarize overall model accuracy of estimates by averaging the squared deviations across ROIs for each individual. MSE captures the overall magnitude of errors, penalizing larger deviations more strongly due to the squaring operation, which makes it well suited for assessing general model accuracy at the individual level. To reduce the influence of rare iterations producing atypical very large errors, outliers were excluded using a conservative interquartile range threshold (3×IQR) to exclude only extreme values while preserving most of the data. Age effects were evaluated using cubic regressions across the age range; sex effects were assessed by comparing MSEs between females and males under imbalanced sampling.

We additionally computed MBE to assess the direction of Z-score differences, indicating whether estimates tend to be systematically over- or underestimated. Because opposing errors can cancel each other out, MBE may underestimate the overall level of error. MSE and MBE together provide complementary information: MSE captures the total error magnitude, while MBE reveals directional biases.

4.5. Clinical validation

To evaluate clinical validity, we assessed whether deviation patterns aligned with known AD-related neurobiological evidence and whether they could reliably distinguish HC from AD. Specifically, we examined whether the most deviating regions (i.e., outliers) corresponded to established patterns of AD pathology (i.e., biological plausibility) and whether classification based on deviation scores could distinguish HC from AD. For both analyses, we compared the performance of the model fitted with the full training set with subsampled training sets models to determine whether deviation patterns and classification performance remained consistent across sampling strategies and sample sizes.

4.5.1. Outlier detection and biological plausibility

To assess biological plausibility, we computed the percentage of outliers per ROI in the test set to assess the consistency and group-specific patterns of deviations in HC and AD, across the model trained on the full set and those trained on each subsampled set. A data point was defined as an outlier if its deviation score was Z < -1.96. This threshold, representing the bottom 2.5% of the normative range, has been commonly used to report extreme deviations in previous works (Bhome et al., 2024; F et al., 2024; Verdi et al., 2023, 2024). Only negative outliers are considered here, as AD is mainly characterized by diffuse, progressive atrophy of brain regions (Pini et al., 2016; Sabuncu et al., 2011; Ten Kate et al., 2018). Using this framework, a properly fitted model is expected to show outlier distributions that align with known neuroanatomical patterns of AD, with regions commonly affected by the disease exhibiting higher outlier counts.

To further capture the overall trends in outlier detection, we analyzed the mean tOC per individual in both HC and AD groups. The tOC, defined as the sum of ROI with negative outliers for each participant (Loreto et al., 2024; Verdi et al., 2021, 2023, 2024), was computed within each group (i.e. HC and AD) for every sample size and subsampling iteration.

4.5.2. Classification performance

We evaluated classification performance to determine whether deviations from normative models (i.e., Z-scores) retained their clinical validity and preserved the relative ranking of individuals for each sampling configuration. Support Vector Classifiers (SVCs) were trained using the ROI-level deviation scores from the test set. Classification was performed using 10-fold cross-validation and the Receiver Operating Characteristic - Area Under the Curve (ROC-AUC) was computed for each fold. For each sampling configuration, ROC-AUC values were averaged across folds and iterations to obtain a robust measure of classification performance.

4.6. Statistical analysis

We used linear mixed-effects models (LMMs) to evaluate the effects of sample size and subsampling strategy on model fit metrics (MSLL, EV, SMSE, Rho, and ICC) and deviation scores (MSE and tOC) (i.e., outcome). Subsampling strategies included age-skewed (left-skewed, right-skewed) and sex-imbalanced (1:4, 1:10, 4:1, 10:1 female-to-male ratios). The representative sampling strategy, which preserved the original age and sex distribution served as the reference.

For model fit, LMMs included sample size (n), sampling strategy, and their interaction as fixed effects, with a random intercept for ROI to account for variability across regions:

The relationship between sample size and model performance is characterized by steep improvements at small n that gradually level off with larger samples (Figures 2-3). To capture this non-linear pattern and avoid overemphasizing differences at high n, we applied a logarithmic transformation to sample size. This transformation linearizes the effect, resulting in more appropriate estimates of the influence of n and its interactions with sampling strategies. Both performance metrics and sample size were standardized prior to modeling to facilitate comparability.

For deviation scores, LMMs additionally included group (HC or AD) and its interactions with sample size and sampling strategy as fixed effects, with a random intercept for individuals to account for subject-level variability. As before, sample size was log-transformed, and both outcomes and sample size were standardized prior to modeling:

Again, the representative sampling strategy and HC group served as baseline levels for comparison. Fixed effects and interaction terms captured the influence of sample size, sampling strategy, and group on the outcomes.

To illustrate the effect of sampling strategies and sample size for specific brain regions, additional LMMs were computed independently for each ROI. These results are presented in the supplementary materials and were corrected using the false discovery rate (FDR) to account for multiple comparisons.

4.7. Adaptive transfer learning

To evaluate whether models trained on large external datasets can generalize to clinical cohorts, we additionally compared our models trained within cohorts (i.e. OASIS-3 or AIBL) with models adapted from a large independent cohort using adaptive transfer learning. To motivate the need for model adaptation across cohorts, we first examined differences in mean cortical thickness between UKB and the clinical datasets. We compared mean cortical thickness and mean subcortical grey volumes for HC from the UKB, OASIS-3 and AIBL datasets. Diterences in mean cortical thickness across datasets were assessed using boxplots and bivariate density plots (mean cortical thickness vs. age). Separate linear regression models were fitted for each dataset, with mean cortical thickness as the dependent variable and age as the independent variable.

Adaptive transfer learning was applied to adjust the UKB-based normative models to the clinical cohorts (Figure 1B). Models were pre-fitted using 80% of UKB HC, with 20% held out for internal evaluation. Adaptation sets and within-cohort training sets were sampled from the same training cohort and tested on the same evaluation set, ensuring direct comparability between models. This approach procedure assumes that biological effects, such as age-related trajectories, are shared across datasets, and only cohort-specific shifts require adjustment. The adaptation assumes that normative patterns learned from a large reference cohort remain valid across datasets, and that residual differences, such as scanner or cohort biases, can be captured through distributional adjustments at the mean level.

To evaluate the two modeling strategies, either directly trained on the target dataset (i.e. OASIS-3 or AIBL) or adapted from a pre-trained UKB model, we compared model fit using the same metrics as previously (MSLL, SMSE, EV, Rho, and ICC), assessed Z-score errors relative to the full adaptation model, and illustrated the percentage of outliers per brain region across sample sizes. To facilitate direct comparison, we also computed the difference in performance between the two approaches (direct training minus adapted) for each metric. Additionally, we replicated all analyses conducted in the direct training setting, including covariate skewing, and present the adapted versions in the Supplementary Materials.

Acknowledgements

This research was supported by the Dementia Research Switzerland - Synapsis Foundation [No. 2021-PI02] and the Swiss National Science Foundation (SNSF) under the following grants [10001C 197480 & IC00I0 227750].

Data were provided in part by OASIS-3: Longitudinal Multimodal Neuroimaging: Principal Investigators: T. Benzinger, D. Marcus, J. Morris; NIH P30 AG066444, P50 AG00561, P30 NS09857781, P01 AG026276, P01 AG003991, R01 AG043434, UL1 TR000448, R01 EB009352. AV-45 doses were provided by Avid Radiopharmaceuticals, a wholly owned subsidiary of Eli Lilly.

Additional files

Supplementary material

Significance of findings

Strength of evidence

Abstract

1. Introduction

2. Results

Methodology diagram for evaluating normative model estimation under different sampling scenarios.

2.1. Model fit evaluation

Model fit evaluation in the HC test set.

Linear mixed model results for evaluation metrics under age-skewed sampling conditions.

Evaluation of model fits in the HC test set across different sampling strategies of the training set, shown for varying sample sizes (n).

Linear mixed model results for evaluation metrics under sex-imbalanced sampling conditions.

2.2. Z-scores errors

Z-score errors: Influence of sample size, age distribution, and sex imbalance on normative model outcomes.

Linear mixed model results for MSE and total outlier count (tOC) under age-skewed sampling.

Linear mixed model results for MSE and total outlier count (tOC) under sex-imbalanced sampling.

2.3. Clinical validation

Clinical validation: effect of sample size and age distributions in outlier detection and classification performance.

2.4. Adaptation from large dataset

Site effects on mean cortical thickness (whole-brain average) in the UK Biobank (UKB), AIBL, and OASIS-3 datasets for HC.

Adaptive transfer learning evaluation.

3. Discussion

4. Methods

4.1. Data

4.1.1. Cohorts

OASIS-3

AIBL

UKB

4.1.2. MRI pre-processing and ROI extraction

4.1.3. Quality Control

4.2. Normative modeling

4.2.1. Estimation of normative models

4.3. Subsampling

4.3.1. Sample Size Variations

4.3.2. Sub-sampling with diterent covariates distributions

4.4. Model evaluation

4.4.1. Evaluation of normative models fitting

4.4.2. Evaluation of Z-scores estimations

4.5. Clinical validation

4.5.1. Outlier detection and biological plausibility

4.5.2. Classification performance

4.6. Statistical analysis

4.7. Adaptive transfer learning

Acknowledgements

Additional files

References

Article and author information

Author information

Camille Elleaume

Bruno Hebling Vieira

Dorothea L Floris

Nicolas Langer

the Australian Imaging Biomarkers and Lifestyle flagship study of ageing*

Author Notes

Version history

Cite all versions

Copyright

Metrics

the Australian Imaging Biomarkers and Lifestyle flagship study of ageing