The influence of sample size and covariate distributions on neuroanatomical normative modeling

  1. Camille Elleaume  Is a corresponding author
  2. Bruno Hebling Vieira
  3. Dorothea L Floris
  4. Nicolas Langer
  1. Methods of Plasticity Research, Department of Psychology, University of Zürich, Switzerland
  2. Neuroscience Center Zürich (ZNZ), Switzerland
9 figures, 24 tables and 1 additional file

Figures

Methodology diagram for evaluating normative model estimation under different sampling scenarios.

(A) Analyses were conducted on the OASIS-3 dataset, with replication in AIBL. Normative models were first fitted using the entire training set (i.e., 80% of HC from the cohort). For each sampling strategy (s), models were re-fitted on randomly drawn subsamples of the training set, with sample sizes (n) ranging from 5 to the maximum available (n = 692 for OASIS, n = 322) and repeated across 10 iterations per sample size. Representative sampling: subsamples preserved the original age distribution using 10 age bins with an equal number of individuals per bin, while also ensuring balanced sex distributions. Left-skewed sampling: overrepresented younger age ranges by applying a beta distribution (α = 2, β = 5), across 10 equally spaced bins. Right-skewed sampling: overrepresented older age ranges using a beta distribution (α = 5, β = 2). Each model was evaluated on a fixed test set composed of 20% of HC from the same cohort and all AD individuals. Example normative fits illustrate how identical test values are interpreted under different models: red dots represent values outside the 95% centile range (outliers), while orange dots fall within the normative range. (B) In a parallel analysis, normative models were pre-trained on the UK Biobank dataset. Adaptive transfer learning was used to adapt these pre-trained models to the clinical cohorts. The same subsampling strategies and sample sizes described in panel A were used here to define the adaptation sets. Adapted models were then applied to the same fixed test set as in panel A, allowing direct comparison between within-cohort models and adapted UKB models across all sampling conditions. (C) Each resulting model was evaluated using standard performance metrics to assess model fit. Z-score errors were additionally computed and compared to those obtained from the model fitted using the full training data. Clinical validation was then performed by analyzing outlier detection and predictive performance.

Figure 2 with 4 supplements
Model fit evaluation in the HC test set.

Performance evaluation of models as a function of the number of participants used for fitting the models (n), calculated using the HC test set from the OASIS-3 dataset. Depicted are in (A) the MSLL; (B) SMSE; (C) EV; (D) Rho; (E) lower-tail HC percentage (below the 2.5% bound); (F) upper-tail HC percentage (above the 97.5% bound); and (G) ICC. For all plots, solid lines represent the mean performance of the evaluation metric across all cortical region models grouped by lobe and the mean performance across subcortical region models. The thick black line indicates the overall mean performance across all cortical and subcortical models. In panels (A–F), dashed lines denote the performance using the full training set (n = 692) and the shaded areas indicate the standard deviation, reflecting variability across 10 iterations of the same sample size. In panel (G), gray dotted lines indicate commonly accepted reliability thresholds: below 0.5 (poor), up to 0.75 (moderate), up to 0.9 (good), and above 0.9 (excellent) (Koo and Li, 2016).

Figure 2—figure supplement 1
Model fit evaluation of normative models trained within OASIS-3.

Performance is assessed in Healthy Controls (HC) test set of OASIS-3 for different sampling strategies of the training set, including (A) Representative, (B) Left-skewed, and (C) Right-skewed age distributions, as well as (D, E, F, G) sex-imbalanced adaptations with female-to-male ratios of 1:4, 4:1, 1:10, and 10:1, respectively. Model performance is assessed as a function of the adaptation set size (n) using the HC test set from the OASIS-3 dataset. Each panel presents the following evaluation metrics: Mean Standardized Log Loss (MSLL), Standardized Mean Squared Error (SMSE), Explained Variance (EV), Pearson correlation coefficient (Rho), Intraclass Correlation Coefficient (ICC), lower-tail HC percentage (below the 2.5% bound), upper-tail HC percentage (above the 97.5% bound). Solid lines indicate the mean performance across all regions within each lobe (as well as the mean of subcortical regions), while shaded areas represent the standard deviation across 10 iterations per sample size. In ICC plots, gray dashed lines denote commonly accepted reliability thresholds:<0.5 (poor), 0.5–0.75 (moderate), 0.75–0.9 (good), and >0.9 (excellent) and shaded areas indicate standard deviation across ROI.

Figure 2—figure supplement 2
Model fit evaluation of normative models pre-trained on the UKB and adapted to OASIS-3.

Performance is assessed in Healthy Controls (HC) test set of OASIS-3 dataset for different sampling strategies of the adaptation set, including (A) Representative, (B) Left-skewed, and (C) Right-skewed age distributions, as well as (D, E, F, G) sex-imbalanced adaptations with female-to-male ratios of 1:4, 4:1, 1:10, and 10:1, respectively. Model performance is assessed as a function of the adaptation set size (n) using the HC test set from the OASIS-3 dataset. Each panel presents the following evaluation metrics: Mean Standardized Log Loss (MSLL), Standardized Mean Squared Error (SMSE), Explained Variance (EV), Pearson correlation coefficient (Rho), Intraclass Correlation Coefficient (ICC), lower-tail HC percentage (below the 2.5% bound), upper-tail HC percentage (above the 97.5% bound). Solid lines indicate the mean performance across all regions within each lobe (as well as the mean of subcortical regions), while shaded areas represent the standard deviation across 10 iterations per sample size. In ICC plots, gray dashed lines denote commonly accepted reliability thresholds:<0.5 (poor), 0.5–0.75 (moderate), 0.75–0.9 (good), and >0.9 (excellent) and shaded areas indicate standard deviation across ROI.

Figure 2—figure supplement 3
Model fit evaluation of normative models trained within AIBL.

Performance is assessed in Healthy Controls (HC) test set of AIBL for different sampling strategies of the training set, including (A) Representative, (B) Left-skewed, and (C) Right-skewed age distributions, as well as (D, E, F, G) sex-imbalanced adaptations with female-to-male ratios of 1:4, 4:1, 1:10, and 10:1, respectively. Model performance is assessed as a function of the adaptation set size (n) using the HC test set from the OASIS-3 dataset. Each panel presents the following evaluation metrics: Mean Standardized Log Loss (MSLL), Standardized Mean Squared Error (SMSE), Explained Variance (EV), Pearson correlation coefficient (Rho), Intraclass Correlation Coefficient (ICC), lower-tail HC percentage (below the 2.5% bound), upper-tail HC percentage (above the 97.5% bound). Solid lines indicate the mean performance across all regions within each lobe (as well as the mean of subcortical regions), while shaded areas represent the standard deviation across 10 iterations per sample size. In ICC plots, gray dashed lines denote commonly accepted reliability thresholds:<0.5 (poor), 0.5–0.75 (moderate), 0.75–0.9 (good), and >0.9 (excellent) and shaded areas indicate standard deviation across ROI.

Figure 2—figure supplement 4
Model fit evaluation of normative models pre-trained on the UKB and adapted to AIBL.

Performance is assessed in Healthy Controls (HC) test set of AIBL dataset for different sampling strategies of the adaptation set, including (A) Representative, (B) Left-skewed, and (C) Right-skewed age distributions, as well as (D, E, F, G) sex-imbalanced adaptations with female-to-male ratios of 1:4, 4:1, 1:10, and 10:1, respectively. Model performance is assessed as a function of the adaptation set size (n) using the HC test set from the OASIS-3 dataset. Each panel presents the following evaluation metrics: Mean Standardized Log Loss (MSLL), Standardized Mean Squared Error (SMSE), Explained Variance (EV), Pearson correlation coefficient (Rho), Intraclass Correlation Coefficient (ICC), lower-tail HC percentage (below the 2.5% bound), upper-tail HC percentage (above the 97.5% bound). Solid lines indicate the mean performance across all regions within each lobe (as well as the mean of subcortical regions), while shaded areas represent the standard deviation across 10 iterations per sample size. In ICC plots, gray dashed lines denote commonly accepted reliability thresholds:<0.5 (poor), 0.5–0.75 (moderate), 0.75–0.9 (good), and >0.9 (excellent) and shaded areas indicate standard deviation across ROI.

Figure 3 with 12 supplements
Evaluation of model fits in the HC test set across different sampling strategies of the training set, shown for varying sample sizes (n).

Age-skewed sampling strategies include representative (matching the initial age distribution with balanced sex, 1F:1M), left-skewed (favoring younger individuals), and right-skewed (favoring older individuals), each with balanced sex distributions. Sex-imbalanced sampling strategies include female-to-male ratios of 1:1, 1:4, 1:10, 4:1, and 10:1, all with representative age distributions. Depicted metrics include MSLL, SMSE, EV, Rho, ICC, the lower and upper tail HC%. Solid lines represent the mean metric values across ROIs and iterations, with shaded areas indicating the standard deviation across iterations for MSLL, SMSE, EV, Rho, and tail percentages. For ICC, shaded areas represent variation across ROIs, as ICC already reflects variability across iterations. Dashed lines indicate the mean metric values for models trained with the full sample for MSLL, SMSE, EV, Rho, and tail percentages.

Figure 3—figure supplement 1
Regional linear mixed-effects results for model-fit metrics in models trained in OASIS-3 with age-skewed samples.

Standardized Betas (β) for each performance metric (MSLL, SMSE, EV, Rho, ICC; rows) are regressed on log-standardized sample size (n), age-distribution contrasts (left-skewed, right-skewed; representative = reference), and their interactions. The sex ratio is balanced (1:1). Gray shading denotes regions that did not survive FDR correction (p ≥ 0.05).

Figure 3—figure supplement 2
Regional linear mixed-effects results for model-fit metrics in models trained in OASIS-3 with sex-imbalanced samples.

Standardized Betas (β) for each performance metric (MSLL, SMSE, EV, Rho, ICC; rows) are regressed on log-standardized sample size (n), sex-ratio contrasts (female-to-male 10:1, 4:1, 1:1 [reference], 1:4, 1:10), and their interactions. Age distribution is representative. Gray shading denotes regions that did not survive FDR correction (p ≥ 0.05).

Figure 3—figure supplement 3
Model fit evaluation of normative models pre-trained on UKB and adapted to OASIS-3 across different sampling strategies.

Models were evaluated in the HC test set for various adaptation sample sizes (n). Age-skewed sampling strategies include representative (matching the initial age distribution with balanced sex, 1F:1M), left-skewed (favoring younger individuals), and right-skewed (favoring older individuals), each with balanced sex distributions. Sex-imbalanced sampling strategies include female-to-male ratios of 1:1.1, 1:4, 1:10, 4:1, and 10:1, all with representative age distributions. Solid lines represent the mean metric values across ROIs and iterations, with shaded areas indicating the standard deviation across iterations for MSLL, SMSE, EV, and Rho. For ICC, shaded areas represent variation across ROIs, as ICC already reflects variability across iteration. Dashed lines indicate the mean metric values for models trained with the full sample for MSLL, SMSE, EV, Rho, the lower and upper tail HC%.

Figure 3—figure supplement 4
Regional linear mixed-effects results for model-fit metrics in models pre-fitted in the UKB and adapted to OASIS-3 with age-skewed samples.

Standardized Betas (β) for each performance metric (MSLL, SMSE, EV, Rho, ICC; rows) are regressed on log-standardized sample size (n), age-distribution contrasts (left-skewed, right-skewed; representative = reference), and their interactions. The sex ratio is balanced (1:1). Gray shading denotes regions that did not survive FDR correction (p ≥ 0.05).

Figure 3—figure supplement 5
Regional linear mixed-effects results for model-fit metrics in models pre-fitted in the UKB and adapted to OASIS-3 with sex-imbalanced samples.

Standardized Betas (β) for each performance metric (MSLL, SMSE, EV, Rho, ICC; rows) are regressed on log-standardized sample size (n), sex-ratio contrasts (female-to-male 10:1, 4:1, 1:1 [reference], 1:4, 1:10), and their interactions. Age distribution is representative. Gray shading denotes regions that did not survive FDR correction (p ≥ 0.05).

Figure 3—figure supplement 6
Model fit evaluation of normative models trained on AIBL across different sampling strategies.

Models were evaluated in the HC test set for various training sample sizes (n). Age-skewed sampling strategies include representative matching the initial age distribution with balanced sex, 1F:1M, left-skewed (favoring younger individuals), and right-skewed (favoring older individuals), each with balanced sex distributions. Sex-imbalanced sampling strategies include female-to-male ratios of 1:1.1, 1:4, 1:10, 4:1, and 10:1, all with representative age distributions. Solid lines represent the mean metric values across ROIs and iterations, with shaded areas indicating the standard deviation across iterations for MSLL, SMSE, EV, Rho, the lower and upper tail HC%. For ICC, shaded areas represent variation across ROIs, as ICC already reflects variability across iteration. Dashed lines indicate the mean metric values for models trained with the full sample for MSLL, SMSE, EV, Rho, and the lower and upper tail HC%.

Figure 3—figure supplement 7
Regional linear mixed-effects results for model-fit metrics in models trained in AIBL with age-skewed samples.

Standardized Betas (β) for each performance metric (MSLL, SMSE, EV, Rho, ICC; rows) are regressed on log-standardized sample size (n), age-distribution contrasts (left-skewed, right-skewed; representative = reference), and their interactions. The sex ratio is balanced (1:1). Gray shading denotes regions that did not survive FDR correction (p ≥ 0.05).

Figure 3—figure supplement 8
Regional linear mixed-effects results for model-fit metrics in models trained in AIBL with sex-imbalanced samples.

Standardized Betas (β) for each performance metric (MSLL, SMSE, EV, Rho, ICC; rows) are regressed on log-standardized sample size (n), sex-ratio contrasts (female-to-male 10:1, 4:1, 1:1 [reference], 1:4, 1:10), and their interactions. Age distribution is representative. Gray shading denotes regions that did not survive FDR correction (p ≥ 0.05).

Figure 3—figure supplement 9
Model fit evaluation of normative models pre-trained on the UKB and adapted to AIBL across different sampling strategies.

Models were evaluated in the HC test set for various adaptation sample sizes (n). Age-skewed sampling strategies include representative (matching the initial age distribution with balanced sex, 1F:1M), left-skewed (favoring younger individuals), and right-skewed (favoring older individuals), each with balanced sex distributions. Sex-imbalanced sampling strategies include female-to-male ratios of 1:1.1, 1:4, 1:10, 4:1, and 10:1, all with representative age distributions. Solid lines represent the mean metric values across ROIs and iterations, with shaded areas indicating the standard deviation across iterations for MSLL, SMSE, EV, Rho, the lower and upper tail HC%. For ICC, shaded areas represent variation across ROIs, as ICC already reflects variability across iteration. Dashed lines indicate the mean metric values for models trained with the full sample for MSLL, SMSE, EV, and Rho.

Figure 3—figure supplement 10
Regional linear mixed-effects results for model-fit metrics in models pre-fitted in the UKB and adapted to AIBL with age-skewed samples.

Standardized Betas (β) for each performance metric (MSLL, SMSE, EV, Rho, ICC; rows) are regressed on log-standardized sample size (n), age-distribution contrasts (left-skewed, right-skewed; representative = reference), and their interactions. The sex ratio is balanced (1:1). Gray shading denotes regions that did not survive FDR correction (p ≥ 0.05).

Figure 3—figure supplement 11
Regional linear mixed-effects results for model-fit metrics in models pre-fitted in the UKB and adapted to AIBL with sex-imbalanced samples.

Standardized Betas (β) for each performance metric (MSLL, SMSE, EV, Rho, ICC; rows) are regressed on log-standardized sample size (n), sex-ratio contrasts (female-to-male 10:1, 4:1, 1:1 [reference], 1:4, 1:10), and their interactions. Age distribution is representative. Gray shading denotes regions that did not survive FDR correction (p ≥ 0.05).

Figure 3—figure supplement 12
Training age-distribution coverage with respect to the HC test cohorts in OASIS-3 (left) and AIBL (right).

Mean age-distribution coverage (± SD across 1000 repetitions) between the sampled training sets and the HC test cohorts is shown for each age-sampling strategy as a function of training sample size (n). Coverage is defined as the distribution intersection between age-bin counts of the sampled training data and the HC test sets, using 20 quantile-based bins derived from the full training sets of each dataset. In OASIS-3, representative sampling yields the highest coverage, while left-skewed sampling remains consistently lower than right-skewed sampling across sample sizes, reflecting the older and skewed age distribution of the cohort. In AIBL, representative sampling also provides the highest coverage, and left-skewed sampling remains slightly below right-skewed sampling but with reduced differences compared to OASIS-3. Sampling was only performed up to n = 200 due to sample availability in AIBL; the gray region marks larger sample sizes not evaluated, while the x-axis is extended to n = 600 for comparability across datasets.

Figure 4 with 6 supplements
Z-score errors: Influence of sample size, age distribution, and sex imbalance on normative model outcomes.

(A) Age distributions for left-skewed (younger-biased), representative, and right-skewed (older-biased) sampling strategies, compared to the full training set in the OASIS-3 dataset. (B) Mean squared error (MSE) of Z-scores relative to the full training set model across sample sizes, age distributions, and diagnostic groups. (C) Mean bias error (MBE) per region across sample sizes and age distributions in the test set. Shown are the 20 brain regions with the highest Cohen’s d effect sizes distinguishing between HC and AD calculated from models trained on the full training set. From left to right: results for left-skewed, representative, and right-skewed training sets. Blue indicates negative MBE (underestimation), red indicates positive MBE (overestimation), and white indicates close alignment with the full training set model. (D) Cubic regression of MSE as a function of age, across sample sizes and sampling strategies. Left-skewed sampling shows increased errors in older individuals; right-skewed sampling shows increased errors in younger individuals. (E) Centile curves for the left hippocampus as a function of age, derived from models trained on left-skewed, representative, and right-skewed sampling (n = 100). Colored lines represent the 5th, 50th, and 95th percentiles; gray lines show centiles from the full training set model. (F) MSE across sample sizes and test set sex, obtained using sex-imbalanced training sets (female-to-male ratios: 10:1, 4:1, 1:1, 1:4, 1:10), all with representative age distributions.

Figure 4—figure supplement 1
Cohen’s d effect sizes computed on full models’ Z-scores for discriminating HC from AD groups for each ROI.

Dashed lines correspond to Cohen’s classification thresholds for effect sizes: small (d < 0.5), medium (0.5 ≤ d < 0.8), and large (d ≥ 0.8). Left and right panels show Cohen’s d effect sizes for OASIS-3 and AIBL test sets, respectively.

Figure 4—figure supplement 2
Centile curve overlays for selected cortical and subcortical regions across sampling strategies and sample sizes in OASIS-3 dataset.

For each ROI, models trained using representative, left-skewed, and right-skewed sampling are shown for multiple sample sizes. Colored lines depict the 5th, 50th, and 95th percentiles estimated from each model, while gray lines indicate the corresponding centiles from the full training set. These overlays highlight how centile estimates diverge when age coverage is limited, particularly at the extreme of age ranges.

Figure 4—figure supplement 3
Figure 5.

Z-score errors in OASIS-3 using models trained in UKB and adapted to OASIS-3: Influence of sample size, age distribution, and sex imbalance on normative model outcomes. (A) Age distributions for left-skewed (younger-biased), representative, and right-skewed (older-biased) sampling strategies, compared to the full training set in the OASIS-3 dataset. (B) Mean squared error (MSE) of Z-scores relative to the full training set model across sample sizes, age distributions, and diagnostic groups. (C) Mean bias error (MBE) per region across sample sizes and age distributions in the test set. Shown are the 20 brain regions with the highest Cohen’s d effect sizes based on models using the full training set. From left to right: results for left-skewed, representative, and right-skewed training sets. Blue indicates negative MBE (underestimation), red indicates positive MBE (overestimation), and white indicates close alignment with the full training set model. (D) Cubic regression of MSE as a function of age, across sample sizes and sampling strategies. Left-skewed sampling shows increased errors in older individuals; right-skewed sampling shows increased errors in younger individuals. (E) Centile curves for the left hippocampus as a function of age (females only), derived from models trained on left-skewed, representative, and right-skewed sampling (n=100). Colored lines represent the 5th, 50th, and 95th percentiles; gray lines show centiles from the full training set model. (F) MSE across sample sizes and test set sex, obtained using sex-imbalanced training sets (female-to-male ratios: 10:1, 4:1, 1:1, 1:4, 1:10), all with representative age distributions.

Figure 4—figure supplement 4
Z-score errors in AIBL using models trained directly on the dataset: Influence of sample size, age distribution, and sex imbalance on normative model outcomes.

(A) Age distributions for left-skewed (younger-biased), representative, and right-skewed (older-biased) sampling strategies, compared to the full training set in the OASIS-3 dataset. (B) Mean squared error (MSE) of Z-scores relative to the full training set model across sample sizes, age distributions, and diagnostic groups. (C) Mean bias error (MBE) per region across sample sizes and age distributions in the test set Shown are the 20 brain regions with the highest Cohen’s d effect sizes based on models using the full training set. From left to right: results for left-skewed, representative, and right-skewed training sets. Blue indicates negative MBE (underestimation), red indicates positive MBE (overestimation), and white indicates close alignment with the full training set model. (D) Cubic regression of MSE as a function of age, across sample sizes and sampling strategies. Left-skewed sampling shows increased errors in older individuals; right-skewed sampling shows increased errors in younger individuals. (E) Centile curves for the Left Hippocampus as a function of age (females only), derived from models trained on left-skewed, representative, and right-skewed sampling (n = 100). Colored lines represent the 5th, 50th, and 95th percentiles; gray lines show centiles from the full training set model. (F) MSE across sample sizes and test set sex, obtained using sex-imbalanced training sets (female-to-male ratios: 10:1, 4:1, 1:1, 1:4, 1:10), all with representative age distributions.

Figure 4—figure supplement 5
Centile curve overlays for selected cortical and subcortical regions across sampling strategies and sample sizes in AIBL dataset.

For each ROI, models trained using representative, left-skewed, and right-skewed sampling are shown for multiple sample sizes. Colored lines depict the 5th, 50th, and 95th percentiles estimated from each model, while gray lines indicate the corresponding centiles from the full training set. These overlays highlight how centile estimates diverge when age coverage is limited, particularly at the extreme of age ranges.

Figure 4—figure supplement 6
Z-score errors in AIBL using models pretrained in UKB and adapted to AIBL: Influence of sample size, age distribution, and sex imbalance on normative model outcomes.

(A) Age distributions for left-skewed (younger-biased), representative, and right-skewed (older-biased) sampling strategies, compared to the full training set in the AIBL dataset. (B) Mean squared error (MSE) of Z-scores relative to the full training set model across sample sizes, age distributions, and diagnostic groups. (C) Mean bias error (MBE) per region across sample sizes and age distributions in the test set Shown are the 20 brain regions with the highest Cohen’s d effect sizes based on models using the full training set. From left to right: results for left-skewed, representative, and right-skewed training sets. Blue indicates negative MBE (underestimation), red indicates positive MBE (overestimation), and white indicates close alignment with the full training set model. (D) Cubic regression of MSE as a function of age, across sample sizes and sampling strategies. Left-skewed sampling shows increased errors in older individuals; right-skewed sampling shows increased errors in younger individuals. (E) MSE across sample sizes and test set sex, obtained using sex-imbalanced training sets (female-to-male ratios: 10:1, 4:1, 1:1, 1:4, 1:10), all with representative age distributions.

Figure 5 with 11 supplements
Clinical validation: effect of sample size and age distributions in outlier detection and classification performance.

(A) Percentage of participants with extreme negative deviation (Z<–1.96) in each brain region in the AD group, shown for independent example iterations at different sample sizes with representative age distribution. (B) Average total Outlier Count (tOC) in the AD and HC is represented with dashed and solid lines respectively as a function of sample size (n). The shaded areas indicate the standard deviation across iterations. Dotted gray lines correspond to the estimations of tOC for HC and AD groups obtained with full sample size. (C) ROC-AUC of support vector classifier with 10-fold cross-validation as a function of sample size in the training set. The solid lines represent the average AUC across iterations for each sampling strategy. The dotted line shows the performance of the models trained with full sample size.

Figure 5—figure supplement 1
Regional linear mixed-effects results for deviation-score metrics in models trained in OASIS-3 with age-skewed samples.

Standardized Betas (β) are shown for the mean-squared error of z-scores relative to the full reference model and for the count of extreme z-score outliers per region (rows). Predictors include Alzheimer’s disease diagnosis (AD), log-standardized sample size (n), age-distribution contrasts (left-skewed, right-skewed; representative = reference), and their two-way interactions with n and AD. Sex ratio is balanced (1:1). Gray shading denotes regions that did not survive FDR correction (p≥ 0.05).

Figure 5—figure supplement 2
Regional linear mixed-effects results for deviation-score metrics in models trained in OASIS-3 with sex-imbalanced samples.

Standardized Betas (β) are shown for the mean-squared error of z-scores relative to the full reference model and for the count of extreme z-score outliers per region (rows). Predictors include Alzheimer’s disease diagnosis (AD), log-standardized sample size (n), sex-ratio contrasts (female-to-male 10:1, 4:1, 1:1 [reference], 1:4, 1:10), and their two-way interactions with n and AD. Age distribution is representative. Gray shading denotes regions that did not survive FDR correction (p ≥ 0.05).

Figure 5—figure supplement 3
Clinical validation for models pre-fitted in UKB and adapted to OASIS-3: effect of sample size and age distributions in outlier detection and classification performance.

(A) Average total Outlier Count (tOC) in the AD and HC is represented with dashed and solid lines respectively as a function of sample size (n). The shaded areas indicate the standard deviation across iterations. Dotted gray lines correspond to the estimations of tOC for HC and AD groups obtained with full sample size. (B) ROC–AUC of Support Vector Classifier with 10-fold cross-validation as a function of sample size in the training set. The solid lines represent the average AUC across iterations for each sampling strategy. The dotted line shows the performance of the models trained with full sample size.

Figure 5—figure supplement 4
Regional linear mixed-effects results for deviation-score metrics in models pre-fitted in the UKB and adapted to OASIS-3 with age-skewed samples.

Standardized Betas (β) are shown for the mean-squared error of z-scores relative to the full reference model and for the count of extreme z-score outliers per region (rows). Predictors include Alzheimer’s disease diagnosis (AD), log-standardized sample size (n), age-distribution contrasts (left-skewed, right-skewed; representative = reference), and their two-way interactions with n and AD. Sex ratio is balanced (1:1). Gray shading denotes regions that did not survive FDR correction (p ≥ 0.05).

Figure 5—figure supplement 5
Regional linear mixed-effects results for deviation-score metrics in models pre-fitted in the UKB and adapted to OASIS-3 with sex-imbalanced samples.

Standardized Betas (β) are shown for the mean-squared error of z-scores relative to the full reference model and for the count of extreme z-score outliers per region (rows). Predictors include Alzheimer’s disease diagnosis (AD), log-standardized sample size (n), sex-ratio contrasts (female-to-male 10:1, 4:1, 1:1 [reference], 1:4, 1:10), and their two-way interactions with n and AD. Age distribution is representative. Gray shading denotes regions that did not survive FDR correction (p ≥ 0.05).

Figure 5—figure supplement 6
Clinical validation for models trained in AIBL: effect of sample size and age distributions in outlier detection and classification performance.

(A) Percentage of participants with extreme negative deviation (Z<–1.96) in each brain region in the AD group, shown for independent example iterations at different sample sizes with representative age distribution. (B) Average total Outlier Count (tOC) in the AD and HC groups is represented with dashed and solid lines respectively as a function of sample size (n). The shaded areas indicate the standard deviation across iterations. Dotted gray lines correspond to the estimations of tOC for HC and AD groups obtained with full sample size. (C) ROC-AUC of support vector classifier with 10-fold cross-validation as a function of sample size in the training set. The solid lines represent the average AUC across iterations for each sampling strategy. The dotted line shows the performance of the models trained with full sample size.

Figure 5—figure supplement 7
Regional linear mixed-effects results for deviation-score metrics in models trained in AIBL with age-skewed samples.

Standardized Betas (β) are shown for the mean-squared error of z-scores relative to the full reference model and for the count of extreme z-score outliers per region (rows). Predictors include Alzheimer’s disease diagnosis (AD), log-standardized sample size (n), age-distribution contrasts (left-skewed, right-skewed; representative = reference), and their two-way interactions with n and AD. Sex ratio is balanced (1:1). Gray shading denotes regions that did not survive FDR correction (p ≥ 0.05).

Figure 5—figure supplement 8
Regional linear mixed-effects results for deviation-score metrics in models trained in AIBL with sex-imbalanced samples.

Standardized Betas (β) are shown for the mean-squared error of z-scores relative to the full reference model and for the count of extreme z-score outliers per region (rows). Predictors include Alzheimer’s disease diagnosis (AD), log-standardized sample size (n), sex-ratio contrasts (female-to-male 10:1, 4:1, 1:1 [reference], 1:4, 1:10), and their two-way interactions with n and AD. Age distribution is representative. Gray shading denotes regions that did not survive FDR correction (p ≥ 0.05).

Figure 5—figure supplement 9
Clinical validation for models pre-fitted in UKB and adapted to AIBL: effect of sample size and age distributions in outlier detection and classification performance.

(A) Average total Outlier Count (tOC) in the AD and HC is represented with dashed and solid lines respectively as a function of sample size (n). The shaded areas indicate the standard deviation across iterations. Dotted gray lines correspond to the estimations of tOC for HC and AD groups obtained with full sample size. (B) ROC–AUC of Support Vector Classifier with 10-fold cross-validation as a function of sample size in the training set. The solid lines represent the average AUC across iterations for each sampling strategy. The dotted line shows the performance of the models trained with full sample size.

Figure 5—figure supplement 10
Regional linear mixed-effects results for deviation score in models pre-fitted in the UKB and adapted to AIBL with age-skewed samples.

Standardized Betas (β) are shown for the mean-squared error of z-scores relative to the full reference model and for the count of extreme z-score outliers per region (rows). Predictors include Alzheimer’s disease diagnosis (AD), log-standardized sample size (n), age-distribution contrasts (left-skewed, right-skewed; representative = reference), and their two-way interactions with n and AD. Sex ratio is balanced (1:1). Gray shading denotes regions that did not survive FDR correction (p ≥ 0.05).

Figure 5—figure supplement 11
Regional linear mixed-effects results for deviation-score metrics in models pre-fitted in the UKB and adapted to AIBL with sex-imbalanced samples.

Standardized Betas (β) are shown for themean-squared errorofz-scores relative to the full reference model and for the count of extremez-score outliers per region (rows). Predictors includeAlzheimer’s disease diagnosis(AD), log-standardized sample size (n), sex-ratio contrasts (female-to-male 10:1, 4:1, 1:1 [reference], 1:4, 1:10), and their two-way interactions withnand AD. Age distribution is representative.Grayshading denotes regions that did not survive FDR correction (p ≥ 0.05).

Figure 6 with 1 supplement
Site effects on mean cortical thickness (whole-brain average) in the UK Biobank (UKB), AIBL, and OASIS-3 datasets for HC.

(A) Regression plots showing the relationship between age and mean cortical thickness across different datasets and imaging sites (UKB: three sites, AIBL: two sites, OASIS-3: one site). Marginal density plots are displayed on the sides of each axis, illustrating the distribution of mean cortical thickness and age for each dataset’s imaging sites. The UKB exhibits notable differences in mean cortical thickness estimations compared to AIBL and OASIS-3, unrelated to the age distribution of participants. (B) Boxplots representing the mean cortical thickness for participants in each dataset and imaging site. (C) Bivariate kernel density estimates plots showing the joint distribution of age and mean cortical thickness for each dataset. The contour plots represent the density of data points, with filled areas reflecting higher concentrations of data.

Figure 6—figure supplement 1
Site effects on sub-cortical gray matter Volume in the UK Biobank (UKB), AIBL, and OASIS-3 datasets for healthy controls.

(A) Regression plots showing the relationship between age and mean sub-cortical volume across different datasets and imaging sites (UKB: three sites, AIBL: two sites, OASIS: two sites). Marginal density plots are displayed on the sides of each axis, illustrating the distribution of mean sub-cortical volume and age for each dataset’s imaging sites. (B) Boxplots representing the mean sub-cortical volume for participants in each dataset and imaging site. (C) Bivariate kernel density estimates plots showing the joint distribution of age and mean sub-cortical volume for each dataset. The contour plots represent the density of data points, with filled areas reflecting higher concentrations of data.

Figure 7 with 1 supplement
Adaptive transfer learning evaluation.

(A) Comparison of model performance between direct training models (OASIS-3-trained) and models pretrained on UKB then adapted to OASIS-3 (UKB-adapted). The gray curve shows the difference in performance (within-cohort minus adapted), plotted on a separate y-axis (right). For MSLL, SMSE, and MSE, lower values indicate better performance; therefore, positive differences indicate better performance of the within-cohort models. For EV and Rho, higher values indicate better performance; hence, positive differences reflect better performance of the adapted models. EV and Rho values remain constant across models, as the adaptation procedure only modifies mean of models and does not affect the shape. (B) Example adaptation iteration showing the percentage of individuals identified as outliers per region. The last row shows the outlier detection obtained using the full adaptation sample (n = 692).

Figure 7—figure supplement 1
Adaptive transfer learning evaluation in AIBL.

(A) Comparison of model performance between direct training models (AIBL trained) and models pretrained on UKB then adapted to AIBL (UKB-adapted). The gray curve shows the difference in performance (within-cohort minus adapted), plotted on a separate y-axis (right). For MSLL, SMSE, and MSE, lower values indicate better performance; therefore, positive differences indicate better performance of the within-cohort models. For EV and Rho, higher values indicate better performance; hence, positive differences reflect better performance of the adapted models. EV and Rho values remain constant across models, as the adaptation procedure only modifies mean of models and does not affect the shape. (B) Example adaptation iteration showing the percentage of individuals identified as outliers per region. The last row shows the outlier detection obtained using the full adaptation sample (n = 322).

Appendix 5—figure 1
Evaluation of model fits in the HC test set across different sampling strategies of the training set for varying sample sizes (n), following re-estimation with 20 independent random seeds to assess the robustness of an apparent artifact observed at n = 300 in left-skewed sampling in the original analysis.

Sampling strategies and metrics are identical to those shown in Figure 3. Solid lines represent the mean metric values across ROIs and random seeds, with shaded areas indicating the standard deviation across seeds for MSLL, SMSE, EV, Rho, the lower and upper tail HC%. For ICC, shaded areas represent variability across ROIs. Dashed lines indicate the mean metric values obtained with the full sample. The absence of a systematic deviation at n = 300 across random seeds indicates that the previously observed effect was driven by stochastic sampling variability rather than a stable modeling artifact.

Author response image 1
Median ROI variance across age bins for OASIS-3 and AIBL.

Shaded areas represent variability across regions within each age bin.

Tables

Table 1
Demographics of HC and AD participants across datasets and sites summary of age ranges, mean ages, standard deviations, and sex ratios (F/M) for HC and AD groups within the UKB, OASIS-3, and AIBL datasets, including site-specific subgroups.
Healthy controls groupAlzheimer’s disease group
DatasetSiteNAge range
(mean ± sd)
Sex ratio
F/M (%)
NAge range (mean ± sd)Sex ratio
F/M (%)
UKBFull dataset42,74744.61 – 82.79 (64.50 ± 7.67)52.75/47.25
UKB 125,53644.61 – 82.34
(63.68 ± 7.57)
51.78/48.22
UKB 210,91448.68 – 82.79
(66.33 ± 7.87)
54.20/45.80
UKB 3629747.92 – 81.93 (65.36 ± 7.52)54.19/45.81
OASIS-3Full dataset86545.41 – 82.69
(68.21± 7.83)
58.96/41.0416750.79 – 82.67 (73.79 ± 5.99)44.91/55.09
AIBLFull dataset40360.13 – 82.61
(72.24 ± 5.40)
56.67/43.336058.69 – 82.75
(73.59 ± 6.62)
56.67/43.33
AIBL 127460.99 – 82.61
(72.64 ± 5.26)
58.03/41.973960.33 –82.75(73.05 ± 6.35)53.85/46.15
AIBL 212960.13 – 82.56
(71.39 ± 5.63)
60.16/39.842158.69 – 81.43
(74.60 ± 7.14)
61.90/38.10
Table 2
Linear mixed model results for evaluation metrics under age-skewed sampling conditions.

Models assess the influence of standardized and log-transformed sample size (n) and age sampling strategy (Representative, Left-skewed, Right-skewed) on model performance metrics (MSLL, SMSE, EV, Rho, ICC, the lower and upper tail HC%). All variables were standardized to allow comparison of effect sizes. Sample size was log-transformed to account for the non-linear association between sample size and model performance. Representative sampling serves as the reference level. Reported coefficients and corresponding p-values indicate the direction and significance of each effect.

MSLLβ (p-value)SMSEβ (p-value)EVβ (p-value)Rhoβ (p-value)ICCβ (p-value)Lower tailHC %β (p-value)Upper tailHC %β (p-value)
Intercept (representative)–0.031
(p = 0.022)
–0.318
(p < 0.001)
0.260
(p < 0.001)
0.183
(p < 0.001)
0.106
(p < 0.001)
–0.203
(p < 0.001)
–0.035
(p = 0.322)
Log(n)–0.496
(p < 0.001)
–0.427
(p < 0.001)
0.499
(p < 0.001)
0.527
(p < 0.001)
0.728
(p < 0.001)
–0.015
(p = 0.126)
–0.603
(p < 0.001)
Left-skewed0.219
(p < 0.001)
1.323
(p < 0.001)
–0.927
(p < 0.001)
–0.642
(p < 0.001)
–0.300
(p < 0.001)
1.410
(p < 0.001)
–0.205
(p < 0.001)
Right-skewed0.115
(p < 0.001)
0.585
(p < 0.001)
–0.692
(p < 0.001)
–0.567
(p < 0.001)
–0.024
(p = 0.026)
–0.068
(p < 0.001)
0.364
(p < 0.001)
Log(n):Left-skewed–0.064
(p < 0.001)
–0.636
(p < 0.001)
0.019
(p = 0.047)
–0.112
(p < 0.001)
–0.032
(p = 0.003)
–0.983
(p < 0.001)
0.340
(p < 0.001)
Log(n): Right-skewed–0.084
(p < 0.001)
–0.139
(p < 0.001)
0.047
(p < 0.001)
–0.034
(p < 0.001)
–0.048
(p < 0.001)
0.056
(p < 0.001)
–0.295
(p < 0.001)
Table 3
Linear mixed model results for evaluation metrics under sex-imbalanced sampling conditions.

Models assess the influence of standardized and log-transformed sample size (n) and sex ratio in the training set (1F:1M, 1F:4M, 1F:10M, 4F:1M, 10F:1M; F = female, M = male) on model performance metrics (MSLL, SMSE, EV, Rho, ICC, the lower and upper tail HC%). All variables were standardized to allow comparison of effect sizes. Sample size was log-transformed to account for the non-linear association between sample size and model performance. Representative sampling (1F:1M) serves as the reference level. Reported β coefficients and corresponding p-values indicate the direction and significance of each effect.

MSLLβ (p-value)SMSEβ (p-value)EVβ (p-value)Rhoβ (p-value)ICCβ (p-value)Lower tailHC %β (p-value)Upper tailHC %β (p-value)
Intercept
(1F1M)
–0.031
(p = 0.005)
–0.318
(p < 0.001)
0.260
(p < 0.001)
0.183
(p < 0.001)
NA–0.203
(p < 0.001)
–0.035
(p = 0.308)
Log(n)–0.496
(p < 0.001)
–0.427
(p < 0.001)
0.498
(p < 0.001)
0.527
(p < 0.001)
0.728
(p < 0.001)
–0.015
(p = 0.004)
–0.603
(p < 0.001)
10F1M0.001
(p = 0.965)
0.076
(p < 0.001)
–0.023
(p < 0.001)
0.044
(p < 0.001)
–0.152
(p < 0.001)
0.010
(p = 0.194)
0.151
(p < 0.001)
4F1M–0.015
(p = 0.252)
0.010
(p = 0.030)
0.010
(p = 0.108)
0.042
(p < 0.001)
–0.055
(p < 0.001)
0.015
(p = 0.043)
0.091
(p < 0.001)
1F4M–0.039
(p = 0.003)
0.068
(p < 0.001)
–0.062
(p < 0.001)
–0.062
(p < 0.001)
–0.048
(p < 0.001)
0.020
(p = 0.007)
–0.028
(p = 0.002)
1F10M–0.062
(p < 0.001)
0.166
(p < 0.001)
–0.121
(p < 0.001)
–0.098
(p < 0.001)
–0.162
(p < 0.001)
0.031
(p < 0.001)
–0.127
(p < 0.001)
Log(n):10F1M0.057
(p < 0.001)
–0.018
(p < 0.001)
0.005
(p = 0.428)
–0.034
(p < 0.001)
0.073
(p < 0.001)
–0.027
(p < 0.001)
–0.033
(p < 0.001)
Log(n):4F1M0.060
(p < 0.001)
–0.009
(p = 0.051)
0.015
(p = 0.018)
–0.011
(p = 0.114)
0.048
(p < 0.001)
–0.033
(p < 0.001)
–0.022
(p = 0.014)
Log(n):1F4M0.159
(p < 0.001)
–0.020
(p < 0.001)
0.001
(p = 0.874)
–0.011
(p = 0.134)
–0.013
(p = 0.221)
–0.022
(p = 0.003)
–0.038
(p < 0.001)
Log(n):1F10M0.279
(p < 0.001)
–0.032
(p < 0.001)
–0.019
(p = 0.002)
–0.049
(p < 0.001)
–0.017
(p = 0.120)
0.033
(p < 0.001)
0.163
(p < 0.001)
Table 4
Linear mixed model results for MSE and total outlier count (tOC) under age-skewed sampling.

Models evaluate the influence of diagnosis (HC, AD), log-transformed and standardized sample size (n), and age sampling strategy (Representative, Left-skewed, Right-skewed) on standardized deviation score outcomes. Continuous variables were standardized to allow comparison of effect sizes. Sample size was log-transformed to account for the non-linear association between sample size and model performance. Representative sampling and HC are used as reference levels. Reported β coefficients and corresponding p-values indicate the direction and significance of the effects. Age was not included as an additional predictor to keep the modeling approach aligned with the sex-based analysis (Table 5) and to limit the complexity of the models.

MSEβ (p-value)tOCβ (p-value)
Intercept (HC, Representative)–0.245 (p < 0.001)–0.443 (p < 0.001)
Log(n)–0.443 (p < 0.001)–0.004 (p = 0.405)
AD0.021 (p = 0.570)0.665 (p < 0.001)
Left-skewed1.045 (p < 0.001)0.461 (p < 0.001)
Right-skewed0.101 (p < 0.001)–0.021 (p = 0.004)
Log(n):AD–0.007 (p = 0.530)0.096 (p < 0.001)
Log(n):Left-skewed–0.752 (p < 0.001)–0.323 (p < 0.001)
Log(n):Right-skewed–0.148 (p < 0.001)0.019 (p = 0.008)
Left-skewed:AD0.575 (p < 0.001)0.721 (p < 0.001)
Right-skewed:AD–0.077 (p < 0.001)0.016 (p = 0.136)
Log(n):Left-skewed:AD–0.234 (p < 0.001)–0.290 (p < 0.001)
Log(n):Right-skewed:AD0.100 (p < 0.001)–0.004 (p = 0.674)
Table 5
Linear mixed model results for MSE and total outlier count (tOC) under sex-imbalanced sampling.

Models evaluate the influence of diagnosis (HC, AD), log-transformed and standardized sample size (n), and sex ratio in the training set (1:1, 1F:4M, 1F:10M, 4F:1M, 10F:1M) on standardized deviation score outcomes. Continuous variables were standardized to allow comparison of effect sizes. Sample size was log-transformed to account for the non-linear association between sample size and model performance. The 1:1 ratio and HC are used as reference levels. Reported β coefficients and corresponding p-values indicate the direction and significance of the effects. Sex was not included as an additional predictor to keep the modeling approach aligned with the age-based analysis (Table 4) and to limit the complexity of the models.

MSEβ (p-value)tOCβ (p-value)
Intercept (HC, 1F 1M)–0.245 (p < 0.001)–0.443 (p < 0.001)
Log(n)–0.443 (p < 0.001)–0.004 (p = 0.084)
AD0.021 (p = 0.137)0.665 (p < 0.001)
10F1M0.089 (p < 0.001)0.005 (p = 0.173)
4F1M0.031 (p < 0.001)0.006 (p = 0.076)
1F4M0.046 (p < 0.001)0.004 (p = 0.212)
1F10M0.083 (p < 0.001)0.005 (p = 0.170)
Log(n):AD–0.007 (p = 0.153)0.096 (p < 0.001)
Log(n):10F1M–0.021 (p < 0.001)–0.010 (p = 0.005)
Log(n):4F1M–0.012 (p = 0.018)–0.012 (p = 0.001)
Log(n):1F4M–0.015 (p = 0.003)–0.005 (p = 0.148)
Log(n):1F10M0.037 (p < 0.001)0.014 (p < 0.001)
10F1M:AD0.025 (p<0.001)–0.006 (p = 0.258)
4F1M:AD0.013 (p = 0.073)0.009 (p = 0.079)
1F4M:AD–0.011 (p = 0.133)–0.002 (p = 0.718)
1F10M:AD–0.013 (p = 0.072)–0.013 (p = 0.012)
Log(n):10F1M:AD–0.013 (p = 0.078)0.009 (p = 0.092)
Log(n):4F1M:AD–0.013 (p = 0.081)0.001 (p = 0.900)
Log(n):1F4M:AD0.013 (p = 0.081)–0.006 (p = 0.220)
Log(n):1F10M:AD–0.002 (p = 0.750)0.000 (p = 0.930)
Appendix 1—table 1
Linear mixed model results for evaluation metrics under age-skewed sampling conditions using normative models pre-trained on the UKB and adapted to the OASIS-3 dataset.

Models assess the influence of standardized and log-transformed sample size (n) and age sampling strategy (Representative, Left-skewed, Right-skewed) on model performance metrics (MSLL, SMSE, EV, Rho, ICC, the lower and upper tail HC%). All variables were standardized to allow comparison of effect sizes. Representative sampling serves as the reference level. Reported β coefficients and corresponding p-values indicate the direction and significance of each effect.

MSLLβ (p-value)SMSEβ (p-value)EVβ (p-value)Rhoβ (p-value)ICCβ (p-value)Lower tailHC %β (p-value)Upper tailHC %β (p-value)
Intercept (representative)–0.195 (p < 0.001)–0.567 (p < 0.001)0.644 (p < 0.001)0.701 (p < 0.001)0.504 (p < 0.001)–0.219 (p < 0.001)–0.373 (p < 0.001)
Log(n)–0.054 (p < 0.001)–0.102 (p < 0.001)0.000 (p = 1.000)0.000 (p = 1.000)0.245 (p < 0.001)–0.228 (p < 0.001)–0.210 (p < 0.001)
Left-skewed0.062 (p < 0.001)0.220 (p < 0.001)0.000 (p = 1.000)–0.000 (p = 1.000)0.041 (p < 0.001)0.186 (p < 0.001)0.778 (p < 0.001)
Right-skewed–0.005 (p < 0.001)0.006 (p = 0.049)–0.000 (p = 1.000)–0.000 (p = 1.000)0.020 (p < 0.001)0.076 (p < 0.001)–0.057 (p < 0.001)
Log(n):Left-skewed–0.037 (p < 0.001)–0.019 (p < 0.001)0.000 (p = 1.000)0.000 (p = 1.000)–0.002 (p = 0.645)–0.010 (p = 0.133)–0.161 (p < 0.001)
Log(n):Right-skewed0.003 (p = 0.087)0.008 (p = 0.010)–0.000 (p = 1.000)0.000 (p = 1.000)–0.001 (p = 0.686)0.008 (p = 0.242)0.012 (p = 0.187)
Appendix 1—table 2
Linear mixed model results for evaluation metrics under sex-imbalanced sampling conditions using normative models pre-trained on the UKB and adapted to the OASIS-3 dataset.

Models assess the influence of standardized and log-transformed sample size (n) and sex ratio in the training set (1F:1M, 1F:4M, 1F:10M, 4F:1M, 10F:1M; F=female, M=male) on model performance metrics (MSLL, SMSE, EV, Rho, ICC, the lower and upper tail HC%). All variables were standardized to allow comparison of effect sizes. Representative sampling (1F:1M) serves as the reference level. Reported β coefficients and corresponding p-values indicate the direction and significance of each effect.

MSLLβ (p-value)SMSEβ (p-value)EVβ (p-value)Rhoβ (p-value)ICCβ (p-value)Lower tailHC %β (p-value)Upper tailHC %β (p-value)
Intercept
(1F1M)
–0.195 (p < 0.001)–0.567 (p < 0.001)0.644 (p < 0.001)0.701 (p < 0.001)0.504 (p < 0.001)–0.219 (p < 0.001)–0.373 (p < 0.001)
Log(n)–0.054 (p < 0.001)–0.102 (p < 0.001)0.000 (p = 1.000)0.000 (p = 1.000)0.245 (p < 0.001)–0.228 (p < 0.001)–0.210 (p < 0.001)
10F1M–0.011 (p < 0.001)–0.005 (p = 0.002)–0.000 (p = 1.000)–0.000 (p = 1.000)–0.000 (p = 0.919)0.088 (p < 0.001)0.059 (p < 0.001)
4F1M–0.009 (p < 0.001)–0.006 (p < 0.001)0.000 (p = 1.000)0.000 (p = 1.000)–0.000 (p = 0.995)0.050 (p < 0.001)0.036 (p < 0.001)
1F4M0.002 (p = 0.031)0.013 (p < 0.001)0.000 (p = 1.000)0.000 (p = 1.000)–0.002 (p = 0.595)–0.006 (p = 0.266)0.036 (p < 0.001)
1F10M–0.001 (p = 0.593)0.020 (p < 0.001)–0.000 (p = 1.000)–0.000 (p = 1.000)–0.005 (p = 0.138)0.011 (p = 0.032)0.036 (p < 0.001)
Log(n):10F1M0.004 (p < 0.001)0.006 (p < 0.001)–0.000 (p = 1.000)–0.000 (p = 1.000)–0.004 (p = 0.280)0.006 (p = 0.267)0.001 (p = 0.855)
Log(n):4F1M0.004 (p < 0.001)0.004 (p = 0.014)–0.000 (p = 1.000)–0.000 (p = 1.000)–0.003 (p = 0.366)0.008 (p = 0.133)0.000 (p = 0.970)
Log(n):1F4M0.000 (p = 0.975)–0.001 (p = 0.479)–0.000 (p = 1.000)0.000 (p = 1.000)0.001 (p = 0.825)0.007 (p = 0.208)–0.046 (p < 0.001)
Log(n):1F10M0.006 (p < 0.001)–0.002 (p = 0.175)–0.000 (p = 1.000)–0.000 (p = 1.000)0.004 (p = 0.192)0.001 (p = 0.793)0.015 (p = 0.001)
Appendix 1—table 3
Linear mixed model results for MSE and total outlier count (tOC) under age-skewed sampling for models adapted from the UK Biobank to OASIS-3.

Models evaluate the influence of diagnosis (HC, AD), log-transformed and standardized sample size (n), and age sampling strategy (Representative, Left-skewed, Right-skewed) on standardized deviation score outcomes. Continuous variables were standardized to allow comparison of effect sizes. Representative sampling and HC are used as reference levels. Reported β coefficients and corresponding p-values indicate the direction and significance of the effects.

MSEβ (p-value)tOCβ (p-value)
Intercept (HC, Representative)–0.313 (p < 0.001)–0.462 (p < 0.001)
Log(n)–0.451 (p < 0.001)0.026 (p = 0.015)
AD0.070 (p = 0.243)0.909 (p < 0.001)
Left-skewed0.930 (p < 0.001)0.281 (p < 0.001)
Right-skewed0.560 (p < 0.001)–0.068 (p < 0.001)
Log(n):AD–0.073 (p = 0.039)0.144 (p < 0.001)
Log(n):Left-skewed–0.359 (p < 0.001)–0.142 (p < 0.001)
Log(n):Right-skewed–0.521 (p < 0.001)0.012 (p = 0.427)
Left-skewed:AD0.211 (p < 0.001)0.670 (p < 0.001)
Right-skewed:AD–0.061 (p = 0.224)–0.192 (p < 0.001)
Log(n):Left-skewed:AD–0.030 (p = 0.555)–0.187 (p < 0.001)
Log(n):Right-skewed:AD0.056 (p = 0.264)0.005 (p = 0.819)
Appendix 1—table 4
Linear mixed model results for MSE and total outlier count (tOC) under sex-imbalanced sampling for models adapted from the UK Biobank to OASIS-3.

Models evaluate the influence of diagnosis (HC, AD), log-transformed and standardized sample size (n), and sex ratio in the training set (1:1, 1F:4M, 1F:10M, 4F:1M, 10F:1M) on standardized deviation score outcomes. Continuous variables were standardized to allow comparison of effect sizes. The 1:1 ratio and HC are used as reference levels. Reported β coefficients and corresponding p-values indicate the direction and significance of the effects.

MSEβ (p-value)tOCβ (p-value)
Intercept (HC, 1F 1M)NA–0.443 (p < 0.001)
Log(n)0.010 (p < 0.001)0.029 (p < 0.001)
AD0.003 (p = 0.168)0.016 (p < 0.001)
10F1M0.010 (p < 0.001)–0.003 (p = 0.202)
4F1M0.014 (p < 0.001)0.003 (p = 0.119)
1F4MNA0.576 (p < 0.001)
1F10M0.002 (p = 0.576)0.055 (p < 0.001)
Log(n):AD0.001 (p = 0.809)0.033 (p < 0.001)
Log(n):10F1M–0.000 (p = 0.964)–0.012 (p < 0.001)
Log(n):4F1M0.001 (p = 0.701)–0.007 (p = 0.011)
Log(n):1F4M–0.142 (p < 0.001)–0.075 (p < 0.001)
Log(n):1F10M0.004 (p = 0.052)0.003 (p = 0.185)
10F1M:AD0.005 (p = 0.010)0.003 (p = 0.100)
4F1M:AD–0.003 (p = 0.106)0.004 (p = 0.060)
1F4M:AD0.001 (p = 0.628)0.002 (p = 0.404)
1F10M:AD–0.005 (p = 0.014)–0.044 (p < 0.001)
Log(n):10F1M:AD–0.002 (p = 0.604)0.002 (p = 0.558)
Log(n):4F1M:AD–0.001 (p = 0.749)0.003 (p = 0.260)
Log(n):1F4M:AD0.002 (p = 0.560)0.006 (p = 0.025)
Log(n):1F10M:AD–0.001 (p = 0.817)0.002 (p = 0.514)
Appendix 2—table 1
Linear mixed model results for evaluation metrics under age-skewed sampling conditions using normative models trained on AIBL dataset.

Models assess the influence of standardized and log-transformed sample size (n) and age sampling strategy (Representative, Left-skewed, Right-skewed) on model performance metrics (MSLL, SMSE, EV, Rho, ICC, the lower and upper tail HC%). All variables were standardized to allow comparison of effect sizes. Representative sampling serves as the reference level. Reported β coefficients and corresponding p-values indicate the direction and significance of each effect.

MSLLβ (p-value)EVβ (p-value)SMSEβ (p-value)Rhoβ (p-value)ICCβ (p-value)Lower tailHC %β (p-value)Upper tailHC %β (p-value)
Intercept (Representative)–0.293 (p < 0.001)–0.332 (p < 0.001)0.175 (p < 0.001)0.067 (p = 0.181)0.100 (p < 0.001)–0.163 (p = 0.004)–0.026 (p = 0.626)
Log(n)–0.426 (p < 0.001)–0.308 (p < 0.001)0.262 (p < 0.001)0.219 (p < 0.001)0.755 (p < 0.001)0.103 (p < 0.001)–0.307 (p < 0.001)
Left-skewed1.080 (p < 0.001)1.382 (p < 0.001)–0.517 (p < 0.001)–0.212 (p < 0.001)0.043 (p = 0.011)1.096 (p < 0.001)–0.457 (p < 0.001)
Right-skewed0.465 (p < 0.001)0.438 (p < 0.001)–0.338 (p < 0.001)–0.127 (p < 0.001)–0.052 (p = 0.002)–0.265 (p < 0.001)0.562 (p < 0.001)
Log(n):Left-skewed–0.484 (p < 0.001)–0.670 (p < 0.001)0.196 (p < 0.001)0.002 (p = 0.897)–0.208 (p < 0.001)–0.555 (p < 0.001)0.182 (p < 0.001)
Log(n):Right-skewed–0.411 (p < 0.001)–0.284 (p < 0.001)0.072 (p < 0.001)0.006 (p = 0.703)0.002 (p = 0.912)0.047 (p = 0.010)–0.572 (p < 0.001)
Appendix 2—table 2
Linear mixed model results for evaluation metrics under sex-imbalanced sampling conditions using normative models trained on AIBL dataset.

Models assess the influence of standardized and log-transformed sample size (n) and sex ratio in the training set (1F:1M, 1F:4M, 1F:10M, 4F:1M, 10F:1M; F = female, M = male) on model performance metrics (MSLL, SMSE, EV, Rho, ICC, the lower and upper tail HC%). All variables were standardized to allow comparison of effect sizes. Representative sampling (1F:1M) serves as the reference level. Reported β coefficients and corresponding p-values indicate the direction and significance of each effect.

MSLLβ (p-value)SMSEβ (p-value)EVβ (p-value)Rhoβ (p-value)ICCβ (p-value)Lower tailHC %β (p-value)Upper tailHC %β (p-value)
Intercept
(1F1M)
–0.294 (p < 0.001)–0.333 (p < 0.001)0.177 (p = 0.001)0.067 (p = 0.285)0.100 (p < 0.001)–0.163 (p < 0.001)–0.026 (p = 0.635)
Log(n)–0.423 (p < 0.001)–0.306 (p < 0.001)0.258 (p < 0.001)0.219 (p < 0.001)0.755 (p < 0.001)0.103 (p < 0.001)–0.307 (p < 0.001)
10F1M0.107 (p < 0.001)0.118 (p < 0.001)–0.111 (p < 0.001)–0.035 (p = 0.021)–0.223 (p < 0.001)0.041 (p < 0.001)0.015 (p = 0.157)
4F1M0.056 (p < 0.001)0.054 (p < 0.001)–0.042 (p = 0.004)–0.010 (p = 0.490)–0.113 (p < 0.001)0.048 (p < 0.001)–0.005 (p = 0.651)
1F4M0.101 (p < 0.001)0.100 (p < 0.001)–0.054 (p < 0.001)–0.026 (p = 0.086)–0.081 (p < 0.001)0.076 (p < 0.001)0.037 (p < 0.001)
1F10M0.248 (p < 0.001)0.234 (p < 0.001)–0.164 (p < 0.001)–0.062 (p < 0.001)–0.272 (p < 0.001)0.141 (p < 0.001)0.033 (p = 0.002)
Log(n):10F1M–0.056 (p < 0.001)–0.069 (p < 0.001)0.029 (p = 0.048)–0.022 (p = 0.146)0.095 (p < 0.001)–0.040 (p < 0.001)–0.067 (p < 0.001)
Log(n):4F1M–0.076 (p < 0.001)–0.077 (p < 0.001)0.037 (p = 0.011)–0.013 (p = 0.400)0.105 (p < 0.001)–0.059 (p < 0.001)–0.074 (p < 0.001)
Log(n):1F4M0.041 (p < 0.001)0.026 (p = 0.005)–0.073 (p < 0.001)–0.011 (p = 0.461)–0.069 (p < 0.001)0.040 (p < 0.001)–0.007 (p = 0.501)
Log(n):1F10M0.036 (p = 0.001)0.060 (p < 0.001)–0.076 (p < 0.001)0.002 (p = 0.914)–0.112 (p < 0.001)0.080 (p < 0.001)0.018 (p = 0.083)
Appendix 2—table 3
Linear mixed model results for MSE and total outlier count (tOC) under age-skewed sampling for models trained in AIBL.

Models evaluate the influence of diagnosis (HC, AD), log-transformed and standardized sample size (n), and age sampling strategy (Representative, Left-skewed, Right-skewed) on standardized deviation score outcomes. Continuous variables were standardized to allow comparison of effect sizes. Representative sampling and HC are used as reference levels. Reported β coefficients and corresponding p-values indicate the direction and significance of the effects.

MSEβ (p-value)tOCβ (p-value)
Intercept (HC, Representative)–0.313 (p < 0.001)–0.462 (p < 0.001)
Log(n)–0.451 (p < 0.001)0.026 (p = 0.015)
AD0.070 (p = 0.243)0.909 (p < 0.001)
Left-skewed0.930 (p < 0.001)0.281 (p < 0.001)
Right-skewed0.560 (p < 0.001)–0.068 (p < 0.001)
Log(n):AD–0.073 (p = 0.039)0.144 (p < 0.001)
Log(n):Left-skewed–0.359 (p < 0.001)–0.142 (p < 0.001)
Log(n):Right-skewed–0.521 (p < 0.001)0.012 (p = 0.427)
Left-skewed:AD0.211 (p < 0.001)0.670 (p < 0.001)
Right-skewed:AD–0.061 (p = 0.224)–0.192 (p < 0.001)
Log(n):Left-skewed:AD–0.030 (p = 0.555)–0.187 (p < 0.001)
Log(n):Right-skewed:AD0.056 (p = 0.264)0.005 (p = 0.819)
Appendix 2—table 4
Linear mixed model results for MSE and total outlier count (tOC) under sex-imbalanced sampling for model trained in AIBL.

Models evaluate the influence of diagnosis (HC, AD), log-transformed and standardized sample size (n), and sex ratio in the training set (1:1, 1F:4M, 1F:10M, 4F:1M, 10F:1M) on standardized deviation score outcomes. Continuous variables were standardized to allow comparison of effect sizes. The 1:1 ratio and HC are used as reference levels. Reported β coefficients and corresponding p-values indicate the direction and significance of the effects.

MSEβ (p-value)tOCβ (p-value)
Intercept (HC, 1F 1M)0.052 (p < 0.001)–0.462 (p < 0.001)
Log(n)0.093 (p < 0.001)0.011 (p = 0.207)
AD0.183 (p < 0.001)0.012 (p = 0.141)
10F1M0.070 (p = 0.009)0.020 (p = 0.020)
4F1M0.011 (p = 0.404)0.036 (p < 0.001)
1F4M0.007 (p = 0.614)0.909 (p < 0.001)
1F10M0.002 (p = 0.896)0.019 (p = 0.148)
Log(n):AD0.003 (p = 0.823)0.031 (p = 0.017)
Log(n):10F1M–0.451 (p < 0.001)0.012 (p = 0.346)
Log(n):4F1M–0.055 (p < 0.001)0.006 (p = 0.667)
Log(n):1F4M–0.070 (p < 0.001)0.026 (p < 0.001)
Log(n):1F10M–0.010 (p = 0.264)–0.010 (p = 0.221)
10F1M:AD0.042 (p < 0.001)–0.015 (p = 0.073)
4F1M:AD–0.073 (p < 0.001)0.010 (p = 0.225)
1F4M:AD–0.000 (p = 0.980)0.021 (p = 0.014)
1F10M:AD–0.004 (p = 0.771)0.144 (p < 0.001)
Log(n):10F1M:AD0.001 (p = 0.949)–0.005 (p = 0.719)
Log(n):4F1M:AD0.004 (p = 0.765)–0.017 (p = 0.188)
Log(n):1F4M:AD0.052 (p < 0.001)–0.006 (p = 0.657)
Log(n):1F10M:AD0.093 (p < 0.001)0.021 (p = 0.101)
Appendix 3—table 1
Linear mixed model results for evaluation metrics under age-skewed sampling conditions using normative models pre-trained on the UKB and adapted to the AIBL dataset.

Models assess the influence of standardized and log-transformed sample size (n) and age sampling strategy (Representative, Left-skewed, Right-skewed) on model performance metrics (MSLL, SMSE, EV, Rho, ICC, the lower and upper tail HC%). All variables were standardized to allow comparison of effect sizes. Representative sampling serves as the reference level. Reported β coefficients and corresponding p-values indicate the direction and significance of each effect.

MSLLβ (p-value)SMSEβ (p-value)EVβ (p-value)Rhoβ (p-value)ICCβ (p-value)Lower tailHC %β (p-value)Upper tailHC %β (p-value)
Intercept (Representative)0.792 (p < 0.001)1.040 (p < 0.001)–2.562 (p < 0.001)–0.076 (p < 0.001)0.502 (p < 0.001)–0.010 (p = 0.869)–0.168 (p = 0.009)
Log(n)–0.436 (p < 0.001)–0.211 (p < 0.001)–0.000 (p = 1.000)–0.000 (p = 1.000)0.515 (p < 0.001)–0.364 (p < 0.001)–0.295 (p < 0.001)
Left-skewed0.862 (p < 0.001)0.550 (p < 0.001)0.000 (p = 1.000)0.000 (p = 1.000)0.078 (p < 0.001)–0.130 (p < 0.001)1.701 (p < 0.001)
Right-skewed0.505 (p < 0.001)0.461 (p < 0.001)0.000 (p = 1.000)0.000 (p = 1.000)0.093 (p < 0.001)0.946 (p < 0.001)–0.244 (p < 0.001)
Log(n):Left-skewed–0.445 (p < 0.001)–0.012 (p = 0.261)0.000 (p = 1.000)0.000 (p = 1.000)0.062 (p < 0.001)0.087 (p < 0.001)–0.303 (p < 0.001)
Log(n):Right-skewed–0.214 (p < 0.001)–0.009 (p = 0.374)0.000 (p = 1.000)0.000 (p = 1.000)–0.009 (p = 0.267)–0.180 (p < 0.001)0.013 (p = 0.537)
Appendix 3—table 2
Linear mixed model results for evaluation metrics under sex-imbalanced sampling conditions using normative models pre-trained on the UKB and adapted to the AIBL dataset.

Models assess the influence of standardized and log-transformed sample size (n) and sex ratio in the training set (1F:1M, 1F:4M, 1F:10M, 4F:1M, 10F:1M; F = female, M = male) on model performance metrics (MSLL, SMSE, EV, Rho, ICC, the lower and upper tail HC%). All variables were standardized to allow comparison of effect sizes. Representative sampling (1F:1M) serves as the reference level. Reported β coefficients and corresponding p-values indicate the direction and significance of each effect.

MSLLβ (p-value)EVβ (p-value)SMSEβ (p-value)Rhoβ (p-value)ICCβ (p-value)Lower tailHC %β (p-value)Upper tailHC %β (p-value)
Intercept
(1F1M)
0.792 (p < 0.001)1.040 (p < 0.001)–2.561 (p < 0.001)–0.075 (p < 0.001)0.502 (p < 0.001)–0.010 (p = 0.864)–0.168 (p = 0.002)
Log(n)–0.436 (p < 0.001)–0.211 (p < 0.001)–0.000 (p = 1.000)–0.000 (p = 1.000)0.515 (p < 0.001)–0.364 (p < 0.001)–0.295 (p < 0.001)
10F1M0.056 (p < 0.001)0.060 (p < 0.001)–0.000 (p = 1.000)–0.000 (p = 1.000)0.002 (p = 0.781)0.097 (p < 0.001)0.021 (p = 0.018)
4F1M0.034 (p < 0.001)0.032 (p < 0.001)0.000 (p = 1.000)–0.000 (p = 1.000)0.012 (p = 0.105)0.056 (p < 0.001)0.009 (p = 0.319)
1F4M0.025 (p = 0.011)0.017 (p < 0.001)–0.000 (p = 1.000)0.000 (p = 1.000)0.031 (p < 0.001)0.056 (p < 0.001)0.033 (p < 0.001)
1F10M0.030 (p = 0.002)0.034 (p < 0.001)–0.000 (p = 1.000)–0.000 (p = 1.000)0.036 (p < 0.001)0.063 (p < 0.001)0.085 (p < 0.001)
Log(n):10F1M–0.032 (p = 0.001)0.003 (p = 0.462)0.000 (p = 1.000)–0.000 (p = 1.000)–0.025 (p < 0.001)–0.014 (p = 0.189)–0.027 (p = 0.002)
Log(n):4F1M–0.037 (p < 0.001)0.007 (p = 0.128)–0.000 (p = 1.000)–0.000 (p = 1.000)–0.039 (p < 0.001)–0.023 (p = 0.029)–0.046 (p < 0.001)
Log(n):1F4M–0.021 (p = 0.030)0.017 (p < 0.001)0.000 (p = 1.000)0.000 (p = 1.000)–0.049 (p < 0.001)–0.016 (p = 0.130)–0.002 (p = 0.836)
Log(n):1F10M0.008 (p = 0.410)0.016 (p < 0.001)0.000 (p = 1.000)0.000 (p = 1.000)–0.037 (p < 0.001)0.007 (p = 0.538)–0.014 (p = 0.113)
Appendix 3—table 3
Linear mixed model results for MSE and total outlier count (tOC) under age-skewed sampling for models adapted from the UK Biobank to AIBL.

Models evaluate the influence of diagnosis (HC, AD), log-transformed and standardized sample size (n), and age sampling strategy (Representative, Left-skewed, Right-skewed) on standardized deviation score outcomes. Continuous variables were standardized to allow comparison of effect sizes. Representative sampling and HC are used as reference levels. Reported β coefficients and corresponding p-values indicate the direction and significance of the effects.

MSEβ (p-value)tOCβ (p-value)
Intercept (HC, Representative)–0.572 (p < 0.001)–0.415 (p < 0.001)
Log(n)–0.277 (p < 0.001)–0.096 (p < 0.001)
AD0.009 (p = 0.613)0.792 (p < 0.001)
Left-skewed0.712 (p < 0.001)–0.039 (p < 0.001)
Right-skewed0.356 (p < 0.001)0.252 (p < 0.001)
Log(n):AD–0.011 (p = 0.305)–0.036 (p < 0.001)
Log(n):Left-skewed–0.193 (p < 0.001)0.026 (p < 0.001)
Log(n):Right-skewed–0.070 (p < 0.001)–0.048 (p < 0.001)
Left-skewed:AD–0.111 (p < 0.001)–0.019 (p = 0.113)
Right-skewed:AD0.036 (p = 0.020)0.213 (p < 0.001)
Log(n):Left-skewed:AD0.002 (p = 0.878)0.005 (p = 0.682)
Log(n):Right-skewed:AD–0.014 (p = 0.368)0.005 (p = 0.650)
Appendix 3—table 4
Linear mixed model results for MSE and total outlier count (tOC) under sex-imbalanced sampling for models adapted from the UK Biobank to AIBL.

Models evaluate the influence of diagnosis (HC, AD), log-transformed and standardized sample size (n), and sex ratio in the training set (1:1, 1F:4M, 1F:10M, 4F:1M, 10F:1M) on standardized deviation score outcomes. Continuous variables were standardized to allow comparison of effect sizes. The 1:1 ratio and HC are used as reference levels. Reported β coefficients and corresponding p-values indicate the direction and significance of the effects.

MSEβ (p-value)tOCβ (p-value)
Intercept (HC, 1F 1M)–0.572 (p < 0.001)–0.415 (p < 0.001)
Log(n)0.027 (p < 0.001)0.025 (p < 0.001)
AD0.010 (p = 0.132)0.014 (p = 0.002)
10F1M0.025 (p < 0.001)0.014 (p = 0.002)
4F1M0.046 (p < 0.001)0.016 (p < 0.001)
1F4M0.009 (p = 0.201)0.792 (p < 0.001)
1F10M0.003 (p = 0.746)0.049 (p < 0.001)
Log(n):AD0.002 (p = 0.838)0.031 (p < 0.001)
Log(n):10F1M0.002 (p = 0.844)0.009 (p = 0.188)
Log(n):4F1M0.002 (p = 0.817)0.015 (p = 0.031)
Log(n):1F4M–0.277 (p < 0.001)–0.096 (p < 0.001)
Log(n):1F10M–0.001 (p = 0.923)–0.003 (p = 0.554)
10F1M:AD0.002 (p = 0.750)–0.005 (p = 0.281)
4F1M:AD0.006 (p = 0.334)–0.003 (p = 0.560)
1F4M:AD0.002 (p = 0.699)0.003 (p = 0.477)
1F10M:AD–0.011 (p = 0.111)–0.036 (p < 0.001)
Log(n):10F1M:AD–0.002 (p = 0.816)–0.001 (p = 0.911)
Log(n):4F1M:AD–0.003 (p = 0.778)–0.008 (p = 0.278)
Log(n):1F4M:AD–0.002 (p = 0.872)–0.007 (p = 0.293)
Log(n):1F10M:AD0.000 (p = 0.996)–0.006 (p = 0.370)
Appendix 4—table 1
Empirical age distributions by sex for the training and test splits of OASIS-3, AIBL, and UK Biobank.

Age is reported as mean ± SD and interquartile range (P25–P75) separately for males and females in each split (Train HC, Test HC, Test AD where available).

SplitSexMean ± sdP25–P75
OASIS-3Train HCMale68.99 ± 7.5565.6–74.1
Female67.66 ± 7.9763.2–73.7
Test HCMale69.05 ± 7.6766.1–73.6
Female67.69 ± 7.9763.9–73.3
Test ADMale73.69 ± 6.0669.6–78.6
Female73.91 ± 5.9370.4–77.9
AIBLTrain HCMale72.43 ± 5.1967.9–76.0
Female72.02 ± 5.5667.7–76.2
Test HCMale72.38 ± 5.4669.0–77.0
Female72.47 ± 5.4268.8–77.3
Test ADMale73.51 ± 6.5969.3–79.0
Female73.65 ± 6.7571.5–79.8
UKBTrain HCMale65.22 ± 7.8059.3–71.3
Female63.87 ± 7.5258.0–69.7
Test HCMale65.20 ± 7.7259.3–71.2
Female63.85 ± 7.5158.0–69.6
Appendix 4—table 2
Expected age distributions under representative, left-skewed, and right-skewed age sampling for OASIS-3 and AIBL.

Expected distributions were estimated by repeated simulation of the age-skewed sampling procedure at n = 200 across 1000 independent samples and are reported as mean ± SD and P25–P75 separately for males and females.

SamplingSexMean ± sdP25–P75
OASIS-3RepresentativeMale68.24 ± 7.7564.4–73.8
Female68.20 ± 7.8264.1–73.8
LeftMale57.58 ± 5.2354.0–60.0
Female57.05 ± 5.4053.3–60.1
RightMale72.10 ± 6.0867.9–76.8
Female71.94 ± 5.8968.2–75.8
AIBLRepresentativeMale72.19 ± 5.3467.7–76.0
Female72.18 ± 5.3968.0–76.2
LeftMale67.54 ± 3.1665.5–69.1
Female67.29 ± 3.1965.4–69.0
RightMale76.04 ± 3.4974.0–78.4
Female75.66 ± 3.2173.9–77.4
Appendix 4—table 3
Expected age distributions under sex-imbalance sampling for OASIS-3 and AIBL.

Sex ratios include 1F:1M (representative), 1F:4M, 1F:10M, 4F:1M, and 10F:1M. Expected distributions were estimated by repeated simulation of the sampling procedure at n = 200 across 1000 independent samples and are reported as mean ± SD and P25–P75 separately for males and females.

SamplingSexMean ± sdP25–P75
OASIS-31F1MMale68.24 ± 7.7464.4–73.8
Female68.19 ± 7.8364.1–73.8
1F4MMale68.24 ± 7.7464.4–73.8
Female68.19 ± 7.8364.1–73.8
1F10MMale68.23 ± 7.7464.4–73.8
Female68.2 ± 7.8564.1–73.8
4F1MMale68.24 ± 7.7364.4–73.8
Female68.19 ± 7.8364.1–73.8
10F1MMale68.22 ± 7.7864.2–73.9
Female68.19 ± 7.8464.1–73.8
AIBL1F1MMale72.19 ± 5.3467.6–76.0
Female72.19 ± 5.3968.1–76.2
1F4MMale72.19 ± 5.3467.6–76.0
Female72.18 ± 5.3968.0–76.2
1F10MMale72.19 ± 5.3467.6–76.0
Female72.19 ± 5.3968.0–76.2
4F1MMale72.2 ± 5.3367.5–76.0
Female72.18 ± 5.3968.1–76.2
10F1MMale72.18 ± 5.3367.6–76.0
Female72.18 ± 5.3968.0–76.2
Appendix 4—table 4
Linear mixed model results for MSE and total outlier count (tOC) under age-skewed sampling for models trained with OASIS-3 or AIBL.

Models evaluate the influence of diagnosis (HC, AD), log-transformed and standardized sample size (n), age sampling strategy (Representative, Left-skewed, Right-skewed), age, sex, and their interactions on standardized deviation score outcomes. Continuous variables were standardized to allow comparison of effect sizes. Representative sampling, HC, and female sex are used as reference levels. Reported β coefficients and corresponding p-values indicate the direction and significance of the effects.

OASIS-3AIBL
MSE
β (p-value)
tOC
β (p-value)
MSE
β (p-value)
tOC
β (p-value)
Intercept0.113 (p = 0.824)0.202 (p = 0.916)0.245 (p = 0.876)0.614 (p = 0.874)
n–1.237 (p < 0.001)0.539 (p < 0.001)–0.972 (p = 0.096)1.112 (p = 0.003)
AD–0.964 (p = 0.242)–0.795 (p = 0.798)–0.132 (p = 0.952)–7.493 (p = 0.165)
left–6.233 (p < 0.001)–3.919 (p < 0.001)–12.217 (p < 0.001)–5.242 (p < 0.001)
right1.270 (p < 0.001)–0.604 (p < 0.001)8.295 (p < 0.001)–1.852 (p < 0.001)
n:AD1.197 (p < 0.001)0.480 (p = 0.017)–0.285 (p = 0.726)–0.914 (p = 0.079)
n:left4.473 (p < 0.001)2.006 (p < 0.001)6.366 (p < 0.001)1.041 (p = 0.049)
n:right–1.482 (p < 0.001)0.077 (p = 0.662)–6.562 (p < 0.001)–0.793 (p = 0.134)
left:AD–2.744 (p < 0.001)–4.194 (p < 0.001)0.434 (p = 0.705)–4.832 (p < 0.001)
right:AD–0.142 (p = 0.730)0.550 (p = 0.054)–0.003 (p = 0.998)1.009 (p = 0.171)
n:left:AD0.558 (p = 0.175)0.389 (p = 0.172)–1.107 (p = 0.335)0.962 (p = 0.192)
n:right:AD–0.415 (p = 0.313)0.253 (p = 0.375)–1.304 (p = 0.256)0.368 (p = 0.618)
age–0.005 (p = 0.477)–0.008 (p = 0.786)–0.008 (p = 0.706)–0.014 (p = 0.795)
left:age0.104 (p < 0.001)0.065 (p < 0.001)0.183 (p < 0.001)0.076 (p < 0.001)
right:age–0.017 (p < 0.001)0.009 (p < 0.001)–0.108 (p < 0.001)0.024 (p < 0.001)
AD:age0.014 (p = 0.228)0.017 (p = 0.685)0.003 (p = 0.922)0.108 (p = 0.141)
left:AD:age0.037 (p < 0.001)0.057 (p < 0.001)–0.007 (p = 0.650)0.072 (p < 0.001)
right:AD:age0.002 (p = 0.686)–0.007 (p = 0.064)0.002 (p = 0.884)–0.014 (p = 0.167)
n:age0.012 (p < 0.001)–0.007 (p < 0.001)0.008 (p = 0.330)–0.014 (p = 0.006)
n:left:age–0.075 (p < 0.001)–0.034 (p < 0.001)–0.094 (p < 0.001)–0.016 (p = 0.030)
n:right:age0.019 (p < 0.001)–0.001 (p = 0.690)0.084 (p < 0.001)0.010 (p = 0.152)
n:AD:age–0.017 (p < 0.001)–0.005 (p = 0.073)0.003 (p = 0.820)0.013 (p = 0.068)
n:left:AD:age–0.006 (p = 0.299)–0.005 (p = 0.166)0.016 (p = 0.294)–0.015 (p = 0.132)
n:right:AD:age0.005 (p = 0.362)–0.003 (p = 0.383)0.016 (p = 0.318)–0.005 (p = 0.612)
sex0.073 (p = 0.809)–0.521 (p = 0.647)–0.290 (p = 0.759)–0.660 (p = 0.778)
left:sex0.002 (p = 0.990)0.722 (p < 0.001)0.057 (p = 0.909)1.052 (p = 0.001)
right:sex0.018 (p = 0.907)0.228 (p = 0.029)0.288 (p = 0.563)0.854 (p = 0.008)
AD:sex0.009 (p = 0.987)1.404 (p = 0.475)0.323 (p = 0.806)5.076 (p = 0.120)
left:AD:sex0.086 (p = 0.741)0.347 (p = 0.055)1.359 (p = 0.050)0.541 (p = 0.224)
right:AD:sex–0.106 (p = 0.685)–0.438 (p = 0.015)–0.694 (p = 0.317)–1.900 (p < 0.001)
n:sex–0.034 (p = 0.751)–0.200 (p = 0.007)0.187 (p = 0.595)–0.302 (p = 0.182)
n:left:sex–0.222 (p = 0.142)–0.285 (p = 0.006)–0.394 (p = 0.430)0.061 (p = 0.850)
n:right:sex–0.089 (p = 0.554)–0.024 (p = 0.817)–0.555 (p = 0.266)0.240 (p = 0.453)
n:AD:sex–0.067 (p = 0.716)0.259 (p = 0.042)–0.178 (p = 0.717)1.452 (p < 0.001)
n:left:AD:sex–0.176 (p = 0.499)–0.067 (p = 0.712)–0.170 (p = 0.806)–0.605 (p = 0.174)
n:right:AD:sex0.433 (p = 0.096)0.070 (p = 0.699)1.254 (p = 0.071)–0.708 (p = 0.112)
age:sex–0.001 (p = 0.803)0.006 (p = 0.696)0.004 (p = 0.743)0.008 (p = 0.793)
left:age:sex0.002 (p = 0.466)–0.011 (p < 0.001)–0.002 (p = 0.767)–0.015 (p < 0.001)
right:age:sex–0.000 (p = 0.884)–0.003 (p = 0.028)–0.003 (p = 0.635)–0.012 (p = 0.009)
AD:age:sex–0.000 (p = 0.993)–0.018 (p = 0.517)–0.004 (p = 0.803)–0.065 (p = 0.143)
left:AD:age:sex–0.001 (p = 0.826)–0.001 (p = 0.757)–0.018 (p = 0.062)–0.006 (p = 0.310)
right:AD:age:sex0.001 (p = 0.709)0.006 (p = 0.020)0.009 (p = 0.364)0.024 (p < 0.001)
n:age:sex0.000 (p = 0.787)0.002 (p = 0.026)–0.003 (p = 0.538)0.004 (p = 0.245)
n:left:age:sex0.002 (p = 0.330)0.004 (p = 0.004)0.006 (p = 0.373)–0.001 (p = 0.789)
n:right:age:sex0.001 (p = 0.540)0.000 (p = 0.764)0.007 (p = 0.300)–0.003 (p = 0.510)
n:AD:age:sex0.001 (p = 0.717)–0.004 (p = 0.047)0.003 (p = 0.697)–0.019 (p < 0.001)
n:left:AD:age:sex0.003 (p = 0.460)–0.000 (p = 0.931)0.002 (p = 0.834)0.008 (p = 0.183)
n:right:AD:age:sex–0.006 (p = 0.111)–0.001 (p = 0.707)–0.016 (p = 0.088)0.010 (p = 0.111)
Appendix 4—table 5
Linear mixed model results for MSE and total outlier count (tOC) under sex-imbalanced sampling for models trained with OASIS-3 or AIBL.

Models evaluate the influence of diagnosis (HC, AD), log-transformed and standardized sample size (n), sex ratio in the training set (1:1, 1F:4M, 1F:10M, 4F:1M, 10F:1M), age, sex, and their interactions on standardized deviation score outcomes. Continuous variables were standardized to allow comparison of effect sizes. The 1:1 ratio, HC, and female sex are used as reference levels. Reported β coefficients and corresponding p-values indicate the direction and significance of the effects.

OASIS-3AIBL
MSE
β (p-value)
tOC
β (p-value)
MSE
β (p-value)
tOC
β (p-value)
Intercept0.113 (p = 0.674)0.202 (p = 0.905)0.245 (p = 0.754)0.614 (p = 0.868)
n–1.237 (p < 0.001)0.539 (p < 0.001)–0.972 (p < 0.001)1.112 (p < 0.001)
AD–0.964 (p = 0.027)–0.795 (p = 0.771)–0.132 (p = 0.903)–7.493 (p = 0.144)
10F1M0.461 (p = 0.002)0.052 (p = 0.618)0.649 (p = 0.086)–0.027 (p = 0.934)
4F1M0.129 (p = 0.383)0.055 (p = 0.595)0.381 (p = 0.313)–0.040 (p = 0.901)
1F4M0.080 (p = 0.587)–0.155 (p = 0.135)–0.033 (p = 0.931)0.295 (p = 0.361)
1F10M–0.061 (p = 0.677)–0.101 (p = 0.331)–0.097 (p = 0.797)0.474 (p = 0.142)
n:AD1.197 (p < 0.001)0.480 (p < 0.001)–0.285 (p = 0.443)–0.914 (p = 0.004)
n:10F1M–0.150 (p = 0.308)0.035 (p = 0.732)–0.587 (p = 0.120)0.067 (p = 0.836)
n:4F1M0.032 (p = 0.826)0.066 (p = 0.523)–0.452 (p = 0.231)0.183 (p = 0.571)
n:1F4M–0.178 (p = 0.228)0.020 (p = 0.844)0.285 (p = 0.450)0.062 (p = 0.849)
n:1F10M0.338 (p = 0.022)–0.157 (p = 0.129)0.469 (p = 0.214)–0.165 (p = 0.610)
10F1M:AD0.729 (p = 0.002)–0.558 (p < 0.001)–0.102 (p = 0.845)–0.071 (p = 0.874)
4F1M:AD0.351 (p = 0.142)–0.404 (p = 0.016)0.050 (p = 0.925)–0.140 (p = 0.755)
1F4M:AD–0.535 (p = 0.025)0.190 (p = 0.257)0.102 (p = 0.846)–0.201 (p = 0.654)
1F10M:AD–0.890 (p < 0.001)0.407 (p = 0.015)–0.028 (p = 0.958)–0.067 (p = 0.881)
n:10F1M:AD–0.462 (p = 0.053)–0.078 (p = 0.641)–0.020 (p = 0.969)0.000 (p = 0.999)
n:4F1M:AD–0.239 (p = 0.318)–0.022 (p = 0.897)–0.053 (p = 0.919)0.105 (p = 0.814)
n:1F4M:AD0.291 (p = 0.223)–0.056 (p = 0.739)–0.548 (p = 0.297)0.156 (p = 0.728)
n:1F10M:AD0.302 (p = 0.206)–0.198 (p = 0.239)–0.492 (p = 0.349)–0.264 (p = 0.556)
age–0.005 (p = 0.180)–0.008 (p = 0.757)–0.008 (p = 0.450)–0.014 (p = 0.785)
10F1M:age–0.000 (p = 0.819)–0.001 (p = 0.667)–0.004 (p = 0.463)0.000 (p = 0.931)
4F1M:age0.001 (p = 0.692)–0.000 (p = 0.789)–0.002 (p = 0.659)0.001 (p = 0.855)
1F4M:age–0.002 (p = 0.245)0.002 (p = 0.158)–0.001 (p = 0.889)–0.004 (p = 0.336)
1F10M:age–0.002 (p = 0.426)0.001 (p = 0.441)–0.001 (p = 0.802)–0.007 (p = 0.121)
AD:age0.014 (p = 0.023)0.017 (p = 0.645)0.003 (p = 0.844)0.108 (p = 0.122)
10F1M:AD:age–0.010 (p = 0.002)0.006 (p = 0.011)0.001 (p = 0.850)–0.001 (p = 0.826)
4F1M:AD:age–0.005 (p = 0.137)0.005 (p = 0.035)–0.001 (p = 0.928)0.001 (p = 0.906)
1F4M:AD:age0.008 (p = 0.023)–0.003 (p = 0.283)–0.001 (p = 0.906)0.003 (p = 0.616)
1F10M:AD:age0.013 (p < 0.001)–0.005 (p = 0.032)0.001 (p = 0.879)0.002 (p = 0.802)
n:age0.012 (p < 0.001)–0.007 (p < 0.001)0.008 (p = 0.033)–0.014 (p < 0.001)
n:10F1M:age–0.000 (p = 0.817)–0.001 (p = 0.375)0.006 (p = 0.230)–0.002 (p = 0.718)
n:4F1M:age–0.002 (p = 0.328)–0.002 (p = 0.250)0.004 (p = 0.432)–0.003 (p = 0.469)
n:1F4M:age0.003 (p = 0.121)–0.001 (p = 0.644)–0.004 (p = 0.481)–0.001 (p = 0.837)
n:1F10M:age–0.003 (p = 0.130)0.002 (p = 0.162)–0.006 (p = 0.259)0.002 (p = 0.679)
n:AD:age–0.017 (p < 0.001)–0.005 (p = 0.002)0.003 (p = 0.619)0.013 (p = 0.003)
n:10F1M:AD:age0.006 (p = 0.052)0.002 (p = 0.503)0.000 (p = 0.985)–0.000 (p = 0.943)
n:4F1M:AD:age0.003 (p = 0.314)0.001 (p = 0.795)0.001 (p = 0.935)–0.002 (p = 0.748)
n:1F4M:AD:age–0.004 (p = 0.226)0.001 (p = 0.772)0.007 (p = 0.340)–0.002 (p = 0.705)
n:1F10M:AD:age–0.004 (p = 0.222)0.002 (p = 0.350)0.006 (p = 0.371)0.003 (p = 0.609)
sex0.073 (p = 0.648)–0.521 (p = 0.603)–0.290 (p = 0.539)–0.660 (p = 0.767)
10F1M:sex–0.134 (p = 0.125)–0.003 (p = 0.960)–0.308 (p = 0.177)0.083 (p = 0.672)
4F1M:sex–0.019 (p = 0.826)–0.012 (p = 0.849)–0.135 (p = 0.554)0.071 (p = 0.716)
1F4M:sex0.007 (p = 0.939)0.072 (p = 0.244)0.071 (p = 0.755)–0.101 (p = 0.605)
1F10M:sex0.042 (p = 0.629)0.046 (p = 0.455)0.219 (p = 0.337)–0.158 (p = 0.419)
AD:sex0.009 (p = 0.975)1.404 (p = 0.417)0.323 (p = 0.623)5.076 (p = 0.102)
10F1M:AD:sex–0.519 (p < 0.001)0.369 (p < 0.001)0.092 (p = 0.773)0.438 (p = 0.107)
4F1M:AD:sex–0.267 (p = 0.078)0.241 (p = 0.024)–0.026 (p = 0.935)0.340 (p = 0.211)
1F4M:AD:sex0.300 (p = 0.047)–0.194 (p = 0.068)–0.036 (p = 0.909)–0.024 (p = 0.930)
1F10M:AD:sex0.552 (p < 0.001)–0.333 (p = 0.002)–0.002 (p = 0.995)–0.136 (p = 0.616)
n:sex–0.034 (p = 0.584)–0.200 (p < 0.001)0.187 (p = 0.246)–0.302 (p = 0.029)
n:10F1M:sex0.009 (p = 0.920)–0.068 (p = 0.273)0.082 (p = 0.721)–0.082 (p = 0.673)
n:4F1M:sex–0.063 (p = 0.469)–0.074 (p = 0.230)–0.003 (p = 0.991)–0.118 (p = 0.546)
n:1F4M:sex0.061 (p = 0.487)–0.016 (p = 0.795)–0.217 (p = 0.342)–0.010 (p = 0.959)
n:1F10M:sex–0.018 (p = 0.834)0.090 (p = 0.145)–0.286 (p = 0.210)0.147 (p = 0.451)
n:AD:sex–0.067 (p = 0.531)0.259 (p < 0.001)–0.178 (p = 0.428)1.452 (p < 0.001)
n:10F1M:AD:sex0.305 (p = 0.044)0.044 (p = 0.680)0.011 (p = 0.972)0.162 (p = 0.551)
n:4F1M:AD:sex0.199 (p = 0.188)0.052 (p = 0.628)0.042 (p = 0.895)0.083 (p = 0.760)
n:1F4M:AD:sex–0.188 (p = 0.214)0.061 (p = 0.566)0.305 (p = 0.337)–0.073 (p = 0.787)
n:1F10M:AD:sex–0.302 (p = 0.046)0.128 (p = 0.229)0.337 (p = 0.288)0.088 (p = 0.747)
age:sex–0.001 (p = 0.637)0.006 (p = 0.658)0.004 (p = 0.510)0.008 (p = 0.782)
10F1M:age:sex–0.001 (p = 0.360)0.000 (p = 0.998)0.002 (p = 0.570)–0.001 (p = 0.693)
4F1M:age:sex–0.001 (p = 0.363)–0.000 (p = 0.964)0.000 (p = 0.887)–0.001 (p = 0.700)
1F4M:age:sex0.001 (p = 0.365)–0.001 (p = 0.317)0.001 (p = 0.856)0.002 (p = 0.526)
1F10M:age:sex0.002 (p = 0.159)–0.000 (p = 0.676)0.000 (p = 0.939)0.003 (p = 0.308)
AD:age:sex–0.000 (p = 0.987)–0.018 (p = 0.461)–0.004 (p = 0.617)–0.065 (p = 0.124)
10F1M:AD:age:sex0.007 (p < 0.001)–0.004 (p = 0.008)–0.001 (p = 0.794)–0.004 (p = 0.246)
4F1M:AD:age:sex0.004 (p = 0.071)–0.003 (p = 0.059)0.000 (p = 0.926)–0.004 (p = 0.333)
1F4M:AD:age:sex–0.004 (p = 0.044)0.003 (p = 0.088)0.000 (p = 0.968)0.000 (p = 0.952)
1F10M:AD:age:sex–0.008 (p < 0.001)0.004 (p = 0.007)–0.000 (p = 0.933)0.002 (p = 0.684)
n:age:sex0.000 (p = 0.642)0.002 (p < 0.001)–0.003 (p = 0.178)0.004 (p = 0.057)
n:10F1M:age:sex0.001 (p = 0.275)0.001 (p = 0.112)–0.000 (p = 0.893)0.001 (p = 0.583)
n:4F1M:age:sex0.002 (p = 0.148)0.001 (p = 0.102)0.001 (p = 0.803)0.002 (p = 0.471)
n:1F4M:age:sex–0.001 (p = 0.246)0.000 (p = 0.641)0.003 (p = 0.385)0.000 (p = 0.924)
n:1F10M:age:sex–0.000 (p = 0.708)–0.001 (p = 0.225)0.004 (p = 0.210)–0.002 (p = 0.552)
n:AD:age:sex0.001 (p = 0.532)–0.004 (p < 0.001)0.003 (p = 0.395)–0.019 (p < 0.001)
n:10F1M:AD:age:sex–0.004 (p = 0.040)–0.001 (p = 0.555)–0.000 (p = 0.979)–0.002 (p = 0.596)
n:4F1M:AD:age:sex–0.003 (p = 0.174)–0.001 (p = 0.539)–0.001 (p = 0.896)–0.001 (p = 0.803)
n:1F4M:AD:age:sex0.003 (p = 0.214)–0.001 (p = 0.589)–0.004 (p = 0.386)0.001 (p = 0.769)
n:1F10M:AD:age:sex0.004 (p = 0.046)–0.001 (p = 0.352)–0.004 (p = 0.312)–0.001 (p = 0.853)
Appendix 4—table 6
Linear mixed model results for MSE and total outlier count (tOC) under age-skewed sampling for models trained with UKB and transferred to OASIS-3 or AIBL.

Models evaluate the influence of diagnosis (HC, AD), log-transformed and standardized sample size (n), age sampling strategy (Representative, Left-skewed, Right-skewed), age, sex, and their interactions on standardized deviation score outcomes. Continuous variables were standardized to allow comparison of effect sizes. Representative sampling, HC, and female sex are used as reference levels. Reported β coefficients and corresponding p-values indicate the direction and significance of the effects.

OASIS-3AIBL
MSE
β (p-value)
tOC
β (p-value)
MSE
β (p-value)
tOC
β (p-value)
Intercept–0.475 (p < 0.001)1.421 (p = 0.448)–0.475 (p = 0.345)7.329 (p = 0.072)
n–0.172 (p = 0.004)–0.261 (p < 0.001)–0.439 (p = 0.154)–1.094 (p < 0.001)
AD–0.013 (p = 0.893)0.355 (p = 0.907)–0.061 (p = 0.930)–8.922 (p = 0.116)
left0.187 (p = 0.025)0.215 (p = 0.002)–1.989 (p < 0.001)–0.726 (p = 0.011)
right0.009 (p = 0.918)0.397 (p < 0.001)1.258 (p = 0.004)5.751 (p < 0.001)
n:AD0.013 (p = 0.889)0.110 (p = 0.167)0.146 (p = 0.733)0.843 (p = 0.003)
n:left–0.063 (p = 0.447)0.019 (p = 0.783)0.138 (p = 0.752)0.206 (p = 0.469)
n:right0.001 (p = 0.990)0.109 (p = 0.117)–0.404 (p = 0.354)0.013 (p = 0.964)
left:AD0.031 (p = 0.818)–0.213 (p = 0.058)4.092 (p < 0.001)0.250 (p = 0.528)
right:AD–0.000 (p = 0.999)0.058 (p = 0.608)–0.663 (p = 0.274)–5.204 (p < 0.001)
n:left:AD–0.002 (p = 0.987)0.041 (p = 0.712)0.040 (p = 0.947)–0.094 (p = 0.812)
n:right:AD0.012 (p = 0.930)0.016 (p = 0.889)0.556 (p = 0.359)–0.058 (p = 0.884)
age–0.000 (p = 0.743)–0.024 (p = 0.366)–0.001 (p = 0.848)–0.104 (p = 0.064)
left:age0.000 (p = 0.907)–0.001 (p = 0.160)0.037 (p < 0.001)0.009 (p = 0.021)
right:age0.000 (p = 0.994)–0.005 (p < 0.001)–0.012 (p = 0.041)–0.074 (p < 0.001)
AD:age0.000 (p = 0.850)0.002 (p = 0.955)0.001 (p = 0.930)0.122 (p = 0.117)
left:AD:age–0.000 (p = 0.940)0.004 (p = 0.010)–0.056 (p < 0.001)–0.003 (p = 0.551)
right:AD:age–0.000 (p = 0.993)–0.001 (p = 0.662)0.009 (p = 0.264)0.072 (p < 0.001)
n:age0.000 (p = 0.628)0.002 (p = 0.003)0.002 (p = 0.599)0.014 (p < 0.001)
n:left:age–0.000 (p = 0.857)–0.000 (p = 0.778)–0.005 (p = 0.444)–0.002 (p = 0.527)
n:right:age0.000 (p = 0.994)–0.001 (p = 0.161)0.004 (p = 0.454)–0.001 (p = 0.887)
n:AD:age–0.000 (p = 0.822)–0.002 (p = 0.127)–0.002 (p = 0.735)–0.012 (p = 0.002)
n:left:AD:age–0.000 (p = 0.960)–0.001 (p = 0.665)–0.000 (p = 0.959)0.001 (p = 0.810)
n:right:AD:age–0.000 (p = 0.928)–0.000 (p = 0.968)–0.008 (p = 0.360)0.001 (p = 0.921)
sex–0.013 (p = 0.708)–0.983 (p = 0.377)–0.088 (p = 0.773)–3.181 (p = 0.197)
left:sex–0.025 (p = 0.611)–0.150 (p < 0.001)0.386 (p = 0.142)0.166 (p = 0.335)
right:sex0.001 (p = 0.988)–0.152 (p < 0.001)–0.102 (p = 0.697)–1.733 (p < 0.001)
AD:sex0.017 (p = 0.775)1.334 (p = 0.488)0.125 (p = 0.768)10.016 (p = 0.004)
left:AD:sex–0.038 (p = 0.654)0.320 (p < 0.001)–1.816 (p < 0.001)–0.124 (p = 0.605)
right:AD:sex0.002 (p = 0.982)0.079 (p = 0.270)0.351 (p = 0.338)3.313 (p < 0.001)
n:sex0.019 (p = 0.592)0.094 (p = 0.001)0.119 (p = 0.521)0.170 (p = 0.161)
n:left:sex0.015 (p = 0.759)0.009 (p = 0.835)0.170 (p = 0.520)0.042 (p = 0.808)
n:right:sex0.002 (p = 0.960)–0.064 (p = 0.120)0.130 (p = 0.622)–0.178 (p = 0.300)
n:AD:sex–0.022 (p = 0.710)–0.181 (p < 0.001)–0.186 (p = 0.474)–0.402 (p = 0.018)
n:left:AD:sex0.007 (p = 0.933)–0.005 (p = 0.938)–0.263 (p = 0.472)0.020 (p = 0.934)
n:right:AD:sex–0.013 (p = 0.881)0.032 (p = 0.654)–0.333 (p = 0.364)0.257 (p = 0.284)
age:sex0.000 (p = 0.715)0.013 (p = 0.435)0.001 (p = 0.774)0.042 (p = 0.216)
left:age:sex0.001 (p = 0.484)0.002 (p = 0.005)–0.005 (p = 0.138)–0.002 (p = 0.403)
right:age:sex–0.000 (p = 0.975)0.002 (p < 0.001)0.001 (p = 0.725)0.023 (p < 0.001)
AD:age:sex–0.000 (p = 0.761)–0.017 (p = 0.511)–0.002 (p = 0.778)–0.129 (p = 0.006)
left:AD:age:sex0.000 (p = 0.814)–0.004 (p < 0.001)0.024 (p < 0.001)0.001 (p = 0.687)
right:AD:age:sex–0.000 (p = 0.996)–0.001 (p = 0.345)–0.005 (p = 0.367)–0.043 (p < 0.001)
n:age:sex–0.000 (p = 0.608)–0.001 (p = 0.020)–0.002 (p = 0.522)–0.002 (p = 0.174)
n:left:age:sex–0.000 (p = 0.693)–0.000 (p = 0.795)–0.002 (p = 0.521)–0.001 (p = 0.799)
n:right:age:sex–0.000 (p = 0.971)0.001 (p = 0.157)–0.002 (p = 0.637)0.002 (p = 0.335)
n:AD:age:sex0.000 (p = 0.688)0.002 (p = 0.002)0.002 (p = 0.493)0.005 (p = 0.024)
n:left:AD:age:sex–0.000 (p = 0.991)0.000 (p = 0.891)0.004 (p = 0.471)–0.000 (p = 0.956)
n:right:AD:age:sex0.000 (p = 0.897)–0.000 (p = 0.666)0.004 (p = 0.379)–0.003 (p = 0.311)
Appendix 4—table 7
Linear mixed model results for MSE and total outlier count (tOC) under sex-imbalanced sampling for models trained with UKB and transferred to OASIS-3 or AIBL.

Models evaluate the influence of diagnosis (HC, AD), log-transformed and standardized sample size (n), sex ratio in the training set (1:1, 1F:4M, 1F:10M, 4F:1M, 10F:1M), age, sex, and their interactions on standardized deviation score outcomes. Continuous variables were standardized to allow comparison of effect sizes. The 1:1 ratio, HC, and female sex are used as reference levels. Reported β coefficients and corresponding p-values indicate the direction and significance of the effects.

OASIS-3AIBL
MSE
β (p-value)
tOC
β (p-value)
MSE
β (p-value)
tOC
β (p-value)
InterceptNA1.421 (p = 0.433)–0.475 (p = 0.022)7.329 (p = 0.062)
n–0.172 (P < 0.001)–0.261 (P < 0.001)–0.439 (p = 0.034)–1.094 (P < 0.001)
AD0.000 (p = 1.000)0.355 (p = 0.904)–0.061 (p = 0.831)–8.922 (p = 0.103)
10F1M0.014 (p = 0.822)0.323 (P < 0.001)–0.026 (p = 0.929)–0.038 (p = 0.844)
4F1M0.006 (p = 0.929)0.216 (P < 0.001)–0.027 (p = 0.925)0.033 (p = 0.864)
1F4M0.015 (p = 0.821)–0.009 (p = 0.880)0.088 (p = 0.763)0.435 (p = 0.025)
1F10M0.016 (p = 0.809)0.027 (p = 0.647)0.072 (p = 0.805)0.425 (p = 0.029)
n:AD0.013 (p = 0.856)0.110 (p = 0.108)0.146 (p = 0.611)0.843 (P < 0.001)
n:10F1M0.006 (p = 0.920)0.013 (p = 0.821)0.091 (p = 0.756)0.130 (p = 0.504)
n:4F1M0.007 (p = 0.910)–0.002 (p = 0.968)0.072 (p = 0.805)0.009 (p = 0.963)
n:1F4M–0.011 (p = 0.858)–0.002 (p = 0.969)–0.050 (p = 0.865)–0.177 (p = 0.361)
n:1F10M0.003 (p = 0.964)–0.009 (p = 0.878)0.130 (p = 0.656)0.029 (p = 0.883)
10F1M:AD0.004 (p = 0.969)–0.409 (P < 0.001)0.050 (p = 0.902)0.424 (p = 0.117)
4F1M:AD0.000 (p = 0.999)–0.267 (p = 0.006)0.021 (p = 0.959)0.347 (p = 0.200)
1F4M:AD–0.003 (p = 0.974)–0.002 (p = 0.983)–0.135 (p = 0.741)–0.537 (p = 0.047)
1F10M:AD–0.000 (p = 0.997)–0.015 (p = 0.873)–0.128 (p = 0.754)–0.756 (p = 0.005)
n:10F1M:AD–0.008 (p = 0.937)0.015 (p = 0.878)–0.105 (p = 0.796)–0.513 (p = 0.058)
n:4F1M:AD–0.004 (p = 0.969)0.049 (p = 0.611)–0.047 (p = 0.909)–0.301 (p = 0.265)
n:1F4M:AD0.014 (p = 0.893)0.019 (p = 0.847)0.130 (p = 0.750)0.198 (p = 0.465)
n:1F10M:AD0.004 (p = 0.968)0.037 (p = 0.698)–0.108 (p = 0.791)0.127 (p = 0.640)
age0.000 (p = 1.000)–0.024 (p = 0.350)–0.001 (p = 0.641)–0.104 (p = 0.055)
10F1M:age–0.000 (p = 0.943)–0.004 (P < 0.001)0.001 (p = 0.849)0.001 (p = 0.677)
4F1M:age–0.000 (p = 0.963)–0.003 (p = 0.003)0.001 (p = 0.895)–0.000 (p = 0.994)
1F4M:age–0.000 (p = 0.932)–0.000 (p = 0.968)–0.001 (p = 0.826)–0.006 (p = 0.027)
1F10M:age–0.000 (p = 0.977)–0.000 (p = 0.567)–0.000 (p = 0.921)–0.006 (p = 0.028)
AD:age–0.000 (p = 1.000)0.002 (p = 0.954)0.001 (p = 0.831)0.122 (p = 0.104)
10F1M:AD:age–0.000 (p = 0.988)0.006 (P < 0.001)–0.001 (p = 0.905)–0.005 (p = 0.156)
4F1M:AD:age0.000 (p = 0.992)0.004 (p = 0.002)–0.000 (p = 0.963)–0.004 (p = 0.236)
1F4M:AD:age0.000 (p = 0.967)–0.000 (p = 0.895)0.002 (p = 0.738)0.007 (p = 0.050)
1F10M:AD:age0.000 (p = 0.989)0.000 (p = 0.928)0.002 (p = 0.752)0.010 (p = 0.006)
n:age0.000 (p = 0.529)0.002 (P < 0.001)0.002 (p = 0.434)0.014 (P < 0.001)
n:10F1M:age–0.000 (p = 0.980)–0.000 (p = 0.831)–0.001 (p = 0.747)–0.002 (p = 0.475)
n:4F1M:age–0.000 (p = 0.991)0.000 (p = 0.939)–0.001 (p = 0.802)–0.000 (p = 0.893)
n:1F4M:age0.000 (p = 0.862)0.000 (p = 0.887)0.001 (p = 0.852)0.002 (p = 0.378)
n:1F10M:age–0.000 (p = 0.993)0.000 (p = 0.806)–0.002 (p = 0.662)–0.000 (p = 0.888)
n:AD:age–0.000 (p = 0.770)–0.002 (p = 0.076)–0.002 (p = 0.614)–0.012 (P < 0.001)
n:10F1M:AD:age0.000 (p = 0.949)–0.000 (p = 0.894)0.001 (p = 0.800)0.007 (p = 0.062)
n:4F1M:AD:age0.000 (p = 0.978)–0.001 (p = 0.619)0.001 (p = 0.913)0.004 (p = 0.271)
n:1F4M:AD:age–0.000 (p = 0.886)–0.000 (p = 0.859)–0.002 (p = 0.746)–0.003 (p = 0.463)
n:1F10M:AD:age–0.000 (p = 0.969)–0.001 (p = 0.648)0.001 (p = 0.798)–0.002 (p = 0.660)
sexNA–0.983 (p = 0.361)–0.088 (p = 0.482)–3.181 (p = 0.181)
10F1M:sex–0.004 (p = 0.907)–0.165 (P < 0.001)0.016 (p = 0.926)0.133 (p = 0.259)
4F1M:sex–0.003 (p = 0.939)–0.114 (p = 0.001)0.002 (p = 0.990)0.040 (p = 0.734)
1F4M:sex–0.003 (p = 0.943)0.016 (p = 0.650)–0.045 (p = 0.801)–0.157 (p = 0.181)
1F10M:sex–0.000 (p = 0.995)0.008 (p = 0.822)–0.055 (p = 0.755)–0.148 (p = 0.207)
AD:sexNA1.334 (p = 0.474)0.125 (p = 0.473)10.016 (p = 0.002)
10F1M:AD:sex0.001 (p = 0.988)0.335 (P < 0.001)–0.003 (p = 0.989)–0.107 (p = 0.514)
4F1M:AD:sex0.001 (p = 0.989)0.213 (P < 0.001)0.008 (p = 0.975)–0.105 (p = 0.520)
1F4M:AD:sex0.001 (p = 0.989)–0.022 (p = 0.715)0.086 (p = 0.728)0.308 (p = 0.060)
1F10M:AD:sex0.003 (p = 0.970)–0.002 (p = 0.968)0.111 (p = 0.652)0.472 (p = 0.004)
n:sex0.019 (p = 0.486)0.094 (P < 0.001)0.119 (p = 0.340)0.170 (p = 0.040)
n:10F1M:sex0.002 (p = 0.958)–0.006 (p = 0.873)–0.027 (p = 0.879)–0.064 (p = 0.588)
n:4F1M:sex0.002 (p = 0.957)0.002 (p = 0.946)–0.005 (p = 0.978)0.002 (p = 0.986)
n:1F4M:sex0.006 (p = 0.879)0.003 (p = 0.928)0.030 (p = 0.867)0.097 (p = 0.408)
n:1F10M:sex–0.003 (p = 0.944)0.006 (p = 0.864)–0.011 (p = 0.952)0.044 (p = 0.705)
n:AD:sex–0.022 (p = 0.629)–0.181 (P < 0.001)–0.186 (p = 0.286)–0.402 (P < 0.001)
n:10F1M:AD:sex0.001 (p = 0.992)0.005 (p = 0.933)0.036 (p = 0.885)0.288 (p = 0.079)
n:4F1M:AD:sex0.001 (p = 0.992)–0.003 (p = 0.957)–0.003 (p = 0.989)0.135 (p = 0.408)
n:1F4M:AD:sex–0.002 (p = 0.977)0.016 (p = 0.794)–0.067 (p = 0.784)–0.085 (p = 0.604)
n:1F10M:AD:sex–0.002 (p = 0.971)0.001 (p = 0.985)0.010 (p = 0.969)–0.100 (p = 0.541)
age:sex0.000 (p = 1.000)0.013 (p = 0.421)0.001 (p = 0.485)0.042 (p = 0.199)
10F1M:age:sex0.000 (p = 0.910)0.002 (P < 0.001)–0.000 (p = 0.920)–0.002 (p = 0.221)
4F1M:age:sex0.000 (p = 0.937)0.001 (p = 0.006)–0.000 (p = 0.986)–0.001 (p = 0.666)
1F4M:age:sex0.000 (p = 0.930)–0.000 (p = 0.773)0.001 (p = 0.797)0.002 (p = 0.166)
1F10M:age:sex0.000 (p = 0.989)–0.000 (p = 0.962)0.001 (p = 0.748)0.002 (p = 0.175)
AD:age:sex0.000 (p = 1.000)–0.017 (p = 0.497)–0.002 (p = 0.494)–0.129 (p = 0.004)
10F1M:AD:age:sex–0.000 (p = 0.981)–0.005 (P < 0.001)0.000 (p = 0.987)0.002 (p = 0.488)
4F1M:AD:age:sex–0.000 (p = 0.984)–0.003 (P < 0.001)–0.000 (p = 0.974)0.001 (p = 0.505)
1F4M:AD:age:sex–0.000 (p = 0.980)0.000 (p = 0.682)–0.001 (p = 0.729)–0.004 (p = 0.070)
1F10M:AD:age:sex–0.000 (p = 0.973)0.000 (p = 0.945)–0.002 (p = 0.653)–0.006 (p = 0.006)
n:age:sex–0.000 (p = 0.505)–0.001 (p = 0.007)–0.002 (p = 0.341)–0.002 (p = 0.047)
n:10F1M:age:sex–0.000 (p = 0.948)0.000 (p = 0.858)0.000 (p = 0.871)0.001 (p = 0.568)
n:4F1M:age:sex–0.000 (p = 0.941)–0.000 (p = 0.953)0.000 (p = 0.970)0.000 (p = 0.966)
n:1F4M:age:sex–0.000 (p = 0.839)–0.000 (p = 0.872)–0.000 (p = 0.871)–0.001 (p = 0.414)
n:1F10M:age:sex0.000 (p = 0.960)–0.000 (p = 0.794)0.000 (p = 0.953)–0.001 (p = 0.706)
n:AD:age:sex0.000 (p = 0.601)0.002 (P < 0.001)0.002 (p = 0.308)0.005 (P < 0.001)
n:10F1M:AD:age:sex–0.000 (p = 0.995)–0.000 (p = 0.940)–0.000 (p = 0.886)–0.004 (p = 0.084)
n:4F1M:AD:age:sex–0.000 (p = 0.996)0.000 (p = 0.928)0.000 (p = 0.989)–0.002 (p = 0.400)
n:1F4M:AD:age:sex0.000 (p = 0.955)–0.000 (p = 0.847)0.001 (p = 0.782)0.001 (p = 0.620)
n:1F10M:AD:age:sex0.000 (p = 0.981)0.000 (p = 0.930)–0.000 (p = 0.979)0.001 (p = 0.573)

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Camille Elleaume
  2. Bruno Hebling Vieira
  3. Dorothea L Floris
  4. Nicolas Langer
(2026)
The influence of sample size and covariate distributions on neuroanatomical normative modeling
eLife 14:RP108952.
https://doi.org/10.7554/eLife.108952.3