A: ROCF figure with 18 elements. B: demographics of the participants and clinical population. C: examples of hand-drawn ROCF images. D: The pie chart illustrates the numerical proportion of the different clinical conditions. E: performance in the copy and (immediate) recall condition across the lifespan in the present data set. F: distribution of the number of images for each total score (online raters).

A: network architecture, constituted of a shared feature extractor and 18 item-specific feature extractors and output blocks. The shared feature extractor consists of three convolutional blocks, whereas item-specific feature extractors have one convolutional block with global max-pooling. Convolutional blocks consist of two convolution and batch-normalization pairs, followed by max-pooling. Output blocks consist of two fully connected layers. ReLU activation is applied after batch normalization. After pooling, dropout is applied. B. item-specific MAE for the regression-based network (blue) and multilabel classification network (orange). In the final model, we determine whether to use the regressor or classifier network based on its performance in the validation data set, indicated by an opaque color in the bar chart. In case of identical performance, the model resulting in the least variance was selected. C: Model variants were compared and the performance of the best model in the original, retrospectively collected (green) and the independent, prospectively collected (purple) test set is displayed; Clf: multilabel classification network; Reg: regression-based network; NA: no augmentation; DA: data augmentation; TTA: test time augmentation. D. Convergence analysis revealed that after ∼8000 images, no substantial improvements could be achieved by including more data. E. The effect of image size on the model performance is measured in terms of MAE.

Contrasting the ratings of our model (A) and clinicians (D) against the ground truth revealed a larger deviation from the regression line for the clinicians. A jitter is applied to better highlight the dot density. The distribution of errors for our model (B) and the clinicians ratings (E) are displayed. The MAE of our model (C) and the clinicians (F) is displayed for each individual item of the figure (see also supplementary Table S2). The corresponding plots for the performance on the prospectively collected data are displayed in the supplementary Figure S5. The model performance for the retrospective (green) and prospective (purple) sample across the entire range of total scores for model (G), clinicians (H) and online raters (I) are presented.

A: Displayed are the mean absolute error and bootstrapped 95% confidence intervals of the model performance across different ROCF conditions (copy and recall), demographics (age, gender), and clinical statuses (healthy individuals and patients) for the retrospective data. B: Model performance across different diagnostic conditions. C & D: The number of subjects in each subgroup is depicted. The same model performance analysis for the prospective data is reported in the supplementary Figure S6.

Robustness to geometric, brightness and contrast variations. The MAE is depicted for different degrees of transformations. In addition examples of the transformed ROCF draw are provided.