Automating clinical assessments of memory deficits: Deep Learning based scoring of the Rey-Osterrieth Complex Figure

  1. Methods of Plasticity Research, Department of Psychology, University of Zurich, Zurich, Switzerland
  2. University Research Priority Program (URPP) Dynamics of Healthy Aging, Zurich, Switzerland
  3. Neuroscience Center Zurich (ZNZ), University of Zurich and ETH Zurich, Zurich, Switzerland
  4. Department of Computer Science, ETH, Zurich, Switzerland
  5. Virginia Commonwealth University. Richmond, Virginia
  6. Department of Health Science, Public University of Navarre, Pamplona, Spain
  7. Instituto de Investigación Sanitaria de Navarra (IdiSNA), Pamplona, Spain
  8. "Rita Levi Montalcini" Department of Neurosciences, University of Turin, Italy
  9. I.R.C.C.S. Istituto Auxologico Italiano, U.O. di Neurologia e Neuroriabilitazione, Ospedale San Giuseppe, Piancavallo (VCO), Italy
  10. Huashan Hospital, Shanghai, China
  11. Smartcode, Zurich, Switzerland
  12. University Children’s Hospital Zurich, Child Development Center, Zurich, Switzerland
  13. Rehabilitation Center, Valens, Switzerland
  14. University Hospital Magdeburg University Department of Neurology, Magdeburg, Germany
  15. MRC Cognition and Brain Sciences Unit, University of Cambridge, Cambridge, United Kingdom
  16. Department of Neurophysics, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
  17. Stanford University, Stanford, CA, United States

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Juan Zhou
    National University of Singapore, Singapore, Singapore
  • Senior Editor
    Yanchao Bi
    Beijing Normal University, Beijing, China

Reviewer #1 (Public Review):

Summary:

The authors aimed to develop and validate an automated, deep learning-based system for scoring the Rey-Osterrieth Complex Figure Test (ROCF), a widely used tool in neuropsychology for assessing memory deficits. Their goal was to overcome the limitations of manual scoring, such as subjectivity and time consumption, by creating a model that provides automatic, accurate, objective, and efficient assessments of memory deterioration in individuals with various neurological and psychiatric conditions.

Strengths:

Comprehensive Data Collection:
The authors collected over 20,000 hand-drawn ROCF images from a wide demographic and geographic range, ensuring a robust and diverse dataset. This extensive data collection is critical for training a generalizable and effective deep learning model.

Advanced Deep Learning Approach:
Utilizing a multi-head convolutional neural network to automate ROCF scoring represents a sophisticated application of current AI technologies. This approach allows for detailed analysis of individual figure elements, potentially increasing the accuracy and reliability of assessments.

Validation and Performance Assessment:
The model's performance was rigorously evaluated against crowdsourced human intelligence and professional clinician scores, demonstrating its ability to outperform both groups. The inclusion of an independent prospective validation study further strengthens the credibility of the results.

Robustness Analysis Efficacy:
The model underwent a thorough robustness analysis, testing its adaptability to variations in rotation, perspective, brightness, and contrast. Such meticulous examination ensures the model's consistent performance across different clinical imaging scenarios, significantly bolstering its utility for real-world applications.

Weaknesses:

Insufficient Network Analysis for Explainability:
The paper does not sufficiently delve into network analysis to determine whether the model's predictions are based on accurately identifying and matching the 18 items of the ROCF or if they rely on global, item-irrelevant features. This gap in analysis limits our understanding of the model's decision-making process and its clinical relevance.

Generative Model Consideration:
The critique suggests exploring generative models to model the joint distribution of images and scores, which could offer deeper insights into the relationship between scores and specific visual-spatial disabilities. The absence of this consideration in the study is seen as a missed opportunity to enhance the model's explainability and clinical utility.

Appraisal and discussion:
By leveraging a comprehensive dataset and employing advanced deep learning techniques, they demonstrated the model's ability to outperform both crowdsourced raters and professional clinicians in scoring the ROCF. This achievement represents a significant step forward in automating neuropsychological assessments, potentially revolutionizing how memory deficits are evaluated in clinical settings. Furthermore, the application of deep learning to clinical neuropsychology opens avenues for future research, including the potential automation of other neuropsychological tests and the integration of AI tools into clinical practice. The success of this project may encourage further exploration into how AI can be leveraged to improve diagnostic accuracy and efficiency in healthcare.

However, the critique regarding the lack of detailed analysis across different patient demographics, the inadequacy of network explainability, and concerns about the selection of median crowdsourced scores as ground truth raises questions about the completeness of their objectives. These aspects suggest that while the aims were achieved to a considerable extent, there are areas of improvement that could make the results more robust and the conclusions stronger.

Reviewer #2 (Public Review):

Summary:

The authors aimed to develop and validate a machine-learning-driven neural network capable of automatic scoring of the Rey-Osterrieth Complex Figure. They aimed to further assess the robustness of the model to various parameters such as tilt and perspective shift in real drawings. The authors leveraged the use of a huge sample of lay workers in scoring figures and also a large sample of trained clinicians to score a subsample of figures. Overall, the authors found their model to have exceptional accuracy and perform similarly to crowdsourced workers and clinicians with, in some cases, less degree of error/score dispersion than clinicians.

Strengths:

The authors used very large data; including a large number of Rey-Osterrieth Complex Figures, a huge crowdsourced human worker sample, and a large clinician sample.

The authors deeply describe their model in relatively accessible terms.

The writing style of the paper is accessible, scientific, and thorough.

Pre-registration of the prospectively collected new data was acceptable.

Weaknesses:

There is no detail on how the final scoring app can be accessed and whether it is medical device-regulated.

No discussion on the difference in sample sizes between the pre-registration of the prospective study and the results (e.g., aimed for 500 neurological patients but reported data from 288).

Details in pre-registration and paper regarding samples obtained in the prospective study were lacking.

Demographics for the assessment of the representation of healthy and non-healthy participants were not present.

The authors achieved their aims and their results and conclusions are supported by strong methods and analyses. The resulting app produced in this work, if suitable for clinical practice, will have impact in automated scoring, which many clinicians will be exceptionally happy with.

Reviewer #3 (Public Review):

Summary:

This study presented a valuable inventory of scoring a neuropsychological test, ROCFT, with constructing an artificial intelligence model.

Strengths:

They constructed huge samples collected among multi-center international researchers and tested the model precisely using internet data.
The model scored the test with excellent ability, surpassing even experts. The product can run an application on a tablet, which helps clinicians and patients.
Their method of building the model of deep learning and testing will apply to tests in all fields, not just the psychological field.

Weaknesses:

The considerable effort and cost to make the model only for an existing neuropsychological test.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation