Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.
Read more about eLife’s peer review process.Editors
- Reviewing EditorMishaela RubinColumbia University Medical Center, New York, United States of America
- Senior EditorDiane HarperUniversity of Michiganâ€Ann Arbor", Ann Arbor, United States of America
Reviewer #1 (Public Review):
This is an interesting study, covering a future direction for the diagnosis of osteoporosis.
Strength: well validated cohorts, authors are more than experts in the field, use of technology.
Weakness: the approach is still very experimental and far away to be clinically relevant.
The authors have performed a very interesting analysis combining data from different, well designed, cohorts.
Authors are leaders in the field. The topic is of interest, the statistical analysis well designed, and the paper is well written and easy to read even for not experts.
I have a few comments
- Although authors are very optimistic about HRpQCT, they should recognize (and acknowledge in the discussion) that their data have a very low clinical impact for the majority of the population. The cost of the machine is still prohibitive for the majority of clinical centers, technology needs more validations out of the reference centers, a lot of controversy on the methodology for cortical porosity. Basically, after 20 years since its introduction, it remains more a research tool than a clinical opportunity. This comment is of course not against the scientific hypothesis or the conduction of the study which remain brilliant
- How authors have managed the role of possible secondary causes of osteoporosis? Did they excluded patients with GIOP for example? Are all study subjects treatment naïve?
- It would be worth to better describe the role of cortical porosity and the predictive value of this parameter which has been extensively studied by Dr Seeman.
Reviewer #2 (Public Review):
The authors apply a deep learning approach to predict fracture using forearm HR-pQCT data pooled from 3 longitudinal cohorts totaling 2666 postmenopausal women. The deep learning based 'Structural Fragility Score - AI' was compared to FRAX w/BMD and BMD alone in its ability to identify women who went on to fracture within the next 5 years. SFS-AI performed significantly better than FRAX w/BMD and BMD alone in all metrics except specificity. This work establishes that deep learning methods applied to HR-pQCT data have great potential for use in predicting (and therefore preventing) fractures.
The low specificity of SFS-AI compared to FRAX and BMD is not adequately acknowledged or addressed - will this lead to over diagnosis / unnecessary interventions and is that a problem?
The paper does not adequately address the relative role of bone vs soft tissue features in the determination of SFS-AI. It would be possible to feed the algorithm only the segmented bone volumes, and compare AUC, etc, of SFS-AI (bone) to that acquired using the entire bone + muscle volume. It's possible (likely?) that most of the predictive power will remain. If muscle is an important part of this algorithm, then mid-diaphyseal tibia scans will be an interesting next application - since that scan site is closer to the muscle belly compared to the distal radius site which contains very little muscle volume.
Reviewer #3 (Public Review):
This work presents a novel approach for predicting fracture risk from high-resolution peripheral quantitative computed tomography (HR-pQCT): by training a deep learning model to predict five-year fracture risk where the sole input is the full 3D HR-pQCT image. Prior studies have developed models, of varying complexity, to predict fracture risk from HR-pQCT. However, this study is novel in that neither the typical manual efforts required for HR-pQCT image analysis nor additional biomarker collection are required, simplifying potential clinical implementation. The authors show that their model predicts fracture within five years with greater sensitivity than FRAX (with an assumed diagnostic threshold of FRAX > 20% or T-score < -2.5 SD), albeit with reduced specificity. The authors further investigate how their model output, the structural fragility score derived by artificial intelligence (SFS-AI), is correlated with two microarchitectural parameters that can be measured with HR-pQCT, demonstrating that their model captures many relevant characteristics of a patient's bone quality that cannot be captured by the standard clinical tools used to diagnose osteoporosis, and thus to identify patients at elevated risk of fracture.
Strengths
The authors use a very large dataset and a combination of state-of-the-art methods for training and validating their fracture prediction model: k-fold cross-validation is used for training and a held-out external test dataset is used to evaluate ensembled model predictions compared to the current clinical standard for fracture screening. The results with the test dataset show that the model can identify women at risk of fracture in the next five years with greater sensitivity than both FRAX with BMD and BMD alone.
Because the model takes only a full 3D HR-pQCT image as input, the feasibility of clinical implementation is maximized. Standard morphological analysis with HR-pQCT is semi-automated and the labour required for the manual portions of analysis poses a significant barrier to clinical implementation. There is mounting evidence for the clinical utility of HR-pQCT (see Gazzotti et al. Br. J. Radiol. 2023) and fully automated models such as the one presented in this work will be critical for making clinical applications of HR-pQCT feasible.
The authors quantify the contributions to the variance of the model output and examine activation maps overlaid on the HR-pQCT images. These sub-analyses indicate that the model is identifying relevant characteristics of hierarchical bone structure for fracture prediction that are not available from aBMD measurements from DXA and thus are not accounted for in the current standard clinical diagnostic tool.
Weaknesses
The authors make the claim that SFS-AI outperforms FRAX with BMD and BMD in terms of sensitivity and specificity of predicting fragility fractures within 5 years. This claim is supported by looking at the ROCs in figure 1, but the specific comparison made in the discussion is not completely fair as currently presented in the article. The thresholds of FRAX > 20% and T-score < -2.5SD were selected by the authors for binary comparison. FRAX and BMD achieve specificities of ~95% at these thresholds, while SFS-AI achieves a specificity of only 77% at the selected threshold, SFS-AI > 0.5. Conversely, SFS-AI achieves a sensitivity of 50% to 60% while FRAX and BMD achieve very poor sensitivities, between 4% and 16%. The authors have not justified their choice of binarization thresholds for FRAX or BMD by citing literature or clinical guidelines, nor have they motivated their choice of any of the thresholds with a discussion of how clinical considerations could influence the sensitivity-specificity trade-off. It is difficult to directly compare the prognosticative performance of SFS-AI to that of FRAX or BMD when the thresholds for FRAX and BMD are at such different locations on the respective ROCs when compared to where the threshold for SFS-AI places it on the ROC. The authors have also not compared their estimates of the sensitivity and specificity of FRAX and BMD to literature to provide important context for the comparison to SFS-AI. An additional unacknowledged limitation is that the FRAX tool is designed to predict 10-year fracture risk, while the outcome used to train the SFS-AI model and to compare to FRAX was 5-year fracture risk.
Direct comparison may be impossible due to differences in study design or reported performance metrics, but the authors have not at all discussed the quantitative performance of prior models for fracture prediction or discrimination that use HR-pQCT (see Lu et al. Bone 2023 or Whittier et al. JBMR 2023) to contextualize the performance of their novel model. While the model presented in this article has the advantage that it does not require the typical expertise and manual effort needed for HR-pQCT image analysis, it is still important to acknowledge the potential trade-off of ease of implementation vs performance. Models that incorporate additional clinical data or that use standard HR-pQCT analysis outputs rather than raw images may perform well enough to justify the increase in the difficulty of clinical implementation or to motivate further work on fully automating microarchitectural analysis with HR-pQCT images.
Finally, the article does not indicate that either the code used for model training or the trained model itself will be made publicly available. This limits the ability of future researchers to replicate and build on the results presented in the article.