Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.
Read more about eLife’s peer review process.Editors
- Reviewing EditorJessica DuboisInserm Unité NeuroDiderot, Université Paris Cité, Paris, France
- Senior EditorJonathan RoiserUniversity College London, London, United Kingdom
Reviewer #1 (Public Review):
Main contributions / strengths
The authors propose a process to improve the ground truth segmentation of fetal brain MRI via a semi-supervised approach based on several iterations of manual refinement of atlas label propagations. This procedure represents an impressive amount of work, likely resulting in a very high-quality ground truth dataset. The corrected labels (obtained from multiple different datasets) are then used to train the final model which performs the brain extraction and tissue segmentation tasks. We also acknowledge the caution paid by the authors regarding the future application of their pipeline to unseen datasets.
The conclusions of this paper are mostly well supported by data, but some aspects of the analysis and validation procedure need to be clarified and extended. In addition, the article would greatly benefit from providing further descriptions of crucial aspects of the study.
Main limitations and potential improvements
- New nomenclature/atlas not sufficiently described/justified.
The proposed nomenclature and atlas are one of the main contributions of this work. We clearly acknowledge the importance for the community of such a contribution. The definition of any nomenclature implies that decisions were taken regarding the acceptable level of ambiguity in the identification of the boundary between neighboring anatomical structures with respect to the gradient in the intensities in the MRI. It is acceptable (and probably inevitable) to set relatively arbitrary criteria in ambiguous regions, providing that these criteria are explicitly stated. The explicit statement of the decisions taken is essential in particular for better interpretation of residual segmentation inaccuracies in application studies.
As a matter of comparison, the postnatal atlas and nomenclature were based on the Albert protocol, which is described in extensive detail. While such a complete description might fall beyond the scope of this work, we believe that an additional description of the nomenclature and protocol, allowing reproduction the manual segmentation on external datasets is required, at least for most ambiguous junctions between structures. For instance, the boundaries across substructures within the DGM are difficult to visualize on the exemplar subjects shown in Fig. 5 and Fig. 6.
Please provide additional precision on how the following were defined: boundaries between lateral ventricles and cavum; between cavum and CSF; the delineation of 3rd and 4th ventricles; the definition of the vermis, especially its junctions with the cerebellum and the brainstem.
How are these boundaries impacted by the changes in the image intensities related to tissue maturation?
We would also greatly appreciate an extension of the qualitative comparison with the two most commonly used protocols (Albert and FETA), for instance, why didn't the authors isolate the hippocampus/amygdala structure? And then how is the boundary between gray and white matter defined in this region?
- More detailed comparison with FETA for some structures would be informative despite obvious limitations.
More specifically, the GM should have a very similar definition. In the "Impact of anomalies' section (page 7) the authors compare their results with the dice score from the FETA challenge and conclude that the difference "highlights the advantages of using high-quality consistent ground truth labels for training". The better performances (from ~0.78 to ~0.88) might be mostly due to the improvement of the ground truth (of the test set). This could be confirmed by observing the ground truth from FETA of the GM for a few cases for which the dice shows a strong increase in performance with respect to FETA. Note that the gain in performance is appreciable even if it is due to a better ground truth.
- Improvement of the ground truth labels is an important contribution of this work, thus we would appreciate a more quantitative description of the impact of the manual correction, such as reporting the change in the dice score induced by the correction.
Quantification of the refinement process would help to better evaluate the relevance of the proposed approach in future studies e.g. introducing a different nomenclature. More specifically, a marked change would be expected after the first training when there is a switch (and refinement) from the registration-propagated labels to the ones predicted by the DL model (as shown in Fig. 5, the changes are quite strong). Again a dice score indicating quantitatively how much improvement results from each iteration would be informative. In the same line, is the last iteration of this process needed or did the authors observe a 'stabilization' (i.e. less manual editing needing to be performed)?
- The testing / training data-splitting strategy is not sufficiently detailed and difficult to follow. The following points deserve clarification:
a) Why did the authors select only four sites for the test set (out of six studies presented in the 3.1 section)?
b) Data used for training: in the first step the authors selected 200 for label propagation and selected only the best 100. In the second stage, the predictions are computed for all training/validation sets (380) and only 200 are selected. When the process was iterated, why did the authors select only 200 out of the 380? Are the same subjects selected across iterations?
Were the acquisition parameters / gestational age controlled for each selection? If yes please specify the distributions precisely.
Did the authors control the potential imbalanced proportion that is present in the dataset (more subjects from dHCP for instance)? (line 316, 100 subjects were selected from only three centers. Why only three? Did the authors keep the same sub-site for other stages?)
c) "The testing dataset includes 40 randomly selected images from four different acquisition protocols" which shows that attention was paid to variations in the scanning parameters, which is of crucial importance. However, no precision is provided regarding the gestational age of this dataset, which impedes the interpretation since a potential influence of age on the accuracy of the segmentation would be problematic. Indeed, the authors mention that the manual correction deserved special attention for late GA (>34 weeks). Please specify precisely the age distribution across the 10 subjects of each of the four acquisition protocols. In addition, the qualitative results shown in Fig.6 and subsection "Impact of GA at scan" are not sufficient and an additional result table reporting the same population and metrics as in Table 2, but dissociating younger versus older fetuses, would be much more informative to rule out potential bias related to gestational age.
d) The definition of the ground truth labels for the test set is not described.
We understand (from the result) that the ground truth for the test set is defined by manual refinement of the atlas label propagated. This should be explicitly described on page 5 after the "Preparation of training datasets" section.
- The validation of segmentation accuracy based on the volumetry growth chart is invalid.
In Section "4.3. Growth charts of normal fetal brain development", since manual corrections were involved, the reported results cannot be considered as a validation of the segmentation pipeline. Regarding the validation of the segmentation pipeline, the quantitative and qualitative results provided in Table 2 and the corresponding text and figures seem sufficient to us (providing our concerns above are addressed, especially regarding the impact of the gestational age).
The growth charts are still valuable to support the validity of the nomenclature and segmentation protocol, but then why are the growth charts computed only for some structures? Reporting the growth chart and statistical evaluation of the impact of acquisition settings using ANCOVA for all the substructures from the proposed protocol would be expected here, in particular for the structures for which the delineation might be ambiguous such as the cavum, the vermis, and DGM substructures such as the thalamus.
Finally, please provide further details on the type and amount of manual correction needed for computing the growth charts.
- MRI data was acquired only on Phillips scanners.
We acknowledge the efforts to maximize heterogeneity in the MRIs,e.g. with both 1.5T and 3T scanners, variations in TE and image resolution, but still, all MRIs included in this study were acquired using the SSTSE sequence on Phillips scanners. The study does not include any MRI acquired on Siemens nor GE scanners, and no image was acquired using the balance-FFE/TRUFISP/FIESTA type sequence. This might limit generalizability.
Reviewer #2 (Public Review):
This work presents a new, automated, deep learning-based segmentation pipeline for fetal cerebral MRI based on the anatomical definitions of the new fetal atlas of the Developing Human Connectome Project. The authors' new software pipeline demonstrated robust performance across different acquisition protocols and gestational age ranges, reducing the need for manual refinement. To provide ground truth data for training their deep learning network, the authors employed a semi-supervised approach, in which atlas labels were propagated to the datasets, and they were corrected manually.
This work stands out for its extensive training on a large number of datasets, it achieves precise anatomical definition through a refined brain tissue parcellation protocol, and it evaluates the segmentation results against growth curves, allowing for a comprehensive assessment of fetal brain development. Due to the fact that abnormal anatomy was largely unobserved by the segmentation network, it is highly likely, however, that the BOUNTI pipeline would lead to some incorrect segmentations in subjects with moderate to large ventriculomegaly, as well as in cases of malformations of the corpus callosum, brainstem or neural tube defects. Further work is required for BOUNTI to generalize its application to pathological brains, as the vast majority of fetal cerebral MRI cases in clinical practice involve such abnormalities rather than normal brain development. This step is crucial for facilitating the clinical translation of BOUNTI. The algorithm is publicly available and works without limitations on datasets acquired in other centers.
Reviewer #3 (Public Review):
This work provides a novel framework for semi-automatic segmentation and parcellation of brain tissues from fetal magnetic resonance imaging (MRI) by fusing an advanced deep learning technique and manual correction by experts. Over the broad age spectrum spanning newborns to adults, several fully-automatic segmentation/parcellation techniques have been proposed, showing robust, reliable performance across MR images with varying imaging quality. Unlike other age groups, however, scanning of the fetal brain is conducted in the womb; thus, there are additional and unique challenges, such as ambiguous positioning of the fetal brain, the surrounding maternal tissue in the fetal MRI, and fetal and maternal motion. These challenges in fetal MRI have collectively served as important bottlenecks in developing robust, reliable automatic segmentation/parcellation frameworks to date. This paper proposes a methodological framework for the segmentation and parcellation of fetal MRI scans using a two-step deep learning model, each for segmentation and parcellation. It is also noteworthy that the validity of the proposed framework has been extensively tested over different datasets with different image quality and different recording parameters, so the robust generalizability of the framework over other fetal MRI datasets is clearly suggested.
Strengths:
In general, a novel design framework, with separation of segmentation and parcellation schemes under each deep learning model, provides ample room for improving the model performance, as suggested by the results of this study. In addition, thanks to the flexibility in the model design (e.g., the choice of deep learning model) and parameters (e.g., manual correction step during training), an identical or similar framework can be easily extended to other datasets for different age groups or diagnostic groups/brain disorders. Another strength is the minimal requirement of human interaction after the training stage as significant time and effort of manual correction is often required following the automatic segmentation of fetal MR images. Lastly, thorough investigation of the inter-dataset generalizability of the proposed segmentation/parcellation framework will be well-received by the fetal neuroscience community.
Weakness:
The main weakness of this paper is the vague definition of the scientific novelty. By design, this paper is a technical study. The technical advancement claimed by the authors is a novel design of deep learning and a two-step deep-learning framework; each for segmentation and parcellation. There have been, however, other deep learning studies, and some share nearly identical model architecture to the one published by Asis-Cruz et al. (Frontiers in Neuroscience, 2022). As such the conceptual improvement in terms of deep learning model architecture is overstated. Regarding the separate framework for segmentation and parcellation, the conventional preprocessing protocol (e.g., Draw-EM; Makropoulos et al. IEEE Transactions on Medical Imaging, 2014) already presented a similar concept. Overall, it is unclear what unique technical advances have been made in the current paper.
A second weakness of the work is the insufficient comparison to other conventional published methods. While the authors' claim that there is no "universally accepted" protocol for fetal brain segmentation/parcellation is at least partially true, Draw-EM, which was originally designed for neonatal brain segmentation, has been widely and successfully utilized in many fetal MRI studies, as discussed by the authors. Instead of a direct comparison to Draw-EM, the authors only performed a descriptive comparison using two exemplar MRI scans. It is unclear whether the superior performance of the proposed framework in these selected scans would be generalizable to others. Similarly, the authors claim that the proposed deep-learning-based segmentation/parcellation framework required minimal time for manual post-preprocessing refinement (1-3 mins), compared to 1-3 hours in another study using Draw-EM (Story et al. Neuroimage: Clinical, 2021). Again, this may not represent a fair comparison considering that the intensity/precision of manual refinement may differ depending on the different goals/objectives of other studies.