Predicting functional topography of the human visual cortex from cortical anatomy at scale

  1. Department of Medicine, Justus-Liebig University Giessen, Giessen, Germany
  2. School of Electrical Engineering and Computer Science, The University of Queensland, Brisbane, Australia
  3. Vision and Computational Cognition Group, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
  4. Center for Mind, Brain and Behavior (CMBB), University of Marburg, Giessen, Germany
  5. Institute of Neuroscience and Medicine (INM-7: Brain and Behaviour), Research Center Jülich, Jülich, Germany
  6. Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
  7. NeuroDataScience – ORIGAMI laboratory, Montréal Neurological Institute-Hospital, McGill University, Montréal, Canada
  8. High Field MR Center, Center for Medical Physics and Biomedical Engineering, Medical University of Vienna, Vienna, Austria
  9. eScience Institute, University of Washington, Seattle, United States
  10. School of Optometry & Vision Science, Waipapa Taumata Rau, University of Auckland, Auckland, New Zealand
  11. Experimental Psychology, University College London, London, United Kingdom
  12. Graduate School of Health, University of Technology Sydney, Sydney, Australia
  13. Queensland Digital Health Centre, University of Queensland, Brisbane, Australia

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Peter Kok
    University College London, London, United Kingdom
  • Senior Editor
    Yanchao Bi
    Peking University, Beijing, China

Reviewer #1 (Public review):

Summary:

This paper describes a deep learning toolbox that can be used to automatically estimate functional topographic maps directly from human brain anatomy. Building on the first author's earlier work, which demonstrated the feasibility of using deep learning for this purpose, the new version of the toolbox now requires only a single anatomical MRI scan to generate predictions, eliminating the need for a myelin scan. This represents a significant practical improvement.

Strengths:

Having such a toolbox is very useful, since manual annotation and delineation of functional visual field maps is a laborious process that also requires deep expertise. The toolbox can save researchers substantial amounts of time and money, and also allows less experienced researchers to now perform this type of analysis. Notably, for certain participants and patients, the time they are able to reside in the scanner might be limited. Being able to focus on the primary research question, rather than the essential yet basic topographic information, could boost data quality and evaluation and might limit the number of participants that need to be included.

Weaknesses:

In the paper, the authors compare the performance of their new version to two previous approaches. Figure 2b shows that the new toolbox performs similarly to the previous deep-learning-based toolbox, but requires only an anatomical scan, which is a significant improvement. They also compare it to an older method that uses an atlas without requiring deep learning. For eccentricity and pRF size predictions, both deep-learning methods perform better than the older approach. For polar angle, a critical parameter for delineating visual field maps, the gain is substantially less. Moreover, the comparison to the atlas method (Benson2014) is not entirely fair, as, to our knowledge, there is also a more advanced atlas version that uses Bayesian fitting methods and already performs better than the old method. To better understand the gain of using deep learning, it would be beneficial if the authors also made the comparison to this more recent atlas-based approach. Moreover, it would be useful to know the correlations for the representative participant. Some examples of relatively "bad" maps would also be useful to have (and could be provided as supplementary information).

Figure 2b shows that the toolbox is quite good at estimating eccentricity and polar angle parameters, but less good at estimating the population receptive field (pRF) size. I will return to this latter point.

An interesting feature is that while the toolbox is trained on a specific data set (HCP), it can, "out-of-the-box", be applied to different existing data sets, without the need to retrain the model. This is quite important for the general utility of the method. The results for this are shown in Figure 3. Again, in panel b, it can be seen that the toolbox does a good job at estimating eccentricity and polar angle values, but performs rather poorly for pRF size: the deepRetinotopy toolbox has a strong tendency to only estimate very small pRFs, particularly when applying it across different datasets. For this reason, at the moment, these estimates appear hardly useful. It would be very helpful for readers if the authors could clarify or elaborate on this point, particularly regarding the limitations of pRF size predictions. They explain that this could be due to the use of different types of stimuli, but even within the same (HCP) dataset, the predictions primarily suggest tiny pRFs, even though the training dataset also contains larger ones (which can be better seen in supplementary Figure 4). Showing the predictions for higher-order brain areas, which have larger pRFs on average, could serve a similar evaluation purpose. Presumably, the underlying reasons are complex and could relate to the use of different stimuli, different analysis toolboxes, and how the deep learning model is currently being trained. Possibly, the abundance of small pRFs at lower eccentricity in the training set (which is usually the case in any empirical analysis) has given the model a very strong bias toward predicting small pRFs.

There would be various ways to verify which of these components is critical. For example, the model could be trained only on the bar stimuli of the HCP dataset, or the pRFs for all stimuli and datasets could be estimated using the same software tool. The latter seems important. For example, Supplementary Figure 4 indicates a high correlation between the Stanford and NYU cohorts that have used the same stimulus and analysis package, despite having different resolutions and scanners. Further investigation into the underlying reasons for these discrepancies would strengthen the paper. It would also provide valuable guidance for users of the toolbox on which toolbox predictions to trust and which not, as well as how well the model generalizes to other stimulus types, scanners, and image resolutions.

An aspect that is not directly apparent from the title, abstract, and introduction is that the deepRetinotopy toolbox does not by itself produce estimates of visual area labels or boundaries. It predicts only polar angle and eccentricity values. To predict labels and boundaries, the authors combine the toolbox with an atlas (the aforementioned Bayesian atlas). For visual areas V1 - V3, it does a very good job, in that the predictions are as good as the empirical ones. Notably, the authors indicate that the predictions for V2 and, in particular, V3 are worse than for V1, but Figure 4 clearly shows that predictions are as good as the empirical ones. More cannot be expected from a model that is trained on such empirical data.

Irrespective of the limitations with respect to predicting pRF size, the toolbox opens up functionally oriented analyses of very large cohorts of healthy participants, of which only anatomical data is available. The authors present an example of this by confirming the existence of differences in horizontal and vertical asymmetries in the field maps of the visual cortex of children and adults. While Figure 5 confirms the existence of differences, the analysis could be expanded to provide deeper insights, such as normalized developmental trajectories for both asymmetries, given the size of the dataset. This would better highlight the true power of their approach.

While the authors address limitations with respect to studying experience-dependent atypical functional organization, they do not address how the deepRetinotopy toolbox would handle (acquired) brain lesions. Addressing this, even if only speculative, would be welcome. Another welcome addition would be to see the predictions for additional brain areas, even if those would (presumably) be worse at present. Such information would nevertheless be essential for users considering applying this toolbox. Moreover, this could be a valuable resource serving as a benchmark for future iterations of either deepRetinotopy or other approaches.

Reviewer #2 (Public review):

Summary:

The authors introduce the deepRetinotopy toolbox, a deep learning-based software package that allows for user-friendly automatic delineation of visual areas based on anatomical (T1-weighted) MRI scans. This is an important evolution over a prior published version of the software, which required myelin maps additionally. The new version will hence allow many more users to obtain high-fidelity field-map delineations based on existing data or using standard protocols, providing a huge advance to the field. The authors exploited this strength and mapped visual field maps (for areas V1-V3) in 11060 human MRI scans covering different age classes to quantify changes of retinotopic organization across age groups, showing that previously functionally identified imbalances of early visual cortex field maps can now be identified on the basis of anatomical scans alone.

Strengths:

Overall, this is a tremendously important methodological contribution of primarily high practical and applied value. It allows functional imaging labs to delineate human cortical visual field maps with confirmed high fidelity using anatomical T1-weighted scans only. This will save expensive functional imaging and time-consuming analyses that were previously required to achieve nearly the same result and far better results than prior model-based approaches offered.

Also, the quantification of the accumulated very large dataset is meticulous and provides impressively detailed results of the field map changes for areas V1-V3 as a function of age.

Weaknesses:

(1) The weak point of the contribution is the choice to limit anatomical quality assessments and error quantifications to just three early regions, V1-V3, even though the deepRetinotopy toolbox can delineate over 20 regions (including parietal, ventral, and lateral regions, such as IPS0-5, hV4, VO1-2, V3A, PHC1-2, LO1-2, and TO1-2).

(2) The limit is fine for their large-scale application of the toolbox to age groups, as here, a clear hypothesis on early cortex variability was tested.

(3) However, the introduction of the toolbox itself warrants quality assessments and comparisons to prior models and ground truth beyond V1-V3, just like the authors did in their prior publication of the predecessor model.

(4) This is important as the vast majority of applications of this toolbox will likely go beyond V1-V3 to delineate dorsal, ventral, and lateral regions.

(5) For the present paper, this will require only 1 or 2 additional figures, or extending their present figures 2 and 4 along the lines of their previous figure 7 (Ribeiro et al 2021), which included error measures for high-level regions. Ideally, you provide sub-graphs separately for early visual, dorsal, ventral, and lateral regions.

(6) Going beyond V1-V3 is important for several reasons: first, future studies applying the software beyond V3 will need quantification for reassurance and justification. Second, for the sake of transparency, even if results are noisy or on par with prior models. Third, as a benchmark or reference point for future approaches.

Reviewer #3 (Public review):

Summary:

This valuable study presents a tool that uses brain anatomy to predict the layout and size of early visual maps, and it is strengthened by testing across a large and diverse collection of scans. The work will be useful for researchers who want to estimate likely visual map layout from standard anatomical scans and to relate anatomical differences to differences in visual organization across groups. The evidence is solid for the general usefulness of the approach, but incomplete for broader claims about prediction accuracy and use across datasets, particularly for estimates of map size and for showing that the model improves on repeated functional measurements.

Strengths:

The paper addresses a useful and important problem: estimating early visual map organization from anatomical measurements alone. Tools that predict these types of functional data from anatomical measurements were introduced more than a decade ago by Benson and colleagues, and the present authors have significantly extended that work. That is a real strength of the manuscript, because there is genuine value in having a practical tool that can estimate likely visual organization from standard anatomical scans.

Another major strength is the rigorous cross-dataset benchmarking and the accumulation of multiple datasets. The authors assembled a large and diverse set of scans and assessed model performance across different scanners, field strengths, and visual stimuli, which gives the reader a much better sense of how broadly the approach may apply. The retrospective analysis of more than 11,000 scans is especially notable and creates an unusual opportunity to ask how anatomical variation may relate to population differences in visual organization.

I also think the paper does a good job of showing why such a tool could matter in practice. A complete tool could be used in several ways. First, it could help users identify the locations of activations measured in other experiments with respect to the typical V1-V3 maps. Second, maps measured from an individual subject or patient could be compared with the predictions from the tool to ask whether they differ meaningfully from a standard anatomy-based map. Third, the tool can be used, as the authors have done here, to examine differences in anatomy across populations and interpret these differences with respect to retinotopic maps. Of these uses, the first already seems well supported by the current presentation.

Weaknesses:

(1) Quantification of the Analysis

My main concern is that the analysis relies heavily on global summary measures such as correlation and Dice score. Those measures are useful, but the paper would be more informative if it also quantified boundary differences in millimeters, especially for comparisons such as the V1/V2 boundary in Figure 2. That kind of analysis would help readers understand how large the errors are in physically meaningful terms.

(2) Model fitting methods

I also think the discussion of prediction failures for pRF size should be more explicit. The mismatch is likely influenced by the fact that the training data and several evaluation datasets were fit with different models and different analysis software. In particular, the network was trained on non-linear size estimates from the HCP data, while the comparison datasets were derived using other packages and, in some cases, different model assumptions. That likely contributes to the spread in Figure 3b and should be discussed more directly. It is important to discuss that the pRF parameters were derived using different software tools.

- HCP dataset (training data): analyzePRF (Compressive Spatial Summation model)

- NYU dataset: vistasoft

- Stanford dataset: vistasoft

- New Zealand dataset: SamSrf

- CHN dataset: Custom MATLAB software

(3) Clarifying Model Accuracy

If deepRetinotopy generates a true "noise-removed" representation of functional mapping based on anatomy, then fitting it to one fMRI measurement should predict a second, independent fMRI run better than the noisy data from the first run does.

The authors possess the exact data for this test. For the HCP dataset, the empirical fMRI data were explicitly separated into two halves: "fit 2" (the first half of the fMRI runs) and "fit 3" (the second half). They correlated these two halves to establish a "noise ceiling," the maximum possible reliability of the data. Looking at their results in Figure 2b, the correlation of the deepRetinotopy predictions falls below this noise ceiling. This means that the noisy functional Half 1 actually predicts functional Half 2 better than the anatomical model does.

The authors should state this explicitly. A side-by-side plot of Half 1 predicting Half 2 versus deepRetinotopy predicting Half 2 would show that the anatomical model regularizes map location well, but misses reliable subject-specific variation that anatomy alone cannot capture.

(4) The Hemodynamic Response Function

The assumptions used to generate the original empirical maps are permanently baked into the deep learning model. However, the authors explicitly mention the hemodynamic response function (HRF) only once, noting in the Methods that the modeled time series was "convolved with a canonical hemodynamic response function."

Beyond this single mention, there is no direct discussion of how the assumption of a single canonical HRF across all 161 HCP training subjects might have systematically impacted or biased the network's predictions. The authors address cross-dataset differences broadly under the umbrella of "experimental design" and "fMRI preprocessing pipeline" biases, but the HRF is a core biological property that mediates the connection between the anatomy and the data. The authors should explicitly discuss how this canonical assumption limits or biases the resulting deepRetinotopy network.

(5) Scoping the Input Data and Normative Use

The authors use FreeSurfer to generate a mean curvature map for the entire midthickness cortical surface. This full-hemisphere curvature map is resampled to a standard template surface space (32k_fs_LR), acting as the data frame that feeds input features into the neural network. However, while the network receives the full geometric structure of the hemisphere, it is explicitly trained to predict retinotopic parameters only within a restricted posterior ROI, based on the Wang et al. atlas and containing roughly 3,200 vertices per hemisphere.

A useful experiment to try, and perhaps the authors have already considered this, would be to restrict the input features exclusively to the posterior vertices. Including all anterior vertices may make it harder for the network to fit the localized visual data. A brief commentary on why the full hemisphere was retained as input could be highly informative for researchers adapting this geometric deep learning pipeline.

Author response:

Public Reviews:

Reviewer #1 (Public review):

In the paper, the authors compare the performance of their new version to two previous approaches. Figure 2b shows that the new toolbox performs similarly to the previous deep-learning-based toolbox, but requires only an anatomical scan, which is a significant improvement. They also compare it to an older method that uses an atlas without requiring deep learning. For eccentricity and pRF size predictions, both deep-learning methods perform better than the older approach. For polar angle, a critical parameter for delineating visual field maps, the gain is substantially less. Moreover, the comparison to the atlas method (Benson2014) is not entirely fair, as, to our knowledge, there is also a more advanced atlas version that uses Bayesian fitting methods and already performs better than the old method. To better understand the gain of using deep learning, it would be beneficial if the authors also made the comparison to this more recent atlas-based approach. Moreover, it would be useful to know the correlations for the representative participant. Some examples of relatively "bad" maps would also be useful to have (and could be provided as supplementary information).

We thank the reviewer for their constructive feedback. We plan to expand our benchmarking section to include the Bayesian model comparison. Note, however, that the additional accuracy gain afforded with the Bayesian model of retinotopy (Benson and Winawer, 2018) results from combining anatomical data with retinotopic maps estimated with a few minutes of functional data. The Bayesian model of retinotopy without such functional data is equivalent to Benson14. We plan to report the correlations (between predicted and empirical maps) for the representative participant shown in Figure 2 and include an additional supplementary figure showing retinotopic map predictions for a participant whose predictions deviate the most from empirical maps, as suggested by the reviewer.

Figure 2b shows that the toolbox is quite good at estimating eccentricity and polar angle parameters, but less good at estimating the population receptive field (pRF) size. I will return to this latter point.

An interesting feature is that while the toolbox is trained on a specific data set (HCP), it can, "out-of-the-box", be applied to different existing data sets, without the need to retrain the model. This is quite important for the general utility of the method. The results for this are shown in Figure 3. Again, in panel b, it can be seen that the toolbox does a good job at estimating eccentricity and polar angle values, but performs rather poorly for pRF size: the deepRetinotopy toolbox has a strong tendency to only estimate very small pRFs, particularly when applying it across different datasets. For this reason, at the moment, these estimates appear hardly useful. It would be very helpful for readers if the authors could clarify or elaborate on this point, particularly regarding the limitations of pRF size predictions. They explain that this could be due to the use of different types of stimuli, but even within the same (HCP) dataset, the predictions primarily suggest tiny pRFs, even though the training dataset also contains larger ones (which can be better seen in supplementary Figure 4). Showing the predictions for higher-order brain areas, which have larger pRFs on average, could serve a similar evaluation purpose. Presumably, the underlying reasons are complex and could relate to the use of different stimuli, different analysis toolboxes, and how the deep learning model is currently being trained. Possibly, the abundance of small pRFs at lower eccentricity in the training set (which is usually the case in any empirical analysis) has given the model a very strong bias toward predicting small pRFs.

There would be various ways to verify which of these components is critical. For example, the model could be trained only on the bar stimuli of the HCP dataset, or the pRFs for all stimuli and datasets could be estimated using the same software tool. The latter seems important. For example, Supplementary Figure 4 indicates a high correlation between the Stanford and NYU cohorts that have used the same stimulus and analysis package, despite having different resolutions and scanners. Further investigation into the underlying reasons for these discrepancies would strengthen the paper. It would also provide valuable guidance for users of the toolbox on which toolbox predictions to trust and which not, as well as how well the model generalizes to other stimulus types, scanners, and image resolutions.

We will expand our discussion of the limitations of pRF size prediction, highlighting that differences in visual stimuli, analysis toolboxes used to estimate pRF parameters from empirical data, and the current training of deepRetinotopy affect prediction accuracy. As the reviewer pointed out, the underlying reasons are complex, and it is difficult to isolate all the potential contributing factors. However, in addition to our expanded discussion, we also intend to present results from additional experiments that assess the impact of different loss functions on the range of predicted pRF sizes (to explain how training may partly account for the differences observed in the HCP dataset). We will also perform pRF fitting on at least one dataset using the same software/encoding model as in the HCP dataset (the training data) to illustrate that the lower performance in pRF size prediction in out-of-distribution datasets is also partly explained by differences in how the empirical maps were obtained.

An aspect that is not directly apparent from the title, abstract, and introduction is that the deepRetinotopy toolbox does not by itself produce estimates of visual area labels or boundaries. It predicts only polar angle and eccentricity values. To predict labels and boundaries, the authors combine the toolbox with an atlas (the aforementioned Bayesian atlas). For visual areas V1 - V3, it does a very good job, in that the predictions are as good as the empirical ones. Notably, the authors indicate that the predictions for V2 and, in particular, V3 are worse than for V1, but Figure 4 clearly shows that predictions are as good as the empirical ones. More cannot be expected from a model that is trained on such empirical data.

We will edit the introduction and abstract to make it clearer that the deepRetinotopy toolbox does not yet produce estimates of visual boundaries on its own.

Irrespective of the limitations with respect to predicting pRF size, the toolbox opens up functionally oriented analyses of very large cohorts of healthy participants, of which only anatomical data is available. The authors present an example of this by confirming the existence of differences in horizontal and vertical asymmetries in the field maps of the visual cortex of children and adults. While Figure 5 confirms the existence of differences, the analysis could be expanded to provide deeper insights, such as normalized developmental trajectories for both asymmetries, given the size of the dataset. This would better highlight the true power of their approach.

Although providing insights into developmental trajectories for horizontal and vertical asymmetries is beyond the scope of the current work, as it would require aggregating datasets such that individuals’ age span a larger range (ABCD dataset only contains individuals between 9-11 years old and the HCP Young Adult dataset between 22-36 years old), we plan to provide some complementary analyses (differences across ages and sex within the ABCD dataset).

While the authors address limitations with respect to studying experience-dependent atypical functional organization, they do not address how the deepRetinotopy toolbox would handle (acquired) brain lesions. Addressing this, even if only speculative, would be welcome. Another welcome addition would be to see the predictions for additional brain areas, even if those would (presumably) be worse at present. Such information would nevertheless be essential for users considering applying this toolbox. Moreover, this could be a valuable resource serving as a benchmark for future iterations of either deepRetinotopy or other approaches.

We plan to expand and report performance evaluation across other visual areas (using Wang atlas’ parcels) to serve as a benchmarking resource. Moreover, we will expand our discussion on how deepRetinotopy would handle brain lesions.

Reviewer #2 (Public review):

(1) The weak point of the contribution is the choice to limit anatomical quality assessments and error quantifications to just three early regions, V1-V3, even though the deepRetinotopy toolbox can delineate over 20 regions (including parietal, ventral, and lateral regions, such as IPS0-5, hV4, VO1-2, V3A, PHC1-2, LO1-2, and TO1-2).

(2) The limit is fine for their large-scale application of the toolbox to age groups, as here, a clear hypothesis on early cortex variability was tested.

(3) However, the introduction of the toolbox itself warrants quality assessments and comparisons to prior models and ground truth beyond V1-V3, just like the authors did in their prior publication of the predecessor model.

(4) This is important as the vast majority of applications of this toolbox will likely go beyond V1-V3 to delineate dorsal, ventral, and lateral regions.

(5) For the present paper, this will require only 1 or 2 additional figures, or extending their present figures 2 and 4 along the lines of their previous figure 7 (Ribeiro et al 2021), which included error measures for high-level regions. Ideally, you provide sub-graphs separately for early visual, dorsal, ventral, and lateral regions.

(6) Going beyond V1-V3 is important for several reasons: first, future studies applying the software beyond V3 will need quantification for reassurance and justification. Second, for the sake of transparency, even if results are noisy or on par with prior models. Third, as a benchmark or reference point for future approaches.

We thank the reviewer for their constructive feedback, and we agree that expanding our performance assessment beyond V1-3 would be a valuable benchmarking resource. Thus, we plan to evaluate retinotopic map prediction accuracy across visual areas defined by the Wang atlas’ parcels, expanding on the results reported in Figure 2, and provide it as a supplementary figure. However, performance estimation ultimately depends on the quality of the dataset used for evaluation. The empirical maps, although treated as ground truth, may themselves misrepresent the underlying retinotopic organization. As a matter of fact, the quality of the empirical data (HCP dataset and others) is indeed lowest in some of the higher-order visual areas.

It may be unclear from the text that the deepRetinotopy toolbox does not yet produce estimates of visual boundaries on its own. Accordingly, we illustrate how deepRetinotopy toolbox’s predictions can be combined with another tool [the Ba yesian model of retinotopy from Benson and Winawer (2018)] to obtain visual area boundaries automatically. We will edit the introduction and abstract to make it clearer. Given the availability of empirical labels (currently only for V1-3) and the segmentation tool (which was only assessed for V1-3), we cannot expand Figure 4 to other visual areas as suggested.

Reviewer #3 (Public review):

Quantification of the Analysis: My main concern is that the analysis relies heavily on global summary measures such as correlation and Dice score. Those measures are useful, but the paper would be more informative if it also quantified boundary differences in millimeters, especially for comparisons such as the V1/V2 boundary in Figure 2. That kind of analysis would help readers understand how large the errors are in physically meaningful terms.

We thank the reviewer for their constructive feedback. Following the reviewer’s suggestion, we plan to expand our segmentation evaluation to quantify the extent to which boundary predictions from deepRetinotopy’s maps deviate from those from empirical maps, in millimetres.

Model fitting methods: I also think the discussion of prediction failures for pRF size should be more explicit. The mismatch is likely influenced by the fact that the training data and several evaluation datasets were fit with different models and different analysis software. In particular, the network was trained on non-linear size estimates from the HCP data, while the comparison datasets were derived using other packages and, in some cases, different model assumptions. That likely contributes to the spread in Figure 3b and should be discussed more directly. It is important to discuss that the pRF parameters were derived using different software tools.

We will expand our discussion of the limitations of pRF size prediction, highlighting that differences in visual stimuli, different encoding models for estimating pRF parameters from empirical data, and the current training of deepRetinotopy affect prediction accuracy. In addition to our expanded discussion, we intend to also present results from additional experiments that assess the impact of those factors on pRF size prediction performance.

Clarifying Model Accuracy: If deepRetinotopy generates a true "noise-removed" representation of functional mapping based on anatomy, then fitting it to one fMRI measurement should predict a second, independent fMRI run better than the noisy data from the first run does.

The authors possess the exact data for this test. For the HCP dataset, the empirical fMRI data were explicitly separated into two halves: "fit 2" (the first half of the fMRI runs) and "fit 3" (the second half). They correlated these two halves to establish a "noise ceiling," the maximum possible reliability of the data. Looking at their results in Figure 2b, the correlation of the deepRetinotopy predictions falls below this noise ceiling. This means that the noisy functional Half 1 actually predicts functional Half 2 better than the anatomical model does.

The authors should state this explicitly. A side-by-side plot of Half 1 predicting Half 2 versus deepRetinotopy predicting Half 2 would show that the anatomical model regularizes map location well, but misses reliable subject-specific variation that anatomy alone cannot capture.

We will expand our benchmarking session to make these comparisons (“Half 1 predicting Half 2 versus deepRetinotopy predicting Half 2”) more explicit. It is important to highlight that there is more subject-specific variation that is currently not captured by our model, and it can also serve as a benchmarking resource for future model versions and newer approaches.

The Hemodynamic Response Function: The assumptions used to generate the original empirical maps are permanently baked into the deep learning model. However, the authors explicitly mention the hemodynamic response function (HRF) only once, noting in the Methods that the modeled time series was "convolved with a canonical hemodynamic response function."

Beyond this single mention, there is no direct discussion of how the assumption of a single canonical HRF across all 161 HCP training subjects might have systematically impacted or biased the network's predictions. The authors address cross-dataset differences broadly under the umbrella of "experimental design" and "fMRI preprocessing pipeline" biases, but the HRF is a core biological property that mediates the connection between the anatomy and the data. The authors should explicitly discuss how this canonical assumption limits or biases the resulting deepRetinotopy network.

As Reviewers 3 and 1 have noted, the observed limitations in pRF size prediction stem from multiple underlying factors. One of those factors is indeed the HRF assumed in the encoding models. We will expand our discussion about factors that may introduce biases into deepRetinotopy predictions, including the HRF.

Scoping the Input Data and Normative Use: The authors use FreeSurfer to generate a mean curvature map for the entire midthickness cortical surface. This full-hemisphere curvature map is resampled to a standard template surface space (32k_fs_LR), acting as the data frame that feeds input features into the neural network. However, while the network receives the full geometric structure of the hemisphere, it is explicitly trained to predict retinotopic parameters only within a restricted posterior ROI, based on the Wang et al. atlas and containing roughly 3,200 vertices per hemisphere.

A useful experiment to try, and perhaps the authors have already considered this, would be to restrict the input features exclusively to the posterior vertices. Including all anterior vertices may make it harder for the network to fit the localized visual data. A brief commentary on why the full hemisphere was retained as input could be highly informative for researchers adapting this geometric deep learning pipeline.

Thanks for this suggestion. We have not performed a systematic evaluation of using ROIs that span a larger portion of the cortex (including the full hemisphere). It is a great idea to do so and report it in our manuscript to inform other researchers interested in adapting our pipeline. We intend to also update our toolbox by retraining our models to take all posterior vertices as suggested, which would improve the coverage of current predictions.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation