Deep learning-driven characterization of single cell tuning in primate visual area V4 supports topological organization

  1. Institute for Bioinformatics and Medical Informatics, Tübingen University, Tübingen, Germany
  2. Institute of Computer Science and Campus Institute Data Science, University of Göttingen, Göttingen, Germany
  3. Department of Ophthalmology, Byers Eye Institute, Stanford University School of Medicine, Stanford, United States
  4. Stanford Bio-X, Stanford University, Stanford, United States
  5. Wu Tsai Neurosciences Institute, Stanford University, Stanford, United States
  6. Department of Neuroscience, Baylor College of Medicine, Houston, United States
  7. Center for Neuroscience and Artificial Intelligence, Baylor College of Medicine, Houston, United States
  8. Department of Pediatrics; Allergy & Immunology, Baylor College of Medicine, Houston, United States
  9. Max Planck Institute for Dynamics and Self-Organization, Göttingen, Germany

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Tirin Moore
    Stanford University, Howard Hughes Medical Institute, Stanford, United States of America
  • Senior Editor
    Tirin Moore
    Stanford University, Howard Hughes Medical Institute, Stanford, United States of America

Reviewer #1 (Public review):

Willeke et al. hypothesize that macaque V4, like other visual areas, may exhibit a topographic functional organization. One challenge to studying the functional (tuning) organization of V4 is that neurons in V4 are selective for complex visual stimuli that are hard to parameterize. Thus, the authors leverage an approach comprising digital twins and most exciting stimuli (MEIs) that they have pioneered. This data-driven, deep-learning framework can effectively handle the difficulty of parametrizing relevant stimuli. They verify that the model-synthesized MEIs indeed drive V4 neurons more effectively than matched natural image controls. They then performed psychophysics experiments (on humans) along with the application of contrastive learning to illustrate that anatomically neighboring neurons often care about similar stimuli. Importantly, the weaknesses of the approach are clearly appreciated and discussed.

Comments:

(1) The correlation between predictions and data is 0.43. I'd agree with the authors that this is "reliable" and would recommend that they discuss how the fact that performance is not saturated influences the results.

(2) Modeling V4 using a CNN and claiming that the identified functional groups look like those found in artificial vision systems may be a bit circular.

(3) No architecture other than ResNet-50 was tested. This might be a major drawback, since the MEIs could very well be reflections of the architecture and also the statistics of the dataset, rather than intrinsic biological properties. Do the authors find the same result with different architectures as the basis of the goal-driven model?

(4) The closed-loop analysis seems to be using a much smaller sample of the recorded neurons - "resulting in n=55 neurons for the analysis of the closed-loop paradigm".

(5) A discussion on adversarial machine learning and the adversarial training that was used is lacking.

Reviewer #2 (Public review):

This is an ambitious and technically powerful study, investigating a long-standing question about the functional organization of area V4. The project combined large-scale single-unit electrophysiology in macaque V4 with deep learning-based activation maximization to characterize neuronal tuning in natural image space. The authors built predictive encoding models for V4 neurons and used these models to synthesize most exciting images (MEIs), which are subsequently validated in vivo using a closed-loop experimental paradigm.

Overall, the manuscript advances three main claims:

(1) Individual V4 neurons showed complex and highly structured selectivity for naturalistic visual features, including textures, curvatures, repeating patterns, and apparently eye-like motifs.

(2) Neurons recorded along the same linear probe penetration tended to have more similar MEIs than neurons recorded at different cortical locations (this similarity was supported by human psychophysics and by distances in a learned, contrastive image embedding space).

(3) MEIs clustered into a limited number of functional groups that resembled feature visualizations observed in deep convolutional neural networks.

Strengths:

(1) The study is important in that it is the first to apply activation maximization to neurons sampled at such fine spatial resolution. The authors used 32-channel linear silicon probes, spanning approximately 2 mm of cortical depth, with inter-contact spacing of roughly 60 µm. This enabled fine sampling across most of the cortical thickness of V4, substantially finer resolution than prior Utah-array or surface-biased approaches.

(2) A key strength is the direct in vivo validation of model-derived synthetic images by stimulating the same neurons used to build the models, a critical step often absent in other neural network-based encoding studies.

(3) More broadly, the study highlights the value of probing neuronal selectivity with rich, naturalistic stimulus spaces rather than relying exclusively on oversimplified stimuli such as Gabors.

Weaknesses:

(1) A central claim is that neurons sampled within the same penetration shared MEI tuning properties compared to neurons sampled in different penetrations because of functional organization. I am concerned about technical correlations in activity due to technical or methodology-related approaches (for example, shared reference or grounding) instead of functional organization alone. These recordings were obtained with linear silicon probes, and there have been observations that neuronal activity along this type of probe (including neuropixels probes) may be correlated above what prior work showed, using manually advanced single electrodes. For example, Fujita et al. (1992) showed finer micro-domains and systematic changes in selectivity along a cortical penetration, and it is not clear if that is true or detectable here. I think that the manuscript would be strengthened by a more thorough and explicit characterization of lower-level response correlations (at the neuronal electrophysiology level) prior to starting with fitting models. In particular, the authors could examine noise correlations along the electrode shaft (using the repeated test images, for example), as well as signal correlations in tuning, both within and across sessions. It would also be helpful to clarify whether these correlations depended on penetration day, recording chamber hole (how many were used?), or spatial separation between penetrations, and whether repeated use of the same hole yielded stable or changing correlations. Illustrations of the peristimulus time histogram changes across the shaft and across penetrations would also help. All of this would help us understand if the reports of clustering were technically inevitable due to the technique.

(2) It is difficult to understand a story of visual cortex neurons without more information about their receptive field locations and widths, particularly given that the stimulus was full-screen. I understand that there was a sparse random dot stimulus used to find the population RF, so it should be possible to visualize the individual and population RFs. Also, the investigators inferred the locations of the important patches using a masking algorithm, but where were those masks relative to the retinal image, and how distributed were they as a function of the shaft location? This would help us understand how similar each contact was.

(3) A major claim is that V4 MEIs formed groups that were comparable to those produced by artificial vision systems, "suggesting potential shared encoding strategies." The issue is that the "shared encoding strategy" might be the authors' use of this same class of models in the first place. It would be useful to know if different functional groups arise as a function of other encoding neural network models, beyond the robust-trained ResNet-50. I am unsure to what extent the reported clustering, depth-wise similarity, and correspondence to artificial features depended on architectural and training bias. It would substantially strengthen the manuscript to test whether a similar organizational structure would emerge using alternative encoding models, such as attention-based vision transformers, self-supervised visual representations, or other non-convolutional architectures. Another important point of contrast would be to examine the functional groups encoded by the ResNet architecture before its activations were fit to V4 neuronal activity: put simply, is ResNet just re-stating what it already knows?

(4) Several comparisons to prior work are presented largely at a qualitative level, without quantitative support. For example, the authors state that their MEIs are consistent with known tuning properties of macaque V4, such as selectivity for shape, curvature, and texture. However, this claim is not supported by explicit image analyses or metrics that would substantiate these correspondences beyond appeal to visual inspection. Incorporating quantitative analyses, for instance, measures of curvature, texture statistics, or comparisons to established stimulus sets, would strengthen these links to prior literature and clarify the relationship between the synthesized MEIs and previously characterized V4 tuning properties.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation