Layer-specific spatiotemporal dynamics of feedforward and feedback in human visual object perception

Tony Carricarte; Siying Xie; Johannes Singer; Robert Trampel; Laurentius Huber; Zejin Lu; Tim C Kietzmann; Nikolaus Weiskopf; Radoslaw M Cichy

doi:10.7554/eLife.111186.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Reviewing Editor
Shuo Wang
Washington University in St. Louis, St. Louis, United States of America
Senior Editor
Tirin Moore
Stanford University, Howard Hughes Medical Institute, Stanford, United States of America

Reviewer #1 (Public review):

Summary:

This study combines representational similarity analysis (RSA) with 7T layer-specific fMRI and EEG to examine how neural representations in specific cortical layers of EVC and LOC correspond to the temporal dynamics of visual processing. The authors interpret these correspondences as reflecting feedforward and feedback processes, based on their relative timing and their similarity to representations in different layers of a deep neural network (DNN).

Strengths:

The combination of RSA with laminar fMRI is a promising approach for dissociating the functional roles and dynamics of different cortical layers within the same functional region, and it holds considerable potential for elucidating computational mechanisms both within and between levels of the visual hierarchy. However, several issues should be addressed before the authors' conclusions can be fully supported.

Weaknesses:

(1) The authors report that the representation in the LOC superficial layer resembles EEG-derived neural representations at ~400 ms post-stimulus, and that this similarity is best explained by representations in the higher layers of the DNN. From these two observations, they conclude that activity in the LOC superficial layer is driven by feedback signals. However, neither line of evidence directly dissociates feedforward from feedback contributions.

Specifically, late-stage representations in LOC could instead reflect the outcome of local recurrent computation, given that the superficial layer also serves as an output layer of the local cortical circuit. Moreover, the correlation with the DNN peaks at higher layers rather than being dominated by them, and feature tuning in higher DNN layers does not necessarily map onto higher-order cortical regions such as PFC.

While a feedback contribution to the LOC superficial layer is consistent with theoretical predictions and known cortical anatomy, the current evidence is indirect. I would recommend that the authors either tone down this conclusion or, at a minimum, explicitly clarify the strength and limitations of the evidence in the Discussion.

(2) I could not find information regarding the fMRI slice orientation or whether temporal regions beyond LOC were covered. The reported FOV (192 × 192 mm) seems quite large if only EVC and LOC were targeted. Did the authors acquire data from other object-selective regions in the temporal cortex, and if so, did they analyze these?

It would strengthen the feedback interpretation considerably if the RDM of the LOC superficial layer could be shown to resemble RDMs from more anterior temporal regions, which would be consistent with feedback originating from higher-order object-processing areas.

(3) Related to the previous point, LOC is a relatively large region, and based on the figures, it appears that the LOC ROI may contain two subregions. It would be helpful for the authors to show the location and extent of the LOC ROI in example participants.

If the ROI does indeed span two subregions, do these subregions share the same laminar profile and temporal dynamics?

(4) The authors report no feedback-related information in EVC, which contrasts with a number of prior fMRI studies that have demonstrated object-related feedback signals in EVC. One plausible explanation for this discrepancy is task relevance: in the present study, participants performed only a fixation color-change task, whereas in previous work they were required to attend to object features or identity (e.g., Morgan et al., 2019, J Neurosci; Kok et al., 2016, Curr Biol; Mohsenzadeh et al., 2018, eLife; Hou et al., 2026, eLife). Task demands on object processing may substantially modulate the strength of feedback signals to EVC, and this possibility warrants discussion.

(5) A substantial body of work has used specialized paradigms to dissociate feedforward and feedback signals in EVC (e.g., Williams et al., 2008, Nat Neurosci; Fan et al., 2016, PNAS; Hou et al., 2026, eLife). These studies are directly relevant to the current work but are not cited.

(6) Multidimensional scaling (MDS) visualizations of the RDMs (as in, e.g., Mohsenzadeh et al., 2018) are not included in the manuscript. These visualizations are important for interpreting the representational format across different layers of LOC and EVC, and I would encourage the authors to include them.

https://doi.org/10.7554/eLife.111186.1.sa2

Reviewer #2 (Public review):

Summary:

Carricarte and colleagues set out to identify and functionally characterize feedforward (FF) and feedback (FB) information flow during object perception in humans, a question that has been difficult to address non-invasively because FF and FB signals overlap rapidly in time and across regions. The authors capitalize on the canonical cortical microcircuit-FF terminations primarily in middle layers, FB terminations primarily in superficial and deep layers, to spatially separate these signals using sub-millimeter (0.9 mm isotropic) GE-BOLD fMRI at 7T in early visual cortex (EVC) and lateral occipital complex (LOC). They combine these layer-resolved fMRI patterns with millisecond-resolution EEG (from a previously published dataset using the same 24 images) via representational similarity analysis-based EEG-fMRI fusion, and use a Vision Transformer (DeiT) trained on ImageNet to characterize the feature complexity of the resulting spatiotemporal signatures.

The authors first review their approach at the macroscale, replicating the expected EVC-then-LOC temporal hierarchy and the EVC-low/LOC-high feature complexity gradient. They then apply the same framework at the mesoscale of cortical layers, reporting: (a) early middle-layer signals in both EVC (~100 ms) and LOC (~160 ms) consistent with FF processing, (b) a later superficial-layer signal in LOC (~400 ms) interpreted as FB; (c) a layer-uniform feature-complexity profile in EVC (peaking at low-mid DNN layers across all depths); and (d) a feature-complexity dissociation in LOC, where middle-layer signals correspond to mid-to-high DNN layers and superficial-layer signals to high DNN layers. They argue that this complexity shift, combined with the timing difference, indicates interareal FB into LOC.

Strengths:

(1) The combination of layer-fMRI at 7T, EEG, and DNN-based representational analysis is well motivated through RSA. Each modality compensates for a known limitation of the others (fMRI: poor temporal resolution; EEG: poor spatial resolution; DNN: surrogate for representational format), and the RSA framework provides a principled common currency. Relatedly, the two-step macroscale-then-mesoscale design, in which the macroscale fusion replicates established findings before the same approach is applied at the layer level, is a sound and welcome scientific strategy that strengthens confidence in the combined-modality inferences.

(2) The authors include multiple complementary controls: partialing out lower layers to mitigate vascular draining, voxel-count matching across layers, an alternative DNN (AlexNet), an alternative time-window definition based on between-layer differences, and time-resolved commonality analyses. The convergence across these analyses is reassuring.

(3) Methodological transparency: The authors are forthright about partial-volume effects, foveal-confluence aggregation, and the indirect nature of the temporal estimates derived from EEG-fMRI fusion.

Weaknesses:

The central interpretive claim-that the late (~400 ms), superficial-layer LOC signal indexes interareal feedback that increases representational complexity-is intriguing, but in my view it is not yet fully supported by the evidence presented based on the following context.

(1) Eye movements as a possible confound for late signals. Stimuli were presented for 1 second, and fixation was enforced only behaviorally via a color-change task on a central cross. No eye-tracking is reported for either the fMRI or EEG datasets. While this approach is not uncommon, the absence of gaze monitoring introduces ambiguity when the goal is to decouple feedforward and feedback contributions at fine temporal resolution in EEG recordings. Under these conditions, multiple image-driven saccades within a trial are plausible, and saccade patterns are likely to be systematically image-specific, given the small (n = 24) and heterogeneous naturalistic stimulus set. Critically, the temporal window over which RDM correlations are interpreted as feedback coincides with the period during which observers typically make 2-4 fixations (average fixation durations of ~250-330 ms; Rayner, 1998; Henderson, 2003), meaning the late EEG-fMRI fusion peaks fall in a window where image-locked saccadic activity and successive foveation-driven feedforward responses would be expected to accumulate. Late peaks could therefore reflect cumulative feedforward responses across successive foveations rather than top-down feedback. The manuscript would be strengthened by providing eye-tracking data (if available), control analyses leveraging post-hoc indicators, or a discussion citing prior evidence that EEG/fMRI response profiles in this paradigm are robust to such eye movements.

(2) Decoding accuracy along the visual hierarchy raises questions about whether LOC is adequately engaged. Pairwise decoding accuracy is substantially higher in EVC than in LOC (Figure 1D), and the noise ceiling for LOC RDMs is markedly lower than for EVC across all layers (Supplementary Figure 4D-F). This pattern inverts the canonical hierarchical gradient of progressively stronger object decoding along the ventral visual stream, as well as the analogous gradient observed in DNN late layers that underlies the commonality analyses. As written, it is unclear how the manuscript reconciles this with its emphasis on LOC's role in higher-order, feedback-modulated representations with greater tolerance or increased complexity--unless decoding accuracies should be understood as image-level discrimination rather than at the level of object-category discrimination. A parsimonious alternative is that the 24-image set is too small or too coarse to reveal category-level representations in LOC robustly, such that LOC RDMs may be driven by lower-level or background/contextual variance and noise. This concern has direct bearing on the mesoscale commonality analyses supporting the "feedback transmits high-complexity features" conclusion. I would encourage the authors to (a) report split-half reliability of LOC RDMs alongside the commonality analyses, and either (b) acknowledge that the feature-complexity inferences are conditional on LOC RDMs faithfully capturing object structure rather than residual contextual/low-level variance, or (c) discuss how replication with a richer stimulus set might bear on the feedback-content interpretation.

(3) The interareal feedback interpretation could be more robustly defended against intra-areal alternatives. In EVC, the authors carefully consider non-feedback explanations for layer-specific dynamics, including lateral connections modulating gain and superficial GE-BOLD bias, and conclude these are sufficient. The same skepticism is not extended to LOC, where the corresponding superficial-layer signal is interpreted as interareal feedback, with speculative sourcing to DLPFC. Slow (unmyelinated) horizontal/lateral propagation in superficial cortical layers (e.g., Davis et al., 2024) can, in principle, produce delayed superficial-layer signals on the timescale observed here without any interareal contribution. This asymmetry is compounded by the treatment of the absence of sustained EVC activity following the middle-layer peak, which is dismissed as a "limitation of the spatial and temporal sensitivity of our measurements" (lines 388-390). If feedback to EVC truly cannot be resolved with this method, the corresponding feedback claim in LOC-imaged with the same protocol warrants comparable caution. The manuscript would benefit from either presenting positive evidence that distinguishes interareal feedback from intra-areal recurrence (e.g., frequency-band signatures, source-resolved EEG, or coupling with frontal regions), or qualifying the conclusion to "delayed superficial-layer activity consistent with either interareal feedback or intra-areal recurrence."

(4) The predictive coding framing is invoked but not well-grounded. The Discussion (lines 349-357) includes a theoretical implication of predictive coding. Predictive coding makes content-specific claims-feedback carries predictions, feedforward carries error signals relative to those predictions, and dissociating these requires manipulations of expectation, congruence, or predictability, none of which are present in the current design. The observed layer-wise timing differences do not bear evidence for rejecting non-predictive accounts. I would suggest either removing this framing or explicitly noting that the present data neither support nor refute predictive coding.

https://doi.org/10.7554/eLife.111186.1.sa1

Reviewer #3 (Public review):

Summary.

Carricarte and colleagues use 0.9mm 7T fMRI in EVC and LOC, fused with previously collected EEG using the same stimulus set, in order to dissect feedforward and feedback contributions to human object processing through their layer-specific termination patterns. They report a feedforward signal in middle layers of EVC (~100ms) and LOC (~160ms), and a later signal in superficial LOC (~400ms) that they interpret as interareal feedback. Using commonality analysis with a Vision Transformer, they argue that this late signal carries higher-complexity features than the earlier signal, and conclude that feedback actively increases representational complexity in LOC.

Strengths.

The empirical work is methodologically ambitious. Sub-millimeter 7T coverage of both EVC and LOC, combined with layer-resolved EEG-fMRI fusion, represents a substantial technical achievement. The authors first reproduce established macroscale EEG-fMRI fusion patterns at 7T before extending the approach to the layer level. The figures throughout are beautifully designed and convey complex analyses with clarity. The empirical core of the paper - that LOC contains layer-distinct dynamics at distinct times, with the late signal carrying representational structure that differs in some way from the early signal - is supported by the data, though with caveats imposed by the LOC noise ceiling.

Weaknesses.

The authors' interpretation of these data (interareal feedback that reflects feature-complexity, related to the functional role of these signals) is not adequately supported and requires either reframing or substantial additional evidence.

Feedback vs. recurrence. The late superficial-LOC signal is interpreted as interareal feedback, but the data are equally consistent with within-area recurrence, lateral connections, or sustained feedforward dynamics. A reader expecting evidence of higher-area signals returning to early-time middle layers - a signature of interareal feedback - finds none in either region.

"Functional role" overclaim. The paper repeatedly claims to characterize the "functional role" of feedforward and feedback, but contains no behavioral linkage, no perturbation, and no analysis relating signals to perceptual outcomes; the fMRI task is explicitly orthogonal to object processing. What is demonstrated is spatiotemporal dynamics and representational format - both valuable, neither equivalent to functional role.

DNN analysis. The DNN analyses use several non-standard modeling choices that introduce more uncertainty than clarity. In the main analyses, the authors only use four sampling points from a single model (DeiT-small): transformer blocks 1, 7, and 12, plus the classification head. Then, the authors make their headline claims about complexity by comparing block 12 and the classification head; within the model, this is a distinction between an embedding layer and a supervised category readout, not a feature-complexity gradient. As such, the author's interpretation conflates semantic layers with representational "complexity." A more convincing use of this modeling strategy would be to demonstrate these effects in multiple models that might disentangle these factors-e.g., supervised (ResNet/ViT), self-supervised (DINOv2), and vision-language (CLIP) models-then to visualize these brain-model relationships across all layers. Alternatively, there are many suitable model-free analyses that could demonstrate the unique representational information within LOC without introducing any model-related concerns.

Reliability of LOC layer-resolved RDMs. The lower-bound noise ceiling for LOC mesoscale RDMs is approximately 0.05 across layers, with deep-LOC reliability essentially at zero. The central layer-resolved dissociation rests on RDMs that individual subjects barely reproduce; consequently, the deep LOC layer is dropped from the commonality analysis (Figure 4C shows only middle/superficial layers, while Figure 4B shows all three for EVC) because the data cannot support it. This is not damning, but it is consequential, and not sufficiently addressed in the manuscript.

https://doi.org/10.7554/eLife.111186.1.sa0

Layer-specific spatiotemporal dynamics of feedforward and feedback in human visual object perception

Peer review process

Editors

Be the first to read new articles from eLife