Stimuli, experimental design and multivariate pattern classification

(A) Stimulus set. The stimuli consisted of 24 different naturalistic object images. (B) fMRI experimental design. On each trial, participants viewed images for 1 s followed by a 4-s baseline interval. Participants were required to perform a color-change detection task on the fixation cross that occurred randomly throughout the experiment. (C) Extraction of voxel values to form condition-specific pattern vectors from region of interest (example here: EVC) and pairwise-object classification using a support vector machine. ROIs are depicted for visualization purposes only (D) Object-pairwise multivariate decoding output. Robust object-specific information was reliably decoded from EVC (71,16%, P = 0.0039) and LOC (60,69%, P = 0.0092) using one-sample permutation tests. Error bars indicate the standard error of the mean across participants.

Macroscale representational EEG-fMRI fusion

(A) Representational similarity analysis. For each condition, we extracted the neural pattern from the region of interest (example here: EVC). We assessed the extent of pattern dissimilarity by calculating 1 - Pearson’s correlation for all combinations of experimental conditions (i, j) and assigned the dissimilarity values to an fMRI representational dissimilarity matrix (RDM) indexed by the conditions in rows and columns, at entry (i, j). ROIs are depicted for visualization purposes only (B) Representational EEG-fMRI fusion. For each time point t, we correlated the EEG-RDM to the fMRI-RDMs of EVC and LOC using Spearman’s rank order correlation. (C) Spatiotemporal neural dynamics at the macroscale level. Time course in EVC peaked earlier than in LOC. (D) Difference between EVC and LOC curves in C. EEG signals correlated first more with EVC than with LOC and later more with LOC than with EVC. (E) Commonality analysis. For each DNN layer, we correlated its RDM to each ROI-specific fMRI-RDM and the mean EEG-RDM at the time interval with significant ROI-specific temporal dynamics. (F) Format of representation (≈ feature complexity) in EVC and LOC. Visual representations of low-to-mid-level complexity emerge early in EVC, while mid-to-high-level object representations emerge later in LOC. Shaded regions and error bars indicate the standard error of the mean across participants. Colored circles denote significant time points (N = 32; cluster-defining threshold P < 0.05; cluster threshold P < 0.05); uncolored circles and horizontal lines indicate peak latency means and 95% confidence intervals, respectively. Colored asterisks indicate significant correlations (right-tailed permutation tests, FDR-corrected; *P < 0.05; **P < 0.01; ***P < 0.001). Colored triangles represent model layers with the highest occurrence proportion, determined through 1,000-iteration bootstraps.

Mesoscale representational EEG-fMRI fusion

(A). For each cortical layer — deep, middle, and superficial — (example here: for EVC) we computed the partial Spearman’s rank-order correlation between its layer-specific fMRI-RDM and the EEG-RDM at each time point t. (B) Layer-specific spatiotemporal neural dynamics in EVC. EEG signals correlated early across layers, with the time course in the middle layer peaking earlier than in the deep and superficial layers. (C) Difference between EVC layer curves in B. EEG signals correlated lately more with deep and superficial layers than with the middle layer (D) Layer-specific spatiotemporal neural dynamics in LOC. EEG signals correlated early in the middle layer and later in the superficial layer. (E) Difference between LOC layer curves in D. EEG signals correlated early more with the middle layer than with the superficial layer, and later more with the superficial layer than with the middle layer. Shaded regions indicate the standard error of the mean across participants. Colored circles denote significant time points (N = 32, cluster-defining threshold P < 0.05, cluster threshold P < 0.05); uncolored circles and horizontal lines indicate peak latency means and 95% confidence intervals, respectively.

Mesoscale commonality analysis between fMRI, EEG and ViT

(A) Procedure. For each transformer layer, we correlated its RDM to each layer-specific fMRI-RDM in EVC and LOC and the mean EEG-RDM at the time interval with significant layer-specific temporal dynamics. (B) Format of representation (≈ feature complexity) across cortical layers in EVC. Low model layers correlated strongly across layers in EVC. (C) Format of representation (≈ feature complexity) across cortical layers in LOC. Middle model layers correlated strongly with the middle layer in LOC, while high model layers correlated primarily with the superficial layer. Colored asterisks indicate significant correlations (N = 32, right-tailed permutation tests, FDR-corrected; *P < 0.05; **P < 0.01; ***P < 0.001). Colored triangles represent model layers with the highest occurrence proportion, determined through 1,000-iteration bootstraps.

Macroscale representational EEG-fMRI fusion and commonality analysis using partial correlations, and laminar profiles of the GE-BOLD response

(A) Spatiotemporal neural dynamics at the macroscale level. For EVC we partialed out the effect of LOC and for LOC we partialed out the effect of EVC. Early representations correlated more strongly with EVC than with LOC and later representations correlated more strongly with LOC than with EVC. (B) Difference between EVC and LOC curves in (A). (C) Macroscale commonality analysis for time intervals showing significant EVC–LOC differences. Visual representations of low-to-mid-complexity emerge primarily in EVC, while mid-to-high-level object representations emerge in LOC. (D) Laminar profiles of the GE-BOLD response to objects in EVC and LOC. Laminar responses derived from GE-BOLD signals are strongly affected by non-specific macrovascular signals, leading to higher activation in superficial layers. Data were averaged across 24 object conditions and 10 fMRI participants. Shaded area represents the standard error of the mean across EEG participants. Shaded regions denote the standard error of the mean across participants; colored circles indicate significant time points (N = 32, cluster-defining threshold P < 0.05, cluster threshold P < 0.05); uncolored circles and horizontal lines indicate peak latency means and 95% confidence intervals, respectively. Colored asterisks indicate significant correlations (N = 32, right-tailed permutation tests, FDR-corrected; *P < 0.05; **P < 0.01; ***P < 0.001); colored triangles represent model layers with the highest occurrence proportion, determined through 1,000-iteration bootstraps.

Mesoscale representational EEG–fMRI fusion and commonality analyses under alternative experimental choices

(AD) Representational EEG–fMRI fusion without partial correlations. (A) Layer-specific spatiotemporal neural dynamics in early visual cortex (EVC). (B) Differences between EVC layer-specific curves shown in (A). (C) Layer-specific spatiotemporal neural dynamics in LOC. (D) Differences between LOC layer-specific curves shown in (C). (EH) Representational EEG–fMRI fusion and commonality analyses with matched voxel counts across cortical layers. For each layer, voxels were randomly subsampled to match the layer with the fewest voxels; this procedure was repeated ten times and results were averaged across permutations. (E) Layer-specific spatiotemporal neural dynamics in EVC. (F) Layer-specific spatiotemporal neural dynamics in LOC. (G,H) Representational format (≈ feature complexity) across cortical layers in EVC and LOC, respectively. (I,J) Representational format (≈ feature complexity) across cortical layers in EVC and LOC using AlexNet, respectively. (K,L) Representational format (≈ feature complexity) across cortical layers in EVC and LOC for time intervals showing significant between-layer differences. Two temporal clusters (labeled 1 and 2) were defined from these intervals. For each cluster, commonality analysis linked model layer-specific DNN-RDMs to layer-specific fMRI-RDMs and the mean EEG-RDM within the corresponding time window in EVC (K) and LOC (L). Shaded regions and error bars indicate the standard error of the mean across participants. Colored circles denote significant time points (N = 32; cluster-defining threshold P < 0.05; cluster threshold P < 0.05); uncolored circles and horizontal lines indicate peak latency means and 95% confidence intervals, respectively. Colored asterisks indicate significant correlations (right-tailed permutation tests, FDR-corrected; *P < 0.05; **P < 0.01; ***P < 0.001). Colored triangles represent model layers with the highest occurrence proportion, determined through 1,000-iteration bootstraps.

Mesoscale commonality analysis across individual time points

Representational format (≈ feature complexity) in deep layer (A, D), middle (B, E), and superficial (C, F) layers in EVC (A–C) and LOC (D–F). Shaded regions denote the standard error of the mean across participants. Colored circles indicate significant time points (N = 32, cluster-defining threshold P < 0.05, cluster threshold P < 0.05).

Reliability and noise ceiling of the fMRI data at macroscale and mesoscale

RDM reliability at the macroscale across brain regions (A) and at the mesoscale across cortical layers in EVC (B) and LOC (C). Noise ceiling estimates at the macroscale (D) and mesoscale in EVC (E) and LOC (F). Error bars indicate the standard error of the mean across participants (N = 10).