Neuroscience

Comprehensive Neural Representations of Naturalistic Stimuli through Multimodal Deep Learning

Mingxue Fu
Guoqiu Chen
Yijie Zhang
Mingzhe Zhang
Yin Wang author has email address

State Key Laboratory of Cognitive Neuroscience and Learning, and IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, China

https://doi.org/10.7554/eLife.107607.2

Open access
Copyright information

Figures and data

Overview of the Study Design.
a Encoding model construction and generalization testing (Study 1 & Study 2). Left panel: Whole-brain BOLD responses were recorded while participants viewed naturalistic movies from the HCP 7T dataset. Visual, linguistic, and multimodal (video–text aligned) features were extracted at each TR and used to train voxel-wise encoding models. During training, the encoding model learns voxel-specific weight matrices that map stimulus features to BOLD responses, capturing each voxel’s feature tuning. Right panel: To assess generalization, the learned voxel-wise weights were held fixed and applied to features extracted from novel movies in an independent dataset (ShortFunMovies, SFM), collected at a different site with different participants. Predicted BOLD responses were compared with empirically measured group-mean fMRI responses in the SFM dataset using Pearson correlation, providing a test of cross-subject and cross-dataset transfer of stimulus-locked representations. b Schematic of the video–text alignment model (VALOR). VALOR encodes temporally extended video segments (comprising multiple consecutive frames over several seconds) and their associated textual descriptions into separate visual and linguistic embedding spaces. These representations are aligned into a shared 512-dimensional joint embedding space via contrastive learning, such that matched video–text pairs are pulled together while mismatched pairs are pushed apart. By operating on dynamic video input rather than isolated frames, VALOR captures both semantic content and temporal structure, distinguishing it from static image–text alignment models such as CLIP. c Using VALOR to study predictive coding (Study 4). To probe predictive coding, the encoding model was extended to include a temporal forecast window. Features at the current time point (F_t) were combined with features from a future time point (F_t+d), where d denotes the prediction distance in TRs. Comparing models with and without future features allowed us to quantify regional differences in predictive timescales across cortex. All human-related images in this figure are AI-generated illustrations. No real faces or identifiable individuals are shown.

Comparison of encoding and generalization performance across models.
a Whole-brain encoding performance (Study 1). Voxel-wise and ROI-level comparisons of four encoding models trained on the HCP 7T dataset: VALOR (video-text alignment, red), AlexNet (visual features, blue), WordNet (semantic features, pink), and CLIP (image-based multimodal features, gray). Left panel: Cortical surface maps show the best-performing model at each voxel. Mean whole-brain prediction accuracy across participants, with VALOR significantly outperforming all baselines. Right panel: ROI-wise prediction accuracy across predefined regions (e.g., visual, language, high-level association areas). Bars indicate group-level means, with error bars denoting standard error of the mean (SEM). Statistical comparisons (two-sided paired t-tests, FDR-corrected) are reported only between VALOR and each of the other models. b Generalization performance on independent data (Study 2). Models trained on HCP data were tested on the independent ShortFunMovies (SFM) dataset to assess generalizability. Left: Voxel-wise surface maps highlight where VALOR outperforms other models across cortex. Right: ROI-level generalization performance, with color coding and statistical tests consistent with panel (a). For both voxel-wise and ROI analyses, VALOR demonstrates broad generalization across both sensory and high-level regions, exceeding the performance of all other models. For the display of larger brain rendering, please check the Supplementary material (Figure S4)

Cortical semantic mapping from video-text alignment features.
a Voxel-wise prediction performance. Top: Encoding accuracy maps based on video-text alignment features (VALOR) versus WordNet-based semantic features from Huth et al. (2012). Bottom: Difference map showing voxel-wise performance gains (VALOR minus WordNet). Red areas indicate regions where VALOR outperforms WordNet. Results are shown for one representative subject; similar patterns were observed in the remaining four participants (see Supplementary Fig. S2). b Semantic dimensions revealed by video-text alignment. Principal component analysis (PCA) was applied to encoding weights derived from VALOR features across ∼20,000 video clips. Example frames illustrate the semantic meaning of the top four PCs: PC1=Mobility (e.g., movement vs. stillness; aligns with Huth’s PC1), PC2=Social content (e.g., people interaction vs. nature; aligns with Huth’s PC2), PC3=Mechanical vs. non-mechanical stimuli (aligns with Huth’s PC4), PC4=Civilization vs. natural environments (aligns with Huth’s PC3). All human-related images in this figure have been replaced with AI-generated illustrations to avoid including identifiable individuals. No real faces or photographs of people are shown. c Spatial correspondence with manually annotated semantic maps in Huth et al, (2012). Jaccard similarity matrix quantifying spatial overlap between the top four PCs from the VALOR model (rows) and those from the WordNet-based model (columns). The highest similarity scores for each column were diagonal, indicating strong alignment between automatically derived and manually annotated semantic components. Detailed projections of each semantic component onto the cortical surface can be found in the Supplementary Material (Fig. S3).

Predictive coding revealed by video-text alignment features.
a voxel-wise predictive scores. For each voxel, a predictive score was computed as the improvement in prediction accuracy when future stimulus features (F_t_+d) were added to current features (F_t) in the encoding model. Only voxels with significant predictive enhancement across participants are shown (Wilcoxon rank-sum test, FDR-corrected). b regional prediction distances. Voxels that showed highest predictive scores were STG, MTG, SPG and PCu. We calculated prediction distance for these regions on a per-voxel, per-participant basis, and then averaged the values. c brain–behavior correlation. Scatter plot showing a positive correlation between prediction distance in the SPG and individual fluid cognitive scores across HCP participants (r = 0.172, p < .05, FDR-corrected). This suggests that individuals with broader predictive horizons in the parietal cortex tend to exhibit stronger fluid reasoning ability.

Representational similarity analysis (RSA) across feature types and brain regions.
Bar plots show RSA results comparing four feature extraction models—video-text alignment (VALOR), CLIP, AlexNet, and WordNet—across seven predefined regions of interest (ROIs): V1, V4 (visual), MTG, AG (language), and PCu, PCC, mPFC (higher-order cognitive regions). RSA values reflect the similarity between multivoxel neural patterns and feature-derived representational dissimilarity matrices (RDMs), aggregated across all participants and movie segments from the HCP 7T dataset. VALOR consistently outperformed or matched other models across most ROIs, particularly in V4, MTG, AG, PCu, and PCC, indicating its ability to capture both low-level perceptual and high-level semantic structure. In V1, VALOR was comparable to AlexNet. In mPFC, performance was similar between VALOR and WordNet. Statistical comparisons were conducted using paired-sample t-tests, with false discovery rate (FDR) correction applied for multiple comparisons. Asterisks indicate significance levels (*p < .05, ***p < .001); "n.s." = not significant. Error bars represent ±1 standard error of the mean (SEM) across participants.

Individual subject maps showing performance difference between feature models (related to Figure 3a).
Cortical flat maps show voxel-wise differences in prediction accuracy between the video-text alignment model (VALOR) and the WordNet-based semantic model, computed as VALOR minus WordNet. Data are shown for four individual subjects (S2–S5) from the Huth et al. (2012) dataset. Red voxels indicate regions where VALOR achieved higher prediction accuracy than WordNet; blue voxels reflect regions where WordNet outperformed VALOR. These maps mirror the analysis presented in Figure 3a (main text) and demonstrate that VALOR consistently outperforms WordNet across a wide range of cortical areas, including both hemispheres and across subjects. The color scale reflects the magnitude of performance differences, ranging from -0.10 to 0.10 in Pearson correlation units.

Projections of semantic components onto the cortical surface.
Left: Component 1 (PC1) loadings projected onto the individual surface of S1 as continuous, signed weights (color bar indicates magnitude). Right: Components 2–4 displayed with a discrete scheme: PC2 = red, PC3 = green, PC4 = blue (greater saturation denotes larger loadings; near-zero shown in gray).

Cortical surface projections of voxel-wise winner-take-all maps showing the best-performing representation at each voxel.
(a) Encoding performance on HCP (Study 1). (b) Generalization performance on SFM (Study 2; trained on HCP, tested on SFM). Colors indicate the representation with the highest prediction accuracy: VALOR (red), AlexNet (blue), WordNet (pink), and CLIP (gray).

Sign up for email alerts