Neuroscience

Comprehensive Neural Representations of Naturalistic Stimuli through Multimodal Deep Learning

Mingxue Fu
Guoqiu Chen
Yijie Zhang
Mingzhe Zhang
Yin Wang author has email address

State Key Laboratory of Cognitive Neuroscience and Learning, and IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, China

https://doi.org/10.7554/eLife.107607.1

Open access
Copyright information

Figures and data

Overview of the Study Design.
a Encoding model construction and generalization testing (Study 1 & Study 2). Left panel: Whole-brain BOLD responses were recorded while participants passively viewed naturalistic movies from the HCP 7T dataset. We extracted visual, linguistic, and multimodal (video-text aligned) features from the movie clips and trained voxel-wise encoding models using kernel ridge regression to predict brain activity. The resulting model weights reflect each voxel’s tuning to these features, providing insight into how the brain represents dynamic, real-world stimuli. Right panel : The trained models were tested on a separate dataset (‘ShortFunMovies’) in which 20 participants watched 8 novel movie segments. Generalization performance was assessed by computing the Pearson correlation between predicted BOLD responses (from HCP-trained models) and empirically measured group-level fMRI responses in the SFM dataset. b Schematic of the video-text alignment model (VALOR). VALOR extracts multimodal features by encoding videos and their corresponding textual descriptions into separate embedding spaces, which are aligned in a 512-dimensional joint space via contrastive learning. The model minimizes the distance between matched video-text pairs and maximizes it for mismatched pairs, enabling temporally and semantically rich multimodal representations. c Using VALOR to study predictive coding (Study 4). To probe predictive coding, we extended the encoding model to include a forecast window. Features at the current timepoint (F_t) were concatenated with features from a future timepoint (F_t+d), allowing us to assess how well different brain regions anticipate upcoming stimuli. d is the temporal offset (prediction distance) from the current TR (repetition time). Predictive performance was compared against models using only current features, revealing regional differences in prediction horizons. All human-related images in this figure have been replaced with AI-generated illustrations to avoid including identifiable individuals. No real faces or photographs of people are shown.

Comparison of encoding and generalization performance across models.
a Whole-brain encoding performance (Study 1). Voxel-wise and ROI-level comparisons of four encoding models trained on the HCP 7T dataset: VALOR (video-text alignment, red), AlexNet (visual features, blue), WordNet (linguistic features, pink), and CLIP (image-based multimodal features, gray). Left panel: Cortical surface maps show the best-performing model at each voxel. Mean whole-brain prediction accuracy across participants, with VALOR significantly outperforming all baselines. Right panel: ROI-wise prediction accuracy across predefined regions (e.g., visual, language, high-level association areas). Bars indicate group-level means, with error bars denoting standard error of the mean (SEM). Statistical comparisons (one-sided paired t-tests, FDR-corrected) are reported only between VALOR and each of the other models. b Generalization performance on independent data (Study 2). Models trained on HCP data were tested on the independent ShortFunMovies (SFM) dataset to assess generalizability. Left: Voxel-wise surface maps highlight where VALOR outperforms other models across cortex. Right: ROI-level generalization performance, with color coding and statistical tests consistent with panel (a). For both voxel-wise and ROI analyses, VALOR demonstrates broad generalization across both sensory and high-level regions, exceeding the performance of all other models.

Cortical semantic mapping from video-text alignment features.
a Voxel-wise prediction performance. Top: Encoding accuracy maps based on video-text alignment features (VALOR) versus WordNet-based semantic features from Huth et al. (2012). Bottom: Difference map showing voxel-wise performance gains (VALOR minus WordNet). Red areas indicate regions where VALOR outperforms WordNet. Results are shown for one representative subject; similar patterns were observed in the remaining four participants (see Supplementary Fig. S2). b Semantic dimensions revealed by video-text alignment. Principal component analysis (PCA) was applied to encoding weights derived from VALOR features across ∼20,000 video clips. Example frames illustrate the semantic meaning of the top four PCs: PC1=Mobility (e.g., movement vs. stillness; aligns with Huth’s PC1), PC2=Social content (e.g., people interaction vs. nature; aligns with Huth’s PC2), PC3=Mechanical vs. non-mechanical stimuli (aligns with Huth’s PC4), PC4=Civilization vs. natural environments (aligns with Huth’s PC3). All human-related images in this figure have been replaced with AI-generated illustrations to avoid including identifiable individuals. No real faces or photographs of people are shown. c Spatial correspondence with manually annotated semantic maps in Huth et al, (2012). Jaccard similarity matrix quantifying spatial overlap between the top four PCs from the VALOR model (rows) and those from the WordNet-based model (columns). The highest similarity scores for each column were diagonal, indicating strong alignment between automatically derived and manually annotated semantic components.

Predictive coding revealed by video-text alignment features.
a voxel-wise predictive scores. For each voxel, a predictive score was computed as the improvement in prediction accuracy when future stimulus features (F_t_+d) were added to current features (F_t) in the encoding model. Only voxels with significant predictive enhancement across participants are shown (Wilcoxon rank-sum test, FDR-corrected). b regional prediction distances. Voxels that showed highest predictive scores were STG, MTG, SPG and PCu. We calculated prediction distance for these regions on a per-voxel, per-participant basis, and then averaged the values. c brain–behavior correlation. Scatter plot showing a positive correlation between prediction distance in the SPG and individual fluid cognitive scores across HCP participants (r = 0.172, p < .05, FDR-corrected). This suggests that individuals with broader predictive horizons in the parietal cortex tend to exhibit stronger fluid reasoning ability.

Representational similarity analysis (RSA) across feature types and brain regions.
Bar plots show RSA results comparing four feature extraction models—video-text alignment (VALOR), CLIP, AlexNet, and WordNet—across seven predefined regions of interest (ROIs): V1, V4 (visual), MTG, AG (language), and PCu, PCC, mPFC (higher-order cognitive regions). RSA values reflect the similarity between multivoxel neural patterns and feature-derived representational dissimilarity matrices (RDMs), aggregated across all participants and movie segments from the HCP 7T dataset. VALOR consistently outperformed or matched other models across most ROIs, particularly in V4, MTG, AG, PCu, and PCC, indicating its ability to capture both low-level perceptual and high-level semantic structure. In V1, VALOR was comparable to AlexNet. In mPFC, performance was similar between VALOR and WordNet. Statistical comparisons were conducted using paired-sample t-tests, with false discovery rate (FDR) correction applied for multiple comparisons. Asterisks indicate significance levels (*p < .05, ***p < .001); “n.s.” = not significant. Error bars represent ±1 standard error of the mean (SEM) across participants.

Individual subject maps showing performance difference between feature models (related to Figure 3a).
Cortical flat maps show voxel-wise differences in prediction accuracy between the video-text alignment model (VALOR) and the WordNet-based semantic model, computed as VALOR minus WordNet. Data are shown for four individual subjects (S2–S5) from the Huth et al. (2012) dataset. Red voxels indicate regions where VALOR achieved higher prediction accuracy than WordNet; blue voxels reflect regions where WordNet outperformed VALOR. These maps mirror the analysis presented in Figure 3a (main text) and demonstrate that VALOR consistently outperforms WordNet across a wide range of cortical areas, including both hemispheres and across subjects. The color scale reflects the magnitude of performance differences, ranging from −0.10 to 0.10 in Pearson correlation units.

Sign up for email alerts