Comprehensive Neural Representations of Naturalistic Stimuli through Multimodal Deep Learning

Mingxue Fu; Guoqiu Chen; Yijie Zhang; Mingzhe Zhang; Yin Wang

doi:10.7554/eLife.107607.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Reviewing Editor
Anna Schapiro
University of Pennsylvania, Philadelphia, United States of America
Senior Editor
Andre Marquand
Radboud University Nijmegen, Nijmegen, Netherlands

Reviewer #1 (Public review):

Summary:

This study compares four models - VALOR (dynamic visual-text alignment), CLIP (static visual-text alignment), AlexNet (vision-only), and WordNet (text-only) - in their ability to predict human brain responses using voxel-wise encoding modeling. The results show that VALOR not only achieves the highest accuracy in predicting neural responses but also generalizes more effectively to novel datasets. In addition, VALOR captures meaningful semantic dimensions across the cortical surface and demonstrates impressive predictive power for brain responses elicited by future events.

Strengths:

The study leverages a multimodal machine learning model to investigate how the human brain aligns visual and textual information. Overall, the manuscript is logically organized, clearly written, and easy to follow. The results well support the main conclusions of the paper.

Weaknesses:

(1) My primary concern is that the performance difference between VALOR and CLIP is not sufficiently explained. Both models are trained using contrastive learning on visual and textual inputs, yet CLIP performs significantly worse. The authors suggest that this may be due to VALOR being trained on dynamic movie data while CLIP is trained on static images. However, this explanation remains speculative. More in-depth discussion is needed on the architectural and inductive biases of the two models, and how these may contribute to their differences in modeling brain responses.

(2) The methods section lacks clarity regarding which layers of VALOR and CLIP were used to extract features for voxel-wise encoding modeling. A more detailed methodological description is necessary to ensure reproducibility and interpretability. Furthermore, discussion of the inductive biases inherent in these models-and their implications for brain alignment - is crucial.

(3) A broader question remains insufficiently addressed: what is the purpose of visual-text alignment in the human brain? One hypothesis is that it supports the formation of abstract semantic representations that rely on no specific input modality. While VALOR performs well in voxel-wise encoding, it is unclear whether this necessarily indicates the emergence of such abstract semantics. The authors are encouraged to discuss how the computational architecture of VALOR may reflect this alignment mechanism and what implications it has for understanding brain function.

(4) The current methods section does not provide enough details about the network architectures, parameter settings, or whether pretrained models were used. If so, please provide links to the pretrained models to facilitate reproducible science.

https://doi.org/10.7554/eLife.107607.1.sa2

Reviewer #2 (Public review):

Summary:

Fu and colleagues have shown that VALOR, a model of multimodal and dynamic stimulus features, better predicts brain responses compared to unimodal or static models such as AlexNet, WordNet, or CLIP. The authors demonstrated the robustness of their findings by generalizing encoding results to an external dataset. They demonstrated the models' practical benefit by showing that semantic mappings were comparable to another model that required labor-intensive manual annotation. Finally, the authors showed that the model reveals predictive coding mechanisms of the brain, which held a meaningful relationship with individuals' fluid intelligence measures.

Strengths:

Recent advances in neural network models that extract visual, linguistic, and semantic features from real-world stimuli have enabled neuroscientists to build encoding models that predict brain responses from these features. Higher prediction accuracy indicates greater explained variance in neural activity, and therefore a better model of brain function. Commonly used models include AlexNet for visual features, WordNet for audio-semantic features, and CLIP for visuo-semantic features; these served as comparison models in the study. Building on this line of work, the authors developed an encoding model using VALOR, which captures the multimodal and dynamic nature of real-world stimuli. VALOR outperformed the comparison models in predicting brain responses. It also recapitulated known semantic mappings and revealed evidence of predictive processing in the brain. These findings support VALOR as a strong candidate model of brain function.

Weaknesses:

The authors argue that this modeling contributes to a better understanding of how the brain works. However, upon reading, I am less convinced about how VALOR's superior performance over other models tells us more about the brain. VALOR is a better model of the audiovisual stimulus because it processes multimodal and dynamic stimuli compared to other unimodal or static models. If the model better captures real-world stimuli, then I almost feel that it has to better capture brain responses, assuming that the brain is a system that is optimized to process multimodal and dynamic inputs from the real world. The authors could strengthen the manuscript if the significance of their encoding model findings were better explained.

In Study 3, the authors show high alignment between WordNet and VALOR feature PCs. Upon reading the method together with Figure 3, I suspect that the alignment almost has to be high, given that the authors projected VALOR features to the Huth et al.'s PC space. Could the authors conduct non-parametric permutation tests, such as shuffling the VALOR features prior to mapping onto Huth et al.'s PC space, and then calculating the Jaccard scores? I imagine that the null distribution would be positively shifted. Still, I would be convinced if the alignment is higher than this shifted null distribution for each PC. If my understanding of this is incorrect, I suggest editing the relevant Method section (line 508) because this analysis was not easy to understand.

In Study 4, the authors show that individuals whose superior parietal gyrus (SPG) exhibited high prediction distance had high fluid cognitive scores (Figure 4C). I had a hard time believing that this was a hypothesis-driven analysis. The authors motivate the analysis that "SPG and PCu have been strongly linked to fluid intelligence (line 304)". Did the authors conduct two analyses only-SPG-fluid intelligence and PCu-fluid intelligence-without relating other brain regions to other individual differences measures? Even if so, the authors should have reported the same r-value and p-value for PCu-fluid intelligence. If SPG-fluid intelligence indeed holds specificity in terms of statistical significance compared to all possible scenarios that were tested, is this rationally an expected result, and could the authors explain the specificity? Also, the authors should explain why they considered fluid intelligence to be the proxy of one's ability to anticipate upcoming scenes during movie watching. I would have understood the rationale better if the authors had at least aggregated predictive scores for all brain regions that held significance into one summary statistic and found a significant correlation with the fluid intelligence measure.

https://doi.org/10.7554/eLife.107607.1.sa1

Reviewer #3 (Public review):

Summary:

In this work, the authors aim to improve neural encoding models for naturalistic video stimuli by integrating temporally aligned multimodal features derived from a deep learning model (VALOR) to predict fMRI responses during movie viewing.

Strengths:

The major strength of the study lies in its systematic comparison across unimodal and multimodal models using large-scale, high-resolution fMRI datasets. The VALOR model demonstrates improved predictive accuracy and cross-dataset generalization. The model also reveals inherent semantic dimensions of cortical organization and can be used to evaluate the integration timescale of predictive coding.

This study demonstrates the utility of modern multimodal pretrained models for improving brain encoding in naturalistic contexts. While not conceptually novel, the application is technically sound, and the data and modeling pipeline may serve as a valuable benchmark for future studies.

Weaknesses:

The overall framework of using data-driven features derived from pretrained AI models to predict neural response has been well studied and accepted by the field of neuroAI for over a decade. The demonstrated improvements in prediction accuracy, generalization, and semantic mapping are largely attributable to the richer temporal and multimodal representations provided by the VALOR model, not a novel neural modeling framework per se. As such, the work may be viewed as an incremental application of recent advances in multimodal AI to a well-established neural encoding pipeline, rather than a conceptual advance in modeling neural mechanisms.

Several key claims are overstated or lack sufficient justification:

(1) Lines 95-96: The authors claim that "cortical areas share a common space," citing references [22-24]. However, these references primarily support the notion that different modalities or representations can be aligned in a common embedding space from a modeling perspective, rather than providing direct evidence that cortical areas themselves are aligned in a shared neural representational space.

(2) The authors discuss semantic annotation as if it is still a critical component of encoding models. However, recent advances in AI-based encoding methods rely on features derived from large-scale pretrained models (e.g., CLIP, GPT), which automatically capture semantic structure without requiring explicit annotation. While the manuscript does not systematically address this transition, it is important to clarify that the use of such pretrained models is now standard in the field and should not be positioned as an innovation of the present work. Additionally, the citation of Huth et al. (2012, Neuron) to justify the use of WordNet-based annotation omits the important methodological shift in Huth et al. (2016, Nature), which moved away from manual semantic labeling altogether.

Since the 2012 dataset is used primarily to enable comparison in study 3, the emphasis should not be placed on reiterating the disadvantages of semantic annotation, which have already been addressed in prior work. Instead, the manuscript's strength lies in its direct comparison between data-driven feature representations and semantic annotation based on WordNet categories. The authors should place greater emphasis on analyzing and discussing the differences revealed by these two approaches, rather than focusing mainly on the general advantage of automated semantic mapping.

(3) The authors use subject-specific encoding models trained on the HCP dataset to predict group-level mean responses in an independent in-house dataset. While this analysis is framed as testing model generalization, it is important to clarify that it is not assessing traditional out-of-distribution (OOD) generalization, where the same subject is tested on novel stimuli, but rather evaluating which encoding model's feature space contains more stimulus-specific and cross-subject-consistent information that can transfer across datasets.

Within this setup, the finding that VALOR outperforms CLIP, AlexNet, and WordNet is somewhat expected. VALOR encodes rich spatiotemporal information from videos, making it more aligned with movie-based neural responses. CLIP and AlexNet are static image-based models and thus lack temporal context, while WordNet only provides coarse categorical labels with no stimulus-specific detail. Therefore, the results primarily reflect the advantage of temporally-aware features in capturing shared neural dynamics, rather than revealing surprising model generalization. A direct comparison to pure video-based models, such as Video Swin Transformers or other more recent video models, would help strengthen the argument.

Moreover, while WordNet-based encoding models perform reasonably well within-subject in the HCP dataset, their generalization to group-level responses in the Short Fun Movies (SFM) dataset is markedly poorer. This could indicate that these models capture a considerable amount of subject-specific variance, which fails to translate to consistent group-level activity. This observation highlights the importance of distinguishing between encoding models that capture stimulus-driven representations and those that overfit to individual heterogeneities.

https://doi.org/10.7554/eLife.107607.1.sa0

Comprehensive Neural Representations of Naturalistic Stimuli through Multimodal Deep Learning

Peer review process

Editors

Be the first to read new articles from eLife