The time course of visuo-semantic representations in the human brain is captured by combining vision and language models

  1. Department of Education and Psychology, Freie Universität Berlin, Berlin, Germany
  2. Institute of Cognitive Neurology and Dementia Research, Otto-von-Guericke-Universität Magdeburg, Magdeburg, Germany
  3. German Center for Neurodegenerative Diseases (DZNE), Otto-von-Guericke-Universität Magdeburg, Magdeburg, Germany

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Xilin Zhang
    South China Normal University, Guangzhou, China
  • Senior Editor
    Yanchao Bi
    Peking University, Beijing, China

Reviewer #1 (Public review):

Summary:

The authors provide a compelling case that the unique variance explained by LLMs is different (and later) than the unique variance explained by DNNs. This characterises when, and to some extent where, these differences occur, and for LLMs, why. The authors also probe what in the sentences is driving the brain alignment.

Strengths:

(1) The study is timely.

(2) There is a robust dataset and results.

(3) There is compelling separation between unique responses related to LLMs and DNNs.

(4) The paper is well-written.

Weaknesses:

The authors could explore more of what the overlap between the LLM and DNN means, and in general, how this relates to untrained networks.

Reviewer #2 (Public review):

Summary:

This study provides an investigation into the temporal dynamics of visuo-semantic processing in the human brain, leveraging both deep neural networks (DNNs) and large language models (LLMs). By developing encoding models based on vision DNNs, LLMs, and their fusion, the authors demonstrate that vision DNNs preferentially account for early, broadband EEG responses, while LLMs capture later, low-frequency signals and more detailed visuo-semantic information. It is shown that the parietal cortex shows responses during visuo-semantic processing that can be partially accounted for by language features, highlighting the role of higher-level areas in encoding abstract semantic information.

Strengths:

The study leverages a very large EEG dataset with tens of thousands of stimulus presentations, which provides an unusually strong foundation for benchmarking a variety of vision DNNs and LLMs. This scale not only increases statistical power but also allows robust comparison across model architectures, ensuring that the conclusions are not idiosyncratic to a particular dataset or stimulus set.

By using high-density EEG, the authors are able to capture the fine-grained temporal dynamics of visuo-semantic processing, going beyond the coarse temporal resolution of fMRI-based studies. This enables the authors to disentangle early perceptual encoding from later semantic integration, and to characterize how different model types map onto these stages of brain activity. The temporal dimension provides a particularly valuable complement to previous fMRI-based model-to-brain alignment studies.

The encoding models convincingly show that vision DNNs and LLMs play complementary roles in predicting neural responses. The vision DNNs explain earlier broadband responses related to perceptual processing, while LLMs capture later, lower-frequency signals that reflect higher-order semantic integration. This dual contribution provides new mechanistic insights into how visual and semantic information unfold over time in the brain, and highlights the utility of combining unimodal models rather than relying on multimodal networks alone.

Weaknesses:

(1) The experimental design is insufficiently described, particularly regarding whether participants were engaged in a behavioral task or simply passively viewing images. Task demands are known to strongly influence neural coding and representations, and without this information, it is difficult to interpret the nature of the EEG responses reported.

(2) The description of the encoding model lacks precision and formalization. It is not entirely clear what exactly is being predicted, how the model weights are structured across time points, or the dimensionality of the inputs and outputs. A more formal mathematical formulation would improve clarity and reproducibility.

(3) The selected vision DNNs (CORnet-S, ResNet, AlexNet, MoCo) have substantially lower ImageNet classification accuracies than current state-of-the-art models, with gaps of at least 10%. Referring to these models collectively as "vision DNNs" may overstate their representational adequacy. This performance gap raises concerns about whether the chosen models can fully capture the visual and semantic features needed for comparison with brain data. Clarification of the rationale for choosing these particular networks, and discussion of how this limitation might affect the conclusions, is needed.

(4) The analytic framework treats "vision" and "language" as strictly separate representational domains. However, semantics are known to emerge in many state-of-the-art visual models, with different layers spanning a gradient from low-level visual features to higher-level semantic representations. Some visual layers may be closer to LLM-derived representations than others. By not examining this finer-grained representational structure within vision DNNs, the study may oversimplify the distinction between vision- and language-based contributions.

(5) The study uses static images, which restricts the scope of the findings to relatively constrained visual semantics. This limitation may explain why nouns and adjectives improved predictions over vision DNNs, but verbs did not. Verbs often require dynamic information about actions or events, which static images cannot convey.

Reviewer #3 (Public review):

Summary:

Rong et al., compare EEG image responses from a large-scale dataset to state-of-the-art vision and language models, as well as their fusion. They find that the fusion of models provides the best predictivity, with early contribution from vision models and later predictivity from language models. The paper has several strengths: high temporal resolution data (though at the expense of spatial resolution), detailed comparison of alignment (and differences) between vision and language model embeddings, and comparison of "fusion" of different DNN models.

Despite the paper's strengths, it is not clear what is at stake with these findings or how they advance our knowledge beyond other recent studies showing vision versus language model predictions of visual cortex responses with fMRI.

Strengths:

The authors use a large-scale EEG dataset and a comprehensive modeling approach. The methods are sound and involve multiple model comparisons. In particular, the disentangling of vision and language model features is something that has been largely ignored in prior related studies.

Weaknesses:

(1) The authors state their main hypothesis (lines 48-51) that human neural responses to visual stimulation are better modelled by combining representations from a vision DNN and an LLM than by the representations from either of the two components alone, and that the vision DNN and LLM components would uniquely predict earlier and later stages of visual processing, respectively.

While they confirm this hypothesis in largely compelling ways, it is not clear whether these results tell us something about the brain beyond how to build the most predictive model.

In particular, why do language models offer advantages over vision models, and what does this tell us about human visual processing? In several places, the discussion of advantages for the language model felt somewhat trivial and did not seem to advance our understanding of human vision, e.g., "responses for visual stimulation encode detailed information about objects and their properties" (lines 266-270) and "LLM representations capture detailed visuo-semantic information about the stimulus images" (line 293).

(2) It is not clear what the high temporal resolution EEG data tell us that the whole-brain fMRI data do not. The latency results seem to be largely in line with fMRI findings, where the early visual cortex is better predicted by vision models, and the language model is better in later/more anterior regions. In addition, it would help to discuss whether the EEG signals are likely to be restricted to the visual cortex, or could the LLM predictivity explain downstream processing captured by whole-brain EEG signals?

Relatedly, it would help the authors to expand on the implications of the frequency analysis.

(3) While the authors test many combinations of vision and language models and show their "fusion" advantages are largely robust to these changes, it is still hard to ignore the vast differences between vision and language models, in terms of architecture and how they are trained. Two studies (Wang et al., 2023, and Conwell et al., 2024) have now shown that when properly controlling for architecture and dataset, there is little to no advantage of language alignment in predicting visual cortex responses. It would help for the authors to both discuss this aspect of the prior literature and to try to address the implications for their own findings (related to pt 1 about what, if anything, is "special" about language models).

(4) Model features - it would help to state the dimensionality of the input embeddings for each model and how much variance is explained and preserved after the PCA step? I wonder how sensitive the findings are to this choice of dimensionality reduction, and whether an approach that finds the optimal model layer (in a cross-validated way) would show less of a difference between vision/language models (I realize this is not feasible with models like GPT-3).

(5) To better understand the fusion advantage, it would help to look at the results, look for a pair of vision models and a pair of language models. Can a similar advantage be found by combining models from the same modality?

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation