Modality-Agnostic Decoding of Vision and Language from fMRI

Mitja Nikolaus; Milad Mozafari; Isabelle Berry; Nicholas Asher; Leila Reddy; Rullen VanRullen

doi:10.7554/eLife.107933.1

eLife Assessment

This manuscript introduces a potentially valuable large-scale fMRI dataset pairing vision and language, and employs rigorous decoding analyses to investigate how the brain represents visual, linguistic, and imagined content. The current manuscript blurs the line between a resource paper and a theoretical contribution, and the evidence for truly modality-agnostic representations remains incomplete at this stage. Clarifying the conceptual aims and strengthening both the dataset technicality and the quantitative analyses would improve the manuscript's significance for the fields of cognitive neuroscience and multimodal AI.

https://doi.org/10.7554/eLife.107933.1.sa3

Significance of findings

valuable: Findings that have theoretical or practical implications for a subfield

landmark
fundamental
important
valuable
useful

Strength of evidence

incomplete: Main claims are only partially supported

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

Humans perform tasks involving the manipulation of inputs regardless of how these signals are perceived by the brain, thanks to representations that are agnostic to the stimulus modality. Investigating such modality-agnostic representations requires experimental datasets with multiple modalities of presentation. In this paper, we introduce and analyze SemReps-8K, a new large-scale fMRI dataset of 6 subjects watching both images and short text descriptions of such images, as well as conditions during which the subjects were imagining visual scenes. The multimodal nature of this dataset enables the development of modality-agnostic decoders, trained to predict which stimulus a subject is seeing, irrespective of the modality in which the stimulus is presented. Further, we performed a searchlight analysis revealing that large areas of the brain contain modality-agnostic representations. Such areas are also particularly suitable for decoding visual scenes from the mental imagery condition. The dataset will be made publicly available.

Introduction

Several regions in the human brain have developed a high degree of specialization for particular lower-level perceptive as well as higher-level cognitive functions (Kanwisher, 2010). For many higher-level functions, it is crucial to be able to manipulate inputs regardless of the modality in which a stimulus was perceived by the brain. Such manipulations can be performed thanks to representations that are abstracted away from particularities of specific modalities, and are therefore modality-agnostic. A range of theories have been developed to explain how and where in the human brain such abstract representations are created (Damasio, 1989; Binder et al., 2009; Martin, 2016; Barsalou, 2016; Ralph et al., 2017).

In order to study how modality-agnostic information is represented in the brain and to exploit it for modality-agnostic decoding, large multimodal neuroimaging datasets with well-controlled stimuli across modalities are required. While large-scale datasets exist for vision (Huth et al., 2012; Chang et al., 2019; Allen et al., 2022), language (Brennan and Hale, 2019; Nastase et al., 2021; Schoffelen et al., 2019; Tang et al., 2023a), and video (naturalistic movies) (Aliko et al., 2020; Visconti di Oleggio Castello et al., 2020; Boyle et al., 2020), none of these contain a controlled set of equivalent stimuli that are presented separately in both modalities. For instance, the different modalities in movies (vision and language) are complementary but do not always carry the same semantics. Further, they are not presented separately but simultaneously, impeding a study of the respective activity pattern caused by each modality in isolation.

Here, we present SemReps-8K, a new large-scale multimodal fMRI dataset of 6 subjects each viewing more than 8,000 stimuli which are presented separately in one of two modalities, as images of visual scenes or as descriptive captions of such images. In addition, the dataset also contains 3 imagery conditions for each subject, where they had to imagine an visual scene based on a caption description they had received before the start of the fMRI experiment. We exploit this new data to develop decoders that are specifically trained to leverage modality-agnostic patterns in the brain. Such modality-agnostic decoders are trained on brain imaging data from multiple modalities, which we demonstrate here for the case of vision and language. In contrast to modality-specific decoders that can be applied only in the modality that they were trained on, modality-agnostic decoders can be applied to decode stimuli from multiple modalities, even without knowing a priori the modality the stimulus was presented in.

We find that modality-agnostic decoders trained on this dataset perform on par with their respective modality-specific counterparts for decoding images, despite the additional challenge of uncertainty about the stimulus modality. For decoding captions, the modality-agnostic decoders even outperform the respective modality-specific decoders (because the former, but not the latter, can leverage the additional training data from the other image modality).

Additionally, we use this novel kind of decoders for a searchlight analysis to localize regions with modality-agnostic representations in the brain. Previous studies that aimed to localize modality-agnostic patterns were based on limited and rather simple stimulus sets and did not always agree on the exact location and extent of such regions (e.g. Vandenberghe et al., 1996; Shinkareva et al., 2011; Devereux et al., 2013; Fairhall and Caramazza, 2013; Jung et al., 2018). We design a searchlight analysis based on a combination of modality-agnostic decoders and cross-decoding. The results reveal that modality-agnostic patterns can be found in a widespread left-lateralized network across the brain, encompassing virtually all regions that have been proposed previously.

Finally, we find that modality-agnostic decoders trained only on data with perceptual input also generalize to conditions during which the subjects were performing mental imagery. There is a large overlap in the areas that we identified as modality-agnostic and those that are suitable for decoding mental imagery.

Related Work

Modality-agnostic representations

Decades of neuro-anatomical research (e.g. based on clinical lesions) and electrophysiology in non-human and human primates, as well as modern experiments leveraging recent brain imaging techniques have provided evidence that the activation patterns in certain brain areas are modality-specific; for example, the occipital cortex responds predominantly to visual stimulation (Felleman and Van Essen, 1991; Sereno et al., 1995; Grill-Spector and Malach, 2004), and a commonly left-lateralized network responds to language processing tasks (Zola-Morgan, 1995; Fedorenko et al., 2010, 2011; Friederici, 2017; Brennan, 2022).

More recent research has started to focus on higher-level regions that respond with modality-agnostic patterns, i.e., patterns that are abstracted away from any modality-specific information. A modality-agnostic region responds with similar patterns to input stimuli of the same meaning, even if they are presented in different modalities (e.g. the word “cat” and picture of a cat). Such regions have also been described as abstract/conceptual (Binder, 2016), modality-invariant (Man et al., 2012), modality-independent (Dirani and Pylkkänen, 2024), supramodal (Sanchez et al., 2020), or amodal (Fairhall and Caramazza, 2013) (although see distinction made in Barsalou, 2016).

Several theories and frameworks on how the brain forms modality-agnostic representations from modality-specific inputs have been proposed. The convergence zones view proposes that information coming from modality-specific sensory cortices is integrated in multiple convergence zones that are distributed across the cortex, predominantly in temporal and parietal lobes (Damasio, 1989; Tranel et al., 1997; Meyer and Damasio, 2009). These convergence zones are organized hierarchically, learned associations are used to create abstractions from lower-level to higher-level feature representations (Simmons and Barsalou, 2003; Meyer and Damasio, 2009). A perceived stimulus first causes activity in the related low-level modality-specific region (e.g. visual cortex), subsequently higher-level convergence zones serve as relays that cause associated activity in other regions of the brain (e.g. the language network) (Meyer and Damasio, 2009; Kiefer and Pulvermüller, 2012). According to Binder, 2016, the most high-level convergence zones can become so abstract that they are representing amodal symbols.

The GRAPES framework (Grounding Representations in Action, Perception, and Emotion Systems) also suggests that representations are distributed across temporal and parietal areas of the cortex. More specifically, they are hypothesized to be situated in areas connected to the perception and manipulation of the environment, as well as in the language system (Martin, 2009, 2016). According to this theory, conceptual knowledge is organized in domains: For example, semantic information related to object form and object motion is represented within specific visual processing systems, regardless of the stimulus modality, and both for perception as well as imagination.

The hub-and-spoke theory states that cross-modal interactions are mediated by a single modality-agnostic hub, located in the anterior temporal lobes (Rogers et al., 2004; Lambon Ralph et al., 2006; Patterson and Lambon Ralph, 2016; Ralph et al., 2017). The hub contains a “continuous distributed representation space that expresses conceptual similarities among items even though its dimensions are not independently interpretable” (Frisby et al., 2023, p. 262). The spokes form the links between the hubs and the modality-specific association cortices. Most importantly, semantic representations are not solely based in the hub, for a given concept all spokes that are linked to modalities in which the concept can be experienced do contribute to the semantic representation. This explains why selective damage to spokes can cause category-specific deficits (Pobric et al., 2010).

This is conceptually similar to some aspects of the Global Workspace Theory, which assumes both a multimodal convergence of inputs towards a specific (network of) region(s), and the possibility of flexibly recruiting unimodal regions into this Global Workspace (Baars, 1993, 2005).

While these and other theories partly disagree on how modality-agnostic information is represented in the brain, they agree that such information is distributed across the cortex, and possibly overlapping with the semantic network (Binder et al., 2009; Huth et al., 2012; Andrews et al., 2014; Zwaan, 2016).

Decoding of vision and language from fMRI

Early approaches of brain decoding focused on identifying and reconstructing limited sets of simple visual stimuli (Haxby et al., 2001; Cox and Savoy, 2003; Kay et al., 2008; Naselaris et al., 2009; Nishimoto et al., 2011). Soon after, attempts to decode linguistic stimuli could identify single words and short paragraphs with the help of models trained to predict features extracted from word embeddings (Pereira et al., 2018).

More recently, large-scale open source fMRI datasets for both vision and language have become available (Chang et al., 2019; Allen et al., 2022; Schoffelen et al., 2019; Tang et al., 2023b) and allowed for the training of decoding models for a larger range and more complex naturalistic stimuli with the help of features extracted from deep learning models. For example, modality-specific decoders for vision can be trained by mapping the brain activity of subjects viewing naturalistic images to feature representation spaces of computational models of the same modality (i.e. vision models) (Shen et al., 2019; Beliy et al., 2019; Lin et al., 2022; Takagi and Nishimoto, 2023; Ozcelik and VanRullen, 2023). Moreover, a range of studies provided evidence that certain representations can transfer between vision and language by evaluating decoders in a modality that they were not trained on (Shinkareva et al., 2011; Man et al., 2012; Fairhall and Caramazza, 2013; Simanova et al., 2014; Jung et al., 2018). The performance in such cross-modal decoding evaluations always lags behind when compared to within-modality decoding. One explanation is that modality-specific decoders are not explicitly encouraged to pick up on modality-agnostic features during training, and modality-specific features do not transfer to other modalities.

To address this limitation, we here propose to directly train modality-agnostic decoders, i.e. models that are exposed to multiple stimulus modalities during training in order to make it more likely that they are leveraging representations that are modality-agnostic. Training this kind of decoder is enabled by the multimodal nature of our fMRI dataset: The stimuli are taken from COCO, a multimodal dataset of images with associated descriptive captions (Lin et al., 2014). During the experiment the subjects are exposed to stimuli in both modalities (images and captions) in separate trials. Crucially, we can map the brain activity of each trial (e.g. the subject viewing an image) to modality-agnostic features extracted from both modalities (the image and the corresponding caption) when training the decoder models. After training, a single modality-agnostic decoder can be used to decode stimuli from multiple modalities, leveraging representations that are common to all modalities.

Decoding of mental imagery

Apart from decoding perceived stimuli, it is also possible to decode representations when subjects were performing mental imagery, without being exposed to any perceptual input. Different theories on mental imagery processes emphasize either the role of the early visual areas (Kosslyn et al., 1999; Pearson, 2019) or the role of the high-level visual areas in the ventral temporal cortex and frontoparietal networks (Spagna et al., 2021; Hajhajate et al., 2022; Liu et al., 2025). There is evidence for both kinds of theories in the form of neuroimaging studies that used decoding to identify stimuli during mental imagery. Some of these found relevant patterns in the early visual cortex (Albers et al., 2013; Naselaris et al., 2015), others highlighted the role of higher-level areas in the ventral visual processing stream (Stokes et al., 2009; Reddy et al., 2010; Lee et al., 2012; VanRullen and Reddy, 2019; Boccia et al., 2019) as well as the precuneus and the intraparietal sulcus (Johnson and Johnson, 2014). These discrepancies can possibly be explained by differences in experimental design: For example, the early visual cortex might only become involved if the task requires the imagination of high-resolution details, which are represented in lower levels of the visual processing hierarchy (Kosslyn and Thompson, 2003).

Crucially, it has been shown that decoders trained exclusively on trials with perceptual input can generalize to imagery trials (Stokes et al., 2009; Reddy et al., 2010; Lee et al., 2012; Johnson and Johnson, 2014; Naselaris et al., 2015), providing evidence that representations formed during perception overlap to some degree with representations formed during mental imagery (Dijkstra et al., 2019).

In our study, we explored to what extent these findings hold true for more varied and complex stimuli. Following previous approaches, we use decoders trained exclusively on trials where subjects were viewing images and captions, and evaluated them on their ability to decode the imagery trials. We additionally hypothesized that mental imagery should be at least as conceptual as it is sensory and therefore primarily recruit modality-agnostic representations. Consequently, modality-agnostic decoders should be ideally suited to decode mental imagery and outperform modality-specific decoders on that task.

Methods for localizing modality-agnostic regions

The first evidence for the existence of modality-agnostic regions came from observations of patients with lesions in particular cortical regions which lead to deficits in the retrieval and use of knowledge across modalities (Warrington and Shallice, 1984; Warrington and Mccarthy, 1987; Gainotti, 2000; Damasio et al., 2004). Semantic impairments across modalities have also been observed in patients with the neurodegenerative disorder semantic dementia (Warrington, 1975; Snowden et al., 1989; Jefferies et al., 2009).

In early work exploring the possible locations of modality-agnostic regions in healthy subjects, brain activity was recorded using imaging techniques while they were presented with a range of concepts in two modalities (e.g. words and pictures). Regions that were active during semantic processing of stimuli in the first modality were compared to regions that were active during semantic processing of stimuli in the second modality. The conjunction of these regions was proposed to be modality-agnostic (Vandenberghe et al., 1996; Moore and Price, 1999; Bright et al., 2004).

While this methodology allows for the identification of candidate regions in which semantic processing of multiple modalities occurs, it can not be used to probe the information represented in these regions. In order to compare the information content (i.e. multivariate patterns) of brain regions, researchers have developed Representational Similarity Analysis (RSA, Kriegeskorte et al., 2008) as well as encoding and decoding analyses (Naselaris et al., 2011). More specifically, RSA has been used to find modality-agnostic regions by comparing activation patterns of a candidate region when subjects are viewing stimuli from different modalities (Devereux et al., 2013; Handjaras et al., 2016; Liuzzi et al., 2017). This comparison is performed in an indirect way, by measuring the correlation of dissimilarity matrices of activation patterns. In turn, cross-decoding analysis can be leveraged to identify modality-agnostic regions by training a classifier to predict the category of a stimulus in a given modality, and then evaluating its performance to predict the category of stimuli that were presented in another modality (Shinkareva et al., 2011; Man et al., 2012; Fairhall and Caramazza, 2013; Simanova et al., 2014; Jung et al., 2018). However, all these studies relied on a predefined set of stimulus categories, and can therefore not be easily extended to more realistic and complex stimuli, as we perceive them in our everyday life.

We summarize candidates for modality-agnostic regions that have been identified by previous studies in Appendix 3. This overview reveals substantial disagreement regarding the possible locations of modality-agnostic patterns in the brain. For example, Fairhall and Caramazza, 2013 found modality-agnostic representations in the left ventral temporal cortex (fusiform, parahippocampal, and perirhinal cortex), middle and inferior temporal gyrus, angular gyrus, parts of the prefrontal cortex as well as the precuneus. Shinkareva et al., 2011 found a larger network of left-lateralized regions, including additionally the left superior temporal, inferior parietal, supramarginal, inferior and inferior occipital, precentral and postcentral gyrus, supplementary motor area, intraparietal sulcus, cuneus, posterior cingulum as well as the right fusiform gyrus and the superior parietal gyrus, paracentral lobule on both hemispheres. In contrast, Jung et al., 2018 found modality-agnostic representations only in the right prefrontal cortex. These diverging results can probably be explained by the limited number of stimuli as well as the use of artificially constructed stimuli in certain studies.

Recent advances in machine learning have enabled another generation of fMRI analyses based on large-scale naturalistic datasets. Here, we present a new multimodal dataset of subjects viewing both images and text. Most importantly, the dataset contains a large number of naturalistic stimuli in the form of complex visual scenes and full sentence descriptions of the same type of complex scenes, instead of pictures of single objects and words as commonly used in previous studies. This data enables the development of modality-agnostic decoders that are explicitly trained to leverage features that are shared across modalities. Further, we use this data to localize modality-agnostic regions in the brain by applying decoders in a multimodal searchlight analysis.

Methods

fMRI Experiment

Six subjects (2 female, age between 20 and 50 years, all right-handed and fluent English speakers) participated in the experiment after providing informed consent. The study was performed in accordance with French national ethical regulations (Comité de Protection des Personnes, ID 2019-A01920-57). We collected functional MRI data using a 3T Philips ACHIEVA scanner (gradient echo pulse sequence, TR=2s, TE=30ms, 46 slices with a 32-channel head coil, slice thickness=3mm with 0.2mm gap, in-plane voxel dimensions 3×3mm). At the start of each session, we further acquired high-resolution anatomical images for each subject (voxel size=1mm³, TR=8.13ms, TE=3.74ms, 170 sagittal slices).

Scanning was spanned over 10 sessions (except for sub-01: 11 sessions), each consisting of 13 to 16 runs during which the subjects were presented with 86 stimuli. Each run started and ended with an 8s fixation period. The stimulus type varied randomly inside each run between images and captions. Each stimulus was presented for 2.5 seconds at the center of the screen (visual angle: 14.6 degrees; captions were displayed in white on a dark gray background (font: “Consolas”), the inter-stimulus interval was 1s. Every 10 stimuli there was a fixation trial that lasted for 2.5s. Every 5min there was a longer fixation trial for 16s.

Subjects performed a one-back matching task: They were instructed to press a button when-ever the stimulus matched the immediately preceding one (cf. Figure 1 Panel A). In case the previous stimulus was of the same modality (e.g. two captions in a row), the subjects were instructed to press a button if the stimuli matched exactly. In the cross-modal case (e.g. an image followed by a caption), the button had to be pressed if the caption was a valid description of the image, and vice versa. Positive one-back trials occurred on average every 10 stimuli.

Panel A: Setup of the main fMRI experiment.
Subjects were seeing images and captions in random alternation. Whenever the current stimulus matched the previous stimulus, the subjects were instructed to press a button (one-back matching task). Images and captions for illustration only; actual size and stimuli as described in the text. Panel B: Number of distinct training stimuli (excluding trials that were one-back targets or during which the subject pressed the response button). There was an additional set of 140 stimuli (70 images and 70 captions) used for testing. Panel C: Setup of the fMRI experiment for the imagery trials. Subjects were instructed to remember 3 image descriptions with corresponding indices (numbers 1 to 3). One of these indices was displayed during the instruction phase, followed by a fixation phase, and then the subjects were imagining the visual scene for 10s.

Images and captions were taken from the training and validation sets of the COCO dataset (Lin et al., 2014). This dataset contains 5 matching captions for each image, of which we only considered the shortest one in order to fit on the screen and to ensure a comparable length for all captions. Spelling errors were corrected manually. As our training set, a random subset of images and another random subset of captions were selected for each subject. All these stimuli were presented only a single time. Information on the number of training stimuli for each subject is shown in Figure 1 Panel B. Additionally, a shared subset of 140 stimuli (70 images and 70 captions) was presented repeatedly to each subject in order to reduce noise, serving as our test set (on average: 26 times, min: 22, max: 31). Contrary to the training stimuli which were randomly selected from the COCO dataset, the 70 test stimuli were chosen by hand to avoid including multiple scenes that could match the same semantic description. The 70 chosen images as well as their 70 corresponding captions constituted the test set. These stimuli were inserted randomly between the training stimuli.

Note that for each stimulus presented to the subject (e.g. an image), we also have access to the corresponding stimulus in the other modality (the corresponding caption from the COCO dataset), allowing us to estimate model features based on both modalities (vision model features extracted from the image and language model features extracted from the corresponding caption) as well as multimodal features extracted from both the image and the caption.

In addition to these perceptual trials, there were 3 imagery trials for each subject (see also Figure 1 Panel C). Prior to the first fMRI scanning session, each subject was presented with a set of 20 captions (manually selected to be diverse and easy to visualize) that were not part of the perceptual trials, and they selected 3 captions for which they felt comfortable imagining a corresponding image. Then, they learned a mapping of each caption to a number (1, 2, and 3) so that they could be instructed to perform mental imagery of a specific stimulus, without having to present them with the caption again. The imagery trials occurred every second run, either at the beginning or the end of the run, so that each of the 3 imagery conditions were repeated on average 26 times (min: 23, max: 29). At the start of the imagery trial, the imagery instruction number was presented for 2s, then there was a 1s fixation period followed by the actual imagery period during which a light gray box was depicted for 10s on a dark gray background (the same background that was also used for trials with perceptual input). The light gray box was meant to represent the area in which the mental image should be “projected”. At the end of the experiment, the subjects drew sketches of the images they had been imagining during the imagery trials.

fMRI Preprocessing

Preprocessing of the fMRI data was performed using SPM12 (Ashburner et al., 2014) via nipype (Gorgolewski et al., 2011). We applied slice time correction and realignment for each subject. Each session was coregistered with an anatomical scan of the respective subject’s first session (down-sampled to 2mm³). We created and applied explicit gray matter masks for each subject based on their anatomical scans using a maximally lenient threshold (probability>0).

In order to obtain beta-values for each stimulus, for each subject we fit a GLM (using SPM12) on data from all sessions. We included regressors for train images, train captions, test images, test captions, imagery trials, fixations, blank screens, button presses, and one-back target trials. One-back target trials as well as trials in which the participant pressed the button were excluded in the calculation of all training and test stimulus betas. As output of these GLMs we obtained beta-values for each training and test caption and image as well as the imagery trials.

Finally, we transformed the volume-space data to surface space Freesurfer (Fischl, 2012). We used trilinear interpolation and the fsaverage template in the highest possible resolution (163,842 vertices on each hemisphere) as target.

Modality-Agnostic Decoders

The multimodal nature of our dataset allowed for the training of modality-agnostic decoders. We trained decoders by fitting ridge regression models that take fMRI beta-values as input and predict latent representations extracted from a pretrained deep learning model. Further details on decoder training can be found in Figure 2 as well as Appendix 1.

While modality-specific decoders are trained only on brain imaging data of a single modality, modality-agnostic decoders are trained on brain imaging data from multiple modalities and therefore allow for decoding of stimuli irrespective of their modality.

More specifically, in our case the modality-specific decoders are trained on fMRI beta-values from one stimulus modality, e.g. when subjects were watching images (cf. Figure 2 panel A). Conversely, modality-agnostic decoders are trained jointly using fMRI data from both stimulus modalities (images and captions; cf. Figure 2 panel B). For all decoders, the features that serve as regression targets can either be unimodal (e.g. extracted from images using a vision model) or multimodal (e.g. extracted from both stimulus modalities using a multimodal model, cf. Figure 2 panel C).

We considered features extracted from a range of vision, language, and multimodal models: For vision features, we considered ResNet (He et al., 2016), ViT (Dosovitskiy et al., 2020), and DI-NOv2 (Oquab et al., 2023); for language features BERT (Devlin et al., 2019), GPT2 (Radford et al., 2019), Llama2 (Touvron et al., 2023), mistral and mixtral (Jiang et al., 2023). Regarding multimodal features, we extracted features from VisualBERT (Li et al., 2019), BridgeTower (Xu et al., 2023), ViLT (Kim et al., 2021), CLIP (Radford et al., 2021), ImageBind (Girdhar et al., 2023), Flava (Singh et al., 2022), Blip2 (Li et al., 2023), SigLip (Zhai et al., 2023), and Paligemma2 (Steiner et al., 2024). In order to estimate the effect of model training, we further extracted features from a randomly initialized ImageBind model as a baseline. Further details on feature extraction can be found in Appendix 1. All decoders were evaluated on the held-out test data (140 stimuli, 70 captions and 70 images) using pairwise accuracy calculated using cosine distance. Prior to calculating the pairwise accuracy, the model predictions for all stimuli were standardized to have mean of 0 and standard deviation of 1. In the case of imagery decoding, the model predictions were standardized separately. In the case of cross-modal decoding (e.g. mapping an image stimulus into the latent space of a language model), a trial was counted as correct if the caption corresponding to the image (according to the ground-truth in COCO) was closest.

Figure 3 provides an overview on the evaluation metrics. A modality-specific decoder for images can be evaluated on its ability to decode images (Panel A, top) and in a cross-decoding setup for captions (Panel B, bottom). In the same way, we can compute the respective evaluation metrics for modality-specific decoders trained on captions. For the case of modality-agnostic decoders, we evaluate performance for decoding both images and captions using the same single decoder that is trained on both modalities (Panel C).

Results

Modality-Agnostic Decoders

We first compared the performance of modality-specific and modality-agnostic decoders that are trained on the whole brain fMRI data based on different unimodal and multimodal features. The average pairwise accuracy scores are presented in Figure 4. Figure 5 presents pairwise accuracy scores separately for decoding images and for decoding captions. Results for individual subjects can be found in Appendix 5.

Decoding accuracy for decoding captions (top) and for decoding images (bottom).
The bars indicate modality-agnostic decoding accuracy. The crosses (×) in the top row and the dots (·) in the bottom row indicate within-modality decoding scores. The dots (·) in the top row indicate cross-decoding scores for images, the crosses (×) in the bottom row indicate cross-decoding scores for captions (see also Figure 3).

When analyzing the average decoding accuracy (Figure 4), we find that modality-agnostic decoders perform better than modality-specific decoders, irrespective of the features that the decoders were trained on. This high performance (which can be attributed to the large training dataset used by modality-agnostic decoders) is achieved despite the additional challenge of not knowing the modality of the stimulus the subject was seeing.

Further, we observed that modality-agnostic decoders based on the best multimodal features (imagebind: 85.71% ± 2.58%) do not perform substantially better than decoders based on the best language features (GPT2-large: 85.31%±2.83%) and only slightly better than decoders trained on the best vision features (Dino-giant: 82.02%±2.43%). This result suggests that high-performing modality-agnostic decoders do not necessarily need to rely on multimodal features; features extracted from language models can lead to equally high performance. When comparing the different architecture types of models for multimodal feature extraction (dual stream vs. single stream with early fusion vs. single stream with late fusion; cf. Panel C in Figure 2), we only observed a slight performance disadvantage for single-stream models with early fusion (Mean accuracy values for dual stream models: 85.04%; for single stream models with early fusion: 81.01%; for single stream models with late fusion 83.63%). We performed a repeated measures ANOVA (grouping the data by subject), comparing the decoding accuracy values of modality-agnostic decoders based on different families of multimodal features. The only significant effect was: model_family_single_stream_early_fusion ∶ β = −0.04, SE = 0.011, p < 1e − 3.

When analyzing the performance specifically for decoding images (Figure 5, top), we find that modality-agnostic decoders perform as well as the modality-specific decoders trained on images (crosses in top row are at the same level as the bars in Figure 5; we find no statistically significant difference in their performances. We performed a repeated measures ANOVA (grouping the data by subject), comparing the image decoding accuracy values of modality-agnostic decoders with those of modality-specific decoders trained on images. The resulting p-value for the effect of the decoder_type was p = 0.73.) Further, modality-agnostic decoders even outperform modality-specific decoders trained on captions for decoding captions (dots in the bottom row are lower than bars in Figure 5). We performed a repeated measures ANOVA (grouping the data by subject), comparing the caption decoding accuracy values of modality-agnostic decoders with those of modality-specific decoders trained on captions. The result was: decoder_type∶ β = 0.036, SE = 0.006, p < 1 ⋅ 10⁻⁸. In other words, even if we know that a brain pattern was recorded in response to the subject reading a caption, we are more likely to decode it accurately if we choose to apply a decoder trained using both modalities, than if we apply the appropriate decoder, trained only on captions.

Furthermore, we found that the cross-modal decoding performance for decoding visual stimuli (images) using decoders trained on linguistic stimuli (captions) is higher than the cross-modal decoding performance in the other direction, corroborating similar results from Tang et al., 2023a on movies and audio books (dots in top row are higher than crosses in bottom row in Figure 5).

Qualitative Decoding Results

To obtain a better understanding of the decoding performance of the modality-agnostic decoders, we inspected the decoding results for 5 randomly selected test stimuli. We created a large candidate set of 41,118 stimuli by combining the test stimuli and the training stimuli from all subjects. For each stimulus, we ranked these candidate set stimuli based on their similarity to the predicted feature vector. As the test stimuli were shared among all subjects, we could average the prediction feature vectors across subjects to obtain the best decoding results.

Figure 6 presents the results for decoding images using a modality-agnostic decoder trained on ImageBind features. We display the target stimulus along with the top-5 ranked test stimuli. We can observe some clear success cases (the train in the first row) but also failure cases (the teddy bear decoded as pizza). For the other stimuli, some aspects such as the high-level semantic class (e.g. vehicle, animal) are correctly decoded: For the cars on the highway (last row), the top-ranked images depict trains, which are also vehicles. For the dog (2nd row), the top images contain cats of similar colors. Regarding the giraffe (3rd row), the model appears to have picked up on the fact that there was a body of water depicted in the image.

Decoding examples for image decoding using a modality-agnostic decoder.
The first column shows the image the subject was seeing and the 5 following columns show the candidate stimuli with highest similarity to the predicted features, in descending order. We display both the image and the caption of the candidate stimuli because the decoder is based on multimodal features that are extracted from both modalities. All images were taken from the CoCo dataset (Lin et al., 2014).

Note that these qualitative results are not directly comparable with previous work on retrieval or reconstruction using the NSD dataset (Allen et al., 2022; Lin et al., 2022; Takagi and Nishimoto, 2023; Ozcelik and VanRullen, 2023), as our data was collected on a 3T MRI scanner with lower signal-to-noise-ratio than NSD’s 7T MRI scanner.

The ranking results for decoding captions are depicted in Figure 7. The results are somewhat similar to the image decoding results, stimuli that were decoded successfully when presented as image such as the train are also decoded successfully when presented as caption; cases that were failures in the case of image decoding (e.g. the teddy bear) also fail here. However, the top-ranked stimuli for the dog (2nd row) do not always contain animals (the decoder seems to have picked up on the presence of a vehicle in the caption), but the cars on the highway (last row) get decoded rather successfully.

Decoding examples for caption decoding using a modality-agnostic decoder.
For details see caption of Figure 6. All images were taken from the CoCo dataset (Lin et al., 2014).

We additionally provide qualitative results for modality-specific decoders in Appendix 2. These results generally reflect the observations from the quantitative results: Modality-agnostic decoders perform similarly to modality-specific decoders evaluated in a within-modality decoding setup, but substantially better than modality-specific decoders when evaluated in cross-decoding setups.

Modality-Agnostic Regions

To provide insight into the spatial organization of modality-agnostic representations in the brain, we performed a surface-based searchlight analysis.

Modality-agnostic regions should contain patterns that generalize between stimulus modalities. Therefore, such regions should allow for decoding of stimuli in both modalities using a decoder that is trained to pick up on modality-agnostic features, i.e. the decoding performance for images and captions of a modality-agnostic decoder should both be above chance. However, as a modality-agnostic decoder is trained on stimuli from both modalities, it could have learned to leverage certain features to project stimuli from one modality and different features to project stimuli from the other modality. We added two conditions to control that the representations directly transfer between the modalities by additionally training two modality-specific decoders and evaluating them according to their cross-decoding performance, i.e. we require that their decoding performance in the modality they were not trained on is above chance. These four conditions are summarized at the top of Figure 8.

Searchlight method to identify modality-agnostic ROIs.
The top plots show performance (pairwise accuracy averaged over subjects) of modality-agnostic decoders for decoding images (top left) and decoding captions (top right). In the second row, we display cross-decoding performances: On the left, modality-specific decoders trained on captions are evaluated on images. On the right, modality-specific decoders trained on images are evaluated on captions. We identified modality-agnostic ROIs as clusters in which all 4 decoding accuracies are above chance by taking the minimum of the respective t-values at each location, then performed TFCE to calculate cluster values. The plot only shows left medial views of the brain to illustrate the method, different views of all resulting clusters are shown in Figure 9.

We used ImageBind features for these searchlight analyses as they led to the highest decoding performance when using the whole brain data. The decoders were trained based on the surface projection of the fMRI beta-values. For each vertex, we defined a searchlight with a fixed size by selecting the 750 closest vertices, corresponding to an average radius of ∼ 9.4mm. Details on how this size was selected are outlined in Appendix 4.

We trained and evaluated a modality-agnostic decoder and modality-specific decoders for both modalities on the beta-values for each searchlight location and each subject, providing us with a decoding accuracy scores for each location on the cortex. Then we performed t-tests to identify locations in which the decoding performance is above chance (acc > 0.5). We aggregated all 4 comparisons by taking the minimum of the 4 t-values at each spatial location. Finally, we performed a threshold-free cluster analysis (TFCE, Smith and Nichols, 2009) to identify modality-agnostic ROIs (Figure 8, bottom). We used the default hyperparameters of ℎ = 2 and e = 1 for surface-based TFCE (Jenkinson et al., 2012).

To estimate the statistical significance of the resulting clusters we performed a permutation test. For each subject, we evaluated the decoders 100 times with shuffed labels to create a surrogate distribution. Then, we sampled 10,000 permutations of the 6 subjects’ surrogate distributions and calculated group-level statistics (TFCE values) for each of them. Based on this null distribution, we calculated p-values for each cluster. To control for multiple comparisons across space, we took the maximum TFCE score across vertices for each permutation (Smith and Nichols, 2009).

The results of the surface-based searchlight analysis are presented in Figure 9. The analysis revealed that modality-agnostic patterns are actually widespread across the brain, especially on the left hemisphere. Peak cluster values were found in the left supramarginal gyrus, inferior parietal gyrus and posterior superior temporal sulcus. Regions belonging to the precuneus, isthmus of the cingulate gyrus, parahippocampus, middle temporal gyrus, inferior temporal gyrus and fusiform gyrus also showed high cluster values.

Searchlight results for modality-agnostic regions.
Maps thresholded at TFCE value of 1508, which is the significance threshold value for which p < 10⁻⁴ based on the permutation testing. Regions with highest cluster values are outlined and annotated based on the Desikan-Killiany atlas (Desikan et al., 2006).

Imagery Decoding

Finally, we evaluated the ability of decoders trained on the fMRI data with perceptual input to decode stimuli during the imagery conditions.

For a modality-agnostic decoder trained on the whole-brain data, the imagery pairwise decoding accuracy reaches 84.48% (averaged across subjects and model features) when using the 3 imagery stimuli as the candidate set. Note that we used the ground-truth caption and corresponding image from COCO in this candidate set, and not the sketches drawn by the subjects. When the whole test set is added to the candidate set (in total: 73 stimuli), the average pairwise accuracy drops to 72.47%. This substantial drop in performance is most likely explained by the fact that the predicted features for the imagery trials were standardized using only 3 stimuli, and this transformation emphasized differences that enabled distinguishing the 3 imagery trials but do not generalize to the larger test set. We also attempted decoding without standardization of the predicted feature vectors, but this led to much lower performance.

As expected, we also found that modality-agnostic decoders are better suited for imagery decoding than modality-specific decoders. We compared the imagery decoding accuracy of both decoder types taking into account the results for all features and all subjects. To this end, we performed two repeated measures ANOVAs (grouping the data by subject), once comparing the accuracy values of modality-agnostic decoders with those of modality-specific decoders trained on images, and once comparing modality-agnostic decoders to modality-specific decoders trained on captions. The average decoding accuracies were 69.42% for a modality-specific decoder trained on images and 70.02% for a modality-specific decoder trained on captions (vs. 72.47% for a modality-agnostic decoder, as mentioned above). In both comparisons, the accuracy values for the two decoder types were significantly different (when comparing modality-agnostic decoders to modality-specific decoders trained on images: decoder_type ∶ β = 0.03, SE = 0.011, p < 0.01; and when comparing to modality-specific decoders trained on captions: decoder_type ∶ β = 0.024, SE = 0.012, p < 0.04).

Appendix 6 presents qualitative decoding results for the imagery trials for each subject as well as the sketches of the mental images drawn at the end of the experiment. As expected, the results are worse than those for perceived stimuli, but for several subjects it was possible to decode some major semantic concepts.

We further computed the imagery decoding accuracy during the searchlight analysis. Figure 10 shows the result clusters for decoding imagery (using the whole test set + the 3 imagery trials as potential candidates).

Searchlight results for imagery decoding.
Maps thresholded at TFCE values that surpass the significance threshold of p < 10⁻⁴ based on a permutation test. Maps thresholded at TFCE value of 3897, which is the significance threshold value for which p < 10⁻⁴ based on the permutation testing We used the pairwise accuracy for imagery decoding using the large candidate set of 73 stimuli. We outlined the same regions as in Figure 9 to facilitate comparison.

We observe that many regions that were found to contain modality-agnostic patterns (cf. Figure 9) are also regions in which decoding of mental imagery is possible.

One main difference is that the imagery decoding clusters appear to be less left-lateralized than the modality-agnostic region clusters (peak cluster values can be found both on the right inferior parietal cortex and bilaterally in the precuneus). To estimate overlap of the regions allowing for imagery decoding and modality-agnostic regions we calculated the correlation between the TFCE values that were used for identifying modality-agnostic regions (Figure 9) and the TFCE values for imagery decoding (Figure 10). The Pearson correlation score for the left hemisphere is 0.41 (p < 1e − 8), and for the right hemisphere 0.62 (p < 1e − 8). Importantly, these correlation scores are substantially higher when compared to the correlation with decoding accuracy of modality-specific decoders: The correlation between the TFCE values for imagery decoding and TFCE values for image decoding of a modality-specific decoder trained on images is 0.28 on the left hemisphere and 0.40 on the right hemisphere. When using TFCE values based on the caption decoding accuracy of a modality-specific decoder trained on captions we obtain 0.19 on the left hemisphere and 0.45 on the right hemisphere.

Discussion

In this work, we introduced a new large-scale multimodal fMRI dataset that enables the development of models for modality-agnostic decoding of visual and linguistic stimuli using a single model. These modality-agnostic decoders were specifically trained to pick up on modality-agnostic patterns, enabling a performance increase over modality-specific decoders when decoding linguistic stimuli in the form of captions.

According to a range of theories, modality-agnostic representations are tightly linked to (lexical-) semantic representations (Simmons and Barsalou, 2003; Binder et al., 2009; Meschke and Gallant, 2024). Most importantly, a range of studies that aimed to identify brain regions linked to semantic/conceptual representations by asking subjects to perform tasks that require semantic processing of words found evidence for such regions that overlap to a high degree with the regions identified in our study (Fernandino et al., 2016; Martin et al., 2018; Carota et al., 2021; Fernandino et al., 2022; Tong et al., 2022). A strong link between these systems could also explain our result that modality-agnostic decoders based on unimodal representations from language models are performing as well as decoders based on multimodal representations (cf. Figure 4), as well as the partial left-lateralization of the identified modality-agnostic regions.

In a second analysis, we additionally leveraged our dataset to localize modality-agnostic regions in the brain by searching for areas in which decoding of both stimulus modalities is possible using modality-agnostic decoders as well as in a cross-decoding setup. This approach lead to the identification of a large network involving temporal, parietal, and frontal regions, and peak cluster values on the left hemisphere (cf. Figure 9). All areas with high cluster values confirm findings from previous studies: The left precuneus (Shinkareva et al., 2011; Popham et al., 2021; Handjaras et al., 2016), posterior cingulate/ retrosplenial cortex (Fairhall and Caramazza, 2013; Handjaras et al., 2016), supramarginal gyrus (Shinkareva et al., 2011), inferior parietal cortex (Man et al., 2012; Vandenberghe et al., 1996; Shinkareva et al., 2011; Devereux et al., 2013; Popham et al., 2021; Simanova et al., 2014; Handjaras et al., 2016), superior temporal sulcus (Man et al., 2012), middle temporal gyrus (Vandenberghe et al., 1996; Shinkareva et al., 2011; Fairhall and Caramazza, 2013; Devereux et al., 2013; Handjaras et al., 2016), inferior temporal gyrus (Vandenberghe et al., 1996; Shinkareva et al., 2011; Fairhall and Caramazza, 2013; Simanova et al., 2014; Handjaras et al., 2016), fusiform gyrus (Vandenberghe et al., 1996; Moore and Price, 1999; Bright et al., 2004; Shinkareva et al., 2011; Fairhall and Caramazza, 2013; Simanova et al., 2014), and parahippocampus (Vandenberghe et al., 1996). However, previous studies have led to contradicting results regarding the locality of modality-agnostic regions (they were identifying varying subsets of these regions; see also Appendix 3), probably due to the limited number and artificial nature of stimuli employed. Our method identified almost all of the previously proposed regions as regions with modality-agnostic patterns, highlighting the advantage of this large multimodal dataset in which subjects are viewing photographs of complex natural scenes and reading full English sentences. The left superior occipital gyrus was not identified in our study, but in previous studies by Vandenberghe et al., 1996; Shinkareva et al., 2011. However, we found that a major part of the left superior occipital sulcus represents modality-agnostic information. Further, Jung et al., 2018 found modality-agnostic patterns in the right superior frontal gyrus. One major difference between their study and ours is that they used auditory input as a second modality instead of text. Further work is required to investigate to what extent the modality-agnostic regions identified in our work generalize to all modalities.

The fact that the presence of modality-agnostic patterns is positively correlated with the imagery decoding performance in different locations provides further evidence that the identified patterns are truly modality-agnostic. We further found that decoders trained exclusively on data for which participants were exposed to perceptual input do generalize to imagery trials, confirming previous findings that were based on more limited stimulus sets (Stokes et al., 2009; Reddy et al., 2010; Lee et al., 2012; Johnson and Johnson, 2014; Naselaris et al., 2015). Regarding the representations involved in mental imagery, we found that modality-agnostic decoders outperform modality-specific decoders in terms of imagery decoding. This finding can be seen as support for the involvement of modality-agnostic representations in mental imagery, as modality-agnostic decoders were trained explicitly to pick up on such patterns.

The findings of our searchlight analysis for imagery decoding suggest that mental imagery indeed involves a large network of regions across both hemispheres of the cerebral cortex. This includes high-level visual areas, parietal areas such as the precuneus and inferior parietal cortex and several frontal regions, but also parts of the early visual cortex. Results are highly similar on both hemispheres, highlighting the involvement of large-scale bilateral brain networks during mental imagery of complex scenes.

While there are lesion studies on hemispheric asymmetries that suggest that regions in the left hemisphere are crucial for mental imagery (Farah, 1984; Bartolomeo, 2002), a more recent review that additionally considers evidence from neuroimaging and direct cortical stimulation studies suggests that frontoparietal networks in both hemispheres are involved in mental imagery, and that lateralization patterns can be found in the temporal lobes (Liu et al., 2022). Such lateralization was found to depend on the nature of the imagined items, the imagination of objects and words involving the left inferior temporal cortex while the imagination of faces and people was found to be more right-lateralized and the imagination of complex scenes (as in our study) leads to significant activity in both hemispheres (O’Craven and Kanwisher, 2000; Steel et al., 2021; Spagna et al., 2021). Crucially, in more recent decoding studies, results were either observed bilaterally, or the analyses did not target hemispheric asymmetries (Reddy et al., 2010; Lee et al., 2012). But see Stokes et al., 2009 in which perception-to-imagery generalization of single letters was left-lateralized. Our work shows for the first time results of a searchlight analysis of imagery decoding of complex visual scenes. Above-chance decoding is possible on both hemispheres, with the highest decoding accuracies in the precuneus and the right inferior parietal cortex and the superior temporal sulcus. Future investigations with larger sets of imagined scenes could address the question whether lateralization patterns depend on the nature of the imagined objects.

It remains an open question whether the activation patterns in the modality-agnostic regions identified in our study relate to abstract concepts or to lower-level features that are shared between the two modalities. Binder, 2016 puts this dichotomy into question, considering that “there is no absolute demarcation between embodied/perceptual and abstract/conceptual representation in the brain.” (p. 1098). The author argues for a hierarchical system in which representational patterns become increasingly abstract, creating a continuum between actual experiential information up to higher-level conceptual information (see also Andrews et al., 2014).

According to the results of our searchlight analysis, the anterior temporal lobes are not among the regions with the highest probability of being modality-agnostic, contradicting the hypothesis of the hub-and-spoke theory that these areas are the major semantic hub in the brain. However, MRI signals from these regions have a lower signal-to-noise ratio with standard fMRI pulse sequences (Devlin et al., 2000; Embleton et al., 2010). A more targeted study with an adapted fMRI protocol would be required to shed light on the nature of patterns in these regions. More generally, the hub-and-spoke theory also puts emphasis on the role of spokes for the formation of conceptual representations (Pobric et al., 2010; Ralph et al., 2017). Future work could be aimed at testing the hub-and-spoke theory proposal that features in hierarchically lower level representation spaces of the spokes are not directly relatable to features in the representation space of the hubs: Object representations in the spokes are based on interpretable features (e.g. shape, color, affordances of an object) and get translated into another representational format that is representing conceptual similarities (but its dimensions do not directly map to interpretable features) in the semantic hub (Frisby et al., 2023). To test this hypothesis, modality-agnostic representations in the anterior temporal lobes (measured with targeted fMRI pulse sequences) could be compared to representations in candidate regions for modality-specific spokes using RSA.

The modality-agnostic regions we found in the searchlight analysis can also be seen as candidates for convergence zones, in which increasingly abstract representations are formed (Damasio, 1989; Tranel et al., 1997; Meyer and Damasio, 2009). To obtain further insight into the hierarchical organization of these zones, future work could take advantage of the improved temporal resolution of other brain imaging techniques such as MEG to explore in which areas modality-agnostic patterns are formed first, and how they are being transformed when spreading to higher-level areas of the brain (Dirani and Pylkkänen, 2024; Benchetrit et al., 2024).

In line with the GRAPES framework, (Martin, 2009, 2016), we found that modality-agnostic representations are distributed across temporal and parietal areas. To test the related hypothesis that conceptual information is organized in domains, we plan to use RSA to understand which kind of semantic information is represented in the different modality-agnostic regions identified.

Finally, our results can be interpreted with respect to the Global Workspace Theory. All modality-agnostic regions are good candidate regions for a global workspace. They could, however, also be part of modality-specific modules that get activated in a modality-agnostic fashion through a “broadcast” operation as a stimulus is perceived consciously (Baars, 1993, 2005). To distinguish these two cases, an experimental manipulation of attention could be used: according to Global Workspace Theory, attention is required for information to enter the workspace, but not for the workspace signals to reach other brain regions via broadcast. In the future, we plan to investigate how modality-agnostic patterns are modulated by attention, by analyzing additional test sessions from the same subjects in which they were instructed in specific runs to pay attention to only one of the modalities. These sessions will be released as part of another future dataset and publication.

To conclude, the results from our searchlight analysis so far are in line with all major theories on modality-agnostic representations that were considered. As this dataset will be shared publicly, more targeted investigations can be performed by the research community in order to adjudicate between different theories.

Acknowledgements

This research was funded by grants from the French Agence Nationale de la Recherche (ANR: AI-REPS grant number ANR-18-CE37-0007-01 and ANITI grant number ANR-19-PI3A-0004) as well as the European Union (ERC Advanced grant GLoW, 101096017). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

We thank the Inserm/UPS UMR1214 Technical Platform for their help in setting up and for the acquisitions of the MRI sequences.

Appendix 1 Feature extraction details

For each target stimulus (image or caption), our database also contained an equivalent stimulus in the other modality (caption or image). In this way, we could extract model features from the corresponding image for vision models, the corresponding caption for language models, and an multimodal representation of both image and caption for the multimodal models. We used publicly available pretrained models implemented in the HuggingFace Transformers library (Wolf et al., 2020) or from their respective authors’ repositories.

Model versions for unimodal models are as indicated in Figure 4. For multimodal models, the exact version for CLIP was clip-vit-large-patch14, for ViLT vilt-b32-mlm, for Visual-BERT visualbert-nlvr2-coco-pre, for Imagebind imagebind_huge, for Bridgetower bridgetower-large-itm-mlm-itc, for Flava flava-full, for SigLip siglip-so400m-patch14-384, and for Paligemma2 paligemma2-3b-pt-224.

We extracted language features from all models by averaging the outputs for each token, as this was established as common practice for the extraction of sentence embeddings from Transformer-based language models (e.g. Krasnowska-Kieraś and Wróblewska, 2019; Reimers and Gurevych, 2019).

For Transformer-based vision models, we compared representations extracted by averaging the outputs for each patch with representations extracted from [CLS] tokens in Table 1. We found that for almost all models, the mean features allow for higher decoding accuracies. For all experiments reported in the main paper we therefore only considered this method.

Feature comparison for vision models.
Pairwise accuracy for modality-agnostic decoders based on vision features extracted by averaging the last hidden states of all patches (“vision_features_mean”) compared to when using features extracted from [CLS] tokens (“vision_features_cls”). The method leading to the best decoding performance for each model is highlighted in bold.

For multimodal models, we also compared a range of techniques for feature extraction. The results are shown in Table 2.

Feature comparison for multimodal models.
The features are either based on the [CLS] tokens from the fused representations (“fused_cls”), averaging over the fused tokens (“fused_mean”), or averaging over tokens from intermediate vision and language stream outputs (“avg”). For the last case, for some models the vision and language features can also be either based on [CLS] or based on averaging over all tokens/patches. The method leading to the best decoding performance for each model is highlighted in bold.

For dual-stream multimodal models, we averaged the vision and language features to create the final multimodal feature representation. For single-stream multimodal features we compared using representations extracted by averaging the outputs for each token with representations extracted from [CLS] token and found that the averaged output leads to better performance in almost all cases (see Table 2). Further, some single-stream models (Flava, Paligemma2 and BLIP2) allow for feature extraction based on intermediate vision and language representations in addition to a direct extraction of multimodal features (based on fused representations from the multimodal stream). We found that averaging features from these intermediate vision and language representations leads to better performance (see Table 2). For all results reported in the main paper we used the feature extraction method leading to best performance for each model.

Decoder training details

The decoders were linear ridge-regression models as implemented in the scikit-learn library (Pedregosa et al., 2011). All training data was standardized to have mean of 0 and standard deviation of 1. The test data was standardized using the mean and standard deviation of the training data. The regularization hyperparameter α was optimized using 5-fold cross validation on the training set (values considered: α ∊ {1e3, 1e4, 1e5, 1e6, 1e7}). Afterwards, a final model was trained using the best α on the whole training set.

Appendix 2 Qualitative Decoding Results for Modality-Specific Decoders

In this section we present qualitative decoding results for modality-specific decoders for the same examples as presented in Section Qualitative Decoding Results for modality-agnostic decoders. The results show that modality-agnostic decoders are as good as modality-specific decoders when evaluated in a within-modality decoding setup (Figures 1 and 3) but substantially better than modality-specific decoders when evaluated in a cross-decoding setup (Figures 2 and 4).

Decoding examples for image decoding using a modality-specific decoder trained on images (within-modality decoding).
The first column shows the image the subject was seeing and the 5 following columns show the candidate stimuli with highest similarity to the predicted features, in descending order. We display both the image and the caption of the candidate stimuli because the decoder is based on multimodal features that are extracted from both modalities. All images were taken from the CoCo dataset (Lin et al., 2014).

Decoding examples for caption decoding using a modality-specific decoder trained on images (cross-modality decoding).
For details see caption of Figure 1 All images were taken from the CoCo dataset (Lin et al., 2014).

Decoding examples for caption decoding using a modality-specific decoder trained on captions (within-modality decoding).
For details see caption of Figure 1 All images were taken from the CoCo dataset (Lin et al., 2014).

Decoding examples for image decoding using a modality-specific decoder trained on captions (cross-modality decoding).
For details see caption of Figure 1 All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 3 Candidates for Modality-agnostic regions

Table 1 shows candidates for modality-agnostic regions that were identified by previous work. We only considered fMRI experiments involving multiple stimulus modalities. Many other studies relied on unimodal stimuli and semantic tasks to identify modality-agnostic/ conceptual regions, these are however less directly comparable to our setup.

Candidates for modality-agnostic regions as identified by previous work.
All there regions were also found in out analysis, except for the 2 regions marked with an asterisk (*).

Appendix 4 Searchlight size

In order to optimize the size of the searchlight for this analysis, we first ran a searchlight analysis with a fixed radius of 10mm using a modality-agnostic decoder on the data from the first subject (sub-01). Due to the shape of the cortex this leads of searchlights that contain varying numbers of vertices (on average: 897.4; max: 1580; min: 399). By observing the decoding scores as function of the number of vertices we find that performance peaks at 750 vertices (cf. Table 1). The final searchlight analyses was therefore performed with a searchlight of a fixed number of 750 vertices. The average radius with this number of vertices was 9.41mm (max: 13.65mm).

Average decoding scores (for images and captions) by number of vertices.
Scores calculated based on the results of a searchlight analysis with a radius of 10mm. The accuracy values were grouped into bins based on the number of vertices that the searchlight was based on.

Appendix 5 Per-subject results

Results for individual subjects can be found in Figure 1. Among all subjects, we found similar converging results for decoding accuracies when comparing models, feature modalities, and modality-agnostic with modality-specific decoders.

Pairwise accuracy per subject.
For details refer to Figure 5

Appendix 6 Qualitative Imagery Decoding Results

The following figures 1-6 present qualitative decoding results for the imagery conditions using a modality-agnostic decoder. We present the results separately for each subject as each subject chose an individual set of 3 stimuli to perform mental imagery on. In each plot, the leftmost column shows the caption that was used as initial instruction as well as the subject’s sketch that they drew at the end of the experiment.

We used the same large candidate set of 41K stimuli as for the qualitative decoding results for the other conditions (see also Section Qualitative Decoding Results).

Overall, we found that the decoding quality for imagery stimuli lags behind that for trials with perceived stimuli. This was expected and confirms the quantitative results reported in Section Imagery Decoding. Still, in several cases some of the concepts are decoded correctly (e.g. a women in the first row of Figure 1; winter sports in the second row of Figure 3; laptops/screens and multiple people in the third row of Figure 3).

Imagery decoding for Subject 1 using a modality-agnostic decoder.
The first column shows the caption that was used to stimulate imagery and on top of it the sketch of their mental image that the subjects draw at the end of the experiment. The 5 following columns show the candidate stimuli with highest similarity to the predicted features, in descending order. We display both the image and the caption of the candidate stimuli because the decoder is based on multimodal features that are extracted from both modalities. All images were taken from the CoCo dataset (Lin et al., 2014).

Imagery decoding for Subject 2 using a modality-agnostic decoder.
For further details refer to the caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Imagery decoding for Subject 3 using a modality-agnostic decoder.
For further details refer to the caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Imagery decoding for Subject 4 using a modality-agnostic decoder.
For further details refer to the caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Imagery decoding for Subject 5 using a modality-agnostic decoder.
For further details refer to the caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Imagery decoding for Subject 6 using a modality-agnostic decoder. For further details refer to the caption of Figure 1.
All images were taken from the CoCo dataset (Lin et al., 2014).

Additional information

Funding

Agence Nationale de la Recherche (ANR-18-CE37-0007-01)

Agence Nationale de la Recherche (ANR-19-PI3A-0004)

European Research Council

https://doi.org/10.3030/101096017

References

1.
1. Albers A
2. Kok P
3. Toni I
4. Dijkerman HC
5. de Lange F
2013Shared Representations for Working Memory and Mental Imagery in Early Visual CortexCurrent Biology 23:1427–1431https://doi.org/10.1016/j.cub.2013.05.065 Google Scholar
2.
1. Aliko S
2. Huang J
3. Gheorghiu F
4. Meliss S
5. Skipper JI
2020A naturalistic neuroimaging database for understanding the brain using ecological stimuliScientific Data 7:347https://doi.org/10.1038/s41597-020-00680-2 Google Scholar
3.
1. Allen EJ
2. St-Yves G
3. Wu Y
4. Breedlove JL
5. Prince JS
6. Dowdle LT
7. Nau M
8. Caron B
9. Pestilli F
10. Charest I
11. Hutchinson JB
12. Naselaris T
13. Kay K
2022A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligenceNature Neuroscience 25:116–126https://doi.org/10.1038/s41593-021-00962-x Google Scholar
4.
1. Andrews M
2. Frank S
3. Vigliocco G
2014Reconciling Embodied and Distributional Accounts of Meaning in LanguageTopics in Cognitive Science 6:359–370https://doi.org/10.1111/tops.12096 Google Scholar
5.
1. Ashburner J
2. Barnes G
3. Chen CC
4. Daunizeau J
5. Flandin G
6. Friston K
7. Kiebel S
8. Kilner J
9. Litvak V
10. Moran R
2014SPM12. Wellcome Trust Centre for NeuroimagingGoogle Scholar
6.
1. Baars BJ
1993A cognitive theory of consciousnessCambridge University Press Google Scholar
7.
1. Baars BJ
2005Global workspace theory of consciousness: toward a cognitive neuroscience of human experienceProgress in Brain Research 150:45–53https://doi.org/10.1016/S0079-6123(05)50004-9 Google Scholar
8.
1. Barsalou LW
2016On Staying Grounded and Avoiding Quixotic Dead EndsPsychonomic Bulletin & Review 23:1122–1142Google Scholar
9.
1. Bartolomeo P
2002The Relationship Between Visual Perception and Visual Mental Imagery: A Reappraisal of the Neuropsychological EvidenceCortex 38:357–378https://doi.org/10.1016/S0010-9452(08)70665-8 Google Scholar
10.
1. Beliy R
2. Gaziv G
3. Hoogi A
4. Strappini F
5. Golan T
6. Irani M
2019From voxels to pixels and back: Self-supervision in natural-image reconstruction from fMRIIn: NeurIPS Google Scholar
11.
1. Benchetrit Y
2. Banville H
3. King JR
2024Brain decoding: toward real-time reconstruction of visual perceptionIn: The Twelfth International Conference on Learning Representations Google Scholar
12.
1. Binder JR
2016In defense of abstract conceptual representationsPsychonomic Bulletin & Review 23:1096–1108https://doi.org/10.3758/s13423-015-0909-1 Google Scholar
13.
1. Binder JR
2. Desai RH
3. Graves WW
4. Conant LL
2009Where Is the Semantic System? A Critical Review and Meta-Analysis of 120 Functional Neuroimaging StudiesCerebral Cortex 19:2767–2796Google Scholar
14.
1. Boccia M
2. Sulpizio V
3. Teghil A
4. Palermo L
5. Piccardi L
6. Galati G
7. Guariglia C
2019The dynamic contribution of the high-level visual cortex to imagery and perceptionHuman Brain Mapping 40:2449–2463https://doi.org/10.1002/hbm.24535 Google Scholar
15.
1. Boyle JA
2. Pinsard B
3. Boukhdhir A
4. Belleville S
5. Brambatti S
6. Chen J
7. Cohen-Adad J
8. Cyr A
9. Fuente Rainville P
10. Bellec P
2020The Courtois project on neuronal modelling-first data releaseIn: 26th annual meeting of the organization for human brain mapping Google Scholar
16.
1. Brennan JR
2022Language and the brain: a slim guide to neurolinguisticsOxford University Press Google Scholar
17.
1. Brennan JR
2. Hale JT
2019Hierarchical structure guides rapid linguistic predictions during naturalistic listeningPLOS One 14:e0207741https://doi.org/10.1371/journal.pone.0207741 Google Scholar
18.
1. Bright P
2. Moss H
3. Tyler LK
2004Unitary vs multiple semantics: PET studies of word and picture processingBrain and Language 89:417–432Google Scholar
19.
1. Carota F
2. Nili H
3. Pulvermüller F
4. Kriegeskorte N
2021Distinct fronto-temporal substrates of distributional and taxonomic similarity among words: evidence from RSA of BOLD signalsNeuroImage 224:117408Google Scholar
20.
1. Chang N
2. Pyles JA
3. Marcus A
4. Gupta A
5. Tarr MJ
6. Aminoff EM.
2019BOLD5000, a public fMRI dataset while viewing 5000 visual imagesScientific Data 6:49https://doi.org/10.1038/s41597-019-0052-3 Google Scholar
21.
1. Cox DD
2. Savoy RL
2003Functional magnetic resonance imaging (fMRI) “brain reading”: detecting and classifying distributed patterns of fMRI activity in human visual cortexNeuroImage 19:261–270https://doi.org/10.1016/S1053-8119(03)00049-1 Google Scholar
22.
1. Damasio AR
1989The Brain Binds Entities and Events by Multiregional Activation from Convergence ZonesNeural Computation 1:123–132Google Scholar
23.
1. Damasio H
2. Tranel D
3. Grabowski T
4. Adolphs R
5. Damasio A
2004Neural systems behind word and concept retrievalCognition 92:179–229https://doi.org/10.1016/j.cognition.2002.07.001 Google Scholar
24.
1. Desikan RS
2. Ségonne F
3. Fischl B
4. Quinn BT
5. Dickerson BC
6. Blacker D
7. Buckner RL
8. Dale AM
9. Maguire RP
10. Hyman BT
11. Albert MS
12. Killiany RJ
2006An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interestNeuroImage 31:968–980Google Scholar
25.
1. Devereux BJ
2. Clarke A
3. Marouchos A
4. Tyler LK
2013Representational Similarity Analysis Reveals Commonalities and Differences in the Semantic Processing of Words and ObjectsThe Journal of Neuroscience 33Google Scholar
26.
1. Devlin J
2. Chang MW
3. Lee K
4. Toutanova K
2019BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingIn: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) Minneapolis, Minnesota: Association for Computational Linguistics pp. 4171–4186https://doi.org/10.18653/v1/N19-1423 Google Scholar
27.
1. Devlin JT
2. Russell RP
3. Davis MH
4. Price CJ
5. Wilson J
6. Moss HE
7. Matthews PM
8. Tyler LK
2000Susceptibility-Induced Loss of Signal: Comparing PET and fMRI on a Semantic TaskNeuroImage 11:589–600https://doi.org/10.1006/nimg.2000.0595 Google Scholar
28.
1. Dijkstra N
2. Bosch SE
3. Van Gerven MAJ
2019Shared Neural Mechanisms of Visual Perception and ImageryTrends in Cognitive Sciences 23:423–434https://doi.org/10.1016/j.tics.2019.02.004 Google Scholar
29.
1. Dirani J
2. Pylkkänen L
2024MEG Evidence That Modality-Independent Conceptual Representations Contain Semantic and Visual FeaturesJournal of Neuroscience 44https://doi.org/10.1523/JNEUROSCI.0326-24.2024 Google Scholar
30.
1. Dosovitskiy A
2. Beyer L
3. Kolesnikov A
4. Weissenborn D
5. Zhai X
6. Unterthiner T
7. Dehghani M
8. Minderer M
9. Heigold G
10. Gelly S
11. Uszkoreit J
12. Houlsby N
2020An Image is Worth 16×16 Words: Transformers for Image Recognition at ScaleIn: International Conference on Learning Representations Google Scholar
31.
1. Embleton KV
2. Haroon HA
3. Morris DM
4. Ralph MAL
5. Parker GJM
2010Distortion correction for diffusion-weighted MRI tractography and fMRI in the temporal lobesHuman Brain Mapping 31:1570–1587https://doi.org/10.1002/hbm.20959 Google Scholar
32.
1. Fairhall SL
2. Caramazza A
2013Brain Regions That Represent Amodal Conceptual KnowledgeJournal of Neuroscience 33:10552–10558Google Scholar
33.
1. Farah MJ
1984The neurological basis of mental imagery: A componential analysisCognition 18:245–272https://doi.org/10.1016/0010-0277(84)90026-X Google Scholar
34.
1. Fedorenko E
2. Behr MK
3. Kanwisher N
2011Functional specificity for high-level linguistic processing in the human brainProceedings of the National Academy of Sciences 108:16428–16433https://doi.org/10.1073/pnas.1112937108 Google Scholar
35.
1. Fedorenko E
2. Hsieh PJ
3. Nieto-Castañón A
4. Whitfield-Gabrieli S
5. Kanwisher N
2010New Method for fMRI Investigations of Language: Defining ROIs Functionally in Individual SubjectsJournal of Neurophysiology 104:1177–1194https://doi.org/10.1152/jn.00032.2010 Google Scholar
36.
1. Felleman DJ
2. Van Essen DC
1991Distributed hierarchical processing in the primate cerebral cortexCerebral cortex 1:1–47https://doi.org/10.1093/cercor/1.1.1-a Google Scholar
37.
1. Fernandino L
2. Binder JR
3. Desai RH
4. Pendl SL
5. Humphries CJ
6. Gross WL
7. Conant LL
8. Seidenberg MS
2016Concept Representation Reflects Multimodal Abstraction: A Framework for Embodied SemanticsCerebral Cortex 26:2018–2034Google Scholar
38.
1. Fernandino L
2. Tong JQ
3. Conant LL
4. Humphries CJ
5. Binder JR
2022Decoding the information structure underlying the neural representation of conceptsProceedings of the National Academy of Sciences 119Google Scholar
39.
1. Fischl B.
2012FreeSurferNeuroImage 62:774–781https://doi.org/10.1016/j.neuroimage.2012.01.021 Google Scholar
40.
1. Friederici AD
2017Language in Our Brain: The Origins of a Uniquely Human CapacityThe MIT Press https://doi.org/10.7551/mitpress/11173.001.0001 Google Scholar
41.
1. Frisby SL
2. Halai AD
3. Cox CR
4. Lambon Ralph MA
5. Rogers TT
2023Decoding semantic representations in mind and brainTrends in Cognitive Sciences 27:258–281https://doi.org/10.1016/j.tics.2022.12.006 Google Scholar
42.
1. Gainotti G
2000What the locus of brain lesion tells us about the nature of the cognitive defect underlying category-specific disorders: a reviewCortex; a Journal Devoted to the Study of the Nervous System and Behavior 36:539–559https://doi.org/10.1016/s0010-9452(08)70537-9 Google Scholar
43.
1. Girdhar R
2. El-Nouby A
3. Liu Z
4. Singh M
5. Alwala KV
6. Joulin A
7. Misra I
2023ImageBind: One Embedding Space To Bind Them AllIn: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 15180–15190Google Scholar
44.
1. Gorgolewski K
2. Burns CD
3. Madison C
4. Clark D
5. Halchenko YO
6. Waskom ML
7. Ghosh SS
2011Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in pythonFrontiers in neuroinformatics 5:12318Google Scholar
45.
1. Grill-Spector K
2. Malach R
2004The Human Visual CortexAnnual Review of Neuroscience 27:649–677https://doi.org/10.1146/annurev.neuro.27.070203.144220 Google Scholar
46.
1. Hajhajate D
2. Kaufmann BC
3. Liu J
4. Siuda-Krzywicka K
5. Bartolomeo P
2022The connectional anatomy of visual mental imagery: evidence from a patient with left occipito-temporal damageBrain Structure & Function 227:3075–3083https://doi.org/10.1007/s00429-022-02505-x Google Scholar
47.
1. Handjaras G
2. Ricciardi E
3. Leo A
4. Lenci A
5. Cecchetti L
6. Cosottini M
7. Marotta G
8. Pietrini P
2016How concepts are encoded in the human brain: A modality independent, category-based cortical organization of semantic knowledgeNeuroImage 135:232–242Google Scholar
48.
1. Haxby JV
2. Gobbini MI
3. Furey ML
4. Ishai A
5. Schouten JL
6. Pietrini P
2001Distributed and Overlapping Representations of Faces and Objects in Ventral Temporal CortexScience 293:2425–2430https://doi.org/10.1126/science.1063736 Google Scholar
49.
1. He K
2. Zhang X
3. Ren S
4. Sun J
2016Deep Residual Learning for Image RecognitionIn: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 770–778Google Scholar
50.
1. Huth AG
2. Nishimoto S
3. Vu AT
4. Gallant JL
2012A continuous semantic space describes the representation of thousands of object and action categories across the human brainNeuron 76:1210–1224https://doi.org/10.1016/j.neuron.2012.10.014 Google Scholar
51.
1. Jefferies E
2. Patterson K
3. Jones RW
4. Ralph MAL
2009Comprehension of concrete and abstract words in semantic dementiaNeuropsychology 23:492https://doi.org/10.1037/a0015452 Google Scholar
52.
1. Jenkinson M
2. Beckmann CF
3. Behrens TEJ
4. Woolrich MW
5. Smith SM.
2012FSLNeuroImage 62:782–790https://doi.org/10.1016/j.neuroimage.2011.09.015 Google Scholar
53.
1. Jiang AQ
2. Sablayrolles A
3. Mensch A
4. Bamford C
5. Chaplot DS
6. Ddl Casas
7. Bressand F
8. Lengyel G
9. Lample G
10. Saulnier L
11. Lavaud LR
12. Lachaux MA
13. Stock P
14. Scao TL
15. Lavril T
16. Wang T
17. Lacroix T
18. Sayed WE
19. Mistral 7B
2023Mistral 7BarXiv https://doi.org/10.48550/arXiv.2310.06825 Google Scholar
54.
1. Johnson MR
2. Johnson MK
2014Decoding individual natural scene representations during perception and imageryFrontiers in Human Neuroscience 8https://doi.org/10.3389/fnhum.2014.00059 Google Scholar
55.
1. Jung Y
2. Larsen B
3. Walther DB
2018Modality-Independent Coding of Scene Categories in Prefrontal CortexJournal of Neuroscience 38:5969–5981Google Scholar
56.
1. Kanwisher N
2010Functional specificity in the human brain: A window into the functional architecture of the mindProceedings of the National Academy of Sciences 107:11163–11170https://doi.org/10.1073/pnas.1005062107 Google Scholar
57.
1. Kay KN
2. Naselaris T
3. Prenger RJ
4. Gallant JL
2008Identifying natural images from human brain activityNature 452:352–355https://doi.org/10.1038/nature06713 Google Scholar
58.
1. Kiefer M
2. Pulvermüller F
2012Conceptual representations in mind and brain: Theoretical developments, current evidence and future directionsCortex 48:805–825https://doi.org/10.1016/j.cortex.2011.04.006 Google Scholar
59.
1. Kim W
2. Son B
3. Kim I
2021ViLT: Vision-and-Language Transformer Without Convolution or Region SupervisionIn: Proceedings of the 38th International Conference on Machine Learning pp. 5583–5594https://proceedings.mlr.press/v139/kim21k.html Google Scholar
60.
1. Kosslyn SM
2. Pascual-Leone A
3. Felician O
4. Camposano S
5. Keenan JP L W
6. Thompson Ganis G
7. Sukel KE
8. Alpert NM
1999The Role of Area 17 in Visual Imagery: Convergent Evidence from PET and rTMSScience 284:167–170https://doi.org/10.1126/science.284.5411.167 Google Scholar
61.
1. Kosslyn SM
2. Thompson WL
2003When is early visual cortex activated during visual mental imagery?Psychological Bulletin 129:723–746https://doi.org/10.1037/0033-2909.129.5.723 Google Scholar
62.
1. Krasnowska-Kieraś K
2. Wróblewska A
2019Empirical Linguistic Study of Sentence EmbeddingsIn: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics Florence, Italy: Association for Computational Linguistics pp. 5729–5739https://doi.org/10.18653/v1/P19-1573 Google Scholar
63.
1. Kriegeskorte N
2. Mur M
3. Bandettini PA
2008Representational similarity analysis-connecting the branches of systems neuroscienceFrontiers in Systems Neuroscience 2https://doi.org/10.3389/neuro.06.004.2008 Google Scholar
64.
1. Lambon Ralph MA
2. Lowe C
3. Rogers TT
2006Neural basis of category-specific semantic deficits for living things: evidence from semantic dementia, HSVE and a neural network modelBrain 130:1127–1137https://doi.org/10.1093/brain/awm025 Google Scholar
65.
1. Lee SH
2. Kravitz DJ
3. Baker CI
2012Disentangling visual imagery and perception of real-world objectsNeuroImage 59:4064–4073https://doi.org/10.1016/j.neuroimage.2011.10.055 Google Scholar
66.
1. Li J
2. Li D
3. Savarese S
4. Hoi S
2023BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsIn: International Conference on Machine Learning Google Scholar
67.
1. Li LH
2. Yatskar M
3. Yin D
4. Hsieh CJ
5. Chang KW
2019VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:190803557 [cs] http://arxiv.org/abs/1908.03557
68.
1. Lin S
2. Sprague T
3. Singh AK
2022Mind Reader: Reconstructing complex images from brain activitiesAdvances in Neural Information Processing Systems 35:29624–29636Google Scholar
69.
1. Lin TY
2. Maire M
3. Belongie S
4. Hays J
5. Perona P
6. Ramanan D
7. Dollár P
8. Zitnick CL
2014Microsoft COCO: Common Objects in ContextComputer Vision – ECCV 2014 8693:740–755Google Scholar
70.
1. Liu J
2. Spagna A
3. Bartolomeo P
2022Hemispheric asymmetries in visual mental imageryBrain Structure & Function 227:697–708https://doi.org/10.1007/s00429-021-02277-w Google Scholar
71.
1. Liu J
2. Zhan M
3. Hajhajate D
4. Spagna A
5. Dehaene S
6. Cohen L
7. Bartolomeo P
2025Visual mental imagery in typical imagers and in aphantasia: A millimeter-scale 7-T fMRI studyCortex 185:113–132https://doi.org/10.1016/j.cortex.2025.01.013 Google Scholar
72.
1. Liuzzi AG
2. Bruffaerts R
3. Peeters R
4. Adamczuk K
5. Keuleers E
6. De Deyne S
7. Storms G
8. Dupont P
9. Vandenberghe R
2017Cross-modal representation of spoken and written word meaning in left pars triangularisNeuroImage 150:292–307https://doi.org/10.1016/j.neuroimage.2017.02.032 Google Scholar
73.
1. Man K
2. Kaplan JT
3. Damasio A
4. Meyer K
2012Sight and Sound Converge to Form Modality-Invariant Representations in Temporoparietal CortexJournal of Neuroscience 32:16629–16636Google Scholar
74.
1. Martin A
2009Circuits in Mind: The Neural Foundations for Object Concepts
In:
1. Gazzaniga MS
, editors. The Cognitive Neurosciences The MIT Press
https://doi.org/10.7551/mitpress/8029.003.0091 Google Scholar
75.
1. Martin A
2016GRAPES—Grounding representations in action, perception, and emotion systems: How object properties and categories are represented in the human brainPsychonomic Bulletin & Review 23:979–990Google Scholar
76.
1. Martin CB
2. Douglas D
3. Newsome RN
4. Man LL
5. Barense MD
2018Integrative and distinctive coding of visual and conceptual object features in the ventral visual streameLife 7Google Scholar
77.
1. Meschke E
2. Gallant J
2024Mapping Multimodal Conceptual Representations within the Lexical-Semantic Brain SystemCCN Google Scholar
78.
1. Meyer K
2. Damasio A
2009Convergence and divergence in a neural architecture for recognition and memoryTrends in Neurosciences 32:376–382https://doi.org/10.1016/j.tins.2009.04.002 Google Scholar
79.
1. Moore CJ
2. Price CJ
1999Three Distinct Ventral Occipitotemporal Regions for Reading and Object NamingNeuroImage 10:181–192Google Scholar
80.
1. Naselaris T
2. Kay KN
3. Nishimoto S
4. Gallant JL.
2011Encoding and decoding in fMRINeuroImage 56:400–410https://doi.org/10.1016/j.neuroimage.2010.07.073 Google Scholar
81.
1. Naselaris T
2. Olman CA
3. Stansbury DE
4. Ugurbil K
5. Gallant JL
2015A voxel-wise encoding model for early visual areas decodes mental images of remembered scenesNeuroImage 105:215–228https://doi.org/10.1016/j.neuroimage.2014.10.018 Google Scholar
82.
1. Naselaris T
2. Prenger RJ
3. Kay KN
4. Oliver M
5. Gallant JL
2009Bayesian Reconstruction of Natural Images from Human Brain ActivityNeuron 63:902–915https://doi.org/10.1016/j.neuron.2009.09.006 Google Scholar
83.
1. Nastase SA
2. Liu YF
3. Hillman H
4. Zadbood A
5. Hasenfratz L
6. Keshavarzian N
7. Chen J
8. Honey CJ
9. Yeshurun Y
10. Regev M
11. Nguyen M
12. Chang CHC
13. Baldassano C
14. Lositsky O
15. Simony E
16. Chow MA
17. Leong YC
18. Brooks PP
19. Micciche E
20. Choe G
21. et al.
2021The “Narratives” fMRI dataset for evaluating models of naturalistic language comprehensionScientific Data 8:250https://doi.org/10.1038/s41597-021-01033-3 Google Scholar
84.
1. Nishimoto S
2. Vu AT
3. Naselaris T
4. Benjamini Y
5. Yu B
6. Gallant JL
2011Reconstructing Visual Experiences from Brain Activity Evoked by Natural MoviesCurrent Biology 21:1641–1646https://doi.org/10.1016/j.cub.2011.08.031 Google Scholar
85.
1. O’Craven KM
2. Kanwisher N
2000Mental Imagery of Faces and Places Activates Corresponding Stimulus-Specific Brain RegionsJournal of Cognitive Neuroscience 12:1013–1023https://doi.org/10.1162/08989290051137549 Google Scholar
86.
1. Visconti di Oleggio Castello M
2. Chauhan V
3. Jiahui G
4. Gobbini MI
2020An fMRI dataset in response to “The Grand Budapest Hotel”, a socially-rich, naturalistic movieScientific Data 7:383https://doi.org/10.1038/s41597-020-00735-4 Google Scholar
87.
1. Oquab M
2. Darcet T
3. Moutakanni T
4. Vo H
5. Szafraniec M
6. Khalidov V
7. Fernandez P
8. Haziza D
9. Massa F
10. El-Nouby A
11. Assran M
12. Ballas N
13. Galuba W
14. Howes R
15. Huang PY
16. Li SW
17. Misra I
18. Rabbat M
19. Sharma V
20. Synnaeve G
21. et al.
2023DINOv2: Learning Robust Visual Features without SupervisionarXiv http://arxiv.org/abs/2304.07193 Google Scholar
88.
1. Ozcelik F
2. VanRullen R
2023Natural scene reconstruction from fMRI signals using generative latent diffusionarXiv http://arxiv.org/abs/2303.05334 Google Scholar
89.
1. Patterson K
2. Lambon Ralph MA
2016The Hub-and-Spoke Hypothesis of Semantic MemoryNeurobiology of Language Elsevier :765–775https://doi.org/10.1016/B978-0-12-407794-2.00061-4 Google Scholar
90.
1. Pearson J
2019The human imagination: the cognitive neuroscience of visual mental imageryNature Reviews Neuroscience 20:624–634https://doi.org/10.1038/s41583-019-0202-9 Google Scholar
91.
1. Pedregosa F
2. Varoquaux G
3. Gramfort A
4. Michel V
5. Thirion B
6. Grisel O
7. Blondel M
8. Prettenhofer P
9. Weiss R
2011Dubourg VJournal of machine Learning research 12:2825–2830Google Scholar
92.
1. Pereira F
2. Lou B
3. Pritchett B
4. Ritter S
5. Gershman SJ
6. Kanwisher N
7. Botvinick M
8. Fedorenko E
2018Toward a universal decoder of linguistic meaning from brain activationNature Communications 9:963https://doi.org/10.1038/s41467-018-03068-4 Google Scholar
93.
1. Pobric G
2. Jefferies E
3. Ralph MAL
2010Category-Specific versus Category-General Semantic Impairment Induced by Transcranial Magnetic StimulationCurrent Biology 20:964–968https://doi.org/10.1016/j.cub.2010.03.070 Google Scholar
94.
1. Popham SF
2. Huth AG
3. Bilenko NY
4. Deniz F
5. Gao JS
6. Nunez-Elizalde AO
7. Gallant JL
2021Visual and linguistic semantic representations are aligned at the border of human visual cortexNature Neuroscience 24:1628–1636Google Scholar
95.
1. Radford A
2. Kim JW
3. Hallacy C
4. Ramesh A
5. Goh G
6. Agarwal S
7. Sastry G
8. Askell A
9. Mishkin P
10. Clark J
11. Krueger G
12. Sutskever I
2021Learning Transferable Visual Models From Natural Language SupervisionIn: Proceedings of the 38th International Conference on Machine Learning :16Google Scholar
96.
1. Radford A
2. Wu J
3. Child R
4. Luan D
5. Amodei D
6. Sutskever I
2019Language models are unsupervised multitask learnersOpenAI blog 1:9Google Scholar
97.
1. Ralph MAL
2. Jefferies E
3. Patterson K
4. Rogers TT
2017The neural and computational bases of semantic cognitionNature Reviews Neuroscience 18:42–55Google Scholar
98.
1. Reddy L
2. Tsuchiya N
3. Serre T
2010Reading the mind’s eye: decoding category information during mental imageryNeuroImage 50:818–825https://doi.org/10.1016/j.neuroimage.2009.11.084 Google Scholar
99.
1. Reimers N
2. Gurevych I
2019Sentence-BERT: Sentence Embeddings using Siamese BERT-NetworksIn: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) pp. 3980–3990https://doi.org/10.18653/v1/D19-1410 Google Scholar
100.
1. Rogers TT
2. Lambon Ralph MA
3. Garrard P
4. Bozeat S
5. McClelland JL
6. Hodges JR
7. Patterson K
2004Structure and Deterioration of Semantic Memory: A Neuropsychological and Computational InvestigationPsychological Review 111:205–235https://doi.org/10.1037/0033-295X.111.1.205 Google Scholar
101.
1. Sanchez G
2. Hartmann T
3. Fuscà M
4. Demarchi G
5. Weisz N
2020Decoding across sensory modalities reveals common supramodal signatures of conscious perceptionProceedings of the National Academy of Sciences 117:7437–7446https://doi.org/10.1073/pnas.1912584117 Google Scholar
102.
1. Schoffelen JM
2. Oostenveld R
3. Lam NHL
4. Uddén J
5. Hultén A
6. Hagoort P
2019A 204-subject multimodal neuroimaging dataset to study language processingScientific Data 6:17https://doi.org/10.1038/s41597-019-0020-y Google Scholar
103.
1. Sereno MI
2. Dale AM
3. Reppas JB
4. Kwong KK
5. Belliveau JW
6. Brady TJ
7. Rosen BR
8. Tootell RBH
1995Borders of Multiple Visual Areas in Humans Revealed by Functional Magnetic Resonance ImagingScience 268:889–893https://doi.org/10.1126/science.7754376 Google Scholar
104.
1. Shen G
2. Horikawa T
3. Majima K
4. Kamitani Y
2019Deep image reconstruction from human brain activityPLOS Computational Biology 15:e1006633https://doi.org/10.1371/journal.pcbi.1006633 Google Scholar
105.
1. Shinkareva SV
2. Malave VL
3. Mason RA
4. Mitchell TM
5. Just MA
2011Commonality of neural representations of words and picturesNeuroImage 54:2418–2425Google Scholar
106.
1. Simanova I
2. Hagoort P
3. Oostenveld R
4. van Gerven MAJ
2014Modality-Independent Decoding of Semantic Information from the Human BrainCerebral Cortex 24:426–434Google Scholar
107.
1. Simmons WK
2. Barsalou LW
2003The similarity-in-topography principle: Reconciling theories of conceptual deficitsCognitive Neuropsychology 20:451–486Google Scholar
108.
1. Singh A
2. Hu R
3. Goswami V
4. Couairon G
5. Galuba W
6. Rohrbach M
7. Kiela D
2022. FLAVA: A Foundational Language and Vision Alignment ModelIn: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 15638–15650Google Scholar
109.
1. Smith SM
2. Nichols TE
2009Threshold-free cluster enhancement: Addressing problems of smoothing, threshold dependence and localisation in cluster inferenceNeuroImage 44:83–98Google Scholar
110.
1. Snowden J
2. Goulding PJ
3. Neary D
1989Semantic dementia: A form of circumscribed cerebral atrophyBehavioural Neurology 2:124043https://doi.org/10.1155/1989/124043 Google Scholar
111.
1. Spagna A
2. Hajhajate D
3. Liu J
4. Bartolomeo P
2021Visual mental imagery engages the left fusiform gyrus, but not the early visual cortex: A meta-analysis of neuroimaging evidenceNeuroscience & Biobehavioral Reviews 122:201–217https://doi.org/10.1016/j.neubiorev.2020.12.029 Google Scholar
112.
1. Steel A
2. Billings MM
3. Silson EH
4. Robertson CE
2021A network linking scene perception and spatial memory systems in posterior cerebral cortexNature Communications 12:2632https://doi.org/10.1038/s41467-021-22848-z Google Scholar
113.
1. Steiner A
2. Pinto AS
3. Tschannen M
4. Keysers D
5. Wang X
6. Bitton Y
7. Gritsenko A
8. Minderer M
9. Sherbondy A
10. Long S
11. Qin S
12. Ingle R
13. Bugliarello E
14. Kazemzadeh S
15. Mesnard T
16. Alabdulmohsin I
17. Beyer L
18. Zhai X
2024PaliGemma 2: A Family of Versatile VLMs for TransferarXiv https://doi.org/10.48550/arXiv.2412.03555 Google Scholar
114.
1. Stokes M
2. Thompson R
3. Cusack R
4. Duncan J
2009Top-Down Activation of Shape-Specific Population Codes in Visual Cortex during Mental ImageryThe Journal of Neuroscience 29:1565–1572https://doi.org/10.1523/JNEUROSCI.4657-08.2009 Google Scholar
115.
1. Takagi Y
2. Nishimoto S
2023High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain ActivityIn: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 14453–14463Google Scholar
116.
1. Tang J
2. Du M
2023Vo VA. Brain encoding models based on multimodal transformers can transfer across language and visionNeurIPS Google Scholar
117.
1. Tang J
2. LeBel A
3. Jain S
4. Huth AG
2023Semantic reconstruction of continuous language from non-invasive brain recordingsNature Neuroscience 26:858–866https://doi.org/10.1038/s41593-023-01304-9 Google Scholar
118.
1. Tong J
2. Binder JR
3. Humphries C
4. Mazurchuk S
5. Conant LL
6. Fernandino L
2022A Distributed Network for Multimodal Experiential Representation of ConceptsJournal of Neuroscience 42:7121–7130Google Scholar
119.
1. Touvron H
2. Martin L
3. Stone K
4. Albert P
5. Almahairi A
6. Babaei Y
7. Bashlykov N
8. Batra S
9. Bhargava P
10. Bhosale S
11. Bikel D
12. Blecher L
13. Ferrer CC
14. Chen M
15. Cucurull G
16. Esiobu D
17. Fernandes J
18. Fu J
19. Fu W
20. Fuller B
21. et al.
2023Llama 2: Open Foundation and Fine-Tuned Chat ModelsarXiv https://doi.org/10.48550/arXiv.2307.09288 Google Scholar
120.
1. Tranel D
2. Damasio H
3. Damasio AR
1997A neural basis for the retrieval of conceptual knowledgeNeuropsychologia 35:1319–1327https://doi.org/10.1016/S0028-3932(97)00085-7 Google Scholar
121.
1. Vandenberghe R
2. Price C
3. Wise R
4. Josephs O
5. Frackowiak RS
1996Functional anatomy of a common semantic system for words and picturesNature :383Google Scholar
122.
1. VanRullen R
2. Reddy L
2019Reconstructing faces from fMRI patterns using deep generative neural networksCommunications Biology 2:1–10https://doi.org/10.1038/s42003-019-0438-y Google Scholar
123.
1. Warrington EK
1975The Selective Impairment of Semantic MemoryQuarterly Journal of Experimental Psychology 27:635–657https://doi.org/10.1080/14640747508400525 Google Scholar
124.
1. Warrington EK
2. Mccarthy RA
1987Categories of knowledge: Further fractionations and an attempted integrationBrain: a Journal of Neurology Google Scholar
125.
1. Warrington EK
2. Shallice T
1984Category Specific Semantic ImpairmentsBrain: a Journal of Neurology Google Scholar
126.
1. Wolf T
2. Debut L
3. Sanh V
4. Chaumond J
5. Delangue C
6. Moi A
7. Cistac P
8. Rault T
9. Louf R
10. Funtowicz M
2020Transformers: State-of-the-art natural language processingIn: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations pp. 38–45Google Scholar
127.
1. Xu X
2. Wu C
3. Rosenman S
4. Lal V
5. Che W
6. Duan N
2023BridgeTower: Building Bridges between Encoders in Vision-Language Representation LearningIn: Proceedings of the AAAI Conference on Artificial Intelligence pp. 10637–10647https://doi.org/10.1609/aaai.v37i9.26263 Google Scholar
128.
1. Zhai X
2. Mustafa B
3. Kolesnikov A
4. Beyer L
2023Sigmoid Loss for Language Image Pre-TrainingIn: 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 11941–11952https://doi.org/10.1109/ICCV51070.2023.01100 Google Scholar
129.
1. Zola-Morgan S
1995Localization of Brain Function: The Legacy of Franz Joseph Gall (1758-1828)Annual Review of Neuroscience 18:359–383https://doi.org/10.1146/annurev.ne.18.030195.002043 Google Scholar
130.
1. Zwaan RA
2016Situation models, mental simulations, and abstract concepts in discourse comprehensionPsychonomic Bulletin & Review 23:1028–1034https://doi.org/10.3758/s13423-015-0864-x Google Scholar

Article and author information

Author information

Mitja Nikolaus
Université de Toulouse, Toulouse, France
ORCID iD: 0000-0001-5609-6628
- For correspondence: mitja.nikolaus@cnrs.fr
- These authors contributed equally to this work
Milad Mozafari
Torus AI, Toulouse, Toulouse, France;
ORCID iD: 0000-0002-4521-1640
- These authors contributed equally to this work
Isabelle Berry
Université de Toulouse, Toulouse, France
Nicholas Asher
Université de Toulouse, Toulouse, France
Leila Reddy
Université de Toulouse, Toulouse, France
ORCID iD: 0000-0002-7078-1055
Rullen VanRullen
Université de Toulouse, Toulouse, France

Author Notes

Competing interests: No competing interests declared

Version history

Preprint posted: June 8, 2025
Sent for peer review: June 14, 2025
Reviewed Preprint version 1: September 3, 2025

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.107933. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

views: 276
downloads: 11
citations: 0

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Significance of findings

Strength of evidence

Abstract

Introduction

Related Work

Modality-agnostic representations

Decoding of vision and language from fMRI

Decoding of mental imagery

Methods for localizing modality-agnostic regions

Methods

fMRI Experiment

Panel A: Setup of the main fMRI experiment.

fMRI Preprocessing

Modality-Agnostic Decoders

Training of modality-specific and modality-agnostic decoders.

Evaluation of modality-specific and modality-agnostic decoders.

Results

Modality-Agnostic Decoders

Average decoding scores for modality-agnostic decoders (bars), compared to modality-specific decoders trained on data from subjects viewing captions (·) or on data from subjects viewing images (×).

Decoding accuracy for decoding captions (top) and for decoding images (bottom).

Qualitative Decoding Results

Decoding examples for image decoding using a modality-agnostic decoder.

Decoding examples for caption decoding using a modality-agnostic decoder.

Modality-Agnostic Regions

Searchlight method to identify modality-agnostic ROIs.

Searchlight results for modality-agnostic regions.

Imagery Decoding

Searchlight results for imagery decoding.

Discussion

Acknowledgements

Appendix 1 Feature extraction details

Feature comparison for vision models.

Feature comparison for multimodal models.

Decoder training details

Appendix 2 Qualitative Decoding Results for Modality-Specific Decoders

Decoding examples for image decoding using a modality-specific decoder trained on images (within-modality decoding).

Decoding examples for caption decoding using a modality-specific decoder trained on images (cross-modality decoding).

Decoding examples for caption decoding using a modality-specific decoder trained on captions (within-modality decoding).

Decoding examples for image decoding using a modality-specific decoder trained on captions (cross-modality decoding).

Appendix 3 Candidates for Modality-agnostic regions

Candidates for modality-agnostic regions as identified by previous work.

Appendix 4 Searchlight size

Average decoding scores (for images and captions) by number of vertices.

Appendix 5 Per-subject results

Pairwise accuracy per subject.

Appendix 6 Qualitative Imagery Decoding Results

Imagery decoding for Subject 1 using a modality-agnostic decoder.

Imagery decoding for Subject 2 using a modality-agnostic decoder.

Imagery decoding for Subject 3 using a modality-agnostic decoder.

Imagery decoding for Subject 4 using a modality-agnostic decoder.

Imagery decoding for Subject 5 using a modality-agnostic decoder.

Imagery decoding for Subject 6 using a modality-agnostic decoder. For further details refer to the caption of Figure 1.

Additional information

Funding

References

Article and author information

Author information

Mitja Nikolaus†

Milad Mozafari†

Isabelle Berry

Nicholas Asher

Leila Reddy

Rullen VanRullen

Author Notes

Version history

Cite all versions

Copyright

Metrics

Mitja Nikolaus

Milad Mozafari