Modality-agnostic decoding of vision and language from fMRI

  1. Mitja Nikolaus  Is a corresponding author
  2. Milad Mozafari
  3. Isabelle Berry
  4. Nicholas Asher
  5. Leila Reddy
  6. Rufin VanRullen
  1. Université de Toulouse, CNRS, CerCo, France
  2. Torus AI, France
  3. Université de Toulouse, IRIT, France
24 figures, 4 tables and 1 additional file

Figures

Setup of the fMRI experiment.

(A) Setup of the main fMRI experiment. Subjects were seeing images and captions in random alternation. Whenever the current stimulus matched the previous stimulus, the subjects were instructed to press a button (one-back matching task). Images and captions for illustration only; actual size and stimuli as described in the text. (B) Number of distinct training stimuli (excluding trials that were one-back targets or during which the subject pressed the response button). The number of overlapping stimuli indicates how many stimuli were presented both as caption and as image. There was an additional set of 140 stimuli (70 images and 70 captions) used for testing. (C) Setup of the fMRI experiment for the imagery trials. Subjects were instructed to remember three image descriptions with corresponding indices (numbers 1–3). One of these indices was displayed during the instruction phase, followed by a fixation phase, and then the subjects were imagining the visual scene for 10 s.

Training of modality-specific and modality-agnostic decoders.

(A) Modality-specific decoders are trained on fMRI data of one modality (e.g. subjects viewing images) by mapping it to features extracted from the same stimuli. (B) Modality-agnostic decoders are trained jointly on fMRI data of both modalities (subjects viewing images and captions). (C) To train decoders, features can be either extracted unimodally from the corresponding images or captions, or by creating multimodal features based on both modalities. For example, to train a modality-agnostic decoder based on features from a unimodal language model, we map the fMRI data of subjects viewing captions to features extracted from the respective captions using this language model, as well as the fMRI data of subjects viewing images to features extracted by the language model from the corresponding captions. We can also train modality-specific decoders on features from another modality, for example, by mapping fMRI data of subjects viewing images to features extracted from the corresponding captions using a language model.

Evaluation of modality-specific and modality-agnostic decoders.

The matrices display cosine similarity scores between features extracted from the candidate stimuli and features predicted by the decoder. The evaluation metric is pairwise accuracy, which is calculated row-wise: For a given matrix row, we compare the similarity score of the target stimulus on the diagonal (in green) with the similarity scores of all other candidate stimuli (in red). (A) Within-modality decoding metrics of modality-specific decoders. To compute within-modality accuracy for image decoding, a modality-specific decoder trained on images is evaluated on all stimuli that were presented as images. To compute within-modality accuracy for caption decoding, a modality-specific decoder trained on captions is evaluated on all caption stimuli. (B) Cross-modality decoding metrics of modality-specific decoders. To compute cross-modality accuracy for image decoding, a modality-specific decoder trained on captions is evaluated on all stimuli that were presented as images. To compute cross-modality accuracy for caption decoding, a modality-specific decoder trained on images is evaluated on all caption stimuli. (C) Metrics for modality-agnostic decoders. To compute modality-agnostic accuracy for image decoding, a modality-agnostic decoder is evaluated on all stimuli that were presented as images. The same decoder is evaluated on caption stimuli to compute modality-agnostic accuracy for caption decoding. Here, we show feature extraction based on unimodal features for modality-specific decoders and based on multimodal features for the modality-agnostic decoder. In practice, the feature extraction can be unimodal or multimodal for any decoder type (see also Figure 2).

Average decoding scores for modality-agnostic decoders (green), compared to modality-specific decoders trained on data from subjects viewing images (orange) or on data from subjects viewing captions (purple).

The metric is pairwise accuracy (see also Figure 3). Error bars indicate 95% confidence intervals calculated using bootstrapping. Chance performance is at 0.5.

Decoding accuracy for decoding images (top) and for decoding captions (bottom).

The orange bars in the top row indicate within-modality decoding scores for images, the purple bars in the bottom row indicate within-modality decoding scores for captions. The purple bars in the top row indicate cross-decoding scores for images, the orange bars in the bottom row indicate cross-decoding scores for captions (see also Figure 3). Error bars indicate 95% confidence intervals calculated using bootstrapping. Chance performance is at 0.5.

Decoding examples for image decoding using a modality-agnostic decoder.

The first column shows the image the subject was seeing and the five following columns show the candidate stimuli with highest similarity to the predicted features, in descending order. We display both the image and the caption of the candidate stimuli because the decoder is based on multimodal features that are extracted from both modalities. All images were taken from the CoCo dataset (Lin et al., 2014).

Decoding examples for caption decoding using a modality-agnostic decoder.

For details, see caption of Figure 6. All images were taken from the CoCo dataset (Lin et al., 2014).

Searchlight method to identify modality-invariant ROIs.

The top plots show performance (pairwise accuracy averaged over subjects) of modality-agnostic decoders for decoding images (top left) and decoding captions (top right). In the second row, we display cross-decoding performances: On the left, modality-specific decoders trained on captions are evaluated on images. On the right, modality-specific decoders trained on images are evaluated on captions. We identified modality-invariant ROIs as clusters in which all 4 decoding accuracies are above chance by taking the minimum of the respective t-values at each location, then performed threshold-free cluster enhancement (TFCE) to calculate cluster values. The plot only shows left medial views of the brain to illustrate the method; different views of all resulting clusters are shown in Figure 9.

Searchlight results for modality-invariant regions.

Maps thresholded at threshold-free cluster enhancement (TFCE) value of 1508, which is the significance threshold value for which p<104 based on the permutation testing. Regions with the highest cluster values are outlined and annotated based on the Desikan-Killiany atlas (Desikan et al., 2006).

Searchlight results for imagery decoding.

Maps thresholded at threshold-free cluster enhancement (TFCE) values that surpass the significance threshold of p<104 based on a permutation test. Maps thresholded at TFCE value of 3897, which is the significance threshold value for which p<104 based on the permutation testing. We used the pairwise accuracy for imagery decoding using the large candidate set of 73 stimuli. We outlined the same regions as in Figure 9 to facilitate comparison.

Appendix 1—figure 1
Head motion estimates for each subject.

The plots show the realignment parameters as computed by SPM12 (spm_realign) for estimating within-modality rigid body alignment. We multiplied the rotation parameters pitch, roll, and yaw (originally in radian units) by 50 in order to allow interpretation in terms of millimeters of displacement for a circle of diameter 10 cm (which is approximately the mean distance from the cerebral cortex to the center of the head) (Power et al., 2012). The translucent error bands show 95% confidence intervals calculated using bootstrapping over all frames/runs from a session.

Appendix 1—figure 2
Frame-wise displacement for each subject.

The measure indicates how much the head changed position from one frame to the next. We calculated framewise displacement as the sum of the absolute values of the derivatives of the six realignment parameters. The translucent error bands show 95% confidence intervals calculated using bootstrapping over all frames/runs from a session.

Appendix 1—figure 3
Mutual information between the anatomical scan of the first session and functional data for each session as an indicator of intersession alignment.

All values were normalized based on the mutual information from the functional data from the first session. The raw mutual information scores for the alignment during the first session are: subject 1: 0.49; subject 2: 0.55; subject 3: 0.55; subject 4: 0.53; subject 5: 0.57; subject 6: 0.57.

Appendix 3—figure 1
Decoding examples for image decoding using a modality-specific decoder trained on images (within-modality decoding).

The first column shows the image the subject was seeing and the five following columns show the candidate stimuli with highest similarity to the predicted features, in descending order. We display both the image and the caption of the candidate stimuli because the decoder is based on multimodal features that are extracted from both modalities. All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 3—figure 2
Decoding examples for caption decoding using a modality-specific decoder trained on images (cross-modality decoding).

For details, see caption of Figure 1 All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 3—figure 3
Decoding examples for caption decoding using a modality-specific decoder trained on captions (within-modality decoding).

For details, see caption of Appendix 3—figure 1 All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 3—figure 4
Decoding examples for image decoding using a modality-specific decoder trained on captions (cross-modality decoding).

For details, see caption of Appendix 3—figure 1 All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 6—figure 1
Pairwise accuracy per subject.

For details, refer to Figure 5

Appendix 7—figure 1
Imagery decoding for Subject 1 using a modality-agnostic decoder.

The first column shows the caption that was used to stimulate imagery and on top of it the sketch of their mental image that the subjects drew at the end of the experiment. The five following columns show the candidate stimuli with highest similarity to the predicted features, in descending order. We display both the image and the caption of the candidate stimuli because the decoder is based on multimodal features that are extracted from both modalities. All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 7—figure 2
Imagery decoding for Subject 2 using a modality-agnostic decoder.

For further details, refer to the caption of Appendix 7—figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 7—figure 3
Imagery decoding for Subject 3 using a modality-agnostic decoder.

For further details, refer to the caption of Appendix 7—figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 7—figure 4
Imagery decoding for Subject 4 using a modality-agnostic decoder.

For further details, refer to the caption of Appendix 7—figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 7—figure 5
Imagery decoding for Subject 5 using a modality-agnostic decoder.

For further details, refer to the caption of Appendix 7—figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 7—figure 6
Imagery decoding for Subject 6 using a modality-agnostic decoder.

For further details, refer to the caption of Appendix 7—figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Tables

Appendix 2—table 1
Feature comparison for vision models.

Pairwise accuracy for modality-agnostic decoders based on vision features extracted by averaging the last hidden states of all patches (‘vision_features_mean’) compared to when using features extracted from [CLS] tokens (‘vision_features_cls’). The method leading to the best decoding performance for each model is highlighted in bold.

ModelVision featuresPairwise accuracy
dino-basevision_features_cls0.763
dino-basevision_features_mean0.819
dino-giantvision_features_cls0.750
dino-giantvision_features_mean0.820
dino-largevision_features_cls0.754
dino-largevision_features_mean0.816
vit-b-16vision_features_cls0.788
vit-b-16vision_features_mean0.772
vit-h-14vision_features_cls0.785
vit-h-14vision_features_mean0.804
vit-l-16vision_features_cls0.788
vit-l-16vision_features_mean0.796
Appendix 2—table 2
Feature comparison for multimodal models.

The features are either based on the [CLS] tokens from the fused representations (‘fused_cls’), averaging over the fused tokens (‘fused_mean’), or averaging over tokens from intermediate vision and language stream outputs (‘avg’). For the last case, for some models, the vision and language features can also be either based on [CLS] or based on averaging over all tokens/patches. The method leading to the best decoding performance for each model is highlighted in bold.

ModelFeaturesVision featuresLanguage featuresPairwise accuracy
blip2avgvision_features_clslang_features_cls0.851
blip2fused_clsvision_features_clslang_features_cls0.710
blip2fused_meanvision_features_clslang_features_cls0.740
bridgetowerfused_cls0.815
bridgetowerfused_mean0.788
clipavgvision_features_clslang_features_cls0.842
flavaavgvision_features_clslang_features_cls0.842
flavafused_clsvision_features_clslang_features_cls0.752
flavafused_meanvision_features_clslang_features_cls0.772
imagebindavgvision_features_clslang_features_cls0.857
paligemma2avgvision_features_clslang_features_mean0.829
paligemma2avgvision_features_meanlang_features_mean0.848
paligemma2fused_meanvision_features_meanlang_features_mean0.828
siglipavgvision_features_clslang_features_cls0.852
siglipavgvision_features_meanlang_features_cls0.823
viltfused_cls0.759
viltfused_mean0.839
visualbertfused_cls0.639
visualbertfused_mean0.743
Appendix 4—table 1
Candidates for modality-invariant regions, as identified by previous work.

All these regions were also found in our analysis, except for the two regions marked with an asterisk (*).

HemiRegionStudies that identified the region as modality-invariant
Left hemisphereSuperior occipital gyrus*Vandenberghe et al., 1996; Shinkareva et al., 2011
Middle occipital gyrusShinkareva et al., 2011
Inferior occipital gyrusShinkareva et al., 2011; Simanova et al., 2014
Superior temporal gyrusShinkareva et al., 2011
Superior temporal sulcusMan et al., 2012
Middle temporal gyrusVandenberghe et al., 1996; Shinkareva et al., 2011; Fairhall and Caramazza, 2013; Devereux et al., 2013; Simanova et al., 2014; Handjaras et al., 2016
Inferior temporal gyrusVandenberghe et al., 1996; Shinkareva et al., 2011; Fairhall and Caramazza, 2013; Simanova et al., 2014; Handjaras et al., 2016
Fusiform gyrusVandenberghe et al., 1996; Moore and Price, 1999; Bright et al., 2004; Shinkareva et al., 2011; Fairhall and Caramazza, 2013; Simanova et al., 2014
HippocampusVandenberghe et al., 1996
ParahippocampusBright et al., 2004; Fairhall and Caramazza, 2013; Handjaras et al., 2016
Perirhinal cortexBright et al., 2004; Fairhall and Caramazza, 2013
Superior parietal cortexShinkareva et al., 2011
Inferior parietal cortexShinkareva et al., 2011; Handjaras et al., 2016
Cingulate gyrusMoore and Price, 1999; Fairhall and Caramazza, 2013; Handjaras et al., 2016
CuneusShinkareva et al., 2011
PrecuneusShinkareva et al., 2011; Handjaras et al., 2016; Fairhall and Caramazza, 2013; Popham et al., 2021
Supramarginal gyrusShinkareva et al., 2011; Handjaras et al., 2016
Angular gyrusShinkareva et al., 2011; Fairhall and Caramazza, 2013; Devereux et al., 2013; Simanova et al., 2014; Handjaras et al., 2016; Popham et al., 2021
Intraparietal sulcusShinkareva et al., 2011; Devereux et al., 2013
Temporoparietal junctionVandenberghe et al., 1996; Handjaras et al., 2016
Superior frontal gyrusFairhall and Caramazza, 2013
Middle frontal gyrusFairhall and Caramazza, 2013; Handjaras et al., 2016
Inferior frontal gyrusVandenberghe et al., 1996; Bright et al., 2004; Simanova et al., 2014; Liuzzi et al., 2017; Handjaras et al., 2016
Precentral gyrusShinkareva et al., 2011
Postcentral gyrusShinkareva et al., 2011
Paracentral lobuleShinkareva et al., 2011
Supplementary motor areaShinkareva et al., 2011
Right hemisphereFusiform gyrusShinkareva et al., 2011; Simanova et al., 2014
Superior temporal sulcusMan et al., 2012
Middle temporal gyrusHandjaras et al., 2016
Inferior temporal gyrusHandjaras et al., 2016
Angular gyrusHandjaras et al., 2016; Popham et al., 2021
Superior parietal cortexShinkareva et al., 2011
PrecuneusShinkareva et al., 2011; Popham et al., 2021; Fairhall and Caramazza, 2013; Handjaras et al., 2016
Cingulate gyrusMoore and Price, 1999; Fairhall and Caramazza, 2013; Handjaras et al., 2016; Jung et al., 2018
ParahippocampusHandjaras et al., 2016
Inferior parietal cortexHandjaras et al., 2016
Paracentral lobuleShinkareva et al., 2011
Middle frontal gyrusSimanova et al., 2014; Jung et al., 2018
Superior frontal gyrus*Jung et al., 2018
Inferior frontal gyrusMoore and Price, 1999; Simanova et al., 2014
Appendix 5—table 1
Average decoding scores (for images and captions) by number of vertices.

Scores were calculated based on the results of a searchlight analysis with a radius of 10 mm. The accuracy values were grouped into bins based on the number of vertices that the searchlight was based on.

# VerticesPairwise accuracy (mean)
25051.81%
50052.35%
75055.96%
100055.49%
125054.60%
150052.96%

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Mitja Nikolaus
  2. Milad Mozafari
  3. Isabelle Berry
  4. Nicholas Asher
  5. Leila Reddy
  6. Rufin VanRullen
(2026)
Modality-agnostic decoding of vision and language from fMRI
eLife 14:RP107933.
https://doi.org/10.7554/eLife.107933.3