Figures and data in Modality-agnostic decoding of vision and language from fMRI

Figures
Tables
Additional files

24 figures, 4 tables and 1 additional file

Figures

Figure 1

Download asset Open asset

Setup of the fMRI experiment.

(A) Setup of the main fMRI experiment. Subjects were seeing images and captions in random alternation. Whenever the current stimulus matched the previous stimulus, the subjects were instructed to press a button (one-back matching task). Images and captions for illustration only; actual size and stimuli as described in the text. (B) Number of distinct training stimuli (excluding trials that were one-back targets or during which the subject pressed the response button). The number of overlapping stimuli indicates how many stimuli were presented both as caption and as image. There was an additional set of 140 stimuli (70 images and 70 captions) used for testing. (C) Setup of the fMRI experiment for the imagery trials. Subjects were instructed to remember three image descriptions with corresponding indices (numbers 1–3). One of these indices was displayed during the instruction phase, followed by a fixation phase, and then the subjects were imagining the visual scene for 10 s.

Figure 2

Download asset Open asset

Training of modality-specific and modality-agnostic decoders.

(A) Modality-specific decoders are trained on fMRI data of one modality (e.g. subjects viewing images) by mapping it to features extracted from the same stimuli. (B) Modality-agnostic decoders are trained jointly on fMRI data of both modalities (subjects viewing images and captions). (C) To train decoders, features can be either extracted unimodally from the corresponding images or captions, or by creating multimodal features based on both modalities. For example, to train a modality-agnostic decoder based on features from a unimodal language model, we map the fMRI data of subjects viewing captions to features extracted from the respective captions using this language model, as well as the fMRI data of subjects viewing images to features extracted by the language model from the corresponding captions. We can also train modality-specific decoders on features from another modality, for example, by mapping fMRI data of subjects viewing images to features extracted from the corresponding captions using a language model.

Figure 3

Download asset Open asset

Evaluation of modality-specific and modality-agnostic decoders.

The matrices display cosine similarity scores between features extracted from the candidate stimuli and features predicted by the decoder. The evaluation metric is pairwise accuracy, which is calculated row-wise: For a given matrix row, we compare the similarity score of the target stimulus on the diagonal (in green) with the similarity scores of all other candidate stimuli (in red). (A) Within-modality decoding metrics of modality-specific decoders. To compute within-modality accuracy for image decoding, a modality-specific decoder trained on images is evaluated on all stimuli that were presented as images. To compute within-modality accuracy for caption decoding, a modality-specific decoder trained on captions is evaluated on all caption stimuli. (B) Cross-modality decoding metrics of modality-specific decoders. To compute cross-modality accuracy for image decoding, a modality-specific decoder trained on captions is evaluated on all stimuli that were presented as images. To compute cross-modality accuracy for caption decoding, a modality-specific decoder trained on images is evaluated on all caption stimuli. (C) Metrics for modality-agnostic decoders. To compute modality-agnostic accuracy for image decoding, a modality-agnostic decoder is evaluated on all stimuli that were presented as images. The same decoder is evaluated on caption stimuli to compute modality-agnostic accuracy for caption decoding. Here, we show feature extraction based on unimodal features for modality-specific decoders and based on multimodal features for the modality-agnostic decoder. In practice, the feature extraction can be unimodal or multimodal for any decoder type (see also Figure 2).

Figure 4

Download asset Open asset

Average decoding scores for modality-agnostic decoders (green), compared to modality-specific decoders trained on data from subjects viewing images (orange) or on data from subjects viewing captions (purple).

The metric is pairwise accuracy (see also Figure 3). Error bars indicate 95% confidence intervals calculated using bootstrapping. Chance performance is at 0.5.

Figure 5

Download asset Open asset

Decoding accuracy for decoding images (top) and for decoding captions (bottom).

The orange bars in the top row indicate within-modality decoding scores for images, the purple bars in the bottom row indicate within-modality decoding scores for captions. The purple bars in the top row indicate cross-decoding scores for images, the orange bars in the bottom row indicate cross-decoding scores for captions (see also Figure 3). Error bars indicate 95% confidence intervals calculated using bootstrapping. Chance performance is at 0.5.

Figure 6

Download asset Open asset

Decoding examples for image decoding using a modality-agnostic decoder.

The first column shows the image the subject was seeing and the five following columns show the candidate stimuli with highest similarity to the predicted features, in descending order. We display both the image and the caption of the candidate stimuli because the decoder is based on multimodal features that are extracted from both modalities. All images were taken from the CoCo dataset (Lin et al., 2014).

Figure 7

Download asset Open asset

Decoding examples for caption decoding using a modality-agnostic decoder.

For details, see caption of Figure 6. All images were taken from the CoCo dataset (Lin et al., 2014).

Figure 8

Download asset Open asset

Searchlight method to identify modality-invariant ROIs.

The top plots show performance (pairwise accuracy averaged over subjects) of modality-agnostic decoders for decoding images (top left) and decoding captions (top right). In the second row, we display cross-decoding performances: On the left, modality-specific decoders trained on captions are evaluated on images. On the right, modality-specific decoders trained on images are evaluated on captions. We identified modality-invariant ROIs as clusters in which all 4 decoding accuracies are above chance by taking the minimum of the respective t-values at each location, then performed threshold-free cluster enhancement (TFCE) to calculate cluster values. The plot only shows left medial views of the brain to illustrate the method; different views of all resulting clusters are shown in Figure 9.

Figure 9

Download asset Open asset

Searchlight results for modality-invariant regions.

Maps thresholded at threshold-free cluster enhancement (TFCE) value of 1508, which is the significance threshold value for which $p < 10^{- 4}$ based on the permutation testing. Regions with the highest cluster values are outlined and annotated based on the Desikan-Killiany atlas (Desikan et al., 2006).

Figure 10

Download asset Open asset

Searchlight results for imagery decoding.

Maps thresholded at threshold-free cluster enhancement (TFCE) values that surpass the significance threshold of $p < 10^{- 4}$ based on a permutation test. Maps thresholded at TFCE value of 3897, which is the significance threshold value for which $p < 10^{- 4}$ based on the permutation testing. We used the pairwise accuracy for imagery decoding using the large candidate set of 73 stimuli. We outlined the same regions as in Figure 9 to facilitate comparison.

Appendix 1—figure 1

Download asset Open asset

Head motion estimates for each subject.

The plots show the realignment parameters as computed by SPM12 (spm_realign) for estimating within-modality rigid body alignment. We multiplied the rotation parameters pitch, roll, and yaw (originally in radian units) by 50 in order to allow interpretation in terms of millimeters of displacement for a circle of diameter 10 cm (which is approximately the mean distance from the cerebral cortex to the center of the head) (Power et al., 2012). The translucent error bands show 95% confidence intervals calculated using bootstrapping over all frames/runs from a session.

Appendix 1—figure 2

Download asset Open asset

Frame-wise displacement for each subject.

The measure indicates how much the head changed position from one frame to the next. We calculated framewise displacement as the sum of the absolute values of the derivatives of the six realignment parameters. The translucent error bands show 95% confidence intervals calculated using bootstrapping over all frames/runs from a session.

Appendix 1—figure 3

Download asset Open asset

Mutual information between the anatomical scan of the first session and functional data for each session as an indicator of intersession alignment.

All values were normalized based on the mutual information from the functional data from the first session. The raw mutual information scores for the alignment during the first session are: subject 1: 0.49; subject 2: 0.55; subject 3: 0.55; subject 4: 0.53; subject 5: 0.57; subject 6: 0.57.

Appendix 3—figure 1

Download asset Open asset

Decoding examples for image decoding using a modality-specific decoder trained on images (within-modality decoding).

The first column shows the image the subject was seeing and the five following columns show the candidate stimuli with highest similarity to the predicted features, in descending order. We display both the image and the caption of the candidate stimuli because the decoder is based on multimodal features that are extracted from both modalities. All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 3—figure 2

Download asset Open asset

Decoding examples for caption decoding using a modality-specific decoder trained on images (cross-modality decoding).

For details, see caption of Figure 1 All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 3—figure 3

Download asset Open asset

Decoding examples for caption decoding using a modality-specific decoder trained on captions (within-modality decoding).

For details, see caption of Appendix 3—figure 1 All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 3—figure 4

Download asset Open asset

Decoding examples for image decoding using a modality-specific decoder trained on captions (cross-modality decoding).

For details, see caption of Appendix 3—figure 1 All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 6—figure 1

Download asset Open asset

Pairwise accuracy per subject.

For details, refer to Figure 5

Appendix 7—figure 1

Download asset Open asset

Imagery decoding for Subject 1 using a modality-agnostic decoder.

The first column shows the caption that was used to stimulate imagery and on top of it the sketch of their mental image that the subjects drew at the end of the experiment. The five following columns show the candidate stimuli with highest similarity to the predicted features, in descending order. We display both the image and the caption of the candidate stimuli because the decoder is based on multimodal features that are extracted from both modalities. All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 7—figure 2

Download asset Open asset

Imagery decoding for Subject 2 using a modality-agnostic decoder.

For further details, refer to the caption of Appendix 7—figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 7—figure 3

Download asset Open asset

Imagery decoding for Subject 3 using a modality-agnostic decoder.

For further details, refer to the caption of Appendix 7—figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 7—figure 4

Download asset Open asset

Imagery decoding for Subject 4 using a modality-agnostic decoder.

For further details, refer to the caption of Appendix 7—figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 7—figure 5

Download asset Open asset

Imagery decoding for Subject 5 using a modality-agnostic decoder.

For further details, refer to the caption of Appendix 7—figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Appendix 7—figure 6

Download asset Open asset

Imagery decoding for Subject 6 using a modality-agnostic decoder.

For further details, refer to the caption of Appendix 7—figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Tables

Appendix 2—table 1

Feature comparison for vision models.

Pairwise accuracy for modality-agnostic decoders based on vision features extracted by averaging the last hidden states of all patches (‘vision_features_mean’) compared to when using features extracted from [CLS] tokens (‘vision_features_cls’). The method leading to the best decoding performance for each model is highlighted in bold.

Model	Vision features	Pairwise accuracy
dino-base	vision_features_cls	0.763
dino-base	vision_features_mean	0.819
dino-giant	vision_features_cls	0.750
dino-giant	vision_features_mean	0.820
dino-large	vision_features_cls	0.754
dino-large	vision_features_mean	0.816
vit-b-16	vision_features_cls	0.788
vit-b-16	vision_features_mean	0.772
vit-h-14	vision_features_cls	0.785
vit-h-14	vision_features_mean	0.804
vit-l-16	vision_features_cls	0.788
vit-l-16	vision_features_mean	0.796

Appendix 2—table 2

Feature comparison for multimodal models.

The features are either based on the [CLS] tokens from the fused representations (‘fused_cls’), averaging over the fused tokens (‘fused_mean’), or averaging over tokens from intermediate vision and language stream outputs (‘avg’). For the last case, for some models, the vision and language features can also be either based on [CLS] or based on averaging over all tokens/patches. The method leading to the best decoding performance for each model is highlighted in bold.

Model	Features	Vision features	Language features	Pairwise accuracy
blip2	avg	vision_features_cls	lang_features_cls	0.851
blip2	fused_cls	vision_features_cls	lang_features_cls	0.710
blip2	fused_mean	vision_features_cls	lang_features_cls	0.740
bridgetower	fused_cls			0.815
bridgetower	fused_mean			0.788
clip	avg	vision_features_cls	lang_features_cls	0.842
flava	avg	vision_features_cls	lang_features_cls	0.842
flava	fused_cls	vision_features_cls	lang_features_cls	0.752
flava	fused_mean	vision_features_cls	lang_features_cls	0.772
imagebind	avg	vision_features_cls	lang_features_cls	0.857
paligemma2	avg	vision_features_cls	lang_features_mean	0.829
paligemma2	avg	vision_features_mean	lang_features_mean	0.848
paligemma2	fused_mean	vision_features_mean	lang_features_mean	0.828
siglip	avg	vision_features_cls	lang_features_cls	0.852
siglip	avg	vision_features_mean	lang_features_cls	0.823
vilt	fused_cls			0.759
vilt	fused_mean			0.839
visualbert	fused_cls			0.639
visualbert	fused_mean			0.743

Appendix 4—table 1

Candidates for modality-invariant regions, as identified by previous work.

All these regions were also found in our analysis, except for the two regions marked with an asterisk (*).

Hemi	Region	Studies that identified the region as modality-invariant
Left hemisphere	Superior occipital gyrus*	Vandenberghe et al., 1996; Shinkareva et al., 2011
	Middle occipital gyrus	Shinkareva et al., 2011
	Inferior occipital gyrus	Shinkareva et al., 2011; Simanova et al., 2014
	Superior temporal gyrus	Shinkareva et al., 2011
	Superior temporal sulcus	Man et al., 2012
	Middle temporal gyrus	Vandenberghe et al., 1996; Shinkareva et al., 2011; Fairhall and Caramazza, 2013; Devereux et al., 2013; Simanova et al., 2014; Handjaras et al., 2016
	Inferior temporal gyrus	Vandenberghe et al., 1996; Shinkareva et al., 2011; Fairhall and Caramazza, 2013; Simanova et al., 2014; Handjaras et al., 2016
	Fusiform gyrus	Vandenberghe et al., 1996; Moore and Price, 1999; Bright et al., 2004; Shinkareva et al., 2011; Fairhall and Caramazza, 2013; Simanova et al., 2014
	Hippocampus	Vandenberghe et al., 1996
	Parahippocampus	Bright et al., 2004; Fairhall and Caramazza, 2013; Handjaras et al., 2016
	Perirhinal cortex	Bright et al., 2004; Fairhall and Caramazza, 2013
	Superior parietal cortex	Shinkareva et al., 2011
	Inferior parietal cortex	Shinkareva et al., 2011; Handjaras et al., 2016
	Cingulate gyrus	Moore and Price, 1999; Fairhall and Caramazza, 2013; Handjaras et al., 2016
	Cuneus	Shinkareva et al., 2011
	Precuneus	Shinkareva et al., 2011; Handjaras et al., 2016; Fairhall and Caramazza, 2013; Popham et al., 2021
	Supramarginal gyrus	Shinkareva et al., 2011; Handjaras et al., 2016
	Angular gyrus	Shinkareva et al., 2011; Fairhall and Caramazza, 2013; Devereux et al., 2013; Simanova et al., 2014; Handjaras et al., 2016; Popham et al., 2021
	Intraparietal sulcus	Shinkareva et al., 2011; Devereux et al., 2013
	Temporoparietal junction	Vandenberghe et al., 1996; Handjaras et al., 2016
	Superior frontal gyrus	Fairhall and Caramazza, 2013
	Middle frontal gyrus	Fairhall and Caramazza, 2013; Handjaras et al., 2016
	Inferior frontal gyrus	Vandenberghe et al., 1996; Bright et al., 2004; Simanova et al., 2014; Liuzzi et al., 2017; Handjaras et al., 2016
	Precentral gyrus	Shinkareva et al., 2011
	Postcentral gyrus	Shinkareva et al., 2011
	Paracentral lobule	Shinkareva et al., 2011
	Supplementary motor area	Shinkareva et al., 2011
Right hemisphere	Fusiform gyrus	Shinkareva et al., 2011; Simanova et al., 2014
	Superior temporal sulcus	Man et al., 2012
	Middle temporal gyrus	Handjaras et al., 2016
	Inferior temporal gyrus	Handjaras et al., 2016
	Angular gyrus	Handjaras et al., 2016; Popham et al., 2021
	Superior parietal cortex	Shinkareva et al., 2011
	Precuneus	Shinkareva et al., 2011; Popham et al., 2021; Fairhall and Caramazza, 2013; Handjaras et al., 2016
	Cingulate gyrus	Moore and Price, 1999; Fairhall and Caramazza, 2013; Handjaras et al., 2016; Jung et al., 2018
	Parahippocampus	Handjaras et al., 2016
	Inferior parietal cortex	Handjaras et al., 2016
	Paracentral lobule	Shinkareva et al., 2011
	Middle frontal gyrus	Simanova et al., 2014; Jung et al., 2018
	Superior frontal gyrus*	Jung et al., 2018
	Inferior frontal gyrus	Moore and Price, 1999; Simanova et al., 2014

Appendix 5—table 1

Average decoding scores (for images and captions) by number of vertices.

Scores were calculated based on the results of a searchlight analysis with a radius of 10 mm. The accuracy values were grouped into bins based on the number of vertices that the searchlight was based on.