Panel A: Setup of the main fMRI experiment.

Subjects were seeing images and captions in random alternation. Whenever the current stimulus matched the previous stimulus, the subjects were instructed to press a button (one-back matching task). Images and captions for illustration only; actual size and stimuli as described in the text. Panel B: Number of distinct training stimuli (excluding trials that were one-back targets or during which the subject pressed the response button). There was an additional set of 140 stimuli (70 images and 70 captions) used for testing. Panel C: Setup of the fMRI experiment for the imagery trials. Subjects were instructed to remember 3 image descriptions with corresponding indices (numbers 1 to 3). One of these indices was displayed during the instruction phase, followed by a fixation phase, and then the subjects were imagining the visual scene for 10s.

Training of modality-specific and modality-agnostic decoders.

Panel A: Modality-specific decoders are trained on fMRI data of one modality (e.g. subjects viewing images) by mapping it to features extracted from the same stimuli. Panel B: Modality-agnostic decoders are trained jointly on fMRI data of both modalities (subjects viewing images and captions). Panel C: To train decoders, features can be either extracted unimodally from the corresponding images or captions, or by creating multimodal features based on both modalities. For example, to train a modality-agnostic decoder based on features from a unimodal language model, we map the fMRI data of subjects viewing captions to features extracted from the respective captions using this language model, as well as the fMRI data of subjects viewing images to features extracted by the language model from the corresponding captions. We can also train modality-specific decoders on features from another modality, for example by mapping fMRI data of subjects viewing images to features extracted from the corresponding captions using a language model (cf. crosses on orange bars in Figure 4 or using multimodal features (cf. crosses on blue bars in Figure 4).

Evaluation of modality-specific and modality-agnostic decoders.

The matrices display cosine similarity scores between features extracted from the candidate stimuli and features predicted by the decoder. The evaluation metric is pairwise accuracy, which is calculated row-wise: For a given matrix row, we compare the similarity score of the target stimulus on the diagonal (in green) with the similarity scores of all other candidate stimuli (in red). Panel A: Within-modality decoding metrics of modality-specific decoders. To compute within-modality accuracy for image decoding, a modality-specific decoder trained on images is evaluated on all stimuli that were presented as images. To compute within-modality accuracy for caption decoding, a modality-specific decoder trained on captions is evaluated on all caption stimuli. Panel B: Cross-modality decoding metrics of modality-specific decoders. To compute cross-modality accuracy for image decoding, a modality-specific decoder trained on captions is evaluated on all stimuli that were presented as images. To compute cross-modality accuracy for caption decoding, a modality-specific decoder trained on images is evaluated on all caption stimuli. Panel C: Metrics for modality-agnostic decoders. To compute modality-agnostic accuracy for image decoding, a modality-agnostic decoder is evaluated on all stimuli that were presented as images. The same decoder is evaluated on caption stimuli to compute modality-agnostic accuracy for caption decoding. Here we show feature extraction based on unimodal features for modality-specific decoders and based on multimodal features for the modality-agnostic decoder, in practice the feature extraction can be unimodal or multimodal for any decoder type (see also Figure 2).

Average decoding scores for modality-agnostic decoders (bars), compared to modality-specific decoders trained on data from subjects viewing captions (·) or on data from subjects viewing images (×).

The metric is pairwise accuracy (see also Figure 3). Error bars indicate 95% confidence intervals for modality-agnostic decoders. Chance performance is at 0.5.

Decoding accuracy for decoding captions (top) and for decoding images (bottom).

The bars indicate modality-agnostic decoding accuracy. The crosses (×) in the top row and the dots (·) in the bottom row indicate within-modality decoding scores. The dots (·) in the top row indicate cross-decoding scores for images, the crosses (×) in the bottom row indicate cross-decoding scores for captions (see also Figure 3).

Decoding examples for image decoding using a modality-agnostic decoder.

The first column shows the image the subject was seeing and the 5 following columns show the candidate stimuli with highest similarity to the predicted features, in descending order. We display both the image and the caption of the candidate stimuli because the decoder is based on multimodal features that are extracted from both modalities. All images were taken from the CoCo dataset (Lin et al., 2014).

Decoding examples for caption decoding using a modality-agnostic decoder.

For details see caption of Figure 6. All images were taken from the CoCo dataset (Lin et al., 2014).

Searchlight method to identify modality-agnostic ROIs.

The top plots show performance (pairwise accuracy averaged over subjects) of modality-agnostic decoders for decoding images (top left) and decoding captions (top right). In the second row, we display cross-decoding performances: On the left, modality-specific decoders trained on captions are evaluated on images. On the right, modality-specific decoders trained on images are evaluated on captions. We identified modality-agnostic ROIs as clusters in which all 4 decoding accuracies are above chance by taking the minimum of the respective t-values at each location, then performed TFCE to calculate cluster values. The plot only shows left medial views of the brain to illustrate the method, different views of all resulting clusters are shown in Figure 9.

Searchlight results for modality-agnostic regions.

Maps thresholded at TFCE value of 1508, which is the significance threshold value for which p < 10−4 based on the permutation testing. Regions with highest cluster values are outlined and annotated based on the Desikan-Killiany atlas (Desikan et al., 2006).

Searchlight results for imagery decoding.

Maps thresholded at TFCE values that surpass the significance threshold of p < 10−4 based on a permutation test. Maps thresholded at TFCE value of 3897, which is the significance threshold value for which p < 10−4 based on the permutation testing We used the pairwise accuracy for imagery decoding using the large candidate set of 73 stimuli. We outlined the same regions as in Figure 9 to facilitate comparison.

Feature comparison for vision models.

Pairwise accuracy for modality-agnostic decoders based on vision features extracted by averaging the last hidden states of all patches (“vision_features_mean”) compared to when using features extracted from [CLS] tokens (“vision_features_cls”). The method leading to the best decoding performance for each model is highlighted in bold.

Feature comparison for multimodal models.

The features are either based on the [CLS] tokens from the fused representations (“fused_cls”), averaging over the fused tokens (“fused_mean”), or averaging over tokens from intermediate vision and language stream outputs (“avg”). For the last case, for some models the vision and language features can also be either based on [CLS] or based on averaging over all tokens/patches. The method leading to the best decoding performance for each model is highlighted in bold.

Decoding examples for image decoding using a modality-specific decoder trained on images (within-modality decoding).

The first column shows the image the subject was seeing and the 5 following columns show the candidate stimuli with highest similarity to the predicted features, in descending order. We display both the image and the caption of the candidate stimuli because the decoder is based on multimodal features that are extracted from both modalities. All images were taken from the CoCo dataset (Lin et al., 2014).

Decoding examples for caption decoding using a modality-specific decoder trained on images (cross-modality decoding).

For details see caption of Figure 1 All images were taken from the CoCo dataset (Lin et al., 2014).

Decoding examples for caption decoding using a modality-specific decoder trained on captions (within-modality decoding).

For details see caption of Figure 1 All images were taken from the CoCo dataset (Lin et al., 2014).

Decoding examples for image decoding using a modality-specific decoder trained on captions (cross-modality decoding).

For details see caption of Figure 1 All images were taken from the CoCo dataset (Lin et al., 2014).

Candidates for modality-agnostic regions as identified by previous work.

All there regions were also found in out analysis, except for the 2 regions marked with an asterisk (*).

Average decoding scores (for images and captions) by number of vertices.

Scores calculated based on the results of a searchlight analysis with a radius of 10mm. The accuracy values were grouped into bins based on the number of vertices that the searchlight was based on.

Pairwise accuracy per subject.

For details refer to Figure 5

Imagery decoding for Subject 1 using a modality-agnostic decoder.

The first column shows the caption that was used to stimulate imagery and on top of it the sketch of their mental image that the subjects draw at the end of the experiment. The 5 following columns show the candidate stimuli with highest similarity to the predicted features, in descending order. We display both the image and the caption of the candidate stimuli because the decoder is based on multimodal features that are extracted from both modalities. All images were taken from the CoCo dataset (Lin et al., 2014).

Imagery decoding for Subject 2 using a modality-agnostic decoder.

For further details refer to the caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Imagery decoding for Subject 3 using a modality-agnostic decoder.

For further details refer to the caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Imagery decoding for Subject 4 using a modality-agnostic decoder.

For further details refer to the caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Imagery decoding for Subject 5 using a modality-agnostic decoder.

For further details refer to the caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Imagery decoding for Subject 6 using a modality-agnostic decoder. For further details refer to the caption of Figure 1.

All images were taken from the CoCo dataset (Lin et al., 2014).