Figures and data

Panel A: Setup of the main fMRI experiment. Subjects were seeing images and captions in random alternation. Whenever the current stimulus matched the previous stimulus, the subjects were instructed to press a button (one-back matching task). Images and captions for illustration only; actual size and stimuli as described in the text. Panel B: Number of distinct training stimuli (excluding trials that were one-back targets or during which the subject pressed the response button). The number of overlapping stimuli indicates how many stimuli were presented both as caption and as image. There was an additional set of 140 stimuli (70 images and 70 captions) used for testing. Panel C: Setup of the fMRI experiment for the imagery trials. Subjects were instructed to remember 3 image descriptions with corresponding indices (numbers 1 to 3). One of these indices was displayed during the instruction phase, followed by a fixation phase, and then the subjects were imagining the visual scene for 10s.

Training of modality-specific and modality-agnostic decoders.
Panel A: Modality-specific decoders are trained on fMRI data of one modality (e.g. subjects viewing images) by mapping it to features extracted from the same stimuli. Panel B: Modality-agnostic decoders are trained jointly on fMRI data of both modalities (subjects viewing images and captions). Panel C: To train decoders, features can be either extracted unimodally from the corresponding images or captions, or by creating multimodal features based on both modalities. For example, to train a modality-agnostic decoder based on features from a unimodal language model, we map the fMRI data of subjects viewing captions to features extracted from the respective captions using this language model, as well as the fMRI data of subjects viewing images to features extracted by the language model from the corresponding captions. We can also train modality-specific decoders on features from another modality, for example by mapping fMRI data of subjects viewing images to features extracted from the corresponding captions using a language model (cf. crosses on orange bars in Figure 4 or using multimodal features (cf. crosses on blue bars in Figure 4).

Evaluation of modality-specific and modality-agnostic decoders.
The matrices display cosine similarity scores between features extracted from the candidate stimuli and features predicted by the decoder. The evaluation metric is pairwise accuracy, which is calculated row-wise: For a given matrix row, we compare the similarity score of the target stimulus on the diagonal (in green) with the similarity scores of all other candidate stimuli (in red). Panel A: Within-modality decoding metrics of modality-specific decoders. To compute within-modality accuracy for image decoding, a modality-specific decoder trained on images is evaluated on all stimuli that were presented as images. To compute within-modality accuracy for caption decoding, a modality-specific decoder trained on captions is evaluated on all caption stimuli. Panel B: Cross-modality decoding metrics of modality-specific decoders. To compute cross-modality accuracy for image decoding, a modality-specific decoder trained on captions is evaluated on all stimuli that were presented as images. To compute cross-modality accuracy for caption decoding, a modality-specific decoder trained on images is evaluated on all caption stimuli. Panel C: Metrics for modality-agnostic decoders. To compute modality-agnostic accuracy for image decoding, a modality-agnostic decoder is evaluated on all stimuli that were presented as images. The same decoder is evaluated on caption stimuli to compute modality-agnostic accuracy for caption decoding. Here we show feature extraction based on unimodal features for modality-specific decoders and based on multimodal features for the modality-agnostic decoder, in practice the feature extraction can be unimodal or multimodal for any decoder type (see also Figure 2).

Average decoding scores for modality-agnostic decoders (green), compared to modality-specific decoders trained on data from subjects viewing images (orange) or on data from subjects viewing captions (purple).
The metric is pairwise accuracy (see also Figure 3). Error bars indicate 95% confidence intervals calculated using bootstrapping. Chance performance is at 0.5.

Decoding accuracy for decoding images (top) and for decoding captions (bottom).
The orange bars in the top row indicate within-modality decoding scores for images, the purple bars in the bottom row indicate within-modality decoding scores for captions. The purple bars in the top row indicate cross-decoding scores for images, the orange bars in the bottom row indicate cross-decoding scores for captions (see also Figure 3). Error bars indicate 95% confidence intervals calculated using bootstrapping. Chance performance is at 0.5.

Decoding examples for image decoding using a modality-agnostic decoder.
The first column shows the image the subject was seeing and the 5 following columns show the candidate stimuli with highest similarity to the predicted features, in descending order. We display both the image and the caption of the candidate stimuli because the decoder is based on multimodal features that are extracted from both modalities. All images were taken from the CoCo dataset (Lin et al., 2014).

Decoding examples for caption decoding using a modality-agnostic decoder.
For details see caption of Figure 6. All images were taken from the CoCo dataset (Lin et al., 2014).

Searchlight method to identify modality-invariant ROIs.
The top plots show performance (pairwise accuracy averaged over subjects) of modality-agnostic decoders for decoding images (top left) and decoding captions (top right). In the second row, we display cross-decoding performances: On the left, modality-specific decoders trained on captions are evaluated on images. On the right, modality-specific decoders trained on images are evaluated on captions. We identified modality-invariant ROIs as clusters in which all 4 decoding accuracies are above chance by taking the minimum of the respective t-values at each location, then performed TFCE to calculate cluster values. The plot only shows left medial views of the brain to illustrate the method, different views of all resulting clusters are shown in Figure 9.

Searchlight results for modality-invariant regions.
Maps thresholded at TFCE value of 1508, which is the significance threshold value for which p < 10−4 based on the permutation testing. Regions with highest cluster values are outlined and annotated based on the Desikan-Killiany atlas (Desikan et al., 2006).

Searchlight results for imagery decoding.
Maps thresholded at TFCE values that surpass the significance threshold of p < 10−4 based on a permutation test. Maps thresholded at TFCE value of 3897, which is the significance threshold value for which p < 10−4 based on the permutation testing. We used the pairwise accuracy for imagery decoding using the large candidate set of 73 stimuli. We outlined the same regions as in Figure 9 to facilitate comparison.

Head motion estimates for each subject.
The plots show the realignment parameters as computed by SPM12 (spm_realign) for estimating within modality rigid body alignment. We multiplied the rotation parameters pitch, roll and yaw (originally in radian units) by 50 in order to allow interpretation in terms of millimeters of displacement for a circle of diameter 10 cm (which is approximately the mean distance from the cerebral cortex to the center of the head) (Power et al., 2012). The translucent error bands show 95% confidence intervals calculating using bootstrapping over all frames/runs from a session.

Framewise displacement for each subject.
The measure indicates how much the head changed position from one frame to the next. We calculated framewise displacement as the sum of the absolute values of the derivatives of the six realignment parameters. The translucent error bands show 95% confidence intervals calculating using bootstrapping over all frames/runs from a session.

Mutual information between the anatomical scan of the first session and functional data for each session as an indicator of intersession alignment.
All values were normalized based on the mutual information from the functional data from the first session. The raw mutual information scores for the alignment during the first session are: subject 1: 0.49; subject 2: 0.55; subject 3: 0.55; subject 4: 0.53; subject 5: 0.57; subject 6: 0.57.

Feature comparison for vision models.
Pairwise accuracy for modality-agnostic decoders based on vision features extracted by averaging the last hidden states of all patches (“vision_features_mean”) compared to when using features extracted from [CLS] tokens (“vision_features_cls”). The method leading to the best decoding performance for each model is highlighted in bold.

Feature comparison for multimodal models.
The features are either based on the [CLS] tokens from the fused representations (“fused_cls”), averaging over the fused tokens (“fused_mean”), or averaging over tokens from intermediate vision and language stream outputs (“avg”). For the last case, for some models the vision and language features can also be either based on <monosapce>[CLS] </monosapce> or based on averaging over all tokens/patches. The method leading to the best decoding performance for each model is highlighted in bold.

Decoding examples for image decoding using a modality-specific decoder trained on captions (cross-modality decoding).
The first column shows the image the subject was seeing and the 5 following columns show the candidate stimuli with highest similarity to the predicted features, in descending order. We display both the image and the caption of the candidate stimuli because the decoder is based on multimodal features that are extracted from both modalities. All images were taken from the CoCo dataset (Lin et al., 2014).

Decoding examples for caption decoding using a modality-specific decoder trained on images (cross-modality decoding).
For details see caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Decoding examples for caption decoding using a modality-specific decoder trained on captions (within-modality decoding).
For details see caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Decoding examples for image decoding using a modality-specific decoder trained on captions (cross-modality decoding).
For details see caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Candidates for modality-invariant regions as identified by previous work.
All these regions were also found in our analysis, except for the 2 regions marked with an asterisk (*).

Average decoding scores (for images and captions) by number of vertices.
Scores calculated based on the results of a searchlight analysis with a radius of 10mm. The accuracy values were grouped into bins based on the number of vertices that the searchlight was based on.

Pairwise accuracy per subject.
For details refer to Figure 5.

Imagery decoding for Subject 1 using a modality-agnostic decoder.
The first column shows the caption that was used to stimulate imagery and on top of it the sketch of their mental image that the subjects draw at the end of the experiment. The 5 following columns show the candidate stimuli with highest similarity to the predicted features, in descending order. We display both the image and the caption of the candidate stimuli because the decoder is based on multimodal features that are extracted from both modalities. All images were taken from the CoCo dataset (Lin et al., 2014).

Imagery decoding for Subject 2 using a modality-agnostic decoder.
For further details refer to the caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Imagery decoding for Subject 3 using a modality-agnostic decoder.
For further details refer to the caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Imagery decoding for Subject 4 using a modality-agnostic decoder.
For further details refer to the caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Imagery decoding for Subject 5 using a modality-agnostic decoder.
For further details refer to the caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).

Imagery decoding for Subject 6 using a modality-agnostic decoder.
For further details refer to the caption of Figure 1. All images were taken from the CoCo dataset (Lin et al., 2014).