Combining a vision DNN with an LLM improves the prediction of neural responses to visual stimulation.

A, Extraction of vision DNN and LLM representations for each of the 16,740 stimulus images. To obtain the vision DNN representations, we fed each image to the vision DNN and extracted its activations. To obtain the LLM representations, for each image we generated five text descriptions using GPT-4V, independently fed these descriptions to the LLM, and averaged the resulting five embedding instances. B, Scatterplot of pairwise cosine similarities between pairs of stimulus representations of either the vision DNN or the LLM. The gray line indicates the linear fit between the vision DNN and LLM pairwise similarities. Inset images are illustrative examples of image pair outliers. C, Encoding models training pipeline. We trained encoding models to predict empirically-recorded EEG responses based on representations from vision DNNs, LLM, and their combination. This resulted in three types of encoding models: vision, language, and fusion. D, Encoding models testing pipeline. We used the trained encoding models to predict EEG responses to the test stimulus images, and compared (Pearson’s r) these predictions to the corresponding empirically-recorded EEG responses, resulting in prediction accuracy time courses. E, Prediction accuracy (Pearson’s r) timecourse for the vision, language, and fusion encoding models. The prediction accuracies are averaged across all participants and EEG channels. In gray is the area between the noise ceiling lower and upper bounds. F, Difference in prediction accuracy between the fusion and the vision or language encoding model. E-F, The black dashed vertical lines indicate the onset of stimulus presentation, and the black dashed horizontal lines indicate the chance level of no experimental effect. Rows of asterisks at the bottom of the plots indicate significant time points (one-sided t-test, p < 0.05, FDR corrected across 180 time points, N = 10 participants). G, Partial correlations between the recorded EEG test responses and the predicted EEG test responses from the fusion encoding model, controlling for the variance explained by the predicted EEG test responses from either the language encoding model (thus isolating the unique contribution of the vision DNN), or from the vision encoding model (thus isolating the unique contribution of the LLM). H, EEG topography of partial correlation results, indicating the unique prediction accuracy contribution of the vision DNN. I, EEG topography of partial correlation results, indicating the unique prediction accuracy contribution of the LLM. H-I, The highlighted black dots indicate significant channels (one-sided t-test, p < 0.05 FDR-corrected across 63 channels and 180 time points, N = 10 participants).

Factors determining the prediction performance of the fusion encoding model.

A, Scatterplot of pairwise cosine similarity between pairs of stimulus representations of either the vision DNN or the LLM. The stimuli were the 25 stimulus images that most benefited from the fusion compared to the vision encoding model in terms of EEG prediction accuracy (color coded in red), and the 25 stimulus images that least benefitted (color coded in blue). The red and blue lines indicate the linear fit between the vision DNN and LLM pairwise similarities for the 25 most benefitting and 25 least benefitting images, respectively. B, Visualization of the 25 stimulus images that most or least benefited from the fusion compared to the vision encoding model in terms of EEG prediction accuracy. C, Prediction accuracy (Pearson’s r) timecourse for the vision encoding model, and the fusion encoding models trained on full descriptions, object category labels, and descriptions randomly assigned to stimulus images. D, Prediction accuracy improvement over the vision encoding model for fusion encoding models trained on full descriptions, object category labels, and image descriptions randomly assigned to stimulus images. E, Prediction accuracy time course for the vision encoding model, and the fusion encoding models trained on full descriptions and on parts of speech (nouns, adjectives, and verbs from the full descriptions). F, Prediction accuracy improvement over the vision encoding model for fusion encoding models trained on full descriptions, nouns, adjectives, and verbs. C-F, The black dashed vertical lines indicate the onset of stimulus presentation, and the black dashed horizontal lines indicate the chance level of no experimental effect. Rows of asterisks at the bottom of the plots indicate significant time points (one-sided t-test, p < 0.05, FDR corrected across 180 time points, N = 10 participants). In gray is the area between the noise ceiling lower and upper bounds. G, Comparison of stimulus image object category labels (both human-annotated and DNN-generated) and nouns from the full image descriptions (excluding the nouns that overlap with the object category labels), for two illustrative examples.

Model comparisons and generalizability analysis.

A, Prediction accuracy timecourse for the fusion encoding model, and the encoding models trained on representations from multimodal DNNs (CLIP and VisualBERT). B, Difference in prediction accuracy between the fusion encoding model, and the encoding models trained on representations from multimodal DNNs (CLIP and VisualBERT). C, Prediction accuracy timecourse for fusion encoding models trained using different vision DNNs. D, Prediction accuracy improvement over the vision encoding model for fusion encoding models trained using different vision DNNs. E, Prediction accuracy timecourse for fusion encoding models trained using different LLMs. F, Prediction accuracy improvement over the vision encoding model for fusion encoding models trained using different LLMs. A-F, The black dashed vertical lines indicate the onset of stimulus presentation, and the black dashed horizontal lines indicate the chance level of no experimental effect. Rows of asterisks at the bottom of the plots indicate significant time points (onesided t-test, p < 0.05, FDR corrected across 180 time points, N = 10 participants). In gray is the area between the noise ceiling lower and upper bounds.

Time-frequency resolved analysis and results.

A, We decomposed the EEG responses into the timefrequency domain using Morlet wavelets. B-D, Prediction accuracy (Pearson’s r) of the EEG time-frequency data of the vision (B), language (C), and fusion (D) encoding models. E, Difference in prediction accuracy between the fusion and the language encoding models, which isolated the effect of the vision DNN. F, Difference in prediction accuracy between the fusion and the vision encoding models, which isolated the effect of the LLM.. BF, The prediction accuracies are averaged across all participants and EEG channels. The gray dashed lines indicate latency and frequency peaks of prediction accuracy with 95% confidence intervals. Cyan contour lines delineate clusters of significant effects (one-sided t-test, p < 0.05, FDR-corrected across 30 frequency points and 180 time points, N = 10 participants). G, EEG topography of partial correlation results, indicating the unique prediction accuracy contribution of the vision DNN. H, EEG topography of partial correlation results, indicating the unique prediction accuracy contribution of the LLM. G-H, The highlighted black dots indicate significant channels (onesided t-test, p < 0.05, FDR-corrected across 63 channels and 180 time points, N = 10 participants).

Encoding models’ single participant prediction accuracy (correlation)

A, Prediction accuracy (Pearson’s r) correlations time course for the visual, language, and fusion encoding models on single-subject level. In gray is the area between the noise ceiling lower and upper bounds. B, Difference in prediction accuracy between the fusion and the vision or language encoding model on single-subject level. A-B, The black dashed vertical lines indicate the onset of stimulus presentation, and the black dashed horizontal lines indicate the chance level of no experimental effect. In gray is the area between the noise ceiling lower and upper bounds.

Encoding models’ prediction accuracy (pairwise decoding)

The rationale of this analysis was to see if a classifier trained on the recorded EEG responses is capable of generalizing its performance to the predicted EEG responses from the encoding models. This is a complementary way (to the correlation analysis, see Fig. 1E-F, Suppl. Fig. 1) to assess the similarity between the recorded and predicted EEG responses, hence the encoding models’ predictive power. We performed the pairwise decoding analysis using the recorded and predicted EEG responses for the 200 test images. We started by averaging 40 recorded EEG image condition repetitions (we used the other 40 repetitions to estimate the noise ceiling; see the “Noise ceiling calculation” paragraph in the Methods section) into 10 pseudo-trials of 4 repeats each. Next, we used the pseudo-trials for training linear support vector machines (SVMs) to perform binary classification between each pair of the 200 recorded EEG image conditions using their EEG channel activity. We then tested the trained classifiers on the corresponding pairs of predicted EEG image conditions. We performed the pairwise decoding analysis independently for each EEG time point, and then averaged decoding accuracy scores across image condition pairs, obtaining a time course of decoding accuracies. A, Pairwise decoding accuracy time course between the recorded and predicted EEG responses from visual, language, and fusion encoding models. The decoding accuracy using the vision encoding model peaks at 105 ms (100-110 ms); the decoding accuracy using the language encoding model peaks 190 ms (110-200 ms); the decoding accuracy using the fusion encoding model peaks 105 ms (100-110 ms). There is no peak-to-peak latency difference between the fusion encoding model and vision encoding model (0 ms (0-5 ms), p > 0.05). The peak-to-peak latency difference between the fusion and language encoding models is 85 ms (0-95 ms), p = 0.06). B, Difference in decoding accuracy between the fusion encoding model and the vision or language encoding models. The difference between the fusion and vision encoding models peaks at 365 ms (210-365 ms); the difference between the fusion and language encoding models peaks at 95 ms (90-100 ms). Their peak-to-peak latency difference is 270 ms (115-270 ms,p < 10−4). A-B, Colored dots below plots indicate significant points (one sided t-test, p < 0.05, FDR-corrected for 180 time points, N = 10 participants). C, Decoding accuracy time course for visual, language, and fusion encoding models on single-subject level. D, Difference in decoding accuracy between the fusion encoding model and the vision or language encoding models on single-subject level. A and C, In gray is the area between the noise ceiling lower and upper bounds. A-D, The decoding accuracies are baseline corrected by subtracting the 50% chance level. The black dashed vertical lines indicate the onset of stimulus presentation, and the black dashed horizontal lines indicate the chance level of no experimental effect. In gray is the area between the noise ceiling lower and upper bounds.

Encoding models’ prediction accuracy (partial correlation)

Partial correlation time courses between the recorded EEG test responses and the predicted EEG test responses from the fusion encoding model, controlling for the variance explained by the predicted EEG test responses from either the vision or language encoding models, indicating the unique prediction accuracy contribution of the vision DNN. The black dashed vertical lines indicate the onset of stimulus presentation, and the black dashed horizontal lines indicate the chance level of no experimental effect. In gray is the area between the noise ceiling lower and upper bounds. Colored dots below plots indicate significant points (one sided t-test, p < 0.05, FDR-corrected for 180 time points, N = 10 participants).

Encoding models’ prediction accuracy (variance partitioning)

We used variance partitioning to assess the unique variance of EEG responses uniquely predicted by either the vision DNN or the LLM. We trained the vision, language, and fusion encoding models, we computed their explained variance (R2) using the test split, and we adjusted the resulting explained variance scores using the formula , where R2 is the unadjusted coefficient of determination, n is the number of test split samples, and k is the number of predictors in the model. To compute the unique variance explained by the vision encoding model we subtracted the of the language encoding model from the of the fusion encoding model. Similarly, to compute the unique variance explained by the language encoding model we subtracted the of the vision encoding model from the of the fusion encoding model. A, Variance explained by the vision, language, and fusion encoding models. B, Unique variance explained by the vision and language encoding models. A-B, The black dashed vertical lines indicate the onset of stimulus presentation, and the black dashed horizontal lines indicate the chance level of no experimental effect. Colored dots below plots indicate significant points (one sided t-test, p < 0.05, FDR-corrected for 180 time points, N = 10 participants).

Topography plots of encoding models’ prediction accuracy (correlation)

A-C, Prediction accuracy (Pearson’s r) topoplots over time, averaged across participants. A, Vision encoding model. B, Language encoding model. C, Fusion encoding model. D, Difference in prediction accuracy between the fusion and the language encoding models, which isolated the effect of the vision DNN. E, Difference in prediction accuracy between the fusion and the vision encoding models, which isolated the effect of the LLM. A-E, The highlighted black dots indicate significant channels (one-sided t-test, p < 0.05, FDR-corrected across 180 time points, N = 10 participants).

Topography plots of encoding models’ prediction accuracy (variance partitioning)

A, Variance of the recorded EEG responses explained by vision encoding model. B, Variance of the recorded EEG responses explained by the language encoding model. C, Unique variance of the recorded EEG responses explained by the fusion encoding model over the language encoding model (thus isolating the effect of the vision DNN). D, Unique variance of the recorded EEG responses explained by the fusion encoding model over the vision encoding model (thus isolating the effect of the LLM). A-D, Results are averaged across participants. The black dashed vertical lines indicate the onset of stimulus presentation, and the black dashed horizontal lines indicate the chance level of no experimental effect. Colored dots below plots indicate significant points (one sided t-test, p < 0.05, FDR-corrected for 180 time points, N = 10 participants). The highlighted black dots indicate significant channels (one-sided t-test, p < 0.05, FDR-corrected across 63 channels and 180 time points, N = 10 participants).

Prediction accuracy of fusion encoding models trained using different linguistic input

A, Difference in prediction accuracy (Pearson’s r) between the fusion encoding model trained on full image descriptions, and the fusion encoding models trained on object category labels, nouns, adjectives, verbs, and image descriptions randomly assigned to stimulus images. B, Difference in prediction accuracy between the fusion encoding model trained on the full image description nouns, and the fusion encoding models trained on object category labels. A-B, The black dashed vertical lines indicate the onset of stimulus presentation, and the black dashed horizontal lines indicate the chance level of no experimental effect. Rows of asterisks at the bottom of the plots indicate significant time points (one-sided t-test, p < 0.05, FDR corrected across 180 time points, N = 10 participants).

Time-frequency EEG prediction accuracy (partial correlation)

A, Partial correlation between the recorded EEG test responses and the predicted EEG test responses from the fusion encoding model, controlling for the variance explained by the predicted EEG test responses from the language encoding model (thus isolating the effect of the vision DNN). B, Partial correlation between the recorded EEG test responses and the predicted EEG test responses from the fusion encoding model, controlling for the variance explained by the predicted EEG test responses from the vision encoding model (thus isolating the effect of the LLM). A-B, The dashed gray lines indicate latency and frequency peaks of prediction accuracy. Cyan contour lines delineate significant effects (one-sided t-test, p < 0.05, FDR corrected across all time and frequency points, N = 10 participants).

Time-frequency EEG prediction accuracy (variance partitioning)

A, Variance of the recorded EEG responses explained by vision encoding model B, Variance of the recorded EEG responses explained by the language encoding model. C, Unique variance of the recorded EEG responses explained by the fusion encoding model over the language encoding model (thus isolating the effect of the vision DNN). D, Unique variance of the recorded EEG responses explained by the fusion encoding model over the vision encoding model (thus isolating the effect of the LLM). A-D, The dashed gray lines indicate latency and frequency peaks of prediction accuracy. Cyan contour lines delineate significant effects (one-sided t-test, p < 0.05, FDR corrected across all time and frequency points, N = 10 participants).

Time-frequency EEG prediction accuracy (partial correlation; unique contribution of the vision DNN)

EEG topography of partial correlation results, indicating the unique prediction accuracy contribution of the vision DNN. The highlighted black dots indicate significant channels (one-sided t-test, p < 0.05, FDR-corrected across 63 channels and 180 time points, N = 10 participants).

Time-frequency EEG prediction accuracy (partial correlation; unique contribution of the LLM)

EEG topography of partial correlation results, indicating the unique prediction accuracy contribution of the LLM. The highlighted black dots indicate significant channels (one-sided t-test, p < 0.05, FDR-corrected across 63 channels and 180 time points, N = 10 participants).

Time-frequency EEG prediction accuracy (variance partitioning; unique variance explained by the vision DNN)

EEG topography of variance partitioning results, indicating the variance uniquely explained by the vision DNN. We performed the variance partitioning analysis using the same rationale detailed in Suppl. Fig. 4. The highlighted black dots indicate significant channels (one-sided t-test, p < 0.05, FDR-corrected across 63 channels and 180 time points, N = 10 participants).

Time-frequency EEG prediction accuracy (variance partitioning; unique variance explained by the LLM)

EEG topography of variance partitioning results, indicating the variance uniquely explained by the LLM. We performed the variance partitioning analysis using the same rationale detailed in Suppl. Fig. 4. The highlighted black dots indicate significant channels (one-sided t-test, p < 0.05, FDR-corrected across 180 time points, N = 10 participants).

Time-frequency EEG prediction accuracy (subtraction; improvement of the vision DNN over the LLM)

EEG topography of the difference between the fusion and language encoding models prediction accuracies, indicating the improvement of the vision DNN over the LLM. The highlighted black dots indicate significant channels (one-sided t-test, p < 0.05, FDR-corrected across 63 channels and 180 time points, N = 10 participants).

Time-frequency EEG prediction accuracy (subtraction; improvement of the LLM over the vision DNN)

EEG topography of the difference between the fusion and vision encoding models prediction accuracies, indicating the improvement of the LLM over the vision DNN. The highlighted black dots indicate significant channels (one-sided t-test, p < 0.05, FDR-corrected across 63 channels and 180 time points, N = 10 participants).

Bootstrapped onset and peak latency

Bootstrapped onset and peak latency comparison across different encoding model types

Bootstrapped peak latency and peak frequency of prediction accuracy in the EEG time-frequency domain