Neuroscience

When word order matters: human brains represent sentence meaning differently from large language models

James Fodor author has email address
Carsten Murawski
Shinsuke Suzuki

The Centre for Brain, Mind and Markets, Faculty of Business and Economics, The University of Melbourne, Melbourne, Australia
Faculty of Social Data Science, Hitotsubashi University, Tokyo, Japan

https://doi.org/10.7554/eLife.108442.2

Open access
Copyright information

Figures and data

Summary of study methods for constructing stimuli, computing model representations, and collecting fMRI and behavioural data.
a) We construct 108 handcrafted sentences, designed to enable systematic variation in sentence meaning while controlling for lexical similarity. Here we show the corresponding 108×108 design matrix colour-coded with the type of each sentence pair. Sentence pairs in the six blocks along the diagonal are the primary pairs of interest in this study. b) All sentences were encoded using each of the four computational models of sentence meaning which we examine in this study. c) We then computed representational similarity matrices of the 108 stimuli for each of the four models. More similar sentence pairs are shown in blue, and less similar in red. d) Study pipeline for the fMRI experiment, in which participants were presented one sentence at a time for 2-7 seconds depending on sentence length. Multiple choice comprehension questions were interspersed randomly to assess attention. After scanning, data was processed and brain activity patterns were used to compute a neural representational similarity matrix for each participant. Correlations were then computed between the model and brain RSA matrices. e) Study pipeline for behavioural experiment, in which online participants were each shown 112 sentence pairs and asked to rate their semantic similarity. Ratings were averaged over participants to compute a similarity matrix. The correlation was then computed between the model and behavioural RSA matrices.

Explanation of the process of constructing sentences used in the study.
Added or altered elements in the second sentence in each pair are italicised. The final two columns represent approximate relative similarities intended for each sentence pair type, though there will be variation due to the precise details of each sentence.

Model correlations with brain activity for all sentence pairs and the block-diagonal subset of sentence pairs.
Partial correlations between RSA matrices of five computational models (Random, Mean, Transformer, VerbNet-CN, and AMR-Smatch) and the brain RSA matrix, controlling for differences in sentence length. ‘Human’ refers to behavioural ratings. Blue bars show correlations computed over all stable (excluding visual regions V1-V4), while green bars show correlations for stable voxels in the language network. Notation for statistical significance: * for p<0.05, ** for p<0.01, and *** for p<0.001, with Bonferroni correction for three independent comparisons. a) Partial correlations for each individual participant shown as blue dots, with the simple average over individual correlations shown as a bar. b) Partial correlations computed using the group-averaged RSA matrix. Error bars show 95% confidence intervals calculated by bootstrap resampling over participants. c) Scatterplots showing the relationship between model similarities (horizontal axis) and group-average neural similarities (vertical axis) for all four computational models. Each dot corresponds to a single pairwise similarity, scaled to between 0 and 1 for visualisation. While all sentence pairs are shown for comparison, regression lines (red) are computed over the block diagonal pairs only.

Average correlations between RSA matrices of each layer of Llama 3 and brain RSA matrix of each participant.

Mean-CN (CN) and VerbNet-CN hybrid (VN) models are also shown for comparison. a) Partial correlations for each individual participant shown as blue dots, with the simple average over individual correlations shown as a bar. b)Partial correlations computed using the group-averaged RSA matrix. Error bars show 95% confidence intervals calculated by bootstrap resampling over participants. c) RSA matrices for the Mean-CN model along with selected layers of the Llama 3 model, computed controlling for differences in sentence length.

Comparison of sentence representations and model correlations across brain regions.
a) RSA matrices for various cortical regions, computed controlling for differences in sentence length. b) RSA matrices for various cortical regions, computed controlling for differences in sentence length and minimum sentence length. c) Searchlight RSA for the VerbNet-CN model using 8mm radius showing cortical regions of interest, with those part of the language network underlined. RSA correlations are thresholded at z=2. d) Partial correlations controlling for differences in sentence length by cortical region, with each individual participant shown as blue dots, and the simple average over individual correlations shown as a bar. e) Partial correlations controlling for differences in sentence length computed using the group-averaged RSA matrix, shown by cortical region. Error bars show 95% confidence intervals calculated by bootstrap resampling over participants.

Comparison of model correlations across subregions of the language network.
a) Regions within the language network. b) Partial correlations controlling for differences in sentence shown by language network region, with each individual participant shown as blue dots, and the simple average over individual correlations shown as a bar.c) Partial correlations controlling for differences in sentence length computed using the group-averaged RSA matrix, shown by language network region. Error bars show 95% confidence intervals calculated by bootstrap resampling over participants.

Behavioural ratings of sentence similarity show similar results to fMRI results, but with higher absolute correlations.
a) (Left) Average correlations between RSA matrices of four computational models and human-rated similarities using all sentence pairs. (Right) Average correlations between RSA matrices of four computational models and human-rated similarities using only diagonal sentence pairs. b) Scatterplots showing the relationship between model similarities (horizontal axis) and human rated similarities (vertical axis) for all four computational models. Each dot corresponds to a single pairwise similarity, scaled to between 0 and 1 for visualisation. The 45-degree line (black) shows a hypothetical line of perfect fit between model and human similarities.

Summary of models of sentence meaning analysed in this study.

Sign up for email alerts