Figures and data

Experiment and performance.
(A) Experimental procedure for Experiments 1-3. In each trial, participants saw a question before reading a passage. After reading the passage, they chose the answer to the question from 4 options. (B) Accuracy of question answering for humans and computational models. Transformerp and transformerf separately denote pre-trained and fine-tuned transformer-based language models. The question type is color coded and an example question is shown for each type. (C) Time spent on reading each passage. The box plot shows the mean (horizontal lines inside the box), 25th and 75th percentiles (box boundaries), and 25th/75th percentiles ± 1.5 × interquartile range (whiskers). (D) Illustration of the training process for transformer-based models. The pre-training process aims to learn general statistical regularities in a language based on large corpora, while the fine-tuning process trains models to perform the reading comprehension task.

Human attention distribution and computational models.
(A) Examples of human attention distribution, quantified by the word reading time. The histograms on the right showed the mean reading time on each line, for both human data and model predictions. (B) The general architecture of the 12-layer transformer-based models. The model input consists of all words in the passage and an integrated option. Output of the model relies on the node CLS12, which is used to calculate a score reflecting how likely an option is the correct answer. The CLS node is a weighted sum of the vectorial representations of all words and tokens, and the attention weight for each word in the passage, i.e., α, is the DNN attention analyzed in this study.

Model word reading time in Experiment 1.
(AB) Predict the word reading time based on the attention weights of DNN models, text features, or question relevance. The predictive accuracy is the correlation coefficient between the predicted word reading time and the actual word reading time. Prediction accuracy significantly higher than chance is denoted by stars on the top of each bar. **P < 0.01. (C) Relationship between the word reading time and line index. The word reading time is longer near the beginning of a passage and the effect is stronger for global questions than local questions. (D) Relationship between the word reading time and question relevance. Line 0 refers to the line with the highest question relevance. The word reading time is higher for the question-relevant line. Color indicates the question type. The shade area indicates one standard error of the mean (SEM) across participants.

Factors influencing attention distribution in different processing stages for humans and DNNs.
(A) Human attention in early and late reading stages is differentially modulated by text features and question relevance. The early and late stages are separately characterized by gaze duration, i.e., duration for the first reading of a word, and counts of rereading, respectively. **P < 0.01; ***P < 0.001. (B) DNN attention weights in different layers are also differentially modulated by text features and question relevance. Each attention head is separately modeled and averaged within each layer, and the results are further averaged across the 3 transformer-based models. Shallow layers of both fine-tuned and pre-trained models are more sensitive to text features. Deep layers of fine-tuned models are sensitive to question relevance.

Influence of task and language proficiency on word reading time.
(AB) Predict the word reading time using attention weights of DNN models, text features, and question relevance for all 4 experiments. Prediction accuracy significantly higher than chance is marked by stars of the same color as the bar. Significant differences between experiments are denoted by black stars. *P < 0.05; **P < 0.01; ***P < 0.001. (CD) Passage beginning and question relevance effects for all 4 experiments. The shade area indicates one SEM across participants.

Illustration of the word-level heuristic models and the RNN-based SAR model.
(A) The orthographic and semantic models calculate the word-wise similarities between all words in the integrated option and all words in the passage, forming a similarity matrix. The similarity measures used in the orthographic and semantic models are the edit distance and correlation between word embeddings, respectively. For each option, the similarity matrix is averaged across all rows and all columns to form a scalar decision score. The option with the largest decision score is chosen as the answer. (B) The SAR model uses bi-directional RNNs to encode contextual information. A vectorial representation for the passage is created using the weighted sum of the vectorial representation of each word, and the weight on each word, i.e., the attention weight, is calculated based on its similarity to the vectorial representation of the question. The summarized passage representation and the option representation is used to form the decision score with a bilinear dot layer.

Question answering accuracy for individual transformer-based models. Human results and other computational models are also plotted for comparison.

Transformer-based models can explain word reading time even when the influences of text features and question relevance are regressed out.
(A) Predict the raw word reading time using the attention weights of individual transformer-based models. Results from other computational models are also plotted for comparison. (B) Predict the residual word reading time when basic text features, i.e., layout and word features, are regressed out. (C) Predict the residual word reading time when both basic text features and question relevance are regressed out. Prediction accuracy significantly higher than chance is denoted by stars of the same color as the bar. **P < 0.01.

Properties of the question relevance of words.
(A) Question relevance as a function of the word position within a passage. For global but not local questions, question-relevant words concentrate near the beginning of a passage. (B) Decay of mean question relevance across lines. The question relevance is averaged within each line, and all lines in a passage are sorted based on the mean question relevance in descending order. Therefore, line 1 is the line with the highest question relevance, and line 2 is the line with the 2nd highest question relevance. For both global and local questions, the mean question relevance sharply decreases over lines. (C) The mean word length, in terms of the number of letters, for words with the question relevance greater or smaller than 0.1. Words of higher relevance are generally longer. (D) Percentage of content words for words with higher or lower question relevance. Question-relevant words are more often content words.

Passage beginning effects (A) and question relevance effects (B) in early and late reading stages. The passage beginning effect differ between global and local questions mainly in the late reading stage reflected by the counts of rereading. The question-relevance effect is also only reliably observed in the late reading stage.

Factors influencing attention weights in each layer of DNNs for local questions.
Similar results are observed for all 3 models: The sensitivity to text features decreases from shallow to deep layers, while the sensitivity to question relevance increases across layers.

Factors influencing attention weights in each layer of DNNs for global questions.
Similar results are observed for all 3 models: The sensitivity to text features decreases from shallow to deep layers, while the sensitivity to question relevance increases across layers.

Passage-beginning and question-relevance effects in 4 experiments.
The passage beginning effect was quantified by the ratio between the mean word reading time on the first 3 lines of a passage and the mean word reading time on other lines. The question relevance effect was quantified by the ratio between mean word reading time on the line that was most relevant to the question and lines that were more than 5 lines away. See Fig. 1 for the explanation for the box plots. **P < 0.01; ***P < 0.001.

P-values for the model prediction of word reading time.

P-values for the prediction of word reading time using text or task-related features.

P-values for the prediction of early and late eye tracking measures using text or task-related features.

P-values for the prediction of word reading time for all 4 experiments.

P-values for the comparisons between experiments.