Human attention during goal-directed reading comprehension relies on task optimization

  1. Jiajie Zou
  2. Yuran Zhang
  3. Jialu Li
  4. Xing Tian
  5. Nai Ding  Is a corresponding author
  1. Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Sciences, Zhejiang University, China
  2. Nanhu Brain-computer Interface Institute, China
  3. Division of Arts and Sciences, New York University Shanghai, China

Abstract

The computational principles underlying attention allocation in complex goal-directed tasks remain elusive. Goal-directed reading, that is, reading a passage to answer a question in mind, is a common real-world task that strongly engages attention. Here, we investigate what computational models can explain attention distribution in this complex task. We show that the reading time on each word is predicted by the attention weights in transformer-based deep neural networks (DNNs) optimized to perform the same reading task. Eye tracking further reveals that readers separately attend to basic text features and question-relevant information during first-pass reading and rereading, respectively. Similarly, text features and question relevance separately modulate attention weights in shallow and deep DNN layers. Furthermore, when readers scan a passage without a question in mind, their reading time is predicted by DNNs optimized for a word prediction task. Therefore, we offer a computational account of how task optimization modulates attention distribution during real-world reading.

eLife assessment

This study provides a valuable contribution to the study of eye-movements in reading, revealing that attention-weights from a deep neural network show a statistically reliable fit to the word-level reading patterns of humans. Its evidence is convincing and strengthens a line of research arguing that attention in reading reflects task optimization. The work would be of interest to psychologists, neuroscientists, and machine learning researchers.

https://doi.org/10.7554/eLife.87197.3.sa0

Introduction

Attention profoundly influences information processing in the brain (Posner and Petersen, 1990; Treisman and Gelade, 1980; Rayner, 1998), and a large number of studies have been devoted to studying the neural mechanisms of attention. From the perspective of David Marr, the attention mechanism can be studied from three levels, that is, the computational, algorithmic, and implementational levels (Marr, 1982). At the computational level, attention is traditionally viewed as a mechanism to allocate limited central processing resources (Kahneman, 1973; Franconeri et al., 2013; Lennie, 2003; Carrasco, 2011; Borji and Itti, 2012). More recent studies, however, propose that attention is a mechanism to optimize task performance, even in conditions where the processing resource is not clearly constrained (Dayan et al., 2000; Gottlieb et al., 2014; Legge et al., 2002; Liu and Reichle, 2010; Najemnik and Geisler, 2005). The optimization hypothesis can explain the attention distribution in a range of well-controlled learning and decision-making tasks (Najemnik and Geisler, 2005; Navalpakkam et al., 2010), but is rarely tested in complex processing tasks for which the optimal strategy is not obvious. Therefore, the computational principles that underlie the allocation of human attention during complex tasks remain elusive. Nevertheless, complex tasks are critical conditions to test whether the attention mechanisms abstracted from simpler tasks can truly explain real-world attention behaviors.

Reading is one of the most common and most sophisticated human behaviors (Li et al., 2022; Gagl et al., 2022), and it is strongly regulated by attention: Since readers can only recognize a couple of words within one fixation, they have to overtly shift their fixation to read a line of text (Rayner, 1998). Thus, eye movements serve as an overt expression of attention allocation during reading (Rayner, 1998; Clifton et al., 2016). Computational modeling of the eye movements has mostly focused on normal reading of single sentences. At the computational level, it has been proposed that the eye movements are programmed to, for example, minimize the number of eye movements (Legge et al., 2002). At the algorithmic and implementational level, models such as the E-Z reader (Reichle et al., 2003) can accurately predict the eye movement trajectory with high temporal and spatial resolution. Everyday reading behavior, however, often engages reading of a multiline passage and generally has a clear goal, for example, information retrieval or inference generation (White et al., 2010). Few models, however, have considered how the reading goal modulates reading behaviors. Here, we address this question by analyzing how readers allocate attention when reading a passage to answer a specific question in mind. The question may require, for example, information retrieval, inference generation, or text summarization (Figure 1). We investigate whether the task optimization hypothesis can explain the attention distribution in such goal-directed reading tasks.

Figure 1 with 2 supplements see all
Experiment and performance.

(A) Experimental procedure for Experiments 1–3. In each trial, participants saw a question before reading a passage. After reading the passage, they chose the answer to the question from four options. (B) Accuracy of question answering for humans and computational models. The question type is color coded and an example question is shown for each type. trans_pre: pre-trained transformer-based models; trans_fine: transformer-based models fine-tuned on the goal-directed reading task. (C) Time spent on reading each passage. The box plot shows the mean (horizontal lines inside the box), 25th and 75th percentiles (box boundaries), and 25th/75th percentiles ±1.5× interquartile range (whiskers) across participants (N = 25). (D) Illustration of the training process for transformer-based models. The pre-training process aims to learn general statistical regularities in a language based on large corpora, while the fine-tuning process trains models to perform the reading comprehension task.

Finding an optimal solution for the goal-directed reading task, however, is computationally challenging since the information related to question answering is sparsely located in a passage and their orthographic forms may not be predictable. Recent advances in DNN models, however, provide a potential tool to solve this computational problem since DNN models equipped with attention mechanisms have approached and even surpassed mean human performance on goal-directed reading tasks (Lan et al., 2020; Liu et al., 2019). Attention in DNN also functions as a mechanism to selectively extract useful information, and therefore, attention may potentially serve a conceptually similar role in DNN. Furthermore, recent studies have provided strong evidence that task-optimized DNN can indeed explain the neural response properties in a range of visual and language processing tasks (Yamins et al., 2014; Kell et al., 2018; Goldstein et al., 2022; Schrimpf et al., 2021; Hasson et al., 2020; Donhauser and Baillet, 2020; Rabovsky et al., 2018; Heilbron et al., 2022). Therefore, although the DNN attention mechanism certainly deviates from the human attention mechanism in terms of its algorithms and implementation, we employ it to probe the computational-level principle underlying human attention distribution during real-world goal-directed reading.

Here, we investigated what computational principles could generate human-like attention distribution during a goal-directed reading task. We employed DNNs to derive a set of attention weights that are optimized for the goal-directed reading task and tested whether such optimal weights could explain human attention measured by eye tracking. Furthermore, since both human and DNN processing is hierarchical, we also investigated whether the human attention distribution during different processing stages, which are characterized through different eye-tracking measures, and the DNN attention weights in different layers may be differentially influenced by visual features, text properties, and the top-down task. Additionally, we recruited both native and non-native readers to probe how language proficiency contributed to the computational optimality of attention distribution.

Results

Experiment 1: Task and performance

In Experiment 1, the participants (N = 25 for each question) first read a question and then read a passage based on which the question should be answered (Figure 1A). After reading the passage, the participants chose from four options which option was the most suitable answer to the question. In total, 800 question/passage pairs were adapted from the RACE dataset (Lai et al., 2017), a collection of English reading comprehension questions designed for Chinese high school students who learn English as a second language. The questions fell into six types (Figure 1B and C): three types of questions required attention to details, for example, retrieving a fact or generate inference based on a fact, which were referred to as local questions. The other three types of questions concerned the general understanding of a passage, for example, summarizing the main idea or identifying the purpose of writing, which were referred to as global questions. None of the question directly appeared in the passage, and the longest string that overlapped in the passage and question was 1.8 ± 1.5 words on average.

Participants in Experiment 1 were Chinese college or graduate students who had relatively high English proficiency. The participants correctly answered 77.94% questions on average and the accuracy was comparable across the six types of questions (Figure 1B). We employed computational models to analyze what kinds of computations were required to answer the questions. The simplest heuristic model chose the option that best matched the passage orthographically (Figure 1—figure supplement 1A). This orthographic model achieved 25.6% accuracy (Figure 1B). Another simple heuristic model only considered word-level semantic matching between the passage and option, and achieved 27.3% accuracy (Figure 1B). The low accuracy of the two models indicated that the reading comprehension questions could not be answered by word-level orthographic or semantic matching.

Next, we evaluated the performance of four context-dependent DNN models, that is, Stanford Attentive Reader (SAR) (Chen et al., 2016), BERT (Devlin et al., 2019), ALBERT (Lan et al., 2020), and RoBERTa (Liu et al., 2019), which could integrate information across words to build passage-level semantic representations. The SAR used the bidirectional recurrent neural network (RNN) to integrate contextual information (Figure 1—figure supplement 1B) and achieved 47.6% accuracy. The other three models, that is, BERT, ALBERT, and RoBERTa, were transformer-based models that were trained in two steps, that is, pre-training and fine-tuning (Figure 1D). Since the three models had similar structures, we averaged the performance over the three models (see Figure 1—figure supplement 2 for the results of individual models). The model performance on the reading task was 37.08 and 73%, respectively, after pre-training and fine-tuning (Figure 1B).

Computational models of human attention distribution

In Experiment 1, participants were allowed to read each passage for 2 min. Nevertheless, to encourage the participants to develop an effective reading strategy, the monetary reward the participant received decreased as they spent more time reading the passage (see ‘Materials and methods’ for details). The results showed that the participants spent, on average, 0.7 ± 0.2 min reading each passage (Figure 1C), corresponding to a reading speed of 457 ± 142 words/min when divided by the number of words per passage. The speed was almost twice the normal reading speed for native readers (Rayner, 1998), indicating a specialized reading strategy for the task.

Next, we employed eye tracking to quantify how the readers allocated their attention to achieve effective reading and analyze which computational models could explain the reading time on each word, that is, the total fixation duration on each word during passage reading. In other words, we probed into what kind of computational principles could generate human-like attention distribution during goal-directed reading. A simple heuristic strategy was to attend to words that were orthographically or semantically similar to the words in the question (Figure 1—figure supplement 1A). The predictions of the heuristic models were not highly correlated with the human word reading time, and the predictive power, that is, the Pearson correlation coefficient between the predicted and real word reading time, was around 0.2 (Figure 3—figure supplement 1A).

The DNN models analyzed here, that is, SAR, BERT, ALBERT, and RoBERTa, all employed the attention mechanism to integrate over context to find optimal question answering strategies. Roughly speaking, the attention mechanism applied a weighted integration across all input words to generate a passage-level representation and decide whether an option was correct or not, and the weight on each word was referred to as the attention weight (see Figure 1—figure supplement 1B and Figure 2B for illustrations about the attention mechanisms in the SAR and transformer-based models, respectively). When the attention weights of the SAR were used to predict the human word reading time, the predictive power was about 0.1 (Figure 3A, Supplementary file 1a).

Human attention distribution and computational models.

(A) Examples of human attention distribution, quantified by the word reading time. The histograms on the right showed the mean reading time on each line for both human data and model predictions. trans_pre: pre-trained transformer-based models; trans_fine: transformer-based models fine-tuned on the goal-directed reading task. (B) The general architecture of the 12-layer transformer-based models. The model input consists of all words in the passage and an integrated option. Output of the model relies on the node CLS (Legge et al., 2002), which is used to calculate a score reflecting how likely an option is the correct answer. The CLS node is a weighted sum of the vectorial representations of all words and tokens, and the attention weight for each word in the passage, that is, α, is the deep neural network (DNN) attention analyzed in this study.

Figure 3 with 3 supplements see all
Model word reading time in Experiment 1.

(A, B) Predict the word reading time based on the attention weights of deep neural network (DNN) models, text features, or question relevance. The predictive power is the correlation coefficient between the predicted word reading time and the actual word reading time. Predictive power significantly higher than chance is denoted by stars on the top of each bar. **p<0.01. trans_rand: transformer-base models with randomized parameters; trans_pre: pre-trained transformer-based models; trans_fine: transformer-based models fine-tuned on the goal-directed reading task. (C) Relationship between the word reading time and line index. The word reading time is longer near the beginning of a passage and the effect is stronger for global questions than local questions. (D) Relationship between the word reading time and question relevance. Line 0 refers to the line with the highest question relevance. The word reading time is higher for the question-relevant line. Color indicates the question type. The shade area indicates 1 standard error of the mean (SEM) across participants (N = 25).

In contrast to assigning a single weight on a word, the transformer-based model employed a multihead attention mechanism: Each of the 12 layers had 12 parallel attention modules, that is, heads. Consequently, each word had 144 attention weights (12 layers × 12 heads), which were used to model the word reading time of humans based on linear regression. Since the attention weights of three transformer-based models showed comparable power to predict human word reading time, we reported the predictive power averaged over models (see Figure 3—figure supplement 1A for the results of individual models). The attention weights of randomly initialized transformer-based models could predict the human word reading time and the predictive power, which was around 0.3, was significantly higher than the chance level and the SAR (Figure 3A, Supplementary file 1a). The attention weights of pre-trained transformer-based models could also predict the human word reading time, the predictive power was around 0.5, significantly higher than the predictive power of heuristic models, the SAR, and randomly initialized transformer-based models (Figure 3A, Supplementary file 1a). The predictive power was further boosted for local but not global questions when the models were fine-tuned to perform the goal-directed reading task (Figure 3A, Supplementary file 1a). The weights assigned to attention heads in the linear regression are shown in Figure 3—figure supplement 2. For the fine-tuned models, we also predict the human word reading time using an unweighted averaged of the 144 attention heads and the predictive power was 0.3, significantly higher than that achieved by the attention weights of SAR (p=4 × 10–5, bootstrap). These results suggested that the human attention distribution was consistent with the attention weights in transformer-based models that were optimized to perform the same goal-directed reading task.

Factors influencing human word reading time

The attention weights in transformer-based DNN models could predict the human word reading time. Nevertheless, it remained unclear whether such predictions were purely driven by basic text features that were known to modulate word reading time. Therefore, in the following, we first analyzed how basic text features modulated the word reading time during the goal-directed reading task, and then checked whether transformer-based DNNs could capture additional properties of the word reading time that could not be explained by basic text features.

Here, we further decomposed text features into visual layout features, that is, position of a word on the screen, and word features, for example, word length, frequency, and surprisal. Layout features were features that were mostly induced by line changes, which could be extracted without recognizing the words, while word features were finer-grained features that could only be extracted when the word or neighboring words were fixated. Linear regression analyses revealed layout features could significantly predict the word reading time (Figure 3B, Supplementary file 1b). Furthermore, the predictive power was higher for global than local questions (p=4 × 10–5, bootstrap, false discovery rate [FDR] corrected for comparisons across three features, i.e., layout features, word features, and question relevance), suggesting a question-type-specific reading strategy. Word features could also significantly predict human reading time, even when the influence of layout features was regressed out. Additionally, a linear mixed effect model revealed significant fixed effects for question type and all text/task-related features, as well as significant interactions between question type and these text/task-related features (Supplementary file 1c; Pinheiro and Bates, 2006; Kuznetsova et al., 2017).

The predictive power of the layout and word features, however, was lower than the predictive power of attention weights of transformer-based models (p=4 × 10–5, bootstrap, FDR corrected for comparisons across two features, i.e., layout and word features). When the layout and word features were regressed out, the residual word reading time was still significantly predicted by the attention weights in transformer-based models (Figure 3—figure supplement 1B, predictive power about 0.3). This result indicated that what the transformer-based models extracted were more than basic text features. Next, we analyzed whether the transformer-based models, as well as the human word reading time, were sensitive to task-related features. To characterize the relevance of each word to the question answering task, we asked another group of participants to annotate which words contributed most to question answering. The annotated question relevance could significantly predict word reading time, even when the influences of layout and word features were regressed out (Figure 3B, Supplementary file 1b). When the question relevance was also regressed out, the residual word reading time was still significantly predicted by the attention weights in transformer-based models (Figure 3—figure supplement 1C, p=0.003, bootstrap, FDR corrected for comparisons across 12 models × 6 question types), but the predictive power dropped to about 0.2. Furthermore, a linear mixed effect model also revealed that more than 85% of the DNN attention heads contribute to the prediction of human reading time when considering text features and question relevance as covariates (Supplementary file 1c). These results demonstrated that the DNN attention weights provided additional information about the human word reading time than the text-related and task-related features analyzed here.

Further analyses revealed two properties of the distribution of question-relevant words. First, for local questions, the question-relevant words were roughly uniformly distributed in the passage, while for global questions, the question-relevant words tended to be near the passage beginning (Figure 3—figure supplement 3A). The eye-tracking data showed that readers also spent more time reading the passage beginning for global than local questions (Figure 3C), explaining why layout features more strongly influenced the answering of global than local questions. Second, few lines in the passage were question relevant (Figure 3—figure supplement 3B), and the eye-tracking data showed that readers spent more time reading the line with the highest question relevance (Figure 3D), confirming the influence of question relevance on word reading time.

Attention in different processing stages for humans and DNNs

Next, we investigated whether humans and DNNs attended to different features in different processing stages. The early stage of human reading was indexed by the gaze duration, that is, duration of first-pass reading of a word, and the later stage was indexed by the counts of rereading. Results showed the influence of layout features increased from early to late reading stages for global but not local questions (Figure 4A, Supplementary file 1d). Consequently, the passage beginning effect differed between global and local questions only for the late reading stage (Figure 4—figure supplement 1A). The influence of word features did not strongly change between reading stages, while the influence of question relevance significantly increased from early to late reading stages (Figure 4A, Figure 4—figure supplement 1B). These results suggested that attention to basic text features developed early, while the influence of task mainly influenced late reading processes.

Figure 4 with 3 supplements see all
Factors influencing attention distribution in different processing stages for humans and deep neural networks (DNNs).

(A) Human attention in early and late reading stages is differentially modulated by text features and question relevance. The early and late stages are separately characterized by gaze duration, that is, duration for the first reading of a word, and counts of rereading, respectively. **p<0.01; ***p<0.001. (B) DNN attention weights in different layers are also differentially modulated by text features and question relevance. Each attention head is separately modeled and averaged within each layer, and the results are further averaged across the three transformer-based models. Shallow layers of both fine-tuned and pre-trained models are more sensitive to text features. Deep layers of fine-tuned models are sensitive to question relevance. trans_rand: transformer-base models with randomized parameters; trans_pre: pre-trained transformer-based models; trans_fine: transformer-based models fine-tuned on the goal-directed reading task.

In the following, we further investigated whether transformer-based DNN attended to different features in different layers, which represented different processing stages. This analysis did not include layout features that were not available to the models. The attention weights in shallow layers were sensitive to word features in randomized, pre-trained, and fine-tuned models (Figure 4B and C). Only in the fine-tuned models, however, the attention weights in deep layers were sensitive to question relevance (see Figure 4—figure supplements 2 and 3 for results of individual models). Therefore, the shallow and deep layers separately evolved text-based and goal-directed attention, and goal-directed attention was induced by fine-tuning on the task.

Experiment 2: Question type specificity of the reading strategy

In Experiment 1, different types of questions were presented in blocks which encouraged the participants to develop question type-specific reading strategies. Next, we ran Experiment 2, in which questions from different types were mixed and presented in a randomized order, to test whether the participants developed question type-specific strategies in Experiment 1. Since it was time consuming to measure the response to all 800 questions, we randomly selected 96 questions for Experiment 2 (16 questions per type). In Experiment 2, the reading speed was on average 298 ± 123 words/min, lower than the speed in Experiment 1 (p=6 × 10–4, bootstrap, FDR corrected for the comparisons across four experiments), but still much faster than normal reading speed (Rayner, 1998).

The word reading time was better predicted by fine-tuned than pre-trained transformer-based models (Figure 5A, Supplementary file 1e). For the influence of text and task-related features, compared to Experiment 1, the predictive power in Experiment 2 was higher for layout and word features, but lower for question relevance (Figure 5B, Supplementary file 1f). For local questions, consistent with Experiment 1, the effects of question relevance significantly increased from early to late processing stages that are separately indexed by gaze duration and counts of rereading (Figure 5—figure supplement 1A, Supplementary file 1d). The passage beginning effect was higher for global than local questions (Figure 5C, second column, p=2 × 10–4, bootstrap, FDR corrected for the comparisons across four experiments), but the difference was smaller than in Experiment 1 (Figure 5C, Figure 5—figure supplement 2A, p=2 × 10–4, bootstrap, FDR corrected for the comparisons across four experiments). The question relevance effect was also smaller in Experiment 2 than Experiment 1 (Figure 5D, Figure 5—figure supplement 2B, p=2 × 10–4, bootstrap, FDR corrected for the comparisons across four experiments). All these results indicated that the readers developed question type-specific strategies in Experiment 1, which led to faster reading speed and stronger task modulation of word reading time.

Figure 5 with 2 supplements see all
Influence of task and language proficiency on word reading time.

(A, B) Predict the word reading time using attention weights of deep neural network (DNN) models, text features, and question relevance for all four experiments. Predictive power significantly higher than chance is marked by stars of the same color as the bar. Significant differences between experiments are denoted by black stars. trans_pre: pre-trained transformer-based models; trans_fine: transformer-based models fine-tuned on the goal-directed reading task. *p<0.05; **p<0.01; ***p<0.001. (C, D) Passage beginning and question relevance effects for all four experiments. The shade area indicates 1 SEM across participants (N = 25 for Exp 1; N = 20 for Exps 2-4).

Experiment 3: Effect of language proficiency

Experiments 1 and 2 recruited L2 readers. To investigate how language proficiency influenced task modulation of attention and the optimality of attention distribution, we ran Experiment 3, which was the same as Experiment 2 except that the participants were native English readers. In Experiment 3, the reading speed was on average 506 ± 155 words/min, higher than that in Experiment 2 (p=6 × 10–4, bootstrap, FDR corrected for the comparisons across four experiments). The question answering accuracy was comparable to L2 readers (Figure 1B).

The word reading time for native readers was slightly better predicted by fine-tuned than pre-trained transformer-based models (Figure 5A, Supplementary file 1e). For the influence of text and task-related features, compared to Experiment 2, the predictive power in Experiment 3 was higher for word features, but lower for layout features and question relevance (Supplementary file 1f). For local questions, the layout effect was more salient for gaze duration than for counts of rereading. In contrast, the effect of word-related features and task relevance was more salient for counts of rereading than gaze duration (Figure 5—figure supplement 1B, Supplementary file 1d). The passage beginning effect was higher for global than local questions, but the difference was smaller than in Experiment 2 (Figure 5C, Figure 5—figure supplement 2A, p = 2 × 10–4, bootstrap, FDR corrected for the comparisons across four experiments). The question relevance effect was also smaller for Experiment 3 than Experiment 2 (Figure 5D, Figure 5—figure supplement 2B, p=2 × 10–4, bootstrap, FDR corrected for the comparisons across four experiments). These results showed that the word reading time of native readers was significantly modulated by the task, but the effect was weaker than that on L2 readers.

Experiment 4: General-purpose reading

In the goal-directed reading task, participants read a passage to answer a question that they knew in advance, and the eye-tracking results revealed that participants spent more time reading question-relevant words. Question-relevant words, however, were generally longer content words (Figure 3—figure supplement 3C and D) that were often associated with longer reading time even without a task (Rayner, 1998). Therefore, to validate the question relevance effect, we ran Experiment 4 in which the participants read the passages without knowing the question to answer. The experiment used the same 96 questions as in Experiments 2 and 3, but adopted a different experimental procedure: participants previewed a passage before reading the question and were allowed to read the passage again to answer the question. We analyzed the reading pattern during passage preview, which was referred to as general-purpose reading.

The participants were given 1.5 min to preview the passage, and the reading speed was on average 225 ± 40 words/min, lower than that in Experiments 1–3 (p=6 × 10–4, bootstrap, FDR corrected for comparisons across four experiments). Before question answering, they were given another 0.5 min to reread the passage, but on average they spent only 0.04 min on rereading it. During passage preview, the word reading time was similarly predicted by the pre-trained and fine-tuned transformer-based models (Figure 5A, Supplementary file 1e). Furthermore, the word reading time was significantly predicted by layout and word features, but not question relevance (Figure 5B, Supplementary file 1e). Both the early and late processing stages of human reading were significantly affected by layout and word features, and the effects were larger for the late processing stage indexed by counts of rereading (Figure 5—figure supplement 1C, Supplementary file 1d). The passage beginning effect was not significantly different between local and global questions (Figure 5C, fourth column, p=0.994, bootstrap, FDR corrected for comparisons across four experiments), and the question relevance effect was significantly smaller than the question relevance effect in Experiments 1–3 (Figure 5D, Figure 5—figure supplement 2B, p=2 × 10–4, bootstrap, FDR corrected for comparisons across four experiments). These results confirmed that the question relevance effects observed during goal-directed reading were indeed task dependent.

Discussion

Attention is a crucial mechanism to regulate information processing in the brain, and it has been hypothesized that a common computational role of attention is to optimize task performance. Previous support for the hypothesis mostly comes from tasks for which the optimal strategy can be easily derived. The current study, however, considers a real-world reading task in which the participants have to actively sample a passage to answer a question that cannot be answered by simple word-level orthographic or semantic matching. In this challenging task, it is demonstrated that human attention distribution can be explained by the attention weights in transformer-based DNN models that are optimized to perform the same reading task but blind to the human eye-tracking data. Furthermore, when participants scan a passage without knowing the question to answer, their attention distribution can also be explained by transformer-based DNN models that are optimized to predict a word based on the context.

Furthermore, we demonstrate that both humans and transformer-based DNN models achieve task-optimal attention distribution in multiple steps: For humans, basic text features strongly modulate the duration of the first reading of a word, while the question relevance of a word only modulates how many times the word is reread, especially for high-proficiency L2 readers compared to native readers. Similarly, the DNN models do not yield a single attention distribution, and instead they generate multiple attention distributions, that is, heads, for each layer. Here, we demonstrate that basic text features mainly modulate the attention weights in shallow layers, while the question relevance of a word modulates the attention weights in deep layers, reflecting hierarchical control of attention to optimize task performance. The attention weights in both the shallow and deep layers of DNN contribute to the explanation of human word reading time (Figure 3—figure supplement 2).

Computational models of attention

A large number of computational models of attention have been proposed. According to Marr’s three levels of analysis (Marr, 1982), some models investigate the computational goal of attention (Dayan et al., 2000; Legge et al., 2002) and some models provide an algorithmic implementation of how different factors modulate attention (Reichle et al., 2003; Itti et al., 1998). Computationally, it has been hypothesized that attention can be interpreted as a mechanism to optimize learning and decision-making, and empirical evidence has been provided that the brain allocates attention among different information sources to optimally reduce the uncertainty of a decision (Dayan et al., 2000; Gottlieb et al., 2014; Legge et al., 2002). The current study provides critical support to this hypothesis in a real-world task that engages multiple forms of attention, for example, attention to visual layout features, attention to word features, and attention to question-relevant information. These different forms of attention, which separately modulate different eye-tracking measures (Figure 4A), jointly achieve an attention distribution that is optimal for question answering.

The transformer-based DNN models analyzed here are optimized in two steps, that is, pre-training and fine-tuning. The results show that pre-training leads to text-based attention that can well explain general-purpose reading in Experiment 4, while the fine-tuning process leads to goal-directed attention in Experiments 1–3 (Figures 4B and 5A). Pre-training is also achieved through task optimization, and the pre-training task used in all the three models analyzed here is to predict a word based on the context. The purpose of the word prediction task is to let models learn the general statistical regularity in a language based on large corpora, which is crucial for model performance on downstream tasks (Lan et al., 2020; Liu et al., 2019; Devlin et al., 2019), and this process can naturally introduce the sensitivity to word surprisal, that is, how unpredictable a word is given the context. Previous eye-tracking studies have suggested that the predictability of words, that is, surprisal, can modulate reading time (Hale, 2016), and neuroscientific studies have also indicated that the cortical responses to language converge with the representations in pre-trained DNN models (Goldstein et al., 2022; Schrimpf et al., 2021). The results here further demonstrate that the DNN optimized for the word prediction task can evolve attention properties consistent with the human reading process. Additionally, the tokenization process in DNN can also contribute to the similarity between human and DNN attention distributions: DNN first separates words into tokens (e.g., ‘tokenization’ is separated into ‘token’ and ‘ization’). Tokens are units that are learned based on co-occurrence of letters and is not strictly linked to any linguistically defined units. Since longer words tend to be separated into more tokens, that is, fragments of frequently co-occurred letters, longer words receive more attention even if the model pay uniform attention to each of its input, that is, a token.

A separate class of models investigates which factors shape human attention distribution. A large number of models are proposed to predict bottom-up visual saliency (Tatler et al., 2011; Borji et al., 2013), and recently DNN models are also employed to model top-down visual attention. It is shown that, through either implicit (Anderson et al., 2018; Xu et al., 2015) or explicit training (Das et al., 2017), DNNs can predict which parts of a picture relate to a verbal phrase, a task similar to goal-directed visual search (Wolfe and Horowitz, 2017). The current study distinguishes from these studies in that the DNN model is not trained to predict human attention. Instead, the DNN models naturally generate human-like attention distribution when trained to perform the same task that humans perform, suggesting that task optimization is a potential cause for human attention distribution during reading.

Models for human reading and human attention to question-relevant information

How human readers allocate attention during reading is an extensively studied topic, mostly based on studies that instruct readers to read a sentence in a normal manner, not aimed to extract a specific kind of information (Clifton et al., 2016). Previous eye-tracking studies have shown that the readers fixate longer upon, for example, longer words, words of lower-frequency, words that are less predictable based on the context, and words at the beginning of a line (Rayner, 1998). A number of models, for example, the E-Z reader (Reichle et al., 2003) and SWIFT (Engbert et al., 2005), have been proposed to predict the eye movements during reading based on basic oculomotor properties or lexical processing (Reichle et al., 2003). Some models also view reading as an optimization process that minimizes the time or the number of saccades required to read a sentence (Legge et al., 2002; Liu and Reichle, 2010). These models can generate fine-grained predictions, for example, which letter in a word will be fixated first, for the reading of simple sentences, but have only been occasionally tested for complex sentences or multiline texts (Mancheva et al., 2015) or to characterize different reading tasks, for example, z-string reading and visual searching (Reichle et al., 2012).

When readers read a passage to answer a question that can be answered using a word-matching strategy (Hermann et al., 2015), a recent study has demonstrated that the specific reading goal modulates the word reading time and the effect can be modeled using an RNN model (Hahn and Keller, 2023). Here, we focus on questions that cannot be answered using a word-matching strategy (Figure 1B) and demonstrate that, for these challenging questions, attention is still modulated by the reading goal but the attention modulation cannot be explained by a word-matching model (Figure 3—figure supplement 1). Instead, the attention effect is better captured by transformer models than an advanced RNN model, that is, the SAR (Figure 3A). Combining the current study and the study by Hahn and Keller, 2023, it is possible that the word reading time during a general-purpose reading task can be explained by a word prediction task, the word reading time during a simple goal-directed reading task that can be solved by word matching can be modeled by an RNN model, while the word reading time during a more complex goal-directed reading task involving inference is better modeled using a transformer model. The current study also further demonstrates that elongated reading time on task-relevant words is caused by counts of rereading and further studies are required to establish whether earlier eye movement measures can be modulated by, for example, a word-matching task. In addition, future studies can potentially integrate classic eye movement models with DNNs to explain the dynamic eye movement trajectory, possibly with a letter-based spatial resolution.

When human readers read a passage with a particular goal or perspective, previous studies have revealed inconsistent results about whether the readers spent more time reading task-relevant sentences (Yeari et al., 2015; Grabe, 1979; Kaakinen et al., 2002). To explain the inconsistent results, it has been proposed that the question relevance effect weakens for readers with a higher working memory and when readers read a familiar topic (Kaakinen et al., 2003). Similarly, here, we demonstrate that non-native readers indeed spend more time reading question-relevant information than native readers do (Figure 5D, Figure 5—figure supplement 2B). Therefore, it is possible that when readers are more skilled and when the passage is relatively easy to read, their processing is so efficient so that they do not need extra time to encode task-relevant information and may rely on covert attention to prioritize the processing of task-relevant information.

DNN attention to question-relevant information

A number of studies have investigated whether the DNN attention weights are interpretable, but the conclusions are mixed: some studies find that the DNN attention weights are positively correlated with the importance of each word (Yang et al., 2016; Lin et al., 2017), while other studies fail to find such correlation (Serrano and Smith, 2019; Jain and Wallace, 2019). The inconsistent results are potentially caused by the lack of gold standard to evaluate the contribution of each word to a task. A few recent studies have used the human word reading time as the criterion to quantify word importance, but these studies do not reach consistent conclusions either. Some studies find that the attention weights in the last layer of transformer-based DNN models better correlates with human word reading time than basic word frequency measures (Bolotova et al., 2020), and integrating human word reading time into DNN can slightly improve task performance (Malmaud et al., 2020). Other studies, however, find no meaningful correlation between the attention weights in transformer-based DNNs and human word reading time (Sood et al., 2020).

The current results provide a potential explanation for the discrepancy in the literature: The last layer of transformer-based DNNs is tuned to task relevant information (Figure 4B), but the influence of task relevance on word reading time is rather weak for native readers (Figure 5B). Consequently, the correlation between the last-layer DNN attention weights and human reading time may not be robust. The current results demonstrate that the reading time of both native and non-native readers are reliably modulated by basic text features, which can be modeled by the attention weights in shallower DNN layers.

Finally, the current study demonstrates that transformer-based DNN models can automatically generate human-like attention in the absence of any prior knowledge about the properties of the human reading process. Simpler models that fail to explain human performance also fail to predict human attention distribution. It remains possible, however, different models can solve the same computational problem using distinct algorithms, and only some algorithms generate human-like attention distribution. In other words, human-like attention distribution may not be a unique solution to optimize the goal-directed reading task. Sharing similar attention distribution with humans, however, provides a way to interpret the attention weights in computational models. From this perspective, the dataset and methods developed here provide an effective probe to test the biological plausibility of NLP models that can be easily applied to test whether a model evolves human-like attention distribution.

Materials and methods

Participants

Totally, 162 participants took part in this study (19–30 years old, mean age, 22.5 years; 84 females). All participants had normal or corrected-to-normal vision. Experiment 1 had 102 participants. Experiments 2–4 had 20 participants. No participant took part in more than one experiment. Additional 17 participants were recruited but failed to pass the calibration process for eye tracking and therefore did not participant in the reading experiments.

In Experiments 1, 2, and 4, participants were native Chinese readers. They were college students or graduate students from Zhejiang University, and were thus above the level required to answer high-school-level reading comprehension questions. English proficiency levels were further guaranteed by the following criterion for screening participants: a minimum score of 6 on IELTS, 80 on TOEFL, or 425 on CET6 (The National College English Test (CET) is a national English test system developed to examine the English proficiency of college students in China. The CET includes tests of two levels: a lower level test CET4 and a higher level test CET6.). In Experiment 3, participants were native English readers. The experimental procedures were approved by the Research Ethics Committee of the College of Medicine, Zhejiang University (2019-047). The participants provided written consent and were paid.

Experimental materials

Request a detailed protocol

The reading materials were selected and adapted from the large-scale RACE dataset, a collection of reading comprehension questions in English exams for middle and high schools in China (Lai et al., 2017). We selected 800 high-school-level questions from the test set of RACE and each question was associated with a distinct passage (117–456 words per passage). All questions were multiple-choice questions with four alternatives including only one correct option among them. The questions fell into six types, that is, Cause (N = 200), Fact (N = 200), Inference (N = 120), Theme (N = 100), Title (N = 100), and Purpose (N = 80). The Cause, Fact, and Inference questions concerned the location, extraction, and comprehension of specific information from a passage, and were referred to as local questions. Questions of Theme, Title, and Purpose tested the understanding of a passage as a whole and were referred to as global questions.

In a separate online experiment, we acquired annotations about the relevance of each word to the question answering task. For each passage, a participant was allowed to annotate up to five key words that were considered relevant to answering the corresponding question. Each passage was annotated by N participants (N ≥ 26), producing N versions of annotated key words. Each version of annotation was then validated by a separate participant. In the validation procedure, the participant was required to answer the question solely based on the key words of a specific annotation version; if the person could not derive the correct answer, this version of annotation was discarded. The percentage of questions correctly answered in the validation procedure was 75.9 and 67.6% for local and global questions, respectively. If M versions of annotation passed the validation procedure and a word was annotated in K versions, the question relevance of the word was K/M. More details about the question types and the annotation procedures could be found in Zou et al., 2021.

Experimental procedures

Experiment 1

Request a detailed protocol

Experiment 1 included all 800 passages, and different question types were separately tested in different sessions, hence six sessions in total. Each session included 25 participants and one participant could participate in multiple sessions. Before each session, participants were familiarized with five questions that were not used in the formal session. During the formal session, questions were presented in a randomized order. Considering the quantities of questions, for Cause and Fact questions, the session was carried out in three separate days (one third questions on each day), and for other question types, the session was carried out in two separate days (50% of questions on each day).

The experiment procedure in Experiment 1 is illustrated in Figure 1A. In each trial, participants first read a question, pressed the space bar to read the corresponding passage, pressed the space bar again to read the question coupled with four options, and chose the correct answer. The time limit for passage reading was 120 s. To encourage the participants to read as quickly as possible, the bonus they received for a specific question would decrease linearly from 1.5 to 0.5 RMB over time. They did not receive any bonus for the question, however, if they gave a wrong answer. Furthermore, before answering the comprehension question, the participants reported whether they were confident about that they could correctly answer the question (yes or no). Participants selected yes for 90.47% of questions (89.62 and 92.04% for local and global questions, respectively). After answering the question, they also rated their confidence about their answer on the scale of 1–4 (low to high). The mean confidence rating was 3.25 (3.28 and 3.18 for local and global question, respectively), suggesting that the participants were confident about their answers.

Experiments 2 and 3

Request a detailed protocol

Experiments 2 and 3 included 96 reading passages and questions that were randomly selected from the questions used in Experiment 1 and included 16 questions for each question type. The six types of questions were mixed and presented in a randomized order. The trial structure, as well as the familiarization procedure, in Experiments 2 and 3 was identical to that in Experiment 1. Experiments 2 and 3 were identical except that Experiment 2 recruited high-proficiency L2 readers while Experiment 3 recruited native English readers.

Experiment 4

Request a detailed protocol

Experiment 4 included the 96 questions presented in Experiments 2 and 3, which were presented in a randomized order. The trial structure in Experiment 4 is similar to that in Experiments 1–3, except that a 90 s passage preview stage was introduced at the beginning of each trial. During passage preview, participants had no prior information of the relevant question. The participants could press the space bar to terminate the preview and to read a question. Then, participants read the passage again with a time limit of 30 s, before proceeding to answer the question. The payment method was similar to Experiment 2, and the bonus was calculated based on the duration of second-pass passage reading.

Stimulus presentation and eye tracking

Request a detailed protocol

The text was presented using the bold Courier New font, and each letter occupied 14 × 27 pixels. We set the maximum number of letters on each line to 120 and used double space. We separated paragraphs by indenting the first line of each new paragraph. Participants sat about 880 mm from a monitor, at which each letter horizontally subtended approximately 0.25° of visual angle.

Eye-tracking data were recorded from the left eye with 500 Hz sampling rate (Eyelink Portable Duo, SR Research). The experiment stimuli were presented on a 24-inch monitor (1920 × 1080 resolution; 60 Hz refresh rate) and administered using MATLAB Psychtoolbox (Brainard, 1997). Each experiment started with a 13-point calibration and validation of eye tracker, and the validation error was required to be below 0.5° of visual angle. Furthermore, before each trial, a 1-point validation was applied, and if the calibration error was higher than 0.5° of visual angle, a recalibration was carried out. Head movements were minimized using a chin and forehead rest.

Word-level reading comprehension models

Request a detailed protocol

The orthographic and semantic models probed whether the reading comprehension questions could be answered based on word-level orthographic or semantic information. Both models calculated the similarity between each content word in the passage and each content word in an option, and averaged the word-by-word similarity across all words in the passage and all words in the option (Figure 1—figure supplement 1A). The option with the highest mean similarity value was chosen as the answer. For the orthographic model, similarity was quantified using the edit distance (Levenshtein, 1966). For the semantic model, similarity was quantified by the correlation between vectorial representations of word meaning, that is, the glove model (Pennington et al., 2014). Performance of the models remained similar if the answer was chosen based on the maximal word-by-word similarity, instead of the mean similarity.

RNN-based reading comprehension models

Request a detailed protocol

The SAR was a classical RNN-based model for the reading comprehension task (Chen et al., 2016). In contrast to the word-level models, the SAR was context sensitive and employed bidirectional RNNs to integrate information across words (Figure 1—figure supplement 1B). Independent bidirectional RNNs were employed to build a vectorial representation for the question and each option. An additional bidirectional RNN was applied to construct a vectorial representation for each word in the passage, and a passage representation was built by a weighted sum of the representations of individual words in the passage. The weight on each word, that is, the attention weight, captured the similarity between the representation of the word and the question representation using a bilinear function. Finally, based on the passage representation and each option representation, a bilinear dot layer calculated the possibility that the option was the correct answer.

Transformer-based reading comprehension models

Request a detailed protocol

We tested three popular transformer-based DNN models, that is, BERT (Devlin et al., 2019), ALBERT (Lan et al., 2020), and RoBERTa (Liu et al., 2019), which were all reported to reach high performance on the reading comprehension task. ALBERT and RoBERTa were both adapted from BERT and had the same basic structure. RoBERTa differed from BERT in its pre-training procedure (Liu et al., 2019) while ALBERT applied factorized embedding parameterization and cross-layer parameter sharing to reduce memory consumption (Lan et al., 2020). Following previous studies (Lan et al., 2020; Liu et al., 2019), each option was independently processed. For the ith option (i = 1, 2, 3, or 4), the question and the option were concatenated to form an integrated option. As shown in the left panel of Figure 2B, for the ith option, the input to models was the following sequence:

CLSi, P1, P2, ..., PN, Si,1, Oi,1, Oi,2, ..., Oi,M, Si,2,

where CLSi, Si,1, and Si,2 denote special tokens separating different components of the input. P1, P2, …, PN denote all the N words of a passage, and Oi,1, Oi,2, …, Oi,M denote all the M words in the ith integrated option. Each of the token was represented by a vector. The vectorial representation was updated in each layer, and in the following the output of the lth layer is denoted as a superscript, for example, CLSil. Following previous studies (Lan et al., 2020; Liu et al., 2019), we calculated a score for each option, which indicated the possibility that the option was the correct answer. The score was calculated by first applying a linear transform to the final representation of the CLS token, that is,

si=ΦCLSi12,

where CLSi12 is the final output representation of CLS and Φ is a vector learned from data. The score was independently calculated for each option and then normalized using the following equation:

scorei=exp(si)i=14exp(si)

The answer to a question was determined as the option with the highest score, and all the models were trained to maximize the logarithmic score of the correct option. The transformer-based models were trained in two steps (Figure 1D). The pre-training process aimed to learn general statistical regularities in a language based on large corpora, that is, BooksCorpus (Zhu et al., 2015) and English Wikipedia, while the fine-tuning process trained models to perform the reading comprehension task based on RACE dataset. All models were implemented based on HuggingFace (Wolf et al., 2020), and all hyperparameters for fine-tuning were adopted from previous studies (Lan et al., 2020; Liu et al., 2019; Zhang et al., 2020; Ran et al., 2019; see Supplementary file 1g).

Attention in transformer-based models

Request a detailed protocol

The transformer-based models we applied had 12 layers, and each layer had 12 parallel attention heads. Each attention head calculated an attention weight between any pair of inputs, including words and special tokens. The vectorial representation of each input was then updated by the weighted sum of the vectorial representations of all inputs (Vaswani et al., 2017). Since only the CLS token was directly related to question answering, here we restrained the analysis to the attention weights that were used to calculate the vectorial representation of CLS (Figure 2B, right panel). In the hth head, the vectorial representation of CLS was computed using the following equations. For the sake of clarity, we did not distinguish the input words and special tokens and simply denoted them as Xi.

CLSh = i=1N+M+3αiVi = αCLSVCLS + n=1NαPnVPn + αS1VS1+m=1MαOmVOm+αS2VS2,
αi=exp(QCLSKiT)i=1N+M+3exp(QCLSKiT),
Vi = XiWV + bV, Ki = XiWK + bK, QCLS = XCLSWQ + bQ, 

where WV, WQ, WK, bV, bQ, and bK are parameters to learn from the data, and αi is the attention weight between CLS and Xi. The attention weight between CLS and the nth word in the passage, that is, αPn, was compared to human attention. Here, we only considered the attention weights associated with the correct option. Additionally, DNNs used byte-pair tokenization which split some words into multiple tokens. We converted the token-level attention weights to word-level attention weights by summing the attention weights over tokens within a word (Bolotova et al., 2020; Clark et al., 2019).

Eye-tracking measures

Request a detailed protocol

We analyzed eye movements during passage reading in Experiments 1–3, and the passage preview in Experiment 4. For each word, the total fixation time, gaze duration, and run counts was extracted using the SR Research Data Viewer software. The total fixation time of a word is referred to as the word reading time. The gaze duration was how long a word was fixated before the gaze moved to other words, reflected first-pass processing of a word. To characterize late processing of a word, we further calculated the counts of rereading, which were defined as the run counts minus 1. Words that were not reread were excluded from the analysis of counts of rereading. Each of the eye-tracking measure was averaged across all participants who correctly answered the question.

Regression models

Request a detailed protocol

We employed linear regression to analyze how well each model, as well as each set of text/task-related features, could explain human attention measured by eye tracking. In all regression analyses, each regressor and the eye-tracking measure were normalized within each passage by taking the z-score. The predictive power, that is, the Pearson correlation coefficient between the predicted eye-tracking measure and the actual eye-tracking measure, was calculated based on fivefold cross-validation.

For the SAR, each word had one attention weight, which was used as the regressor. For transformer-based models, since each model contained 12 layers and each layer contained 12 attention heads, altogether there were 144 regressors. Text features included layout features and word features. The layout features concerned the visual position of text, including the coordinate of the leftmost pixel of a word, ordinal paragraph number of a word in a passage, ordinal line number of a word in a paragraph, and ordinal line number of a word in a passage. The word features included word length, logarithmic word frequency estimated based on the BookCorpus (Zhu et al., 2015) and English Wikipedia using SRILM (Stolcke, 2002), and word surprisal estimated from GPT-2 Medium (Radford et al., 2018). The task-related feature referred to the question relevance annotated by another group of participants (see ‘Experimental materials’ for details).

Additionally, we also applied linear regression to probe how DNN attention was affected by text features and question relevance. Since information of lines and paragraphs was not available to DNNs, the layout features only included the ordinal position of a word in a sentence, ordinal position of a word in a passage, and ordinal sentence number of a word in this analysis.

Linear mixed effect model

Request a detailed protocol

To characterize the influences of different factors on human word reading time, we employed linear mixed effects models (Pinheiro and Bates, 2006) implemented in the lmerTest package (Kuznetsova et al., 2017) of R. For the baseline model, we treated the type of questions (local vs. global; local = baseline) and all text/task-related features as fixed factors, and considered the interaction between the type of questions and these text/task-related features. We included participants and items (i.e., questions) as random factors, each with associated random intercepts. The formulation of the baseline model was reading-time ~ ParagraphNumber * QuestionType + LineNumberInPassage * QuestionType + LeftMostPixel * QuestionType + LineNumberInParagraph * QuestionType + LogWordFreq * QuestionType + WordLength * QuestionType + Surprisal * QuestionType + QuestionRelevance * QuestionType + (1 | Participant) + (1 | question). Additionally, starting from the baseline model, we augmented the baseline model by adding DNN attention as additional fixed factors. This augmentation facilitated an examination of whether DNN attention demonstrated a statistically significant contribution to the prediction of human word reading time. Notably, the DNN attention was derived from diverse sources, including SAR, randomized BERT, pre-trained BERT, and fine-tuned BERT.

Statistical tests

Request a detailed protocol

In the regression analysis, we employed a one-sided permutation test to test whether a set of features could statistically significantly predict an eye- tracking measure. A total of 500 chance-level predictive power was calculated by predicting the eye-tracking measure shuffled across all words within a passage: the eye-tracking measure to predict was shuffled but the features were not. The procedure was repeated 500 times, creating 500 chance-level predictive power. If the actual correlation was smaller than N out of the 500 chance-level correlation, the significance level was (N + 1)/501.

When comparing the responses to local and global questions, the three types of local/global questions were pooled. The comparison between local and global questions, as well as the comparison between experiments, was based on bias-corrected and accelerated bootstrap (Efron and Tibshirani, 1994). For example, to test whether the predictive power differed between the two types of questions, all global questions were resampled with replacement 50,000 times and each time the predictive power was calculated based on the resampled questions, resulting in 50,000 resampled predictive power. If the predictive power for local questions was greater (or smaller) than N out of the 50,000 resampled predictive power for global questions, the significance level of their difference was 2(N + 1)/50,001. When multiple comparisons were performed, the p-value was further adjusted using the FDR correction.

Data availability

All eye tracking data and code are available at (https://github.com/jiajiezou/TOA copy archived at jiajiezou, 2023).

References

    1. Borji A
    2. Itti L
    (2012) State-of-the-art in visual attention modeling
    IEEE Transactions on Pattern Analysis and Machine Intelligence 35:185–207.
    https://doi.org/10.1109/TPAMI.2012.89
    1. Brainard DH
    (1997)
    The psychophysics toolbox
    Spatial Vision 10:433–436.
  1. Conference
    1. Chen D
    2. Bolton J
    3. Manning CD
    (2016) A thorough examination of the cnn/daily mail reading comprehension task
    Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
    https://doi.org/10.18653/v1/P16-1223
  2. Conference
    1. Devlin J
    2. Chang MW
    3. Lee K
    4. Toutanova K
    (2019)
    Bert: Pre-training of deep bidirectional transformers for language understanding
    2019 Conference of the North American Chapter of the Association for Computational Linguistics; 2019: ACL.
  3. Conference
    1. Hermann KM
    2. Kocisky T
    3. Grefenstette E
    4. Espeholt L
    5. Kay W
    6. Suleyman M
    (2015)
    Teaching machines to read and comprehend
    Advances in Neural Information Processing Systems.
    1. Itti L
    2. Koch C
    3. Niebur E
    (1998) A model of saliency-based visual attention for rapid scene analysis
    IEEE Transactions on Pattern Analysis and Machine Intelligence 20:1254–1259.
    https://doi.org/10.1109/34.730558
  4. Conference
    1. Jain S
    2. Wallace BC
    (2019)
    Attention is not Explanation
    2019 Conference of the North American Chapter of the Association for Computational Linguistics.
  5. Book
    1. Kahneman D
    (1973)
    Attention and Effort
    Citeseer.
  6. Conference
    1. Lai G
    2. Xie Q
    3. Liu H
    4. Yang Y
    5. Hovy E
    (2017) RACE: Large-scale reading comprehension dataset from examinations
    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.
    https://doi.org/10.18653/v1/D17-1082
  7. Conference
    1. Lan Z
    2. Chen M
    3. Goodman S
    4. Gimpel K
    5. Sharma P
    6. Soricut R
    (2020)
    Albert: a lite bert for self-supervised learning of language representations
    International Conference on Learning Representations.
  8. Book
    1. Levenshtein VI
    (1966)
    Binary Codes Capable of Correcting Deletions, Insertions, and Reversals
    Soviet physics doklady.
  9. Conference
    1. Lin Z
    2. Feng M
    3. Yu M
    4. Xiang B
    5. Zhou B
    (2017)
    A structured self-attentive sentence embedding
    International Conference on Learning Representations.
  10. Conference
    1. Liu Y
    2. Reichle E
    (2010)
    The emergence of adaptive eye movements in reading
    Annual Conference of the Cognitive Science Society.
  11. Book
    1. Marr D
    (1982)
    Vision: A Computational Investigation into the Human Representation and Processing of Visual Information
    henry holt and co.
  12. Conference
    1. Pennington J
    2. Socher R
    3. Manning C
    (2014) Glove: Global Vectors for Word Representation
    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP.
    https://doi.org/10.3115/v1/D14-1162
  13. Book
    1. Pinheiro J
    2. Bates D
    (2006)
    Mixed-Effects Models in S and S-PLUS
    Springer science & business media.
  14. Conference
    1. Serrano S
    2. Smith NA
    (2019) Is Attention Interpretable?
    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
    https://doi.org/10.18653/v1/P19-1282
  15. Conference
    1. Stolcke A
    (2002) SRILM - an extensible language modeling toolkit
    7th International Conference on Spoken Language Processing (ICSLP 2002.
    https://doi.org/10.21437/ICSLP.2002-303
  16. Conference
    1. Vaswani A
    2. Shazeer N
    3. Parmar N
    4. Uszkoreit J
    5. Jones L
    6. Gomez AN
    (2017)
    Attention is all you need
    Advances in Neural Information Processing Systems.
  17. Conference
    1. Xu K
    2. Ba J
    3. Kiros R
    4. Cho K
    5. Courville A
    6. Salakhudinov R
    (2015)
    Show, attend and tell: Neural image caption generation with visual attention
    32nd International Conference on Machine Learning.
  18. Conference
    1. Yang Z
    2. Yang D
    3. Dyer C
    4. He X
    5. Smola A
    6. Hovy E
    (2016) Hierarchical Attention Networks for Document Classification
    Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics.
    https://doi.org/10.18653/v1/N16-1174

Peer review

Reviewer #1 (Public Review):

This manuscript describes a set of four passage-reading experiments which are paired with computational modeling to evaluate how task-optimization might modulate attention during reading. Broadly, participants show faster reading and modulated eye-movement patterns of short passages when given a preview of a question they will be asked. The attention weights of a Transformer-based neural network (BERT and variants) show a statistically reliable fit to these reading patterns above-and-beyond text- and semantic-similarity baseline metrics, as well as a recurrent-network-based baseline. Reading strategies are modulated when questions are not previewed, and when participants are L1 versus L2 readers, and these patterns are also statistically tracked by the same transformer-based network.

Strengths:

- Task-optimization is a key notion in current models of reading and the current effort provides a computationally rigorous account of how such task effects might be modeled

- Multiple experiments provide reasonable effort towards generalization across readers and different reading scenarios

- Use of RNN-based baseline, text-based features, and semantic features provides a useful baseline for comparing Transformer-based models like BERT

Weaknesses:

- Generalization across neural network models may be limited (models differ in size, training data etc.); it is thus not always clear which specific model characteristics support their fit to human reading patterns.

https://doi.org/10.7554/eLife.87197.3.sa1

Reviewer #2 (Public Review):

In this study, researchers aim to understand the computational principles behind attention allocation in goal-directed reading tasks. They explore how deep neural networks (DNNs) optimized for reading tasks can predict reading time and attention distribution. The findings show that attention weights in transformer-based DNNs predict reading time for each word. Eye tracking reveals that readers focus on basic text features and question-relevant information during initial reading and rereading, respectively. Attention weights in shallow and deep DNN layers are separately influenced by text features and question relevance. Additionally, when readers read without a specific question in mind, DNNs optimized for word prediction tasks can predict their reading time. Based on these findings, the authors suggests that attention in real-world reading can be understood as a result of task optimization.

Strengths of the Methods and Results:

The present study employed stimuli consisting of paragraphs read by middle and high school students, covering a wide range of diverse topics. This choice ensured that the reading experience for participants remained natural, ultimately enhancing the ecological validity of the findings and conclusions.

In Experiments 1-3, participants were instructed to read questions before the text, while in Experiment 4 participants were instructed to read questions after the text. This deliberate manipulation allowed the paper to assess how different reading task conditions influence reading and eye movements.

Weaknesses of the Methods and Results:

While the study benefits from several strengths, it is important to acknowledge its limitations. Notably, recent months have seen significant advancements in Deep Neural Network (DNN) models, including the development of models such as GPT-3.5 and GPT-4, which have demonstrated remarkable capabilities in tasks resembling human cognition, like Theory of Mind. However, as the code for these cutting-edge models was not publicly accessible, they were unable to evaluate whether the attention mechanisms in the most up-to-date DNN models could provide improved predictions for human eye-movement data. This constraint represents a limitation in the investigation.

The methods and data presented in this study are valuable for gaining insights into the psychological mechanisms of reading. Moreover, the data provided in this paper may prove instrumental in enhancing the performance of future DNN models.

https://doi.org/10.7554/eLife.87197.3.sa2

Reviewer #3 (Public Review):

This paper presents several eyetracking experiments measuring task-directed reading behavior where subjects read texts and answered questions. It then models the measured reading times using attention patterns derived from deep-neural network models from the natural language processing literature. Results are taken to support the theoretical claim that human reading reflects task-optimized attention allocation.

Strengths:

(1) The paper leverages modern machine learning to model a high-level behavioral task (reading comprehension). While the claim that human attention reflects optimal behavior is not new, the paper considers a substantially more high-level task in comparison to prior work. The paper leverages recent models from the NLP literature which are known to provide strong performance on such question-answering tasks, and is methodologically well grounded in the NLP literature.

(2) The modeling uses text- and question-based features in addition to DNNs, specifically evaluates relevant effects, and compares vanilla pretrained and task-finetuned models. This makes the results more transparent and helps assess the contributions of task optimization. In particular, besides fine-tuned DNNs, the role of the task is further established by directly modeling the question relevance of each word. Specifically, the claim that human reading is predicted better by task-optimized attention distributions rests on (i) a role of question relevance in influencing reading in Expts 1-2 but not 4, and (ii) the fact that fine-tuned DNNs improve prediction of gaze in Expts 1-2 but not 4.

(3) The paper conducts experiments on both L2 and L1 speakers.

Weaknesses:

(1) Under the hypothesis advanced, human reading should adapt rationally to task demands. Indeed, Experiment 1 tests questions from different types in blocks (local and global), and the paper provides evidence that this encourages the development of question-type-specific reading strategies -- indeed, this specifically motivates Experiment 2, and is confirmed indirectly in the comparison of the effects found in the two experiments ("all these results indicated that the readers developed question-type-specific strategies in Experiment 1"). On the other hand, finetuning the model on one of the two types does not seem to reproduce this differential behavior, in the sense that fit to reading data is not improved. In this sense, the model seems to have limited abilities in reproducing the observed task dependence of human reading.

The results support the conclusions well, with the weakness described above a limitation of the modeling approach chosen.

The data are likely to be useful as a benchmark in further modeling of eye-movements, an area of interest to computational research on psycholinguistics.

The modeling results contribute to theoretical understanding of human reading behavior, and strengthens a line of research arguing that it reflects task-adaptive behavior.

https://doi.org/10.7554/eLife.87197.3.sa3

Author response

The following is the authors’ response to the original reviews.

Reviewer #1:

This manuscript describes a set of four passage-reading experiments which are paired with computational modeling to evaluate how task-optimization might modulate attention during reading. Broadly, participants show faster reading and modulated eye-movement patterns of short passages when given a preview of a question they will be asked. The attention weights of a Transformerbased neural network (BERT and variants) show a statistically reliable fit to these reading patterns above-and-beyond text- and semantic-similarity baseline metrics, as well as a recurrent-networkbased baseline. Reading strategies are modulated when questions are not previewed, and when participants are L1 versus L2 readers, and these patterns are also statistically tracked by the same transformer-based network.

I should note that I served as a reviewer on an earlier version of this manuscript at a different venue. I had an overall positive view of the paper at that point, and the same opinion holds here as well.

Strengths:

  • Task-optimization is a key notion in current models of reading and the current effort provides a computationally rigorous account of how such task effects might be modeled

  • Multiple experiments provide reasonable effort towards generalization across readers and different reading scenarios

  • Use of RNN-based baseline, text-based features, and semantic features provides a useful baseline for comparing Transformer-based models like BERT

Thank you for the accurate summary and positive evaluation.

Weaknesses:

1. Generalization across neural network models seems, to me, somewhat limited: The transformerbased models differ from baseline models in numerous ways (model size, training data, scoring algorithm); it is thus not clear what properties of these models necessarily supports their fit to human reading patterns.

Thank you for the insightful comment. To dissociate the effect of model architecture and the effect of training data, we have now compared the attention weights across three transformer-based models that have the same architecture but different training data/task: randomized (with all model parameters being randomized), pretrained, and fine-tuned models. Remarkably, even without training on any data, the attention weights in randomly initialized models exhibited significant similarity to human attention patterns (Figure. 3A). The predictive power of randomly initialized transformer-based models outperformed that of the SAR model. Through subsequent pre-training and fine-tuning, the predictive capacity of the models was further elevated. Therefore, both model architecture and the training data/task contribute to human-like attention distribution in the transformer models. We have now reported this result:

“The attention weights of randomly initialized transformer-based models could predict the human word reading time and the predictive power, which was around 0.3, was significantly higher than the chance level and the SAR (Fig. 3A, Table S1). The attention weights of pre-trained transformerbased models could also predict the human word reading time, and the predictive power was around 0.5, significantly higher than the predictive power of heuristic models, the SAR, and randomly initialized transformer-based models (Fig. 3A, Table S1). The predictive power was further boosted for local but not global questions when the models were fine-tuned to perform the goal-directed reading task (Fig. 3A, Table S1).”

In addition, we reported how training influenced the sensitivity of attention weights to text features and question relevance. As shown in Figure 4AB, attention in the randomized models were sensitive to text features across all layers. After pretraining, the models exhibited increased sensitivity to text features in the shallow layers, and decreased sensitivity to text features in deep layers. Subsequent finetuning on the reading comprehension task further attenuates the encoding of text features in deep layers but strengthens the sensitivity to task-relevant information.

1. Inferential statistics are based on a series of linear regressions, but these differ markedly in model size (BERT models involve 144 attention-based regressor, while the RNN-based model uses just 1 attention-based regressor). How are improvements in model fit balanced against changes in model size?

Thank you for pointing out this issue. The performance of linear regressions was evaluated based on 5-fold cross-validation, and the performance we reported was the performance on the test set. To match the number of parameters, we have now predicted human attention using the average of all heads. The predictive power of the average head was still significantly higher than the predictive power of the SAR model. We have now reported this result in our revised manuscript:

“For the fine-tuned models, we also predict the human word reading time using an unweighted averaged of the 144 attention heads and the predictive power was 0.3, significantly higher than that achieved by the attention weights of SAR (P = 4 × 10-5, bootstrap).”

Also, it was not clear to me how participant-level variance was accounted for in the modeling effort (mixed-effects regression?) These questions may well be easily remedied by more complete reporting.

In the previous manuscript, the word reading time was averaged across participants, and we did not consider the variance between participants. We have now analyzed eye movements of each participant and used the linear mixed effects model to test how different factors affected human word reading time to account for participantslevel and item-level variances.

“Furthermore, a linear mixed effect model also revealed that more than 85% of the DNN attention heads contribute to the prediction of human reading time when considering text features and question relevance as covariates (Supplementary Results).”

“Supplementary MethodsTo characterize the influences of different factors on human word reading time, we employed linear mixed effects models [5] implemented in the lmerTest package [6] of R. For the baseline model, we treated the type of questions (local vs. global; local = baseline) and all text/task-related features as fixed factors, and considered the interaction between the type of questions and these text/taskrelated features. We included participants and items (i.e., questions) as random factors, each with associated random intercepts…”

Supplementary ResultsThe baseline mixed model revealed significant fixed effects for question type and all text/task-related features, as well as significant interactions between question type and these text/task-related features (Table S7). Upon involving SAR attention, we observed a statistically significant fixed effect associated with SAR attention. When involving attention weights of randomly initialized BERT, the mixed model revealed that most attention heads exhibited significant fixed effects, suggesting their contributions to the prediction of human word reading time. A broader range of attention heads showed significant fixed effects for both pre-trained and fine-tuned BERT.

1. Experiment 1 was paired with a relatively comprehensive discussion of how attention weights mapped to reading times, but the same sort of analysis was not reported for Exps 2-4; this seems like a missed opportunity given the broader interest in testing how reading strategies might change across the different parameters of the four experiments.

Thank you for the valuable suggestion. We have now also characterized how different reading measures, e.g., gaze duration and counts or rereading, were affected by text and task-related features in Experiments 2-4.

For Experiment 2:“For local questions, consistent with Experiment 1, the effects of question relevance significantly increased from early to late processing stages that are separately indexed by gaze duration and counts of rereading (Fig. S9A, Table S3).”

For Experiment 3:“For local questions, the layout effect was more salient for gaze duration than for counts of rereading. In contrast, the effect of word-related features and task relevance was more salient for counts of rereading than gaze duration (Fig. S9B, Table S3).”

For Experiment 4:“Both the early and late processing stages of human reading were significantly affected by layout and word features, and the effects were larger for the late processing stage indexed by counts of rereading (Fig. S9C, Table S3).”

1. Comparison of predictive power of BERT weights to human annotations of text relevance is limited: The annotation task asked participants to chose the 5 "most relevant" words for a given question; if >5 words carried utility in answering a question, this would not be captured by the annotation. It seems to me that the improvement of BERT over human annotations discussed around page 10-11 could well be due to this arbitrary limitation of the annotations.

Thank you for the insightful comment. We only allowed a participant to label 5 words since we wanted the participant to only label the most important information. As the reviewer pointed out, five words may not be enough. However, this problem is alleviated by having >26 annotators per question. Although each participant can label up to 5 words, pooling the results across >26 annotators results in nonzero relevance rating for an average 21.1 words for local questions and 26.1 words for global question. More important, as was outlined in Experimental Materials, we asked additional participants to answer questions based on only 5 annotated keywords. The accuracy for question answering were 75.9% for global questions and 67.6% for local questions, which was close to the accuracy achieved when the complete passage was present (Fig. 1B), suggesting that even 5 keywords could support question answering.

1. Abstract ln 35: This concluding sentence didn't really capture the key contribution of the paper which, at least from my perspective, was something closer to "we offer a computational account of how task optimization modulates attention during reading"

p 4 ln 66: I think this sentence does a good job capturing the main contributions of this paper

Thanks for your suggestion. We have modified our conclusion in Abstract accordingly.

1. p 4 ln 81: "therefore is conceptually similar" maybe "may serve a conceptually similar role"

We have rewritten the sentence.

“Attention in DNN also functions as a mechanism to selectively extract useful information, and therefore attention may potentially serve a conceptually similar role in DNN.”

1. p. 7 ln 140: "disproportional to the reading time" I didn't understand this sentence

Sorry for the confusion and we have rewritten the sentence.

“In Experiment 1, participants were allowed to read each passage for 2 minutes. Nevertheless, to encourage the participants to develop an effective reading strategy, the monetary reward the participant received decreased as they spent more time reading the passage (see Materials and Methods for details).”

1. p 8 ln 151: This was another sentence that helped solidify the main research contributions for me; I wonder if this framing could be promoted earlier?

Thank you for the suggestion and we have moved the sentence to Introduction.

1. p. 33: I may be missing something here, but I didn't follow the reasoning behind quantifying model fit against eye-tracking measures using accuracy in a permutation test. Models are assessed in terms of the proportion of random shuffles that show a greater statistical correlation. Does that mean that an accuracy value like 0.3 (p. 10 ln 208) means that 0.7 random permutations of word order led to higher correlations between attention weights and RT? Given that RT is continuous, I wonder if a measure of model fit such as RMSE or even R^2 could be more interpretable.

We have now realized that the term “prediction accuracy” was not clearly defined and have caused confusion. Therefore, in the revised manuscript, we have replaced this term with “predictive power”. Additionally, we have now introduced a clear definition of “prediction power” at its first mention in Result:

“…the predictive power, i.e., the Pearson correlation coefficient between the predicted and real word reading time, was around 0.2”

The permutation test was used to test if the predictive power is above chance. Specifically, if the predictive power is higher than the 95 percentile of the chancelevel predictive power estimated using permutations, the significant level (i.e., the p value) is 0.05. We have explained this in Statistical tests.

1. p. 33: FDR-based multiple comparisons are noted several times, but wasn't clear to me what the comparison set is for any given test; more details would be helpful (e.g. X comparisons were conducted across passages/model-variants/whatever)

Sorry for missing this important information. We have now mentioned which comparisons are corrected,

“…Furthermore, the predictive power was higher for global than local questions (P = 4 × 10-5, bootstrap, FDR corrected for comparisons across 3 features, i.e., layout features, word features, and question relevance)…”

Reviewer #2:

In this study, researchers aim to understand the computational principles behind attention allocation in goal-directed reading tasks. They explore how deep neural networks (DNNs) optimized for reading tasks can predict reading time and attention distribution. The findings show that attention weights in transformer-based DNNs predict reading time for each word. Eye tracking reveals that readers focus on basic text features and question-relevant information during initial reading and rereading, respectively. Attention weights in shallow and deep DNN layers are separately influenced by text features and question relevance. Additionally, when readers read without a specific question in mind, DNNs optimized for word prediction tasks can predict their reading time. Based on these findings, the authors suggest that attention in real-world reading can be understood as a result of task optimization.

The research question pursued by the study is interesting and important. The manuscript was well written and enjoyable to read. However, I do have some concerns.

We thank the reviewer for the accurate summary and positive evaluation.

1. In the first paragraph of the manuscript, it appears that the purpose of the study was to test the optimization hypothesis in natural tasks. However, the cited papers mainly focus on covert visual attention, while the present study primarily focuses on overt attention (eye movements). It is crucial to clearly distinguish between these two types of attention and state that the study mainly focuses on overt attention at the beginning of the manuscript.

Thank you for pointing out this issue. We have explicitly mentioned that we focus on overt attention in the current study. Furthermore, we have also discussed that native readers may rely more on covert attention so that they do not need to spend more time overtly fixating at the task relevant words.

In Introduction:

“Reading is one of the most common and most sophisticated human behaviors [16, 17], and it is strongly regulated by attention: Since readers can only recognize a couple of words within one fixation, they have to overtly shift their fixation to read a line of text [3]. Thus, eye movements serve as an overt expression of attention allocation during reading [3, 18].”

In Discussion:

“Therefore, it is possible that when readers are more skilled and when the passage is relatively easy to read, their processing is so efficient so that they do not need extra time to encode task-relevant information and may rely on covert attention to prioritize the processing of task-relevant information.”

1. The manuscript correctly describes attention in DNN as a mechanism to selectively extract useful information. However, eye-movement measures such as gaze duration and total reading time are primarily influenced by the time needed to process words. Therefore, there is a doubt whether the argument stating that attention in DNN is conceptually similar to the human attention mechanism at the computational level is correct. It is strongly suggested that the authors thoroughly discuss whether these concepts describe the same or different things.

Thank you for bringing up this very important issue and we have added discussions about why human and DNN may generate similar attention distributions. For example, we found that both DNN and human attention distributions are modulated by task relevance and word properties, which include word length, word frequency, and word surprisal. The influence of task relevance is relatively straightforward since both human readers and DNN should rely more on task relevant words to answer questions. The influence of word properties is less apparent for models than for human readers and we have added discussions:

For DNN’s sensitivity to word surprisal:

“The transformer-based DNN models analyzed here are optimized in two steps, i.e., pre-training and fine-tuning. The results show that pre-training leads to text-based attention that can well explain general-purpose reading in Experiment 4, while the fine-tuning process leads to goal-directed attention in Experiments 1-3 (Fig. 4B & Fig. 5A). Pre-training is also achieved through task optimization, and the pre-training task used in all the three models analyzed here is to predict a word based on the context. The purpose of the word prediction task is to let models learn the general statistical regularity in a language based on large corpora, which is crucial for model performance on downstream tasks [21, 22, 33], and this process can naturally introduce the sensitivity to word surprisal, i.e., how unpredictable a word is given the context.”

For DNN’s sensitivity to word length:

“Additionally, the tokenization process in DNN can also contribute to the similarity between human and DNN attention distributions: DNN first separates words into tokens (e.g., “tokenization” is separated into “token” and “ization”). Tokens are units that are learned based on co-occurrence of letters, and is not strictly linked to any linguistically defined units. Since longer words tend to be separated into more tokens, i.e., fragments of frequently co-occurred letters, longer words receive more attention even if the model pay uniform attention to each of its input, i.e., a token.”

1. When reporting how reading time was predicted by attention weights, the authors used "prediction accuracy." While this measure is useful for comparing different models, it is less informative for readers to understand the quality of the prediction. It would be more helpful if the results of regression models were also reported.

Sorry for the confusion. The prediction accuracy was defined as the correlation coefficient between the predicted and actual eye-tracking measures. We have now realized that the term “prediction accuracy” might have caused confusion. Therefore, in the revised manuscript, we have replaced this term with “predictive power”. Additionally, we have now introduced a clear definition of “prediction power” at its first mention in Result:

“…the predictive power, i.e., the Pearson correlation coefficient between the predicted and real word reading time, was around 0.2”

1. The motivations of Experiments 2 and 3 could be better described. In their current form, it is challenging to understand how these experiments contribute to understanding the major research question of the study.

Thank you for pointing out this issue. In Experiments 1, different types of questions were presented in separate blocks, and all the participants were L2 reader. Therefore, we conducted Experiments 2 and 3 to examine how reading behaviors were modulated when different types of questions were presented in a mixed manner, or when participants were L1 readers. We have now clarified the motivations:

“In Experiment 1, different types of questions were presented in blocks which encouraged the participants to develop question-type-specific reading strategies. Next, we ran Experiment 2, in which questions from different types were mixed and presented in a randomized order, to test whether the participants developed question-type-specific strategies in Experiment 1.”

“Experiments 1 and 2 recruited L2 readers. To investigate how language proficiency influenced task modulation of attention and the optimality of attention distribution, we ran Experiment 3, which was the same as Experiment 2 except that the participants were native English readers.”

Reviewer #3:

This paper presents several eyetracking experiments measuring task-directed reading behavior where subjects read texts and answered questions.

It then models the measured reading times using attention patterns derived from deep-neural network models from the natural language processing literature.

Results are taken to support the theoretical claim that human reading reflects task-optimized attention allocation.

STRENGTHS:

1. The paper leverages modern machine learning to model a high-level behavioral task (reading comprehension). While the claim that human attention reflects optimal behavior is not new, the paper considers a substantially more high-level task in comparison to prior work. The paper leverages recent models from the NLP literature which are known to provide strong performance on such question-answering tasks, and is methodologically well grounded in the NLP literature.

1. The modeling uses text- and question-based features in addition to DNNs, specifically evaluates relevant effects, and compares vanilla pretrained and task-finetuned models. This makes the results more transparent and helps assess the contributions of task optimization. In particular, besides finetuned DNNs, the role of the task is further established by directly modeling the question relevance of each word. Specifically, the claim that human reading is predicted better by task-optimized attention distributions rests on (i) a role of question relevance in influencing reading in Expts 1-2 but not 4, and (ii) the fact that fine-tuned DNNs improve prediction of gaze in Expts 1-2 but not 4.

1. The paper conducts experiments on both L2 and L1 speakers.

We thank the reviewer for the accurate summary and positive evaluation.

WEAKNESSES:

1. The paper aims to show that human gaze is predicted the the DNN-derived task-optimal attention distribution, but the paper does not actually derive a task-optimal attention distribution. Rather, the DNNs are used to extract 144 different attention distributions, which are then put into a regression with coefficients fitted to predict human attention. As a consequence, the model has 144 free parameters without apparent a-priori constraint or theoretical interpretation. In this sense, there is a slight mismatch between what the modeling aims to establish and what it actually does.

Regarding Weakness (1): This weakness should be made explicit, at least by rephrasing line 90. The authors could also evaluate whether there is either a specific attention head, or one specific linear combination (e.g. a simple average of all heads) that predicts the human data well.

Thank you for pointing out this issue. One the one hand, we have now also predicted human attention using the average of all heads, i.e., the simple average suggested by the reviewer. The predictive power of the average head was still significantly higher than the predictive power of the SAR model. We have now reported this result in our revised manuscript.

“For the fine-tuned models, we also predict the human word reading time using an unweighted averaged of the 144 attention heads and the predictive power was 0.3, significantly higher than that achieved by the attention weights of SAR (P = 4 × 10-5, bootstrap).”

On the other hand, since different attention weights may contribute differently to the prediction of human reading time, we have now also reported the weights assigned to individual attention head during the original regression analysis (Fig. S4). It was observed that the weight was highly distributed across attention head and was not dominated by a single head.

Even more importantly, we have now rephrased the statement in line 90 of the previous manuscript:

“We employed DNNs to derive a set of attention weights that are optimized for the goal-directed reading task, and tested whether such optimal weights could explain human attention measured by eye tracking.”

Furthermore, in Discussion, we mentioned that:

“Furthermore, we demonstrate that both humans and transformer-based DNN models achieve taskoptimal attention distribution in multiple steps… Similarly, the DNN models do not yield a single attention distribution, and instead it generates multiple attention distributions, i.e., heads, for each layer. Here, we demonstrate that basic text features mainly modulate the attention weights in shallow layers, while the question relevance of a word modulates the attention weights in deep layers, reflecting hierarchical control of attention to optimize task performance. The attention weights in both the shallow and deep layers of DNN contribute to the explanation of human word reading time (Fig. S4).”

1. While Experiment 1 tests questions from different types in blocks, and the paper mentions that this might encourage the development of question-type-specific reading strategies -- indeed, this specifically motivates Experiment 2, and is confirmed indirectly in the comparison of the effects found in the two experiments ("all these results indicated that the readers developed question-typespecific strategies in Experiment 1") -- the paper seems to miss the opportunity to also test whether DNNs fine-tuned for each of the question-types predict specifically the reading times on the respective question types in Experiment 1. Testing not only whether DNN-derived features can differentially predict normal reading vs targeted reading, but also different targeted reading tasks, would be a strong test of the approach.

Regarding Weakness (2): results after finetuning for each question type could be reported.

Thank you for the valuable suggestion. We have now fine-tuned the models separately based on global and local questions. The detailed fine-tuning parameters employed in the fine-tuning process were presented in Author response table 1.

Author response table 1
The hyperparameter for fine-tuning DNN models with specific question type.
fold of cross-
validation
training :
validation : test
learning ratetraining epochstraining batch
size
weigh decayearly stopping
patience
1070%:20%:10%1e-510160.12

The fine-tuning process yielded a slight reduction in loss (i.e., the negative logarithmic score of the correct option) on the validation set. Specifically, for BERT, the loss decreased from 1.08 to 0.96; for ALBERT, it decreased from 1.16 to 0.76; for RoBERTa, it went down from 0.68 to 0.54. Nevertheless, the fine-tuning process did not improve the prediction of reading time (Author response image 1). A likely reason is that the number of global and local questions for training is limited (local questions: 520; global questions: 280), and similar questions also exist in RACE dataset that is used for the original fine tuning (sample size: 87,866). Therefore, a small number of questions can significantly change the reading strategy of human readers but using these questions to effectively fine-tune a model seems to be a more challenging task.

Author response image 1
Fine-tuning based on local and global questions does not significantly modulate the prediction of human reading time.

Lighter-color symbols show the results for the 3 BERT-family models (i.e., BERT, ALBERT, and RoBERTa) and the darker-color symbols show the average over the 3 BERT-family models. trans_fine: model fine-tuned based on the RACE dataset; trans_local: models additionally fine-tuned using local questions; trans_global: models additionally fine-tuned using global questions.

1. The paper compares the DNN-derived features to word-related features such as frequency and surprisal and reports that the DNN features are predictive even when the others are regressed out (Figure S3). However, these features are operationalized in a way that puts them at an unfair disadvantage when compared to the DNNs: word frequency is estimated from the BNC corpus; surprisal is derived from the same corpus and derived using a trigram model. The BNC corpus contains 100 Million words, whereas BERT was trained on several Billions of words. Relatedly, trigram models are now far surpassed by DNN-based language models. Specifically, it is known that such models do not fit human eyetracking reading times as well as modern DNN-based models (e.g., Figure 2 Dundee in: Wilcox et al, On the Predictive Power of Neural Language Models for Human Real-Time Comprehension Behavior, CogSci 2020). This means that the predictive power of the word-related features is likely to be underestimated and that some residual predictive power is contained in the DNNs, which may implicitly compute quantities related to frequency and surprisal, but were trained on more data. In order to establish that the DNN models are predictive over and above word-related features, and to reliably quantify the predictive power gained by this, the authors could draw on (1) frequency estimated from the corpora used for BERT (BookCorpus + Wikipedia), (2) either train a strong DNN language model, or simply estimate surprisal from a strong off-the-shelf model such as GPT-2.

This concern does not fundamentally cast doubt on the conclusions, since the authors found a clear effect of the task relevance of individual words, which by definition is not contained in those baseline models. However, Figure S3 -- specifically Figure S3C -- is likely to inflate the contribution of the DNN model over and above the text-based features.

Thank you for pointing out these issues. Following the valuable suggestion of the reviewer, we have now (1) computed word frequencies based on BookCorpus and Wikipedia and (2) calculated word surprisal using GPT-2.

“The word features included word length, logarithmic word frequency estimated based on the BookCorpus [62] and English Wikipedia using SRILM [68], and word surprisal estimated from GPT-2 Medium [69].”

These recalculated word frequency and surprisal are correlated with the original measures (word frequency: 0.98; surprisal: 0.59), and the updated results are also closely aligned with those reported in the previous manuscript.

Others:

1. How does the statistical modeling take into account that measures are repeated both within the items (same texts read by different subjects) and within the subjects (some subject read multiple texts)? I only see the items-level repetition be addressed in line 715-721 in comparing between local and global questions, but not elsewhere. The standard approach in the literature on human reading times (e.g. the Wilcox et al paper mentioned above, or ref. 44) is to use mixed-effects regression with appropriate random effects for items and subjects. The same question applies to the calculation of chance accuracy (line 702-709), which is done by shuffling words within a passage. Relatedly, how exactly was cross-validation (line 681) calculated? On the level of subjects, individual words, trials, texts, ...?

Thank you for raising up this issue. In the previous manuscript, the word reading time was averaged across participants. The cross-validation was conducted on the level of texts (i.e., passages). Following the valuable suggestion, we have now separately analyzed each participant and applied the linear mixed effects models.

“Furthermore, a linear mixed effect model also revealed that more than 85% of the DNN attention heads contribute to the prediction of human reading time when considering text features and question relevance as covariates (Supplementary Results).”

“Supplementary MethodsTo characterize the influences of different factors on human word reading time, we employed linear mixed effects models [5] implemented in the lmerTest package [6] of R. For the baseline model, we treated the type of questions (local vs. global; local = baseline) and all text/task-related features as fixed factors, and considered the interaction between the type of questions and these text/taskrelated features. We included participants and items (i.e., questions) as random factors, each with associated random intercepts…”

Supplementary ResultsThe baseline mixed model revealed significant fixed effects for question type and all text/task-related features, as well as significant interactions between question type and these text/task-related features (Table S7). Upon involving SAR attention, we observed a statistically significant fixed effect associated with SAR attention. When involving attention weights of randomly initialized BERT, the mixed model revealed that most attention heads exhibited significant fixed effects, suggesting their contributions to the prediction of human word reading time. A broader range of attention heads showed significant fixed effects for both pre-trained and fine-tuned BERT.

1. I could not find any statement about code availability (only about data availability). Will the source code and statistical analysis code also be made available?

We have added the code availability statement.

“The code is now available at https://github.com/jiajiezou/TOA.”

1. The theoretical claim, and some basic features of the research, are quite similar to other recent work (Hahn and Keller, Modeling task effects in human reading with neural network-based attention, Cognition, 2023; cited with very little discussion as ref 44), which also considered task-directed reading in a question-answering task and derived task-optimized attention distributions. There are various differences, and the paper under consideration has both weaknesses and strengths when compared to that existing work -- e.g., that paper derived a single attention distribution from task optimization, but the paper under consideration provides more detailed qualitative analysis of the task effects, uses questions requiring more high-level reasoning, and uses more state-of-the-art DNNs.

Thanks for bringing up this issue. We have now incorporated a more comprehensive discussion that compare the current study with the recent work conducted by Hahn and Keller:

“When readers read a passage to answer a question that can be answered using a word-matching strategy [45], a recent study has demonstrated that the specific reading goal modulates the word reading time and the effect can be modeled using a RNN model [46]. Here, we focus on questions that cannot be answered using a word-matching strategy (Fig. 1B) and demonstrate that, for these challenging questions, attention is still modulated by the reading goal but the attention modulation cannot be explained by a word-matching model (Fig. S3). Instead, the attention effect is better captured by transformer models than an advanced RNN model, i.e., the SAR (Fig. 3A). Combining the current study and the study by Hahn et al. [46], it is possible that the word reading time during a general-purpose reading task can be explained by a word prediction task, the word reading time during a simple goal-directed reading task that can be solved by word matching can be modeled by a RNN model, while the word reading time during a more complex goal-directed reading task involving inference is better modeled using a transformer model. The current study also further demonstrates that elongated reading time on task-relevant words is caused by counts of rereading and further studies are required to establish whether earlier eye movement measures can be modulated by, e.g., a word matching task.”

1. In Materials&Methods, line 599-636, specifically when "pretraining" is mentioned (line 632), it should be mentioned what datasets these DNNs were pretrained on.

We have now mentioned this in the revised manuscript:

“The pre-training process aimed to learn general statistical regularities in a language based on large corpora, i.e., BooksCorpus [62] and English Wikipedia…”

https://doi.org/10.7554/eLife.87197.3.sa4

Article and author information

Author details

  1. Jiajie Zou

    1. Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Sciences, Zhejiang University, Hangzhou, China
    2. Nanhu Brain-computer Interface Institute, Hangzhou, China
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0009-0009-9054-117X
  2. Yuran Zhang

    Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Sciences, Zhejiang University, Hangzhou, China
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  3. Jialu Li

    Division of Arts and Sciences, New York University Shanghai, Shanghai, China
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  4. Xing Tian

    Division of Arts and Sciences, New York University Shanghai, Shanghai, China
    Contribution
    Supervision, Project administration
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1629-6304
  5. Nai Ding

    1. Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Sciences, Zhejiang University, Hangzhou, China
    2. Nanhu Brain-computer Interface Institute, Hangzhou, China
    Contribution
    Conceptualization, Formal analysis, Supervision, Funding acquisition, Writing – original draft, Project administration, Writing – review and editing
    For correspondence
    ding_nai@zju.edu.cn
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3428-2723

Funding

STI2030-Major Project (2021ZD0204105)

  • Nai Ding

National Natural Science Foundation of China (32222035)

  • Nai Ding

National Natural Science Foundation of China (32300856)

  • Jiajie Zou

Major Scientific Project of Zhejiang Laboratory (2019KB0AC02)

  • Nai Ding

Fundamental Research Funds for the Central Universities (226-2023-00091)

  • Nai Ding

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank David Poeppel, Yunyi Pan, and Erik D Reichle for valuable comments on earlier versions of this manuscript; Jonathan Simon, Bingjiang Lyu, and members of the Ding lab for thoughtful discussions and feedback; Qian Chu, Yuhan Lu, Anqi Dai, Zhonghua Tang, and Yan Chen for assistance with experiments. This work was supported by STI2030-Major Project 2021ZD0204105, National Natural Science Foundation of China 32222035, National Natural Science Foundation of China 32300856, Major Scientific Research Project of Zhejiang Lab 2019KB0AC02, and Fundamental Research Funds for the Central Universities 226-2023-00091.

Ethics

Human subjects: The experimental procedures were approved by the Research Ethics Committee of the College of Medicine, Zhejiang University (2019-047). The participants provided written consent and were paid.

Senior Editor

  1. Floris P de Lange, Donders Institute for Brain, Cognition and Behaviour, Netherlands

Reviewing Editor

  1. Hang Zhang, Peking University, China

Version history

  1. Sent for peer review: April 16, 2023
  2. Preprint posted: April 25, 2023 (view preprint)
  3. Preprint posted: June 23, 2023 (view preprint)
  4. Preprint posted: October 31, 2023 (view preprint)
  5. Version of Record published: November 30, 2023 (version 1)

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.87197. This DOI represents all versions, and will always resolve to the latest one.

Copyright

© 2023, Zou et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 171
    Page views
  • 42
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Jiajie Zou
  2. Yuran Zhang
  3. Jialu Li
  4. Xing Tian
  5. Nai Ding
(2023)
Human attention during goal-directed reading comprehension relies on task optimization
eLife 12:RP87197.
https://doi.org/10.7554/eLife.87197.3

Share this article

https://doi.org/10.7554/eLife.87197

Further reading

    1. Neuroscience
    Eyal Y Kimchi, Anthony Burgos-Robles ... Kay M Tye
    Research Article

    Basal forebrain cholinergic neurons modulate how organisms process and respond to environmental stimuli through impacts on arousal, attention, and memory. It is unknown, however, whether basal forebrain cholinergic neurons are directly involved in conditioned behavior, independent of secondary roles in the processing of external stimuli. Using fluorescent imaging, we found that cholinergic neurons are active during behavioral responding for a reward – even prior to reward delivery and in the absence of discrete stimuli. Photostimulation of basal forebrain cholinergic neurons, or their terminals in the basolateral amygdala (BLA), selectively promoted conditioned responding (licking), but not unconditioned behavior nor innate motor outputs. In vivo electrophysiological recordings during cholinergic photostimulation revealed reward-contingency-dependent suppression of BLA neural activity, but not prefrontal cortex. Finally, ex vivo experiments demonstrated that photostimulation of cholinergic terminals suppressed BLA projection neuron activity via monosynaptic muscarinic receptor signaling, while also facilitating firing in BLA GABAergic interneurons. Taken together, we show that the neural and behavioral effects of basal forebrain cholinergic activation are modulated by reward contingency in a target-specific manner.

    1. Neuroscience
    Olgerta Asko, Alejandro Omar Blenkmann ... Anne-Kristin Solbakk
    Research Article Updated

    Orbitofrontal cortex (OFC) is classically linked to inhibitory control, emotion regulation, and reward processing. Recent perspectives propose that the OFC also generates predictions about perceptual events, actions, and their outcomes. We tested the role of the OFC in detecting violations of prediction at two levels of abstraction (i.e., hierarchical predictive processing) by studying the event-related potentials (ERPs) of patients with focal OFC lesions (n = 12) and healthy controls (n = 14) while they detected deviant sequences of tones in a local–global paradigm. The structural regularities of the tones were controlled at two hierarchical levels by rules defined at a local (i.e., between tones within sequences) and at a global (i.e., between sequences) level. In OFC patients, ERPs elicited by standard tones were unaffected at both local and global levels compared to controls. However, patients showed an attenuated mismatch negativity (MMN) and P3a to local prediction violation, as well as a diminished MMN followed by a delayed P3a to the combined local and global level prediction violation. The subsequent P3b component to conditions involving violations of prediction at the level of global rules was preserved in the OFC group. Comparable effects were absent in patients with lesions restricted to the lateral PFC, which lends a degree of anatomical specificity to the altered predictive processing resulting from OFC lesion. Overall, the altered magnitudes and time courses of MMN/P3a responses after lesions to the OFC indicate that the neural correlates of detection of auditory regularity violation are impacted at two hierarchical levels of rule abstraction.