Prediction tendency, eye movements, and attention in a unified framework of neural speech tracking

  1. Paris-Lodron-University of Salzburg, Department of Psychology, Centre for Cognitive Neuroscience, Salzburg, Austria
  2. Department of Experimental Psychology, University College London, United Kingdom
  3. Wellcome Centre for Human Neuroimaging, University College London, United Kingdom
  4. Neuroscience Institute, Christian Doppler University Hospital, Paracelsus Medical University, Salzburg, Austria

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Roberto Bottini
    University of Trento, Trento, Italy
  • Senior Editor
    Barbara Shinn-Cunningham
    Carnegie Mellon University, Pittsburgh, United States of America

Reviewer #1 (Public review):

Summary:

This study aimed at replicating two previous findings that showed (1) a link between prediction tendencies and neural speech tracking, and (2) that eye movements track speech. The main findings were replicated which supports the robustness of these results. The authors also investigated interactions between prediction tendencies and ocular speech tracking, but the data did not reveal clear relationships. The authors propose a framework that integrates the findings of the study and proposes how eye movements and prediction tendencies shape perception.

Strengths:

This is a well-written paper that addresses interesting research questions, bringing together two subfields that are usually studied in separation: auditory speech and eye movements. The authors aimed at replicating findings from two of their previous studies, which was overall successful and speaks for the robustness of the findings. The overall approach is convincing, methods and analyses appear to be thorough, and results are compelling.

Weaknesses:

Linking the new to the previous studies could have been done in more detail, and the extent to which results were replicated could have been discussed more thoroughly.

Eye movement behavior could have been presented in more detail and the authors could have attempted to understand whether there is a particular component in eye movement behavior (e.g., microsaccades) that drives the observed effects.

Reviewer #2 (Public review):

Summary

Schubert et al. recorded MEG and eye-tracking activity while participants were listening to stories in single-speaker or multi-speaker speech. In a separate task, MEG was recorded while the same participants were listening to four types of pure tones in either structured (75% predictable) or random (25%) sequences. The MEG data from this task was used to quantify individual 'prediction tendency': the amount by which the neural signal is modulated by whether or not a repeated tone was (un)predictable, given the context. In a replication of earlier work, this prediction tendency was found to correlate with 'neural speech tracking' during the main task. Neural speech tracking is quantified as the multivariate relationship between MEG activity and speech amplitude envelope. Prediction tendency did not correlate with 'ocular speech tracking' during the main task. Neural speech tracking was further modulated by local semantic violations in the speech material, and by whether or not a distracting speaker was present. The authors suggest that part of the neural speech tracking is mediated by ocular speech tracking. Story comprehension was negatively related to ocular speech tracking.

Strengths

This is an ambitious study, and the authors' attempt to integrate the many reported findings related to prediction and attention in one framework is laudable. The data acquisition and analyses appear to be done with great attention to methodological detail (perhaps even with too much focus on detail-see below). Furthermore, the experimental paradigm used is more naturalistic than was previously done in similar setups (i.e. stories instead of sentences).

Weaknesses

For many of the key variables and analysis choices (e.g. neural/ocular speech tracking, prediction tendency, mediation) it is not directly clear how these relate to the theoretical entities under study, and why they were quantified in this particular way. Relatedly, while the analysis pipeline is outlined in much detail, an overarching rationale and important intermediate results are often missing, which makes it difficult to judge the strength of the evidence presented. Furthermore, some analysis choices appear rather ad-hoc and should be made uniform and/or better motivated.

Reviewer #3 (Public review):

Summary:

In this paper, the authors measured neural activity (using MEG) and eye gaze while individuals listened to speech from either one or two speakers, which sometimes contained semantic incongruencies.

The stated aim is to replicate two previous findings by this group: (1) that there is "ocular speech tracking" (that eye-movements track the audio of the speech), (2) that individual differences in neural response to tones that are predictable vs. not-predictable in their pitch is linked to neural response to speech. In addition, here they try to link the above two effects to each other, and to link "attention, prediction, and active sensing".

Strengths:

This is an ambitious project, that tackles an important issue and combines different sources of data (neural data, eye-movements, individual differences in another task) in order to obtain a comprehensive "model" of the involvement of eye-movements in sensory processing.

The authors use many adequate methods and sophisticated data-analysis tools (including MEG source analysis and multivariate statistical models) in order to achieve this.

Weaknesses:

Although I sympathize with the goal of the paper and agree that this is an interesting and important theoretical avenue to pursue, I am unfortunately not convinced by the results and find that many of the claims are very weakly substantiated in the actual data.

Since most of the analyses presented here are derivations of statistical models and very little actual data is presented, I found it very difficult to assess the reliability and validity of the results, as they currently stand. I would be happy to see a thoroughly revised version, where much more of the data is presented, as well as control analyses and rigorous and well-documented statistical testing (including addressing multiple comparisons).

These are the main points of concern that I have regarding the paper, in its current format.

(1) Prediction tendencies - assessed by listening to sequences of rhythmic tones, where the pitch was either "predictable" (i.e., followed a fixed pattern, with 25% repetition) or "unpredictable" (no particular order to the sounds). This is a very specific type of prediction, which is a general term that can operate along many different dimensions. Why was this specific design selected? Is there theoretical reason to believe that this type of prediction is also relevant to "semantic" predictions or other predictive aspects of speech processing?

(2) On the same point - I was disappointed that the results of "prediction tendencies" were not reported in full, but only used later on to assess correlations with other metrics. Even though this is a "replication" of previous work, one would like to fully understand the results from this independent study. On that note, I would also appreciate a more detailed explanation of the method used to derive the "prediction tendency" metric (e.g, what portion of the MEG signal is used? Why use a pre-stimulus and not a post-stimulus time window? How is the response affected by the 3Hz steady-state response that it is riding on? How are signals integrated across channels? Can we get a sense of what this "tendency" looks like in the actual neural signal, rather than just a single number derived per participant (an illustration is provided in Figure 1, but it would be nice to see the actual data)? How is this measure verified statistically? What is its distribution across the sample? Ideally, we would want enough information for others to be able to replicate this finding).

(3) Semantic violations - half the nouns ending sentences were replaced to create incongruent endings. Can you provide more detail about this - e.g., how were the words selected? How were the recordings matched (e.g., could they be detected due to audio editing?)? What are the "lexically identical controls that are mentioned"? Also, is there any behavioral data to know how this affected listeners? Having so many incongruent sentences might be annoying/change the nature of listening. Were they told in advance about these?

(4) TRF in multi-speaker condition: was a univariate or multivariate model used? Since the single-speaker condition only contains one speech stimulus - can we know if univariate and multivariate models are directly comparable (in terms of variance explained)? Was any comparison to permutations done for this analysis to assess noise/chance levels?

(5) TRF analysis at the word level: from my experience, 2-second segments are insufficient for deriving meaningful TRFs (see for example the recent work by Mesik & Wojtczak). Can you please give further details about how the analysis of the response to semantic violations was conducted? What was the model trained on (the full speech or just the 2-second long segments?) Is there a particular advantage to TRFs here, relative - say - to ERPs (one would expect a relatively nice N400 response, not)? In general, it would be nice to see the TRF results on their own (and not just the modulation effects).

(6) Another related point that I did not quite understand - is the dependent measure used for the regression model "neural speech envelope tracking" the r-value derived just from the 2sec-long epochs? Or from the entire speech stimulus? The text mentions the "effect of neural speech tracking" - but it's not clear if this refers to the single-speaker vs. two-speaker conditions or to the prediction manipulation. Or is it different in the different analyses? Please spell out exactly what metric was used in each analysis.

Author response:

We appreciate all the reviewers for their encouraging comments and thoughtful feedback. We are confident that we can incorporate many of the suggestions to provide a clearer overall picture in the revised manuscript. In particular, we agree with the reviewers' concern that some of our methodological decisions, including our choice of metrics, require further clarification. We will focus on revising the methods section to make these decisions more transparent and to address any misunderstandings related to the analysis.

We also value the request to include more data, such as intermediate results and additional control analyses. We will carefully assess which results to include in the main manuscript and which to provide in an extended supplementary section.

To offer a more detailed understanding of our quantification of "prediction tendency," we refer to our previous work (Schubert et al., 2023, 2024), where we elaborate on our analytical choices in great detail and provide additional control analyses (e.g., ensuring that the relationship with speech tracking is not driven by participants' signal-to-noise ratio; Schubert et al., 2023).

Additionally, we would like to clarify that the aim of this manuscript is not to analyze viewing behavior in depth but to replicate the general finding of ocular speech tracking, as presented in Gehmacher et al. (2024). A thorough investigation of specific ocular contributions (e.g., microsaccades or blinks) would require a separate research question and distinct analysis approaches, given the binary nature of such events.

Nevertheless, we share the reviewers' interest in independent results from the current study, and we plan to carefully select and present the most relevant findings in the revised manuscript.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation