Quantification of individual prediction tendency and the multi-speaker paradigm.

A) Participants passively listened to sequences of pure tones in different conditions of entropy (ordered vs. random). Four tones of different fundamental frequencies were presented with a fixed stimulation rate of 3 Hz, their transitional probabilities varied according to respective conditions. B) Expected classifier decision values contrasting the brains’ prestimulus tendency to predict a forward transition (ordered vs. random). The purple shaded area represents values that were considered as prediction tendency C) Exemplary excerpt of a tone sequence in the ordered condition. An LDA classifier was trained on forward transition trials of the ordered condition (75% probability) and tested on all repetition trials to decode sound frequency from brain activity across time. D) Participants either attended to a story in clear speech, i.e. 0 distractor condition, or to a target speaker with a simultaneously presented distractor (blue), i.e. 1 distractor condition. E) The speech envelope was used to estimate neural and ocular speech tracking in respective conditions with temporal response functions (TRF). F) The last noun of some sentences was replaced randomly with an improbable candidate to measure the effect of envelope encoding on the processing of semantic violations. Adapted from Schubert et al., 2023.

List of all ebooks and short stories that were used as a basis for audio material.

Individual prediction tendency.

A) Time-resolved contrasted classifier decision: forward > repetition for ordered and random repetition trials. Classifier tendencies showing frequency-specific prediction for tones with the highest probability (forward transitions) can be found even before stimulus onset but only in an ordered context (shaded areas always indicate 95% confidence intervals). Using the summed difference across pre-stimulus time, one prediction value was extracted per individual subject. B) Distribution of prediction tendency values across subjects (N = 29).

Neural speech tracking is related to prediction tendency and word surprisal, independent of selective attention.

A) Envelope (x) – response (y) relationships are estimated using deconvolution (Boosting). The TRF (filter kernel, h) models how the brain processes the envelope over time. This filter is used to predict neural responses via convolution. Predicted responses are correlated with actual neural activity to evaluate model fit and the TRF’s ability to capture response dynamics. Correlation coefficients from these models are then used as dependent variables in Bayesian regression models. (Panel adapted from Gehmacher et al., 2024b). B) Temporal response functions (TRFs) depict the time-resolved neural tracking of the speech envelope for the single speaker and multi speaker target condition, shown here as absolute values averaged across channels. Solid lines represent the group average. Shaded areas represent 95% Confidence Intervals. C–H) The beta weights shown in the sensor plots are derived from Bayesian regression models in A). For Panel C, this statistical model is based on correlation coefficients computed from the TRF models (further details can be found in the Methods Section). C) In a single speaker condition, neural tracking of the speech envelope was significant for widespread areas, most pronounced over auditory processing regions. D) The condition effect indicates a decrease in neural speech tracking with increasing noise (1 distractor). E) Stronger prediction tendency was associated with increased neural speech tracking over left frontal areas. F) However, there was no interaction between prediction tendency and conditions of selective attention. G) Increased neural tracking of semantic violations was observed over left temporal areas. H) There was no interaction between word surprisal and speaker condition, suggesting a representation of surprising words independent of background noise. Marked sensors indicate ‘significant’ clusters, defined as at least two neighboring channels showing a significant result. N = 29.

Ocular speech tracking is dependent on selective attention.

A) Vertical eye movements ‘significantly’ track attended clear speech, but not in a multi-speaker condition. Temporal profiles of this effect show a downward pattern (negative TRF weights). B) Horizontal eye movements ‘significantly’ track attended speech in a multi-speaker condition. Temporal profiles of this effect show a left-rightwards (negative to positive TRF weights) pattern. Statistics were performed using Bayesian regression models. A ‘*’ within posterior distributions depicts a significant difference from zero (i.e. the 94%HDI does not include zero). Shaded areas in TRF weights represent 95% confidence intervals. N = 29.

Model summary statistics for ocular speech tracking depending on condition and prediction tendency.

Model summary statistics for ocular speech tracking depending on word type and condition.

Ocular speech tracking and selective attention to speech share underlying neural computations.

A) Vertical eye movements significantly mediate neural clear speech tracking throughout the time-lags from –0.3 – 0.7 s for principal component 1 (PC1) over right-lateralized auditory regions.This mediation effect propagates to more leftwards lateralized auditory areas over later time-lags for PC2 and PC3. B) Horizontal eye movements similarly contribute to neural speech tracking of a target in a multi-speaker condition over right-lateralized auditory processing regions for PC1, also with significant anticipatory contributions and a clear peak at ∼ 0.18 s. PC2 shows a clear left-lateralization, however not only over auditory, but also parietal areas almost entirely throughout the time-window of interest with a clear anticipatory effect starting at –0.3 s. For PC3, there still remained a small anticipatory cluster ∼ –0.2 s again over mostly left-lateralized auditory regions. Colour bars represent PCA weights for the group-averaged mediation effect. Shaded areas on time-resolved model-weights represent regions of practical equivalence (ROPE) according to Kruschke (2018). Solid lines show ‘significant’ clusters where at minimum two neighbouring time-points showed a significant mediation effect. Statistics were performed using Bayesian regression models. N = 29.

Ocular, but not neural speech tracking is related to semantic speech comprehension.

A) There was no significant relationship between neural speech tracking (10% sensors with strongest encoding effect) and comprehension, however, a condition effect indicated that comprehension was generally decreased in the multi-speaker condition. B & C) A ‘significant’ negative relationship between comprehension and vertical as well as horizontal ocular speech tracking shows that participants with weaker comprehension increasingly engaged in ocular speech tracking. Statistics were performed using Bayesian regression models. Shaded areas represent 94% HDIs. N = 29.

Model summary statistics for comprehension depending on ocular speech tracking and condition.

Model summary statistics for rated difficulty depending on ocular speech tracking and condition.

A schematic illustration of the framework:

A speech percept is processed in representational (feature-) spacetime (X and Y-axes) at different levels of the cognitive hierarchy (Z-axis), which ranges from highly selective to more associative representations (without the notion of an “apex” in this hierarchy). The temporal and spatial characteristics of a representation depends on where and when (in the brain) it is probed. Anticipatory predictions (purple) help to interpret auditory information at different levels in parallel with high feature-specificity but low temporal precision. These anticipatory predictions reflect to some extent individual tendencies (and differences) that generalise across listening situations. In contrast, active ocular sensing (green) increases the temporal precision already at lower stages of the auditory system to facilitate bottom–up processing at specific timescales (similar to neural oscillations). It does not necessarily convey feature-specific information, but is more likely used to boost (or filter) information around relevant time windows. Our results suggest that this mechanism is motivated by selective attention (blue) rather than predictive assumptions.