1. Neuroscience
Download icon

Linguistic processing of task-irrelevant speech at a cocktail party

  1. Paz Har-shai Yahav  Is a corresponding author
  2. Elana Zion Golumbic  Is a corresponding author
  1. The Gonda Center for Multidisciplinary Brain Research, Bar Ilan University, Israel
Research Article
  • Cited 0
  • Views 1,091
  • Annotations
Cite this article as: eLife 2021;10:e65096 doi: 10.7554/eLife.65096

Abstract

Paying attention to one speaker in a noisy place can be extremely difficult, because to-be-attended and task-irrelevant speech compete for processing resources. We tested whether this competition is restricted to acoustic-phonetic interference or if it extends to competition for linguistic processing as well. Neural activity was recorded using Magnetoencephalography as human participants were instructed to attend to natural speech presented to one ear, and task-irrelevant stimuli were presented to the other. Task-irrelevant stimuli consisted either of random sequences of syllables, or syllables structured to form coherent sentences, using hierarchical frequency-tagging. We find that the phrasal structure of structured task-irrelevant stimuli was represented in the neural response in left inferior frontal and posterior parietal regions, indicating that selective attention does not fully eliminate linguistic processing of task-irrelevant speech. Additionally, neural tracking of to-be-attended speech in left inferior frontal regions was enhanced when competing with structured task-irrelevant stimuli, suggesting inherent competition between them for linguistic processing.

eLife digest

We are all familiar with the difficulty of trying to pay attention to a person speaking in a noisy environment, something often known as the ‘cocktail party problem’. This can be especially challenging when the background noise we are trying to filter out is another conversation that we can understand. In order to avoid being distracted in these kinds of situation, we need selective attention, the cognitive process that allows us to attend to one stimulus and to ignore other irrelevant sensory information. How the brain processes the sounds in our environment and prioritizes them is still not clear.

One of the central questions is whether we can take in information from several speakers at the same time or whether we can only understand speech from one speaker at a time. Neuroimaging techniques can shed light on this matter by measuring brain activity while participants listen to competing speech stimuli, helping researchers understand how this information is processed by the brain.

Now, Har-Shai Yahav and Zion Golumbic measured the brain activity of 30 participants as they listened to two speech streams in their native language, Hebrew. They heard each speech in a different ear and tried to focus their attention on only one of the speakers. Participants always had to attend to natural speech, while the sound they had to ignore could be either natural speech or unintelligible syllable sequences. The activity of the brain was registered using magnetoencephalography, a non-invasive technique that measures the magnetic fields generated by the electrical activity of neurons in the brain.

The results showed that unattended speech activated brain areas related to both hearing and language. Thus, unattended speech was processed not only at the acoustic level (as any other type of sound would be), but also at the linguistic level. In addition, the brain response to the attended speech in brain regions related to language was stronger when the competing sound was natural speech compared to random syllables. This suggests that the two speech inputs compete for the same processing resources, which may explain why we find it difficult to stay focused in a conversation when there are other people talking in the background.

This study contributes to our understanding on how the brain processes multiple auditory inputs at once. In addition, it highlights the fact that selective attention is a dynamic process of balancing the cognitive resources allocated to competing information rather than an all-or-none process. A potential application of these findings could be the design of smart devices to help individuals focus their attention in noisy environments.

Introduction

The seminal speech-shadowing experiments conducted in the 50 s and 60 s set the stage for studying one of the primary cognitive challenges encountered in daily life: how do our perceptual and linguistic systems deal effectively with competing speech inputs? (Cherry, 1953; Broadbent, 1958; Treisman, 1960). Over the past decades, a wealth of behavioral and neural evidence has accumulated showing that when only one speech-stream is behaviorally relevant, auditory and linguistic resources are devoted to its preferential encoding at the expense of other task-irrelevant input. Consequentially, this so-called ‘attended’ message can be repeated, comprehended, and remembered, whereas very little of competing task-irrelevant speech is explicitly recalled (Glucksberg, 1970; Ambler et al., 1976; Neely and LeCompte, 1999; Oswald et al., 2000; Brungart et al., 2001). This attentional selection is accompanied by attenuation of speech-tracking for unattended speech in auditory regions (Mesgarani and Chang, 2012; Horton et al., 2013; Zion Golumbic et al., 2013a; O'Sullivan et al., 2015; Fiedler et al., 2019; Teoh and Lalor, 2019) as well as language-related regions (Zion Golumbic et al., 2013b), particularly for linguistic-features of the speech (Brodbeck et al., 2018a; Broderick et al., 2018; Ding et al., 2018; Brodbeck et al., 2020a). However, as demonstrated even in the earliest studies, the content of task-irrelevant speech is probably not fully suppressed and can affect listener behavior in a variety of ways (Moray, 1959; Bryden, 1964; Yates, 1965). Indeed, despite decades of research, the extent to which concurrent speech are processed and the nature of the competition for resources between ‘attended’ and ‘task-irrelevant’ input in multi-speaker contexts is still highly debated (Kahneman, 1973; Driver, 2001; Lachter et al., 2004; Bronkhorst, 2015).

Fueling this debate are often-conflicting empirical findings regarding whether or not task-irrelevant speech is processed for semantic and linguistic content. Many studies fail to find behavioral or neural evidence for processing task-irrelevant speech beyond its acoustic features (Carlyon et al., 2001; Lachter et al., 2004; Ding et al., 2018). However, others are able to demonstrate that at least some linguistic information is gleaned from task-irrelevant speech. For example, task-irrelevant speech is more distracting than non-speech or incomprehensible distractors (Rhebergen et al., 2005; Iyer et al., 2010; Best et al., 2012; Gallun and Diedesch, 2013; Carey et al., 2014; Kilman et al., 2014; Swaminathan et al., 2015; Kidd et al., 2016), and there are also indications for implicit processing of the semantic content of task-irrelevant speech, manifest through priming effects or memory intrusions (Tun et al., 2002; Dupoux et al., 2003; Rivenez et al., 2006; Beaman et al., 2007; Carey et al., 2014; Aydelott et al., 2015; Schepman et al., 2016). Known as the ‘Irrelevant Sound Effect’ (ISE), these are not always accompanied by explicit recall or recognition (Lewis, 1970; Bentin et al., 1995; Röer et al., 2017a), although in some cases task-irrelevant words, such as one’s own name, may also ‘break in consciousness’ (Cherry, 1953; Treisman, 1960; Wood and Cowan, 1995; Conway et al., 2001).

Behavioral findings indicating linguistic processing of task-irrelevant speech have been interpreted in two opposing ways. Proponents of Late-Selection attention theories understand them as reflecting the system’s capability to apply linguistic processing to more than one speech stream in parallel, albeit mostly pre-consciously (Deutsch and Deutsch, 1963; Parmentier, 2008; Parmentier et al., 2018; Vachon et al., 2020). However, others maintain an Early-Selection perspective, namely, that only one speech stream can be processed linguistically due to inherent processing bottlenecks, but that listeners may shift their attention between concurrent streams giving rise to occasional (conscious or pre-conscious) intrusions from task-irrelevant speech (Cooke, 2006; Vestergaard et al., 2011; Fogerty et al., 2018). Adjudicating between these two explanations experimentally is difficult, due to the largely indirect-nature of the operationalizations used to assess linguistic processing of task-irrelevant speech. Moreover, much of the empirical evidence fueling this debate focuses on detection of individual ‘task-irrelevant’ words, effects that can be easily explained either by parallel processing or by attention-shifts, due to their short duration.

In attempt to broaden this conversation, here we use objective neural measures to evaluate the level of processing applied to task-irrelevant speech. Using a previously established technique of hierarchical frequency-tagging (Ding et al., 2016; Makov et al., 2017), we are able to go beyond the question of detecting individual words and probe whether linguistic processes that require integration over longer periods of time – such as syntactic structure building – are applied to task-irrelevant speech. To study this, we recorded brain activity using Magnetoencephalography (MEG) during a dichotic listening selective-attention experiment. Participants were instructed to attend to narratives of natural speech presented to one ear, and to ignore speech input from the other ear (Figure 1). Task-irrelevant stimuli consisted of sequences of syllables, presented at a constant rate (4 Hz), with their order manipulated to either create linguistically Structured or Non-Structured sequences. Specifically, for the Non-Structured syllables were presented in a completely random order, whereas in the Structured stimuli syllables were ordered to form coherent words, phrases, and sentences. In keeping with the frequency-tagging approach, each of these linguistic levels is associated with a different frequency (words - 2 Hz, phrases - 1 Hz, sentences - 0.5 Hz). By structuring task-irrelevant speech in this way, the two conditions were perfectly controlled for low-level acoustic attributes that contribute to energetic masking (e.g. loudness, pitch, and fine-structure), as well as for the presence of recognizable acoustic-phonetic units, which proposedly contributes to phonetic interference during speech-on-speech masking (Rhebergen et al., 2005; Shinn-Cunningham, 2008; Kidd et al., 2016). Rather, the only difference between the conditions was in the order of the syllables which either did or did not form linguistic structures. Consequentially, if the neural signal shows peaks at frequencies associated with linguistic-features of Structured task-irrelevant speech, as has been reported previously when these type of stimuli are attended or presented without competition (Ding et al., 2016; Ding et al., 2018; Makov et al., 2017), this would provide evidence that integration-based processes operating on longer time-scales are applied to task-irrelevant speech, for identifying longer linguistic units comprised of several syllables. In addition, we also tested whether the neural encoding of the to-be-attended speech itself was affected by the linguistic structure of task-irrelevant speech, which could highlight the source of potential tradeoffs or competition for resources when presented with competing speech (Zion Golumbic et al., 2013b; O'Sullivan et al., 2015; Fiedler et al., 2019; Teoh and Lalor, 2019).

Figure 1 with 2 supplements see all
Illustration of the Dichotic Listening Paradigm.

(a) Participants were instructed to attended right or left ear (counterbalanced) and ignore the other. To-be-attended stimulus was always natural Hebrew speech, and multiple choice questions about the content were asked at the end of each trial. The task-irrelevant ear was always presented with hierarchical frequency-tagging stimuli in two conditions: Structured and Non-Structured. (b) Example of intelligible (Structured) speech composed of 250 ms syllables in which 4 levels of information are differentiated based on their rate: acoustic/syllabic, word, phrasal, and sentential rates (at 4, 2, 1, and 0.5 Hz, respectively). Translation of the two Hebrew sentences in the example: ‘Small puppies want a hug’ and ‘A taxi driver turned on the meter.’ Control stimuli were Non-Structured syllable sequences with the same syllabic-rate of 4 Hz. (c) Representative sound wave of a single sentence (2 s). Sound intensity fluctuates at the rate of 4 Hz. (d) Modulation spectrua of the speech envelopes in each condition. Panels b and c are reproduced from Figure 1A and Figure 1B of Makov et al., 2017. Panel d has been adapted from Figure 1C of Makov et al., 2017.

Materials and methods

Participants

We measured MEG recordings from 30 (18 females, 12 males) native Hebrew speakers. Participants were adult volunteers, ages ranging between 18 and 34 (M = 24.8, SD = ± 4.2), and all were right-handed. Sample size was determined a priori, based on a previous study from our group using a similar paradigm and electrophysiological measures (Makov et al., 2017), where significant effects were found in a sample of n = 21 participants. Exclusion criteria for participation included: non-native Hebrew speakers, a history of neurological disorders or ADHD (based on self-report) or the existence of metal implants (which would disrupt MEG recordings). The study was approved by the IRB committee at Bar-Ilan University and all participants provided their written consent for participation prior to the experiment.

Natural speech (to-be-attended)

Request a detailed protocol

Natural speech stimuli were narratives from publicly available Hebrew podcasts and short audio stories (duration: 44.53 ± 3.23 s). These speech materials were chosen from an existing database in the lab, that were used in previous studies and for which the behavioral task had already been validated (see Experimental Procedure). The stimuli originally consisted of narratives in both female and male voices. However, since it is known that selective attention to speech is highly influenced by whether the competing voices are of the same/different sex (Brungart et al., 2001; Rivenez et al., 2006; Ding and Simon, 2012), and since the task-irrelevant stimuli were recorded only in a male voice (see below), we transformed narratives that were originally recorded in a female voice to a male voice (change-gender function in Praat; Boersma, 2011, http://www.praat.org). To ensure that the gender change did not affect the naturalness of the speech and to check for abnormalities in the materials, we conducted a short survey among 10 native Hebrew speakers. They all agreed that the speech sounded natural and normal. Sound intensity was equated across all narratives. Stimuli examples are available at: https://osf.io/e93qa. These natural speech narratives served as the to-be-attended stimuli in the experiment. For each participant, they were randomly paired with task-irrelevant speech (regardless of condition), to avoid material-specific effects.

Frequency-tagged speech (task-irrelevant)

Request a detailed protocol

A bank of individually recorded Hebrew syllables were used to create two sets of isochronous speech sequences. Single syllables were recorded in random order by a male actor, and remaining prosodic cues were removed using pitch normalization in Praat. Additional sound editing was performed to adjust the length of each syllable to be precisely 250 ms either by truncation or silence padding at the end (original mean duration 243.6 ± 64.3 ms, range 168–397 ms). In case of truncation, a fading out effect was applied to the last 25 ms to avoid clicks. Sound intensity was then manually equated for all syllables.

These syllables were concatenated to create long sequences using custom-written scripts in MATLAB (The MathWorks; code available at https://osf.io/e93qa), equated in length to those of the natural speech segments (44.53 ± 3.23 seconds). Sequences could either be Non-Structured, with syllables presented in a fully random order without creating meaningful linguistic units, or they could be linguistically Structured. Structured sequences were identical to those used in a previous study from our group (Makov et al., 2017), and were formed as follows: Every two syllables formed a word, every two words formed a phrase, and every two phrases formed a sentence. Because syllables were grouped hierarchically into linguistic constituents with no additional acoustic gaps inserted between them, different linguistic hierarchies are associated with fixed periodicities throughout the stimuli (syllables at 4 Hz, words at 2 Hz, phrases at 1 Hz, and sentences at 0.5 Hz; Figure 1b). Structured stimuli also contained no prosodic cues or other low-level acoustic indications for boundaries between linguistics structures, nor did Structured sentences include rhymes, passive form of verbs, or arousing semantic content. See Supplementary Material for more information on the construction of Structured and Non-Structured stimuli.

The modulation spectrum of both types of task-irrelevant stimuli is shown in Figure 1d. It was calculated using a procedure analogous to the spectral analysis performed on the MEG data, in order to ensure maximal comparability between the spectrum of the stimuli and the spectrum of the neural response. Specifically, (1) the broadband envelope of each sequence was extracted by taking the root-mean-square of the audio (10 ms smoothing window); (2) the envelope was segmented into 8 s long segments, which was identical to the segmentation of the MEG data; (3) a fast Fourier transform (FFT) was applied to each segment; and (4) averaged across segments. As expected, both stimuli contained a prominent peak at 4 Hz, corresponding to the syllable-rate. The Structured stimuli also contain a smaller peak at 2 Hz, which corresponds to the word-rate. This is an undesirable side-effect of the frequency-tagging approach, since ideally these stimuli should not contain any energy at frequencies other than the syllable-rate. As shown in the Supplementary Material, the 2 Hz peak in the modulation spectrum reflects the fact that a consistently different subset of syllables occurs at each position within the sequence (e.g. at the beginning\end of words; for similar acoustic effects when using frequency-tagging of multi-syllable words see Luo and Ding, 2020. A similar 2 Hz peak is not observed in the Non-Structured condition, where the syllables are randomly positioned throughout all stimuli. Given this difference in the modulation spectrum, if we were to observe a 2 Hz peak in the neural response to Structured vs. Non-Structured stimuli, this would not necessarily provide conclusive evidence for linguistic ‘word-level’ encoding (although see Makov et al., 2017) and Supplementary Material for a way to control for this). As it happened, in the current dataset we did not see a 2 Hz peak in the neural response in either condition (see Results), therefore this caveat did not affect the interpretability of the data in this specific instance. Importantly, neither the Structured nor the Non-Structured stimuli contained peaks at frequencies corresponding to other linguistic levels (1 Hz and 0.5 Hz), hence comparison of neural responses at these frequencies remained experimentally valid. Stimuli examples are available at: https://osf.io/e93qa.

Experimental procedure

Request a detailed protocol

The experiment used a dichotic listening paradigm, in which participants were instructed to attend to a narrative of natural Hebrew speech presented to one ear, and to ignore input from the other ear where either Structured or Non-Structured task-irrelevant speech was presented. The experiment included a total of 30 trials (44.53 ± 3.23 seconds), and participants were informed at the beginning of each trial which ear to attend to, which was counterbalanced across trials. The sound intensity of the task-irrelevant stimuli increased gradually during the first three seconds of each trial, to avoid inadvertent hints regrading word-boundaries, and the start of each trial was excluded from data analysis. After each trial, participants answered four multiple choice questions about the content of the narrative they were supposed to attend to (3-answers per question; chance level = 0.33). Some of the questions required recollection of specific details (e.g. ‘what color was her hat?”), and some addressed the ‘gist’ of the narrative, (e.g. ‘why was she sad?”). The average accuracy rate of each participant (% questions answered correctly) was calculated across all questions and narratives, separately for trials in the Structured and Non-Structured condition.

This task was chosen as a way to motivate and guide participants to direct attention toward the to-be-attended narrative and provide verification that indeed they listened to it. At the same time, we recognize that this task is not highly sensitive for gauging the full extent of processing the narrative for two reasons: (1) its sparse sampling of behavior (four questions for a 45 s narrative); and (2) since accuracy is affected by additional cognitive factors besides attention such as short-term memory, engagement, and deductive reasoning. Indeed, behavioral screening of this task showed that performance was far from perfect even when participants listened to these narratives in a single-speaker context (i.e., without additional competing speech; average accuracy rate 0.83 ± 0.08; n = 10). Hence, we did not expect performance on this task to necessarily reflect participants’ internal attentional state. At the same time, this task is instrumental in guiding participants' selective attention toward the designated speaker, allowing us to analyze their neural activity during uninterrupted listening to continuous speech, which was the primarily goal of the current study.

Additional tasks

Request a detailed protocol

Based on previous experience, the Structured speech materials are not immediately recognizable as Hebrew speech and require some familiarization. Hence, to ensure that all participants could identify these as speech, and in order to avoid any perceptual learning effects during the main experiment, they underwent a familiarization stage prior to the start of the main experiment, inside the MEG. In this stage, participants heard sequences of 8 isochronous syllables, which were either Structured – forming a single sentence – or Non-Structured. After each trial participants were asked to repeat the sequence out loud. The familiarity stage continued until participants correctly repeated five stimuli of each type. Structured and Non-Structured stimuli were presented in random order.

At the end of the experiment, a short auditory localizer task was performed. The localizer included hearing tones in five different frequencies: 400, 550, 700, 850, and 1000 Hz, all 200 ms long. The tones were presented with random ISIs: 500, 700, 1000, 1200, 1400, 1600, 1800, and 2000 ms. Participants listened to the tones passively and were instructed only to focus on the fixation mark in the center of the screen.

MEG data acquisition

Request a detailed protocol

MEG recordings were conducted with a whole-head, 248-channel magnetometer array (4D Neuroimaging, Magnes 3600 WH) in a magnetically shielded room at the Electromagnetic Brain Imaging Unit, Bar-Ilan University. A series of magnetometer and gradiometer reference coils located above the signal coils were used to record and subtract environmental noise. The location of the head with respect to the sensors was determined by measuring the magnetic field produced by small currents delivered to five head coils attached to the scalp. Before the experimental session, the position of head coils was digitized in relation to three anatomical landmarks (left and right preauricular points and nasion). The data was acquired at a sample rate of 1017.3 Hz and an online 0.1 to 200 Hz band-pass filter was applied. The 50 Hz power line noise fluctuations were recorded directly from the power line as well as vibrations using a set of accelerometers attached to the sensor in order to remove the artifacts on the MEG recordings.

MEG preprocessing

Request a detailed protocol

Preprocessing was performed in MATLAB (The MathWorks) using the FieldTrip toolbox (http://www.fieldtriptoolbox.org). Outlier trials were identified manually by visual inspection and were excluded from analysis. Using independent component analysis (ICA) we removed eye movements (EOG), heartbeat and vibrations of the building. The clean data was then segmented into 8 s long segments, which corresponds to four sentences in the Structured condition. Critically, these segments were perfectly aligned such that they all start with the onset of a syllable, which in the Structured condition will also be the onset of a sentence.

Source estimation

Request a detailed protocol

Source estimation was performed in Python (http://www.python.org) using the MNE-python platform (Gramfort et al., 2013; Gramfort et al., 2014). Source modeling was performed on the pre-processed MEG data, by computing Minimum-Norm Estimates (MNEs). In order to calculate the forward solution, and constrain source locations to the cortical surface, we constructed a Boundary Element Model (BEM) for each participant. BEM was calculated using the participants’ head shape and location relative to the MEG sensors, which was co-registered to an MRI template (FreeSurfer; surfer.nmr.mgh.harvard.edu). Then, the cortical surface of each participant was decimated to 8194 source locations per hemisphere with at least 5 mm spacing between adjacent locations. A noise covariance matrix was estimated using the inter-trial intervals in the localizer task (see Additional Tasks), that is, periods when no auditory stimuli were presented. Then, an inverse operator was computed based on the forward solution and the noise covariance matrix, and was used to estimate the activity at each source location. For visualizing the current estimates on the cortical surface, we used dynamic Statistical Parametric Map (dSPM), which is an F-statistic calculated at each voxel and indicating the relationship between MNE amplitude estimations and the noise covariance (Dale et al., 2000). Finally, individual cortical surfaces were morphed onto a common brain, with 10,242 dipoles per hemisphere (Fischl et al., 1999), in order to compensate for inter-subject differences.

Behavioral data analysis

Request a detailed protocol

The behavioral score was calculated as the average correct response across trials (four multiple-choice question per narrative) for each participant. In order to verify that participants understood and completed the task, we performed a t-test between accuracy rates compared to chance-level (i.e. 0.33).Then, to test whether performance was affected by the type of task-irrelevant speech presented, we performed a paired t-test between the accuracy rates in both conditions. We additionally performed a median-split analysis of the behavioral scores across participant, based on their neural response to task-irrelevant speech (specifically the phrase-level response; see MEG data analysis), to test for possible interactions between performance on the to-be-attended speech and linguistic neural representation of task-irrelevant speech.

MEG data analysis

Spectral analysis

Request a detailed protocol
Scalp-level
Request a detailed protocol

Inter-trial phase coherence (ITPC) was calculated on the clean and segmented data. To this end, we applied a Fast Fourier Transform (FFT) to individual 8 s long segments and extracted the phase component at each frequency (from 0.1 to 15 Hz, with a 0.125 Hz step). The normalized (z-scored) ITPC at each sensor was calculated using the Matlab circ_rtest function (circular statistics toolbox; Berens, 2009). This was performed separately for the Structured and Non-Structured conditions. In order to determine which frequencies had significant ITPC, we performed a t-test between each frequency bin relative to the surrounding frequencies (average two bins from each side), separately for each condition (Nozaradan et al., 2018). In addition, we directly compared the ITPC spectra between the two conditions using a permutation test. In each permutation, the labels of the two conditions were randomly switched in half of the participants, and a paired t-test was performed. This was repeated 1000 times, creating a null-distribution of t-values for this paired comparison, separately for each of the frequencies of interest. The t-values of the real comparisons were evaluated relative to this null-distribution, and were considered significant if they fell within the top 5% (one-way comparison, given our a-priori prediction that peaks in the Structured condition would be higher than in the Non-Structured condition). This procedure was performed on the average ITPC across all sensors, to avoid the need to correct for multiple comparisons, and focused specifically on four frequencies of interest (FOI): 4, 2, 1, and 0.5 Hz which correspond to the four linguistic levels present in the Structured stimuli (syllables, words, phrases and sentences, respectively).

Source-level
Request a detailed protocol

Spectral analysis of source-level data was similar to that performed at the sensor level. ITPC was calculated for each frequency between 0.1 and 15 Hz (0.125 Hz step) and a t-test between the ITPC at each frequency relative to the surrounding frequencies (two bins from each side) was performed in order to validate the response peaks at the FOIs. Then, statistical comparison of responses at the source-level in the Structured and Non-Structured conditions focused only on the peaks that showed a significant difference between conditions at the scalp-level (in this case, the peak at 1 Hz). In order to determine which brain-regions contributed to this effect, we used 22 pre-defined ROIs in each hemisphere, identified based on multi-modal anatomical and functional parcellation (Glasser et al., 2016; Supplementary Neuroanatomical Results [table 1, page 180] and see Figure 3c). We calculated the mean ITPC value in each ROI, across participants and conditions, and tested for significant differences between them using a permutation test, which also corrected for multiple comparisons. As in the scalp-data, the permutation test was based on randomly switching the labels between conditions for half of the participants and conducting a paired t-test within each ROI. In each permutation we identified ROIs that passed a statistical threshold for a paired t-test (p<0.05) and the sum of their t-values was used as a global statistic. This procedure was repeated 1000 times, creating a null-distribution for this global statistic. A similar procedure was applied to the real data, and if the global statistic (sum of t-values in the ROIs that passed an uncorrected threshold of p<0.05) fell within the top 5% of the null-distribution, the entire pattern could be considered statistically significant. This procedure was conducted separately within each hemisphere.

Speech-tracking analysis

Request a detailed protocol

Speech-tracking analysis was performed in order to estimate the neural response to the natural speech that served as the to-be-attended stimulus. To this end, we estimated the Temporal Response Function (TRF), which is a linear transfer-function expressing the relationship between features of the presented speech stimulus s(t) and the recorded neural response r(t). TRFs were estimated using normalized reverse correlation as implemented in the STRFpak Matlab toolbox (strfpak.berkeley.edu) and adapted for MEG data (Zion Golumbic et al., 2013a). Tolerance and sparseness factors were determined using a jackknife cross-validation procedure, to minimize effects of over-fitting. In this procedure, given a total of N trials, a TRF is estimated between s(t) and r(t) derived from N-1 trials, and this estimate is used to predict the neural response to the left-out stimulus. The tolerance and sparseness factors that best predicted the actual recorded neural signal (predictive power estimated using Pearson’s correlation) were selected based on scalp-level TRF analysis (collapsed across conditions), and these were also used when repeating the analysis on the source-level data (David et al., 2007). The predictive power of the TRF model was also evaluated statistically by comparing it to a null-distribution obtained from repeating the procedure on 1000 permutations of mismatched s*(t) and r*(t).

TRFs to the to-be-attended natural speech were estimated separately for trials in which the task-irrelevant speech was Structured vs. Non-Structured. TRFs in these two conditions were then compared statistically to evaluate the effect of the type of task-irrelevant stimulus on neural encoding of to-be-attended speech. For scalp-level TRFs, we used a spatial-temporal clustering permutation test to identify the time-windows where the TRFs differed significantly between conditions (fieldtrip toolbox; first level stat p<0.05, cluster corrected). We then turned to the source-level TRFs to further test which brain regions showed significant difference between conditions, by estimating TRFs in the same 22 pre-defined ROIs in each hemisphere used above. TRFs in the Structured vs. Non-Structured conditions were compared using t-tests in 20 ms long windows (focusing only on the 70–180 ms time window which was found to be significant in the scalp-level analysis) and corrected for multiple comparisons using spatial-temporal clustering permutation test.

Results

Data from one participant was excluded from all analyses due to technical problems during MEG recording. Six additional participants were removed only from source estimation analysis due to technical issues. The full behavioral and neural data are available at: https://osf.io/e93qa.

Behavioral results

Behavioral results reflecting participants response accuracy on comprehension questions about narratives were significantly above chance (M = 0.715, SD = ± 0.15; t(28)=26.67, p<0.001). There were no significant differences in behavior as a function of whether task-irrelevant speech was Structured or Non-Structured (t(28)=−0.31, p=0.75; Figure 2). Additionally, to test for possible interactions between answering questions about the to-be-attended speech and linguistic neural representation of task-irrelevant speech, we performed a median-split analysis of the behavioral scores across participants. Specifically, we used the magnitude of the ITPC value at 1 Hz in the Structured condition (averaged across all sensors), in order to split the sample into two groups – with high and low 1 Hz responses. We performed a between-group t-test on the behavioral results in the Structured condition, and also on the difference between conditions (Structured – Non-Structured). Neither test showed significant differences in performance between participants whose 1 Hz ITPC was above vs. below the median (Structured condition: t(27) = −1.07, p=0.29; Structured – Non-Structured: t(27) = −1.04, p=0.15). Similar null-results were obtained when the median-split was based on the source-level data.

Behavioral results.

Mean accuracy across all participants for both Structured and Non-Structured conditions. Lines represent individual results.

Hierarchical frequency-tagged responses to task-irrelevant speech

Scalp-level spectra of the Inter-trial phase coherence (ITPC) showed a significant peak at the syllabic-rate (4 Hz) in response to both Structured and Non-Structured hierarchical frequency-tagged speech, with a four-pole scalp-distribution common to MEG recorded auditory responses (Figure 3a) (p<10^−9; large effect size, Cohen's d > 1.5 in both). As expected, there was no significant difference between Structured and Non-Structured condition in the 4 Hz response (p=0.899). Importantly, we also observed a significant peak at 1 Hz in the Structured condition (p<0.003; moderate effect size, Cohen's d = 0.6), but not in the Non-Structured condition (p=0.88). Comparison of the 1 Hz ITPC between these conditions also confirmed a significant difference between them (p=0.045; moderate effect size, Cohen's d = 0.57). The scalp-distribution of the 1 Hz peak did not conform to the typical auditory response topography, suggesting different neural generators. No other significant peaks were observed at any other frequencies, including the 2 Hz or 0.5 Hz word and sentence-level rates, nor did the responses at these rates differ significantly between conditions.

Neural tracking of linguistic structures in task-irrelevant speech.

(a) Top panel shows the ITPC spectrum at the scalp-level (average across all sensors) in response to Structured (purple) and Non-Structured (green) task-irrelevant speech. ITPC values z-score normalized, as implemented in the circ_rtest function (see Materials and methods). Shaded areas indicate SEM across participants (n = 29). Asterisk represents statistically significant difference (p<0.05) between conditions, indicating a significant response at 1 Hz for Structured task-irrelevant speech, which corresponds to the phrase-level. Bottom panel shows the scalp-topography of ITPC at 4 Hz and 1 Hz in the two conditions. (b) ITPC spectrum at the source-level, averaged across all voxels in each hemisphere. Shaded highlights denote SEM across participants (n = 23). Asterisk represents statistically significant difference (p<0.05) between conditions, indicating a significant response at 1 Hz for Structured task-irrelevant speech in the left, but not right hemisphere. (c) Source-level map on a central inflated brain depicting the ROIs in the left hemisphere where significant differences in ITPC at 1 Hz were found for Structured vs. Non-Structured task-irrelevant speech. Black lines indicate the parcellation into the 22 ROIs used for source-level analysis. (d) Source-level maps showing localization of the syllabic-rate response (4 Hz) in both conditions.

In order to determine the neural source of the 1 Hz peak in the Structured condition, we repeated the spectral analysis in source-space. An inverse solution was applied to individual trials and the ITPC was calculated in each voxel. As shown in Figure 3b, the source-level ITPC spectra, averaged over each hemisphere separately, is qualitatively similar to that observed at the scalp-level. The only significant peaks were at 4 Hz in both conditions (p<10−8; large effect size, Cohen's d > 1.7 in both conditions and both hemispheres) and at 1 Hz in the Structured condition (left hemisphere p=0.052, Cohen's d = 0.43; right hemisphere p<0.007, Cohen's d = 0.6), but not in the Non-Structured condition. Statistical comparison of the 1 Hz peak between conditions revealed a significant difference between the Structured and Non-Structured condition over the left hemisphere (p=0.026, Cohen's d = 0.57), but not over the right hemisphere (p=0.278).

Figure 3c shows the source-level distribution within the left hemisphere of the difference in 1 Hz ITPC between the Structured and Non-Structured condition. The effect was observed primarily in frontal and parietal regions. Statistical testing evaluating the difference between conditions was performed in 22 pre-determined ROIs per hemisphere, using a permutation test. This indicated significant effects in several ROIs in the left hemisphere including the inferior-frontal cortex and superior parietal cortex, as well as the mid-cingulate and portions of the middle and superior occipital gyrus (cluster-corrected p=0.002). No ROIs survived multiple-comparison correction in the right hemisphere (cluster-corrected p=0.132), although some ROIs in the right cingulate were significant at an uncorrected level (p<0.05).

With regard to the 4 Hz peak, it was localized as expected to bilateral auditory cortex and did not differ significantly across conditions in either hemisphere (left hemisphere: p=0.155, right hemisphere: p=0.346). We therefore did not conduct a more fine-grained analysis of different ROIs. As in the scalp-level data, no peaks were observed at 2 Hz and no significant difference between conditions (left hemisphere: p=0.963, right hemisphere: p=0.755).

Speech tracking of to-be-attended speech

Speech tracking analysis of responses to the to-be-attended narrative yielded robust TRFs (scalp-level predictive power r = 0.1, p<0.01 vs. permutations). The TRF time-course featured two main peaks, one ~ 80 ms and the other ~140 ms, in line with previous TRF estimations (Akram et al., 2017; Fiedler et al., 2019; Brodbeck et al., 2020b). Both the scalp-level and source-level TRF analysis indicated that TRFs were predominantly auditory – showing the common four-pole distribution at the scalp-level (Figure 4a) and in the source-level analysis was localized primarily to auditory cortex (superior temporal gyrus and sulcus; STG/STS) as well as left insula/IFG (Figure 4b; 140 ms).

Speech tracking of to-be-attended speech.

(a) top, TRF examples from two sensors, showing the positive and negative poles of the TRF over the left hemisphere. Shaded highlights denote SEM across participants. Black line indicates the time points where the difference between conditions was significant (spatio-temporal cluster corrected). bottom, TRF examples from source-level ROIs in left Auditory Cortex and in the left Inferior Frontal/Insula region. (b) topographies (top) and source estimations (bottom) for each condition at M1 peak (140 ms). (c) left, topography of the difference between conditions (Structured – Non-Structured) at the M1 peak (140 ms). Asterisks indicate the MEG channels where this difference was significant (cluster corrected). right, significant cluster at the source-level.

When comparing the TRFs to the to-be-attended speech as a function of whether the competing task-irrelevant stimulus was Structured vs. Non-Structured, some interesting differences emerged. Spatial-temporal clustering permutation test on the scalp-level TRFs revealed significant differences between the conditions between 70–180 ms (p<0.05; cluster corrected), including both the early and late TRF peaks, at a large number of sensors primarily on the left (Figure 4a and c). Specifically, TRF responses to the to-be-attended speech were enhanced when the task-irrelevant stimulus was Structured vs. Non-Structured. The effect was observed at the scalp-level with opposite polarity in frontal vs. medial sensors, and was localized at the source-level to a single cluster in the left inferior-frontal cortex, that included portions of the insula and orbito-frontal cortex (Figure 4c; spatial-temporal clustering p=0.0058).

Discussion

In this MEG study, frequency-tagged hierarchical speech was used to probe the degree to which linguistic processing is applied to task-irrelevant speech, and how this interacts with processing speech that is presumably in the focus of attention (‘to-be-attended’). As expected, we observe obligatory acoustic representation of task-irrelevant speech, regardless of whether it was Structured or Non-Structured, which manifest as a 4 Hz syllable-level response localized to bilateral auditory regions in the STG/STS. Critically, for Structured task-irrelevant speech, we also find evidence for neural tracking of the phrasal structure, with a 1 Hz peak localized primarily to left inferior-frontal cortex and left posterior-parietal cortex. The regions are not associated with low-level processing, but rather play an important role in speech processing and higher order executive functions (Dronkers et al., 2004; Humphries et al., 2006; Linden, 2007; Corbetta et al., 2008; Edin et al., 2009; Hahn et al., 2018). Additionally, we find that the speech-tracking response to the to-be-attended speech in left inferior frontal cortex was also affected by whether task-irrelevant speech was linguistically Structured or not. These results contribute to ongoing debates regarding the nature of the competition for processing resources during speech-on-speech masking, demonstrating that linguistic processes requiring integration of input over relatively long timescales can indeed be applied to task-irrelevant speech.

The debate surrounding linguistic processing of task-irrelevant speech

Top-down attention is an extremely effective process, by which the perceptual and neural representation of task-relevant speech are enhanced at the expense of task-irrelevant stimuli, and speech in particular (Horton et al., 2013; Zion Golumbic et al., 2013b; O'Sullivan et al., 2015; Fiedler et al., 2019; Teoh and Lalor, 2019). However, the question still stands: what degree of linguistic processing is applied to task-irrelevant speech? One prominent position is that attention is required for linguistic processing and therefore speech that is outside the focus of attention is not processed beyond its sensory attributes (Lachter et al., 2004; Brodbeck et al., 2018a; Ding et al., 2018). However, several lines of evidence suggest that linguistic features of task-irrelevant speech can be processed as well, at least under certain circumstances. For example, task-irrelevant speech is more disruptive to task performance if it is intelligible, as compared to unintelligible noise-vocoded or rotated speech (Marrone et al., 2008; Iyer et al., 2010; Best et al., 2012; Gallun and Diedesch, 2013; Swaminathan et al., 2015; Kidd et al., 2016) or a foreign language (Freyman et al., 2001; Rhebergen et al., 2005; Cooke et al., 2008; Calandruccio et al., 2010; Francart et al., 2011). This effect, referred to as informational masking, is often attributed to the detection of familiar acoustic-phonetic features in task-irrelevant speech, that can lead to competition for phonological processing ('phonological interference') (Durlach et al., 2003; Drullman and Bronkhorst, 2004; Kidd et al., 2008; Shinn-Cunningham, 2008; Rosen et al., 2013). However, the phenomenon of informational masking alone is insufficient for determining the extent to which task-irrelevant speech is processed beyond identification of phonological units.

Other lines of investigation have focused more directly on whether task-irrelevant speech is represented at the semantic level. Findings that individual words from a task-irrelevant source are occasionally detected and recalled, such as one’s own name, (Cherry, 1953; Wood and Cowan, 1995; Conway et al., 2001; Röer et al., 2017b; Röer et al., 2017a), have been taken as evidence that task-irrelevant inputs can be semantically processed, albeit the information may not be consciously available. Along similar lines, a wealth of studies demonstrate the ‘Irrelevant Sound Effect’ (ISE), showing that the semantic content of task-irrelevant input affects performance on a main task, mainly through priming effects and interference with short-term memory (Lewis, 1970; Bentin et al., 1995; Surprenant et al., 1999; Dupoux et al., 2003; Beaman, 2004; Rivenez et al., 2006; Beaman et al., 2007; Rämä et al., 2012; Aydelott et al., 2015; Schepman et al., 2016; Vachon et al., 2020). However, an important caveat precludes interpreting these findings as clear-cut evidence for semantic processing of task-irrelevant speech: Since these studies primarily involve presentation of arbitrary lists of words (mostly nouns), usually at a relatively slow rate, an alternative explanation is that the ISE is simply a result of occasional shifts of attention toward task-irrelevant stimuli (Carlyon, 2004; Lachter et al., 2004). Similarly, the effects of informational masking discussed above can also be attributed to a similar notion of perceptual glimpsing, that is gleaning bits of the task-irrelevant speech in the short ‘gaps’ in the speech that is to-be-attended (Cooke, 2006; Kidd et al., 2016; Fogerty et al., 2018). These claims – that effects of task-irrelevant speech are not due to parallel processing but reflect shifts of attention – are extremely difficult to reject empirically, as they would require insight into the listeners’ internal state of attention, which at present is not easy to operationalize.

Phrase-level response to task-irrelevant speech

In attempt to broach the larger question of processing task-irrelevant speech, the current study takes a different approach by focusing not on detection of single words, but on linguistic processes that operate over longer timescales. To this end the stimuli used here, in both the to-be-attended and the task-irrelevant ear, was continuous speech rather than word-lists whose processing requires accumulating and integrating information over time, which is strikingly different than the point-process nature of listening to word-lists (Fedorenko et al., 2016). Using continuous speech is also more representative of the type of stimuli encountered naturally in the real world (Hill and Miller, 2010; Risko et al., 2016; Matusz et al., 2019; Shavit-Cohen and Zion Golumbic, 2019). In addition, by employing hierarchical frequency-tagging, we were able to obtain objective and direct indications of which levels of information were detected within task-irrelevant speech. Indeed, using this approach we were able to identify a phrase-level response for Structured task-irrelevant speech, which serves as a positive indication that these stimuli are indeed processed in a manner sufficient for identifying the boundaries of syntactic structures.

An important question to ask is whether the phrase-level response observed for task-irrelevant speech can be explained by attention shifts? Admittedly, in the current design participants could shift their attention between streams in an uncontrolled fashion, allowing them to ‘glimpse’ portions of the task-irrelevant speech, integrate and comprehend (portions of) it. Indeed, this is one of the reasons we refrain from referring to the task-irrelevant stream as ‘unattended’: since we have no principled way to empirically observe the internal loci or spread of attention, we chose to focus on its behavioral relevance rather than make assumptions regarding the participants’ attentional-state. Despite the inherent ambiguity regarding the underlying dynamics of attention, the fact that here we observe a phrase-level response for task-irrelevant speech is direct indication that phonetic-acoustic information from this stream was decoded and integrated over time, allowing the formation of long-scale representations for phrasal boundaries. If this is a result of internal ‘glimpsing’, this would imply either that (a) the underlying hierarchical speech structure was detected and used to guide ‘glimpses’ in a rhythmic-manner to points in time that are most informative; or (b) that ‘glimpses’ occur irregularly, but that sufficient information is gleaned through them and stored in working-memory to allow the consistent detection phrasal boundaries in task-irrelevant speech. Both of these options imply a sophisticated multiplexed encoding-scheme for successful processing of concurrent speech, that relies on precise temporal control and working-memory storage. Another possibility, of course, is that there is no need for attention-shifts and that the system has sufficient capacity to process task-irrelevant speech in parallel to focusing primarily on the to-be-attended stream. As mentioned above, the current data cannot provide insight into which of these listening-strategies underlies the generation of the observed phrase-level response to task-irrelevant speech. However, we hope that future studies will gain empirical access into the dynamic of listeners’ internal attentional state and help shed light on this pivotal issue.

The current study is similar in design to another recent study by Ding et al., 2018 where Structured frequency-tagged speech was presented as a task-irrelevant stimulus. In contrast to the results reported here, they did not find significant peaks at any linguistic-related frequencies in the neural response to task-irrelevant speech. In attempt to resolve this discrepancy, it is important to note that these two studies differ in an important way – in the listening effort that was required of participants in order to understand the to-be-attended speech. While in the current experiment to-be-attended speech was presented in its natural form, mimicking the listening effort of real-life speech-processing, in the study by Ding et al., 2018 to-be-attended speech was time-compressed by a factor of 2.5 and naturally occurring gaps were removed, making the comprehension task substantially more effortful (Nourski et al., 2009; Müller et al., 2019). Load Theory of Attention proposes that the allocation of processing resources among competing inputs can vary as a function of the perceptual traits and cognitive load imposed by the task (Lavie et al., 2004; Murphy et al., 2017). Accordingly, it is plausible that these divergent results are due to the extreme difference in the perceptual load and listening effort in the two studies. Specifically, if understanding the to-be-attended speech imposes relatively low perceptual and cognitive load, then sufficient resources may be available to additionally process aspects of task-irrelevant speech, but that this might not be the case as the task becomes more difficult and perceptually demanding (Wild et al., 2012; Gagné et al., 2017; Peelle, 2018).

More broadly, the comparison between these two studies invites re-framing of the question regarding the type/level of linguistic processing applied to task-irrelevant speech, and propels us to think about this issue not as a yes-or-no dichotomy, but perhaps as a more flexible process that depends on the specific context (Brodbeck et al., 2020b). The current results provide a non-trivial positive example for processing task-irrelevant speech that is indeed processed beyond its acoustic attributes, in an experimental context that closely emulates the perceptual and cognitive load encountered in real-life (despite the admitted unnatural nature of the task-irrelevant speech). At the same time, they do not imply that this is always the case, as is evident from the diverse results reported in the literature regarding processing task-irrelevant speech, as discussed at length above. Rather, they invite adopting a more flexible perspective of processing bottlenecks within the speech processing system, that takes into consideration the perceptual and cognitive load imposed in a given context, in line with load theory of attention (Mattys et al., 2012; Lavie et al., 2014; Fairnie et al., 2016; Gagné et al., 2017; Peelle, 2018). Supporting this perspective, others have also observed that the level of processing applied to task-irrelevant stimuli can be affected by task demands (Hohlfeld and Sommer, 2005; Pulvermüller et al., 2008). Moreover, individual differences in attentional abilities, and particularly the ability to process concurrent speech, have been attributed partially to working-memory capacity, a trait associated with the availability of more cognitive resources (Beaman et al., 2007; Forster and Lavie, 2008; Naveh-Benjamin et al., 2014; Lambez et al., 2020) but cf. (Elliott and Briganti, 2012). As cognitive neuroscience research increasingly moves toward studying speech processing and attention in real-life circumstances, a critical challenge will be to systematically map out the perceptual and cognitive factors that contribute to, or hinder, the ability to glean meaningful information from stimuli that are outside the primary focus of attention.

The brain regions where phrase-level response is observed

The phrase-level neural response to task-irrelevant Structured speech was localized primarily to two left-lateralized clusters: one in the left anterior fronto-temporal cortex and the other in left posterior-parietal cortex. The fronto-temporal cluster, which included the IFG and insula, is known to play an important role in speech processing (Dronkers et al., 2004; Humphries et al., 2006; Brodbeck et al., 2018b; Blank and Fedorenko, 2020). The left IFG and insula are particularly associated with linguistic processes that require integration over longer periods of time, such as syntactic structure building and semantic integration of meaning (Fedorenko et al., 2016; Matchin et al., 2017; Schell et al., 2017), and are also recruited when speech comprehension requires effort, such as for degraded or noise-vocoded speech (Davis and Johnsrude, 2003; Obleser and Kotz, 2010; Davis et al., 2011; Hervais-Adelman et al., 2012). Accordingly, observing a phrase-level response to task-irrelevant speech in these regions is in line with their functional involvement in processing speech under adverse conditions.

With regard to the left posterior-parietal cluster, the interpretation for why a phrase-level response is observed there is less straightforward. Although some portions of the parietal cortex are involved in speech processing, these are typically more inferior than the cluster found here (Hickok and Poeppel, 2007; Smirnov et al., 2014). However, both the posterior-parietal cortex and inferior frontal gyrus play an important role in verbal working-memory (Todd and Marois, 2004; Postle et al., 2006; Linden, 2007; McNab and Klingberg, 2008; Edin et al., 2009; Østby et al., 2011; Rottschy et al., 2012; Gazzaley and Nobre, 2012; Ma et al., 2012; Meyer et al., 2014; Meyer et al., 2015; Yue et al., 2019; Fedorenko and Blank, 2020). Detecting the phrasal structure of task-irrelevant speech, while focusing primarily on processing the to-be-attended narratives, likely requires substantial working-memory for integrating chunks of information over time. Indeed, attention and working-memory are tightly linked constructs (McNab and Klingberg, 2008; Gazzaley and Nobre, 2012; Vandierendonck, 2014), and as mentioned above, the ability to control and maintain attention is often associated with individual working-memory capacity (Cowan et al., 2005; Beaman et al., 2007; Forster and Lavie, 2008; Naveh-Benjamin et al., 2014; Lambez et al., 2020). Therefore, one possible interpretation for the presence of a phrase-level response to task-irrelevant speech in the left posterior-parietal cortex and inferior frontal regions, is their role in forming and maintaining a representation of task-irrelevant stimuli in working-memory, perhaps as a means for monitoring the environment for potentially important events.

Why no word-level response?

Although in the current study we found significant neural response to task-irrelevant speech at the phrase-rate, we did not see peaks at the word- or at the sentence-rate. Regarding the sentence-level response, it is difficult to determine whether the lack of an observable peak indicates that the stimuli were not parsed into sentences, or if this null-result is due to the technical difficulty of obtaining reliable peaks at low-frequencies (0.5 Hz) given the 1/f noise-structure of neurophysiological recordings (Pritchard, 1992; Miller et al., 2009). Hence, this remains an open question for future studies. Regarding the lack of a word-level response at 2 Hz for Structured task-irrelevant stimuli, this was indeed surprising, since in previous studies using the same stimuli in a single-speaker context we observe a prominent peak at both the word- and the phrase-rate (Makov et al., 2017). Although we do not know for sure why the 2 Hz peak is not observed when this speech was presented as task-irrelevant concurrently with another narrative, we can offer some speculations for this null-result: One possibility is that the task-irrelevant speech was indeed parsed into words as well, but that the neural signature of 2 Hz parsing was not observable due to interference from the acoustic contributions at 2 Hz (see Supplementary Materials and Luo and Ding, 2020). However, another possibility is that the lack of a word-level response for task-irrelevant speech indicates that it does not undergo full lexical analysis. Counter to the linear intuition that syntactic structuring depends on identifying individual lexemes, there is substantial evidence that lexical and syntactic processes are separable and dissociable cognitive processes, that rely on partially different neural substrates (Friederici and Kotz, 2003; Hagoort, 2003; Humphries et al., 2006; Nelson et al., 2017; Schell et al., 2017; Pylkkänen, 2019; Morgan et al., 2020). Indeed, a recent frequency-tagging study showed that syntactic phrasal structure can be identified (generating a phrase-level peak in the neural spectrum) even in the complete absence of lexical information (Getz et al., 2018). Hence, it is possible that when speech is task-irrelevant and does not receive full attention, it is processed only partially, and that although phrasal boundaries are consistently detected, task-irrelevant speech does not undergo full lexical analysis. This matter regarding the depth of lexical processing of task-irrelevant speech, and its interaction with syntactic analysis, remains to be further explored in future research.

Task-irrelevant influence on processing to-be-attended speech

Besides analyzing the frequency-tagged neural signatures associated with encoding the task-irrelevant stimuli, we also looked at how the neural encoding of to-be-attended speech was affected by the type of task-irrelevant speech it was paired with. In line with previous MEG studies, the speech-tracking response (estimated using TRFs) was localized to auditory temporal regions bilaterally and left inferior frontal regions (Ding and Simon, 2012; Zion Golumbic et al., 2013a; Puvvada and Simon, 2017). The speech tracking response in auditory regions was similar in both conditions; however, the response in left inferior-frontal cortex was modulated by the type of task-irrelevant speech presented and was enhanced when task-irrelevant speech was Structured vs. when it was Non-Structured. This pattern highlights the nature of the competition for resources triggered by concurrent stimuli. When the task-irrelevant stimulus was Non-Structured, even though it was comprised of individual phonetic-acoustic units, it did not contain meaningful linguistic information and therefore did not require syntactic and semantic resources. However, the Structured task-irrelevant speech poses more of a competition, since it constitutes fully intelligible speech. Indeed, it is well established that intelligible task-irrelevant speech causes more competition and therefore are more distracting than non-intelligible speech (Rhebergen et al., 2005; Iyer et al., 2010; Best et al., 2012; Gallun and Diedesch, 2013; Carey et al., 2014; Kilman et al., 2014; Swaminathan et al., 2015; Kidd et al., 2016). A recent EEG study found that responses to both target and distractor speech are enhanced when the distractor was intelligible vs. unintelligible (Olguin et al., 2018), although this may depend on the specific type of stimulus used (Rimmele et al., 2015). However, in most studies it is difficult to ascertain the level(s) of processing where competing between the inputs occurs, and many effects can be explained by variation in the acoustic nature of maskers (Ding and Simon, 2014). The current study is unique in that all low-level features of Structured and Non-Structured speech stimuli were perfectly controlled, allowing us to demonstrate that interference goes beyond the phonetic-acoustic level and also occurs at higher linguistic levels. The findings that the speech tracking response of the to-be-attended narratives is enhanced when competing with a Structured task-irrelevant speech, specifically left inferior-frontal brain regions, where we also observed tracking of the phrase-structure of task-irrelevant speech, pinpoints the locus of this competition to these dedicated speech-processing regions, above and beyond any sensory-level competition (Davis et al., 2011; Brouwer et al., 2012; Hervais-Adelman et al., 2012). Specifically, they suggest that the enhanced speech tracking response in IFG reflects the investment of additional listening effort for comprehending the task-relevant speech (Vandenberghe et al., 2002; Gagné et al., 2017; Peelle, 2018).

Since the neural response to the to-be-attended speech was modulated by the type of competition it faced, then why was this not mirrored in the current behavioral results as well? In the current study participants achieved similar accuracy rates on the comprehension questions regardless of whether the natural-narratives were paired with Structured or Non-Structured stimuli in the task-irrelevant ear, and there was no significant correlation between the neural effects and performance. We attribute the lack of a behavioral effect primarily to the insensitivity of the behavioral measures used here, that consisted of asking four multiple-choice questions after each 45 s long narrative. Although numerous previous studies have been able to demonstrate behavioral ‘intrusions’ of the task-irrelevant stimuli on performance of an attended-task, these have been shown using more constrained experimental paradigms, that have the advantage of probing behavior at a finer scale, but are substantially less ecological (e.g. memory-recall for short lists of words or priming effects; Tun et al., 2002; Dupoux et al., 2003; Rivenez et al., 2006; Rivenez et al., 2008; Carey et al., 2014; Aydelott et al., 2015). In moving toward studying speech processing and attention under more ecological circumstances, using natural continuous speech, we face an experimental challenge of obtaining sufficiently sensitive behavior measures without disrupting listening with an ongoing task (e.g. target detection) or encroaching too much on working-memory. This is a challenge shared by many previous studies similar to ours, and is one of the main motivations for turning directly to the brain and studying neural activity during uninterrupted listening to continuous speech, rather than relying on sparse behavioral indications (Ding et al., 2016; Makov et al., 2017; Brodbeck et al., 2018a; Broderick et al., 2018; Broderick et al., 2019; Donhauser and Baillet, 2020).

Conclusions

The current study contributes to ongoing efforts to understand how the brain deals with the abundance of auditory inputs in our environment. Our results indicate that even though top-down attention effectively enables listeners to focus on a particular task-relevant source of input (speech in this case), this prioritization can be affected by the nature of task-irrelevant sounds. Specifically, we find that when the latter constitutes meaningful speech, left fronto-temporal speech-processing regions are engaged in processing both stimuli, potentially leading to competition for resources and more effortful listening. Additional brain regions, such as the PPC, are also engaged in representing some aspects of the linguistic structure of task-irrelevant speech, which we interpret as maintaining a representation of what goes on in the ‘rest of the environment’, in case something important arises. Importantly, similar interactions between the structure of task-irrelevant sounds and responses to the to-be-attended sounds have been previously demonstrated for non-verbal stimuli as well (Makov and Zion Golumbic, 2020). Together, this highlights the fact that attentional selection is not an all-or-none processes, but rather is a dynamic process of balancing the resources allocated to competing input, which is highly affected by the specific perceptual, cognitive and environmental aspects of a given task.

Appendix 1

Supplementary materials

The modulation spectrum of Structured speech stimuli used in this study featured a prominent peak at 4Hz, corresponding to the syllable-rate, and an additional smaller peak at 2Hz. Since the stimuli were built by concatenating 250-ms long syllables, while taking care not to introduce any additional acoustic events that would introduce other rhythmic regularities (such as systematic gaps between words; Buiatti et al., 2009), we hypothesized that it may be related to the order of the syllables within the Structured sequences. Specifically, since our Structured stimuli was comprised of bi-syllabic Hebrew words, there may be a systematic difference in the envelope-shape of syllables at the beginning vs. end of words. For example, in the materials used here, it was indeed more common to start a word with a CV syllable than to end with one (Figure 1—figure supplement 1 and 2). This, in turn, could lead to subtle yet systematic differences in the envelope-shape at even vs. odd positions in the stimulus, particularly after averaging across sentences/trials, resulting in a 2Hz peak in the modulation spectrum. A recent study by Luo and Ding, 2020 nicely demonstrates that an acoustic-driven 2Hz peak can be induced simply by amplifying every second syllable in multi-syllable words.

To better understand the origin of the 2Hz peak in our Structured stimuli we ran several simulations, testing how the order of the syllables within the sequence affected the modulation spectrum. First, we created Position-Controlled Stimuli (Figure 1—figure supplement 2b), which were pseudo-sentences comprised of the same syllables as the Structured speech, but ordered in a manner that did not create linguistic meaning. Importantly, randomization was performed in a manner so that each syllable maintained the same position it had in the original sentences. For example, if the syllable /gu/ was the first in original Sentence #1 and the syllable /buk/ was the last, then in the Position-Controlled stimuli these two syllables will still be first or last, respectively, but will no longer be part of the same pseudo-sentence (see concrete examples in Figure 1—figure supplement 2b). In this manner, the average audio-envelope across sentences is still identical to the Structured materials, but the stimuli no longer carry linguistic meaning.

The same procedure was used for calculating the modulation spectrum as were applied to the original Structured stimuli (see Materials and methods for full details). Briefly, a total of 52 pseudo-sentences were constructed and randomly concatenated to form 50 sequences (56-seconds long). To stay faithful to the analysis procedure applied later to the MEG data, sequences were then divided into 8-second epochs, the envelope was extracted and FFT was applied to each epoch. The modulation spectrum is the result of averaging the spectrum across all epochs. We find that, indeed, the modulation spectrum of the original Structured materials and the Position-Controlled materials are basically identical, both containing similar peaks at 4Hz and 2Hz. This supports our hypothesis that the 2Hz peak stemmed from a natural asymmetry in the type of syllables that occur in start vs. end positions of bi-syllabic words (at least in Hebrew). We note that this type of Position-Controlled stimuli were used by us as Non-Structured stimuli in a previous study (Makov et al., 2017), and this is likely a more optimal choice as a control stimuli for future studies, as it allows to more confidently attribute differences at 2Hz in the neural response between Structured and Position-Controlled Non-Structured stimuli to linguistic, rather than acoustic, effects.

We next ran two additional simulations to determine what form of syllable randomization eliminates the 2Hz peak. We found that when creating pseudo-sentences where syllables are completely randomized and not constrained by position, the 2Hz peak is substantially reduced (Figure 1—figure supplement 2c). However, in this case we used a Fixed set of 52 pseudo-sentences to form sequences of stimuli. When we further relaxed this constraint, and allowed different randomization in different sequences (Non-Fixed Randomized stimuli; Figure 1—figure supplement 2d), the 2Hz peak was completely eliminated. The latter is akin to the Non-Structured stimuli used in the current experiment, and hence they were, in fact, not fully controlled at the acoustic level for 2Hz modulations.

That said, since we did not in fact see any peak in the neural data at 2Hz, and the effect we did find was at 1Hz (which is controlled across conditions), this caveat does not affect the validity of the results reported here. Future studies using this hierarchical frequency-tagging approach should take care to equate the modulation spectrum of experimental and control stimuli on this dimension as well (as was done by Makov et al., 2017).

Data availability

The Full MEG data and examples of the stimuli are now available on the Open Science Framework repository (https://osf.io/e93qa).

The following data sets were generated
    1. Har-shai Yahav P
    2. Zion Golumbic E
    (2021) Open Science Framework
    ID e93qa. Linguistic processing of task-irrelevant speech at a Cocktail Party.

References

  1. Software
    1. Boersma P
    (2011) Praat : Doing Phonetics by Computer [Computer Program]
    Praat : Doing Phonetics by Computer [Computer Program].
  2. Book
    1. Broadbent DE
    (1958)
    Perception and Communication
    London: Pergamon Press.
    1. Bryden MP
    (1964) The manipulation of strategies of report in dichotic listening
    Canadian Journal of Psychology/Revue Canadienne De Psychologie 18:126–138.
    https://doi.org/10.1037/h0083290
    1. Cooke M
    (2006) A glimpsing model of speech perception in noise
    The Journal of the Acoustical Society of America 119:1562–1573.
    https://doi.org/10.1121/1.2166600
  3. Book
    1. Kahneman D
    (1973)
    Attention and Effort
    Prentice-Hall.
    1. Rivenez M
    2. Darwin CJ
    3. Guillaume A
    (2006) Processing unattended speech
    The Journal of the Acoustical Society of America 119:4027–4040.
    https://doi.org/10.1121/1.2190162
    1. Treisman AM
    (1960) Contextual cues in selective listening
    Quarterly Journal of Experimental Psychology 12:242–248.
    https://doi.org/10.1080/17470216008416732
    1. Yates AJ
    (1965) Delayed auditory feedback and shadowing
    Quarterly Journal of Experimental Psychology 17:125–131.
    https://doi.org/10.1080/17470216508416421

Decision letter

  1. Barbara G Shinn-Cunningham
    Senior and Reviewing Editor; Carnegie Mellon University, United States
  2. Phillip E Gander
    Reviewer; University of Iowa, United States
  3. Ross K Maddox
    Reviewer; University of Rochester, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

Using a clever adaptation of a classic experimental paradigm, the authors assessed linguistic processing of task-irrelevant speech. By structuring the stimulus so that phonemes, words, phrases and sentence have nested but identifiable discrete rates. the authors were able to identify neural processing corresponding to each stimulus organizational level from magnetoencephalography responses. This study, which will interest those studying attention, language, or the organization of the auditory system, reveals linguistic processing of irrelevant speech at the phrasal level, which is an unexpected and intriguing result.

Decision letter after peer review:

Thank you for submitting your article "Linguistic processing of task-irrelevant speech at a Cocktail Party" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by Barbara Shinn-Cunningham as the Senior and Reviewing Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Phillip E Gander (Reviewer #1); Ross K Maddox (Reviewer #2).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1. It is surprising that your results do not replicate previous results showing peaks related to sentence- and word-level frequencies. Ideally, some other control experiment would be performed to directly address this issue. If such controls cannot be performed, the claims of the paper must be toned down.

2. The main effects is a change in a small peak that appears at 1 Hz, the frequency of the phrasal structure of the to-be-ignored speech. However, the reviewers raised a number of issues about the reliability of this peak measurement. Further analyses are needed to convince that the peak is real, especially given that the measured spectra are noisy-such as a quantification of effect sizes, quantification of the modulation spectrum of the natural, to-be-attended speech (and a consideration of how this interacts with the manipulated spectrum of the to-be-ignored speech), or related analyses.

3. Related to point 2, please clarify exactly how the ITPC normalized and how this relates to the noise floor in your measures (see Reviewer 3's remarks).

4. To strengthen your claims, please explore whether individual subject data are correlated when you compare response accuracy in the structured condition with the strength of the phrasal-level ITPC. If there is no significant relationship, please offer some interpretation as to why there is not.

5. Lapses of attention that are not especially prolonged could contribute to or even explain some of your results. While you mention this and are careful with your language, your Discussions should consider this issue more fully and fairly.

6. Please discuss how "natural" the manipulated speech stimuli were, and whether this differed across conditions (or is different in important ways compared to previous studies). Do you believe any such differences explain what you observed?

In resubmitting your paper, please also consider the many suggestions and questions that the reviewers raise in their full reviews, below.

Reviewer #1:

The present study sought to better characterize how listeners deal with competing speech streams from multiple talkers, that is, whether unattended speech in a multi talker environment competes for exclusively lower-level acoustic/phonetic resources or whether it competes for higher-level linguistic processing resources as well. The authors recorded MEG data and used hierarchical frequency tagging in an unattended speech stream presented to one ear while listeners were instructed to attend to stories presented in the other ear. The study found that when the irrelevant speech contained structured (linguistic) content, an increase in power at the phrasal level (1 Hz) was observed, but not at the word level (2 Hz) or the sentence level (0.5 Hz). This suggests that some syntactic information in the unattended speech stream is represented in cortical activity, and that there may be a disconnect between lexical (word level) processing and syntactic processing. Source analyses of the difference between conditions indicated activity in left inferior frontal and left posterior parietal cortices. Analysis of the source activity underlying the linear transformation of the stimulus and response revealed activation in left inferior frontal (and nearby) cortex. Implications for the underlying mechanisms (whether attentional shift or parallel processing) are discussed. The results have important implications for the debate on the type and amount of representation that occurs to unattended speech streams.

The authors utilize clever tools which arguably provided a unique means to address the main research question, i.e., they used hierarchical frequency tagging for the distractor speech, which allowed them to assess linguistic representations at different levels (syllable-, word-, phrase-, and sentence-level). This technique enabled the authors to make claims about what level of language hierarchy the stimuli are being processed, depending on the observed frequency modulation in neural activity. These stimuli were presented during MEG recording, which let the authors assess changes in neurophysiological processing in near real time – essential for research on spoken language. Source analyses of these data provided information on the potential neural mechanisms involved in this processing. The authors also assessed a temporal response function (TRF) based on the speech envelope to determine the brain regions involved at these different levels for linguistic analysis of the distractor speech.

1. Speech manipulation:

In general, it is unclear what predictions to make regarding the frequency tagging of the unattended distractor speech. On the one hand, the imposed artificial rhythmicity (necessary for the frequency tagging approach) may make it easier for listeners to ignore the speech stream, and thus seeing an effect at higher-level frequency tags may be of greater note, although not entirely plausible. On the other hand, having the syllables presented at a consistent rate may make it easier for listeners to parse words and phrasal units because they know precisely when in time a word/phrase/sentence boundary is going to occur, allowing listeners to check on the irrelevant speech stream at predictable times. For both the frequency tagging and TRF electrophysiological results, the task-irrelevant structured speech enhancement could be interpreted as an infiltration of this information in the neural signal (as the authors suggest), but because the behavioral results are not different this latter interpretation is not easily supported. This pattern of results is difficult to interpret.

2. Behavioral Results:

Importantly, no behavioral difference in accuracy was observed between the two irrelevant speech conditions (structured vs. non-structured), which makes it difficult to interpret what impact the structured irrelevant speech had on attentive listening. If the structured speech truly "infiltrates" or "competes" for linguistic processing resources, the reader would assume a decrease in task accuracy in the structured condition. This behavioral pattern has been observed in other studies. This calls into questions the face validity of the stimuli and task being used.

3. Attention:

• In this study activation of posterior parietal cortex was found, that could be indicative of a strong attentional manipulation, and that the task was in fact quite attentionally demanding in order for subjects to perform. This may align with the lack of behavioral difference between structured and non-structured irrelevant stimuli. Perhaps subjects attempted to divide their attention which may have been possible between speech that was natural and speech that was rather artificial. The current results may align with a recent proposal that inferior frontal activity may be distinguished by language selective and domain general patterns.

4. Lack of word level response:

• A major concern is that the results do not seem to replicate from an earlier study with the same structured stimuli, i.e., the effects were seen for sentence and word level frequency tagging. As the authors discuss, it seems difficult to understand how a phrasal level of effect could be obtained without word-level processing, and so a response at the word level is expected.

5. Familiarization phase:

The study included a phase of familiarization with the stimuli, to get participants to understand the artificial speech. However it would seem that it is much easier for listeners to report back on structured rather than unstructured stimuli. This is relevant to understanding any potential differences between the two conditions. It is unclear if any quantification was made of performance/understanding at this phase. If there is no difference in the familiarization phase, this might explain why there was no difference in behavior during the actual task between the two conditions. Or, if there is a difference at the familiarization phase (i.e. structured sequences are more easily repeated back than non-structured sequences), this might help explain the neural data result at 1 Hz, given that some higher level of processing must have occurred for the structured speech (such as "chunking" into words/phrasal units).

6. Speech manipulation:

• To assist interpretation of the pattern of results it would be helpful to know something about the listening experience to the stimuli. Was there any check on abnormality of the attended speech that was changed in gender, i.e., how natural did this sound? Why was the gender manipulation imposed, instead of initially recording the stimuli in a male voice?

• Similarly, how natural-sounding was the task-irrelevant structured speech?

7. Behavioral Results:

• It is difficult to reconcile why there are no significant differences in response accuracy between non-structured and structured conditions. How do the authors interpret this discrepancy alongside the observed neural differences across these two conditions? Is it possible this is reflective of the stimulus, i.e., that the structured speech is so artificial that it is not intrusive on the attended signal (thus no difference in behavior)? It seems less plausible that the non-structured speech was equally as intrusive as the structured speech, given the cited literature. These issues relate to the speech manipulation.

• Did the authors check on task behavior before running the experiment?

• How does this relate to the irrelevant stimuli were used in the experiment in Makov et al. 2017. Is it possible this is important, can the authors comment?

• It would be helpful to understand why no behavioral result occurred, and whether this was related to the speech manipulation specifically by obtaining behavioral results with other versions of the task.

• There is a rather large range in response accuracy across subjects. One analysis that might strengthen the argument that the increase in 1 Hz ITPC truly reflects phrasal level information would be to look for a correlation between response accuracy in the structured condition with 1 Hz ITPC. One might predict that listeners which show lower behavioral accuracy would show greater 1 Hz ITPC (i.e. greater interference from linguistic content of the unattended structured speech).

8. Attention:

• An attentional load account is interesting, but load is not directly manipulated in this experiment. And it is difficult to reconcile the claim that the Ding experiment was less natural than the current one, as both were transformations of time.

• Federenko and Blank, 2020 propose an account of inferior frontal mechanisms that may relate to the present pattern of results regarding, in particular that there may be a stronger manipulation of attention (domain general) than linguistic processes.

9. Lack of word level response:

• Again it might be relevant with respect to potential stimulus differences to previous versions of the stimuli. If this is the case then it might have important implications for only a 1Hz effect being found.

• If SNR is an issue it calls into question the current results in addition to the possibility that it was an underestimation.

10. Familiarization phase:

• If they exist, what are the behavioral data / accuracy between conditions (structured v. non-structured) from the familiarization phase? Or did subjects comment on any differences?

11. ROIs:

• A representation or list of the ROIs might be helpful. It is unclear from Figure 1 if the whole cortical mantle is included in the 22 ROIs. In addition these ROI boundaries look considerably larger than the 180 in each hemisphere from Glasser et al. 2016. Please clarify.

Reviewer #2:

This paper by Har-shai Yahav and Zion Golumbic investigates the coding of higher level linguistic information in task-irrelevant speech. The experiment uses a clever design, where the task-irrelevant speech is structured hierarchically so that the syllable, word, and sentence levels can be ascertained separately in the frequency domain. This is then contrasted with a scrambled condition. The to-be-attended speech is naturally uttered and the response is analyzed using the temporal response function. The authors report that the task-irrelevant speech is processed at the sentence level in the left fronto-temporal area and posterior parietal cortex, in a manner very different from the acoustical encoding of syllables. They also find that the to-be-attended speech responses are smaller when the distractor speech is not scrambled, and that this difference shows up in exactly the same fronto-temporal area – a very cool result.

This is a great paper. It is exceptionally well written from start to finish. The experimental design is clever, and the results were analyzed with great care and are clearly described.

The only issue I had with the results is that the possibility (or likelihood, in my estimation) that the subjects are occasionally letting their attention drift to the task-irrelevant speech rather than processing in parallel can't be rejected. To be fair, the authors include a nice discussion of this very issue and are careful with the language around task-relevance and attended/unattended stimuli. It is indeed tough to pull apart. The second paragraph on page 18 states "if attention shifts occur irregularly, the emergence of a phase-rate peak in the neural response would indicate that bits of 'glimpsed' information are integrated over a prolonged period of time." I agree with the math behind this, but I think it would only take occasional lapses lasting 2 or 3 seconds to get the observed results, and I don't consider that "prolonged." It is, however, much longer than a word, so nicely rejects the idea of single-word intrusions.

Reviewer #3:

The use of frequency tagging to analyze continuous processing at phonemic, word, phrasal and sentence-levels offers a unique insight into neural locking at higher-levels. While the approach is novel, there are major concerns regarding the technical details and interpretation of results to support phrase-level responses to structured speech distractors.

– Is the peak at 1Hz real and can it be attributed solely to the structured distractor?

* The study did not comment on the spectral profile of the "attended" speech, and how much low modulation energy is actually attributed to the prosodic structure of attended sentences? To what extent does the interplay of the attended utterance and distractor shapes the modulation dynamics of the stimulus (even dichotically)?

* How is the ITPC normalized? Figure 2 speaks of a normalization but it is not clear how? The peak at 1Hz appears extremely weak and no more significant (visually) than other peaks – say around 3Hz and also 2.5Hz in the case of non-structured speech? Can the authors report on the regions in modulation space that showed any significant deviations? What about effect size of the 1Hz peak relative to these other regions?

* It is hard to understand where the noise floor in this analysis – this floor will rotate with the permutation test analysis performed in the analysis of the ITPC and may not be fully accounted for. This issue depends on what the chosen normalization procedure is. The same interpretation put forth by the author regarding a lack of a 0.5Hz peak due to noise still raises the question of interpreting the observed 1Hz peak?

– Control of attention during task performance

* The author present a very elegant analysis of possible alterative accounts of the results, but they acknowledge that possible attention switches, even if irregular, could result in accumulated information that could emerge as a small neurally-locked response at the phrase-level? As indicated by the authors, the entire experimental design to fully control for such switches is a real feat. That being said, additional analyses could shed some light on variations of attentional state and their effect on observed results. For instance, analysis of behavioral data across different trials (wouldn't be conclusive, but could be informative)

* This issue is further compounded by the fact that a rather similar study (Ding et al.) did not report any phrasal-level processing, though there are design differences. The authors suggest differences in attentional load as a possible explanation and provide a very appealing account or reinterpretation of the literature based on a continuous model of processing based on task demands. While theoretically interesting, it is not clear whether any of the current data supports such account. Again, maybe a correlation between neural responses and behavioral performance in specific trials could shed some light or strengthen this claim.

– What is the statistic shown for the behavioral results? Is this for the multiple choice question? Then what is the t-test on?

– Beyond inter-trial phase coherence, can the authors comment on actual power-locked responses at the same corresponding rates?

In line with concerns regarding interpretation of experimental findings, some control experiment appears to be critical to establish a causal link between the observed neural processing the 1Hz rhythm and the phrasal processing of the distractor. What do the author expect from a similarly structured distractor but unintelligible, in an unfamiliar language or even reversed?

https://doi.org/10.7554/eLife.65096.sa1

Author response

Essential revisions:

1. It is surprising that your results do not replicate previous results showing peaks related to sentence- and word-level frequencies. Ideally, some other control experiment would be performed to directly address this issue. If such controls cannot be performed, the claims of the paper must be toned down.

Indeed, the lack of a response at the word-level was not expected. All the reviewers point this out and we too try to grapple with this issue. We too expected, based on previous findings, that if linguistic-responses were observed for task-irrelevant speech, this would be observed at both the word- and phrase- levels. Indeed, this is the case when participants listen to these stimuli from single-speaker, without competition; e.g., Ding et al. 2016, Makov et al. 2017. As detailed below in our response to reviewers #1 and #3 and revised methods and Results section, we conducted additional analyses of the data to confirm the validity of the 1Hz peak (despite the lack of a 2Hz peak), and this result still stands. Since these are the data, which we now share in full on OSF (https://osf.io/e93qa), we offer some thoughts as to why the word-level peak is not observed when frequency-tagged speech is the ‘task-irrelevant’ stimulus in a two-speaker paradigm (as opposed to a single, attended speaker). For a more elaborate response see section “why no word-level response” in our revised paper (Discussion, p. 23), and our detailed response to reviewers #1 and #3 below.

2. The main effects is a change in a small peak that appears at 1 Hz, the frequency of the phrasal structure of the to-be-ignored speech. However, the reviewers raised a number of issues about the reliability of this peak measurement. Further analyses are needed to convince that the peak is real, especially given that the measured spectra are noisy-such as a quantification of effect sizes, quantification of the modulation spectrum of the natural, to-be-attended speech (and a consideration of how this interacts with the manipulated spectrum of the to-be-ignored speech), or related analyses.

We understand the reviewers’ concerns about the validity of the 1Hz peak, and we have several responses to this comment:

1. We have added a new analysis to independently assess the significance of the 1Hz peak. In this analysis, we compared ITPC at each frequency to the average ITPC in the surrounding frequencies (2 bins from each side) using a t-test. This approach addresses the concern of frequency-specific noise-floor, for example due to the inherent 1/f noise structure (see similar implementations e.g., in Mouraux et al., 2011; Retter and Rossion, 2016, Nozaradan et al. 2017, 2018). Results of this analysis confirmed that the only significant ITPC peaks were at 4Hz (in both conditions) and at 1Hz (in the Structured condition only), further validating the robustness of the 1Hz phrase-level response. Note that the other peaks that stand-out visually (at 2.5Hz and 3Hz) were not statistically significant in this analysis. The new analysis is now described in the methods (pp. 11) and Results sections (pp. 14).

2. Although the magnitude of the 1Hz peak is smaller than the 4Hz, this is to be expected since the 4Hz peak is a direct consequence of the acoustic input whereas the 1Hz peak proposedly reflects linguistic representation and/or chunking. Similar ratios between the acoustic and linguistic peaks have been observed in previous studies using this frequency-tagging approach (e.g., Ding et al. 2016, Makov et al. 2017). Therefore, the smaller relative magnitude of the 1Hz peak does not invalidate it. We have now added quantification of all effect sizes to the Results section.

3. The reviewers suggests that, perhaps, the acoustics of the task-relevant stream could have contributed to the observed 1Hz ITPC peak, and that the peak might not solely be attributed to the Structured stimuli. However, this is not the case for two main reasons:

a. In our response to reviewer #3 we show the modulation spectrum of the narratives used as task-relevant speech. As can be seen, the spectrum does not contain a distinct peak at 1Hz.

b. The narratives used as task-relevant speech were fully randomized between the Structured and Non-Structured conditions and between participants. Therefore, if these stimuli had contributed to the 1Hz peak, this should have been observed in both conditions. We have now clarified this point in the methods (p.6).

3. Related to point 2, please clarify exactly how the ITPC normalized and how this relates to the noise floor in your measures (see Reviewer 3's remarks).

The normalization used for the ITPC was a z-score of the resultant phase-locking factor, as implemented in the circ_rtest function in the Matlab circular statistics toolbox (Berens, 2009). We have now clarified this in the methods section (p. 11) and the caption of Figure 3 (p. 14).

4. To strengthen your claims, please explore whether individual subject data are correlated when you compare response accuracy in the structured condition with the strength of the phrasal-level ITPC. If there is no significant relationship, please offer some interpretation as to why there is not.

To test whether the 1Hz ITPC peak was stronger in participants who performed poorly we have added a new analysis. We split the participants into two groups according to the median of the 1Hz ITPC value in the Structured condition and tested whether there was a significant difference in the behavioral scores of the two groups. However, this median-split analysis did not reveal any significant effects that would indicate a ‘trade-off’ between linguistic representation of task-irrelevant speech and performance on the attended task. That said, as we elaborate in our response to reviewer #1/critique #2 below and in the revised methods and discussion, the task used here (answering 4 multiple-choice questions after each 45-second long narrative) was not sufficiently sensitive for adequately capturing behavioral consequences of task-irrelevant stimuli. Therefore, in our opinion, this null-effect should not necessarily be taken as indication that both types of task-irrelevant stimuli were similarly distracting. We have added this new analysis to the paper (p. 13-14).

5. Lapses of attention that are not especially prolonged could contribute to or even explain some of your results. While you mention this and are careful with your language, your Discussions should consider this issue more fully and fairly.

Indeed, as the reviewers points out, one possible interpretation for the current results is that the strict rhythmic/hierarchical structure of task-irrelevant speech-stimuli enabled participants to employ a ‘rhythmic-glimpsing listening strategy’. In this strategy the task irrelevant stream could be sampled (‘glimpsed’) at the most informative points in time, allowing participants to glean sufficient linguistic information to process the task-irrelevant speech. We agree (and discuss in the paper; pp. 20) that the phrase-rate peak observed here in the Structured condition could, potentially, reflect this type of ‘glimpsing’ strategy (even though we cannot validate or disprove this possibility in the current study, since we do not have direct insight into participants’ listening strategy).

However, even if this is the case, employing a ‘rhythmic glimpsing strategy’ is in-and-of-itself an indication that the linguistic structure of the task-irrelevant stream was detected and is parsed correctly. This is because, given the hierarchical design of our Structured stimuli, for participants to "know precisely when in time a word/phrase/sentence boundary is going to occur" they must integrate over several syllables to identify the underlying linguistic structure. This is not trivial since the linguistic structure does not follow directly from the acoustics. Also, since we gradually ramped up the volume of the task-irrelevant speech at the start of each trial, determining where each sentence/phrase starts can only be achieved based on linguistic processing. Therefore, finding a 1Hz peak for the Structured but not for the Non-Structured task-irrelevant speech, which share the same acoustic-rhythm and were constructed from the same syllable units, serves as an indication that the underlying linguistic structure of task irrelevant speech was indeed detected. We have elaborated on this point in the revised discussion (pp. 20).

6. Please discuss how "natural" the manipulated speech stimuli were, and whether this differed across conditions (or is different in important ways compared to previous studies). Do you believe any such differences explain what you observed?

This is an important point, since our ultimate goal is to understand attention to speech under natural, real-life, conditions. We agree that the current study is not fully ‘natural’ for several reasons (e.g. the structured nature of the stimuli; their arbitrary content; the dichotic presentation etc.). The revised discussion now addressed this point more extensively, focusing on the generalization of our findings to more "natural” circumstances, comparison to previous studies, and a call for future studies to continue in this direction and systematically explore these questions under increasingly natural conditions (p. 21-22 and p.25-26; see also specific response to the reviewers below):

“As cognitive neuroscience research increasingly moves towards studying speech processing and attention in real-life circumstances, a critical challenge will be to systematically map out the perceptual and cognitive factors that contribute to, or hinder, the ability to glean meaningful information from stimuli that are outside the primary focus of attention.”

In resubmitting your paper, please also consider the many suggestions and questions that the reviewers raise in their full reviews, below.

Reviewer #1:

[…] 1. Speech manipulation:

In general, it is unclear what predictions to make regarding the frequency tagging of the unattended distractor speech. On the one hand, the imposed artificial rhythmicity (necessary for the frequency tagging approach) may make it easier for listeners to ignore the speech stream, and thus seeing an effect at higher-level frequency tags may be of greater note, although not entirely plausible. On the other hand, having the syllables presented at a consistent rate may make it easier for listeners to parse words and phrasal units because they know precisely when in time a word/phrase/sentence boundary is going to occur, allowing listeners to check on the irrelevant speech stream at predictable times. For both the frequency tagging and TRF electrophysiological results, the task-irrelevant structured speech enhancement could be interpreted as an infiltration of this information in the neural signal (as the authors suggest), but because the behavioral results are not different this latter interpretation is not easily supported. This pattern of results is difficult to interpret.

Although the current results provide an indication for linguistic-parsing of task-irrelevant speech, the reviewer raises an important point regarding the generalizability of our results to natural speech. Specifically, they ask whether the strict rhythmicity of the frequency tagged stimuli used here, might have made it easier or harder to "ignore", relative to natural speech that lacks this precise temporal structure.

We have several responses to this comment:

1. Indeed, as the reviewer points out, one possible interpretation for the current results is that the strict rhythmic/hierarchical structure of task-irrelevant speech-stimuli enabled participants to employ a ‘rhythmic-glimpsing listening strategy’. In this strategy the task irrelevant stream could be sampled (‘glimpsed’) at the most informative points in time, allowing participants to glean sufficient linguistic information to process the task irrelevant speech. We agree (and discuss in the paper; pp. 20) that the phrase-rate peak observed here in the Structured condition could, potentially, reflect this type of ‘glimpsing’ strategy (even though we cannot validate or disprove this possibility in the current study, since we do not have direct insight into participants’ listening strategy).

2. However, even if this is the case, employing a ‘rhythmic glimpsing strategy’ is in-and-of itself an indication that the linguistic structure of the task-irrelevant stream was detected and is parsed correctly. This is because, given the hierarchical design of our Structured stimuli, for participants to "know precisely when in time a word/phrase/sentence boundary is going to occur" they must integrate over several syllables to identify the underlying linguistic structure. This is not trivial, since the linguistic structure does not follow directly from the acoustics. Also, since we gradually ramped up the volume of the task-irrelevant speech at the start of each trial, determining where each sentence/phrase starts can only be achieved based on linguistic processing. Therefore, finding a 1Hz peak for the Structured but not for the Non-Structured task-irrelevant speech, which share the same acoustic-rhythm and were constructed from the same syllable units, serves as an indication that the underlying linguistic structure of task irrelevant speech was indeed detected.

We have elaborated on this point in the revised discussion, which now reads (pp. 20):

“An important question to ask is whether the phrase-level response observed for task irrelevant speech can be explained by attention shifts? […] However, we hope that future studies will gain empirical access into the dynamic of listeners’ internal attentional state and help shed light on this pivotal issue.”

3. Another important point raised by the reviewer is that the artificial-rhythmicity of the current stimuli might actually make them easier to ignore than natural speech that is less rhythmic? Indeed, as the reviewer points out, previous studies using simple tones suggest that when task-irrelevant tones are isochronous, this can make them easier to ignore relative to non-isochronous tones (e.g., Makov et al. 2020). The current study was not designed to address this particular aspect; since we do not compare whether isochronous speech is easier to ignore than non-isochronous speech. However, given the monotonous and artificial nature of the current speech-stimuli we speculate that they are probably easier to ‘tune out’, as compared to natural speech that contain potential attention grabbing events as well as prosodic cues. A recent study by Aubanel and Schwarz (2020) also suggests that isochrony plays a reduced role in speech perception relative to simple tones. Importantly, though, regarding the reviewer’s concern: the isochronous 4Hz nature of the speech-stimuli is not sufficient for explaining the main 1Hz effect reported here, since both the Structured and Non-Structured stimuli were similarly isochronous. Therefore, although we agree that this speech is “unnatural”, it is nonetheless processed and parsed for linguistic content.

4. Taking a broader perspective to the reviewer’s comment regarding the generalization of our findings to more "natural” circumstances:

One of the main take-home-messages of this paper is that we should not think about whether or not task-irrelevant speech is processed linguistically as a binary yes/no question. Rather, that the ability to process task-irrelevant speech likely depends on both the acoustic properties of the stimuli as well as the cognitive demands of the task. Within this framework, our current findings nicely demonstrate one of the circumstances in which task-irrelevant speech has the capacity to process more than one speech stream (be it through glimpsing or parallel processing, as discussed above). This demonstration has important theoretical implications, indicating that we should not assume an inherent “bottleneck” for processing two competing speech-streams, as suggested by some models.

Clearly, additional research is required to fully map-out the factors contributing to this effect (rhythmicity being one of them, perhaps). However, it is highly unlikely that the system is ONLY capable of doing this when stimuli are strictly rhythmic. We look forward to conducting systematic follow-up studies into the specific role of rhythm in attention to speech, as well as other acoustic and cognitive factors. We now elaborate on this point in the discussion (p. 21):

“More broadly, the comparison between these two studies invites re-framing of the question regarding the type / level of linguistic processing applied to task-irrelevant speech, and propels us to think about this issue not as a yes-or-no dichotomy, but perhaps as a more flexible process that depends on the specific context (Brodbeck et al. 2020b). [….] As cognitive neuroscience research increasingly moves towards studying speech processing and attention in real-life circumstances, a critical challenge will be to systematically map out the perceptual and cognitive factors that contribute to, or hinder, the ability to glean meaningful information from stimuli that are outside the primary focus of attention.”

2. Behavioral Results:

Importantly, no behavioral difference in accuracy was observed between the two irrelevant speech conditions (structured vs. non-structured), which makes it difficult to interpret what impact the structured irrelevant speech had on attentive listening. If the structured speech truly "infiltrates" or "competes" for linguistic processing resources, the reader would assume a decrease in task accuracy in the structured condition. This behavioral pattern has been observed in other studies. This calls into questions the face validity of the stimuli and task being used.

In the current study, attentive-listening behavior was probed by asking participants four multiple-choice questions after each narrative (3-possible answers per question; chance level = 0.33). Indeed, we did not find any differences in accuracy on these questions when the natural-narrative was paired with Structured vs. Non-Structured task-irrelevant speech. Moreover, a new analysis aimed at testing whether behavioral performance was modulated by the strength of the 1Hz ITPC response did not reveal any significant effects (see behavioral results, pp. 14). However, we were not surprised by the lack of a behavioral effect, nor do we believe that this null-effect diminishes the significance of the neural effect observed here. This is because the task was not designed with the intent of demonstrating behavioral disturbance-effects by task-irrelevant speech, and indeed does not have sufficient sensitivity to do so for several reasons:

a. Asking only four questions on a ~45-second long narrative is an extremely poor measure for probing a participant’s understanding of the entire narrative.

b. Answering these questions correctly or incorrectly is not necessarily a direct indication of how much attention was devoted to the narrative, or an indication of lapses of attention. Rather, performance is likely influenced by other cognitive processes. For example, questions requiring recollection of specific details (e.g., “what color was her hat?”) rely not only on attention but also on working-memory. Conversely, questions addressing the ‘gist’ of the narrative, (e.g., “why was she sad?”) can be answered correctly based on logical deduction even if there were occasional attention-lapses. In other words, good performance on this task does not necessarily indicate “perfect” attention to the narrative, just as making mistakes on this task does not necessarily reflect “lapses” of attention.

c. Supporting our claim of the coarseness of this task: Prior to this experiment, we conducted a behavioral pilot study, aimed at obtaining a baseline-measure for performance when participants listen to these narratives in a single-speaker context (without additional competing speech). In that study, the average accuracy rate was 83% (n=10), indicating that even without the presence of competing speech, performance on this task is not perfect.

So why did we choose this task?

As we discuss in the paper, and the reviewer correctly points out, there have been numerous behavioral studies demonstrating behavioral ‘intrusions’ of the task-irrelevant stimuli on performance of an attended-task. However, these effects require substantially more sensitive behavioral tasks, where behavior is probed at a finer scale. Some prominent examples are short-term memory-tasks for lists of words (Tun et al. 2002), semantic priming (Dupoux et al. 2003, Aydelott et al. 2015) or target-detection tasks (Rivenez et al. 2006, 2008, Carey et al. 2014). However, these types of tasks are quite artificial and are not suitable for studying attention to continuous natural speech particularly if we do not want to disrupt listening with an ongoing (secondary) task and/or artificial manipulation of the speech. Therefore, many studies similar to ours employ ‘gross’ comprehension metrics, that are by-definition inadequate for capturing the complexity of attentive-listening behavior.

Our main motivation for choosing this task was a) to motivate and guide participants to direct attention towards the to-be-attended narrative and b) verify that indeed they listened to it. This is critical in order to assert that participants indeed attended to the correct stream.

However, given the admitted insensitivity of this measure, we do not believe that the null effects on behavior can be interpreted in any meaningful way. In fact, the lack of good behavioral metrics for studying attention to continuous speech, and the tension of determining the ‘ground truth’ of attention is precisely the motivation for this study. We believe that ongoing neural metrics provide a better indication of how continuous speech is processed than sporadic assessment of comprehension/memory.

We now elaborate on this point in the paper in the methods section (p. 9):

“This task was chosen as a way to motivate and guide participants to direct attention towards the to-be-attended narrative and provide verification that indeed they listened to it. […] At the same time, this task is instrumental in guiding participants' selective attention toward the designated speaker, allowing us to analyze their neural activity during uninterrupted listening to continuous speech, which was the primarily goal of the current study.”

And in the discussion (p. 25):

“Since the neural response to the to-be-attended speech was modulated by the type of competition it faced, then why was this not mirrored in the current behavioral results as well? […] This is a challenge shared by many previous studies similar to ours, and is one of the main motivations for turning directly to the brain and studying neural activity during uninterrupted listening to continuous speech, rather than relying on sparse behavioral indications (Ding et al. 2016; Makov et al. 2017; Broderick et al. 2018, 2019; Donhauser and Baillet 2019; Brodbeck et al. 2020).”

3. Attention:

• In this study activation of posterior parietal cortex was found, that could be indicative of a strong attentional manipulation, and that the task was in fact quite attentionally demanding in order for subjects to perform. This may align with the lack of behavioral difference between structured and non-structured irrelevant stimuli. Perhaps subjects attempted to divide their attention which may have been possible between speech that was natural and speech that was rather artificial. The current results may align with a recent proposal that inferior frontal activity may be distinguished by language selective and domain general patterns.

We agree with the reviewer processing aspects of the task-irrelevant speech, in addition to following the to-be-attended narrative may impose a higher demand on working-memory (a form of ‘divided’ attention), and this might be reflected in the activation of PPC and inferior frontal regions in this condition. We have now expanded on this in our discussion, which reads (Discussion p. 22-23):

“Both the posterior-parietal cortex and inferior frontal gyrus play an important role in verbal working-memory (Todd and Marois 2004; Postle et al. 2006; Linden 2007; McNab and Klingberg 2008; Edin et al. 2009; Østby et al. 2011; Gazzaley and Nobre 2012; Ma et al. 2012; Rottschy et al. 2012; Meyer et al. 2014, 2015; Yue et al. 2019, Fedorenko and Blank 2020). […] Therefore, one possible interpretation for the presence of a phrase-level response to task-irrelevant speech in the left posterior-parietal cortex and inferior frontal regions, is their role in forming and maintaining a representation of task-irrelevant stimuli in working-memory, perhaps as a means for monitoring the environment for potentially important events.”

4. Lack of word level response:

• A major concern is that the results do not seem to replicate from an earlier study with the same structured stimuli, i.e., the effects were seen for sentence and word level frequency tagging. As the authors discuss, it seems difficult to understand how a phrasal level of effect could be obtained without word-level processing, and so a response at the word level is expected.

Indeed, the lack of a response at the word-level was not expected. We too expected that if linguistic-responses were observed for task-irrelevant speech, this would be observed at both the word- and phrase- level (as was observed in previous single-speaker studies using these stimuli; e.g., Makov et al. 2017). All the reviewers point this out and we too try to grapple with this issue in our discussion. As detailed above, we conducted additional analyses of the data to confirm that validity of the 1Hz peak (and the lack of a 2Hz peak), and this result still stands. Since these are the data, which we now share in full on OSF (https://osf.io/e93qa), we offer our thoughts as to why there is only a phrase-level peak but not a word-level peak when frequency-tagged speech is the ‘task-irrelevant’ stimulus in a two-speaker paradigm. See section “why no word-level response” in our revised paper (Discussion, section p. 23):

“Why no word-level response?

Although in the current study we found significant neural response to task-irrelevant speech at the phrasal-rate, we did not see peaks at the word- or at the sentence-rate. […] This matter regarding the depth of lexical processing of task-irrelevant speech, and its interaction with syntactic analysis, remains to be further explored in future research.”

We understand that the lack of the expected 2Hz response can raise doubts regarding the validity of the 1Hz response as well. To address this concern and further verify that the peak observed at 1Hz is “real”, we have now added a new analysis to independently assess the significance of the 1Hz peak relative to the frequency-specific noise-floor. To this end, we compared ITPC at each frequency to the average ITPC in the surrounding frequencies (2 bins from each side) using a t-test. This approach accounts for potential frequency-specific variations in SNR, for example due to the inherent 1/f noise-structure (Mouraux et al., 2011; Retter and Rossion, 2016; Nozaradan et al., 2017, 2018). This analysis confirmed that the only significant ITPC peaks were at 4Hz (in both conditions) and at 1Hz (in the Structured condition only), further validating the robustness of the 1Hz phrase-level response. The new analysis is now described in the methods (pp. 11) and Results sections (pp. 14) which reads:

“Scalp-level spectra of the Inter-trial phase coherence (ITPC) showed a significant peak at the syllabic-rate (4Hz) in response to both Structured and Non-Structured hierarchical frequency tagged speech, with a 4-pole scalp-distribution common to MEG recorded auditory responses (Figure 3a) (p< 10^-9; large effect size, Cohen's d > 1.5 in both). […] Comparison of the 1Hz ITPC between these conditions also confirmed a significant difference between them (p=0.045; moderate effect size, Cohen's d = 0.57).”

5. Familiarization phase:

The study included a phase of familiarization with the stimuli, to get participants to understand the artificial speech. However it would seem that it is much easier for listeners to report back on structured rather than unstructured stimuli. This is relevant to understanding any potential differences between the two conditions. It is unclear if any quantification was made of performance/understanding at this phase. If there is no difference in the familiarization phase, this might explain why there was no difference in behavior during the actual task between the two conditions. Or, if there is a difference at the familiarization phase (i.e. structured sequences are more easily repeated back than non-structured sequences), this might help explain the neural data result at 1 Hz, given that some higher level of processing must have occurred for the structured speech (such as "chunking" into words/phrasal units).

We would like to clarify the need of the familiarization stage. The speech materials used in the current study are not immediately recognizable as Hebrew speech. Therefore, we conducted a familiarization stage prior to the main experiment. Without the familiarization stage, we could not be sure whether participants were aware of higher-level chunking of the Structured stimuli. Moreover, we wanted to avoid any perceptual learning effects during the main experiment. We now elaborate more extensively on the familiarization procedure in the methods section (pp. 9).

Indeed, as the reviewer anticipated, repeating the Structured stimuli during the familiarization stage was easier than repeating the Non-Structured stimuli since the latter did not map only known lexical units. However, we don’t fully understand the reviewers concern regarding the familiarization task and whether it ‘explains’ the neural results. As we see it, the 1Hz neural response reflects the identification of phrasal boundaries, which was only possibly for the Structured stimuli. The fact that participants were (implicitly) made aware of the underlying structure does not trivialize this effect. Rather, our results indicate that the continuous stream of input was correctly parsed and the underlying linguistic structure encoded despite being task-irrelevant.

6. Speech manipulation:

• To assist interpretation of the pattern of results it would be helpful to know something about the listening experience to the stimuli. Was there any check on abnormality of the attended speech that was changed in gender, i.e., how natural did this sound? Why was the gender manipulation imposed, instead of initially recording the stimuli in a male voice?

The natural speech materials were chosen from a pre-existing database of actor-recorded short stories, that have been used in previous studies in the lab. These were originally recorded by both female and male speakers. However, our frequency-tagged stimuli were recorded only in a male voice. Since it is known that selective attention to speech is highly influenced by whether the voices are of the same/different sex, we opted to use only male voices. Therefore, we used the voice-change transformation on narratives that were originally spoken by a female actor. To ensure that the gender change did not affect the natural sounding of the speech and to check for abnormalities of the materials, we conducted a short survey among 10 native Hebrew speakers. They all agreed that the speech sounded natural and normal. Examples of the stimuli are now available at: https://osf.io/e93qa. We have clarified this in the methods section (pp. 6) which reads:

“Natural speech stimuli were narratives from publicly available Hebrew podcasts and short audio stories (duration: 44.53±3.23 seconds). […] They all agreed that the speech sounded natural and normal.”

• Similarly, how natural-sounding was the task-irrelevant structured speech?

The task-irrelevant speech is comprised of individually-recorded syllables, concatenated at a fixed rate. Although the syllabic-rate (4Hz) is akin to that of natural speech, the extreme rhythmicity imposed here is not natural and requires some perceptual adaptation. As mentioned above, the frequency-tagged stimuli are not immediately recognizable as Hebrew speech, which is the main reason for conducting a familiarization stage prior. After a few minutes of familiarization these stimuli still sound unnaturally-rhythmic, but are fully intelligible as Hebrew. Examples of the stimuli (presented separately and dichotically) are now available at: https://osf.io/e93qa.

7. Behavioral Results:

• It is difficult to reconcile why there are no significant differences in response accuracy between non-structured and structured conditions. How do the authors interpret this discrepancy alongside the observed neural differences across these two conditions? Is it possible this is reflective of the stimulus, i.e., that the structured speech is so artificial that it is not intrusive on the attended signal (thus no difference in behavior)? It seems less plausible that the non-structured speech was equally as intrusive as the structured speech, given the cited literature. These issues relate to the speech manipulation.

• Did the authors check on task behavior before running the experiment?

• It would be helpful to understand why no behavioral result occurred, and whether this was related to the speech manipulation specifically by obtaining behavioral results with other versions of the task.

See our extensive answer above regarding the behavioral tasks (including results from behavioral screening of the task) in response to critique #2.

• How does this relate to the irrelevant stimuli were used in the experiment in Makov et al. 2017. Is it possible this is important, can the authors comment?

Yes. The Structured stimuli were identical to those used in our previous study by Makov et al., 2017. However, in that study only a single-speech stimulus was presented and selective attention was not manipulated. We now mention this explicitly in the methods section.

• There is a rather large range in response accuracy across subjects. One analysis that might strengthen the argument that the increase in 1 Hz ITPC truly reflects phrasal level information would be to look for a correlation between response accuracy in the structured condition with 1 Hz ITPC. One might predict that listeners which show lower behavioral accuracy would show greater 1 Hz ITPC (i.e. greater interference from linguistic content of the unattended structured speech).

Thank you for this suggestion. To test whether the 1Hz ITPC peak was stronger in participants who performed poorly we have added a new analysis. We split the participants into two groups according to the median of the 1Hz ITPC value in the Structured condition and tested whether there was a significant difference in the behavioral scores of the two groups. However, this median-split analysis did not reveal any significant effects that would indicate a ‘trade-off’ between linguistic representation of task-irrelevant speech and performance on the attended task. That said, as elaborated in our response to critique #2 above, the task was not sufficiently sensitive for adequately capturing behavioral consequences of task-irrelevant stimuli. Therefore, in our opinion, this null-effect should not be taken as indication that both types of task-irrelevant stimuli were similarly distracting.

We have added this new analysis to the paper, which reads (p. 13-14):

“Additionally, to test for possible interactions between answering questions about the to-be attended speech and linguistic neural representation of task-irrelevant speech, we performed a median-split analysis of the behavioral scores across participants. […] Neither test showed significant differences in performance between participants whose 1Hz ITPC was above vs. below the median [Structured condition: t(27) = -1.07, p=0.29; Structured – Non-Structured: t(27) = -1.04, p=0.15]. Similar null-results were obtained when the median-split was based on the source level data.”

8. Attention:

• An attentional load account is interesting, but load is not directly manipulated in this experiment. And it is difficult to reconcile the claim that the Ding experiment was less natural than the current one, as both were transformations of time.

Thank you for this opportunity to clarify our comparison between the current study and the study by Ding et al.: While both studies used similar frequency-tagged speech as task irrelevant stimuli, the to-be-attended stimuli used by us was natural speech whereas Ding et al. used speech that was compressed by a factor of 2.5 and where all naturally-occurring gaps were artificially removed. It is well known that understanding time-compressed speech is substantially more difficult than natural-paced speech. Hence, in order to perform the task (answer questions about the to-be-attended narrative), participants in the study by Ding et al. needed to invest substantially more listening effort relative to the current one.

We offer the perspective of “Attentional Load theory” as a way to account for the discrepancy in results between these two studies, and also as a way to think more broadly about the discrepancies reported throughout the literature regarding processing of task-irrelevant stimuli. Specifically, we proposed that because of the more-demanding stimuli/task used by Ding et al., insufficient resources were available to also encode task-irrelevant stimuli, which is in line with “Attentional Load Theory” (Lavie et al. 2004). We also agree that additional studies are required in order to fully test this explanation and systematically manipulate “load” in a within-experiment design. We have revised our discussion on this point (pp. 21):

“While in the current experiment to-be-attended speech was presented in its natural form, mimicking the listening effort of real-life speech-processing, in the study by Ding et al. (Ding et al. 2018) to-be-attended speech was time-compressed by a factor of 2.5 and naturally occurring gaps were removed, making the comprehension task substantially more effortful (Nourski et al. 2009; Müller et al. 2019). […] Specifically, if understanding the to-be-attended speech imposes relatively low perceptual and cognitive load, then sufficient resources may be available to additionally process aspects of task-irrelevant speech, but that this might not be the case as the task becomes more difficult and perceptually demanding (Wild et al. 2012; Gagné et al. 2017; Peelle 2018).”

• Federenko and Blank, 2020 propose an account of inferior frontal mechanisms that may relate to the present pattern of results regarding, in particular that there may be a stronger manipulation of attention (domain general) than linguistic processes.

We thank the reviewer for pointing us to this highly-relevant work, and now discuss the potential involvement of IFG in domain-general processes such as working-memory and attention, besides its role in speech processing (pp. 23).

9. Lack of word level response:

• Again it might be relevant with respect to potential stimulus differences to previous versions of the stimuli. If this is the case then it might have important implications for only a 1Hz effect being found.

• If SNR is an issue it calls into question the current results in addition to the possibility that it was an underestimation.

See our extensive answer regarding the lack of a work-level peak in response to critique #4 above, where we address the lack of a word-level response and include a new analysis validating the robustness of the 1Hz ITPC peak.

10. Familiarization phase:

• If they exist, what are the behavioral data / accuracy between conditions (structured v. non-structured) from the familiarization phase? Or did subjects comment on any differences?

See our answer above regarding the familiarization task, in response to critique #5.

11. ROIs:

• A representation or list of the ROIs might be helpful. It is unclear from Figure 1 if the whole cortical mantle is included in the 22 ROIs. In addition these ROI boundaries look considerably larger than the 180 in each hemisphere from Glasser et al. 2016. Please clarify.

We apologize, this was not clear enough in our original submission. Indeed, Glasser et al., 2016, identified 180 ROIs in each hemisphere. However, in the Supplementary Neuroanatomical Results (table1 p180) these are grouped into 22 larger ROIs, which is extremely useful for data-simplification and reduction of multiple-comparisons. This coarser ROI-division is also more suitable for MEG data, given its reduced spatial-resolution (compared to fMRI). The 22 ROIs are delineated in black in Figure 3, and we have now clarified this in the methods section (pp. 12)

Reviewer #2:

[…] The only issue I had with the results is that the possibility (or likelihood, in my estimation) that the subjects are occasionally letting their attention drift to the task-irrelevant speech rather than processing in parallel can't be rejected. To be fair, the authors include a nice discussion of this very issue and are careful with the language around task-relevance and attended/unattended stimuli. It is indeed tough to pull apart. The second paragraph on page 18 states "if attention shifts occur irregularly, the emergence of a phase-rate peak in the neural response would indicate that bits of 'glimpsed' information are integrated over a prolonged period of time." I agree with the math behind this, but I think it would only take occasional lapses lasting 2 or 3 seconds to get the observed results, and I don't consider that "prolonged." It is, however, much longer than a word, so nicely rejects the idea of single-word intrusions.

We fully agree with the reviewer that the current results do not necessarily imply parallel processing of competing speech and could be brought about by occasional (rhythmic or nonrhythmic) shifts of attention between streams (“multiplexed listening”). We now elaborate on our previous discussion of this issue, emphasizing two main points (p. 19-20; also, please see our extensive response to reviewer #1/critique #1):

1. As the reviewer points out, regardless of the underlying mechanism, the current results indicate temporal integration of content and structure-building processing, that go beyond momentary “intrusions” of single words.

2. One of our main take-home messages from the current study is that selective-attention does not mean exclusive-attention, and that individuals are capable of gleaning linguistic components of task-irrelevant speech (be it through parallel processing or through multiplexed-listening). This invites us to re-examine some of the assumptions regarding what individuals are actually doing when we instruct them to pay selective attention.

Reviewer #3:

The use of frequency tagging to analyze continuous processing at phonemic, word, phrasal and sentence-levels offers a unique insight into neural locking at higher-levels. While the approach is novel, there are major concerns regarding the technical details and interpretation of results to support phrase-level responses to structured speech distractors.

– Is the peak at 1Hz real and can it be attributed solely to the structured distractor?

* The study did not comment on the spectral profile of the "attended" speech, and how much low modulation energy is actually attributed to the prosodic structure of attended sentences? To what extent does the interplay of the attended utterance and distractor shapes the modulation dynamics of the stimulus (even dichotically)?

The reviewer suggests that, perhaps, the acoustics of the task-relevant stream could have contributed to the observed 1Hz ITPC peak, and that the peak might not solely be attributed to the Structured stimuli.

There are two reasons why this is not the case.

1. In Author response image 1 we show the modulation spectrum of the narratives used as task-relevant speech. As can be seen, the spectrum does not contain a distinct peak at 1Hz.

2. The narratives used as task-relevant speech were fully randomized between the Structured and Non-Structured conditions and between participants. Therefore, if these stimuli had contributed to the 1Hz peak, this should have been observed in both conditions. We have now clarified this point in the methods (p.6) which reads “For each participant they were randomly paired with task-irrelevant speech (regardless of condition), to avoid material-specific effects.”

Author response image 1

* How is the ITPC normalized? Figure 2 speaks of a normalization but it is not clear how? The peak at 1Hz appears extremely weak and no more significant (visually) than other peaks – say around 3Hz and also 2.5Hz in the case of non-structured speech? Can the authors report on the regions in modulation space that showed any significant deviations? What about effect size of the 1Hz peak relative to these other regions?

* It is hard to understand where the noise floor in this analysis – this floor will rotate with the permutation test analysis performed in the analysis of the ITPC and may not be fully accounted for. This issue depends on what the chosen normalization procedure is. The same interpretation put forth by the author regarding a lack of a 0.5Hz peak due to noise still raises the question of interpreting the observed 1Hz peak?

We understand the reviewer’s concerns about the validity of the 1Hz peak, and have taken several steps to address them:

1. We have added a new analysis to independently assess the significance of the 1Hz peak. In this analysis, we compared ITPC at each frequency to the average ITPC in the surrounding frequencies (2 bins from each side) using a t-test. This approach addresses the concern of frequency-specific noise-floor, for example due to the inherent 1/f noise structure (see similar implementations e.g., in Mouraux et al., 2011; Retter and Rossion, 2016, Nozaradan et al. 2017, 2018). Results of this analysis confirmed that the only significant ITPC peaks were at 4Hz (in both conditions) and at 1Hz (in the Structured condition only), further validating the robustness of the 1Hz phrase-level response. Note that the other peaks that stand-out visually (at 2.5Hz and 3Hz) were not statistically significant in this analysis. The new analysis is now described in the methods (pp. 11) and Results sections (pp. 14) which reads:

“Scalp-level spectra of the Inter-trial phase coherence (ITPC) showed a significant peak at the syllabic-rate (4Hz) in response to both Structured and Non-Structured hierarchical frequency-tagged speech, with a 4-pole scalp-distribution common to MEG recorded auditory responses (Figure 3a) (p< 10^-9; large effect size, Cohen's d > 1.5 in both). […] Comparison of the 1Hz ITPC between these conditions also confirmed a significant difference between them (p=0.045; moderate effect size, Cohen's d = 0.57).”

2. Although, as the reviewer points out, the magnitude of the 1Hz peak is smaller than the 4Hz, this is to be expected since the 4Hz peak is a direct consequence of the acoustic input whereas the 1Hz peak proposedly reflects linguistic representation and/or chunking. Similar ratios between the acoustic and linguistic peaks have been observed in previous studies using this frequency-tagging approach (e.g. Ding et al. 2016, Makov et al. 2017). Therefore, the smaller relative magnitude of the 1Hz peak does not invalidate it. We have now added quantification of all effect sizes to the Results section.

3. The normalization used for the ITPC was a z-score of the resultant phase-locking factor, as implemented in the circ_rtest function in the Matlab circular statistics toolbox (Berens, 2009). We have now clarified this in the methods section (p. 11) and the caption of Figure 3 (p. 14).

2. Control of attention during task performance

* The author present a very elegant analysis of possible alterative accounts of the results, but they acknowledge that possible attention switches, even if irregular, could result in accumulated information that could emerge as a small neurally-locked response at the phrase-level? As indicated by the authors, the entire experimental design to fully control for such switches is a real feat. That being said, additional analyses could shed some light on variations of attentional state and their effect on observed results. For instance, analysis of behavioral data across different trials (wouldn't be conclusive, but could be informative).

Indeed, we agree that occasional ‘attention switches’ can be one explanation for the results observed here. We now elaborate on our previous discussion of this point in the revised paper (discussion, pp. 20; and see also our extensive response to reviewer #1/critique #1 on this point).

To address the reviewer’s suggestion of looking more closely at the behavioral results, and testing whether the emergence of a phrase-level response to task-irrelevant speech came at the ‘expense’ of processing the to-be-attended speech, we now include a new median-split analysis. Rather than conducting a per-trial analysis (which suffers from low SNR), we split the participants into two groups according to the median of the 1Hz ITPC value in the Structured condition and tested whether there was a significant difference in the behavioral scores of the two groups. However, this median-split analysis did not reveal any significant effects that would indicate a ‘tradeoff’ between linguistic representation of task-irrelevant speech and performance on the attended task. We have added this new analysis to the paper, which reads (p. 13-14):

“Additionally, to test for possible interactions between answering questions about the to-be attended speech and linguistic neural representation of task-irrelevant speech, we performed a median-split analysis of the behavioral scores across participants. […] Neither test showed significant differences in performance between participants whose 1Hz ITPC was above vs. below the median [Structured condition: t(27) = -1.07, p=0.29; Structured – Non-Structured: t(27) = -1.04, p=0.15]. Similar null-results were obtained when the median-split was based on the source level data.”

That said, as elaborated in our response to reviewer #1/critique #2 above, and in the revised methods (p. 9) we do not believe the task used here is sufficiently sensitive for capturing behavioral consequences of task-irrelevant stimuli. The lack of a good behavioral measures is, in fact, part of the challenge of studying attention to continuous speech and part of the motivation for turning towards neural metrics rather than relying on behavioral measures, as we now discuss explicitly (p. 25):

“In moving towards studying speech processing and attention under more ecological circumstances, using natural continuous speech, we face an experimental challenge of obtaining sufficiently sensitive behavior measures without disrupting listening with an ongoing task (e.g., target detection) or encroaching too much on working-memory. This is a challenge shared by many previous studies similar to ours, and is one of the main motivations for turning directly to the brain and studying neural activity during uninterrupted listening to continuous speech, rather than relying on sparse behavioral indications (Ding et al. 2016; Makov et al. 2017; Broderick et al. 2018, 2019; Donhauser and Baillet 2019; Brodbeck et al. 2020).”

Therefore, although we wholeheartedly agree with the reviewer’s comment that “the experimental design to fully control for such switches is a real feat”, and we DO hope to be able to better address this question in future (neural) studies, in our opinion the current behavioral data cannot reliably shed light on this matter.

* This issue is further compounded by the fact that a rather similar study (Ding et al.) did not report any phrasal-level processing, though there are design differences. The authors suggest differences in attentional load as a possible explanation and provide a very appealing account or reinterpretation of the literature based on a continuous model of processing based on task demands. While theoretically interesting, it is not clear whether any of the current data supports such account. Again, maybe a correlation between neural responses and behavioral performance in specific trials could shed some light or strengthen this claim.

Here too, we refer the reviewer to our response to reviewer #1/critique #8, regarding the comparison between the current results and the study by Ding et al.

We offer the perspective of “Attentional Load theory” as a way to account for the discrepancy in results between these two studies, given that they differ substantially in the listening effort required for performing the task. We agree with the reviewer that, at the moment, this is mostly a theoretical speculation (particularly given the null-behavioral effects discussed above), even though it does converge with several other lines of empirical testing (e.g. the effects of working-memory capacity on attentional abilities). It also gives rise to specific hypotheses which can (and should) be tested in follow-up studies, which is why we think it is beneficial to include this as a discussion point in our paper (p. 21-22):

“the comparison between these two studies invites re-framing of the question regarding the type / level of linguistic processing applied to task-irrelevant speech, and propels us to think about this issue not as a yes-or-no dichotomy, but perhaps as a more flexible process that depends on the specific context (Brodbeck et al. 2020b). […] As cognitive neuroscience research increasingly moves towards studying speech processing and attention in real-life circumstances, a critical challenge will be to systematically map out the perceptual and cognitive factors that contribute to, or hinder, the ability to glean meaningful information from stimuli that are outside the primary focus of attention.”

– What is the statistic shown for the behavioral results? Is this for the multiple choice question? Then what is the t-test on?

We are happy to clarify. After each narrative, participants were asked 4 multiple-choice questions (with three potential answers; chance level = 0.33). Their answers to question were coded as ‘correct’ or ‘incorrect’ and then the average accuracy-rate (% questions answered correctly) was calculated across all questions and all narratives, for each participant. This was done separately for trials in the Structured and Non-Structured condition.

The t-test presented in Figure 2 tests for differences in average-accuracy rate between conditions. Additionally, we performed a t-test between the accuracy rate averaged across all conditions vs. chance-rate (0.33), to establish that they performed better than chance (guessing).

We have now clarified this in the methods section (p. 9):

“The average accuracy rate of each participant (% questions answered correctly) was calculated across all questions and narratives, separately for trials in the Structured and NonStructured condition.”

– Beyond inter-trial phase coherence, can the authors comment on actual power-locked responses at the same corresponding rates?

Thank you for this comment. Yes, we did look at the evoked power-spectrum in addition to the ITPC (Author response image 2). As is evident, the power spectrum also contained a clearly visible peak at 1Hz in the Structured condition (but not in the Non-Structured condition). However, statistical testing of this peak vs. its surrounding neighbors was not significant. We attribute the discrepancy between the power spectrum and ITPC spectrum to the 1/f power-law noise than primarily affects the former (attributed to non-specific/spontaneous neural activity; e.g. Voytek et al. 2015).

Author response image 2

Similar patterns have also been observed in other studies using frequency-tagged stimuli, and ITPC seems to be a ‘cleaner’ measure for bringing out frequency-specific phase-locked responses in the neural signal (see similar discussion in Makov et al. 2017). For this reason, in the manuscript we only report the ITPC results.

In line with concerns regarding interpretation of experimental findings, some control experiment appears to be critical to establish a causal link between the observed neural processing the 1Hz rhythm and the phrasal processing of the distractor. What do the author expect from a similarly structured distractor but unintelligible, in an unfamiliar language or even reversed?

Multiple previous studies using hierarchical frequency-tagging of speech have clearly shown that this response is specific to contexts in which individuals understand the speech. For example, both Ding et al. 2016 and Makov et al. 2017 showed that the phrase-level response is only produced for speech in a familiar language, but not for unfamiliar language or random concatenation of syllables. This pattern of results has been replicated across several languages so far (e.g., English, Chinese, Hebrew, Dutch) as well as for newly learned pseudo-languages (Henin et al. 2021). Accordingly, the current study builds on these previous findings in attributing the 1Hz peak to encoding the phrase-structure of speech.

https://doi.org/10.7554/eLife.65096.sa2

Article and author information

Author details

  1. Paz Har-shai Yahav

    The Gonda Center for Multidisciplinary Brain Research, Bar Ilan University, Ramat Gan, Israel
    Contribution
    Data curation, Formal analysis, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    For correspondence
    pazhs10@gmail.com
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3666-3338
  2. Elana Zion Golumbic

    The Gonda Center for Multidisciplinary Brain Research, Bar Ilan University, Ramat Gan, Israel
    Contribution
    Conceptualization, Resources, Data curation, Formal analysis, Supervision, Funding acquisition, Investigation, Methodology, Writing - original draft, Project administration, Writing - review and editing
    For correspondence
    elana.zion-golumbic@biu.ac.il
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8831-3188

Funding

Israel Science Foundation (2339/20)

  • Elana Zion Golumbic

United States - Israel Binational Science Foundation (2015385)

  • Elana Zion Golumbic

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was funded by Binational Science Foundation (BSF) grant # 2015385 and ISF grant #2339/20. We would like to thank Dr. Nai Ding for helpful comments on a previous version of this paper.

Ethics

Human subjects:The study was approved by the IRB of Bar Ilan University on 1.1.2017, under the protocol titled "Linking brain activity to selective attention at a 'Cocktail Party' " (approval duration: 2 years). All participants provided written informed consent prior to the start of the experiment.

Senior and Reviewing Editor

  1. Barbara G Shinn-Cunningham, Carnegie Mellon University, United States

Reviewers

  1. Phillip E Gander, University of Iowa, United States
  2. Ross K Maddox, University of Rochester, United States

Publication history

  1. Received: November 22, 2020
  2. Accepted: April 26, 2021
  3. Accepted Manuscript published: May 4, 2021 (version 1)
  4. Version of Record published: May 28, 2021 (version 2)

Copyright

© 2021, Har-shai Yahav and Zion Golumbic

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,091
    Page views
  • 98
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Neuroscience
    Attila Ozsvár et al.
    Research Article

    Summation of ionotropic receptor-mediated responses is critical in neuronal computation by shaping input-output characteristics of neurons. However, arithmetics of summation for metabotropic signals are not known. We characterized the combined ionotropic and metabotropic output of neocortical neurogliaform cells (NGFCs) using electrophysiological and anatomical methods in the rat cerebral cortex. These experiments revealed that GABA receptors are activated outside release sites and confirmed coactivation of putative NGFCs in superficial cortical layers in vivo. Triple recordings from presynaptic NGFCs converging to a postsynaptic neuron revealed sublinear summation of ionotropic GABAA responses and linear summation of metabotropic GABAB responses. Based on a model combining properties of volume transmission and distributions of all NGFC axon terminals, we predict that in 83% of cases one or two NGFCs can provide input to a point in the neuropil. We suggest that interactions of metabotropic GABAergic responses remain linear even if most superficial layer interneurons specialized to recruit GABAB receptors are simultaneously active.

    1. Neuroscience
    Qiaoli Huang et al.
    Research Article

    In memory experiences, events do not exist independently but are linked with each other via structure-based organization. Structure context largely influences memory behavior, but how it is implemented in the brain remains unknown. Here, we combined magnetoencephalogram (MEG) recordings, computational modeling, and impulse-response approaches to probe the latent states when subjects held a list of items in working memory (WM). We demonstrate that sequence context reorganizes WM items into distinct latent states, i.e., being reactivated at different latencies during WM retention, and the reactivation profiles further correlate with recency behavior. In contrast, memorizing the same list of items without sequence task requirements weakens the recency effect and elicits comparable neural reactivations. Computational modeling further reveals a dominant function of sequence context, instead of passive memory decaying, in characterizing recency effect. Taken together, sequence structure context shapes the way WM items are stored in the human brain and essentially influences memory behavior.