Introduction

Human speech encompasses elements at different levels, from phonemes to syllables, words, phrases, sentences and paragraphs. These elements manifest over distinct timescales: Phonemes occur over tens of milliseconds, and paragraphs span a few minutes. Understanding how these units at different time scales are encoded in the brain during speech comprehension remains a challenge. In the visual system, it has been well-established that there is a hierarchical organization such that neurons in the early visual areas have smaller receptive fields, while neurons in the higher-level visual areas receive inputs from lower-level neurons and have larger receptive fields (Hubel & Wiesel, 1962; 1965). This organizing principle is theorized to be mirrored in the auditory system, where a hierarchy of temporal receptive windows (TRW) extends from primary sensory regions to advanced perceptual and cognitive areas (Hasson et al., 2008; Honey et al., 2012; Lerner et al., 2011; Murray et al., 2014). Under this assumption, neurons in the lower-level sensory regions, such as the core auditory cortex, support rapid processing of the ever-changing auditory and phonemic information, whereas neurons in the higher cognitive regions, with their extended temporal receptive windows, process information at the sentence or discourse level.

Recent functional magnetic resonance imaging (fMRI) studies have shown some evidence that different levels of linguistic units are encoded at different cortical regions (Blank & Fedorenko, 2020; Chang et al., 2022; Hasson et al., 2008; Lerner et al., 2011; Schmitt et al., 2021). For example, Schmitt et al. (2021) used artificial neural networks to predict the next word in a story across five stacked time scales. By correlating model predictions with brain activity while listening to a story in an fMRI scanner, they discerned a hierarchical progression along the temporoparietal pathway. This pathway identifies the role of the bilateral primary auditory cortex in processing words over shorter durations and the involvement of the inferior parietal cortex in processing paragraph-length units over extended periods. Studies using electroencephalogram (EEG), magnetoencephalography (MEG) and electrocorticography (ECoG) have also revealed synchronous neural responses to different linguistics units at different time scales (e.g., Ding et al., 2015; Ding & Simon, 2012; Honey et al., 2012; Luo & Poeppel, 2007). Notably, Ding et al. (2015) showed that the MEG-derived cortical response spectrum concurrently tracked the timecourses of abstract linguistic structures at the word, phrase and sentence levels.

Although there is a growing consensus on the hierarchical encoding of linguistic units in the brain, the neural representations of these units in a multi-talker setting remain less explored. In the classic “cocktail party” situation in which multiple speakers talk simultaneously (Cherry, 1953), listeners must separate a speech signal from a cacophony of other sounds (McDermott, 2009). Studies have shown that listeners with normal hearing can selectively attend to a chosen speaker in the presence of two competing speakers (Brungart, 2001; Shinn-Cunningham, 2008), resulting in enhanced neural responses to the attended speech stream (Brodbeck et al., 2018; Ding & Simon, 2012; Mesgarani & Chang, 2012; O’Sullivan et al., 2015; Zion Golumbic et al., 2013). For example, Ding and Simon (2012) showed that neural responses were selectively phase-locked to the broadband envelope of the attended speech stream in the posterior auditory cortex. Furthermore, when the intensity of the attended and unattended speakers is separately varied, the neural representation of the attended speech stream adapts only to the intensity of the attended speaker. Zion Golumbic et al. (2013) further suggested that the neural representation appears to be more “selective” in higher perceptual and cognitive brain regions such that there is no detectable tracking of ignored speech.

This selective entrainment to attended speech has primarily focused on the low-level acoustic properties of the speech, while largely ignoring the higher-level linguistic units beyond the phonemic levels. Brodbeck et al. (2018) were the first study to simultaneously compare the neural responses to the acoustic envelopes as well as the phonemes and words in two competing speech streams. They found that although the acoustic envelopes of both the attended and unattended speech could be decoded from brain activity in the temporal cortex, only phonemes and words of the attended speech showed significant responses. However, as their study only examined two linguistic units, a complete model of how the brain tracks linguistic units, from phonemes to sentences, in two competing speech streams is still lacking.

In this study, we investigate the neural underpinnings of processing diverse linguistic units in the context of competing speech streams among listeners with both normal and impaired hearing. We included hearing-impaired listeners to examine how hierarchically organized linguistic units in competing speech streams impact comprehension abilities. The experiment design consisted of a multi-talker condition and a single-talker condition. In the multi-talker condition, participants listened to mixed speech from female and male speakers narrating simultaneously. Before each trial, instructions on the screen indicated which speaker to focus on. In the single-talker condition, the speeches from male and female speakers were presented separately (refer to Figure 1A for the experimental procedure). We employed a hierarchical multiscale Long Short-Term Memory network (HM-LSTM; Chung et al., 2017) to dissect the linguistic information of the stimuli across phoneme, syllable, word, phrase, and sentence levels (detailed model architecture is described in the “Hierarchical multiscale LSTM model” section in Materials and Methods). We then performed ridge regressions using these linguistic units as regressors, time-locked to the offset of each sentence at various latencies (see Figure 1B for the analysis pipeline). By examining how model alignments with brain activity vary across different linguistic levels between listeners with normal and impaired hearing, our goal is to identify specific levels of linguistic processing that pose comprehension challenges for hearing-impaired individuals.

Methods and behavioral results. A. Experimental procedure. The experimental task consisted of a multi-talker condition followed by a single-talker condition. In the multi-talker condition, the mixed speech was presented twice with the female and male speakers narrating simultaneously. Before each trial, instructions appeared in the center of the screen indicating which of the talkers to attend to (e.g., ‘‘Attend female’’). In the single-talker condition, the male and female speeches were presented sequentially. B. Analyses pipeline. Hidden-layer activity of the HM-LSTM model, which represents each level of linguistic units for each sentence, was extracted and aligned with EEG data, time-locked to the offset of each sentence at nine different latencies.

Results

Behavioral results

A total of 41 participants (21 females, mean age=24.1 years, SD=2.1 years) with extended high frequency (EHF) hearing loss and 33 participants (13 females, mean age=22.94 years, SD=2.36 years) with normal hearing were included in the study. EHF hearing loss refers to hearing loss at frequencies above 8 kHz. It is considered a major cause of hidden hearing loss, which cannot be detected by audiometry (Bharadwaj et al., 2019). Although the phonetic information required for speech perception in quiet conditions is below 6 kHz, ample evidence suggests that salient information in the higher-frequency regions may also affect speech intelligibility (e.g., Apoux & Bacon, 2004; Badri et al., 2011; Collins et al., 1981; Levy et al., 2015). Unlike age-related hearing loss, EHF hearing loss is commonly found in young adults who frequently use earbuds and headphones for prolonged periods and are exposed to high noise levels during recreational activities (Motlagh Zadeh et al., 2019). Consequently, this demographic is a suitable comparison group for their age-matched peers with normal hearing. EHF hearing loss was diagnosed using the pure tone audiometry (PTA) test, thresholded at frequencies greater than 8 kHz. As shown in Figure 2A, starting at 10 kHz, participants with EHF hearing loss exhibit significantly higher hearing thresholds (M=6.42 dB, SD=7 dB) compared to those with normal hearing (M=3.3 dB, SD=4.9 dB), as confirmed by an independent two-sample one-tailed t-test (t(72)=2, p=0.02).

Behavioral results.

A. PTA results for participants with normal hearing and EHF hearing loss. Starting at 10 kHz, participants with EHF hearing loss have significantly higher hearing thresholds (M=6.42 dB, SD=7 dB) compared to normal-hearing participants (M=3.3 dB, SD=4.9 dB; t=2, p=0.02). B. Distribution of self-rated intelligibility scores for mixed and single-talker speech across the two listener groups. * indicates p < .05, ** indicates p <. 01 and *** indicates p < .001.

Figure 2B illustrates the distribution of intelligibility ratings for both mixed and single-talker speech across the two listener groups. The average intelligibility ratings for both mixed and single-talker speech were significantly higher for normal-hearing participants (mixed: M=3.89, SD=0.83; single-talker: M=4.64, SD=0.56) compared to hearing-impaired participants (mixed: M=3.38, SD=1.04; single-talker: M=4.09, SD=0.89), as shown by independent two-sample one-tailed t-tests (mixed: t(72)=2.11, p=0.02; single: t(72)=2.81, p=0.003). Additionally, paired two-sample one-tailed t-tests indicated significantly higher intelligibility scores for single-talker speech compared to mixed speech within both listener groups (normal-hearing: t(32)=4.58, p<0.0001; hearing-impaired: t(40)=4.28, p=0.0001). These behavioral results confirm that mixed speech presents greater comprehension challenges for both groups, with hearing-impaired participants experiencing more difficulty in understanding both types of speech compared to those with normal hearing.

HM-LSTM model performance

To simultaneously estimate linguistic content at the phoneme, syllable, word, phrase and sentence levels, we adopted the HM-LSTM model originally developed by Chung et al. (2017). We extended the HM-LSTM model developed by Chung et al. (2017) to include not just the word and phrasal levels but also the sub-lexical phoneme and syllable levels. The inputs to the model were the vector representations of the phonemes in two sentences and the output of the model was the classification result of whether the second sentence follows the first sentence (see Figure 3A and the “Hierarchical multiscale LSTM model” section in Materials and Methods for the detailed model architecture). Unlike the Transformer-based language models that can predict only word- or sentence- and paragraph-level information, our HM-LSTM model can disentangle the impact of different informational content associated with phonemic and syllabic levels. We trained our model on the WenetSpeech corpus (Zhang et al., 2021) and extracted hidden-layer activity at the phoneme, syllable, word, sentence and paragraph levels. The sentence-level information was represented by the last unit in the fourth layer. After 130 epochs, the model achieved an accuracy of 0.87 on the training data and 0.83 on our speech stimuli, which comprise 570 sentence pairs. We subsequently extracted activity from the trained model’s four hidden layers for each sentence in our stimuli to represent information at the phoneme, syllable, word, sentence, and paragraph levels, with the sentence-level information represented by the last unit of the fourth layer. We computed the correlations among the activations at the five levels of the HM-LSTM model and observed no significantly high correlations, suggesting that different layers of the model captured distinct patterns in the stimuli (see Figure 3B).

The HM-LSTM model architecture and hidden layer activity for the stimuli sentences and the 4-word Chinese sentences with same vowels. A. The HM-LSTM model architecture. The model includes four hidden layers, corresponding to the phoneme-, syllable-, word- and phrase-level information. Sentence-level information was represented by the last unit of the 4th layer. The inputs to the model were the vector representations of the phonemes in two sentences and the output of the model was the classification result of whether the second sentence follows the first sentence. B. Correlation matrix for the HM-LSTM model’s hidden layer activity for the sentences in the experimental stimuli. C. Scatter plot of hidden-layer activity at the five linguistic levels for each of the 20 4-syllable sentences after MDS.

To verify that the hidden-layer activity indeed reflects information at the corresponding linguistic levels, we constructed a test dataset comprising 20 four-syllable sentences where all syllables contain the same vowels, such as “mā ma mà mă” (mother scolds horse), “shū shu shŭ shù” (uncle counts numbers). Table S1 in the Supplementary lists all four-syllable sentences with the same vowels. We hypothesized that the activity in the phoneme and syllable layer would be more similar than other layers for same-vowel sentences. The results confirmed our hypothesis: Hidden-layer activity for same-vowel sentences exhibited much more similar distributions at the phoneme and syllable levels compared to those at the word, phrase and sentence levels. Figure 3C displays the scatter plot of the model activity at the five linguistic levels for each of the 20 4-syllable sentences, post dimension reduction using multidimensional scaling (MDS). The plot reveals that model representations at the phoneme and syllable levels are more dispersed for each sentence, while representations at the higher linguistic levels— word, phrase, and sentence—are more centralized. Additionally, similar phonemes tend to cluster together across the phoneme and syllable layers, indicating that the model captures a greater amount of information at these levels when the phonemes within the sentences are similar.

Regression results for single-talker speech versus attended speech

To examine the differences in neural activity between single- and dual-talker speech, we first compared the model fit of acoustic and linguistic features for the single-talker speech against the attended speech in the context of mixed speech across both listener groups (see “Ridge regression at different time latencies” and “Spatiotemporal clustering analysis” in Methods and materials for analysis details). Figure 4 shows the sensors and time window where acoustic and linguistic features significantly better predicted EEG data in the left temporal region during single-talker speech as compared to the attended speech. It can be seen that for hearing-impaired participants, acoustic features showed a better model fit in single-talker settings as opposed to mixed speech conditions from -100 to 100 ms around sentence offsets (t=1.4, Cohen’s d=1.5, p=0.002). However, no significant differences in the model fit between the single-talker and the attended speeches were observed for normal-hearing participants. Group comparisons revealed a significant difference in the model fit for the two conditions from -100 to 50 ms around sentence offsets (t=1.43, Cohen’s d=1.28, p=0.011).

Significant sensor and time window for the model fit to the EEG data for the acoustic and linguistic features extracted from the HM-LSTM model between single-talker and attended speech across the two listener groups. A. Significant sensors showing higher model fit for single-talker speech compared to the attended speech at the acoustic, phoneme, and syllable levels for the two listener groups and their contrast. B. Timecourses of mean model fit in the significant clusters where normal-hearing participants showed higher model fit at the acoustic, phoneme, and syllable levels than hearing-impaired participants. Shaded regions indicate significant time windows. * denotes p<0.05, ** denotes p<0.01 and *** denotes p<0.001.

For the linguistic features, both the phoneme and syllable layers from the HM-LSTM model were more predictive of EEG data in single-talker speech compared to attended speech among hearing-impaired participants in the left temporal regions (phoneme: t=1.9, Cohen’s d=0.49, p=0.004; syllable: t=1.9, Cohen’s d=0.37, p=0.002). The significant effect occurred from approximately 0-100 ms for phonemes and 50-150 ms for syllables after sentence onsets. No significant differences in model fit were observed between the two conditions for participants with normal hearing. Comparisons between groups revealed significant differences in the contrast maps from 0-100 ms after sentence onsets for phonemes (t=2.39, Cohen’s d=0.72, p=0.004) and from 50-150 ms after the sentence onsets for syllables (t=2.11, Cohen’s d=0.78, p=0.001). The model fit to the EEG data for higher-level linguistic features—words, phrases, and sentences—does not show any significant differences between single-talker and attended speech across the two listener groups. This suggests that both normal-hearing and hearing-impaired participants are able to extract information at the word, phrase, and sentence levels from the attended speech in dual-speaker scenarios, similar to conditions involving only a single talker.

Regression results for single-talker versus unattended speech

We also compared the model fit for single-talker speech and the unattended speech under the mixed speech condition. As shown in Figure 5, the acoustic features showed a better model fit in single-talker settings as opposed to mixed speech conditions from -100 to 50 ms around sentence offsets for hearing-impaired listeners (t=2.05, Cohen’s d=1.1, p=<0.001) and from -100 to 50 ms for normal-hearing listeners (t=2.61, Cohen’s d=0.23, p=<0.001). No group difference was observed with regard to the contrast of the model fit for the two conditions.

Significant sensor and time window for the model fit to the EEG data for the acoustic and linguistic features between the single-talker and unattended speech in the mixed speech condition across the two listener groups. A. Significant sensors showing higher model fit for the single-talker speech compared to the unattended speech at the acoustic and linguistic levels for the two listener groups and their contrast. B. Timecourses of mean model fit in the significant clusters. Shaded regions indicate significant time windows. * denotes p<0.05, ** denotes p<0.01 and *** denotes p<0.001.

All the five linguistic features were more predictive of EEG data in single-talker speech compared to the unattended speech for both hearing-impaired participants (phoneme: t=1.72, Cohen’s d=0.79, p<0.001; syllable: t=1.94, Cohen’s d=0.9, p<0.001; word: t=2.91, Cohen’s d=1.08, p<0.001; phrase: t=1.4, Cohen’s d=0.61, p=0.041; sentence: t=1.67, Cohen’s d=1.01, p=0.023) and normal-hearing participants (phoneme: t=1.99, Cohen’s d=0.31, p=0.02; syllable: t=1.78, Cohen’s d=0.8, p<0.001; word: t=2.85, Cohen’s d=1.55, p=0.001; phrase: t=1.74, Cohen’s d=1.4, p<0.001; sentence: t=1.86, Cohen’s d=0.81, p=0.046). The significant effects occurred progressively later from phoneme to sentence level for both hearing-impaired participants (phoneme: -100-100 ms; syllable: 0-200 ms; word: 0-250 ms; phrase: 200-300 ms; sentence: 200-300 ms) and normal-hearing participants (phoneme: -50-100 ms; syllable:0-200 ms; word: 50-250 ms; phrase: 100-300 ms; sentence: 200-300 ms.). No significant group differences in the model fit were observed between the two conditions for all the linguistic levels.

Regression results for attended versus unattended speech

Figure 6 depicts the model fit of acoustic and linguistic predictors against EEG data for both attended and unattended speech while two speakers narrated simultaneously. It can be seen that for normal-hearing participants, acoustic features demonstrated a better model fit for attended speech compared to unattended speech from -100 ms to sentence offsets (t=3.21, Cohen’s d=1.34, p=0.02). However, for hearing-impaired participants, no significant differences were observed in this measure. The difference between attended and unattended speech in normal-hearing and hearing-impaired participants was confirmed to be significant in the left temporal region from -100 ms to -50 ms before sentence offsets (t=2.24, Cohen’s d=1.01, p=0.02) by a permutation two-sample t-test.

Significant sensor and time window for the model fit to the EEG data for the acoustic and linguistic features between the attended and unattended speech in the mixed speech condition across the two listener groups. A. Significant sensors showing higher model fit for the attended speech compared to the unattended speech at the acoustic and linguistic levels for the two listener groups and their contrast. B. Timecourses of mean model fit in the significant clusters where normal-hearing participants showed higher model fit than hearing-impaired participants. Shaded regions indicate significant time windows. * denotes p<0.05, ** denotes p<0.01 and *** denotes p<0.001.

Both phoneme and syllable features significantly better predicted attended speech compared to unattended speech among normal-hearing participants (phoneme: t=1.58, Cohen’s d=0.46, p=0.0006; syllable: t=1.05, Cohen’s d=1.02, p=0.0001). The significant time window for phonemes was from -100 to 250 ms after sentence onsets, earlier than that for syllables, which was from 0 to 250 ms. No significant differences were observed in the hearing-impaired group. The contrast maps were significantly different across the two groups during the 0-100 ms window for phonemes (t=2.28, Cohen’s d=1.32, p=0.026) and the 0-150 ms window for syllables (t=2.64, Cohen’s d=1.04, p=0.022).

The word- and phrase-level features were significantly more effective at predicting EEG responses for attended speech than for unattended speech in both normal-hearing (word: t=2.59, Cohen’s d=1.14, p=0.002; phrase: t=1.77, Cohen’s d=0.68, p=0.027) and hearing-impaired listeners (word: t=3.61, Cohen’s d=1.59, p=0.001; phrase: t=1.87, Cohen’s d=0.71, p=0.004). The significant time windows for word processing were from 150-250 ms for hearing-impaired listeners and 150-200 ms for normal-hearing listeners. For phrase processing, significant time windows were from 150-300 ms for hearing-impaired listeners and 250-300 ms for normal-hearing listeners. No significant discrepancies were observed between the two groups regarding the model fit of words and phrases to the EEG data for attended versus unattended speeches. Surprisingly, we found a significantly better model fit for sentence-level features in attended speech for normal-hearing participants (t=1.52, Cohen’s d=0.98, p=0.003) but not for hearing-impaired participants, and the contrast between the two groups was significant (t=1.7, Cohen’s d=1.27, p<0.001), suggesting that hearing-impaired participants also struggle more with tracking information at longer temporal scales in multi-talker scenarios.

Discussion

Speech comprehension in a multi-talker environment is especially challenging for listeners with impaired hearing (Fuglsang et al., 2020). Consequently, exploring the neural underpinnings of multi-talker speech comprehension in hearing-impaired listeners could yield valuable insights into the challenges faced by both normal-hearing and hearing-impaired individuals in this scenario. Studies have reported abnormally enhanced responses to fluctuations in acoustic envelope in the central auditory system of older listeners (e.g., Goossens et al., 2016; Parthasarathy et al., 2019; Presacco et al., 2016) and listeners with peripheral hearing loss (Goossens et al., 2018; Millman et al., 2017). As older listeners also suffer from suppression of task-irrelevant sensory information due to reduced cortical inhibitory control functions (Gazzaley et al., 2005, 2008), it is possible that impaired speech comprehension in a cocktail party situation arises from an attentional deficit linked to aging (Du et al., 2016; Presacco et al., 2016). However, younger listeners with EHF hearing loss have also reported difficulty understanding speech in a multi-talker environment (Motlagh Zadeh et al., 2019). It remains unknown what information is lost during multi-talker speech perception, and how the hierarchically organized linguistic units in competing speech streams affect the comprehension ability of people with impaired hearing.

In this study, we show that for normal-hearing listeners, the acoustic and linguistic features extracted from an HM-LSTM model can significantly predict EEG responses during both single-talker and attended speech in the context of two speakers talking simultaneously. Interestingly, their intelligibility scores for mixed speech are lower compared to single-talker speech, suggesting that normal-hearing listeners are still capable of tracking linguistic information at these levels in a cocktail party scenario, although with potentially greater effort. The model fit of the EEG data for attended speech is significantly higher than that for unattended speech across all levels. This aligns with previous research on “selective auditory attention,” which demonstrates that individuals can focus on specific auditory stimuli for processing, while effectively filtering out background noise (Brodbeck et al., 2018; Brungart, 2001; Ding & Simon, 2012; Mesgarani & Chang, 2012; O’Sullivan et al., 2015; Shinn-Cunningham, 2008; Zion Golumbic et al., 2013). Expanding on prior research which suggested that phonemes and words of attended speech could be decoded from the left temporal cortex of normal-hearing participants (Brodbeck et al., 2018), our results demonstrate that linguistic units across all hierarchical levels can be tracked in the neural signals.

For listeners with hearing impairments, the model fit for attended speech is significantly poorer at the acoustic, phoneme, and syllable levels compared to that for single-talker speech. Additionally, there is no significant difference in model fit at the acoustic, phoneme, and syllable levels between attended and unattended speech when two speakers are talking simultaneously. However, the model fit for the word and phrase features do not differ between single-talker and attended speech, and is significantly higher for that of the unattended speech. These findings suggest that hearing-impaired listeners may encounter difficulties in processing information at shorter temporal scales, including the dynamic amplitude envelope and spectrotemporal details of speech, as well as phoneme and syllable-level content. This is expected as our EHF hearing loss participants all exhibit higher hearing thresholds at frequencies above 8 kHz. Although these frequencies exceed those necessary for phonetic information in quiet environments, which are below 6 kHz, they may still impact the ability to process auditory information at faster temporal scales more than at slower speeds. Surprisingly, hearing-impaired listeners did not demonstrate an improved model fit for sentence features of the attended speech compared to the unattended speech, indicating that their ability to process information at longer temporal scales is also compromised. One possible explanation is that hearing-impaired listeners struggle to extract low-level information from competing speech streams. Such a disruption in bottom-up processing could impede their ability to discern sentence boundaries effectively, which in turn hampers their ability to benefit from top-down information processing.

The hierarchical temporal receptive window (TRW) hypothesis proposed that linguistic units at shorter temporal scales, such as phonemes, are encoded in the core auditory cortex, while information of longer duration is processed in higher perceptual and cognitive regions, such as the anterior temporal or posterior temporal and parietal regions (Hasson et al., 2008; Honey et al., 2012; Lerner et al., 2011; Murray et al., 2014). With the limited spatial resolution of EEG, we could not directly compare the spatial localization of these units at different temporal scales, however, we did observe an increasing latency in the significant model fit across different linguistic levels. Specifically, the significant time window for acoustic and phoneme features occurred around -100 to 100 ms relative to sentence onsets; syllables and words around 0-200 ms; and phrases and sentences around 200-300 ms. These progressively later effects from lower to higher linguistic levels suggest that these units may indeed be represented in brain regions with increasingly longer TRWs.

Our hierarchical linguistic contents were extracted using the HM-LSTM model adapted from (Chung et al., 2017). This model has been adopted by Schmitt et al. (2021) to show that a “surprisal hierarchy” based on the hidden layer activity correlated with fMRI blood oxygen level-dependent (BOLD) signals along the temporal-parietal pathway during naturalistic listening. Although their research question is different from ours, their results suggested that the model has effectively captured information at different linguistic levels. Our testing results further confirmed that the model representations at the phoneme and syllable levels are different from model representations at the higher linguistic levels when the phonemes within the sentences are similar. Compared to the increasingly popular “model-brain alignment” studies that typically use transformer architectures (e.g., (Caucheteux & King, 2022; Goldstein et al., 2022; Schrimpf et al., 2021), our HM-LSTM model is considerably smaller in parameter size and does not match the capabilities of state-of-the-art large language models (LLMs) in downstream natural language processing (NLP) tasks such as question-answering, text summarization, translation, etc. However, our model incorporates phonemic and syllabic level representations, which are absent in LLMs that operate at the sub-word level. This feature could provide unique insights into how the entire hierarchy of linguistic units is processed in the brain.

It’s important to note that we do not assert any similarity between the model’s internal mechanisms and the brain’s mechanisms for processing linguistic units at different levels. Instead, we use the model to disentangle linguistic contents associated with these levels. This approach has proven successful in elucidating language processing in the brain, despite the notable dissimilarities in model architectures compared to the neural architecture of the brain. For example, Nelson et al. (2017) correlated syntactic processing under different parsing strategies with the intracranial electrophysiological signals and found that the left-corner and bottom-up strategies fit the left temporal data better than the most eager top-down strategy; Goldstein et al. (2022) and Caucheteux & King (2022) also showed that the human brain and the deep learning language models share the computational principles as they process the same natural narrative.

In summary, our findings show that linguistic units extracted from a hierarchical language model better explain the EEG responses of normal-hearing listeners for attended speech, as opposed to unattended speech, when two speakers are talking simultaneously. However, hearing-impaired listeners exhibited poorer model fits at the acoustic, phoneme, syllable, and sentence levels, although their model fits at the word and phrase levels were not significantly affected. These results suggest that processing information at both shorter and longer temporal scales is especially challenging for hearing-impaired listeners when attending to a chosen speaker in a cocktail party situation. As such, these findings connect basic research on speech comprehension with clinical studies on hearing loss, especially hidden hearing loss, a global issue that is increasingly common among young adults.

Materials and Methods

Participants

A total of 51 participants (26 females, mean age=24 years, SD=2.12 years) with EHF hearing loss and 51 normal-hearing participants (26 females, mean age=22.92 years, SD=2.14 years) took part in the experiment. 28 participants (18 females, mean age=23.55, SD=2.18) were removed from the analyses due to excessive motion, drowsiness or inability to complete the experiment, resulting in a total of 41 participants (21 females, mean age=24.1, SD=2.1) with EHF hearing loss and 33 participants (13 females, mean age=22.94, SD=2.36) with normal-hearing. All participants were right-handed native Mandarin speakers currently studying in Shanghai for their undergraduate or graduate degree, with no self-reported neurological disorders. EHF hearing loss was diagnosed using the PTA test, thresholded at frequencies above 8 kHz. The PTA was performed by experienced audiological technicians using an audiometer (Madsen Astera, GN Otometrics, Denmark) with headphones (HDA-300, Sennheiser, Germany) in a soundproof booth with background noise below 25 dB(A), as described previously (Wang et al., 2021). Air-conduction audiometric thresholds for both ears at frequencies of 0.5, 1, 2, 3, 4, 6, 8, 10, 12.5, 14 and 16 kHz were measured in 5-dB steps in accordance with the regulations of ISO 8253-1:2010.

Stimuli

Our experimental stimuli were two excerpts from the Chinese translation of “The Little Prince” (available at http://www.xiaowangzi.org/), previously used in fMRI studies where participants with normal hearing listened to the book in its entirety (Li et al., 2022). This material has been enriched with detailed linguistic predictions, from lexical to syntactic and discourse levels, using advanced natural language processing tools. Such rich annotation is critical for modeling hierarchical linguistic structures in our study. The two excerpts were narrated by one male and one female computer-synthesized voice, developed by the Institute of Automation, Chinese Academy of Sciences. The synthesized speech (available at https://osf.io/fjv5n/) is comparable to human narration, as confirmed by participants’ post-experiment assessment of its naturalness. Additionally, using computer-synthesized voice instead of human-narrated speech alleviates the potential issue of imbalanced voice intensity and speaking rate that can arise between female and male narrators. The two sections were matched in length (approximately 10 minutes) and mean amplitude (approximately 65 dB), and were mixed digitally in a single channel to prevent any biases in hearing ability between the left and right ears.

Experimental procedure

The experimental task consisted of a multi-talker condition and a single-talker condition. In the multi-talker condition, the mixed speech was presented twice with the female and male speakers narrating simultaneously. Before each trial, instructions appeared in the center of the screen indicating which of the talkers to attend to (e.g., ‘‘Attend Female’’). In the single-talker condition, the male and female speeches were presented separately (see Figure 1A for the experiment procedure). The presentation order of the 4 conditions was randomized, and breaks were given between each trial. Stimuli were presented using insert earphones (ER-3C, Etymotic Research, United States) at a comfortable volume level of approximately 65 dB SPL. Participants were instructed to maintain visual fixation for the duration of each trial on a crosshair centered on the computer screen, and to minimize eye blinking and all other motor activities for the duration of each section. The whole experiment lasted for about 65 minutes, and participants rated the intelligibility of the multi-talker and the single-talker speeches on a 5-point Likert scale after the experiment. The experiment was conducted at the Department of Otolaryngology-Head and Neck Surgery, Shanghai Ninth People’s Hospital affiliated with the School of Medicine at Shanghai Jiao Tong University. The experimental procedures were approved by the Ethics Committee of the Ninth People’s Hospital affiliated with Shanghai Jiao Tong University School of Medicine (SH9H-2019-T33-2). All participants provided written informed consent prior to the experiment and were paid for their participation.

Acoustic features of the speech stimuli

The acoustic features included the broadband envelopes and the log-mel spectrograms of the two single-talker speech streams. The amplitude envelope of the speech signal was extracted using the Hilbert transform. The 129-dimension spectrogram and 1-dimension envelope were concatenated to form a 130-dimension acoustic feature at every 10 ms of the speech stimuli.

Hierarchical multiscale LSTM model

We extended the original HM-LSTM model developed by Chung et al. (2017) to include not just the word and phrasal levels but also the sub-lexical phoneme and syllable levels. The inputs to the model were the vector representations of the phonemes in two sentences and the output of the model was the classification result of whether the second sentence follows the first sentence. We trained our model on 10,000 sentence pairs from the WenetSpeech corpus (Zhang et al., 2021), a collection that features over 10,000 hours of labeled Mandarin Chinese speech sourced from YouTube and podcasts. We used 1024 units for the input embedding and 2048 units for each HM-LSTM layer.

EEG recording and preprocessing

EEG was recorded using a standard 64-channel actiCAP mounted according to the international 10-20 system against a nose reference (Brain Vision Recorder, Brain Products). The ground electrode was set at the forehead. EEG signals were registered between 0.016 and 80 Hz with a sampling rate of 500 Hz. The impedances were kept below 20 kΩ. The EEG recordings were band-pass filtered between 0.1 Hz and 45 Hz using a linear-phase finite impulse response (FIR) filter. Independent component analysis (ICA) was then applied to remove eye blink artifacts. The EEG data were then segmented into epochs spanning 500 ms pre-stimulus onset to 10 minutes post-stimulus onset and were subsequently downsampled to 100 Hz.

Ridge regression at different time latencies

For each subject, we modeled the EEG responses at each sensor from the single-talker and mixed-talker conditions with our acoustic and linguistic features using ridge regression (see Figure 1B). For each sentence in the speech stimuli, we employed Principal Component Analysis (PCA) to reduce the 2048-dimensional hidden layer activations for each linguistic level to the first 150 principal components. The 150-dimensional vectors for the 5 linguistic levels were then subjected to ridge regression in relation to the EEG signals, time-locked to coincide with the end of each sentence in the stimuli. To assess the temporal progression of the regression outcomes, we conducted the analysis at nine sequential time points, ranging from 100 milliseconds before to 300 milliseconds after the sentence offset, with a 50-millisecond interval between each point. The same regression procedure was applied to the 130-dimensional acoustic features. Both the EEG data and the regressors were z-scored before regression.

Spatiotemporal clustering analysis

The timecourses of the coefficient of determination (R2) from the regression results at the nine time points for each sensor, corresponding to the five linguistic regressors and 1 acoustic regressor or each subject, underwent spatiotemporal cluster analyses to determine their statistical significance at the group level. To assess the differences in model fit between single-talker speech and attended speech in the mixed speech for both participants with EHF hearing loss and those with normal hearing, we conducted one-tailed one-sample t-tests on the z-transformed R2 timecourses for the two speech types. Additionally, we used one-tailed two-sample t-tests to explore between-group differences in the contrasts of the R2 timecourses for these speech types. We repeated these procedures 10,000 times, replacing the observed t-values with shuffled t-values for each participant to generate a null distribution of t-values for each sensor. Sensors whose t-values were in the top 5th percentile of the null distribution were deemed significant (sensor-wise significance p < 0.05). The same method was applied to analyze the contrasts between attended and unattended speech during mixed speech conditions, both within and between groups. All our analyses were performed using custom python codes, making heavy use of the mne (v.1.6.1), torch (v2.2.0) and scipy (v1.12.0) packages.

Data and code availability

All data and codes are available at https://osf.io/fjv5n/.

Supplementary

All four-syllable Chinese sentences with same vowels.