Multi-talker speech comprehension at different temporal scales in listeners with normal and impaired hearing

  1. Jixing Li  Is a corresponding author
  2. Qixuan Wang
  3. Qian Zhou
  4. Lu Yang
  5. Yutong Shen
  6. Shujian Huang
  7. Shaonan Wang
  8. Liina Pylkkänen
  9. Zhiwu Huang  Is a corresponding author
  1. Department of Linguistics and Translation, City University of Hong Kong, Hong Kong
  2. Department of Facial Plastic and Reconstructive Surgery, Eye and ENT Hospital, Fudan University, China
  3. ENT institute, Eye and ENT Hospital, Fudan University, China
  4. Department of Otolaryngology-Head and Neck Surgery, Shanghai Ninth People’s Hospital, Shanghai Jiao Tong University School of Medicine, China
  5. Department of Computer Science and Technology, Nanjing University, China
  6. Institute of Automation, Chinese Academy of Sciences, China
  7. Department of Linguistics, Department of Psychology, New York University, United States
  8. College of Health Science and Technology, Shanghai Jiao Tong University School of Medicine, China
6 figures, 1 video, 1 table and 1 additional file

Figures

Figure 1 with 1 supplement
Methods and behavioral results.

(A) Experimental procedure. The experimental task consisted of a multi-talker condition followed by a single-talker condition. In the multi-talker condition, the mixed speech was presented twice with the female and male speakers narrating simultaneously. Before each trial, instructions appeared in the center of the screen indicating which of the talkers to attend to (e.g. ‘Attend female’). In the single-talker condition, the male and female speeches were presented sequentially. (B) Analyses pipeline. Hidden-layer activity of the hierarchical multiscale Long Short-Term Memory network (HM-LSTM) model, which represents each level of linguistic units for each sentence, was extracted and aligned with EEG data, time-locked to the offset of each sentence at nine different latencies.

Figure 1—figure supplement 1
Correlation matrices of regression outcomes for the five linguistic predictors between the electroencephalogram (EEG) data from delta, theta, and all frequency bands.
Behavioral results.

(A) Pure tone audiometry (PTA) results for participants with normal hearing and extended high frequency (EHF) hearing loss. Starting at 10 kHz, participants with EHF hearing loss have significantly higher hearing thresholds (M=6.42 dB, SD = 7 dB) compared to normal-hearing participants (M=3.3 dB, SD = 4.9 dB; t=2, p=0.02). (B) Distribution of self-rated intelligibility scores for mixed- and single-talker speech across the two listener groups. * indicates p<0.05, ** indicates p<0.01, and *** indicates p<0.001.

The hierarchical multiscale Long Short-Term Memory network (HM-LSTM) model architecture and hidden-layer activity for the stimuli sentences and the four-word Chinese sentences with same vowels.

(A) The HM-LSTM model architecture. The model includes four hidden layers, corresponding to the phoneme-, syllable-, word-, and phrase-level information. Sentence-level information was represented by the last unit of the fourth layer. The inputs to the model were the vector representations of the phonemes in two sentences, and the output of the model was the classification result of whether the second sentence follows the first sentence. (B) Correlation matrix for the HM-LSTM model’s hidden-layer activity for the sentences in the experimental stimuli. (C) Scatter plot of hidden-layer activity at the five linguistic levels for each of the 20 four-syllable sentences after multidimensional scaling (MDS).

Significant sensor and time window for the model fit to the electroencephalogram (EEG) data for the acoustic and linguistic features extracted from the hierarchical multiscale Long Short-Term Memory network (HM-LSTM) model between single-talker and attended speech across the two listener groups.

(A) Significant sensors showing higher model fit for single-talker speech compared to the attended speech at the acoustic, phoneme, and syllable levels for the two listener groups and their contrast. (B) Time courses of mean model fit in the significant clusters where normal-hearing participants showed higher model fit at the acoustic, phoneme, and syllable levels than hearing-impaired participants. The coefficient of determination (R2) was z-transformed. Shaded regions indicate significant time windows. * denotes p<0.05, ** denotes p<0.01, and *** denotes p<0.001.

Significant sensor and time window for the model fit to the electroencephalogram (EEG) data for the acoustic and linguistic features between the single-talker and unattended speech in the mixed speech condition across the two listener groups.

(A) Significant sensors showing higher model fit for the single-talker speech compared to the unattended speech at the acoustic and linguistic levels for the two listener groups and their contrast. (B) Time courses of mean model fit in the significant clusters. The significant time windows for within-group comparisons. The coefficient of determination (R2) was z-transformed. Shaded regions indicate significant time windows. * denotes p<0.05, ** denotes p<0.01, and *** denotes p<0.001.

Figure 6 with 2 supplements
Significant sensor and time window for the model fit to the electroencephalogram (EEG) data for the acoustic and linguistic features between the attended and unattended speech in the mixed speech condition across the two listener groups.

(A) Significant sensors showing higher model fit for the attended speech compared to the unattended speech at the acoustic and linguistic levels for the two listener groups and their contrast. (B) Time courses of mean model fit in the significant clusters where normal-hearing participants showed higher model fit than hearing-impaired participants. The coefficient of determination (R2) was z-transformed. Shaded regions indicate significant time windows. * denotes p<0.05, ** denotes p<0.01, and *** denotes p<0.001.

Figure 6—figure supplement 1
Contrast of temporal response function (TRF) weights to the electroencephalogram (EEG) data of attended and unattended speech for the five linguistic predictors.

(A) Significant sensors showing higher model fit for the attended speech compared to the unattended speech at the acoustic and linguistic levels for the two listener groups and their contrast. Shaded regions indicate significant time windows. * denotes p<0.05, ** denotes p<0.01, and *** denotes p<0.001.

Figure 6—figure supplement 2
Temporal response function (TRF) weights to the electroencephalogram (EEG) data of attended and unattended speech envelope.

Shaded regions indicate significant time windows. * denotes p<0.05, ** denotes p<0.01, and *** denotes p<0.001.

Videos

Video 1
Explanation of the ridge regression methods versus the multivariate temporal response function (mTRF) analyses.

Explainer videos are not peer reviewewe used the python.

Tables

Table 1
All four-syllable Chinese sentences with same vowels.

nǎi
naimǎimài
ʃə̌nʃənʃə̌nʃə̀n
gūŋguŋtʃūŋdùŋ
mèimeiméilèi
tɕiə̀utɕiəutɕʰiə́utɕiə̀u
ji
ɕíɕīɕì
tàitaibǎipāi
wa
ʃū
fu
ba
ma
buɔ́buɔbuɔ̄guɔ̌
ʃūʃuʃǔʃù
gu
lǎʊlaʊnáʊmāʊ
puɔ́puɔ́duɔ̌guɔ̄
tɕiětɕietɕiétɕiē
diɕǐ

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Jixing Li
  2. Qixuan Wang
  3. Qian Zhou
  4. Lu Yang
  5. Yutong Shen
  6. Shujian Huang
  7. Shaonan Wang
  8. Liina Pylkkänen
  9. Zhiwu Huang
(2026)
Multi-talker speech comprehension at different temporal scales in listeners with normal and impaired hearing
eLife 13:RP100056.
https://doi.org/10.7554/eLife.100056.4