Research Article

Neuroscience

Multi-talker speech comprehension at different temporal scales in listeners with normal and impaired hearing

Department of Linguistics and Translation, City University of Hong Kong, Hong Kong
Department of Facial Plastic and Reconstructive Surgery, Eye and ENT Hospital, Fudan University, China
ENT institute, Eye and ENT Hospital, Fudan University, China
Department of Otolaryngology-Head and Neck Surgery, Shanghai Ninth People’s Hospital, Shanghai Jiao Tong University School of Medicine, China
Department of Computer Science and Technology, Nanjing University, China
Institute of Automation, Chinese Academy of Sciences, China
Department of Linguistics, Department of Psychology, New York University, United States
College of Health Science and Technology, Shanghai Jiao Tong University School of Medicine, China

May 26, 2026

https://doi.org/10.7554/eLife.100056.4

Open access
Copyright information

Figures
Videos
Tables
Additional files

6 figures, 1 video, 1 table and 1 additional file

Figures

Figure 1 with 1 supplement

Download asset Open asset

Methods and behavioral results.

(A) Experimental procedure. The experimental task consisted of a multi-talker condition followed by a single-talker condition. In the multi-talker condition, the mixed speech was presented twice with the female and male speakers narrating simultaneously. Before each trial, instructions appeared in the center of the screen indicating which of the talkers to attend to (e.g. ‘Attend female’). In the single-talker condition, the male and female speeches were presented sequentially. (B) Analyses pipeline. Hidden-layer activity of the hierarchical multiscale Long Short-Term Memory network (HM-LSTM) model, which represents each level of linguistic units for each sentence, was extracted and aligned with EEG data, time-locked to the offset of each sentence at nine different latencies.

Figure 1—figure supplement 1

Download asset Open asset

Correlation matrices of regression outcomes for the five linguistic predictors between the electroencephalogram (EEG) data from delta, theta, and all frequency bands.

Figure 2

Download asset Open asset

Behavioral results.

(A) Pure tone audiometry (PTA) results for participants with normal hearing and extended high frequency (EHF) hearing loss. Starting at 10 kHz, participants with EHF hearing loss have significantly higher hearing thresholds (M=6.42 dB, SD = 7 dB) compared to normal-hearing participants (M=3.3 dB, SD = 4.9 dB; t=2, p=0.02). (B) Distribution of self-rated intelligibility scores for mixed- and single-talker speech across the two listener groups. * indicates p<0.05, ** indicates p<0.01, and *** indicates p<0.001.

Figure 3

Download asset Open asset

The hierarchical multiscale Long Short-Term Memory network (HM-LSTM) model architecture and hidden-layer activity for the stimuli sentences and the four-word Chinese sentences with same vowels.

(A) The HM-LSTM model architecture. The model includes four hidden layers, corresponding to the phoneme-, syllable-, word-, and phrase-level information. Sentence-level information was represented by the last unit of the fourth layer. The inputs to the model were the vector representations of the phonemes in two sentences, and the output of the model was the classification result of whether the second sentence follows the first sentence. (B) Correlation matrix for the HM-LSTM model’s hidden-layer activity for the sentences in the experimental stimuli. (C) Scatter plot of hidden-layer activity at the five linguistic levels for each of the 20 four-syllable sentences after multidimensional scaling (MDS).

Figure 4

Download asset Open asset

Significant sensor and time window for the model fit to the electroencephalogram (EEG) data for the acoustic and linguistic features extracted from the hierarchical multiscale Long Short-Term Memory network (HM-LSTM) model between single-talker and attended speech across the two listener groups.

(A) Significant sensors showing higher model fit for single-talker speech compared to the attended speech at the acoustic, phoneme, and syllable levels for the two listener groups and their contrast. (B) Time courses of mean model fit in the significant clusters where normal-hearing participants showed higher model fit at the acoustic, phoneme, and syllable levels than hearing-impaired participants. The coefficient of determination (R²) was z-transformed. Shaded regions indicate significant time windows. * denotes p<0.05, ** denotes p<0.01, and *** denotes p<0.001.

Figure 5

Download asset Open asset

Significant sensor and time window for the model fit to the electroencephalogram (EEG) data for the acoustic and linguistic features between the single-talker and unattended speech in the mixed speech condition across the two listener groups.

(A) Significant sensors showing higher model fit for the single-talker speech compared to the unattended speech at the acoustic and linguistic levels for the two listener groups and their contrast. (B) Time courses of mean model fit in the significant clusters. The significant time windows for within-group comparisons. The coefficient of determination (R²) was z-transformed. Shaded regions indicate significant time windows. * denotes p<0.05, ** denotes p<0.01, and *** denotes p<0.001.

Figure 6 with 2 supplements

Download asset Open asset

Figure 6—figure supplement 1

Download asset Open asset

Contrast of temporal response function (TRF) weights to the electroencephalogram (EEG) data of attended and unattended speech for the five linguistic predictors.

(A) Significant sensors showing higher model fit for the attended speech compared to the unattended speech at the acoustic and linguistic levels for the two listener groups and their contrast. Shaded regions indicate significant time windows. * denotes p<0.05, ** denotes p<0.01, and *** denotes p<0.001.

Figure 6—figure supplement 2

Download asset Open asset

Temporal response function (TRF) weights to the electroencephalogram (EEG) data of attended and unattended speech envelope.

Shaded regions indicate significant time windows. * denotes p<0.05, ** denotes p<0.01, and *** denotes p<0.001.

Videos

Video 1

Download asset

posterframe for video — Explanation of the ridge regression methods versus the multivariate temporal response function (mTRF) analyses.

Explainer videos are not peer reviewewe used the python.

Tables

Table 1

All four-syllable Chinese sentences with same vowels.

nǎi	nai	mǎi	mài
ʃə̌n	ʃən	ʃə̌n	ʃə̀n
gūŋ	guŋ	tʃūŋ	dùŋ
mèi	mei	méi	lèi
tɕiə̀u	tɕiəu	tɕʰiə́u	tɕiə̀u
jí	ji	jí	jì
dì	ɕí	ɕī	ɕì
tài	tai	bǎi	pāi
wá	wa	wā	wā
ʃū	fù	sù	dú
gū	fu	kǔ	dú
bà	ba	dá	kǎ
mā	ma	mà	mǎ
buɔ́	buɔ	buɔ̄	guɔ̌
ʃū	ʃu	ʃǔ	ʃù
gū	gu	bǔ	bù
lǎʊ	laʊ	náʊ	māʊ
puɔ́	puɔ́	duɔ̌	guɔ̄
tɕiě	tɕie	tɕié	tɕiē
dì	di	ɕǐ	dì

Additional files

MDAR checklist: https://cdn.elifesciences.org/articles/100056/elife-100056-mdarchecklist1-v1.pdf
Download elife-100056-mdarchecklist1-v1.pdf

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Jixing Li
Qixuan Wang
Qian Zhou
Lu Yang
Yutong Shen
Shujian Huang
Shaonan Wang
Liina Pylkkänen
Zhiwu Huang

(2026)

Multi-talker speech comprehension at different temporal scales in listeners with normal and impaired hearing

eLife 13:RP100056.

https://doi.org/10.7554/eLife.100056.4

Figures

Methods and behavioral results.

Correlation matrices of regression outcomes for the five linguistic predictors between the electroencephalogram (EEG) data from delta, theta, and all frequency bands.

Behavioral results.

The hierarchical multiscale Long Short-Term Memory network (HM-LSTM) model architecture and hidden-layer activity for the stimuli sentences and the four-word Chinese sentences with same vowels.

Significant sensor and time window for the model fit to the electroencephalogram (EEG) data for the acoustic and linguistic features extracted from the hierarchical multiscale Long Short-Term Memory network (HM-LSTM) model between single-talker and attended speech across the two listener groups.

Significant sensor and time window for the model fit to the electroencephalogram (EEG) data for the acoustic and linguistic features between the single-talker and unattended speech in the mixed speech condition across the two listener groups.

Significant sensor and time window for the model fit to the electroencephalogram (EEG) data for the acoustic and linguistic features between the attended and unattended speech in the mixed speech condition across the two listener groups.

Contrast of temporal response function (TRF) weights to the electroencephalogram (EEG) data of attended and unattended speech for the five linguistic predictors.

Temporal response function (TRF) weights to the electroencephalogram (EEG) data of attended and unattended speech envelope.

Videos

Explanation of the ridge regression methods versus the multivariate temporal response function (mTRF) analyses.

Tables

All four-syllable Chinese sentences with same vowels.

Additional files

MDAR checklist

Download links

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Be the first to read new articles from eLife

Share this article

Cite this article

Methods and behavioral results.

Correlation matrices of regression outcomes for the five linguistic predictors between the electroencephalogram (EEG) data from delta, theta, and all frequency bands.

Behavioral results.

The hierarchical multiscale Long Short-Term Memory network (HM-LSTM) model architecture and hidden-layer activity for the stimuli sentences and the four-word Chinese sentences with same vowels.

Significant sensor and time window for the model fit to the electroencephalogram (EEG) data for the acoustic and linguistic features extracted from the hierarchical multiscale Long Short-Term Memory network (HM-LSTM) model between single-talker and attended speech across the two listener groups.

Significant sensor and time window for the model fit to the electroencephalogram (EEG) data for the acoustic and linguistic features between the single-talker and unattended speech in the mixed speech condition across the two listener groups.

Significant sensor and time window for the model fit to the electroencephalogram (EEG) data for the acoustic and linguistic features between the attended and unattended speech in the mixed speech condition across the two listener groups.

Contrast of temporal response function (TRF) weights to the electroencephalogram (EEG) data of attended and unattended speech for the five linguistic predictors.

Temporal response function (TRF) weights to the electroencephalogram (EEG) data of attended and unattended speech envelope.

Explanation of the ridge regression methods versus the multivariate temporal response function (mTRF) analyses.

All four-syllable Chinese sentences with same vowels.

MDAR checklist

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)