Segmentation of speech and topic modelling application.

a, Perceptually driven segmentation of speech. A continuous speech was segmented into phrases or sentences in a perceptually relevant manner. Using the library Syllable Nuclei in Praat, we obtained acoustic speech chunks by the following thresholding parameters: pause (silence) duration (minimum 0.25 s long), loudness (−25 dB silence), and the length of speech chunks (minimum 1 s long). We detected 129 speech chunks on average with a mean duration of 3.37 s for seven continuous talks used in the study (also see Supplementary Table 1). A snapshot from one example talk is shown in the figure. Each row depicts a raw speech signal, spectrogram with intensity (in yellow) and pitch (in blue), the number of syllables, and annotations. b, Text preprocessing and topic modelling. Left column: An example of segmented speech chunks from a representative talk is shown. In this talk, 135 speech chunks were obtained. Middle column: Preprocessing of annotations of spoken speech materials was performed using a python library, spaCy, through the following three steps: tokenization, lemmatization and removal of stop words. Right column: Schematic illustration of topic modelling algorithm, Latent Dirichlet Allocation (LDA), is displayed. A fixed number of topics was specified in advance (4 in the current study), and bi-gram model was used in the topic model to optimally capture topic messages. c, Extracted topic keywords. Out of the LDA model, documents (referred to as speech chunks in the current study) assigned to topic t and words with high probability for topic t are obtained as outputs (document-topic matrix). The most common words with the highest probability are shown for each topic. d, Distribution of topic probability across speech chunks in a representative talk. Vectors of topic probabilities for each speech chunk were depicted using a stacked bar chart (topic mixture) where x- and y-axes depict topic probability and the identity of speech chunks, respectively. Color-coded bars in each row represent the probability distribution across 4 topics in a given speech chunk. In this sample talk, topic 3 has the highest probability across all speech chunks, which indicates the representative topic keywords for this talk. e, t-SNE visualization of topics colored and sized by topic probability. The document-topic matrix from the LDA model was subjected to t-SNE dimensionality reduction model, which was then fitted to be visualized in 2-d embedded space scatterpie chart. Each data point as a scatterpie chart represents each speech chunk which is clustered into a certain topic according to the highest probability for the topic. The scatterpie chart also shows the probability distribution over topics. f, Speech chunks sorted according to the topic probability of the representative topic (topic 3 in this example talk). The speech chunks shown in b were sorted from highest to lowest topic probability as to topic 3. To investigate the neural processing of topic representation, we split these speech chunks into high vs. low topic probability conditions. g, Segmentation and allocation of the MEG and speech signals into topic probability conditions. Corresponding brain data at sensor and source level and auditory speech envelope were split into high vs. low topic probability conditions as well, resulting in trial-based epochs.

Mean-field power analysis at sensor level.

a, Mean-field power from 0.25 s to 0.5 s time-locked to the onset of speech chunks over left auditory sensors (inset) were averaged and compared between all combinations of topic probability (high, low) and attention (attended, unattended) as follows via paired t-test: attended, high vs. attended, low: t43 = 2.21, p = 0.03; attended, high vs. unattended, high: t43 = 3.21, p = 0.002; attended, high vs. unattended, low: t43 = 4.35, p < 0.0001; attended, low vs. unattended, high: t43 = 1.51, p = 0.14; attended, low vs. unattended, low: t43 = 2.91, p = 0.005; unattended, high vs. unattended, low: t43 = 1.44, p = 0.16; attended vs. unattended talk pooling across high and low topic probability: t43 = 4.17, p = 0.0001; high vs. low pooling across attended and unattended: t43 = 2.62, p = 0.01). b-d, Temporally unfolded mean-field power by root mean square averaged over the same left auditory sensors during -0.05 s to 1 s time-locked to the speech chunk onset are displayed for all individual conditions (b) and combined effects for topic probability (c) and attention (d).

Receptive field model estimation analysis for topic probability and attention effects. a, Schematic flowchart for the model estimation.

The stimulus feature-neural response mapping was modelled bidirectionally using encoding and decoding models (also known as forward and backward models). Encoding and decoding model analyses were performed separately for high and low topic probability conditions as well as for attended and unattended talks. Figure adapted from Crosse et al. (2016). b, Decoding model performance. Model prediction accuracy to reconstruct speech envelope for each condition was obtained by correlation coefficient score (r-value) between original and reconstructed speech, which were then compared between high and low topic probability conditions separately for each attended and unattended talk. The prediction accuracy was significantly stronger for speech with high than low topic probability in both attended (between 1st and 2nd columns: attended, high vs. attended, low; t43 = 11.43, p = 1.27e-14) and unattended talk (between 3rd and 4th columns: unattended, high vs. unattended, low; t43 = 29.98, p = 1.88e-30) with a greater difference for unattended than attended talk. Interestingly, the prediction accuracy was stronger for the high topic probability condition in unattended talk than the low topic probability condition in attended talk (between 2nd and 3rd columns: unattended, high vs. attended, low; t43 = 9.62, p = 2.74e-12). However, no significant difference was observed between high topic probability conditions (between 1st and 3rd columns: attended, high vs. unattended, high; t43 = 0.80, p = 0.43). All statistical comparisons were performed via two-tailed paired t-test. Dots and lines represent individual results.

Encoding model weights mapped onto source space.

Encoding model coefficients were mapped onto source space via dSPM method and statistically compared between high and low topic probability conditions using cluster-level spatio-temporal permutation test (p < 0.05; two-tailed; 1024 permutations). Summary clusters (averaged across all significant temporal clusters) are shown. a, attended, high vs. attended, low. b, unattended, high vs. unattended low. c, unattended, high vs. attended, low. T-values in significant clusters are scaled corresponding to the duration spanned by the cluster (for more details, see Statistical test in Materials and Methods).

Causal relationship between attended and unattended speech on speech comprehension. a, Salient unattended speech negatively mediates attended speech comprehension.

Mediation analysis was performed to test the hypothesis that attention to semantically salient unattended speech negatively mediates (i.e., suppresses) attended speech comprehension. In the mediation model, encoding model coefficients of attended speech chunks with high topic probability, encoding model coefficients of unattended speech chunks with high topic probability, and speech comprehension accuracy for attended speech were used as predictor (X, independent), mediator (M), and target (Y, dependent) variables, respectively. The encoding model coefficients during 0 - 0.5 s with respect to the speech chunk onset were averaged within each region in PALS-B12-Brodmann atlas in MNE-Python within each individual. Significant negative indirect effects were identified for left BA41 (path ab β = -2.03, p = 0.03, 95% CI = -5.53, -0.29), left BA4 (path ab β = -3.82, p = 0.02, 95% CI = -11.12, -0.71), left BA6 (path ab β = -3.51, p = 0.009, 95% CI = -10.84, -0.93) and right BA9: path ab; β = -6.43, p = 0.03, 95% CI = -13.44, -0.95) with 5000 bootstrap iterations. b, Increased sensitivity to attended speech in the regions enhances speech comprehension. Sensitivity index, analogous to d-prime, defined by the difference between Z-transformed model coefficients of attended speech with high topic probability and unattended speech with high topic probability (Z (attended high) – Z (unattended high)) was created for each of 4 regions and averaged. The sensitivity index was significantly correlated with speech comprehension accuracy across participants (Spearman rank correlation: r = 0.47, p = 0.001), supporting the hypothesis that participants with increased sensitivity to semantically salient attended speech in these regions show better speech comprehension.