1. Neuroscience
Download icon

The human auditory brainstem response to running speech reveals a subcortical mechanism for selective attention

  1. Antonio Elia Forte
  2. Octave Etard
  3. Tobias Reichenbach  Is a corresponding author
  1. Imperial College London, United Kingdom
Short Report
  • Cited 1
  • Views 1,349
  • Annotations
Cite as: eLife 2017;6:e27203 doi: 10.7554/eLife.27203

Abstract

Humans excel at selectively listening to a target speaker in background noise such as competing voices. While the encoding of speech in the auditory cortex is modulated by selective attention, it remains debated whether such modulation occurs already in subcortical auditory structures. Investigating the contribution of the human brainstem to attention has, in particular, been hindered by the tiny amplitude of the brainstem response. Its measurement normally requires a large number of repetitions of the same short sound stimuli, which may lead to a loss of attention and to neural adaptation. Here we develop a mathematical method to measure the auditory brainstem response to running speech, an acoustic stimulus that does not repeat and that has a high ecological validity. We employ this method to assess the brainstem's activity when a subject listens to one of two competing speakers, and show that the brainstem response is consistently modulated by attention.

https://doi.org/10.7554/eLife.27203.001

Introduction

It is well known that selective attention to one of several competing acoustic signals affects the encoding of sound in the auditory cortex (Shinn-Cunningham, 2008; Hackley et al., 1990; Choi et al., 2013; Fritz et al., 2007b; Hillyard et al., 1973; Womelsdorf and Fries, 2007; Fritz et al., 2007a; Näätänen et al., 2001). Because extensive auditory centrifugal pathways carry information from central to more peripheral levels of the auditory system (Winer, 2006; Pickels, 1988; Song et al., 2008; Bajo et al., 2010), neural activity in the subcortical structures may contribute to attention as well. Previous attempts to determine an attentional modulation from recording the auditory brainstem response through scalp electrodes have, however, yielded highly inconclusive results.

In particular, one investigation found that selective attention alters the brainstem's response to the fundamental frequency of a speech signal (Galbraith et al., 1998), while another study concluded that this response is modulated in an unsystematic but subject-specific manner (Lehmann and Schönwiesner, 2014) and a third recent experiment did not find a significant attentional effect (Varghese et al., 2015). Results on the effects of attention on the auditory-brainstem response to short clicks or pure tones are similarly inconclusive (Brix, 1984; Gregory et al., 1989; Hoormann et al., 2000; Galbraith et al., 2003). These inconsistencies may result from a main experimental limitation in these studies: because the brainstem response is tiny, its measurement requires hundred- to thousandfold repetition of the same sound. The large number of repetitions may lead to difficulties for subjects in sustaining selective attention, to adaptation in the nervous system, and to a reduction in efferent feedback (Lasky, 1997; Kumar Neupane et al., 2014).

To overcome this limitation, we develop here a method to measure the auditory brainstem's response to natural running speech that does not repeat. We then use this method to assess the modulation of the auditory brainstem response to one of two competing speakers by selective attention.

Results

Assessing the brainstem's response to continuous non-repetitive speech does not allow to average over many repeated presentations of the same sound. Instead, we sought to quantify the brainstem's response to the fundamental frequency of speech. Neuronal activity in the brainstem, and in particular in the inferior colliculus, can indeed phase lock to the periodicity of voiced speech (Skoe and Kraus, 2010). The fundamental frequency of running speech varies over time, however, compounding a direct read-out of the evoked brainstem response.

To overcome this difficulty, we employed empirical mode decomposition (EMD) of the speech stimuli to identify an empirical mode that, at each time instance, oscillates at the fundamental frequency of the speech signal (Huang and Pan, 2006) (Materials and methods). This mode is a nonlinear and nonstationary oscillation with a temporally-varying amplitude and frequency that we refer to as the 'fundamental waveform' of the speech stimulus (Figure 1a).

Figure 1 with 1 supplement see all
The brainstem response to running speech.

(a) Speech (black) contains voiced parts with irregular oscillations at a time-varying fundamental frequency and higher harmonics. We extract a fundamental waveform (red) that oscillates nonlinearly and nonstationary at the fundamental frequency. (b) The autocorrelation of the fundamental waveform (red) peaks when the delay vanishes and oscillates at the average fundamental frequency. The cross-correlation of the fundamental waveform with its Hilbert transform (blue) can be seen as an imaginary part of the autocorrelation. The amplitude of the resulting complex cross-correlation (black) shows a life-time of a few ms. (c) The correlation of the speech-evoked brainstem response, recorded from one subject, to the fundamental waveform of the speech signal (red) as well as to its Hilbert transform (blue) can serve as real and imaginary parts of a complex correlation function. Its amplitude (black) peaks at a latency of 9 ms. The latency of the correlation is not altered by the processing of the speech signal or of the neural recording, and contains neither a stimulus artifact nor the cochlear microphonic (Figure 1—figure supplement 1).

https://doi.org/10.7554/eLife.27203.002

We then recorded the brainstem response to running non-repetitive speech stimuli of several minutes in duration from human volunteers through scalp electrodes. We cross-correlated the obtained recording with the fundamental waveform of the speech signal (Figure 1b). Because the brainstem response may occur at a phase that is different from that of the fundamental waveform, we also correlated the neural signal to the Hilbert transform of the fundamental waveform that has a phase shift of 90˚. The two correlations can be viewed as the real and imaginary part of a complex correlation function that can trace the brainstem response at any phase shift. The amplitude of the complex correlation informs then on the strength of the brainstem response.

Our statistical analysis showed that the peak amplitude of the complex correlation was significantly different from the noise in fourteen out of sixteen subjects (p<0.05, Materials and methods). The peak occurred at a mean latency of 9.3 ± 0.7 ms, verifying that the measured neural response resulted from the brainstem and not from the cerebral cortex (Hashimoto et al., 1981; Picton et al., 1981). Moreover, the latency agreed with that found previously regarding the brainstem's response to short repeated speech stimuli (Skoe and Kraus, 2010). The average value of the correlation at the peak was 0.015 ± 0.003. We checked that the response did not contain a stimulus artifact or a contribution from the cochlear microphonic, and that the latency of the response was not affected by the processing of the speech signal or of the neural response (Materials and methods; Figure 1—figure supplement 1). We also verified that the response did not contain further peaks at higher latencies, showing that the neural signal did not contain a measurable contribution from the cortex.

We further considered a highly simplistic model of the auditory brainstem response in which a burst of neural spikes occurred at each cycle of the fundamental waveform, at a fixed phase φ, and was then shifted in time by a certain delay τ (Figure 2a,b; Materials and methods). Adding noise that is realistic of neural recordings from scalp electrodes, and computing the complex correlation of the obtained signal with the fundamental waveform, showed a peak in the complex correlation at the time delay τ (Figure 2c). The phase φ of the fundamental waveform at which the neural bursts occurred was obtained from the phase of the complex correlation at the time delay τ. This demonstrated that the brainstem's response to continuous speech could be reliably extracted through the developed method. The response can be characterized through the latency and amplitude of the correlation's peak.

Simplistic model of the auditory brainstem response to continuous speech.

(a) The fundamental waveform (red) as well as its Hilbert transform (blue) oscillate with a varying amplitude and frequency. (b) We model a simplistic brainstem response in which bursts of neural spikes occur at each cycle of the fundamental waveform, at a phase of ¼ π rad (black dots). Furthermore, all neural bursts are shifted by a temporal delay of 8 ms. (c) When adding realistic noise as emerges from scalp recordings, and then computing the complex correlation with the fundamental waveform as performed for the actual brainstem recording, we find a peak at the modelled delay of 8 ms. The modelled phase of ¼ π rad is obtained as the inverse phase of the complex correlation at that latency.

https://doi.org/10.7554/eLife.27203.004

Armed with the ability to quantify the brainstem's response to running non-repetitive speech, we sought to investigate if this neural activity is affected by selective attention. Employing a well-established paradigm of attention to one of two speakers (Ding and Simon, 2012), we presented volunteers diotically with two concurrent speech streams of equal intensity, one by a male and another by a female voice. For parts of the speech presentation subjects attended the male voice and ignored the female voice, and vice versa for the remaining parts.

We quantified the brainstem's response to both the male and the female voice by extracting the fundamental waveforms of both speech signals and correlating the neural recording separately to both. We found that the latency of the response was unaffected by attention: the response to the unattended speaker occurred 0.8 ± 0.5 ms later than that to the attended speaker, which was not statistically significant (p=0.2; average over the responses to the male and the female voice as well as all subjects).

In contrast, all subjects showed a larger response of the auditory brainstem, at the peak latency, to the male voice when attending rather than ignoring it (Figure 3a). The difference in the responses was statistically significant in nine of the fourteen subjects (p<0.05). The brainstem's response to the attended female speaker similarly exceeded that to the unattended female voice in all but one subject, with eight subjects showing a statistically-significant difference (p<0.05; Figure 3b). The ratio of the brainstem's response to attended and to ignored speech, averaged over all subjects, was 1.5 ± 0.1 and 1.6 ± 0.2 for the male and for the female speaker, respectively. Both ratios were significantly different from unity (p<0.001, male voice; p<0.01, female voice). The male and the female voice elicited a comparable attentional modulation: the difference between the corresponding ratios was insignificant (p=0.7). The magnitude of the brainstem's response was hence significantly enhanced through attention, and consistently so across subjects and speakers.

Figure 3 with 1 supplement see all
Modulation of the brainstem response to speech by selective attention.

(a) The brainstem's response to the male speaker is larger for each subject when attending the speaker (dark blue) than when ignoring it (light blue). The average ratio of the brainstem responses to the attended and to the ignored male speaker is significantly larger than 1 (black, mean and standard error of the mean). (b) With the exception of subject 13, the neural response to the female voice is also larger when subjects attend to it (dark red) instead of ignoring it (light red). The average ratio of the brainstem responses to the attended and to the ignored female speaker is significantly larger than one as well (black, mean and standard error of the mean).

https://doi.org/10.7554/eLife.27203.005

The auditory brainstem response to short speech stimuli has a low-pass nature: the amplitude of the response declines with increasing frequency (Musacchia et al., 2007; Skoe and Kraus, 2010).To determine if this relation held for the brainstem response to continuous speech that we measured here as well, and how it may be affected by attention, we computed the correlation between the amplitude of the brainstem response to short speech segments and the fundamental frequency of the segments (Materials and methods). We found a small but statistically significant negative correlation between the amplitude of the brainstem response and the fundamental frequency, both for the neural response to a single speaker as well as for the response to the attended and ignored speech signal of two competing speakers, evidencing the low-pass nature of the brainstem response (Figure 3—figure supplement 1). The correlations between the amplitude and the frequency did not differ significantly between the different conditions.

The brainstem response to short clicks in noise can have a delay that becomes longer with increasing noise level, while the amplitude of the brainstem response declines, presumably reflecting varying contributions of auditory-nerve fibers with different spontaneous rates and from different cochlear locations (Mehraei et al., 2016). However, for the brainstem response to continuous speech that we measured here, we did not find a statistically-significant correlation between amplitude and latency (Materials and methods).

Discussion

Our results show that the human auditory brainstem response to continuous speech is larger when attending than when ignoring a speech signal, and consistently so across different subjects and speakers. In particular, the strength of the phase locking of the neural activity to the pitch structure of speech is larger for an attended than for an unattended speech stream. In contrast, we did not observe a difference in the latency of this activity.

The fundamental waveform of speech that we have obtained from EMD has a temporally varying frequency and amplitude and is therefore not a simple component of Fourier analysis. While it may be obtained from short-time Fourier transform or wavelet analysis, both methods suffer from an inherently limited time-frequency resolution that makes them inferior to the EMD analysis (Huang and Pan, 2006).

Because we have employed a diotic stimulus presentation in which the same acoustical stimulus was presented to each ear, the attentional modulation cannot result from a general modulation of the brainstem's activity to acoustic stimuli between the two hemispheres. Moreover, although the fundamental frequencies of the two competing speakers differ at most time points, their spectra largely overlap. The attentional modulation can therefore not result from a broad-band modulation of the neural activity either. Instead, the attentional effect must result from a modulation of the brainstem's response to the specific pitch structure of a speech stimulus.

The brainstem response to the pitch of continuous speech that we have measured can reflect a response both to the fundamental frequency of speech as well as to higher harmonics. Indeed, previous studies have found that the brainstem responds at the fundamental frequency of a speech stimulus even when that frequency itself is removed from the acoustic signal (Galbraith and Doan, 1995), or when it cancels out due to presentation of stimuli with opposite polarities and averaging of the obtained responses (Aiken and Picton, 2008). The attentional modulation of the brainstem response can thus reflect a modulation of the response to the fundamental frequency itself or to higher harmonics. Moreover, attentional modulation of higher harmonics may depend on frequency as shown recently in recordings of otoacoustic emissions from the inner ear (Maison et al., 2001).

The attentional modulation of the brainstem's response to the pitch of a speaker may result from an enhancement of the neural response to an attended speech signal, from the suppression of the response to an ignored speech stimulus, or from both. Further investigation into this issue may compare brainstem responses to speech when attending to the acoustical signal and when attending to a visual stimulus (Woods et al., 1992; Karns and Knight, 2009; Saupe et al., 2009).

The response at the fundamental frequency of speech can result from multiple sites in the brainstem (Chandrasekaran and Kraus, 2010). However, we observed a single peak with a width of a few ms in the correlation of the neural signal to the fundamental waveform of speech. The brainstem response to running speech that we have measured here can therefore only reflect neural sources whose latencies vary by a few ms or less from the peak latency. The neural delay of about 9 ms as well as the similarity of the speech-evoked brainstem response to the frequency-following response suggest that the main neural source may be in the inferior colliculus (Sohmer et al., 1977). The attentional effect that we have observed may then result from the multiple feedback loops between the inferior colliculus, the medial geniculate body and the auditory cortex (Huffman and Henson, 1990).

Our study provides a mathematical methodology to analyse the brainstem response to complex, real world stimuli such as speech. Since our method does not require artificial and repeated stimuli, it fosters sustained attention and avoids potential neural adaptation. This method can therefore pave the way to further explore how the brainstem contributes to the processing of complex real-world acoustic environments. It may also be relevant for better understanding and diagnosing the recently discovered cochlear neuropathy or 'hidden hearing loss' (Kujawa and Liberman, 2009). Because the latter alters the brainstem's activity (Schaette and McAlpine, 2011; Mehraei et al., 2016), assessing the auditory brainstem response to speech as well as its modulation by attention may further clarify the origin, prevalence and consequences of such poorly understood supra-threshold hearing loss.

Materials and methods

Participants

16 healthy adult volunteers aged 18 to 32, eight of which were female, participated in the study. All subjects were native English speakers and had no history of hearing or neurological impairments. All participants had pure-tone hearing thresholds better than 20 dB hearing level in both ears at octave frequencies between 250 Hz and 8 kHz. Each subject provided written informed consent. All experimental procedures were approved by the Imperial College Research Ethics Committee.

Auditory brainstem recordings to running speech

Samples of continuous speech from a male and a female speaker were obtained from publicly available audiobooks (https://librivox.org). All samples had a duration of at least two minutes and ten seconds; some were slightly longer to end upon completion of a sentence. To construct speech samples with two competing speakers, samples from the male and from the female speaker were normalized to the same root-mean-square amplitude and then superimposed.

Participants were placed in a comfortable chair in an acoustically and electrically insulated room (IAC Acoustics, UK). A personal computer outside the room controlled audio presentation and data acquisition. Speech stimuli were presented at a sampling frequency of 44.1 kHz through a high-performance sound card (Xonar Essence STX, Asus, USA). Stimuli were delivered diotically through insert earphones (ER-3C, Etymotic, USA) at a level of 76 dB(A) SPL (A-weighted frequency response). Sound intensity was calibrated with an ear simulator (Type 4157, Brüel and Kjaer, Denmark). All subjects reported that the stimulus level was comfortable.

The response from the auditory brainstem was measured through five passive Ag/AgCl electrodes (Multitrode, BrainProducts, Germany). Two electrodes were positioned at the cranial vertex (Cz), two further electrodes were placed on the left and right mastoid processes, and the remaining electrode was positioned on the forehead to measure the ground. The impedance between each electrode and the skin was reduced to below 5 kΩ using abrasive electrolyte-gel (Abralyt HiCl, Easycap, Germany). The electrode on the left mastoid, at the cranial vertex and the ground electrode were connected to a bipolar amplifier with low-level noise and a gain of 50 (EP-PreAmp, BrainProducts, Germany). The remaining two electrodes were connected to a second identical bipolar amplifier. The output from both bipolar amplifiers was fed into an integrated amplifier (actiCHamp, BrainProducts, Germany) where it was low-pass filtered through a hardware anti-aliasing filter with a corner frequency of 4.9 kHz and sampled at 25 kHz. The audio signals were measured by the integrated amplifier as well through an acoustic adapter (Acoustical Stimulator Adapter and StimTrak, BrainProducts, Germany). The electrophysiological data were acquired through PyCorder (BrainProducts, Germany). The simultaneous measurement of the audio signal and the brainstem response from the integrated amplifier was employed to temporally align both signals to a precision of less than 40 μs, the inverse of the sampling rate (25 kHz).

Experimental design

In the first part of the experiment, each volunteer listened to four speech samples of the female speaker only. Comprehension questions were asked at the end of each part in order to verify the subject's attention to the story.

The second part of the experiment employed eight samples of speech that contained both a male and a female voice. During the presentation of the first four samples, subjects were asked to attend either the male or the female speaker. Volunteers were then presented with the next four speech samples and asked to attend to the speaker that they had ignored earlier. Whether the subject was asked to attend first to the male or to the female voice was determined randomly for every subject. Comprehension questions were asked after each sample.

Computation of the fundamental waveform of speech

The fundamental waveform of each speech sample with a single speaker was computed through a custom-written Matlab program (code available on Github; Forte, 2017; a copy is archived at https://github.com/elifesciences-publications/fundamental_waveforms_extraction). The fundamental waveform of a speech sample with two speakers followed from the two corresponding samples with a single speaker only.

First, each speech signal was downsampled to 8820 Hz, low-pass filtered at 1,500 Hz (linear-phase FIR filter, transition band 1500–1650 Hz, stopband attenuation −80 dB, passband ripple 1 dB, order 296) and time-shifted to compensate for the filter delay. Silent parts between words were identified by computing the envelope of the speech signal. Each part where the envelope was less than 10% of the maximal value found in the speech was considered silent, and the speech signal there was set to zero.

Second, the instantaneous fundamental frequency of the voiced parts of the speech signal was detected through the autocorrelation method, employing rectangular windows of 50 ms duration with a successive overlap of 49 ms. Speech segments that yielded a fundamental frequency outside the range of 60 Hz to 400 Hz, or in which the fundamental frequency varied by more than 10 Hz between two successive windows were considered voiceless. The speech segments that corresponded to voiced speech, as well as their fundamental frequency, were thus obtained. The fundamental frequency of each segment was interpolated through a cubic spline, and varied between 100 and 300 Hz in each segment. Note that this method yields the fundamental frequency but not by itself the fundamental wavemode.

Third, the voiced speech segments were analysed through the Hilbert-Huang transform. The latter is an adaptive signal processing based on empirical basis functions and can thus be better suited for analysing nonlinear and nonstationary signals such as speech than Fourier analysis (Huang and Pan, 2006). The transform consists of two parts. First, empirical mode decomposition extracts intrinsic mode functions (IMFs) that satisfy two properties: (i) the numbers of extrema and zero crossings are either equal or differ by one; (ii) the mean of the upper and lower envelope vanishes. The signal follows as the linear superposition of the IMFs. Second, the Hilbert spectrum of each IMF is determined, which yields, in particular, the mode's instantaneous frequency. This analysis was performed for each short segment of voiced speech, that is, for each part of voiced speech that was preceded and followed by a pause or voiceless speech.

Fourth, the fundamental frequency of each short speech segment was compared to the instantaneous frequencies of the segment's IMFs at each individual time point. All IMFs with an instantaneous frequency that differed by less than 20% from the segment's fundamental frequency were determined, and the IMF with the largest amplitude was therefrom selected as the fundamental wavemode of that segment and at that time point (Huang and Pan, 2006). If no IMF had an instantaneous frequency within 20% of the fundamental frequency, or if a speech segment was unvoiced, that time point was assigned a fundamental waveform of zero. The fundamental waveforms obtained at the different time points were combined through cosine crossfading functions with a window width of 10 ms to obtain the fundamental waveform of the speech signal. The Hilbert transform of that fundamental waveform was computed as well.

To control for latency changes in the acoustic signal induced by the subsequent processing steps, and in particular by the involved frequency filtering, the cross-correlation between the original speech signal and the fundamental waveform as well as with its Hilbert transform was computed (Figure 1—figure supplement 1a). The cross-correlations show that the fundamental waveform has no latency change and no phase difference with respect to the original speech stimulus.

Analysis of the auditory-brainstem response

The brainstem responses from the two measurement channels were averaged. A frequency-domain regression technique (CleanLine, EEGLAB) was used to attenuate noise from the power line in the brainstem recording. Moreover, because a voltage amplitude above 20 mV cannot result from the brainstem but represents artefacts such as spurious muscle activity, the signal was set to zero during episodes of such high voltage. The electrophysiological recording was then filtered between 100–300 Hz since the fundamental frequency of the speech was in that range (high-pass filter: linear-phase FIR filter, transition band from 90 to 100 Hz, stopband attenuation −80 dB, passband ripple 1 dB, order 6862; low-pass filter: linear-phase FIR filter, transition band 300–360 Hz, stopband attenuation −80 dB, passband ripple 1 dB, order 1054). In particular, the high-pass filter eliminated neural signals from the cerebral cortex that occur predominantly below 100 Hz. To avoid transient activity at the beginning of each speech sample, the first ten seconds of each brainstem recording in response to a speech sample were discarded. The following two minutes of data were divided into 40 epochs of a duration of 3 s each, and the remaining data were discarded, if any.

The processing of the neural signal did not induce a latency. This was confirmed by computing the cross-correlation between the processed neural response and the original signal, demonstrating a maximum correlation at zero temporal delay (Figure 1—figure supplement 1b).

As set out above, the first part of the experiment measured the brainstem response to running speech without background noise. For each subject and each epoch, the cross-correlation of the brainstem response with the corresponding segment of the fundamental waveform as well as with its Hilbert transform were computed. A delay of 1 ms of the acoustic signal produced by the earphones was taken into account. The two cross-correlation functions were interpreted as the real and the imaginary part of a complex correlation function. For each individual subject, the average of the complex cross-correlation over all epochs was then computed, and the latency at which the amplitude peaked was determined.

The obtained latencies of about 9 ms affirmed that the signal resulted from the auditory brainstem and not from the cerebral cortex (Hashimoto et al., 1981). The latency also evidenced that the signal resulted neither from stimulus artifacts nor from the cochlear microphonic, which would occur at or near zero delay (Skoe and Kraus, 2010). As an additional control, the brainstem response was recorded when the earphones were near the ear, but not inserted into the ear canal, so the subject could not hear the speech signals. The recording did then not yield a measurable brainstem response (Figure 1—figure supplement 1c). Two presentations of the same speech stimulus, but with opposite polarities, were employed as well, and the neural response to both presentations was averaged before computing the correlation to the fundamental waveform. The correlation was identical to that obtained by a single stimulus presentation, demonstrating the absence of a stimulus artifact and of the cochlear microphonic (Figure 1—figure supplement 1d).

To further verify our analysis we considered a highly simplistic model of the brainstem response. In particular, we constructed a simplified hypothetical brainstem response to speech of a duration of 10 min in which neural spikes occur in bursts (Figure 2a,b). We described each burst by a Gaussian distribution with a width of 1 ms. Each cycle of the fundamental waveform triggered a burst, which was centered at a fixed phase φ of the waveform and shifted by a certain time delay τ. We then added an actual neural recording from a trial without sound presentation, representing realistic noise, to the simulated brainstem response, at a signal-to-noise ratio of −20 dB. Processing this simulated brainstem response through the methods described above yielded a complex correlation that peaked at the imposed time delay τ (Figure 2c). The phase of the complex correlation was the inverse of the phase φ of the fundamental waveform at which the modelled brainstem response occurred; the inverse of the phase appeared due to the complex conjugate in the definition of the cross-correlation. This confirmed the validity of our methodology as well as that our processing of the neural data did not alter the temporal delay.

To determine whether the peak in the cross-correlation obtained from a given subject was significant, the values of the complex cross-correlation from the individual epochs, and at the peak latency, were analysed. Because each correlation value is an average of many measurements, it follows from the Central Limit Theorem that the complex correlations from the different epochs exhibit a two-dimensional normal distribution with a mean of zero if the measurements are randomly distributed. A one-sample Hotelling’s T-squared test was therefore used to assess the significance of the complex correlation at the peak latency. Two subjects who did not show a significant correlation (p>0.05) were not included in the further analysis.

The population mean and standard error of the mean of the latency were computed from the latencies of the individual subjects.

The brainstem responses to competing speakers were then analysed for each individual subject. For each epoch, the complex cross-correlation between the brainstem response and the fundamental waveform was computed, both for the fundamental waveform of the attended and for that of the unattended speaker. The corresponding complex correlation functions were averaged across epochs, and the amplitudes as well as latencies of the peaks were determined.

Statistical significance of the difference in latency of the brainstem responses to the attended and the unattended speaker, obtained from the eight samples, was tested by computing population mean as well as standard error of the mean for the differences in latencies obtained from individual subjects. A two-tailed Student's t-test was employed to test if the difference was significantly different from zero.

To control for differences in the voice of the male and the female speaker, differences in amplitude of the brainstem response to the attended and ignored male speaker were determined separately from differences in the amplitude of the brainstem response to the attended and ignored female speaker. The amplitudes of the complex cross-correlations, at the peak latencies, were computed for all epochs. A two-sample Student’s t-test was then employed to test for a significant difference between the amplitude in response to the attended and the ignored speaker.

The amplitude of the brainstem response to speech can vary widely between subjects (3), due to variations such as in anatomy and scalp conductivity. The ratios of the amplitudes of the brainstem responses to attended and ignored speech, rather than the differences, were thus computed for each individual. The population mean and standard error of the mean were therefrom obtained. A one-tailed Student's t-test assessed whether the population average of the ratio was significantly larger than unity. A two-tailed two-sample Student's t-test was employed to assess whether the ratios obtained from the responses to the male and to the female speaker were significantly different.

We assessed the correlation between the amplitude of the brainstem response and the frequency of the fundamental waveform by dividing each speech signal into 160 segments, and by dividing the corresponding neural recording analogously. We then computed the complex correlation of each segment of the neural data with the corresponding segment of the fundamental waveform, and obtained the amplitude at the peak latency. We further computed the average fundamental waveform of each segment. The correlation between amplitude and frequency was tested for statistical significance through a one-tailed Student's t-test. It was found to be significant for the brainstem response to a single speaker as well as for the response to the attended and the ignored speaker in the two-speaker stimuli (p<0.05). However, none of the differences between the obtained correlation coefficients were statistically significant (two-tailed Student's t-test, p<0.05).

We also computed the correlation between the amplitude and the latency of the brainstem response from short speech segments of 120 s in duration; such long segments were required to obtain a reliable assessment of the latency. We performed this analysis for the auditory brainstem response to a single speaker as well as for those to the attended and ignored speaker when subjects were presented with two competing speakers. However, none of these correlations were statistically significant (two-tailed Student's t-test, p>0.05).

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
    An Introduction to the Physiology of Hearing
    1. JO Pickels
    (1988)
    Bingley: Emerald Group Publishing.
  30. 30
    Auditory evoked potentials from the human cochlea and brainstem
    1. TW Picton
    2. DR Stapells
    3. KB Campbell
    (1981)
    The Journal of Otolaryngology. Supplement 9:1–41.
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40

Decision letter

  1. Barbara G Shinn-Cunningham
    Reviewing Editor; Boston University, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "The human auditory brainstem response to running speech reveals a subcortical mechanism for selective attention" for consideration by eLife. Your article has been favorably evaluated by Andrew King (Senior Editor) and three reviewers, one of whom, Barbara G Shinn-Cunningham (Reviewer #1), is a member of our Board of Reviewing Editors. The following individual involved in review of your submission has agreed to reveal their identity: Steve Aiken (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

The results of this study are very intriguing, showing differences in auditory steady-state responses (ASSRs) that are specific to which of two speech streams a listener is attending. Although the correlations between the stimulus and response are small, the peak of the cross-correlation function was significant in most of the subjects, as was the effect of attention. The writing is generally clear and concise.

All three reviewers thought that the conclusion that brainstem-generated responses are modulated by attentional focus, if properly justified, is novel and noteworthy. While many aspects of this paper are intriguing, there are questions that affect the interpretation. These technical issues and concerns must be addressed adequately for the study to be acceptable for publication in eLife.

Essential revisions:

1) For any response like this, including the more-common FFR (with a steady-state constant-frequency acoustic signal), observations are a mixture of IC responses and other responses (perhaps in thalamus – but also lower responses such as the cochlear microphonic). For the same relative delays and magnitudes of the responses, these different responses will add in different phases, depending on their frequency. Unlike with the FFR, here, the frequency is changing from moment to moment. This will lead to different cancellation / summation at different frequencies that likely result in different peak delays. Only if there is a single truly dominant source in the mixture will the peak delay be at a fixed delay independent of frequency.

Attention effects in FFRs have been suggested to be due to the involvement of the cortex (the work of Emma Holmes presented at ARO 2017). While the delay of 10 ms relative to the stimulus calls into question a dominant role for the cortex in the present study (see above), it is also possible that the attention effect is mediated through olivocochlear inhibition.

The stats show that attention is changing the observed responses. The question is just where these responses are coming from. Given that the observed response is a mixture, further analysis is warranted to tease apart whether the effects are due to a single dominant source with a fixed delay that is modulated by attention, or whether higher-level sources, which are modulated by attention, cause different summation / cancellation effects depending on attentional focus.

2) The latency reported, 10.3ms, is greater than is usually attributed to a brainstem response. The latency of the largest peak of the click-evoked ABR (usually attributed to inferior colliculus) is usually assumed to be ~5ms or somewhat greater for lower-frequency stimuli (Don and Eggermont, 1978).

Often, the delay in an ASSR is longer (presumably in part because it is a mixture of neural sources as noted above). The authors say that their estimate agrees with that reported by Skoe and Kraus (2010) for speech, but the value reported in that paper is actually 7-8ms. Figure 1 of that paper, where that value is mentioned, illustrates it as the amount by which the response precedes the stimulus (!), so that source itself may have some methodological problems.

Because the latency is the primary fact used to conclude that the induced attentional changes are from brainstem, this issue is very important to consider and discuss.

3) The latency is calculated as a cross-correlation between electrode signals and a "fundamental waveform" defined as a "nonlinear oscillation" derived from the speech by empirical mode decomposition (EMD). EMD is attractive but not very well defined or theoretically grounded; as far as we can tell, it just extracts an approximation of the fundamental Fourier component. The Hilbert transform is calculated, and both the waveform and its Hilbert transform are cross-correlated with the EEG to obtain a "complex correlation function", the amplitude of which peaks at a latency of 10.3ms. The rationale for introducing this Hilbert component is not clear, as it would seem more straightforward to correlate simply with the speech waveform (or its "fundamental waveform). The "amplitude of the complex CC" has a wider peak than the raw CC.

Please explain and justify the analysis more clearly.

4) An accurate estimate of latency is crucial for saying that the response reflects the brainstem. Temporal alignment between audio and EEG may be affected by acoustic delay in the earphones (not specified, possibly ~1ms for ER 3), as well as the signal processing of the inputs and of the brain measures.

Audio is down-sampled (interpolation filter unspecified), filtered by a FIR of order 296 (IR temporal extent 33.4 ms), time-shifted to "compensate for delay" of the FIR, processed by the EMD algorithm, and finally by the Hilbert transform. The Hilbert transform is presumably performed by applying an STFT to a window of unspecified duration. It involves a 90-degree phase shift that translates (for the quasi-sinusoidal fundamental wave) to a frequency-dependent time shift of up to 2.5ms at 100Hz.

On the EEG side, the signal is processed by a frequency-domain method (ClearLine) to attenuate 50Hz and (presumably) harmonics. The possibility that this might affect the fundamental waveform (its time-varying frequency falls in this range) is not discussed. The EEG is filtered by a cascade of FIR filters of order 6862 and 1054 (IR lengths 274ms and 42ms) before correlation with the audio-based signal. There are clearly many stages at which a latency mismatch could arise, and the fact that this is not acknowledged or addressed (for example by calibration) is troubling.

The peak value of the cross-correlation function shown in Figure 1C, 0.05, seems rather high given that the ABR is supposed to have very low SNR. Similar values (0.05 – 0.1) have been reported for cross-correlation with filtered cortical responses that are supposed to have a much better SNR. The shape of the function is also somewhat intriguing because it is approximately symmetrical and extends to negative lag (i.e., there is a noncausal relationship between the input and the neural response). This suggests that it is largely determined by temporal smearing in the processing, for example due to convolution with the various unspecified filter kernels.

Please carefully outline how the acoustical and neural signal processing affects the estimated latency of the neural responses.

5) From the description of methods, it would seem that the stimulus is always presented with the same polarity. Electrode signals measured in that case are likely to be dominated by cochlear microphonic or possibly even cross-talk from the earphone drivers or cables. There is no reason to believe that they are from brainstem, except possibly latency (and as discussed above, it is unclear how good an indicator that is of brainstem activity).

Please include some calibration or analysis to verify that electrical artifact is not a significant factor in your findings.

6) While the speech presentation levels were relatively high, individual high-frequency harmonics would be relatively low in level given the low-pass characteristic of speech signals. Attentional modulation of cochlear gain has been shown to be frequency-specific (e.g., Maison et al., Psychophysiol 2001) and certainly could be specific to a harmonic complex.

It would be useful to include this point in the Discussion.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Thank you for submitting your article "The human auditory brainstem response to running speech reveals a subcortical mechanism for selective attention" for consideration by eLife. Your article has been favorably evaluated by Andrew King (Senior Editor) and three reviewers, one of whom, Barbara G Shinn-Cunningham (Reviewer #1), is a member of our Board of Reviewing Editors. The following individual involved in review of your submission has agreed to reveal their identity: Steve Aiken (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission. It is not normally eLife policy to allow a paper to go through multiple rounds of reviews, so this will be the last opportunity to revise the manuscript before a final decision is made.

Summary:

This paper provides evidence for attentional modulation in neural responses measured in response to running speech. The approach is relatively novel and the findings interesting.

The revision has gone some way to addressing the concerns raised by the reviewers. However, several questions still need to be answered. The reviewers also provide suggestions for how to strengthen the argument. The controls added here would definitely strengthen the paper.

Essential revisions:

1) The paper continues to over-emphasize that the measured responses are from the brainstem. The evidence shows a clear attentional effect; however, the claim that the observed effects are absolutely from the brainstem is still problematic. If cortical contributions cannot be ruled out, it would be presumptuous to conclude that this is an attentional effect in the brainstem.

The authors should make their case more persuasively. One approach to enhance this argument is to conduct further analysis of their present data. If cortical contributions are indeed responsible for the attentional effects, one would expect to find a positive correlation between peak latency and amplitude in the attended conditions. One might also expect such effects to be stronger for segments with lower fundamental frequencies (e.g., < 200 Hz) given cortical phase-locking limits.

The authors should test for these possibilities. Of interest would be any relationship between peak cross-correlation latency and peak cross-correlation amplitude for attended streams, and between peak cross-correlation amplitude and segment fundamental frequency (also for attended vs. unattended streams). Such an analysis might lead to a more nuanced understanding of the data or conversely add weight to the conclusions.

2) The cross-correlation functions between speech and "fundamental wave" and between raw and processed EEG) address earlier concerns about the effect of processing on latency estimates. Still, an end-to-end calibration would be even more convincing. This does not require recording new data, but rather running simulated data through the processing pipeline: (a) formulate a simple speech-to-neural response model based on the conclusions the authors believe follow from their results, (b) add background EEG signal (e.g., from recorded data shifted in time), (c) run the analysis, (d) check whether latencies conform to what is expected based on the model. Given the importance of the timing of the responses to the argument that the effects are subcortical, the extra effort is worthwhile: the EEG processing pipeline involves many convolutional stages that the authors still do not fully characterize (are all filters zero-phase?).

https://doi.org/10.7554/eLife.27203.008

Author response

Essential revisions:

1) For any response like this, including the more-common FFR (with a steady-state constant-frequency acoustic signal), observations are a mixture of IC responses and other responses (perhaps in thalamus – but also lower responses such as the cochlear microphonic). For the same relative delays and magnitudes of the responses, these different responses will add in different phases, depending on their frequency. Unlike with the FFR, here, the frequency is changing from moment to moment. This will lead to different cancellation / summation at different frequencies that likely result in different peak delays. Only if there is a single truly dominant source in the mixture will the peak delay be at a fixed delay independent of frequency.

Attention effects in FFRs have been suggested to be due to the involvement of the cortex (the work of Emma Holmes presented at ARO 2017). While the delay of 10 ms relative to the stimulus calls into question a dominant role for the cortex in the present study (see above), it is also possible that the attention effect is mediated through olivocochlear inhibition.

The stats show that attention is changing the observed responses. The question is just where these responses are coming from. Given that the observed response is a mixture, further analysis is warranted to tease apart whether the effects are due to a single dominant source with a fixed delay that is modulated by attention, or whether higher-level sources, which are modulated by attention, cause different summation / cancellation effects depending on attentional focus.

We have now added additional controls for a potential stimulus artifact as well as for a contribution from the cochlear microphonic. In particular, we have measured the brainstem response to a speech signal as well as to the same signal with reversed polarity. The averaged neural response shows the same peak in its complex correlation to speech as the response to a single speech presentation. This evidences the absence of a stimulus artifact as well as of a measurable cochlear microphonics.

The brainstem response at the fundamental frequency of speech can indeed result from multiple sites in the brainstem. However, because we observe a single peak with a width of a few ms in the correlation of the neural response to the fundamental waveform of speech, the brainstem response to speech that we describe here cannot reflect sources whose latencies vary by more than a few ms from the mean latency. The latency of about 9 ms confirms the absence of a cortical contribution and hints at a neural origin in the inferior colliculus. The inferior colliculus is indeed connected to the auditory cortex through multiple segregated feedback loops, involving the olivocochlear efferent system, that likely mediate the attentional modulation that we report here.

These important issues are now discussed at different places in the revised manuscript, namely in the Results section, in a new paragraph in the Discussion section, and in three new paragraphs in the Materials and methods section. The additional data showing the control for a stimulus artifact and the cochlear microphonic is laid out in a new supplement to Figure 1.

2) The latency reported, 10.3ms, is greater than is usually attributed to a brainstem response. The latency of the largest peak of the click-evoked ABR (usually attributed to inferior colliculus) is usually assumed to be ~5ms or somewhat greater for lower-frequency stimuli (Don and Eggermont, 1978).

Often, the delay in an ASSR is longer (presumably in part because it is a mixture of neural sources as noted above). The authors say that their estimate agrees with that reported by Skoe and Kraus (2010) for speech, but the value reported in that paper is actually 7-8ms. Figure 1 of that paper, where that value is mentioned, illustrates it as the amount by which the response precedes the stimulus (!), so that source itself may have some methodological problems.

Because the latency is the primary fact used to conclude that the induced attentional changes are from brainstem, this issue is very important to consider and discuss.

Figure 1 of the publication by Skoe and Kraus (2010) reports indeed a delay of 7-8 ms. However, this is only one example of a measured response, and the authors state in the same publication that the measured latencies occur at a delay of 6 – 10 ms (page 15). As pointed out by the reviewers below, the earphones that we have employed introduce a delay of 1 ms that we had not accounted for earlier. We have now corrected our measurements for this delay, and the latencies that we obtain are now on average 9.3 ms, which is consistent with the results summarized in the review by Skoe and Kraus (2010). We have modified the manuscript and Discussion to reflect this important point.

Regarding the latency reported in Figure 1 of the publication by Skoe and Kraus (2010), the acoustic stimulus has been shifted to a later time by 7 – 8 ms with respect to the neural response to maximize the visual coherence between both signals. The neural response reported in this Figure therefore occurs 7 – 8 ms after the acoustic stimulus.

3) The latency is calculated as a cross-correlation between electrode signals and a "fundamental waveform" defined as a "nonlinear oscillation" derived from the speech by empirical mode decomposition (EMD). EMD is attractive but not very well defined or theoretically grounded; as far as we can tell, it just extracts an approximation of the fundamental Fourier component. The Hilbert transform is calculated, and both the waveform and its Hilbert transform are cross-correlated with the EEG to obtain a "complex correlation function", the amplitude of which peaks at a latency of 10.3ms. The rationale for introducing this Hilbert component is not clear, as it would seem more straightforward to correlate simply with the speech waveform (or its "fundamental waveform). The "amplitude of the complex CC" has a wider peak than the raw CC.

Please explain and justify the analysis more clearly.

The fundamental frequency of running speech varies with time. Moreover, the amplitude of the voiced parts of speech also varies greatly, between zero for voiceless speech and a maximal value for voiced speech. While these variations may be ignored when investigating responses to short repeating stimuli, where the frequency fluctuations are limited due to the short duration and the amplitude may be approximately constant, the variations cannot be ignored in running speech where the fundamental frequency may vary over an octave and the amplitude varies widely. Fourier analysis decomposes a signal into sinusoidal oscillations at different, but constant, frequency and amplitude. A single component extracted from Fourier analysis can thus not represent the fundamental waveform. Furthermore, classical time-frequency analysis methods such as the Short Time Fourier Transform and wavelet-based methods are not well suited for this task either due to their inherently limited time-frequency resolution (Huang and Pan 2006). EMD, in contrast, decomposes a signal into empirical modes that are nonlinear and non-stationary oscillations with time-varying frequency and amplitude with no inherent limitation to time-frequency resolution. A previous study by Huang and Pan (2006) has shown that one of these modes oscillates at the fundamental frequency. We therefore identify this mode with the fundamental waveform and employ EMD to extract it. As the reviewers state, a precise theoretical grounding of EMD is still lacking, but it is a well-defined constructive procedure that is increasingly used for analyzing nonlinear and non-stationary oscillations. We now comment further on this issue in the Discussion section of our revised manuscript.

As the reviewers remark, the correlation of the neural response to the fundamental waveform has a narrower peak than the amplitude of the complex correlation. However, this narrower correlation is partly due to phase mismatch, and not to a lower correlation per se. Indeed, the negative correlation at a few ms before and after the peak shows that the brainstem response is then in antiphase, but still correlated to, the fundamental waveform. More generally, the brainstem response can occur at an arbitrary phase with respect to the fundamental waveform. We therefore compute the correlation of the neural signal to the Hilbert transform of the fundamental waveform as well. The Hilbert transform is a version of the fundamental waveform that is phase shifted by 90˚. By interpreting both correlations as the real and imaginary part of a complex correlation, we can track a brainstem response at an arbitrary phase delay. This analysis is reminiscent of Fourier analysis at a particular frequency where the correlation of a signal to a cosine and a sine function are interpreted as real and imaginary part of a complex Fourier coefficient, the phase of which represents the phase delay of the signal with respect to the cosine function. We now explain this important point in the Results section.

4) An accurate estimate of latency is crucial for saying that the response reflects the brainstem. Temporal alignment between audio and EEG may be affected by acoustic delay in the earphones (not specified, possibly ~1ms for ER 3), as well as the signal processing of the inputs and of the brain measures.

Audio is down-sampled (interpolation filter unspecified), filtered by a FIR of order 296 (IR temporal extent 33.4 ms), time-shifted to "compensate for delay" of the FIR, processed by the EMD algorithm, and finally by the Hilbert transform. The Hilbert transform is presumably performed by applying an STFT to a window of unspecified duration. It involves a 90-degree phase shift that translates (for the quasi-sinusoidal fundamental wave) to a frequency-dependent time shift of up to 2.5ms at 100Hz.

On the EEG side, the signal is processed by a frequency-domain method (ClearLine) to attenuate 50Hz and (presumably) harmonics. The possibility that this might affect the fundamental waveform (its time-varying frequency falls in this range) is not discussed. The EEG is filtered by a cascade of FIR filters of order 6862 and 1054 (IR lengths 274ms and 42ms) before correlation with the audio-based signal. There are clearly many stages at which a latency mismatch could arise, and the fact that this is not acknowledged or addressed (for example by calibration) is troubling.

The peak value of the cross-correlation function shown in Figure 1C, 0.05, seems rather high given that the ABR is supposed to have very low SNR. Similar values (0.05 – 0.1) have been reported for cross-correlation with filtered cortical responses that are supposed to have a much better SNR. The shape of the function is also somewhat intriguing because it is approximately symmetrical and extends to negative lag (i.e., there is a noncausal relationship between the input and the neural response). This suggests that it is largely determined by temporal smearing in the processing, for example due to convolution with the various unspecified filter kernels.

Please carefully outline how the acoustical and neural signal processing affects the estimated latency of the neural responses.

We are grateful to the reviewers for pointing out the acoustic delay in the earphones. Manufacture specifications state that the earphones indeed introduce a delay of 1 ms. As explained above, we have now adjusted for this delay, and the average delay of the brainstem response to speech that we have measured is now 9.3 ms. The figures and the text of our revised manuscript have been modified accordingly.

We have taken care that the various steps involved in the processing of the acoustic and the neural signals do not affect the latencies. In particular, the down-sampling of the audio signal was carried out using the Matlab resample function, which resamples an input sequence by applying an antialiasing FIR lowpass filter and compensating for the delay introduced by the filter. The Hilbert transform did not involve a STFT, and the other frequency filters were time-compensated to not produce a latency shift either.

Due to the importance of potential latency shifts introduced by signal processing, we have investigated this issue further through computing cross-correlations of the processed and the unprocessed signals. The obtained data are presented in the novel supplement to Figure 1, and are discussed in the Results section as well as in two new paragraphs in the Materials and methods section. Our analysis shows that there is no latency shift between the computed fundamental waveform and the original speech signal, and neither is there a latency shift between the processed and the original neural recording.

The brainstem response is indeed much weaker than cortical responses. We therefore require long recordings of 10 minutes in duration to achieve a good SNR. Nonetheless, the average of the correlation across the different subjects is only 0.015 ± 0.003 as we now report in the manuscript. Moreover, we employed passive electrodes in combination with a specialized bipolar preamplifier with a very high common-mode rejection; this helps to achieve a high SNR. Measurements of cortical responses often employ active electrodes that have a lower SNR than passive electrodes as well as a main amplifier with a lower common-mode rejection than our specialized bipolar preamplifier, which decreases the SNR of the so-obtained cortical responses further.

As the reviewers remark, the peak of the correlation is approximately symmetric and extends to negative delays. The shape of the peak reflects indeed the autocorrelation of the speech signal (Figure 1B) which is symmetric. The peak of the complex correlation of the neural response to the stimulus is wider than that of the autocorrelation since it is not only affected by the self-similarity of the fundamental waveform at short times, but also by that of the neural response as well as by noise. The causality between the acoustic signal and the neural response is established by the positive delay of the peak of the correlation. The extension of the lower-delay part of this peak to negative times does, in contrast, not reflect a violation of causality. Indeed, the peak in the autocorrelation of the fundamental waveform extends to negative times as well but it is clearly the lag at the peak (0 ms) that determines the casual relationship.

5) From the description of methods, it would seem that the stimulus is always presented with the same polarity. Electrode signals measured in that case are likely to be dominated by cochlear microphonic or possibly even cross-talk from the earphone drivers or cables. There is no reason to believe that they are from brainstem, except possibly latency (and as discussed above, it is unclear how good an indicator that is of brainstem activity).

Please include some calibration or analysis to verify that electrical artifact is not a significant factor in your findings.

We agree that special care needs to be taken to ensure that the measured response is free from stimulus artifacts.

The influence of stimulus artifacts or of the cochlear microphonics is unlikely since both would manifest at zero latency where we do not observe a measurable response. However, we have now carried out two additional tests as suggested by the reviewers. The results are reported in a new supplement to Figure 1 and are discussed in the Results section as well as in a new paragraph in the Materials and methods section.

First, we have recorded brainstem responses when the earphones were not inside the ear canal of a subject but close to the ear, so that the subject could not hear the stimulus. We do then not obtain a measurable brainstem response to speech, evidencing the absence of stimulus artifacts. Second, we have played the same story to a subject twice, the second time with inverted polarity. We have then averaged the neural response to both stimulus presentations and performed the correlation analysis on the average. This procedure removes putative stimulus artifacts as well as the cochlear microphonic. We obtain the same correlation as when we analyze the response to a single stimulus, demonstrating that the response reflects neither a stimulus artifact nor the cochlear microphonic.

6) While the speech presentation levels were relatively high, individual high-frequency harmonics would be relatively low in level given the low-pass characteristic of speech signals. Attentional modulation of cochlear gain has been shown to be frequency-specific (e.g., Maison et al., Psychophysiol 2001) and certainly could be specific to a harmonic complex.

It would be useful to include this point in the Discussion.

The response at the fundamental frequency that we have measured can indeed result from higher harmonics as we had already discussed in our manuscript. As the reviewer remarks, attentional modulation may depend on the frequency of the harmonics. We have now included this valuable point in the Discussion section of our revised manuscript.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Essential revisions:

1) The paper continues to over-emphasize that the measured responses are from the brainstem. The evidence shows a clear attentional effect; however, the claim that the observed effects are absolutely from the brainstem is still problematic. If cortical contributions cannot be ruled out, it would be presumptuous to conclude that this is an attentional effect in the brainstem.

The authors should make their case more persuasively. One approach to enhance this argument is to conduct further analysis of their present data. If cortical contributions are indeed responsible for the attentional effects, one would expect to find a positive correlation between peak latency and amplitude in the attended conditions. One might also expect such effects to be stronger for segments with lower fundamental frequencies (e.g., < 200 Hz) given cortical phase-locking limits.

The authors should test for these possibilities. Of interest would be any relationship between peak cross-correlation latency and peak cross-correlation amplitude for attended streams, and between peak cross-correlation amplitude and segment fundamental frequency (also for attended vs. unattended streams). Such an analysis might lead to a more nuanced understanding of the data or conversely add weight to the conclusions.

Cortical contributions can be ruled out due to their longer latency. Hashimoto et al. (1981), for instance, employed recordings in neurosurgical patients from different parts of the brainstem as well as the cortex during sound stimulation. They found that the cortex did not contribute to any scalp-recorded potential with a delay of 10 ms or less. Picton et al. (1981) found that any auditory-evoked scalp-recorded potential with a latency of 15 ms or less originates from the cochlea or the brainstem. The neural response to the fundamental frequency of speech that we measure here has a latency of 9.3 ms which shows that this response does not contain a cortical contribution.

Our obtained latency is now further validated through a toy model of the brainstem response that we have included in our revised manuscript as set out below. Moreover, we have now included the new panel (E) to Figure 1—figure supplement 1 that shows the correlation between the neural signal and the complex fundamental waveform up to a latency of 700 ms, evidencing that there is no peak except for the one at 9 ms, and therefore no measurable contribution from the cerebral cortex. We now also point out that we assess the brainstem response through the amplitude of the complex cross-correlation at the peak amplitude. Even if there was a small cortical contribution to the neural response it would occur at a higher latency and would thus not affect our results regarding attention.

We now make this case more persuasively in our revised manuscript and cite the two important references described above. We now write:

"The peak occurred at a mean latency of 9.3 ± 0.7 ms, verifying that the measured neural response resulted from the brainstem and not from the cerebral cortex (Hashimoto et al. 1981; Picton et al. 1981)."

We have thereby added the new references to Hashimoto et al. (1981) as well as to Picton et al. (1981).

We have also followed the valuable suggestion to investigate correlations between amplitude and latency of the brainstem response. We find, however, no statistically significant correlation. In particular, we have analysed the correlation between amplitude and latency of short segments of the recordings and of the corresponding speech signal. We have performed this analysis for several conditions, namely for the brainstem response to a single speaker as well as for the neural response to both the attended and the ignored speaker of the two-speaker stimulus. For each speech stimulus we have also determined the parts with a fundamental frequency that is lower than the mean frequency, as well as the parts with a fundamental frequency above the mean, and considered each condition separately. However, in neither condition did we find a statistically significant correlation between the amplitude and the latency.

We would like to point out that this does not necessarily mean that there is no correlation between amplitude and latency, but only that we were not able to find such a correlation from our data. Indeed, a recent study by Mehraei et al. (J. Neurosci. 2016), finds a correlation between amplitude and latency of wave V of the brainstem response to clicks in noise when the noise level is varied. This correlation presumably arises from the varying contribution of auditory-nerve fibers with different thresholds and from different parts of the cochlea. Since speech contains varying contributions other than the fundamental waveform, it is therefore conceivable that a similar correlation exists in the brainstem response to continuous speech that we describe, but that the effect is too small to emerge from our data. On the contrary, this implies that even if such an effect was found, it alone would not allow to differentiate between brainstem and cortical contributions to the observed response.

We have also followed the important suggestion to investigate a correlation between the amplitude of the brainstem response and the fundamental frequency of the speech signal. We have done this through considering short speech segments and the corresponding neural data. This analysis has been performed for the brainstem response to a single speaker as well as for that to the attended and the ignored speaker of the two-speaker stimulus. We find small negative and statistically significant correlations between amplitude and fundamental frequency in all conditions. This concurs with previous results that have shown a low-pass nature of the brainstem response (e.g. Musacchia, et al. (2007)). We find, however, no statistically significant difference between the correlation coefficients in the different conditions. In particular, the correlation obtained from the attended speaker is not significantly larger than that obtained from the ignored speaker.

We now describe these findings in detail in our revised manuscript, and are convinced that they will trigger further studies into the important issues of the precise neural mechanisms that govern the brainstem response to continuous speech.

In particular, we have added the new Figure 3—figure supplement 1 in which we show the correlation between the amplitude of the brainstem response and the fundamental frequency of the speech signal.

In the Results section, we write:

"The auditory brainstem response to short speech stimuli has a low-pass nature: the amplitude of the response declines with increasing frequency (Musacchia et al. 2007; Skoe & Kraus 2010). […] However, for the brainstem response to continuous speech that we measured here, we did not find a statistically-significant correlation between amplitude and latency (Materials and methods).”

In the Materials and methods section, we write:

"We assessed the correlation between the amplitude and the frequency of the fundamental waveform by dividing each speech signal into 160 segments (3 s duration each), and by dividing the corresponding neural recording analogously. […] However, none of these correlations were statistically significant (two-tailed Student's t-test, p > 0.05).”

2) The cross-correlation functions between speech and "fundamental wave" and between raw and processed EEG) address earlier concerns about the effect of processing on latency estimates. Still, an end-to-end calibration would be even more convincing. This does not require recording new data, but rather running simulated data through the processing pipeline: (a) formulate a simple speech-to-neural response model based on the conclusions the authors believe follow from their results, (b) add background EEG signal (e.g., from recorded data shifted in time), (c) run the analysis, (d) check whether latencies conform to what is expected based on the model. Given the importance of the timing of the responses to the argument that the effects are subcortical, the extra effort is worthwhile: the EEG processing pipeline involves many convolutional stages that the authors still do not fully characterize (are all filters zero-phase?).

We agree that a simulated brainstem response, processed through our pipeline and analysed with our methodology, will strengthen our argument in two ways. First, it will show that our methodology is able to accurately extract both phase shift and time delay of the brainstem signal. Second, it will provide a verification that our processing, such as through the involved filtering, does not introduce an additional time delay.

We have therefore followed this very valuable suggestion and considered a toy model for the brainstem response. In a highly simplistic fashion, the brainstem response is thereby modelled as a series of bursts of spikes. Each cycle of the fundamental waveform triggers a burst that occurs at a fixed phase of the fundamental waveform. All bursts are then shifted by a certain fixed time delay. We further added noise that is realistic for scalp recordings, and then analysed the resulting modelled signal through our processing methodology. We then process the simulated response through our signal-processing pipeline. To clarify, all the filters that we used to process the data are linear phase FIR filter, and we systematically compensated for their delay.

We find a complex correlation that performs as expected: the peak of the amplitude occurs at the time delay of the modelled response, and the phase yields the phase of the fundamental waveform at which the bursts occur. This verifies that our methodology is indeed able to compute both phase shift and time delay of the brainstem response. It also verifies that the processing that we employ for the neural recording does not yield an additional time delay.

We now describe this simplistic model and the important findings that follow from it in our revised manuscript. In particular, in the Results section, we now write:

"We further considered a highly simplistic model of the auditory brainstem response in which a burst of neural spikes occurred at each cycle of the fundamental waveform, at a fixed phase φ, and was then shifted in time by a certain delay τ (Figure 2A, B; Materials and methods). […] This demonstrated that the brainstem's response to continuous speech could be reliably extracted through the developed method."

In the Materials and methods section, we elaborate further:

"To further verify our analysis we considered a highly simplistic model of the brainstem response. […] This confirmed the validity of our methodology as well as that our processing of the neural data did not alter the temporal delay.”

Our simplistic brainstem response, together with the resulting complex correlation to the fundamental waveform, is illustrated in the new Figure 2.

We now also clarify in our revised manuscript that we employed linear phase FIR filter, and that we systematically compensated for their delay (subsection “Computation of the fundamental waveform of speech”, second paragraph and subsection “Analysis of the auditory-brainstem response”, first paragraph).

https://doi.org/10.7554/eLife.27203.009

Article and author information

Author details

  1. Antonio Elia Forte

    Department of Bioengineering, Centre for Neurotechnology, Imperial College London, London, United Kingdom
    Contribution
    Software, Formal analysis, Validation, Investigation, Visualization, Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared
  2. Octave Etard

    Department of Bioengineering, Centre for Neurotechnology, Imperial College London, London, United Kingdom
    Contribution
    Formal analysis, Validation, Methodology, Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared
  3. Tobias Reichenbach

    Department of Bioengineering, Centre for Neurotechnology, Imperial College London, London, United Kingdom
    Contribution
    Conceptualization, Resources, Data curation, Supervision, Funding acquisition, Investigation, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    reichenbach@imperial.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon 0000-0003-3367-3511

Funding

Engineering and Physical Sciences Research Council (EP/M026728/1)

  • Tobias Reichenbach

National Science Foundation (PHY-1125915)

  • Tobias Reichenbach

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Steve Bell, Karolina Kluk-de Kort, Patrick Naylor, David Simpson and Malcolm Slaney for discussion as well as for comments on the manuscript. This research was supported by EPSRC grant EP/M026728/1 to TR as well as in part by the National Science Foundation under Grant No. NSF PHY-1125915.

Ethics

Human subjects: All subjects provided written informed consent. All experimental procedures were approved by the Imperial College Research Ethics Committee.

Reviewing Editor

  1. Barbara G Shinn-Cunningham, Reviewing Editor, Boston University, United States

Publication history

  1. Received: March 27, 2017
  2. Accepted: September 14, 2017
  3. Version of Record published: October 10, 2017 (version 1)

Copyright

© 2017, Forte et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,349
    Page views
  • 216
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

  1. Further reading

Further reading

    1. Neuroscience
    2. Structural Biology and Molecular Biophysics
    Riley Perszyk et al.
    Research Article
    1. Neuroscience
    Povilas Karvelis et al.
    Research Article Updated