Speech is defined by theta-gamma coupled acoustic rhythms, mapped onto segregated populations in human early auditory cortex

Víctor J López-Madrona; Jérémy Giroud; Manuel Mercier; Léonardo Lancia; Bruno L Giordano; Agnès Trébuchon; David Poeppel; Anne-Lise Giraud; Luc H Arnal; Benjamin Morillon

doi:10.7554/eLife.110310.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Reviewing Editor
Björn Herrmann
Baycrest Hospital, Toronto, Canada
Senior Editor
Huan Luo
Peking University, Beijing, China

Reviewer #1 (Public review):

Summary:

This article investigates the application of commonly employed analytic methods in electrophysiological neuroscience to the speech envelope taken from 17 different languages' audio corpora. The findings indicate that features observed in speech-brain tracking responses, specifically theta and gamma oscillations, as well as their phase-amplitude coupling, are actually present within the speech envelope itself. This suggests that the neural data recorded in response to speech primarily reflects an evoked response to the temporal statistical properties of the envelope, rather than an inherent neural mechanism. Data from 18 individuals with epilepsy listening to French speech further support this interpretation: theta and gamma oscillations, along with their phase-amplitude coupling, are absent at rest and are linearly driven by the acoustic envelope during speech perception.

Strengths:

I find these results very interesting and convincing, with a strong take-home message: we should exercise caution when interpreting observed theta/gamma activity and the associated phase-amplitude coupling during speech comprehension tasks.

Weaknesses:

I mostly have comments on clarifications regarding the methods, specifically on the criteria for language exclusion, and on the statistical testing and reporting.

(1) Clarification is needed regarding the rationale for the number of languages analysed: initially, 17 languages were considered, six were excluded due to the absence of PAC in the high gamma range, yet the analysis was ultimately conducted on only nine languages, not eleven. Could you please explain this discrepancy?

(2) Considering the six languages that did not exhibit any statistically significant high-frequency PAC, do you have potential reasons for this result? Might it be related to the fundamental frequency (F0) of the speakers' voices? If six languages out of seventeen do not show PAC, can we argue that this feature is universal across languages?

(3) How is inter-subject variability addressed within the SEEG analysis? The authors report the percentage of SEEG independent components showing significant effects in power spectral changes, PAC, and other measures, but it is unclear whether these components are consistent across participants or whether only a few participants drive the effect. It would be helpful to report how many participants are retained for each selection of SEEG-ICs in the article. Currently, the statistical testing of the SEEG-ICs also appears to assume independent samples. It would be helpful to include group-level statistical tests across subjects, for instance by performing mixed-effects models and including participant as a random factor.

https://doi.org/10.7554/eLife.110310.1.sa2

Reviewer #2 (Public review):

Summary:

This paper nicely demonstrates that "speech tracking" in the auditory cortex extends all the way up to 100Hz-150Hz. Specifically, the study asks whether the fluctuations in sound amplitude found in speech at various time scales relate to fluctuations found in similar time scales in intracranial recordings in auditory brain areas. First, it analyzes amplitude fluctuations in speech of 17 different languages, and characterizes fluctuations due to syllabic rate (2-6Hz), vocalic features (30-50 Hz), and fundamental frequency (100-150 Hz, in male speakers). It then analyzes whether neural activity occurs while listening to male and female speakers in French. By measuring changes in power spectrum relative to rest, it links the sound amplitude fluctuations to fluctuations in neural activity in the same frequency bands, referring to them as "theta", "low-gamma", and "high-gamma". Using Grange "causality," it clearly shows that the neural fluctuations can be predicted linearly from the sound fluctuations. Using a cross-frequency coupling measure, they further show that, in the neural dynamic, high-gamma fluctuations precede theta fluctuations.

Strengths:

(1) Analysis of neural activity (Figure 2 is a very compelling account of how theta, low, and high gamma observed in neural recordings closely follow the properties of the acoustic speech signal itself.

(2) This includes phase amplitude coupling, a property that I had not previously seen described for the speech signal itself, and is here nicely demonstrated in Figure 1.

(3) The Grange "causality" analysis makes a compelling case that neural fluctuations in these frequency bands are driven by the stimulus itself.

(4) The finding in Figure 4 that female fundamental emerges at half the frequency in the neural activity is, to my knowledge, an entirely novel observation, not just in speech but in amplitude modulated sounds in general. This non-linear phenomenon is very interesting and prompts a host of interesting questions for future research: Does this happen only for voiced speech, does it depend on the harmonic stack of speech, or can it be produced with a single AM frequency? Are there preferred frequencies for this phenomenon?

(5) The cross-frequency coupling measure shows a number of directed effects in the neural signal which seem to counter the predominant view in neuroscience, namely, that the phase of the slower fluctuations "organize" or "drive" the faster fluctuations seen in power, e.g. theta→gamma coupling, which here is seen to be reversed as gamma→ theta coupling, and this is not a property of sound itself. This, too, should lead to a number of follow-up studies (although there are some potential confounds here).

Weaknesses:

(1) The claim that different frequency bands are processed in different locations, referred to in the abstract as "multiplexing" is less well supported. The neural analysis is performed on independent components that are spatially distributed, making this claim less transparent than it could be, with other, more direct ways of treating electrode location, such as bipolar referencing.

(2) The writing in the Introduction and Results section obscures the source of sound amplitude fluctuations at different timescales. Instead, it treats these fluctuations as some sort of discovery. This is strange because the abstract and discussions are fairly accurate on this point - namely, they are all due to well-known properties of speech. The descriptions are accurate, although I would put it slightly differently: fluctuations below 6Hz are due to varying length of sentences and words, 25Hz-50Hz are well-established stationary times of the vocal tract, and 100-150Hz are the vibration of the vocal cords in male speakers.

(3) The problem of guiding the analysis of sound by notions from neural signals is most glaring when they restrict their analysis to less than 150Hz, which leaves out female-voiced speech.

(4) Along with this, there is a heavy emphasis on notions of "rhythms" and "oscillations" when clearly, aside from the vocal cords, there is no evidence for rhythmic fluctuations. Any reasonable definition of a rhythm would need at least 2 or 3 cycles of a repeated pattern. A spectral "peak" for the sound envelope is shown at 5Hz. But this is not indicative of a regular rhythm. Instead, the peak appears to be an artifact of displaying power per octave rather than power spectral density. A peak in a power per octave is not a reliable indicator of a coherent oscillation, and the speech envelope does not exhibit a clear 5Hz rhythm. Unfortunately, prior literature has not been clear on this. It would be more accurate if the word "rhythm" were replaced with "fluctuation" and/or "activity" for the case of speech envelope and neural activity, respectively.

(5) The Introduction also omits the literature on neural responses to amplitude-modulated sounds that go up at least to 200Hz and more. So the findings here on "high-gamma" are well in line with prior literature.

(6) The fact that neural analysis was cut off at 150Hz to me is a missed opportunity to test if neural speech tracking goes all the way up to 200Hz of the typical female fundamental.

(7) The gamma→theta effects reported here could be confounded by a simple longer delay in the analysis of theta. In fact, Figure S5 confirms that delay. It is unclear whether the CFD metric captures anything more than a temporal delay between the two signals. The term "functionally interconnected" in the abstract is a bit of a stretch; it may be essentially delayed correlation.

(8) There is a minor concern with the claim that low-gamma drives theta amplitude. While statistics on this are reported, the corresponding figure may be suggesting an alpha-harmonic instead of theta (Figure 5c).

https://doi.org/10.7554/eLife.110310.1.sa1

Reviewer #3 (Public review):

Summary:

This manuscript investigates whether the theta-gamma phase-amplitude coupling in the human auditory cortex serves as an intrinsically generated neural mechanism for hierarchically parsing speech or not. By analyzing speech corpora across 17 languages alongside human intracranial EEG recordings, the authors demonstrate that these nested oscillatory dynamics are actually inherent, robust acoustic properties embedded within the speech envelope itself. Consequently, they claim that rather than generating parsing windows internally, the early auditory cortex acts as a temporal demultiplexer that segregates syllabic, vocalic, and pitch features into distinct, stimulus-driven neural channels. Furthermore, the study presents evidence for a reversed functional directionality wherein fast-varying gamma activity drives the phase alignment of slower theta rhythms, fundamentally reframing auditory PAC as a stimulus-evoked alignment to a highly structured external signal rather than an endogenous cognitive parsing tool.

Strengths:

(1) The authors demonstrated robust theta-gamma acoustic structure across languages. They analyzed the acoustic speech envelope across 17 typologically distinct languages. This establishes that the nested theta-gamma acoustic structure is a universal feature of human speech, rather than an artifact of one language's specific phonology.

(2) The use of time-resolved, high-SNR intracranial recordings is a critical strength of this study. This approach provides the precise spatiotemporal fidelity required to confidently separate and delineate multiplexed high-frequency dynamics, particularly the low- and high-gamma bands, that are essential for accurate speech decoding but are typically attenuated or lost in non-invasive scalp recordings.

(3) The authors move beyond standard correlational PAC metrics by employing a suite of converging analyses, including the isolation of true oscillations from aperiodic noise and the directional index. Together, these metrics demonstrate that auditory PAC is a stimulus-evoked alignment to a highly structured external speech signal, rather than an intrinsically generated top-down parsing mechanism.

Weaknesses:

(1) A major methodological concern is the use of ICA across SEEG electrode shafts to define distinct neural sources (SEEG-ICs). SEEG electrodes traverse complex macroanatomy, including multiple cortical layers, sulcal banks, and white matter. By constructing components derived from weights across the entire electrode, and subsequently localizing each component solely to the contact with the maximal contribution, the authors risk generating biologically implausible signals. Such an approach potentially mixes true localized cortical gray matter activity with deep structure or white matter signals. Given that a central claim of this manuscript is the spatial and functional segregation of theta and gamma neural populations, the authors could consider further validating these core findings (such as the gamma-to-theta directionality) using single-channel or bipolar-referenced data.

(2) Another methodological concern is the use of GC to evaluate the directional causality between speech and neural signal. As noted in Bastos & Schoffelen (2015) and indeed acknowledged by the authors' own citation of Nolte et al. (2010), Granger Causality is highly sensitive to SNR imbalances and filtering artifacts. Given the inherent SNR disparity between a cleanly extracted acoustic envelope and noisy SEEG data, coupled with the known distortions introduced by distinct filtering pipelines (Barnett & Seth, 2011), the GC results may reflect methodological artifacts rather than true physiological driving.

(3) The third concern is the study's exclusive reliance on linear metrics applied to the envelopes of band-filtered speech and neural signals, e.g., linear Granger Causality and cross-correlations. The human auditory system is an inherently non-linear dynamical system. Complex acoustic features, such as rapid spectrotemporal transitions or dynamic pitch trajectories, often drive non-linear neural responses and complex phase-locking behaviors. While the linear models provide strong interpretable results, by restricting their connectivity and directionality metrics to linear autoregressive models, the authors may be missing substantial non-linear interactions, or conversely, forcing a linear fit onto non-linear data, which can distort estimations of causality and temporal lags. The authors should consider explicitly addressing this limitation in their discussion. Ideally, they should validate their core directional claims on a subset of the data using an information-theoretic, non-linear metric (e.g., Transfer Entropy or Mutual Information), or apply linear methods to nonlinearly abstracted features (e.g., phonemic, syllabic, intonational-level features), to ensure their linear assumptions are not masking or misrepresenting the true underlying dynamics.

https://doi.org/10.7554/eLife.110310.1.sa0

Speech is defined by theta-gamma coupled acoustic rhythms, mapped onto segregated populations in human early auditory cortex

Peer review process

Editors

Be the first to read new articles from eLife