Introduction

Theta and gamma oscillations in the human auditory cortex have been proposed to reflect speech encoding and parsing (Giraud & Poeppel, 2012). However, the origin and function of these rhythms remains heavily debated (Atanasova et al., 2025; Gourévitch et al., 2020; Lalor & Nidiffer, 2025). A prominent framework proposes that cross-frequency coupling — where gamma (∼25–150 Hz) amplitude is locked to the phase of theta (∼2–8 Hz; phase-amplitude coupling, PAC)— enables hierarchical parsing of continuous speech by coordinating representations of linguistic units at different time-scales (Baroni et al., 2020; Hovsepyan et al., 2020; Hyafil, Fontolan, et al., 2015; Kösem et al., 2016; Mai et al., 2016; Meyer et al., 2020; Morillon et al., 2010). However, an important ambiguity in this framework has remained unaddressed: are these neural dynamics intrinsic oscillatory mechanisms for remapping speech into internal representations, or do they mirror the rhythmic structure of the speech signal itself? The question is important because speech is not any environmental signal, but one shaped by humans, through both evolution and development, to be efficiently produced and perceived.

Research over the past two decades has extensively explored the auditory theta neural time scale (2–8 Hz) and its putative role in speech perception (Doelling et al., 2014; Luo & Poeppel, 2007; Oganian & Chang, 2019; Park et al., 2015; Peelle et al., 2013; Pefkou et al., 2017; Schmidt et al., 2023). This frequency range aligns with the primary temporal modulation in speech, approximating the syllabic rate (Coupé et al., 2019; Ding et al., 2017; Giroud et al., 2023, 2024; Greenberg et al., 2003; Kendall, 2013; Pellegrino et al., 2011; Varnet et al., 2017). Because of speech variability, it also includes suprasyllabic prominences—such as stress, accentuation, intonation contour or prosodic phrase boundaries (Tilsen and Arvanniti, 2013; Varnet et al., 2017)— usually related to the neural delta rhythm (Chalas et al., 2024; Ghitza, 2017; Keitel et al., 2017; Rimmele et al., 2021). During speech listening, the phase of neural theta activity flexibly tracks the primary speech rhythm, which is believed to aid in segmenting continuous speech into syllabic units (Ding & Simon, 2014; Giraud & Poeppel, 2012; Keitel et al., 2018; Poeppel & Assaneo, 2020). A spiking network that has a weak theta intrinsic activity is able to signal syllable boundaries with resilience to noise and speech rates (Hyafil, Fontolan, et al., 2015). Behavioral studies show that this tracking is crucial for comprehension (Giroud et al., 2023), which remains possible as long as the rhythm stays around the theta range (Ghitza, 2014; Lubinus et al., 2023). Thus, speech tracking within the boundaries of the theta neural scale is considered necessary, though not sufficient, for comprehension (Ahissar et al., 2001; Etard & Reichenbach, 2019; Kösem et al., 2023; Molinaro & Lizarazu, 2018; Pefkou et al., 2017).

In the auditory cortex, gamma activity has been associated with processing phonetic features and other fine-grained acoustic properties (Keshishian et al., 2023; Kösem et al., 2016; Kulasingham et al., 2020; Marchesotti et al., 2020; Mesgarani et al., 2014; Nourski et al., 2015; Teng & Poeppel, 2020). However, exploration of auditory gamma dynamics remains limited compared to theta (Baroni et al., 2020; Giraud et al., 2007; Giroud et al., 2020; Gross et al., 2013; Morillon et al., 2012). Detecting higher-frequency activity (>25 Hz) with standard surface electroencephalography (EEG) is challenging due to low signal-to-noise ratio (SNR) and limited spatial sensitivity, often requiring MEG or intracranial EEG for more accurate measurements (Baillet, 2017; Mercier et al., 2022). Additionally, low-gamma (∼25–50 Hz) and high-gamma (>50 Hz; i.e., high-frequency activity; (Leszczyński et al., 2020; Mukamel et al., 2005; Quyen et al., 2010) signals are often mixed up. High-gamma reflects broadband activity linked to multi-unit firing, whereas low-gamma exhibits narrowband oscillatory properties (Buzsáki et al., 2012).

Current models propose that low-gamma activity in the auditory cortex enables fine-grained sampling of spectro-temporal details required for phonetic processing (Ghitza, 2011; Giraud & Poeppel, 2012). This “fixed sampling” hypothesis posits that phonetic features are best encoded through an endogenous oscillatory process operating at a stable rate (∼25–50 Hz, in the low-gamma range), independently of the input dynamics. This mechanism is thought to work in tandem with theta-based syllabic parsing, via PAC, enabling efficient dual time-scale speech processing (Ghitza, 2011; Giraud & Poeppel, 2012; Hyafil, Fontolan, et al., 2015). In this framework, theta cycles adapt to the speech rhythm, while low-gamma activity represents a relatively steady sampling process optimized for extracting phonetic features within the syllabic structure (Doelling et al., 2014; Ghitza, 2013; Hyafil, Fontolan, et al., 2015).

While computational models support this dual-time scale framework (Baroni et al., 2020; Hovsepyan et al., 2020; Hyafil, Fontolan, et al., 2015), experimental evidence remains limited. There are mixed findings of intrinsic low-gamma activity in the (left) auditory cortex, with a few studies reporting it (Giraud et al., 2007; Morillon et al., 2010), while others do not (Groppe et al., 2013; Hillebrand et al., 2016; Keitel & Gross, 2016; López-Madrona, Trébuchon, Bénar, et al., 2024). Furthermore, no experimental study has clearly identified which acoustic or linguistic features engage these neural auditory oscillations, nor has established how they are encoded through PAC.

Here we aim to experimentally address the origin and function of theta and gamma activity in speech processing by characterizing the quasi-rhythmic properties of speech acoustics and early auditory cortical activity, and assessing the functional relationship between them. Similar to the theta time scale —first identified as a neural rhythm during natural speech processing (Luo & Poeppel, 2007) and later recognized as mirroring the primary acoustic speech rhythm itself (Doelling et al., 2014; Howard & Poeppel, 2012; Oganian et al., 2023; Y. Wang et al., 2025)— we hypothesize that low- and high-gamma activity may also correspond to acoustic properties of speech, driving neural responses at these frequencies. Specifically, we ask: (i) Do theta and gamma activity in the auditory cortex correspond to intrinsic neural rhythms, or are they driven by the structure of the speech signal? (ii) Does theta-gamma coupling (PAC) in the auditory cortex also mirror the phase and frequency of speech acoustics? and (iii) If so, is this coupling a neural mechanism for hierarchical segmentation of speech?

We analyzed speech corpora from 17 languages as well as intracerebral (stereotaxic EEG, SEEG) recordings from the auditory cortex of 18 epilepsy patients, both at rest and while they listened to a naturalistic French narrative. Our findings reveal that (i) theta and gamma dynamics and their interaction (PAC) are inherent features of the speech envelope; (ii) PAC is a robust and specific acoustic signature of speech envelope across languages; (iii) these acoustic rhythms reflect syllabic rate (2–6 Hz), vowel features (30–50 Hz), and fundamental frequency (100–150 Hz); (iv) alpha-like (6–9 Hz) oscillations dominate the human auditory cortex at rest, whereas theta and gamma dynamics are absent; (v) theta and gamma responses during speech perception are directly and linearly driven by speech acoustics; (vi) these responses arise from distinct neural populations within overlapping regions, reflecting functional segregation; and (vii) directed connectivity reveals a gamma-to-theta influence, challenging the classical theta-to-gamma framework.

Thus, the early auditory cortex appears to act as a temporal demultiplexer, preserving the hierarchical temporal structure of speech for downstream linguistic processing.

Results

Theta-gamma phase-amplitude coupling (PAC) is a distinctive acoustic signature of speech across languages

We first performed a corpus analysis of 17 languages spanning multiple typologically different language families (mean duration per language = 5.85 hours; Table S1; see Methods). Across languages, the power spectrum of the speech envelope showed a prominent peak around 5 Hz (between 4 and 6.5 Hz across languages; Figure 1a), consistent with previous reports (Chang et al., 2025; Ding et al., 2017; Giroud et al., 2023, 2024; Varnet et al., 2017). At higher frequencies, it exhibited an exponential decay that was consistent across languages, with individual variations between 80 and 180 Hz, likely reflecting speakers’ fundamental frequencies (f0; Figure 1a). Notably, there was no power increase in the low-gamma range (30-50Hz), but a decrease in the exponential decay in the 50-150Hz range (Figure 1a). To estimate whether the speech envelope exhibits cross-frequency interactions, we computed phase-amplitude coupling (PAC; Figures 1b; Figure S1). We identified significant coupling between the primary speech rate (i.e., the phase at ∼2–6 Hz) and amplitudes at ∼30–50 Hz and ∼100–150 Hz (p<.05, surrogate tests based on temporal shifting of the amplitude signal), corresponding to theta, low-gamma, and high-gamma bands (Figure 1c). This pattern was present across individual languages but was absent in musical and environmental sounds (Figure S2), suggesting that the pattern reflects universal features specific to speech. In addition, coupling in the 100–150 Hz range varied across languages and the amplitude frequency exhibiting maximal (and significant) PAC correlated with the f0 of each speaker (Pearson correlation across languages; N=11; 6 languages showed no significant high-frequency (80-150 Hz) PAC: R=0.92, df=9, p<0.001; Figure 1d).

Acoustic structure of the speech envelope across 17 languages.

a) Power spectrum between 2 and 12 Hz (top) and between 12 and 180 Hz (bottom) of the envelope of naturalistic, discourse-level speech, displayed for each language (gray lines) and averaged across languages (blue line). b) Significant phase amplitude coupling (PAC) comodulogram estimated across languages (N = 17), showing coupling between the phase (2-12 Hz) and amplitude (20-150 Hz) of the speech envelope (p<.05, surrogate tests). c) Example of speech envelope segment exhibiting strong PAC. From top to bottom: raw envelope, the same signal filtered in the 100–150 Hz, 30–50 Hz, and 2–6 Hz bands. The dashed line indicates the peak of the slow-frequency cycle, aligned with increased amplitude in the higher-frequency bands. d) Correlation across languages between the median fundamental frequency (f0) and the amplitude frequency exhibiting maximal and significant PAC (between 80 and 150 Hz). e) Averaged time-frequency response of the speech envelope aligned to consonant (left) or vocalic (right) onsets, extracted from 9 languages. f) Maximum of 30-50 Hz activity (N=9 languages) after phoneme onset for consonants (blue) and vowels (yellow). Black lines represent individual languages (paired t-test, *** p<.001). g) Example phrase exhibiting strong 30–50 Hz activity after vocalic onset. From top to bottom: sentence in French, phonetic transcription with onset-aligned phonemes, waveform, envelope, and time-frequency representation of the envelope. Arrows indicate fluctuations in the envelope generating the 30–50 Hz activity. h) Significant PAC comodulogram of the speech envelope of the 10-minute French stories used during neural recordings, uttered by (left) a male voice (median f0: 139 Hz) and (right) a female voice (median f0: 189 Hz; p<.05, surrogate tests).

To obtain further insights into the origin of the 30-50 Hz activity, we analyzed the time-frequency response of the speech envelope locked to phoneme onsets in 9 languages (Figure 1e; Table S1; see Methods). The analysis revealed a 30–50 Hz response related to vocalic onsets —but not seen in in consonants (paired t-test, maximum of 30-50 Hz activity after phoneme onset for consonants vs vowels: t(8)=5.15, p<.001; Figure 1f)— emerging ∼15ms after vocalic onset. Importantly, this activity reflected a transient amplitude increase, lasting ∼25 ms, rather than sustained oscillations (Figure 1g).

We next analyzed the two audio sets used during the SEEG recordings, each consisting of a 10-minute French story narrated by either a male or a female voice (see Methods). In both cases, the speech envelope was characterized by a consistent significant 30-50 Hz amplitude coupling with the primary speech rate (∼5 Hz; p<.05, surrogate tests, Figure 1h). The male voice showed additional coupling in the 100–150 Hz range, consistent with its f0 (median f0: male voice: 139 Hz; female voice: 189 Hz). In both recordings, gamma-like components were maximal at peak phase (∼0°) of the slow dynamics, indicating alignment with the maximum intensity of the primary speech rhythm (Figure S3).

The auditory cortex is dominated by alpha oscillations at rest, while theta and gamma dynamics mirror the speech envelope

SEEG activity was recorded from the auditory cortex of 18 epileptic patients (Figures 2a-b) during rest and passive listening to a 10-minute story (male voice; see Methods). Recordings covered medial, primary (TE1.0, TE1.2), and lateral higher-level (Superior Temporal Gyrus (STG); BA41/A42, BA22) auditory areas (Fan et al., 2016). Data from 23 electrodes (5 bilateral implantations; 305 channels) were analyzed using independent component analysis (ICA) to separate neural sources (SEEG-ICs; Figures 2c-d; Herreras et al., 2015; López-Madrona, Trébuchon, Bénar, et al., 2024; López-Madrona, Trébuchon, Mindruta, et al., 2024; Michelmann et al., 2018).

Identification of oscillatory and speech related neural sources.

a-e) Methods. a) Example of cerebral MRI scan (3D T1-weighted): axial cross section with reconstruction of the SEEG electrode position for patient #6. The location of each contact is represented with white rectangles. b) Example of monopolar recordings during resting state. c) Spatial profile of two SEEG-ICs (independent components) across the electrode shaft, representing their contribution to each contact. Both components have clear peaks (ICA weights) along the electrode, indicating the origin of the neural activity around these contacts. d) SEEG-IC traces during the same time period as panel (b). The first component (SEEG-IC 1) captures high-amplitude low-frequency oscillations from the contacts close to the surface, while SEEG-IC 2 is characterized by irregular activity, with no clear oscillatory pattern. e) Schematic of the approach to detect oscillations for SEEG-IC 1. The power spectrum is modeled as a combination of an aperiodic fit and oscillatory peaks. The relative power (yellow lines) of each oscillatory peak is measured as the difference between oscillatory (gray) and aperiodic (dashed) fits. f-g) Resting state. f) Relative power of all significant oscillatory peaks identified across patients during resting state. Points represent oscillatory peaks (N=507). g) Histogram of frequencies for all the oscillatory peaks in panel (f). h-m) Speech processing. h) Normalized power spectra (speech-rest)/rest, averaged across all components exhibiting significant effect (N=112). For all panels: bold lines represent power ratios significantly different from zero at the group level (p<.05, t-tests against zero, FDR-corrected). Error bars are s.e.m. i) Left: example of SEEG-IC with a significant Granger Causality (GC) from the speech envelope, but not in the opposite direction (black line represents statistical threshold: p=.05, surrogate tests). Right: Normalized power spectra across all components with a significant GC from the speech envelope (N=60). j) Left: example of one SEEG-IC with a significant phase-amplitude coupling (PAC) between the phase of theta and the gamma amplitude (p<.05, surrogate tests). Right: Normalized power spectra across all components with a significant theta-gamma PAC (N=28). k) Venn diagram of SEEG-ICs classified by their outcomes in the different analyses: non-responsive channels (N=193; in grey), significant power changes only (N=49; in blue), significant GC and power changes (N=35; in green), significant PAC, GC and power changes (N=25; in orange), and significant PAC and power changes (N=3; in orange). l) Histogram of oscillatory peaks during speech processing, for the different groups of SEEG-ICs identified in panel (k). Each group was considered independently, excluding those contained within smaller circles (e.g., the blue histogram represents the SEEG-ICs with significant power changes but no GC or PAC). m) Normalized power spectra for the different groups of SEEG-ICs.Aperiodic activity was corrected, with different fits between 1–28 Hz and 28–150 Hz (vertical black line).

Spectral analysis of SEEG-ICs at rest indicated a combination of aperiodic background activity and oscillatory components (Donoghue et al., 2020; Figure 2e). Most SEEG-ICs exhibited prominent alpha-like oscillations (6–9 Hz, N=191, Figures 2f-g; López-Madrona et al., 2024), along with their harmonics (12–18 Hz, N=151). Delta activity appeared in a subset of the data (∼2 Hz, N=21), and remaining peaks outside these ranges were sparse and of low power. Resting state activity was notably characterized by an absence of oscillatory activity at theta and/or low-gamma rates.

To characterize neural responses to speech, we applied three complementary analyses. First, we examined spectral power changes during speech processing compared to rest. Significant power increases were observed in the delta/theta (1–6 Hz) and broadband gamma (30–150 Hz) ranges across ∼37% of SEEG-ICs (N=112), alongside suppression in the alpha/beta band (6–20 Hz; p<.05, (speech-rest)/rest ratio: individual permutation tests, FDR-corrected; Figure 2h; non-normalized spectra in Figure S4).

Second, we computed speech-brain (temporal) Granger Causality (GC) and found significant GC from the speech envelope in 60 SEEG-ICs, meaning speech significantly predicted brain activity (speech-driven SEEG-IC), with none showing the opposite pattern (p<.05, surrogate tests based on temporally shifted speech; Figures 2i and S4). All these speech-driven sources were also characterized by a significant power increase (p<.05, individual permutation tests), particularly in the delta/theta (1–6 Hz) and broadband gamma (30–150 Hz) ranges. Additionally, 37/60 SEEG-ICs showed reduced alpha/beta power (6–20 Hz; p<.05, individual permutation tests).

Finally, we analysed neural phase-amplitude coupling (PAC) and observed significant PAC between theta (peaking at ∼4 Hz) and gamma dynamics in 28 SEEG-ICs (p<.05, surrogate tests based on temporal shifting of the amplitude signal; Figures 2j and S4). Two distinct gamma bands were coupled with theta: a high-gamma component (peaking at ∼110 Hz) in all 28 SEEG-ICs, and a low-gamma component (peaking at ∼40 Hz) in 12 of them. The 28 SEEG-ICs also showed increased delta/theta (1–6 Hz) and broadband gamma (30–150 Hz) power (p<.05, individual permutation tests), with 25/28 also showing significant GC from the speech envelope, indicating direct modulation by speech.

Importantly, the three analytic approaches ranged from least to most selective: power change, GC, and PAC (Figure 2k). We thus classified SEEG-ICs into three groups: power changes only (N=49), GC without PAC (N=35), and PAC (N=28). Only 3 channels did not entirely fit this nested structure (i.e., showing significant PAC (and power change), but not GC, see above). As in the resting condition, all groups exhibited a dominant alpha-like rhythm and its first harmonic (Figure 2l). However, speech elicited additional, group-specific changes in power at other frequencies (Figure 2m). The PAC group was the only one with a significant theta power increase (p<.05 group level t-tests between speech and rest, FDR-corrected; Figure 2m). All groups exhibited increased high-gamma power (∼110 Hz) and beta suppression (12–20 Hz). High-gamma enhancement peaked in the PAC group, while beta suppression was strongest in the power changes group and extended into the alpha range (∼8 Hz), the dominant resting-state rhythm (see Figure 2g). Overall, these findings indicate that theta and gamma neural oscillations are directly driven by the speech envelope, inducing PAC between these rhythms in the corresponding neural sources.

Theta and gamma responses are spatially segregated in the auditory cortex

To further characterize how speech drives theta and gamma responses in the auditory cortex, we computed spectral GC between the speech envelope and the speech-driven SEEG-ICs (i.e., with significant temporal GC, N=60; see Figure 2i). Three main speech-driven frequencies emerged: theta (∼2–6 Hz), low-gamma (∼25–50 Hz), and high-gamma (∼100–150 Hz), with SEEG-ICs exhibiting one or more of these dynamics (Figure 3a). Three profiles (groups) of SEEG-ICs were differentiated, with 10 SEEG-ICs modulated only at theta, 26 only at (low and/or high) gamma, and 24 at both frequencies.

Spectral GC and neural PAC profiles.

a) Representative example of the three types of outcomes observed in the spectral GC analysis between the speech envelope and neural activity: theta (2–6 Hz; in blue), gamma (low 30–50 Hz and/or high 100–150 Hz; in yellow), or theta and gamma (low and/or high; in green) significant GC. Shaded areas indicate the significance threshold (surrogate tests). b) Average PAC modulation spectrum for the three groups of SEEG-ICs: theta (left; N=10), gamma (middle; N=26) and theta-gamma (right; N=24). c) PAC between the theta (2–6 Hz) phase and low- (30–50 Hz) or high-gamma (100–150 Hz) for each SEEG-IC group (** p<.05, Tukey’s HSD post-hoc test). d) Position of the 60 SEEG-ICs from the three groups on a 3D surface of the temporal lobe. The location of each SEEG-IC is defined by the contact with the maximal contribution in the spatial IC profile.

This last group of SEEG-ICs was characterized by stronger PAC (Figure 3b), at both low-gamma (one-way ANOVA: F(2,57)=5.4, p=.007; Tukey’s HSD: gamma vs. theta-gamma, p=.006; all other comparisons: p>.16) and high-gamma frequencies (one-way ANOVA: F(2,57)=5.7, p=.005; Tukey’s HSD: gamma vs. theta-gamma, p=.005; all other comparisons p>.10; Figure 3c). This indicates that theta and gamma dynamics —and their interaction (PAC)— are directly driven by the speech envelope.

Finally, as SEEG-ICs correspond to distinct neural sources, these three groups were spatially segregated along the auditory cortex (Figure 3d), mainly laterally, within area 41/42 of the superior temporal gyrus (STG), though no specific region across participants was associated with a particular dynamics (Table S2). This shows that theta and gamma responses occur in overlapping anatomical regions but are generated by distinct neural populations (SEEG-IC), indicative of a functional segregation of these speech-driven neural dynamics.

High-gamma (100-150 Hz) response is a mixture of frequency-following and evoked activity

While theta and low-gamma activity during speech processing are driven by the speech signal, high-gamma (100-150 Hz) activity exhibits a more complex response profile. It peaked in SEEG-ICs with significant PAC, but was also present in SEEG-ICs showing only power changes —i.e., without significant speech-brain GC (Figure 2m). This indicates that a large portion —but not all— of the activity in this frequency range reflects the frequency-following response to the fundamental frequency of the male voice used in our primary stimulus (f0 = 139 Hz; Figure 1h, left panel). To dissociate these two sources of high-gamma activity, we recorded a second cohort of patients (N=2) during passive listening to the same male-narrated story and another story narrated by a female voice (f₀ = 189 Hz; Figure 1h right panel; see Methods).

We identified three speech-driven SEEG-ICs (N=3) with significant temporal GC in both auditory conditions. Two were classified as theta SEEG-IC, showing consistent modulation at the speech rate across stimuli (Figure 4a, top). The third was a gamma SEEG-IC (Figure 4a, bottom), exhibiting low-gamma modulation in both recordings. High-gamma modulation was also observed, but only for the male voice, consistent with a frequency-following response to f0 (see Figure 1h).

Relation between speech-brain coupling and fundamental frequency.

a) Comparison of neural responses to male and female recordings with representative examples of the two types of outcomes observed in the spectral GC analysis between the speech envelope (left: male voice; right: female voice) and neural activity: theta (2–6 Hz; top; in blue) and gamma (low 30–50 Hz and/or high 100–150 Hz; bottom; in yellow). Shaded areas indicate the significance threshold (surrogate tests). b) PAC modulation spectrum estimated between the phase of the speech envelope (male: left, female: right) and the amplitude of the gamma SEEG-IC from panel (e) (p<.05, surrogate tests).

We next analysed speech-brain PAC in the gamma SEEG-IC (Figure 4b). In both recordings, gamma amplitude was modulated at the speech rate (∼5 Hz), with dominant coupling in the low-gamma band (30–50 Hz), both for male and female voices. High-gamma PAC (100–150 Hz) was also significant in both conditions (p<.05, surrogate tests based on temporal shifting of the SEEG-IC), indicating speech-related high-gamma activity independent of f0. These results show that while high-gamma reflects an intrinsic neural rhythm, it can also be linearly driven by speech acoustics when the fundamental frequency falls within its frequency range.

Gamma to theta PAC in the auditory cortex

While theta and gamma neural dynamics primarily mirror speech acoustics, they are spatially segregated in the auditory cortex (Figure 3d). To investigate directed functional connectivity between the speech-driven SEEG-ICs (i.e., with significant temporal GC, N=60), we computed the cross-frequency directionality (CFD) index. We first computed the CFD on the speech envelope and found no significant directionality (all p>.05, surrogate tests based on temporal shifting of the amplitude signal), indicating that slow (∼2–6 Hz) and fast (∼30–150 Hz) envelope dynamics evolve overall synchronously.

Then, we estimated the speech-brain CFD comodulogram across all speech-driven SEEG-ICs with significant gamma modulation (N=50; see Figure 3a), and confirmed that the main speech rate (∼4 Hz) primarily modulates low-gamma (30-50 Hz; one-sample t-test against zero: t(49)=6.5, p<.001) and high-gamma (100-150 Hz; t(49)=6.8, p<.001) neural activity (Figure 5a). Additionally, analysis of the speech-driven SEEG-ICs with significant theta modulation (N=34; see Figure 3a) showed that neural theta activity (∼4 Hz) is primarily modulated by acoustic dynamics of the speech envelope at 30-50 Hz (t(33)=-8.82, p<.001) and 100-150 Hz (t(33)=-10.57, p<.001, Figure 5b). As slow (∼4 Hz) and fast (∼30–50 and ∼100-150 Hz) dynamics are strongly coupled in the speech envelope, this result corroborates the spectral GC analysis, which already revealed linear, directed speech-brain functional connectivity at theta, low-gamma, and high-gamma frequencies (Figure 3a).

Directed speech-brain and brain-brain PAC.

a) Left: Averaged speech-brain cross-frequency directionality (CFD) comodulogram, estimated between the phase of the speech envelope and the amplitude of speech-driven gamma SEEG-ICs (N=50). Right: Average CFD between theta phase (2-6 Hz) and low- (30-50 Hz) or high-gamma (100-150 Hz) amplitude. b) speech-brain CFD estimated between the phase of speech-driven theta SEEG-ICs (N=34) and the amplitude of the speech envelope. c) Inter-source CFD averaged across electrodes (N=15). For each electrode a single CFD map was computed as the average of all its speech-driven pairs. d) Intra-source CFD averaged across electrodes (N=16). Points represent (a,b) individual SEEG-ICs or (c,d) electrodes. Positive CFD values indicate phase-to-amplitude directionality; negative values indicate the reverse. One-sample t-test against zero:: ** p<.01, *** p<.001.

We then assessed functional directionality between neural sources by computing CFD between all pairs of speech-driven SEEG-ICs (N=60; see Figure 2i) per electrode (N=15; 8 electrodes were excluded due to having N < 2 significant speech-driven SEEG-ICs). CFD maps were averaged within each electrode shaft, excluding within-source analysis. Surprisingly, gamma amplitude consistently drove theta phase across electrodes, for both low-gamma (t(14)=-3.36, p=.005) and high-gamma (t(14)=-4.17, p<.001) activity (Figure 5c). Finally, intra-source CFD within theta–gamma SEEG-ICs (averaged CFD for each electrode with theta-gamma SEEG-ICs, N=16) revealed the same pattern: gamma amplitude consistently drove theta phase, for both low-gamma (t(15)=-3.57, p=.003) and high-gamma (t(15)=-3.72, p=.002) activity (Figure 5d). Together, these results show that gamma activity—both low and high— drives theta phase, both within and across neural sources, suggesting that auditory theta activity is shaped by gamma activity, rather than the reverse.

We finally estimated the latency of the neural responses, by computing frequency-resolved linear speech-brain cross-correlation analyses (Figure S5). For each speech-driven SEEG-IC (N=60), we computed the lag between the speech envelope and (i) the theta-filtered neural activity (2–6 Hz) for theta SEEG-ICs (N=33 SEEG-ICs showing a significant cross-correlation with the speech envelope, see Methods), and (ii) the envelope of low- (30–50 Hz; N=22) or high-gamma (100–150 Hz; N=29) signals for gamma GC SEEG-ICs. Delays in theta SEEG-ICs were highly variable (range: 8–205 ms; mean ± s.d.: 78.2 ± 63.6 ms). In contrast, delays were more consistent in the gamma bands (low-gamma: 58.8 ± 9.4 ms; high-gamma: 23.2 ± 33.2 ms). High-gamma responses occurred significantly earlier than both theta (one-way ANOVA: F(2,81)=11.82, p<.001; Tukey’s HSD: theta vs. high-gamma, p<0.001) and low-gamma (p=.017), with no significant difference between theta and low-gamma responses (p=.26), indicating that higher frequency activity is processed earliest in the auditory cortex.

Discussion

Acoustic origin of theta and gamma dynamics in early auditory cortex

This study demonstrates that multiplexed theta–gamma dynamics in the human auditory cortex during speech perception are not reflective of an intrinsic neural code of the auditory cortex but mirror acoustic features present in the speech signal (Figure 6). As a highly evolved human signal, these features are likely related to the processing capacities of the auditory cortex, without relying on neural oscillations present at rest. Using SEEG recordings, we identified multiple auditory neural sources that are selectively activated by natural speech and remain silent at rest (Figure 2). Each source is driven by frequency-specific components of the speech envelope that approximate distinct linguistic features and are robust across languages: syllabic rate (theta), vowel energy (low-gamma), and/or fundamental frequency (f0; high-gamma; Figure 1 and S1). These acoustic features are specific to speech and absent in musical and environmental sounds (Figure S2), underscoring the unique temporal structure of speech. Crucially, phase-amplitude coupling (PAC) between these frequencies —often interpreted as a mechanism for hierarchical parsing— is also present in the speech signal itself, reflecting its nested acoustic architecture (Figure 1). Each speech-driven source within the same auditory region is feature-selective, indicating functional segregation (i.e., demultiplexing; Figure 3). They are functionally connected via directed PAC, with gamma activity —both frequency-following and evoked (Figure 4)—driving theta phase (Figure 5), a reversal of the dominant model. Overall, these results challenge the interpretation of auditory cortical PAC as an endogenous mechanism for speech parsing.

Diagram summarizing the relationship between the speech signal and early auditory cortex dynamics during speech perception.

The green box (Granger Causality) shows brain dynamics that mirror the speech envelope, linearly driven by it, indicated by directional arrows. In the purple box (Power), up/down vertical arrows reflect increases or decreases in power during speech perception compared to rest across all neural sources (SEEG-ICs). The blue box (Cross-Frequency Directionality, CFD) depicts internal connectivity between brain dynamics associated with phase–amplitude coupling (PAC).

From temporal hierarchies to parallel multiplexing

Speech encodes multiple layers of information concurrently (Coupé et al., 2019; Rosen, 1992)—such as speaker identity, pitch (f₀), timbre, prosody, semantics, and syntax. These layers have been shaped by evolution, from the cycles of mandibular oscillation in the articulatory system to the development of motor control circuits in the brain (Fitch, 2000; MacNeilage, 1998). This has resulted in dynamical patterns unique to human speech. While linguistic theory has long described a hierarchical organization of units (features, phonemes, syllables, morphemes, prosodic words, phrases, sentences), whether these are systematically embedded in a dynamically structured acoustic signal has not been quantitatively assessed.

Brain rhythms are also hierarchically organized, aligning excitability across timescales to optimize coding efficiency (Bragin et al., 1995; Buzsáki & Draguhn, 2004; Freeman & Rogers, 2002; Lakatos et al., 2005; Lisman & Idiart, 1995; Pöppel, 1997; Steriade, 2006). In particular, the theta-gamma neural code (Canolty & Knight, 2010; Lisman & Jensen, 2013) which —though primarily studied in the hippocampus— proposes a general encoding mechanism, whereby theta oscillations structure the timing of gamma bursts, representing discrete units of information.

Building on the isomorphism between brain rhythms and speech features (e.g., theta/syllables, gamma/phonetic features; Poeppel, 2001, 2003), Giraud & Poeppel, (2012) extended this framework to speech, proposing that such correspondence supports the chunking and tracking of speech units in time. The nested alignment of fast (gamma) and slow (theta) cortical rhythms has been widely observed during speech processing (Attaheri et al., 2022; Gross et al., 2013; Hyafil, Fontolan, et al., 2015; Koutsoukos et al., 2013; Lizarazu et al., 2019, 2023; Mai et al., 2016; Meyer et al., 2020; Morillon et al., 2012; J. Wang et al., 2014). Such multiplexed temporal coding has gained support as a biologically plausible mechanism for mapping speech’s multiscale structure onto neural dynamics. By shifting focus from categorical linguistic units toward the temporal dimension (Norman-Haignere et al., 2025), it offers a unified framework in which hierarchically embedded speech units are organized within a shared temporal architecture.

However, the common temporal structure of speech and brain rhythms raises a critical question: does phase-amplitude coupling in the auditory cortex emerge de novo in the early cortical stages of speech processing for re-mapping speech from acoustics to other representational forms suited to neural processing, as proposed in other cognitive functions, or is it evoked by the temporal structure of the acoustic signal itself?

Revisiting the role of PAC in speech processing

This question has not been systematically, quantitatively, and mechanistically examined, and the field has largely assumed a neural origin. However, our findings reveal that PAC also characterizes the speech envelope itself, exposing cross-frequency structure not visible in the power spectrum alone (Figure 1a-b). This suggests that the nested dynamics often attributed to endogenous auditory parsing in fact arise from the stimulus.

While our results are compatible with the multiplexing described by the multi-timescale hypothesis (Baroni et al., 2020; Giraud & Poeppel, 2012; Giroud et al., 2020; Teng et al., 2016, 2017; Teng & Poeppel, 2020), they challenge the assumption that nested brain rhythms reflect endogenously generated parsing mechanisms in early auditory cortex. Here, we show that theta–gamma coupling is explainable by the speech acoustic features, and that the early auditory cortex mirrors a highly structured input rather than generates internal windows for segmentation. This reframes the idea that the auditory cortex operates on multiple timescales: theta, low-gamma, and high-gamma reflect syllabic rate, vowel onsets, and f₀, respectively (Giroud et al., 2024; Kulasingham et al., 2020). These features are already dissociable in the stimulus and processed by distinct but spatially overlapping neural populations, supporting the idea of parallel streams for spectrotemporal parsing (Poeppel, 2003). This challenges the prevailing view that PAC in the auditory cortex supports endogenous hierarchical parsing —as observed in the hippocampus (Bragin et al., 1995; Colgin et al., 2009; López-Madrona et al., 2020)— and instead suggests that it reflects the brain’s alignment to a temporal code already embedded in speech.

An important finding is that we did not detect oscillatory activity at rest, which is supported by the lack of consensus in the literature regarding intrinsic auditory rhythms. Across studies, proposed resting-state cortical auditory oscillations span a wide frequency range: delta (Armonaite et al., 2022; Frauscher et al., 2018; Groppe et al., 2013; Kalamangalam et al., 2020; Mellem et al., 2017), theta (Giraud et al., 2007; Morillon et al., 2012), beta (Ramkumar et al., 2014), or low-gamma (Giraud et al., 2007; Groppe et al., 2013; Hillebrand et al., 2012; Morillon et al., 2010). Some studies report endogenous activity across nearly all frequency bands (Keitel & Gross, 2016; Lubinus et al., 2021; Neymotin et al., 2022). However, the most consistently reported and replicable finding across studies is a prominent alpha-like peak —often referred to as the auditory “tau” rhythm (Capilla et al., 2022; Frauscher et al., 2018; Groppe et al., 2013; Hillebrand et al., 2012; Kalamangalam et al., 2020; Keitel & Gross, 2016; Lehtelä et al., 1997; López-Madrona, Trébuchon, Bénar, et al., 2024; Mahjoory et al., 2020; Niedermeyer, 1990; Ramkumar et al., 2014; Tiihonen et al., 1991) — in line with our own observations (Figure 2). This variability, along with the methodological difficulty of disentangling true oscillatory peaks from broadband 1/f aperiodic activity (Donoghue et al., 2020), casts doubt on the notion that theta and gamma reflect stable intrinsic rhythms in the human auditory system.

However, the absence of prominent oscillatory activity at rest does not imply a lack of resonance properties. Instead, latent capacities of the auditory cortex may manifest as a preferential sensitivity to specific frequencies during auditory stimulation. This is the case of auditory steady-state responses (ASSRs), in which cortical circuits preferentially entrain to periodic auditory inputs typically around 40 Hz (Arnal et al., 2019; Galambos et al., 1981; Mäkelä & Hari, 1987; Picton et al., 2003). ASSRs are not just a series of discrete evoked potentials, but rather a response to the periodicity of the stimulus, where the phase of neural activity locks to the stimulus modulation frequency. They illustrate the resonance properties of auditory circuits —namely, the frequency preferences of neural populations shaped by neural connectivity and synaptic time constants (Hutcheon & Yarom, 2000). Thus, resonance differs from spontaneous oscillations in that it does not reflect the system’s intrinsic dynamics, but rather its capacity to preferentially respond to external input at specific frequencies (Giroud et al., 2020; Teng et al., 2016, 2017; Teng & Poeppel, 2020, Lehongre et al., 2011).

In this context, the observed mirroring of both low- and high-gamma speech features by auditory cortex aligns with findings from the frequency-following response (FFR) literature. FFRs —traditionally measured in subcortical structures but increasingly documented cortically— reflect phase-locked responses to periodic stimulus features such as the fundamental frequency or high-rate temporal modulations (Bidelman, 2018; Coffey et al., 2019; Nourski, 2017). Importantly, intracranial and high-resolution imaging data demonstrate that human auditory cortex can phase-lock to acoustic frequencies up to ∼130 Hz (Brugge et al., 2009; Nourski & Brugge, 2011; Steinschneider et al., 2013), overlapping with the gamma-band responses we identify here. This supports the interpretation that both low- and high-gamma PAC in the auditory cortex reflect frequency-following responses to acoustic modulations, rather than internally generated sampling mechanisms (Lerousseau et al., 2021). Moreover, the distinct spatial patterns of gamma and theta SEEG sources are consistent with the spatial organization of FFRs and temporal envelope tracking described in auditory cortex (Nourski, 2017), suggesting that different neuronal ensembles are tuned to resonate to specific spectrotemporal features of speech.

More broadly, phase–amplitude coupling is a well-established organizational principle in the brain, supporting the compression and multiplexing of information across timescales (Akam & Kullmann, 2012; Canolty et al., 2006; Hyafil, Fontolan, et al., 2015). Previous hypotheses have adapted this mechanism to the context of speech processing. Accordingly, PAC would align perception with the richly structured input while operating within the temporal resolution and bandwidth constraints of cortical circuits. By aligning multiscale acoustic features into dedicated oscillatory neural channels, PAC may enhance both perceptual resolution and neural efficiency — a principle that underpins computational models designed to match neural auditory dynamics (Baroni et al., 2020; Dogonasheva et al., 2025; Giraud & Poeppel, 2012; Hovsepyan et al., 2020, 2023; Hyafil, Fontolan, et al., 2015). Our findings question this view as we show that PAC is present in the structure of the speech stimulus itself. Auditory neural PAC may thus not be a mechanism to transform the acoustic structure of speech into linguistic units, but may represent nested temporal regularities already embedded in the signal. This aligns with theoretical work suggesting that the rhythmic structure of speech arises from the biomechanical and cognitive constraints of speech production (O’Dell & Nieminen, 1999; Tilsen, 2016). Rather than generating hierarchical parsing schemes de novo, the auditory cortex may inherit a structured input shaped by the audio-phonological dynamics, mapping it onto internal representations that are both perceivable and reproducible.

We conjecture that PAC reflects the alignment of distributed neural populations to shared temporal anchors. The auditory cortex may thus function as a temporal ‘demultiplexer’, selectively amplifying and routing speech features across oscillatory channels tuned to biologically relevant timescales. This perspective invites a re-evaluation of neural coupling in speech processing. Rather than implementing linguistic parsing, theta–gamma PAC may serve as a biologically efficient mechanism for synchronizing perception to an highly-evolved structured signal. Future work should identify how and where this temporal scaffolding supports the formation of abstract linguistic representations —likely beyond the auditory cortex.

Directionality of PAC: from gamma to theta

Traditional models of PAC assume that slower rhythms, such as theta, modulate the amplitude of faster gamma activity, creating periodic “windows” of heightened excitability for sensory encoding (Giraud & Poeppel, 2012; Gross et al., 2013; Hyafil, Fontolan, et al., 2015; Lakatos et al., 2008; Lisman & Jensen, 2013; Lizarazu et al., 2019; Mai et al., 2016). In this framework, theta phase gates gamma activity, supporting hierarchical parsing. However, this directionality has rarely been empirically tested, and emerging work has challenged its universality (Baroni et al., 2020; Chalkiadakis et al., 2025; Jiang et al., 2015, 2020; López-Madrona et al., 2020).

Our results provide evidence that low- and high-gamma activity systematically precedes and modulates theta phase, both across distinct SEEG sources and within single neural populations (Figure 5). This gamma-to-theta directionality suggests that fast spectrotemporal features in speech—particularly in the low- and high-gamma ranges—drive early neural responses that in turn align the phase of theta oscillations to incoming input. In this view, gamma responses reflect rapid sensory encoding, while theta dynamics serve to temporally coordinate slower integrative processes, such as syllabic integration.

This finding is consistent with recent proposals of a reversed hierarchy in auditory processing (Baroni et al., 2020; Kösem & van Wassenhove, 2017; Xu et al., 2023), where high-gamma power anticipates theta rhythms and better tracks fast-varying speech features. Our data support this temporal asymmetry, with high-gamma exhibiting shorter latencies than theta (Figure S5), and decoding analyses pointing to a primary auditory origin for gamma responses (Xu et al., 2023). While an alternative explanation could be that theta and gamma are independently entrained by the same stimulus with differing delays, the presence of gamma-to-theta directionality within the same SEEG-ICs (i.e., the same neural source) argues against this account (Hyafil, Giraud, et al., 2015). Taken together, these results call for a re-evaluation of the mechanisms underlying theta–gamma PAC in speech processing, shifting focus from theta-based parsing or segmentation to gamma-driven alignment.

Roles of gamma-nested dynamics

The 30–50 Hz component of the speech envelope is closely linked to phonemic features and is essential for speech intelligibility (Chait et al., 2015; Shannon et al., 1995). Our findings show that the auditory cortex linearly follows this frequency range (Figure 2a). However, gamma dynamics have also been associated with higher-level linguistic processing and the transformation of acoustic input into linguistic meaning (Di Liberto et al., 2015; Mesgarani et al., 2014). Additionally, while 30–50 Hz energy in the speech envelope is phase-locked to the syllabic rate—a pattern conserved across languages (Figure 4)—the corresponding PAC in the brain is stronger for intelligible than reversed speech (Gross et al., 2013). These findings raise the question of how much gamma activity reflects acoustic encoding versus cognitive processing in the early auditory cortex.

Our data confirm that at least two gamma regimes are at play in auditory cortex during speech processing. The first, captured by our recordings, reflects an early stage of acoustic encoding (Xu et al., 2023). It mirrors low-level features such as phoneme energy and f₀ in a linear, stimulus-driven fashion, and it couples with theta phase at the syllabic rate (Figures 3b and 4). This gamma activity mirrors input dynamics: increasing or decreasing speech rate shifts the gamma frequency proportionally (Lizarazu et al., 2019). In turn, this gamma input may reset the phase of theta oscillations, aligning slower rhythms with syllabic structure. Although this low-gamma regime has an acoustic origin (Howard & Poeppel, 2010), some evidence suggests it may also contribute to speech comprehension. The auditory cortex exhibits a preferential response for 30–50 Hz signals (Teng & Poeppel, 2020), and gamma and theta components often arise from distinct neural populations (Figure 2a). Moreover, gamma activity may be modulated by top-down influences, including delta–gamma coupling (Fontolan et al., 2014). During degraded or unintelligible speech, gamma amplitude and its modulation of theta phase are both reduced (Gross et al., 2013; Lizarazu et al., 2023), impairing theta tracking of the input (Etard & Reichenbach, 2019; Vanthornhout et al., 2018). In kids with autism, theta-gamma coupling is absent and replaced by an atypical beta-gamma coupling, showing that neural coupling phenomena are under neuromodulatory influence (X. Wang et al., 2023).

Beyond this early encoding stage, a second gamma regime may operate in higher-order cortical areas (Di Liberto et al., 2015; Forseth et al., 2020; Hamilton et al., 2018). This activity, which may not couple to theta, could reflect internally generated activity that support higher-level linguistic or semantic operations. Thus, gamma activity in speech processing likely spans a functional continuum —from early sensory tracking to abstract computation— with the presence or absence of PAC distinguishing these stages.

Conclusion

Our results challenge the prevailing view that theta–gamma phase–amplitude coupling (PAC) in the auditory cortex reflects endogenously generated speech processing mechanisms. Instead, we show that these dynamics largely mirror the temporal structure of the speech signal itself. Theta, low-gamma, and high-gamma activity mirror distinct acoustic features — syllabic rate, vowel-related energy, and f₀— embedded in the speech envelope. This coupling is exogenous, robust across languages, and could be shaped by audio-motor constraints on speech production. This points to a likely universal, biologically rooted mechanism that underpins the human-specific communication niche that is speech.

Crucially, gamma activity leads theta phase, suggesting an active role in aligning slower rhythms with fast-changing acoustic inputs. We demonstrate that these rhythms are processed by distinct but interconnected cortical populations, implying demultiplexing and functional coordination rather than passive responses. While PAC may support speech comprehension, our findings emphasize the need to methodically disentangle intrinsic neural activity from stimulus-driven dynamics. Altogether, these results call for a revised framework: PAC in the auditory cortex mostly reflects the brain’s response to an already-structured signal, not its internal construction.

Methods

Audio corpus

We analyzed audio recordings from three corpora of naturalistic sounds: speech, music, and environmental sounds. The speech corpus is composed of verses of the New Testament from 17 different languages (www.faithcomesbyhearing.com). We selected the specific versions of the recordings that did not contain any sound effects but only plain speech (Table S1).

The music corpus comes from the “GTZAN Dataset – Music Genre Classification” (GTZAN Dataset - Music Genre Classification, s. f.); “the MNIST of sounds,” a collection of 10 musical genres comprising blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. Each genre contains 100 audio files, each with a length of 30 s, sampled at 22.05 kHz.

The corpus of environmental sounds comes from the “Making Sense of Sound” classification challenge (Kroos et al., 2019), containing a broad range of sounds from everyday life. This corpus is composed of five categories: nature, human, music, effects, and urban, with 300 sounds in each category. All samples were 5 s long, sampled at 44.1 kHz, and peak-normalized.

The audios of each corpus were converted from multi to single channel and resampled to 16 kHz. Each file was then cut into a 3- s- long segment, and its amplitude envelope was extracted. Amplitude envelope corresponds to the slow overall amplitude fluctuations of the signal over time. It was computed with custom MATLAB scripts developed for speech signals analyses (Gilbert & Lorenzi, 2006). The sound signal was decomposed into 32 narrow frequency bands using a cochlear model, and the absolute value of the Hilbert transform was computed for each of these narrowband signals. The broadband temporal amplitude envelope resulted from the summation of these band-pass–filtered signals and was down sampled to 512Hz.

For 9 languages of the speech corpus, we obtained the precise timing of each phonemes present in the audio files (Table S1). To do so, we first obtained the transcript of each audio file using the speech to text model whisper-large-v3 from OpenAI available through the HugginFace platform (Radford et al., 2022). Following this first step, we used the WebMAUS online tool (Kisler et al., 2017) to force align the sounds and transcripts and obtained for each audio a textgrid file with the timestamps of each phonemes. For languages available in the Mass corpus (Zanon Boito et al., 2020), the textgrid files were directly retrieved without the need for the previous steps.

For each language and each phoneme, the corresponding amplitude envelope time series were segmented into epochs spanning from 1 second before to 1 second after the phoneme of interest. Epochs corresponding to the same phoneme were then grouped together and subjected to time-frequency decomposition. Distinct phonemes were further categorized based on whether they belonged to the vowel or consonant group.

Participants

Two cohorts of patients with pharmacoresistant epilepsy underwent stereo-electroencephalographic (SEEG) recordings during their presurgical evaluation at Hôpital de la Timone (Marseille, France). The first cohort included 18 patients (10 females), while the second comprised 2 patients (1 female). Unless otherwise stated, all analyses were performed on the first cohort. All patients were French native speakers and the neuropsychological assessments confirmed that they had intact language functions and normal hearing. None of the patients had their epileptogenic zones located in the auditory areas, as determined by experienced epileptologists. The study received approval from the Institutional Review Board of the French Institute of Health (IRB00003888) in accordance with the Declaration of Helsinki. All patients provided written informed consent before the experimental session. Participation was voluntary, and none of the patients were involved in a clinical trial.

Recordings

Stereo-electroencephalographic (SEEG) recordings SEEG recordings were performed using depth electrodes, implanted stereotactically (Talairach et al., 1992, Alcis, Besançon, France, and Dixi Medical, Chaudefontaine, France). All the patients had one electrode located in the auditory cortex, implanted orthogonally to the cortical surface, targeting the Heschl’s gyrus and the planum temporale (Figure 1a). The first cohort consisted of 13 unilateral implantations (10 left, 3 right) and 5 bilateral implantations, resulting in a total of 23 analyzed electrodes (N=23). The second cohort had N=3 electrodes implanted in the region of interest, including one patient with a bilateral implantation. The diameter of the electrodes was 0.8 mm and they had and between 8 and 18 contacts of 2 mm length and separated from each other by 1.5 mm. The recording reference and ground were chosen by the clinical staff as two consecutive sEEG contacts on the same shaft both located in the white matter and/or at distance from any epileptic activity. Patients were recorded either in an insulated Faraday cage or in the bedroom. In the Faraday cage, they laid comfortably in a chair, the room was sound attenuated, and data were recorded using a 256-channel amplifier (Brain Products), sampled at 1 kHz and high-pass filtered at 0.016 Hz. In the bedroom, data were recorded using a 256-channel Natus amplifier (DeltaMed system), sampled at 512 Hz, and high-pass filtered at 0.16 Hz. To precisely localize the channels, a procedure similar to the one used in the iELVis toolbox and in the FieldTrip toolbox was applied (Groppe et al., 2017; Stolk et al., 2018). First, we manually identified the location of each channel centroid on the post-implant CT scan using the Gardel software (Medina Villalon et al., 2018). Second, we performed volumetric segmentation and cortical reconstruction on the pre-implant MRI with the Freesurfer image analysis suite (documented and freely available for download online at http://surfer.nmr.mgh.harvard.edu/). This segmentation of the pre-implant MRI with SPM12 provides us with both the tissue probability maps (i.e. gray, white, and cerebrospinal fluid [CSF] probabilities) and the indexed-binary representations (i.e. either gray, white, CSF, bone, or soft tissues). This information allowed us to reject electrodes not located in the brain. Third, the post-implant CT scan was coregistered to the pre- implant MRI via a rigid affine transformation and the pre-implant MRI was registered to MNI152 space, via a linear and a non-linear transformation from SPM12 methods (Penny et al., 2007), through the FieldTrip toolbox (Oostenveld et al., 2011). Fourth, applying the corresponding transformations, we mapped channel locations to the pre-implant MRI brain that was labeled using the volume-based Human Brainnetome Atlas (Fan et al., 2016).

Experimental paradigm

For the first cohort, SEEG activity was recorded during rest and a passive listening task. The resting condition consisted of three three-minute intervals, during which patients remained awake without specific instructions to open or close their eyes. The task consisted on passively listening to a ∼10 minutes-long storytelling (577 s, La sorcière de la rue Mouffetard), while recording SEEG activity. The fundamental frequency was measured using Praat’s autocorrelation algorithm (male voice, median F0 = 139 Hz). The second cohort of two patients performed the same tasks and additionally listened to a second story (613 s, Le Petit Prince) with a higher fundamental frequency (female voice, median F0 = 189 Hz).

The experiment was done either in an insulated Faraday cage or in the patient’s bedroom. In the former, a sound Blaster X-Fi Xtreme Audio, an amplifier Yamaha P2040 and Yamaha loudspeakers (NS 10M) were used for sound presentation. In the latter, stimuli were presented using a Sennheiser HD 25 headphone set. Sound stimuli were presented at ∼75 dBA, with 44.1 kHz sample rate and 16 bits resolution.

To analyze the brain activity in response to the speech, we computed the envelope of the audio signal as main feature, following the same approach as for the audio corpus.

Independent Component Analysis

ICA aims to solve the ‘cocktail party’ problem by separating N statistically independent sources that have been mixed on M recording contacts (Herreras et al., 2022; López-Madrona, Trébuchon, Bénar, et al., 2024; López-Madrona, Trébuchon, Mindruta, et al., 2024; Makarova et al., 2025). It assumes immobility of the neural sources in space, recording session. Each recorded signal um(t) is thus modeled as the sum of N independent i.e., that the contribution of one source to the recording contacts is the same throughout the sources (sn(t)) multiplied by a constant factor (Vmn)

where um(t) is the SEEG data, Vmn the ICA weights or spatial profile of each source, M the number of contacts, N the number of sources and sn(t) the obtained independent components (“SEEG-ICs”).

For each electrode, we obtained as many components as available contacts (N=M), without any prior dimension reduction (Artoni et al., 2018). We used EEGLAB (Delorme & Makeig, 2004) to compute ICA using the infomax algorithm (Bell & Sejnowski, 1995). From all the components, we manually rejected all the SEEG-ICs related to the reference or remote sources. They were characterized by a flat spatial profile with a similar contribution to all the contacts of the electrode (López-Madrona, Trébuchon, Bénar, et al., 2024). All the other SEEG-ICs were considered as putative neural sources. All the SEEG-ICs were z-scored.

Spectral analysis

Power spectra were estimated using the multitaper method (Thomson, 1982), with a frequency resolution of 0.25 Hz. To characterize the power of each source, we applied the fooof approach (Donoghue et al., 2020), which separates the periodic and aperiodic (1/f-like) components of the spectra. This method allows for the analysis of oscillatory power independently of 1/f-related changes (Voytek et al., 2015). The aperiodic component was modeled using a Lorentzian function, with offset, knee, and exponent (curvature) as key parameters. Then, fooof identified oscillatory peaks exceeding the aperiodic component and fit them individually using Gaussian functions, yielding estimates of power, center frequency, and bandwidth for each detected oscillation.

For the analysis in Figure 2e-g, we applied this approach to identify significant oscillatory peaks, indicating the presence of oscillations in the time course. The fooof fit was performed in the 1-50 Hz range, with a minimum peak bandwidth of 0.5 Hz (twice the frequency resolution) and a minimum amplitude threshold of twice the standard deviation of the aperiodic-removed power spectrum. If multiple peaks were detected with overlapping bandwidths, they were merged, retaining the frequency and relative power of the highest peak.

Since the aperiodic activity during auditory stimulation differs from the resting state (López-Madrona, Trébuchon, Bénar, et al., 2024), we removed this component from both conditions using the fooof approach (Donoghue et al., 2020).To account for differences in the 1/f distribution between rest and auditory stimulation (López-Madrona, Trébuchon, Bénar, et al., 2024) (Figure 2l), we recomputed the fooof fit separately for two frequency bands: 1-28 Hz and 28-150 Hz (Podvalny et al., 2015). A distinct aperiodic component was fitted for each SEEG-IC and frequency band, which was subsequently removed from the power spectra.

Statistical test for individual analysis

Recordings (rest and listening task) were segmented into 10-second windows, and power spectra were computed for each window. The power spectrum for each task was then averaged, and the ratio was calculated as (speech – rest) / rest, where “speech” refers to the listening condition. The p-value for each frequency was obtained using a permutation test (N = 1001 permutations), randomly reassigning “rest” or “listening task” labels while maintaining the same proportion in each iteration. Multiple comparisons were corrected using False Discovery Rate (FDR).

Statistical test for group analysis

A single power spectrum was averaged for each SEEG-IC and condition, and the ratio was computed as described above. A t-test was performed at each frequency to determine whether the ratio differed significantly from zero, with FDR correction for multiple comparisons.

Time-frequency analysis

The time-frequency representation of each envelope was computed using Morlet wavelet transforms with 7 cycles per wavelet. Each frequency band was normalized separately using a z-score computed over the entire duration of the audio recording. Time-frequency maps were then segmented into ‘trials’, time-locked to phoneme onsets, and categorized into two conditions: vowels and consonants. For each language, we averaged the time-frequency responses across phonemes within each condition. To assess differences in 30–50 Hz activity between conditions, we extracted the maximum power in this frequency range following phoneme onset for each language and condition. Group-level statistical comparisons across languages were performed using a paired t-test.

Granger Causality

To estimate whether a SEEG-IC was related to the speech signal, we computed the Granger Causality (GC) between them (Bressler & Seth, 2011; Granger, 1969; López-Madrona et al., 2019). GC assumes that, if there is a linear relation between x(t) and y(t) (e.g., between the speech envelope and the neural activity, respectively), then the past activity of x (x(t-τ)) may predict the dynamics of y(t). It is based on autoregressive models, where x(t) and y(t) are modeled by their own past and a residual. Briefly, if there is a directionality from x(t) to y(t) the inclusion of the previous values of x(t) in the model of y(t) would reduce the error of the prediction. Then, the GC in the temporal domain from x(t) to y(t) would be determined by the F-statistics:

where and are the residual covariance matrices of the autoregressive models of y(t) before and after including the past values of x(t) respectively.

While equation (2) refers to the GC in the temporal domain, it is also possible to decompose it by frequency, estimating a spectral version of the GC (Geweke, 1982). The advantage of the spectral GC is to identify which are the frequencies of y(t) that can be explained by the same oscillations from x(t).

We used the MVGC toolbox (Barnett & Seth, 2014) to computed the unconditional GC between the speech envelope and each SEEG-IC independently (i.e., only two signals in a single model). We downsampled the signals to 360 Hz and used a sliding window of 10 seconds, with an overlap of 80% and a model order of 15. The GC associated to each pair of signals was the average across time-windows. The statistical significance was obtained with a surrogate data analysis (n=1001) by shifting the speech time course by a random value and breaking the temporal relationship between them. We computed the GC for each surrogate and them approximated them into a gaussian distribution. The significance threshold was set at the 95th percentile of the distribution (pval = .05).

An SEEG-IC was considered speech-driven if it exhibited significant temporal GC from the envelope (Figure 2i). We then analyzed the spectral content of the GC and categorized SEEG-ICs into three groups (Figure 3a): theta SEEG-ICs, if significance was found only at theta frequencies (2-6 Hz); gamma SEEG-ICs, if significance was limited to low- and/or high-gamma bands (30-50 Hz and 100-150 Hz, respectively); and theta-gamma SEEG-ICs, if significance was present in both the theta range and either low- or high-gamma to the SEEG-IC).

Phase-Amplitude Coupling

To evaluate the non-linear interactions between theta (2-6 Hz) and gamma (30-150 Hz) oscillations, we computed the phase-amplitude coupling (PAC). The PAC measures whether the changes in amplitude of a fast signal (e.g. gamma activity) occur at specific phases of a slower oscillation (e.g. theta). We used a modified version of the modulation index (MI) (Canolty et al., 2006; Tort et al., 2008) to mitigate the errors induced when estimating the phase of oscillations that are not purely sinusoidal (López-Madrona et al., 2020).

For each frequency considered as reference for the phase, we combined a narrowband filter to detect the zero-crossing points (which correspond to the ascendant and descendant slope of the oscillation, or the phases π/2 and 3π/3, respectively) and a broadband filter to find the trough and the peak (phases 0 and π, respectively; Cole and Voytek, 2019). We labelled this signal as xθφ(t) For each frequency considered as reference for the amplitude we bandpass filtered the signal at the frequency of interest with a bandwidth of 20 Hz (Aru et al., 2015) and computed the envelope using the Hilbert transform. This signal was labelled as xγA(t)

To compute the MI, we divided each cycle of xθφ(t) into N bins (16 in this work). As the phase was defined by 4 points (the trough, the peak and the two zero-crossing points), we split each cycle in its 4 corresponding segments and further divided them in N/4 bins (López-Madrona et al., 2020). Note that as the signal is not sinusoidal, the size of bins may differ within the same cycle. We then computed the mean amplitude of xγA(t) at each bin, calling < xγA >φ(j) the amplitude at the phase bin j. From them, we can calculate the entropy H, defined by:

where N was set to 16, and pj is given by

The value of MI was obtained normalizing H by the maximum entropy (Hmax), given by the uniform distribution pj=1/N (i.e. Hmax = log N):

where a MI close to 0 indicates lack of PAC and larger values reflect higher coupling between both signals.

The PAC can be computed between any pair of time-courses, either using the same signal as reference for the phase and amplitude (e.g. PAC within the speech envelope or within an SEEG-IC), or using different signals (e.g. coupling between the phase of the speech envelope and gamma amplitudes of a SEEG-IC, or inter-source PAC between different SEEG-ICs). For each pair of analyzed signals, a comodulogram was obtained as the MI map between certain ranges of frequencies for the phase and amplitude references. The shifting xγA(t) with respect to xθφ(t). This breaks the temporal alignment between both statistical significance was assessed using a surrogate analysis (n=1001 surrogates) by oscillations and the MI obtained can be considered “by chance”. To correct the comodulogram for multiple comparisons, we used a cluster-based approach (Cohen, 2014).

To obtain a single PAC value for low- and high-gamma per SEEG-IC, we selected the maximum comodulogram value within the 2-6 Hz range for the phase and the 30-50 Hz range for the amplitude of low-gamma, or the 100-150 Hz range for high-gamma.

Cross-Frequency Directionality

While the PAC gives an estimation of the degree of coupling, is it possible to measure the directionality between the slow signal and the amplitude of the fast one (Arinyo-i-Prats et al., 2024; Dupré la Tour et al., 2017; Jiang et al., 2015). We used the cross-frequency directionality (CFD) index (Jiang et al., 2015) to infer whether one of the signals was significantly preceding the other. The CFD, based on the phase-slope index (PSI; Nolte et al., 2008), assumes that if the oscillation of one signal x(t) at a certain frequency is driving a second one y(t) with a time delay, then the phase difference between them at that specific delay will change consistently with the frequency of the signals.

Calling x(t) to the original signal, to the power envelope of the signal at a v gamma frequency, and being X and their Fourier transform, respectively, the CFD can be defined by:

where

is the complex coherency, fj is the theta frequency under study, S is the number of slope is measured, and it has been fixed at 2 Hz, 4 times the resolution (Δf = 0.5 Hz). segments in which the signal has been divided and β is the bandwidth for which the phase

This methodology overcomes some limitations of classical approaches as GC, were the strong differences in signal-to-noise ratio between both components may strongly bias the results (Jiang et al., 2015; Nolte et al., 2010).

We computed the CFD using four different combination of signals: (i) between the phase of the speech envelope and the amplitude of gamma and theta-gamma SEEG-ICs (see Granger Causality section), (ii) between the phase of theta and theta-gamma SEEG-ICs and the amplitude of the speech envelope, (iii) using the phase and amplitude of the same theta-gamma SEEG-IC (intra-source CFD), and (iv) between the phase and amplitude of different SEEG-IC within the same patient (inter-source CFD). To avoid some patients with multiple speech-driven SEEG-ICs bias the results for intra-source and inter-source CFD, we averaged the comodulograms within each electrode, resulting in one single CFD map per electrode.

The statistical significance of the CFD was computed following a pixel-based surrogate approach (Cohen, 2014). In this case, for each surrogated comodulogram (n=1001 surrogates) we kept both the maximal and minimal values of CFD, as the results can be either positive or negative. Thus, we fitted two gaussian distributions and computed both significance thresholds (one for the positive and the other for the negative values) as their 97.5th percentile.

To obtain a single CFD value for low- and high-gamma per SEEG-IC, we selected the absolute maximum value within the 2-6 Hz range for the phase and the 30-50 Hz range for the amplitude of low-gamma, or the 100-150 Hz range for high-gamma.

Delay speech-brain

The delay between the auditory stimulus and the brain response was estimated as the lag of the cross-correlation between the speech envelope and each SEEG-IC. Separate delay values were computed for theta, low-gamma, and high-gamma by considering only SEEG-ICs with significant GC from the speech envelope at the respective frequency bands. For theta SEEG-ICs, signals were bandpass filtered between 2-6 Hz. For low- and high-gamma SEEG-ICs, signals were filtered at 30-50 Hz and 100-150 Hz, respectively, and their envelopes were extracted using the Hilbert transform (Xu et al., 2023). Both the speech envelope and the filtered SEEG-IC signals were segmented into 10-second windows with a 5-second overlap. Cross-correlation was computed within a maximum lag of 500 ms, and the correlation values were averaged across windows. The maximum correlation value and its corresponding lag were retained.

To assess significance, a surrogate approach was applied (n = 1001 surrogates) by randomly shifting the brain signal relative to the speech envelope before segmentation and repeating the analysis. Only the maximum correlation value from each iteration was retained, regardless of lag. A delay was considered reliable if the original maximum correlation exceeded 95% of the surrogate values (p < .05), and its corresponding lag was taken as the estimated neural response delay to the speech signal.

Supplementary material

Description of the speech corpus of 17 languages.

Anatomical location of the SEEG-ICs.

STG: Superior Temporal Gyrus; A22c: caudal area 22; A22r: rostral area 22; STS: Superior Temporal Sulcus; rpSTS: rostroposterior STS. Brainnetome atlas.

PAC of the speech envelope across 17 languages.

PAC comodulogram between the phase and the amplitude of the speech envelope of naturalistic, discourse-level speech (p<.05, surrogate tests).

PAC of the envelope of various naturalistic musical and environmental sounds.

PAC comodulogram between the phase and the amplitude of the envelope of (top) 10 musical genres and (bottom) 5 categories of environmental sounds.

Preferred phase of PAC.

Averaged normalized amplitude distribution across two 4 Hz cycles of the speech envelope of the french male voice. The black line represents a reference for the phase signal.

Non-normalized power spectrum in rest and speech.

Power spectra during rest and speech for all components with: (left) significant changes in power between conditions; (middle) significant GC from the speech envelope; (right) significant phase-amplitude coupling (PAC) between the phase of theta and the gamma amplitude.

Temporal lag between the speech envelope and auditory activity.

Delay estimated using cross-correlations between the speech envelope and speech-driven SEEG-ICs with spectral GC at theta (N=33 SEEG-ICs showing a significant cross-correlation with the speech envelope, see Methods), low-gamma (N=22), or high-gamma (N=29) frequencies. (*/*** p < .05/.001, Tukey’s HSD post-hoc test). Points represent individual SEEG-ICs. Box plots show the median (center line), interquartile range (box), and full data range excluding outliers (whiskers); outliers are marked with red crosses.

Data availability

The raw data investigated in the current manuscript is privileged patient data. Because of this, the conditions of our ethics approval do not permit public archiving of anonymised study data. Readers seeking access to the data should contact Dr. Benjamin Morillon (benjamin.morillon@univ-amu.fr). Access will be granted to named individuals in accordance with ethical procedures governing the reuse of clinical data, including completion of a formal data sharing agreement. All the code used in this work is available at https://github.com/jeremygrd/Phase-Amplitude-Coupling-PAC-speech-Amplitude-Envelope/tree/main (acoustic analysis) and https://github.com/VictorLopezMadrona/Matlab_analysis_neuroscience (brain analysis).

Acknowledgements

Research supported by grants ANR-16-CONV-0002 (ILCB) and the Excellence Initiative of Aix-University (A*MIDEX AMX-19-IET-004). B.M. is supported by ANR-20-CE28-0007, Fondation Pour l’Audition (FPA RD-2022-09), and co-funded by the European Union (ERC, SPEEDY, ERC-CoG-101043344). L.L. is supported by ANR21-CE28-0015-01.

Additional information

Author contributions

BM and VLM Designed Research; MM and AT Acquired Data; VLM, JG, MM and BLG Analyzed Data; VLM, MM, LL, DP, ALG, LHA and BM Interpreted Results; VLM, BM and LHA Wrote the first draft of the paper; all authors contributed to the manuscript’s final Writing.

Funding

Agence Nationale de la Recherche (ANR) (ANR-16-CONV-0002)

  • Víctor J López-Madrona

Agence Nationale de la Recherche (ANR) (ANR-20-CE28-0007)

  • Benjamin Morillon

Agence Nationale de la Recherche (ANR) (ANR21-CE28-0015-01)

  • Leonardo Lancia

EC | European Research Council (ERC)

https://doi.org/10.3030/101043344

  • Benjamin Morillon

Fondation Pour l'Audition (FPA) (FPA RD-2022-09)

  • Benjamin Morillon

Excellence Initiative of Aix-University (A*MIDEX AMX-19-IET-004)

  • Benjamin Morillon