1. Computational and Systems Biology
  2. Neuroscience
Download icon

Speech encoding by coupled cortical theta and gamma oscillations

  1. Alexandre Hyafil Is a corresponding author
  2. Lorenzo Fontolan
  3. Claire Kabdebon
  4. Boris Gutkin
  5. Anne-Lise Giraud
  1. Ecole Normale Supérieure, France
  2. University of Geneva, Switzerland
  3. National Research University Higher School, Russia
Research Article
Cited
22
Views
2,185
Comments
0
Cite as: eLife 2015;4:e06213 doi: 10.7554/eLife.06213

Abstract

Many environmental stimuli present a quasi-rhythmic structure at different timescales that the brain needs to decompose and integrate. Cortical oscillations have been proposed as instruments of sensory de-multiplexing, i.e., the parallel processing of different frequency streams in sensory signals. Yet their causal role in such a process has never been demonstrated. Here, we used a neural microcircuit model to address whether coupled theta–gamma oscillations, as observed in human auditory cortex, could underpin the multiscale sensory analysis of speech. We show that, in continuous speech, theta oscillations can flexibly track the syllabic rhythm and temporally organize the phoneme-level response of gamma neurons into a code that enables syllable identification. The tracking of slow speech fluctuations by theta oscillations, and its coupling to gamma-spiking activity both appeared as critical features for accurate speech encoding. These results demonstrate that cortical oscillations can be a key instrument of speech de-multiplexing, parsing, and encoding.

https://doi.org/10.7554/eLife.06213.001

eLife digest

Some people speak twice as fast as others, while people with different accents pronounce the same words in different ways. However, despite these differences between speakers, humans can usually follow spoken language with remarkable ease.

The different elements of speech have different frequencies: the typical frequency for syllables, for example, is about four syllables per second in speech. Phonemes, which are the smallest elements of speech, appear at a higher frequency. However, these elements are all transmitted at the same time, so the brain needs to be able to process them simultaneously.

The auditory cortex, the part of the brain that processes sound, produces various ‘waves’ of electrical activity, and these waves also have a characteristic frequency (which is the number of bursts of neural activity per second). One type of brain wave, called the theta rhythm, has a frequency of three to eight bursts per second, which is similar to the typical frequency of syllables in speech, and the frequency of another brain wave, the gamma rhythm, is similar to the frequency of phonemes. It has been suggested that these two brain waves may have a central role in our ability to follow speech, but to date there has been no direct evidence to support this theory.

Hyafil et al. have now used computer models of neural oscillations to explore this theory. Their simulations show that, as predicted, the theta rhythm tracks the syllables in spoken language, while the gamma rhythm encodes the specific features of each phoneme. Moreover, the two rhythms work together to establish the sequence of phonemes that makes up each syllable. These findings will support the development of improved speech recognition technologies.

https://doi.org/10.7554/eLife.06213.002

Introduction

The physical complexity of biological and environmental signals poses a fundamental problem to the sensory systems. Sensory signals are often made of different rhythmic streams organized at multiple timescales, which require to be processed in parallel and recombined to achieve unified perception. Speech constitutes an example of such a physical complexity, in which different rhythms index linguistic representations of different granularities, from phoneme to syllables and words (Rosen, 1992; Zion Golumbic et al., 2012). Before meaning can be extracted from continuous speech, two critical pre-processing steps need to be carried out: a de-multiplexing step, i.e., the parallel analysis of each constitutive rhythm, and a parsing step, i.e., the discretization of the acoustic signal into linguistically relevant chunks that can be individually processed (Stevens, 2002; Poeppel, 2003; Ghitza, 2011). While parsing is presumably modulated in a top-down way, by knowing a priori through developmental learning (Ngon et al., 2013) where linguistic boundaries should lie, it is likely largely guided by speech acoustic dynamics. It has recently been proposed that speech de-multiplexing and parsing could both be handled in a bottom-up way by the combined action of auditory cortical oscillations in distinct frequency ranges, enabling parallel computations at syllabic and phonemic timescales (Ghitza, 2011; Giraud and Poeppel, 2012). Intrinsic coupling across cortical oscillations of distinct frequencies, as observed in electrophysiological recordings of auditory cortex (Lakatos et al., 2005; Fontolan et al., 2014), could enable the hierarchical combination of syllabic- and phonemic-scale computations, subsequently restoring the natural arrangement of phonemes within syllables (Giraud and Poeppel, 2012).

The most pronounced energy fluctuations in speech occur at about 4 Hz (Zion Golumbic et al., 2012) and can serve as an acoustic guide for signalling the syllabic rhythm (Mermelstein, 1975). Since the syllabic rate coincides with the auditory cortex theta rhythm (3–8 Hz), syllable boundaries could be viably signalled by a given phase in the theta cycle. The relevance of speech tracking by the theta neural rhythm (Henry et al., 2014) is highlighted by experimental data showing that speech intelligibility depends on the degree of phase-locking of the theta-range neural activity in auditory cortex (Ahissar et al., 2001; Luo and Poeppel, 2007; Peelle et al., 2013; Gross et al., 2013). By analogy with the spatial and mnemonic oscillatory processes that take place in the hippocampus (Jensen and Lisman, 1996; Lisman and Jensen, 2013; Lever et al., 2014), the theta oscillation may orchestrate gamma neural activity to facilitate its subsequent decoding (Canolty et al., 2007): the phase of theta-paced neural activity could regulate faster neural activity in the low-gamma range (>30 Hz) involved in linguistic coding of phonemic details (Ghitza, 2011; Giraud and Poeppel, 2012). The control of gamma by theta oscillations could hence both modulate the excitability of gamma neurons to devote more processing power to the informative parts of syllabic sound patterns, and constitute a reference time frame aligned on syllabic contours for interpreting gamma-based phonemic processing (Shamir et al., 2009; Ghitza, 2011; Kayser et al., 2012; Panzeri et al., 2014).

Compelling as this hypothesis may sound, direct evidence for neural mechanisms linking speech constituents and oscillatory components is still lacking. One way to address a causal role of oscillations in speech processing is computational modelling, as it permits to directly test the efficiency of cross-coupled theta and gamma oscillations as an instrument of speech de-multiplexing, parsing, and encoding. Previous models of speech processing involved only gamma oscillations in the context of isolated speech segments (Shamir et al., 2009) or did not involve neural oscillations at all (Gütig and Sompolinsky, 2009; Yildiz et al., 2013). On the other hand, previous models of cross-frequency coupled oscillations did not address sensory functions as parsing and de-multiplexing (Jensen and Lisman, 1996; Tort et al., 2007). Here, we examined how a biophysically inspired model of coupled theta and gamma neural oscillations can process continuous speech (spoken sentences). Specifically, we determined: (i) whether theta oscillations are able to accurately parse speech into syllables, (ii) whether syllable-related theta signal may serve as a reference time frame to improve gamma-based decoding of continuous speech; (iii) whether this decoding requires theta to modulate the activity of the gamma network. To address the last two points, we compared speech decoding performance of the model with two control versions of the network, in which we removed the neural connection entraining the theta neurons by speech fluctuations or the link that couples them to the gamma neurons.

Results

Model architecture and spontaneous behaviour

The model proposed here (Figure 1A) is inspired from cortical architecture (Douglas and Martin, 2004; da Costa and Martin, 2010) and function (Lakatos et al., 2007) as well as from previous biophysical models of cross-frequency coupled oscillation generation (Tort et al., 2007; Kopell et al., 2010; Vierling-Claassen et al., 2010). We used the well documented Pyramidal Interneuron Gamma (PING) model for implementing a gamma network: bursts of inhibitory neurons immediately follow bursts of excitatory neurons (Jadi and Sejnowski, 2014), creating the overall spiking rhythm. Given that gamma and theta oscillations are both locally present in superficial cortical layers (Lakatos et al., 2005), we assume similar local generation mechanisms for theta and gamma with a direct connection between them. Direct evidence for a local generation of theta oscillations in auditory cortex is still scarce (Ainsworth et al., 2011) and we cannot completely rule out that they might spread from remote generators (e.g., in the hippocampus; Tort et al., 2007; Kopell et al., 2010). Yet, we built the case for local generation from the following facts: (1) neocortical (somatosensory) theta oscillations are observed in vitro (Fanselow et al., 2008), (2) MEG, EEG, and combined EEG/FMRI recordings in humans show that theta activity phase-locks to speech amplitude envelope in A1 and immediate association cortex—but not beyond—(Ahissar et al., 2001; Luo and Poeppel, 2007; Cogan and Poeppel, 2011; Morillon et al., 2012), and (3) theta phase-locking to speech is not accompanied by power increase, arguing for a phase restructuring of a local oscillation (Luo and Poeppel, 2007). We assumed a similar generation mechanism for theta and gamma oscillations, with slower excitatory and inhibitory synaptic time constants for theta (Kopell et al., 2010; Vierling-Claassen et al., 2010). The distinct dynamics for the two modules reflect the diversity of inhibitory synaptic timescales observed experimentally, with Martinotti cells displaying slow synaptic inhibition (Ti neurons), and basket cells showing faster inhibition decay (Gi neurons) (Silberberg and Markram, 2007). We refer to the theta network as Pyramidal Interneuron Theta (PINTH), by analogy with PING. The full model is hence composed of a theta-generating module with interconnected spiking excitatory (Te) and inhibitory (Ti) neurons that spontaneously synchronize at theta frequency (6–8 Hz) through slow decaying inhibition; and of a gamma-generating module with excitatory (Ge) and inhibitory (Gi) neurons that burst at a faster rate (25–45 Hz) synchronized by fast decaying inhibition (PING; Figure 1B) (Börgers and Kopell, 2005). The firing pattern of our simulated neurons is sparse and weakly synchronous at rest, consistent with the low spiking rate of cortical neurons (Brunel and Wang, 2003) (Figure 1—figure supplement 1D). Unlike the classical 50–80 Hz PING seen in in vitro preparations of rat auditory cortex (Ainsworth et al., 2011), our network produced a lower gamma frequency around 30 Hz, as observed in human auditory cortex in response to speech (Nourski et al., 2009; Pasley et al., 2012).

Figure 1 with 1 supplement see all
Network architecture and dynamics.

(A) Architecture of the full model. Te excitatory neurons (n = 10) and Ti inhibitory neurons (n = 10) form the PINTH loop generating theta oscillations. Ge excitatory neurons (n = 32) and Gi inhibitory neurons (n = 32) form the PING loop generating gamma oscillations. Te neurons receive non-specific projections from all auditory channels, while Ge units receive specific projection from a single auditory channel, preserving tonotopy in the Ge population. PING and PINTH loops are coupled through all-to-all projections from Te to Ge units. (B) Network activity at rest and during speech perception. Raster plot of spikes from representative Ti (dark green), Te (light green), Gi (dark blue), and Ge (light blue). Simulated LFP is shown on top and the auditory spectrogram of the input sentence "Ralph prepared red snapper with fresh lemon sauce for dinner" is shown below. Ge spikes relative to theta burst (red boxes) form the output of the network. Gamma synchrony is visible in Gi spikes. (C) Evoked potential (ERP) and Post-stimulus time histograms (PSTH) of Te and Ge population from 50 simulations of the same sentence: ERP (i.e., simulated LFP averaged over simulations, black line), acoustic envelope of the sentence (red line, filtered at 20 Hz), PSTH for theta (green line) and gamma (blue line) neurons. Vertical bars show scale of 10 spikes for both PSTH. The theta network phase-locks to speech slow fluctuations and entrains the gamma network through the theta–gamma connection. (D) Theta/gamma phase-amplitude coupling in Ge spiking activity. Top panel: LFP gamma envelope follows LFP theta phase in single trials. Bottom-Left panel: LFP phase-amplitude coupling (measured by Modulation Index) for pairs of frequencies during rest, showing peak in theta–gamma pairs. Bottom-right panel: MI phase-amplitude coupling at the spiking level for the intact model and a control model with no theta–gamma connection (red arrow on A panel), during rest (blue bars) and speech presentation (brown bars).

https://doi.org/10.7554/eLife.06213.003

At rest the PINTH population activity synchronizes at the theta timescale, and the PING population at the gamma time scale. Both the Te and Ge populations receive projections from a ‘subcortical’ module that mimics the nonlinear filtering of acoustic input by subcortical structures, which primarily includes a signal decomposition into 32 auditory channels (Chi et al., 2005). Individual excitatory neurons in the theta module received channel-averaged input while those in the gamma module received frequency selective input. Such a differential selectivity was motivated by experimental observations from intracranial recordings (Morillon et al., 2012; Fontolan et al., 2014) suggesting that unlike the gamma one, the theta response does not depend on the input spectrum. It also mirrors the dissociation in primate auditory cortex between a population of 'stereotyped' neurons responding very rapidly and non-selectively to any acoustic stimulus (putatively Te neurons) and a population of 'modulated' neurons responding selectively to specific spectro-temporal features (putatively Ge neurons) (Brasselet et al., 2012). Each Ge neuron receives input from one specific channel, preserving the auditory tonotopy, so that the whole Ge population represents the rich spectral structure of the stimulus. Each Te neuron receives input from all the channels, i.e., the Te population conveys a widely tuned temporal signal capturing slow stimulus fluctuations. Importantly, the two oscillating modules are connected through all-to-all connections from Te neurons to Ge neurons allowing the theta oscillations to control the activity of the faster gamma oscillations. This structure enables syllable boundary detection (through the theta module) to constrain the decoding of faster phonemic information. The output of the network is taken from the Ge neurons as we assume that the Ge neurons provide the input to higher-level cortical structures performing operations like phoneme categorization and providing access to lexicon. Accordingly, in the model the Ge neurons receive more spectral details about speech than the Te neurons (Figure 1B). Ge spiking is then referenced with respect to timing of theta spikes, and submitted to decoding algorithms.

Model dynamics in response to natural sentences

We first explored the dynamic behaviour of the model. As expected from its architecture and biophysical parameters (see ‘Materials and methods’), the neural network produced activity in theta (6–8 Hz) and low gamma (25–45 Hz) ranges, both at rest and during speech presentation. Consistent with experimental observations (Luo and Poeppel, 2007) there was no notable increase in theta spiking during speech presentation, but sentence onsets induced a phase-locking of theta oscillations as shown by the Post-stimulus time histograms of theta neurons, which was further enhanced by all edges in speech envelope. Consequently, the resulting global evoked activity followed the acoustic envelope of the speech signal (Figure 1C) (Abrams et al., 2008). Local Field Potential (LFP) indexes the global synaptic activity over the network (excitatory neurons of both networks) and its dynamics closely followed spiking dynamics. Unlike the LFP theta power pattern, the LFP theta phase pattern was robust across repetitions of the same sentence (Figure 1—figure supplement 1A,C,E), replicating LFP behaviour from the primate auditory cortex (Kayser et al., 2009), and human MEG data (Luo and Poeppel, 2007; Luo et al., 2010). In line with other empirical data from human auditory cortex (Nourski et al., 2009) gamma oscillations followed the onset of sentences (Figure 1C). Owing to the feed-forward connection from the theta to the gamma sub-circuits, the gamma amplitude was coupled to the theta phase both at rest and during speech (Figure 1D). The coupling was visible both in the spiking (Figure 1—figure supplement 1B) and LFP signal (Figure 1D). Critically, this coupling disappeared when the theta/gamma connection was removed, showing that a common input to Te and Ge cells is not sufficient to couple the two oscillations.

Syllable boundary detection by theta oscillations

Before testing the speech decoding properties of the model, we explored whether syllable boundaries could reliably be detected at the cortical level by a theta network (see Methods). This first study was based on a corpus consisting of 4620 phonetically labelled English sentences (TIMIT Linguistic Data Consortium, 1993). The acoustic analysis of these sentences confirmed a correspondence between the dominant peak of the speech modulation spectrum and the mean syllabic rate (3–6 Hz) (Figure 2—figure supplement 1A), whereby syllabic boundaries correspond to trough in speech slow fluctuations (Peelle et al., 2013). The theta network in the model (Figure 2—figure supplement 1B) was explicitly designed to exploit such regularities and infer syllable boundaries. When presenting sentences to the theta module, we observed a consistent theta burst within 50 ms following syllable onset followed by a locking of theta oscillations to theta acoustic fluctuations in the speech signal (Figure 2—figure supplement 1C,D). More importantly, neuronal theta bursts closely aligned to the timing of syllable boundaries in the presented sentences (Figure 2A). We compared the performance of the theta network to that of two alternative models also susceptible to predict syllable boundaries: a simple linear-nonlinear acoustic boundary detector (Figure 2—figure supplement 1E) and Mermelstein algorithm, a state-of-the-art model which, unlike the model developed here, only permits ‘off-line’ syllable boundary detection (Mermelstein, 1975). The theta network performed better than both the linear model and the Mermelstein algorithm (Figure 2B, all p-values <10−12). Similar to results from behavioural studies of human perception (Miller et al., 1984; Nourski et al., 2009; Mukamel et al., 2011) the theta network could adapt to different speech rates. The model performed better than other algorithms, with a syllabic alignment accuracy remaining well above chance levels (p < 10−12) in the twofold and threefold time compression conditions. (Figure 2B).

Figure 2 with 1 supplement see all
Theta entrainment by syllabic structure.

(A) Theta spikes align to syllable boundaries. Top graph shows the activity of the theta network at rest and in response to a sentence, including the LFP traces displaying strong theta oscillations, and raster plots for spikes in the Ti (light green) and Te (dark green) populations. Theta bursts align well to the syllable boundaries obtained from labelled data (vertical black lines shown on top of auditory spectrogram in graph below). (B) Performance of different algorithms in predicting syllable onsets: Syllable alignment score indexes how well theta bursts aligned onto syllable boundaries for each sentence in the corpus, and the score was averaged over the 3620 sentences in the test data set (error bars: standard error). Results compare Mermelstein algorithm (grey bar), linear-nonlinear predictor (LN, pink) and theta network (green), both for normal speed speech (compression factor 1) and compressed speech (compression factors 2 and 3). Performance was assessed on a different subsample of sentences than those used for parameter fitting.

https://doi.org/10.7554/eLife.06213.005

This first study demonstrates that theta activity provides a reliable, syllable-based, internal time reference that the neural system could use when reading out the activity of gamma neurons.

Decoding of simple temporal stimuli from output spike patterns

Our next step was to test whether the theta-based syllable chunks of output spike trains (Ge neurons) for the different input types could be properly classified. We first quantified the model's ability to encode stimuli designed as simple temporal patterns. We used 50 ms sawtooth stimuli whose shape was parametrically varied by changing the peak position (Figure 3A), with interstimulus interval between 50 and 250 ms. This toy set of stimuli was previously used in a gamma-based speech encoding model and argued to represent idealized formant transitions (Shamir et al., 2009). We extracted spike patterns from all the Ge (output) neurons from −20 ms before each sawtooth onset to 20 ms after its offset. This procedure is referred to as ‘stimulus timing’ since it uses the stimulus onset as time reference. Using a clustering method (see ‘Materials and methods’), we observed that the identity of the presented sawtooth could be decoded from the output spike patterns (Figure 3A) with over 60% accuracy (Figure 3C, light grey bar). We also computed the decoding performance when we used an internal time reference provided by the theta timing rather than by the stimulus timing. When spike patterns were analysed within a window defined by two successive theta bursts (Figure 3C, dark grey bar), sawtooth decoding was still possible and even relatively well preserved (mean decoding rate of 41.7%). Noise in the theta module allows the alignment of theta bursts to stimulus onset and thus improves detection performance by enabling consistent theta chunking of spike patterns.

Sawtooth classification.

(A) Gamma spiking patterns in response to simple stimuli. The model was presented with 50 ms sawtooth stimuli, where peak timing was parameterized between 0 (peak at onset) and 1 (peak at offset). Spiking is shown for different Ge neurons (y axis) in windows phase-locked to theta bursts (−20 to +70 ms around the burst, x-axis). Neural patterns are plotted below the corresponding sawtooths. (B) Simulated networks. The analysis was performed on simulated data from three distinct networks: ‘Undriven-theta model’ (no speech input to Te units, top), ‘Uncoupled theta/gamma model’ (no projection from Te to Ge units, middle), full intact model (bottom). (C) Classification performance using stimulus vs. theta timing for the three simulated networks. The stimulus timing (light bars) is obtained by extracting Ge spikes in a fixed-size window locked to the onset of the external stimulus; the theta timing (dark bars) is obtained by extracting Ge spikes in a window defined by consecutive theta bursts (theta chunk, see Figure 3A). Classification was repeated 10 times for each network and neural code, and mean values and standard deviation were extracted. Average expected chance level is 10%. (D) Stimulus detection performance, for the intact and control models. Rest neural patterns were discriminated against any of the 10 neural patterns defined by the 10 distinct temporal shapes. (E) Confusion matrices for stimulus- and theta-timing and the two control models (using theta-timing code). The colour of each cell represents the number of trials where a stimulus parameter was associated with a decoded parameter (blue: low numbers; red: high numbers). Values on the diagonal represent correct decoding.

https://doi.org/10.7554/eLife.06213.007

We then compared the decoding performance from the full model with that of two control models: one in which the theta module was not driven by the stimulus (undriven theta model) and one in which the theta module was not connected with the gamma module (uncoupled theta/gamma model) (Figure 3B, green and blue). Decoding performance of both control models, as revealed by the mean performance (Figure 3C) and confusion matrices (Figure 3E), was degraded for either neural code (theta onset and stimulus timing, all p-values <10−9). The details of the raw confusion matrices show that the temporal patterns are decoded correctly or as a neighbouring temporal shape only in the intact version of the model (Figure 3E). Furthermore, the intact model achieved better signal vs rest discrimination than the two control models, notably avoiding false alarms (Figure 3D). In summary, these analyses show that gamma-spiking neurons within theta bursts provide a reliable internal code for characterizing simple temporal patterns, and that this ability is granted by the time-locking of theta neurons (Te units) to stimulus and the modulation they exert on the fast-scale output (Ge) units.

Continuous speech encoding by model output spike patterns

The overarching goal of this theoretical work was to assess whether coupled cortical oscillations can achieve on-line speech decoding from continuous signal. We therefore set out to classify syllables from natural sentences. To decode Ge spiking, we used similar procedures as for the encoding/decoding of simple temporal patterns. Output Ge spikes were parsed into spike patterns based on the theta chunks, and the decoding analysis was used to recover syllable identity (Figure 4A). To evaluate the importance of the precise spike timing of gamma neurons, we compared decoding (see ‘Materials and methods’) using spike patterns (i.e., spikes labelled with their precise timing w.r.t. chunk onset) vs those obtained from plain spike counts (i.e., unlabelled spikes). When using spike patterns syllable decoding reached a high level of accuracy in the intact model: 58% of syllables were correctly classified within a set of 10 possible (randomly chosen) syllables (Figure 4B). Syllable decoding dropped when using spike counts instead of spike patterns (p < 10−12). Critically, decoding was poor in both control models (undriven theta and uncoupled theta/gamma) using either spike counts or spike patterns (significantly lower than decoding using spike patterns in the full model, all p-values < 10−12, and non-significantly higher than decoding using spike counts in the full model, all p-values > 0.08 uncorrected).

Figure 4 with 1 supplement see all
Continuous speech parsing and syllable classification.

(A) Decoding scheme. Output spike patterns were built by extracting Ge spikes occurring within time windows defined by consecutive theta bursts (red boxes) during speech processing simulations. Each output pattern was then labelled with the corresponding syllable (grey bars). (B) Syllable decoding average performance for uncompressed speech. Performance for the three simulated models (Figure 3B) using two possible neural codes: spike count and spike pattern. (C) Syllable decoding average performance across speakers, using the spike pattern code. Syllable decoding was optimal when syllable duration was within the 100–300 ms range, i.e., corresponded to the duration of one theta cycle. The intact model performed better than the two controls irrespective of syllable duration range. Chance level is 10%. Colour code same as B. (D) Syllable decoding performance for compressed speech for the intact model using the spike pattern code (same speaker, as in B). Compression ranges from 1 (uncompressed) to 3. Average chance level is 10% (horizontal line in the right plot).

https://doi.org/10.7554/eLife.06213.008

We also explored the model performance for encoding syllables spoken by different speakers. We used a similar decoding procedure as above, but here the classifier was trained on different speakers pronouncing the same two sentences. Theta chunks were classified into syllables based on the network response to the two sentences uttered by 99 other speakers. The material included sentences spoken by 462 speakers of various ethnic and geographical origins, showing a marked heterogeneity in phonemic realization and syllable durations (as labelled by phoneticians). The syllable duration distribution was skewed with the median at 200 ms and tail values ranging from a few ms to over 800 ms (Figure 4—figure supplement 1A). Given that theta activity is meant to operate in a 3–9 Hz range, i.e., integrate speech chunks of about 100–300 ms (Ghitza, 2011, 2014), we did not expect the model to perform equally well along the whole syllable duration range. Accordingly, decoding accuracy was not uniform across the whole syllable duration range. When decoding from spike pattern, the intact model allowed 24% accuracy (chance level at 10%). It showed a peak in performance in the range in which it is expected to operate, i.e., for syllables durations between 100 and 300 ms. Given the cross-speaker phonemic variability such a performance is fairly good. Critically, the intact model outperformed control models both within the 100 to 300 ms range (p < 0.001), and throughout the whole syllable duration span (p < 0.001). These analyses overall show that the model can flexibly track syllables within a physiological operating window, and that syllable decoding relies on the integrity of the model architecture.

Lastly, we tested more directly the resilience of the spike pattern code to speech temporal compression and found that while degrading the decoding performance remained above chance for compression rates of 2 and 3 (Figure 4D), mimicking humans decoding performance (Ahissar et al., 2001). Altogether, the decoding of syllables from continuous speech showed that coupled theta and gamma oscillations provide a viable instrument for syllable parsing and decoding, and that its performance relies on the coupling between the two oscillation networks.

Encoding properties of model neurons

We finally assessed the physiological plausibility of the model by comparing the encoding properties of the simulated neurons, without further parameter fitting, with those of neurons recorded from primate auditory cortex (Kayser et al., 2009; 2012). The first analysis of neural encoding properties consisted of comparing the ability to classify neural codes from the model into arbitrary speech segments of fixed duration (as opposed to classification into syllables as in previous section). We simulated data using natural speech and studied the spiking activity of Ge neurons by implementing the same methods of analysis as in the original experiment. We extracted fixed-size windows of spike patterns activity for individual Ge neurons, and assessed neural encoding characteristics using different neural codes. Speech encoding was first evaluated using a nearest-mean classifier and then using mutual information techniques (Kayser et al., 2009).

Classifier analysis

In this analysis, neural patterns were classified not into syllables as above or into any linguistic constituent but into arbitrary segments of speech, allowing for a-theoretical insight into the encoding properties of neurons. We extracted a subset of 25 sentences from the TIMIT corpus and exposed the network to 50 presentations of each sentence from the subset. We defined 10 stimuli as 10 distinct windows of a given size (from 80 to 480 ms) randomly extracted from the 25 sentences, and then assessed the capacity to decode the identity of a stimulus from the activity of individual Ge neurons within that window (Kayser et al., 2012). Three different codes were used (Figure 5A): a simple spike count was used as reference code; a time-partitioned code where spikes were assigned to one of 8 bins of equal duration within the temporal window; a phase-partitioned code where spikes were labelled with the phase of LFP theta at the timing of spike (the spikes were then assigned into one of 8 bins according to their phase).

Figure 5 with 1 supplement see all
Comparison with encoding properties of auditory cortical neurons.

(A) Neural codes. Stimulus decoding was performed on patterns of Ge spikes chunked in fixed-size windows (the figure illustrates the pattern for one neuron extracted from one window). Spike count consisted of counting all spikes for each neuron within the window. Time-partitioned code was obtained in dividing the window in N equal size bins (vertical grey bars) and counting spikes within each bin. Phase-partitioned code was obtained by binning LFP phase into N bins (depicted by the four colours in the top graph) and assigning each spike with the corresponding phase bin. (B) Spike pattern decoding. (Left) Decoding performance across Ge neurons for the intact model using N = 8 bins for each code: spike count (black curve), time-partitioned (blue curve), and phase-partitioned codes (green curve). (Right) Data from the original experiment. Adapted from Kayser et al., 2012. (C) Mutual information (MI). (Left) Mean MI between stimulus and individual output neuron activity during sentence processing in the intact model for spike count (black curve), time-partitioned (blue line), combined count and phase-partitioned (green line) and combined time- and phase-partitioned codes (red line). (Right) Comparison with experimental data from auditory cortex neurons (adapted from Kayser et al., 2009).

https://doi.org/10.7554/eLife.06213.010

We observed that for 80 to 240 ms windows (within one theta cycle), decoding was almost as good for the phase-partitioned code as for the time-partitioned code (Figure 5B, left). In other words, stimulus decoding using theta timing was nearly as good as when using stimulus timing. Performance using the spike count was considerably lower (p < 10−12 for all 6 window sizes). Overall, there was a qualitative and even quantitative match between the results from simulated data and the original experimental results (Figure 5B, right). When we removed either the input-to-theta (undriven theta model) or the theta-to-gamma connection (uncoupled theta/gamma model) in the network, the performance of the phase-partitioned code dropped to just above that of the spike count code (Figure 5—figure supplement 1A; significantly lower increase in decoding performance using phase-partitioned instead of spike count code compared to full model, p < 10−12 for all 6 window sizes and both control models), and the simulations no longer predicted the experimental results. Finally, experimental data and simulations from the intact model also matched when we investigated the dependence of decoding accuracy on the number of bins, which was not the case for any of the control models (Figure 5—figure supplement 1B).

Mutual information (MI) analysis

MI between the input (acoustic stimulus) and the output (neural pattern) provides an alternative measure for how well stimuli are encoded in the output pattern (see ‘Materials and methods’). We used the same simulation data as for the classification procedure, but the sentences were subdivided into shorter chunks using a non-overlapping time window (length T: 8–48 ms) (Kayser et al., 2009). We compared the MI between the stimulus and neural activity in individual Ge neurons as a function of the length of stimulus window, using four neural codes: spike count, time-partitioned code, phase-partitioned code combined with spike count and finally combined phase- and time-partitioned codes. These codes are qualitatively equivalent to the decoding strategies used in the previous classifier analysis. Figure 5C shows that taking into account the spike phase boosts the MI carried by the Spike count code or the Time-partitioned code alone (p < 10−12 for all 6 window sizes). In other words, spike phase provided additional rather than redundant information to more traditional codes. The gain provided by spike phase increased when enlarging the window and when combined with either spike count or spike pattern (Spike Count vs Time-partitioned; Spike count and Phase-partitioned code vs Time- and Phase-partitioned code). These results replicate the original experimental data from monkey auditory cortex (Kayser et al., 2009). Such a pattern was not reproduced using any of the control models (Figure 5—figure supplement 1C). These results hence show that in addition to enhancing the reliability of the spike phase code, the theta–gamma connection enhanced the temporal precision of Ge neurons spiking in response to speech stimuli.

Critically, results from both classifier and mutual information analyses demonstrate that the full network architecture of the model provides an efficient way of boosting the encoding capacity of neurons in a way that bears remarkable similarities to actual neurons from primate auditory cortex.

Discussion

Like most complex natural patterns, speech contains rhythmic activity at different scales that conveys different and sometimes non-independent categories of information. Using a biophysically inspired model of auditory cortex function, we show that cortical theta–gamma cross-frequency coupling provides a means of using the timing of syllables to orchestrate the readout of speech-induced gamma activity. The current modelling data demonstrate that theta bursts generated by a theta (PINTH) network can predict ‘on-line’ syllable boundaries at least as accurately as state-of-the-art offline syllable detection algorithms. Syllable boundary detection by a theta network hence provides an endogenous time reference for speech decoding. Our simulated data further show that a gamma biophysical network, receiving a spectral decomposition of speech as input, can take advantage of the theta time reference to encode fast phonemic information. The central result of our work is that the gamma network could efficiently encode temporal patterns (from simple sawtooths to natural speech), as long as it was entrained by the theta rhythm driven by syllable boundaries. The proposed theta/gamma network displayed sophisticated spectral and encoding properties that compared both qualitatively and quantitatively to existing neurophysiological evidence including cross-frequency coupling properties (Schroeder and Lakatos, 2009) and theta-referenced stimulus encoding (Kayser et al., 2009; 2012). The projections from the Te to Ge neurons endowed the network with phase-amplitude and phase-frequency coupling between gamma and theta oscillations, at both the spike and the LFP levels (Jensen and Colgin, 2007). This closely reproduces the theta/gamma phase-amplitude coupling observed from intracortical recordings (Giraud and Poeppel, 2012; Lakatos et al., 2005). Importantly, due to the dissociation of excitatory populations we obtained denser gamma spiking immediately after the theta burst evoked by the syllable onset. This validates a critical point of theta/gamma parsing system, namely that a more in-depth encoding is carried-out by the auditory cortex during the early phase of syllables, when more information needs to be extracted (Schroeder and Lakatos, 2009; Giraud and Poeppel, 2012).

The human auditory system, like other sensory systems, is able to produce invariant responses to different physical presentations of the same input. Importantly, it is relatively insensitive to the speed at which speech is being produced. Speech can double in speed from one speaker to another and yet remain intelligible up to an artificial compression factor of 3 (Ahissar et al., 2001). In the current model, theta bursts could still signal syllable boundaries when speech was compressed by a factor 2 and this alignment deteriorated for higher compression factors. Syllable decoding was significantly degraded for compressed speech, yet remained twice as accurate as chance. Our network is purely bottom-up and does not include high level linguistic processes and representations, which in all likelihood plays an important role in speech perception (Davis et al., 2011; Peelle et al., 2013; Gagnepain et al., 2012): its relative resilience to speech compression is thus a fairly good performance. A previous model (Gütig and Sompolinsky, 2009) proposed a neural code that was robust to speech warping, based on the notion that individual neurons correct for speech rate by their overall level of activity. While this model achieved very good speech categorization performance, it relied on extremely precise spiking behaviour (neurons spiked only once, when their associated channel reached a certain threshold), for which neurophysiological evidence is scarce. Another model developed by Hopfield proposes that a low gamma external current provides encoding neurons with reliable timing and dynamical memory spanning up to 200 ms, a long enough window to integrate information over a full syllable (Hopfield, 2004). The utility of gamma oscillations for precise spiking is arguably similar in both Hopfield's model and ours, whereas the syllable integration process is irregularly ensured by intermittent traces of recent (∼200 ms) neural activity in Hopfield's, and in ours by regularly spaced theta bursts that are locked to the speech signal. The advantage of our model is that integration over long speech segments is permanently enabled by the phase of output spikes with respect to the ongoing theta oscillation. Our approach shows that accurate encoding can be achieved using a system that does not require explicit memory processes, and in which the temporal integration buffer is only emulated by a slow neural oscillator aligned to speech dynamics.

In the current combined theta/gamma model, theta oscillations do not only act as a syllable-scale integration buffer, but also as a precise neural timer. Because syllabic contours are reflected in the slow modulations of speech, the theta oscillator can flexibly entrain to them (3–7 Hz, Figure 2—figure supplement 1A) and signal syllable boundaries. The spiking behaviour of theta neurons parallels experimental observations that a subset of neurons in A1 respond to the onset of naturalistic sounds (Fishbach et al., 2001; Phillips et al., 2002; Wang et al., 2008), providing an endogenous time reference that serves as a landmark to decode from other neurons (Kayser et al., 2012; Brasselet et al., 2012; Panzeri and Diamond, 2010; Panzeri et al., 2014). This parallels the dissociation between Ge and Te units in our model: while Ge units are channel specific, Te units cover the whole acoustic spectrum, which allow them to respond quickly and reliably to the onset of all auditory stimuli (Brasselet et al., 2012). In the model, however, theta neurons did not only discharge at stimulus onset but at regular landmarks along the speech signal, the syllable boundaries (Zhou and Wang, 2010). These neurons, hence, tie together the fast neural activity of gamma excitatory neurons into strings of linguistically relevant chunks (syllables), acting like punctuation in written language (Lisman and Buzsáki, 2008). This mechanism for segmentation is conceptually similar to the segmentation of neural codes by theta oscillations in the hippocampus during spatial navigation (Gupta et al., 2012).

From an evolutionary viewpoint, because the theta rhythm is neither auditory- nor human-specific, it might have been incorporated as a speech-parsing tool in the course of language evolution. Likewise, human language presumably optimized the length of its main constituents, syllables, to the parsing capacity of the auditory cortex. As a result, syllables have the ideal temporal format to interface with, e.g., hippocampal memory processes, or with motor routines reflecting other types of rhythmic mechanical constrains, e.g., the natural motion rate of the jaw (4Hz) (Lieberman, 1985).

Although conceptually promising, syllable tracking and speech encoding by a theta/gamma network, as proposed here, also show some limitations. While our current model is purely bottom-up, top-down predictions play a significant role in guiding speech perception (Arnal and Giraud, 2012; Gagnepain et al., 2012; Poeppel et al., 2008) presumably across different frequency channels and processing timescales (Wang, 2010; Bastos et al., 2012; Fontolan et al., 2014). How these predictions interplay with theta- and gamma-parsing activity remain unclear (Lee et al., 2013). Experimental findings suggest that theta activity might be at the interface of bottom-up and top-down processes (Peelle et al., 2013). Theta auditory activity is better synchronized to speech modulations when speech is intelligible, irrespective of its temporal or spectral structure (Luo and Poeppel, 2007; Peelle et al., 2013). In the present model, theta activity bears an intrinsic temporal predictive function: it is driven by speech modulations, but is also resilient enough to syllable length variations to stay tuned to the global statistics of speech (average syllable duration). The model performed well above chance level when decoding syllables from a new speaker, showing flexibility in syllable tracking within a 3 to 9 Hz range. A natural follow-up of this work will hence be to explore how the intrinsic dynamics of theta and gamma activity interact not only with sensory input but also with linguistic top-down signals, e.g., word, sentence level predictions (Gagnepain et al., 2012), and even cross-modal predictions (Arnal et al., 2009). The trade-off between the autonomous functioning of theta and gamma oscillatory activity on one hand and their entrainment to sensory input on the other hand are at the core of future experimental and theoretical challenges.

In conclusion, our model provides a direct evidence that theta/gamma coupled oscillations can be a viable instrument to de-multiplex speech, and by extension to analyse complex sensory scenes at different timescales in parallel. By tying the gamma-organized spiking to the syllable boundaries, theta activity allows for decoding individual syllables in continuous speech streams. The model demonstrates the computational value of neural oscillations for parsing sensory stimuli based on their temporal properties and offers new perspectives for syllable-based automatic speech recognition (Wu et al., 1997) and brain-machine interfaces using oscillation-based neuromorphic algorithms.

Materials and methods

Architecture of the full model

The model is composed of 4 types of cells: theta inhibitory neurons (Ti, 10 neurons), theta excitatory cells (Te, 10 neurons), gamma inhibitory neurons (Gi, 32 neurons), and gamma excitatory neurons (Ge, 32 neurons) also called output neurons. All neurons were modeled as leaky integrate-and-fire neurons, where the dynamics of the membrane potential Vi of the neurons followed:

CdVi/dt=gL(VLVi)+IiSYN(t)+IiINP(t)+IiDC+η(t),

where C is the capacitance of the membrane potential; gL and VL are the conductance and equilibrium potential of the leak current; ISYN, IINP and IDC are the synaptic and constant currents, respectively; η(t) is a Gaussian noise term of σi variance.

Whenever Vi reached the threshold potential VTHR, the neuron emitted a spike and Vi was turned back to VRESET.

ISYN is the sum of all synaptic currents from all projecting neurons in the network:

IiSYN(t)=jgijsij(t)(VjSYNVi(t)),

where gij is the synaptic conductance of the j-to-i synapse, sij(t) is the corresponding activation variable, and VSYN is the equilibrium potential of synaptic current (0 mV for excitatory neurons, −80 mV for inhibitory neurons). The activation variable sij(t) varies as follow:

dxjR/dt=1/τjR+δ(ttjSPK),
dsij/dt=1/τjD,

where τjR and τjD are the time constants for synaptic rise and synaptic decay, respectively.

The connectivity among the cells is the following:

  1. Te and Ti are reciprocally connected with all-to-all connections, generating the PINTH rhythm. There were also all-to-all connections within Ti cells.

  2. Ge and Gi are also reciprocally connected with all-to-all connections, generating the PING rhythm.

  3. Te projected with all-to-all connections to Ge cells, enabling cross-frequency coupling.

Input current IiINP(t) is non-null only for Te and Ge cells and follows the equation:

IiINP(t)=cωcixc(t),

where xc(t) is the signal from channel c and ωci is the weight of the projection from channel c to unit i.

Input to Te units is computed by filtering the auditory spectrogram by an optimized 2D spectro-temporal kernel (see section LN model below). LFP signal was simulated by summing the absolute values of all synaptic currents to all excitatory cells (both Ge and Te), as in Mazzoni et al. (2008). All simulations were run on Matlab. Differential equations were solved using Euler method with a time step of 0.005 ms. Values for all parameters are provided in Tables 1 and 2.

Table 1

Full network parameter set

https://doi.org/10.7554/eLife.06213.012
ParameterCVTHRVRESETVKVLgLgGe,GigGi,GegTe,Ge
Value1 F/cm2−40 mV−87 mV−100 mV−67 mV0.15/NGe5/NGi0.3/NTe
ParameterτGeRτTeRτGiRτTiRτGeDτGiDIGeDCIGiDC
Value0.2 ms4 ms0.5 ms5 ms2 ms20 ms31
Table 2

Optimal parameters for the LN model

https://doi.org/10.7554/eLife.06213.013
ParametertspnextτIhDC
Value0.07481.4330.4672

Stimuli

We used oral recordings of English sentences produced by male and female speakers from the TIMIT database (Linguistic Data Consortium, 1993). The sentences were first processed through a model of subcortical auditory processing (Chi et al., 2005) to the sentences. The model decomposes the auditory input into 128 channels of different frequency bands, reproducing the cochlear filterbank (http://www.isr.umd.edu/Labs/NSL/Software.htm). The frequency-decomposed signals undergo a series of nonlinear filters reflecting the computations taking place in the auditory nerve and other subcortical nuclei. We then reduced the number of channels from 128 to 32 by averaging the signal of each group of four consecutive channels, and used these 32 channels as input to the network. Each channel projected onto a distinct Ge cell (i.e., specific connections, ωci=0.25δ(c,i)). As for Te input, each channel was convolved by the temporal filter and projected to all Te cells (all-to-all connections). Such a convolution can be implemented by a population of relay neurons that transmit their input with a certain delay, here between 0 and 50 ms.

Phoneme identity and boundaries have been labelled by phoneticians in every sentence of the corpus. We used the Tsylb2 program (Fisher, 1996) that automatically syllabifies phonetic transcriptions (Kahn, 1976) to merge these sequences of phonemes into sequences of syllables according to English grammar rules and thus get a timing for syllable boundaries.

To address the resilience of the model to speech compression, we produced compressed sentences by applying a pitch-synchronous, overlap and add (PSOLA) procedure implemented by PRAAT, a speech analysis and modification software (http://www.fon.hum.uva.nl/praat/). The procedure retains all spectral properties from the original speech data in the compressed process. The same precortical filters were then applied as for uncompressed data before feeding into the network.

Syllable boundary prediction algorithms

Syllable boundaries triggered average (STAs) were computed as follow: for each syllable boundary (syllable onsets excluding the first of each sentence), we extracted a 700 ms window of the corresponding locked to the syllable boundary and averaged over all syllable boundaries. STAs were computed for speech envelope and for each channel of the Chi et al. (2005) model.

Predictive models

We compared the performance of four distinct families of models to predict the timing of syllable boundaries based on speech envelope or speech audiogram: the Mermelstein algorithm, a Linear–Nonlinear (LN) model (a simplified integration-to-threshold algorithm), the entrained theta neural oscillator and a purely rhythmic control model. The four algorithms are presented in the sections below.

Mermelstein algorithm

The Mermelstein algorithm is a standard algorithm that predicts syllable boundaries by identifying troughs in the power of the speech signal (Mermelstein, 1975; Villing et al., 2004). The predicted boundaries are computed according to the following steps. First, extract the power of speech signal in the 500–4000 Hz range (grossly corresponding to formants) and low-pass filter at 40 Hz to remove fast fluctuations, defining a so-called loudness function. Second, for each sentence, compute the convex hull of the loudness signal and extract the maximum of the difference between the loudness signal and its convex hull. If that difference exceeds a certain threshold Tmin and if the peak intensity of the interval of no more than Pmax smaller than the peak intensity of the whole sentence, then that time of maximal difference is defined as a predicted boundary and the same procedure is applied recursively to the intervals to the left and right of that boundary. Parameters Tmin and Pmax were optimized to yield minimum prediction distance (see below), yielding Tmin = 0.152 dB and Pmax = 15.85 dB.

Note that this algorithm cannot be run online since the convex hull at a given time depends on the future value of speech power. Thus syllable boundaries can only be predicted after a certain delay, which makes it impractical for online speech comprehension as occurring in the human brain.

LN model and variations

To evaluate the capacity of a simplified neural system to predict syllable boundaries, we trained a generalized linear point process model on the syllable data set. The model (Figure 2—figure supplement 1D) does not incorporate full neural dynamics but simply comprises a linear stimulus kernel followed by nonlinear function. The process issues a ‘spike’ or ‘syllable boundary signal’ whenever the output reaches a certain threshold (Pillow et al., 2008). This signal is fed back into the nonlinear function (another kernel Ih is used here): such negative feedback loop implements a relative refractory period. This model is a generalization of the Linear–Nonlinear Poisson model, hence we refer to it simply as LN model. We used the 32 auditory channels as input to the model and trained it to maximize its syllable boundary prediction performance.

We looked for a linear filter that is separable in its temporal and spectral component. We first computed the Spike Triggered Average (or rather ‘Syllable Boundary Triggered Average’) for all 32 channels from 600 ms to 0 ms prior to the actual boundary in 10 ms time steps. Yet STA provides the optimal estimate for the linear kernel in a LN model only when stimulus consists of uncorrelated white noise (Chichilnisky, 2001). To get the optimal values out of the white noise condition, we looked at the separable filter H that yields best prediction of the output, i.e., (|Y(t)Ŷ(t|H)|2), where:

  • Y(t) is a binary output equal to 1 if there is a syllabic boundary in the 10 ms interval, 0 otherwise,

  • H is a separable spectro-temporal filter (i.e., H(ω, u) = S(ω)T(u) for all orders u and all frequencies ω. S and T are, respectively, the spectral and temporal component of filter H.

  • Ŷ(t|H)=u,wH(w,u)X(ω,tu), where X(ω,t) is the value of auditory channel ω at time step t.

Optimal solutions of the system verify:

uT(u)R(ω,u)=u,v,ξS(ξ)T(u)T(v)M(ω,ξ,u,v)ω,ωS(ω)R(ω,u)=v,ω,ξS(ω)S(ξ)T(v)M(ω,ξ,u,v)u,

where R(ω,u)=Y(t)X(ω,t)t (i.e., R is the Spike Triggered Average)and M is the covariance tensor for X, i.e., M(ω,ξ,u,v)=cov(X(ω,tu),X(ξ,tv)).

Solutions to T and S for that system of equations can be approximated numerically using the following iterative procedure:

S0(ω)=1ω,T0(u)=1u,Sn+1=(T0Ru,vTn(u)Tn(v)M(u,v,.,.))T,Tn+1=(RS0ω,ξSn+1(ω)Sn+1(ξ)M(ω,ξ,.,.)),

and then stopping when the resulting square error RS0ω,ξSn+1(ω)Sn+1(ξ)Tn(v)M(ω,ξ,.,.v)u2 goes below a minimum value (we used a threshold of 104). The first 6 components (i.e., time bins) of the temporal kernel (i.e., 0–50 ms) were also used for input convolution in the theta model. We did not integrate further components (60–400 ms) since their weight was much lower and its implementation by relay neurons seemed less realistic.

To retrieve the optimal value for all parameters of the model, we used the GLM matlab toolbox developed in the Pillow lab (http://pillowlab.cps.utexas.edu/code_GLM.html), using as input the one-dimensional signal U(t)=ωS(ω)X(ω,t). Other parameters of the LN model including the self-inhibition temporal kernel Ih were optimized using the gradient descent implemented in the toolbox. This method provides estimation for a stochastic generalized LN model. We were interested in assessing the performance of a deterministic LN model. We then run a deterministic model with the same parameters as the stochastic model plus one new free parameter describing the normalized time to next spike (in the stochastic model, that time is drawn from an exponential distribution). The value of tspnext was optimized using the same minimization procedure used for others models (see Optimisation section below). Two other parameters were also optimized again, since this procedure minimized a different score than the GLM toolbox score: time scale of self-inhibition τIh and constant input to the model DC (Table 2).

We made one last modification to this LN model. We optimized the model such that it would maximally fire not at the time of syllable boundaries but 10 ms posterior to that time (de facto, we simply slid the STA window by 10 ms). This provides a delayed signal but likely more reliable since it can use more information (notably the rebound in the auditory spectrogram that is present right after a syllable boundary).

Theta model

The theta model is composed of the Te and Ti cells from the full network model described above, with the exact same parameter set. 11 parameters were optimized in the full model, 10 in the control model (see values in Table 3).

Table 3

Optimal parameters for the theta model

https://doi.org/10.7554/eLife.06213.014
ParsσTeσTi=σGe=σGiτTeDτTiDITeextITeDCτTiDCgTi,TigTi,TegTeL
Value0.282 A ms/cm22.028 A ms/cm224.330.36151.250.08510.4320.2070.264

Control model

The control model was used to provide a baseline for assessing the performance of other models. Under these control conditions, predicted syllable boundaries were generated rhythmically at a fixed time interval, irrespective of the stimulus. The rate of the rhythmic process was varied from 1 Hz to 15 Hz in 0.5 Hz intervals. Such control model yielded better performance than another control model consisting of a homogeneous Poisson process. It thus provides a more stringent control for estimating the efficiency of other algorithms.

Model performance evaluation

We evaluated how well syllable boundaries predicted by any model matched with the boundaries derived from labelled speech data. As an evaluation metrics, we used a point process distance that is used to compare distance between spike trains (Victor and Purpura, 1997). Shift cost was set to 20 s−1 (in other words, a predicted and an actual boundary could be matched if they were no more than 50 msec apart).

To draw comparison between different models, for each level of compression, we computed the (non-normalized) distance measure for the theta model summed over all sentences in the test data set, as well as the average number of predicted boundaries per sentence. We then matched the theta model to a control rhythmic model with the same predicted syllabic rate, and computed the difference between the non-normalized distance for the theta model and for that matched rhythmic model.

Optimisation

We optimized the parameters from all models to get the minimal normalized point process distance between predicted and actual boundaries in each sentence. Optimization was made using global gradient descent (function fminsearch in Matlab) and repeated with many initial points to avoid retaining a local minimum. Although both the theta model and the control model are intrinsically stochastic, the sample size was large enough for the objective function over the entire sample to be nearly deterministic, allowing for convergence of the gradient descent algorithm. The list of optimized parameters for each type of model is provided in the related model sections above. We split the entire TIMIT TRAIN data set (4620 sentences) into two data sets: a first data set of 1000 sentences was used to compute optimal parameters; final assessment of an algorithm performance with its optimal parameters was done on a separate set of 3620 sentences.

Analysis of model behaviour

LFP spectral analysis

Simulated LFP was downsampled to 1000 Hz before applying a time-frequency decomposition using complex Morlet wavelet transform, with all frequencies between 2 and 100 Hz with a 0.5 Hz precision. Coherence between stimulus and LFP signal was then computed for each time point t and each frequency f over 100 simulations using 100 distinct sentences sen, using the formula from Mitra and Pesaran (1999). Synchronized bursts of the PING or PINTH were detected using spike timings in Gi and Ti populations since spikes of inhibitory neurons were more synchronized than those of excitatory neurons. Synchronous bursts of spikes were detected within a given population whenever more than 10% of neurons in the population spikes within a 6 ms interval (15 ms for Ti cells).

Cross-frequency coupling

We computed cross-frequency coupling from 50 simulations of the model, each with a different TIMIT sentence preceded by 1000–1500 ms rest.

For the LFP phase-amplitude coupling, we extracted phase and amplitude from all frequencies from 2 Hz to 70 Hz in 1 Hz interval, and computed the Modulation Index for all pairs of frequencies (Tort et al., 2010). Data from all trials were concatenated (separately for spontaneous and speech-related activity) across all trials beforehand. To compute Modulation Index, in each condition, signal amplitude values x(famp,t,sen) were binned in N = 18 different bins according to the simultaneous phase of x(fphase,t,sen). For spike phase-amplitude coupling, we defined spike gamma amplitude as the number of Gi neurons spiking at a given gamma burst, and the spike theta phase was defined by linear interpolation from −π for a theta spike burst to +π for the subsequent theta burst.

Simple temporal patterns decoding

We first explored the model's performance using simple sawtooth signals (Shamir et al., 2009), representing prototypical realizations of formant transitions in a given frequency band. Each stimulus consisted of a rising component between 0 and 1, followed by a decay component from 1 back to 0. The overall length of the sawtooth was 50 ms, and the relative position of the maximal point tMAX between the starting point tSTART and end point tEND was defined by a variable a = (tMAX − tSTART)/(tEND − tSTART).

The input connectivity had to be slightly modified since sawtooths are one-dimensional signals in contrast to the multi-dimensional channel signals that we have to use for speech stimuli: for Te units, we used ITeEXT = 20; and for the connections to Ge units in line with the original model (Shamir et al., 2009), we used different input levels across the population, ranging from 0.125 to 4 in 0.125 intervals. The rest of the model remained unchanged.

We simulated the response of the network to a series of 500 sawtooths with parameter a taking one of 10 equally spaced values within the [0 1] interval. Interstimulus interval varied randomly between 50 and 250 ms.

We compared the model's performance for different neural codes. For the ‘stimulus timing’ code (see ‘Results’ section), we extracted the spike pattern of output (Ge) neurons between 20 ms before and 70 ms after of each sawtooth onset. We computed the distance between all output spike patterns using a spike train distance measure (Victor and Purpura, 1997), implemented in the Spike Train Analysis Toolkit (http://neuroanalysis.org/toolkit/). We used a shift cost of 200 s−1 corresponding to a timing resolution of 5 ms. We decoded the peak parameter using the simple leave-one-out clustering procedure of the STA toolkit, using a clustering exponent of −10. By comparing the ‘decoded parameter’, i.e., the parameter corresponding to the closest cluster, to the input sawtooth parameter, we built confusion matrices and computed decoding performance.

In the ‘theta-timing’ code, we extracted the spike pattern of output neuron in windows starting 20 before a theta burst and finishing 20 ms after the next theta burst (‘theta chunks’, Figure 4A). Spike times within each chunk were referenced with respect to the onset of the window. Each spike pattern was labelled with the corresponding value of the stimulus if the theta burst occurred during the presentation of the stimulus, or with the label ‘rest’ if the theta burst occurred during an interstimulus interval. The same decoding analysis was applied on such internally referenced neural patterns, yielding a 11 × 11 confusion matrix (10 stimulus shapes and rest). Detection theory measures (hits, misses, correct rejections, and false alarms) were computed by summing values in blocks of the confusion matrix (of size 10 × 10, 10 × 1, 1 × 10, and 1 × 1, respectively). A classification confusion matrix was obtained by removing the last row and last column of that confusion matrix.

We run the same decoding analysis on variants of the network: the full network; a control model where Te units do not receive the sawtooth input (undriven theta network) and another control where theta–gamma connections were removed (uncoupled theta–gamma network).

Syllable decoding from sentences

The classification procedure was similar for syllable decoding, where we tried to decode the identity of syllables within continuous stream of speech (full sentences) from the activity of output neurons. We stimulated the network by presenting 25 sentences from the TIMIT corpus repeated 100 times each. We extracted theta chunks of Ge spike patterns as explained previously. Each chunk was labelled with the identity of the syllable being presented at the time of the first theta burst of the chunk. We randomly selected 10 syllables from the whole set of syllables within the 25 sentences. As in some cases there were several consecutive theta chunks corresponding to the same syllable, we equated the total number of theta chunks per syllable by randomly selecting 100 theta chunks labelled with each of the 10 syllables. Syllable classification of theta-chunked Ge spike patterns was performed using two different neural codes. For the spike pattern code, we applied the same procedure as for sawtooth classification, using a smaller value of spike shift cost corresponding to a timing resolution of 60 ms. For the spike count code, we measured the number of spikes emitted by each Ge neuron within a theta chunk. We then ran a simple nearest-mean classification procedure to decode syllable identity corresponding to each theta chunk from the spike counts of all Ge neurons (see ‘Classification analysis’ below). Both methods relied on the leave-one-out procedure that consists in identifying a chunk after the decoder was trained on all chunks but the to-be-decoded one. Decoding was repeated 200 times using each time a different set of 10 random syllables, and the analysis was performed over all three variants of the network.

For syllable classification across speakers, we used the two sentences from the TIMIT corpus that have been recorded for each of the 462 speakers ('She had your dark suit in greasy wash water all year' and 'Don't ask me to carry an oily rag like that') and trained the network to classify syllables based on the neural output from other speakers, thus testing generalization across speakers. There is a wide variability of pronunciations over speakers as attested by the variability of chain of phonemes labelled of phoneticians, but the two sentences could nonetheless be parsed into 25 syllables overall for each speaker. We simulated the network presenting these 924 sentences and used the theta-chunked output to decode syllable identity. The method used was very similar to the syllable decoding analysis, where we classified theta-chunked neural patterns into one of 10 possible syllables (drawn randomly from the set of 25 syllables), with the only difference that here the classifier was based on theta chunks coming from different speakers. The classification was repeated 100 times for different subsets of syllables.

Neural encoding properties: classification analysis

The first analysis of neural encoding properties consisted in comparing the ability to classify neural codes from the model into arbitrary speech segments (as opposed to syllables as in previous section). The methods, as detailed below, were inspired by the decoding of neural auditory cortical activity recorded in monkeys in response to naturalistic sounds (Kayser et al., 2012). We simulated the network by presenting 25 different sentences from the TIMIT corpus repeated 50 times each. For a given window size (ranging from 80 to 480 ms in 80 ms intervals), we randomly extracted 10 windows (defined as stimuli) from the overall set of 25 sentences. We then retrieved stimulus identity based on the activity of a neuron that was randomly drawn from the Ge population using three different neural codes. In the neural count code, we counted the number of spikes emitted by that neuron within each window. In the time-partitioned code, we divided each window into N equally size bins, and computed the number of spikes for each of the 8 bins separately. In the phase-partitioned code, we divided the window based on theta-phase- rather time-intervals: each spike was labelled with the phase of the theta oscillation at the corresponding spike time, and we computed the number of spikes falling into each of the N subdivisions of the [−π;π] interval.

We then used a nearest-mean template matching procedure to decode the stimuli. To classify each stimulus exemplar using each neural code, we averaged the vectors over all presentations of each stimulus using a leave-one-out procedure; we then computed the Euclidian distance from the current vector to each of the 10 stimulus-averaged template. Finally, we ‘decoded’ the neural code by assigning it to the stimulus class with minimal distance to template. A more detailed explanation of the procedure is provided in the original experiment article (Kayser et al., 2012). The procedure was repeated 1000 times, each time with a different set of 10 random stimuli, and performed the 3 variants of network.

Neural encoding properties: mutual information analysis

We complemented the stimulus classification with a similar analysis using mutual information between the acoustic ‘stimulus’ and response of individual Ge neurons to further characterize the encoding properties of the network. Mutual Information (MI) estimates the reduction of uncertainty about the acoustic ‘stimulus’ that is obtained from the knowledge of a single trial of neural response. The data set was identical to the one previously used for stimulus classification analysis, where each stimulus was again segmented into non-overlapping windows of length T (here 8 to 48 ms) (Kayser et al., 2009; de Ruyter van Steveninck et al., Strong, 1997).

Mutual Information was computed for the same neural codes as in Kayser et al. (2009). We used Spike count code and Time-partitioned code as described above (for the Time-partitioned code the size of the bins was kept constant to 8 bins; the number of bins in a window hence increased with window size. As slow LFP phase was more reliable over sentence repetitions than power, we combined spike count and LFP theta phase to get a Spike count & Phase-partitioned code (Montemurro et al., 2008). For this code, the phase of slow LFP was divided into N = 4 bins, and the firing rate in each window was labelled according to the phase at which the first spike occurred. Finally, we explored the influence of slow LFP phase on MI when combined with temporal spiking patterns. Thus, in the Time- & Phase-partitioned code spikes carry two distinct tags, the first one referring to the position of the spike inside one of the four subdivisions of the stimulus window, the second indicating the phase of the underlying LFP at the moment of the spike occurrence.

We corrected for sampling bias (Kayser et al., 2009) first by using a shuffling method (Panzeri et al., 2007), then the quadratic extrapolation method (Strong et al., 1998). We further reduced the residual bias using a bootstrapping technique (200 resampled data) (Montemurro et al., 2008).

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
    Auditory edge detection: a neural model for physiological and psychoacoustical responses to amplitude transients
    1. A Fishbach
    2. I Nelken
    3. Y Yeshurun
    (2001)
    Journal of Neurophysiology 85:2303–2323.
  20. 20
    tsylb2
    1. WM Fisher
    (1996)
    National Institute of Standards and Technology, http://www.nist.gov/speech/tools.
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
    Syllable-based generalizations in English phonology
    1. D Kahn
    (1976)
    Syllable-based generalizations in English phonology, http://seas3.elte.hu/szigetva/courses/syllable/kahn76-pres-szpsyllableracz.pdf.
  35. 35
    Analysis of slow (theta) oscillations as a potential temporal reference frame for information coding in sensory cortices
    1. C Kayser
    2. RA Ince
    3. S Panzeri
    (2012)
    e1002717, PLOS Computational Biology, 8, Edited by Tim Behrens, 10.1371/journal.pcbi.1002717.
  36. 36
  37. 37
    Gamma and theta rhythms in biophysical models of hippocampal circuits
    1. NJ Kopell
    2. C Börgers
    3. DD Pervouchine
    4. P Malerba
    (2010)
    In: V Cutsuridis, editors. Hippocampal microcircuits. Springer. pp. 423–457.
  38. 38
  39. 39
  40. 40
    Top-down beta rhythms support selective attention via interlaminar interaction: a model
    1. JH Lee
    2. MA Whittington
    3. NJ Kopell
    (2013)
    e1003164, PLOS Computational Biology, 9, Edited by Stephen Coombes, 10.1371/journal.pcbi.1003164.
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
    Auditory cortex tracks both auditory and visual stimulus dynamics using low-frequency neuronal phase modulation
    1. H Luo
    2. Z Liu
    3. D Poeppel
    (2010)
    e1000445, PLOS Biology, 8, Edited by Robert Zatorre, 10.1371/journal.pbio.1000445.
  47. 47
  48. 48
    Encoding of naturalistic stimuli by local field potential spectra in networks of excitatory and inhibitory neurons
    1. A Mazzoni
    2. S Panzeri
    3. NK Logothetis
    4. N Brunel
    (2008)
    e1000239, PLOS Computational Biology, 4, Edited by KarlJ Friston, 10.1371/journal.pcbi.1000239.
  49. 49
    Automatic segmentation of speech into syllabic units
    1. P Mermelstein
    (1975)
    The Journal of the Acoustical Society of America 58:880–883.
    https://doi.org/10.1121/1.380738
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
    Phase-based measures of cross-frequency coupling in brain electrical dynamics under general anesthesia
    1. EA Mukamel
    2. KF Wong
    3. MJ Prerau
    4. EN Brown
    5. PL Purdon
    (2011)
    Conference Proceedings: Annual International Conference of the IEEE Engineering in Medicine and Biology Society 2011:1981–1984.
    https://doi.org/10.1109/IEMBS.2011.6090558
  55. 55
  56. 56
  57. 57
  58. 58
    Reading spike timing without a clock: intrinsic decoding of spike trains
    1. S Panzeri
    2. RA Ince
    3. ME Diamond
    4. C Kayser
    (2014)
    Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 369:20120467.
    https://doi.org/10.1098/rstb.2012.0467
  59. 59
  60. 60
    Reconstructing speech from human auditory cortex
    1. BN Pasley
    2. SV David
    3. N Mesgarani
    (2012)
    e1001251, PLOS Biology, 10, Edited by Robert Zatorre, 10.1371/journal.pbio.1001251.
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
    Speech perception at the interface of neurobiology and linguistics
    1. D Poeppel
    2. WJ Idsardi
    3. V van Wassenhove
    (2008)
    Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 363:1071–1086.
    https://doi.org/10.1098/rstb.2007.2160
  66. 66
    Temporal information in speech: acoustic, auditory and linguistic aspects
    1. S Rosen
    (1992)
    Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 336:367–373.
    https://doi.org/10.1098/rstb.1992.0070
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
    Automatic blind syllable segmentation for continuous speech
    1. R Villing
    2. J Timoney
    3. T Ward
    4. J Costello
    (2004)
    Electronic Engineering.
  77. 77
  78. 78
  79. 79
    Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on
    1. SL Wu
    2. ML Shire
    3. S Greenberg
    4. N Morgan
    (1997)
    987–990, Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, 2, IEEE, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=596105.
  80. 80
    From birdsong to human speech recognition: bayesian inference on a hierarchy of nonlinear dynamical systems
    1. IB Yildiz
    2. K von Kriegstein
    3. SJ Kiebel
    (2013)
    e1003219, PLOS Computational Biology, 9, Edited by Viktor K Jirsa, 10.1371/journal.pcbi.1003219.
  81. 81
  82. 82

Decision letter

  1. Hiram Brownell
    Reviewing Editor; Boston College, United States

eLife posts the editorial decision letter and author response on a selection of the published articles (subject to the approval of the authors). An edited version of the letter sent to the authors after peer review is shown, indicating the substantive concerns or comments; minor concerns are not usually shown. Reviewers have the opportunity to discuss the decision before the letter is sent (see review process). Similarly, the author response typically shows only responses to the major concerns raised by the reviewers.

Thank you for sending your work entitled "Speech encoding by coupled cortical theta and gamma oscillations" for consideration at eLife. Your article has been favorably evaluated by Eve Marder (Senior editor), three reviewers, and a member of our Board of Reviewing Editors (Hiram Brownell).

The Reviewing editor and the reviewers discussed their comments before we reached this decision, and the Reviewing editor has assembled the following comments to help you prepare a revised submission.

The study breaks new ground in that it shows that a simple, biologically plausible model of coupled theta and gamma oscillators performs better at speech decoding than other methods, including the authors' own circuit with uncoupled oscillators. While the reviewers value the potential contribution of this work, they were clear that they felt unable to fairly evaluate the suitability of the paper for publication without substantially more information. The major points to be addressed in a revision are presented below. Please note that presenting the additional information will not necessarily make the paper suitable for publication. With the requested additional information (e.g., results of statistical tests) presented as part of a revised submission, the reviewers will carry out a new review and arrive at a decision.

1) A general area of concern is the presentation of the model and the motivation for its specific architecture. Much of the eLife readership will not be familiar with the modeling literature.

1A) One specific issue is the motivation for the differential connectivity of theta- and gamma-associated neurons to the input signal).

1B) In the subsection headed “Model architecture and spontaneous behaviour” some of the aspects of the model appear somewhat ad-hoc: how realistic is it to assume that gamma and theta are generated by the same mechanism? How realistic is the connectivity structure between neurons? To preempt any concerns in this direction, one suggestion is to state explicitly and together the ways in which the model is based on neural data (connectivity structure, time constants, etc.) and the ways it is not. Some of this content is already stated in various places.

2) The connection of this work to prior literature could be improved.

2A) Given previous work on coupled oscillators and entrainment of oscillators, it should be made more clear what was known before the current work carried out and what is different and surprising about the results of the current work.

2B) The two main aims of the study could be better developed to set the stage for the rest of the manuscript. The first one (that speech constituents can be tracked with oscillations) has been shown with real data many times. A suggested rephrasing could include the modeling aspect. The second aim could be stated more clearly. (It is unclear what is meant by 'shape efficient neural code'.)

3) The description of the stimulus items used for training and testing the model is a major aspect of the paper needing additional detail. How do stimuli differ from each other? Since the test of decoding is a comparison of neural activity in response to test stimuli compared to that activity in response to trained stimuli, it is essential to know how different were the test and trained stimuli. In the subsection headed “Syllable decoding”, "25 sentences… repeated 100 times each" suggest identical stimuli were used. Was each sentence spoken by a different speaker? If not, especially if some test stimuli were identical to training stimuli, then trial-to-trial differences could be due to only internal noise in the circuit, which could always be lowered to improve performance in any circuit. Thus, relevant questions include: were different exemplars of syllables spoken by different speakers? Or did they appear in different phrases or words but spoken by the same speaker?

4) More information is needed regarding support for the model.

4A) It was not clear what data supported the network architecture, specifically, that the Ge neurons receive input from only one specific auditory channel, whereas the Te neurons receive input from all channels. Are the theta- and gamma-associated neurons anatomically dissociable?

4B) A major issue raised consistently in the reviews is that throughout the paper, the specific statistical tests applied were opaque or missing. Examples include p values (or Bayes factors, etc.) to back up many of the comparisons across conditions, syllable boundary detection (is the presented model's performance significantly better than the others?), and decoding of simple stimuli (again, is performance significantly better for the intact model?). What is the chance level of decoding for the conditions shown in Figure 3C and Figure 4B? These are not necessarily what might be expected, depending on the number of samples available. Although the approaches seem generally reasonable, statistical support must be reported (incorporating, if applicable, correction for multiple comparisons).

5) Some additional comment on phase-reset is warranted. For example, Figure 2 nicely shows PINTH activity (LFP and theta-associated activity) occurring in a (quasi-)regular fashion before the presentation of the speech stimulus. The degree to which the incoming speech signal systematically alters the phase of this activity is unclear: from Figure 2, it appears to happen instantaneously, with no missed/extra spikes. Such performance seems intriguing but potentially unrealistic. How does the phase following speech presentation compare with that prior to speech presentation?

6) Please check figures and figure captions. For example, there are captions for Figure 2 C, D and E, but these panels are missing in Figure 2. (Note also a similar mismatch for Figure 2–figure supplement 1.)

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Thank you for submitting your work entitled "Speech encoding by coupled cortical theta and gamma oscillations" for peer review at eLife. Your submission has been favorably evaluated by Eve Marder (Senior editor), a Reviewing editor, and three reviewers.

The reviewers have discussed the reviews with one another and the Reviewing editor has drafted this decision to help you prepare a revised submission.

Summary: The paper is greatly improved and needs only relatively small revisions prior to publication.

Essential revisions:

1) Please address the following in the paper itself. The comment is in response to a point made in the authors' cover letter.

A system without noise would always perform at 100% on any classification test if given repeats of certain items in training, and then tested on those same items. Your point that noise is needed to optimize detection of stimulus onset time is important and should be included in the paper (that is, for tests in which stimulus onset time is not provided to the model).

2) Given the authors' intent to take into account the constraint of neural noise, additional information as to how well this is done should be provided. At a minimum, in the spiking models, please provide the CV of the spike trains and the Fano factors for responses to identical stimuli. For the LFP signals, also provide an indicator of the trial-to-trial variability in response to identical stimuli: e.g., when stimuli are identical, CV of heights of particular LFP peaks, or of time intervals between specific peaks and troughs.

https://doi.org/10.7554/eLife.06213.015

Author response

1) A general area of concern is the presentation of the model and the motivation for its specific architecture. Much of the eLife readership will not be familiar with the modeling literature.

1A) One specific issue is the motivation for the differential connectivity of theta- and gamma-associated neurons to the input signal).

We thank the reviewers for bringing up this important point. Our rationale was as follows. Human intracortical data obtained in primary auditory cortex (Morillon et al., 2012) indicate that the cerebro/acoustic coherence in the theta range does not depend on the input frequency. In the gamma range, however, coherence is stronger at specific stimulus frequencies (see Author response figure 1). This suggests that theta oscillations inside auditory cortex are broadly responsive to the whole hearing spectrum, while local gamma generators are finely tuned to specific frequencies. This claim also makes sense from the point of view of multiplexing to track a broadband and finely tuned signals simultaneously at two distinct frequencies (Panzeri et al. 2010; Gross et al. 2013). This was our main motivation for feeding a broadband signal to individual theta neurons and a more fine-grained, spectrally complex input to individual gamma neurons. This is now specified in the Results section of the manuscript together with the information requested in Point 3:

"Such a differential selectivity was motivated experimental observations from intracranial recordings (Fontolan et al. 2014; Morillon et al. 2012) suggesting that unlike the gamma one, the theta response does not depend on the input spectrum".

Author response image 1
Human stereotactic EEG recording in auditory cortex.

A: the location of the electrode shaft, B: acoustic stimulus (one sentence); C: time-frequency representation of the cortical response in primary auditory cortex; D: spectrum of the stimulus in relation with E: the cross-correlation between stimulus (Hilbert transform of B) and cortical response for 33 different frequency bands (corresponding to 33 cochlear filters) spanning the speech audio spectrum up to 8 kHz. Note that the cross-correlation is broad-band in the theta range, but frequency specific in the gamma range, with peaks of correlation corresponding roughly to the energy peaks (speech formants). The data partly published in Morillon et al., 2012 and Giraud and Poeppel, 2012, and partly original.

https://doi.org/10.7554/eLife.06213.017

1B) In the subsection headed “Model architecture and spontaneous behaviour” some of the aspects of the model appear somewhat ad-hoc: how realistic is it to assume that gamma and theta are generated by the same mechanism? How realistic is the connectivity structure between neurons? To preempt any concerns in this direction, one suggestion is to state explicitly and together the ways in which the model is based on neural data (connectivity structure, time constants, etc.) and the ways it is not. Some of this content is already stated in various places.

Thanks you for this suggestion. We have now updated the manuscript and put all the information regarding experimental support for the model architecture in the "Model architecture and spontaneous behaviour" subsection of the Results section. We believe that this change, motivated by reviewers comments, permit to better disentangle those assumptions that were directly constrained by experimental evidence, and those that were put forward for future experimental verification.

Regarding the generation of theta oscillations, we state (now in the section “Model architecture and spontaneous behaviour”) that in the absence of conclusive evidence about the underlying mechanisms, we chose the most parsimonious option: a mechanism similar to gamma generation (PING) based on the attested presence of inhibitory neurons with slower time constants (our Ti neurons), which were involved in theta generation in previous models of neocortical oscillations (Vierling-Claassen et al., 2010; Compte et al., J Neuroscience 2008).

When developing our model, we did consider alternative architectures but concluded that they were not as compelling. Most obvious example is a three-population model with Gi units, Ti units and a single population of pyramidal neurons, (as in Tort et al. PNAS 2005 model of hippocampus) was not appropriate as it did not permit a differential frequency channel tuning for the inputs to the gamma excitatory and the theta excitatory neurons (since they represent the same population). We decided to avoid a lengthy discussion about alternative models in the manuscript, as it would probably sound unappealing to most readers. Overall we believe these changes should add to the clarity of the manuscript.

2) The connection of this work to prior literature could be improved.

2A) Given previous work on coupled oscillators and entrainment of oscillators, it should be made more clear what was known before the current work carried out and what is different and surprising about the results of the current work.

We agree with this remark and have now thoroughly rewritten the Introduction to make a more explicit connection between the previous literature and the present study. To the best of our knowledge, until now coupled cross-frequency oscillations were only used to model the functions of the hippocampus and prefrontal cortex (Jensen and Lisman 1996; Tort et al. 2005; etc.). These models were not envisaged to account for the sampling function of sensory areas. In particular, the functional interaction of an intrinsic oscillation with a sensory signal bearing pseudo-rhythmic modulations such as speech has, to the best of our knowledge, never been explored before. While supported by some experimental evidence, the role of coupled oscillations in speech processing has only been formulated from a neurophysiological perspective (Giraud and Poeppel, 2012) or in phenomenological terms (Ghitza, Front Psychology 2010).

2B) The two main aims of the study could be better developed to set the stage for the rest of the manuscript. The first one (that speech constituents can be tracked with oscillations) has been shown with real data many times. A suggested rephrasing could include the modeling aspect. The second aim could be stated more clearly. (It is unclear what is meant by 'shape efficient neural code'.)

We respectfully disagree with the statement about our first aim. A number of experiments shows that neural oscillations in auditory cortex track the slow fluctuations of speech, but none of them has explicitly tied oscillations to specific linguistic constituents, such as we do here with respect to theta oscillations and syllables. Thus, the efficiency of biologically plausible coupled theta and gamma oscillators in speech encoding has never been evaluated before. Also, given the large variability of the syllabic rhythm in natural speech, it was not clear that neural oscillations could accurately track the syllabic rate. Finally, no experimental data has ever proven causality with respect to the hypothesized functions. It is precisely because causality cannot easily be established in human recordings (it would require a specific interference with the theta rhythm with e.g. optogenetics), that we initiated this modelling work.

The new Introduction paragraph now clarifies the current state of knowledge, and specifies the three specific aims of the study. Note that for the readability of the manuscript, the changes are not tracked in the Introduction.

3) The description of the stimulus items used for training and testing the model is a major aspect of the paper needing additional detail. How do stimuli differ from each other? Since the test of decoding is a comparison of neural activity in response to test stimuli compared to that activity in response to trained stimuli, it is essential to know how different were the test and trained stimuli. In the subsection headed “Syllable decoding”, "25 sentences…repeated 100 times each" suggest identical stimuli were used. Was each sentence spoken by a different speaker? If not, especially if some test stimuli were identical to training stimuli, then trial-to-trial differences could be due to only internal noise in the circuit, which could always be lowered to improve performance in any circuit. Thus, relevant questions include: were different exemplars of syllables spoken by different speakers? Or did they appear in different phrases or words but spoken by the same speaker?

As correctly pointed out by the reviewers, we used identical stimuli for syllable classification in the analysis where we aimed at comparing both different model versions and different neural codes. To thoroughly assess the different versions of the model and explore the added value of reading out gamma spiking as a function of theta phase, some consistency in syllable length was required. We believe that the differential results are both valid and important. To address the reviewers’ concern, we now additionally present decoding results when the classifier was trained on 2 sentences uttered by 462 different speakers, where the training material was hence different than the tested material. Despite the fact that the new dataset involved a wide range of phonemic realizations (as labelled by phoneticians) due to accents and pronunciation variants, classification reached 20-24% (depending on model version), which remained well above chance level (10%). Decoding accuracy using the intact model was significantly better than either control model, and importantly the full model performed optimally when the syllable duration was within the 100-300 ms range, i.e. roughly one theta cycle.

Details about this control analysis have been included in the Results section and we added a corresponding figure (Figure 4C and Figure 4–figure supplement 1).

The point about reducing noise in the system makes sense from an algorithmic point of view but less so from a neuroscience perspective. Neuronal noise is a neural constraint requiring robust computations. Moreover, if reducing noise in the gamma network may arguably enhance chance performance, it should be noted that the level of noise in the theta module was optimized for syllabic boundary detection, thus noisy theta is expected to allow better classification performance than noiseless theta.

We thank the reviewers for this remark and we believe the across-speaker decoding results contribute to strengthen the main claims of the manuscript.

4) More information is needed regarding support for the model.

4A) It was not clear what data supported the network architecture – specifically, that the Ge neurons receive input from only one specific auditory channel, whereas the Te neurons receive input from all channels. Are the theta- and gamma-associated neurons anatomically dissociable?

Regarding the different patterns of input to the Te and Ge populations, please refer to our reply to point 1A. Regarding the issue of anatomic dissociability, there is unfortunately very little experimental evidence to build from. We draw a putative link between our two subpopulations of theta and gamma neurons and the two subclasses of stereotyped and modulated neurons found in primate auditory cortex by Brasselet and colleagues (Brasselet et al., 2012). The authors report the existence of two subclasses of neurons in primate auditory cortex corresponding to our two populations of theta and gamma neurons (see Discussion):

A subclass of 'stereotyped' neurons responding very rapidly and non-selectively to any acoustic stimulus, presumably receiving input from the whole acoustic spectrum, which we model as theta neurons, signalling boundaries in the speech signal.

A subclass of 'modulated' neurons responding more slowly and selective to some specific spectro-temporal features, indicating a narrower receptive field, which we model as gamma excitatory neurons.

The authors did not report a spatial dissociation between the two subclasses, leaving the possibility open that theta and gamma circuits do indeed coexist in one place. This point has been added to the Result section of the manuscript:

"It also mirrored the dissociation in primate auditory cortex between a population of 'stereotyped' neurons responding very rapidly and non-selectively to any acoustic stimulus (putatively Te neurons) and a population of 'modulated' neurons responding selectively to specific spectro-temporal features (putatively Ge neurons) (Brasselet et al. 2012)."

4B) A major issue raised consistently in the reviews is that throughout the paper, the specific statistical tests applied were opaque or missing. Examples include p values (or Bayes factors, etc.) to back up many of the comparisons across conditions, syllable boundary detection (is the presented model's performance significantly better than the others?), and decoding of simple stimuli (again, is performance significantly better for the intact model?). What is the chance level of decoding for the conditions shown in Figure 3C and Figure 4B? These are not necessarily what might be expected, depending on the number of samples available. Although the approaches seem generally reasonable, statistical support must be reported (incorporating, if applicable, correction for multiple comparisons).

We thank the reviewers for this comment, which urged us to provide more statistical details. We have now included results from statistical testing at all steps of the analyses. Statistical effects were very strong in all cases and reported both below and in the main manuscript. For the case of chance levels, given that we use 10 classes for all classification analysis, the average expected chance level would be 10%. Based on Combrisson and Jerbi, the threshold for 5% testing would be resp. 12.2% and 11.6% for each sawtooth (500 samples) and syllable (1000 samples) classification analysis. Importantly here, the classification procedure was repeated many times (10 times for sawtooth, 200 times for syllables) with a different set of samples. Hence the distribution of classification score could be compared through a simple t-test against its expected mean value for the chance level (10%).

Details of the statistical results are as follows:

Spike phase-amplitude coupling: coupling was significant for the full model both for rest and speech (p< 10-9 for both);

Syllable boundary detection performance (in the subsection headed “Syllable boundary detection by theta oscillations”): the theta network performed significantly better than the Mermelstein and LN algorithms and than chance level for uncompressed and compressed speech (all p-values < 10-12, tested over 3620 sentences);

Sawtooth decoding (in the subsection headed “Decoding of simple temporal stimuli from output spike patterns”): decoding using the full model was significantly larger than using any of the two control networks either using stimulus timing or theta timing (all p-values <10-9, testing over 10 repetitions of the classification procedure);

Syllable decoding (in the subsection headed “Continuous speech encoding by model output spike patterns”): in the full model, decoding was lower using spike count than using spike patterns (p<10-12 over 200 repetitions); decoding using any of the control models and any of the neural code (spike count/spike patterns) was significantly lower than using spike patterns for the full model (all p-values <10-12), and none was significantly better than using spike counts for the full model (all p-values>.08 uncorrected);

Classifier analysis (in the subsection headed “Classifier analysis”): decoding using spike counts was significantly smaller than using spike patterns (p<10-12 over 200 repetitions for all 6 window sizes); the increase in decoding performance using phase-partitioned code rather than spike count was significantly reduced in both control models and all 6 window sizes compared to full model (all p-values<10-12);

Mutual Information (in the subsection headed “Mutual information”): in the full model, MI was significantly larger when adding phase-partitioned code on top of both spike count and time-partitioned code (p<10-12 with vs without phase information (p-values over 32 Ge neurons: p<10-12 for both).

5) Some additional comment on phase-reset is warranted. For example, Figure 2 nicely shows PINTH activity (LFP and theta-associated activity) occurring in a (quasi-)regular fashion before the presentation of the speech stimulus. The degree to which the incoming speech signal systematically alters the phase of this activity is unclear: from Figure 2, it appears to happen instantaneously, with no missed/extra spikes. Such performance seems intriguing but potentially unrealistic. How does the phase following speech presentation compare with that prior to speech presentation?

The strong speech input to Te neurons provides a somewhat hard reset of theta phase in correspondence of a sentence onset. Unlike the weak coupling situation, the theta phase after the onset is virtually independent of the phase prior to speech onset. This mechanism enables very rapid and strong phaselocking of the theta to speech throughout the sentence. Gross et al. (Plos Biology, 2013 showed that phaselocking to speech edges occurs in auditory cortex within a few hundred milliseconds. To better illustrate this important property, we added a figure (Figure 1–figure supplement 1C) that shows theta phase concentration for multiple presentation of the same sentence: there is a rapid transition within a few hundred ms (e.g. a theta cycle or less) from uniform phase distribution before sentence onset to very strong phase-locking. The figure is referenced in the Results section “Model dynamics in response to natural sentences”.

6) Please check figures and figure captions. For example, there are captions for Figure 2 C, D and E, but these panels are missing in Figure 2. (Note also a similar mismatch for Figure 2–figure supplement 1.)

This was an error. This has been corrected for Figure 2 and Figure 2–figure supplement 1. The legends for two panels of Figure 5–figure supplement 1 have also been correctly reordered.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

1) Please address the following in the paper itself. The comment is in response to a point made in the authors' cover letter.

A system without noise would always perform at 100% on any classification test if given repeats of certain items in training, and then tested on those same items. Your point that noise is needed to optimize detection of stimulus onset time is important and should be included in the paper (that is, for tests in which stimulus onset time is not provided to the model).

We agree with reviewers that this point should be made explicit in the manuscript. We have now added in the Results section:

"Noise in the theta module allows the alignment of theta bursts to stimulus onset and thus improves detection performance by enabling consistent theta chunking of spike patterns."

2) Given the authors' intent to take into account the constraint of neural noise, additional information as to how well this is done should be provided. At a minimum, in the spiking models, please provide the CV of the spike trains and the Fano factors for responses to identical stimuli. For the LFP signals, also provide an indicator of the trial-to-trial variability in response to identical stimuli: e.g., when stimuli are identical, CV of heights of particular LFP peaks, or of time intervals between specific peaks and troughs.

We have added (in Figure 1–figure supplement 1) spike train CV and spike count Fano factors for the different types of neurons in response to speech as well as the standard deviation of LFP in response to speech. The latter shows a great decrease in LFP variability at sentence onset that is mostly due to the phase-locking of theta and gamma oscillations. We added reference to both new panels in the Results section.

https://doi.org/10.7554/eLife.06213.016

Article and author information

Author details

  1. Alexandre Hyafil

    INSERM U960, Group for Neural Theory, Département d'Etudes Cognitives, Ecole Normale Supérieure, Paris, France
    Contribution
    AH, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article
    For correspondence
    alexandre.hyafil@gmail.com
    Competing interests
    The authors declare that no competing interests exist.
  2. Lorenzo Fontolan

    1. INSERM U960, Group for Neural Theory, Département d'Etudes Cognitives, Ecole Normale Supérieure, Paris, France
    2. Department of Neuroscience, University of Geneva, Geneva, Switzerland
    Contribution
    LF, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.
  3. Claire Kabdebon

    INSERM U960, Group for Neural Theory, Département d'Etudes Cognitives, Ecole Normale Supérieure, Paris, France
    Contribution
    CK, Acquisition of data, Analysis and interpretation of data
    Competing interests
    The authors declare that no competing interests exist.
  4. Boris Gutkin

    1. INSERM U960, Group for Neural Theory, Département d'Etudes Cognitives, Ecole Normale Supérieure, Paris, France
    2. Centre for Cognition and Decision Making, National Research University Higher School, Moscow, Russia
    Contribution
    BG, Conception and design, Analysis and interpretation of data, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.
  5. Anne-Lise Giraud

    Department of Neuroscience, University of Geneva, Geneva, Switzerland
    Contribution
    A-LG, Conception and design, Analysis and interpretation of data, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.

Funding

European Research Council (ERC) (CompusLang 260347)

  • Lorenzo Fontolan

Schweizerische Nationalfonds zur Förderung der Wissenschaftlichen Forschung (320030-149319)

  • Anne-Lise Giraud

Agence Nationale de la Recherche

  • Boris Gutkin

Centre National de la Recherche Scientifique

  • Anne-Lise Giraud

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was funded by the European Research Council (Compuslang project; Grant agreement 260347), the Swiss National Fund (grant 320030-149319), the Agence National de la Recherche, the CNRS. We warmly thank Oded Ghitza for stimulating discussions, Maoz Shamir and Andy Brughera for sharing elements of code with us, Adrien Wohrer for help with the mathematical analysis and Jean-Paul Haton for his input from the perspective of automatic speech recognition. BSG gratefully acknowledges partial support from the National Research University Higher School of Economics.

Reviewing Editor

  1. Hiram Brownell, Reviewing Editor, Boston College, United States

Publication history

  1. Received: December 23, 2014
  2. Accepted: May 28, 2015
  3. Accepted Manuscript published: May 29, 2015 (version 1)
  4. Accepted Manuscript updated: June 5, 2015 (version 2)
  5. Version of Record published: June 25, 2015 (version 3)

Copyright

© 2015, Hyafil et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,185
    Page views
  • 653
    Downloads
  • 22
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Comments

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading