(A) Acoustic analysis of TIMIT corpus. Left panel: speech modulation frequency increases with syllabic rate. All 4620 sentences of the TIMIT corpus (Test data set) were sorted into quartiles according to syllabic rate (i.e., number of syllables per second). Speech envelope spectrum (with 1/f correction) was averaged over all sentences within each quartile, and the four averages are plotted. Colour bars on top of the graphs represent the syllabic rate range for all four quartiles, showing a correspondence between the modal frequency and the syllabic rate over the corpus. Middle panel: average channel spectrum. Spectrum was taken for each 128 auditory channels of the Chi and colleagues pre-cortical auditory model (Chi et al., 2005), averaged over all sentences in the corpus. All channels show a clear peak in the same 4–8 Hz range, showing that the theta modulation is very present in the input to auditory cortex. Right panel: syllable onset corresponds to a dip in spectrogram. Average of auditory spectrogram channels of sentences phase-locked to syllable onsets. t = 0 (green line) corresponds to syllable onset. Red colours correspond to high value, blue colours to low values. Dip at syllable onset is particularly pronounced over medium frequencies corresponding to formants. Auditory channels were averaged over all syllable onsets over the entire corpus (4620 sentences). This plot shows the connection between syllable boundaries and fluctuations of auditory channels that the auditory cortex may take advantage of in order to predict syllable boundaries. (B) Theta network model. Left panel: the architecture of the theta model is the same as the full model network without the PING component. Speech data are decomposed into auditory channels as in the LN model and projected non-specifically onto 10 Te excitatory neurons. The Te population interacts reciprocally with 10 Ti inhibitory neurons, generating theta oscillations. Theta bursts provide the model prediction for syllable boundary timing. (C) Te neurons burst at speech onset: Te neurons provide onset-signalling neurons that respond non-specifically to the onset of all sentences. The spikes from one Te neuron were collected over presentation of 500 distinct sentences, and then referenced in time with respect to sentence onset. Here, sentence onset was defined as the time when speech envelope first reached a given threshold (1000 a.u.). Spikes counts are then averaged in 20 ms bins, showing that this neuron displays a strong activity peak 0–60 ms after sentence onset. A secondary burst occurs around 200 ms after onset, as present in the example neuron shown in Brasselet et al., 2012. (D) Model of linear-nonlinear (LN) predictor of syllable boundaries. Auditory channels are filtered, summed, and passed through a nonlinear function: the output determines the expected probability of syllable onset. A negative feedback loop prevents repeated onset at close timings. Values for filters, nonlinear function, and feedback loops are optimized through fitting to a sub-sample of sentences. (E) Stimulus-network coherence. Theta phase (4–8 Hz) was extracted from both the simulated LFP and speech input. Coherence at each data point was computed as the Phase-Locking Value of the phase difference computed from 100 simulations with a distinct sentence. Coherence established in the 0–200 ms following sentence onset to a stable high coherence value of about 0.4.