An oscillating computational model can track pseudo-rhythmic speech by using linguistic predictions

  1. Sanne ten Oever  Is a corresponding author
  2. Andrea E Martin
  1. Language and Computation in Neural Systems group, Max Planck Institute for Psycholinguistics, Netherlands
  2. Donders Centre for Cognitive Neuroimaging, Radboud University, Netherlands
  3. Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Netherlands

Abstract

Neuronal oscillations putatively track speech in order to optimize sensory processing. However, it is unclear how isochronous brain oscillations can track pseudo-rhythmic speech input. Here we propose that oscillations can track pseudo-rhythmic speech when considering that speech time is dependent on content-based predictions flowing from internal language models. We show that temporal dynamics of speech are dependent on the predictability of words in a sentence. A computational model including oscillations, feedback, and inhibition is able to track pseudo-rhythmic speech input. As the model processes, it generates temporal phase codes, which are a candidate mechanism for carrying information forward in time. The model is optimally sensitive to the natural temporal speech dynamics and can explain empirical data on temporal speech illusions. Our results suggest that speech tracking does not have to rely only on the acoustics but could also exploit ongoing interactions between oscillations and constraints flowing from internal language models.

Introduction

Speech is a biological signal that is characterized by a plethora of temporal information. The temporal relationship between subsequent speech units allows for the online tracking of speech in order to optimize processing at relevant moments in time (Jones and Boltz, 1989; Large and Jones, 1999; Giraud and Poeppel, 2012; Ghitza and Greenberg, 2009; Ding et al., 2017; Arvaniti, 2009; Poeppel, 2003). Neural oscillations are a putative index of such tracking (Giraud and Poeppel, 2012; Schroeder and Lakatos, 2009). The existing evidence for neural tracking of the speech envelope is consistent with such a functional interpretation (Luo et al., 2013; Keitel et al., 2018). In these accounts, the most excitable optimal phase of an oscillation is aligned with the most informative time point within a rhythmic input stream (Schroeder and Lakatos, 2009; Lakatos et al., 2008; Henry and Obleser, 2012; Herrmann et al., 2013; Obleser and Kayser, 2019). However, the range of onset time difference between speech units seems more variable than fixed oscillations can account for (Rimmele et al., 2018; Nolan and Jeon, 2014; Jadoul et al., 2016). As such, it remains an open question how is it possible that oscillations can track a signal that is at best only pseudo-rhythmic (Nolan and Jeon, 2014).

Oscillatory accounts tend to focus on the prediction in the sense of predicting ‘when’, rather than predicting ‘what’: oscillations function to align the optimal moment of processing given that timing is predictable in a rhythmic input structure. If rhythmicity in the input stream is violated, oscillations must be modulated to retain optimal alignment to incoming information. This can be achieved through phase resets (Rimmele et al., 2018; Meyer, 2018), direct coupling of the acoustics to oscillations (Poeppel and Assaneo, 2020), or the use of many oscillators at different frequencies (Large and Jones, 1999). However, the optimal or effective time of processing stimulus input might not only depend on when you predict something to occur, but also depend on what stimulus is actually being processed (Ten Oever et al., 2013; Martin, 2016; Rosen, 1992; deen et al., 2017).

What and when are not independent, and certainly not from the brain’s-eye-view. If continuous input arrives to a node in an oscillatory network, the exact phase at which this node reaches threshold activation does not only depend on the strength of the input, but also depend on how sensitive this node was to start with. Sensitivity of a node in a language network (or any neural network) is naturally affected by predictions in the what domain generated by an internal language model (Martin, 2020; Marslen-Wilson, 1987; Lau et al., 2008; Nieuwland, 2019). We define internal language model as the individually acquired statistical and structural knowledge of language stored in the brain. A virtue of such an internal language model is that it can predict the most likely future input based on the currently presented speech information. If a language model creates strong predictions, we call it a strong model. In contrast, a weak model creates no or little predictions about future input (note that the strength of individual predictions depends not only on the capability of the system to create a prediction, but also on the available information). If a node represents a speech unit that is likely to be spoken next, a strong internal language model will sensitize this node and it will therefore be active earlier, that is, on a less excitable phase of the oscillation. In the domain of working memory, this type of phase precession has been shown in rat hippocampus (O'Keefe and Recce, 1993; Malhotra et al., 2012) and more recently in human electroencephalography (Bahramisharif et al., 2018). In speech, phase of activation and perceived content are also associated (Ten Oever and Sack, 2015; Kayser et al., 2016; Di Liberto et al., 2015; Ten Oever et al., 2016; Thézé et al., 2020), and phase has been implicated in tracking of higher-level linguistic structure (Meyer, 2018; Brennan and Martin, 2020; Kaufeld et al., 2020a). However, the direct link between phase and the predictability flowing from a language model has yet to be established.

The time of speaking/speed of processing is not only a consequence of how predictable a speech unit is within a stream, but also a cue for the interpretation of this unit. For example, phoneme categorization depends on timing (e.g., voice onsets, difference between voiced and unvoiced phonemes), and there are timing constraints on syllable durations (e.g., the theta syllable Poeppel and Assaneo, 2020; Ghitza, 2013) that affect intelligibility (Ghitza, 2012). Even the delay between mouth movements and speech audio can influence syllabic categorizations (Ten Oever et al., 2013). Most oscillatory models use oscillations for parsing, but not as a temporal code for content (Panzeri et al., 2015; Kayser et al., 2009; Mehta et al., 2002; Lisman and Jensen, 2013). However, the time or phase of presentation does influence content perception. This is evident from two temporal speech phenomena. In the first phenomena, the interpretation of an ambiguous short /α/ or long vowel /a:/ depends on speech rate (in Dutch; Reinisch and Sjerps, 2013; Kösem et al., 2018; Bosker and Reinisch, 2015). Specifically, when speech rates are fast the stimulus is interpreted as a long vowel and vice versa for slow rates. However, modulating the entrainment rate effectively changes the phase at which the target stimulus – which is presented at a constant speech rate – arrives (but this could not be confirmed in Bosker and Kösem, 2017). A second speech phenomena shows the direct phase-dependency of content (Ten Oever and Sack, 2015; Ten Oever et al., 2016). Ambiguous /da/-/ga/ stimuli will be interpreted as a /da/ on one phase and a /ga/ on another phase. This was confirmed in both a EEG and a behavioral study. An oscillatory theory of speech tracking should account for how temporal properties in the input stream can alter what is perceived.

In the speech production literature, there is strong evidence that the onset times (as well as duration) of an uttered word is modulated by the frequency of that word in the language (O'Malley and Besner, 2008; Monsell, 1991; Monsell et al., 1989; Powers, 1998; Piantadosi, 2014) showing that internal language models modulate the access to or sensitivity of a word node (Martin, 2020; Hagoort, 2017). This word-frequency effect relates to the access to a single word. However, it is likely that during ongoing speech internal language models use the full context to estimate upcoming words (Beattie and Butterworth, 1979; Pluymaekers et al., 2005a; Lehiste, 1972). If so, the predictability of a word in context should provide additional modulations on speech time. Therefore, we predict that words with a high predictability in the producer’s language model should be uttered relatively early. In this way, word-to-word onset times map to the predictability level of that word within the internal model. Thus, not only the processing time depends on the predictability of a word (faster processing for predictable words; see Gwilliams et al., 2020; Deacon et al., 1995, and Aubanel and Schwartz, 2020 showing that speech time in noise matters), but also the production time (earlier uttering of predicted words).

Language comprehension involves the mapping of speech units from a producer’s internal model to the speech units of the receiver’s internal model. In other words, one will only understand what someone else is writing or saying if one’s language model is sufficiently similar to the speakers (and if we speak in Dutch, fewer people will understand us). If the producer’s and receiver’s internal language model have roughly matching top-down constrains, they should similarly influence the speed of processing (either in production or perception; Figure 1A–C). Therefore, if predictable words arrive earlier (due to high predictability in the producer’s internal model), the receiver also expects the content of this word to match one of the more predictable ones from their own internal model (Figure 1C). Thus, the phase of arrival depends on the internal model of the producer and the expected phase of arrival depends on the internal model of the receiver (Figure 1D). If this is true, pseudo-rhythmicity is fully natural to the brain, and it provides a means to use time or arrival phase as a content indicator. It also allows the receiver to be sensitive to less predictable words when they arrive relatively late. Current oscillatory models of speech parsing do not integrate the constraints flowing from an internal linguistic model into the temporal structure of the brain response. It is therefore an open question whether the oscillatory model the brain employs is actually attuned to the temporal variations in natural speech.

Proposed interaction between speech timing and internal linguistic models.

(A) Isochronous production and expectation when there is a weak internal model (even distribution of node activation). All speech units arrive around the most excitable phase. (B) When the internal model of the producer does not align with the model of the receiver temporal alignment and optimal communication fails. (C) When both producer and receiver have a strong internal model, speech is non-isochronous and not aligned to the most excitable phase, but fully expected by the brain. (D) Expected time is a constraint distribution in which the center can be shifted due to linguistic constraints.

Here, we propose that neural oscillations can track pseudo-rhythmic speech by taking into account that speech timing is a function of linguistic constrains. As such we need to demonstrate that speech statistics are influenced by linguistic constrains as well as showing how oscillations can be sensitive to this property in speech. We approach this hypothesis as follows: First, we demonstrate that in natural speech timing depends on linguistics predictions (temporal speech properties). Then, we model how oscillations can be sensitive to these linguistic predictions (modeling speech tracking). Finally, we validate that this model is optimally sensitive to the natural temporal properties in speech and displays temporal speech illusions (model validation). Our results reveal that tracking of speech needs to be viewed as an interaction between ongoing oscillations as well as constraints flowing from an internal language model (Martin, 2016; Martin, 2020). In this way, oscillations do not have to shift their phase after every speech unit and can remain at a relatively stable frequency as long as the internal model of the speaker matches the internal model of the perceiver.

Results

Temporal speech properties

Word frequency influences word duration

To extract the temporal properties in naturally spoken speech we used the Corpus Gesproken Nederlands (CGN; [Version 2.0.3; 2014]). This corpus consists of elaborated annotations of over 900 hr of spoken Dutch and Flemish words. We focus here on the subset of the data of which onset and offset timings were manually annotated at the word level in Dutch. Cleaning of the data included removing all dashes and backslashes. Only words were included that were part of a Dutch word2vec embedding (github.com/coosto/dutch-word-embeddingsNieuwenhuijse, 2018; needed for later modeling) and required to have a frequency of at least 10 in the corpus. All other words were replaced with an <unknown> label. This resulted in 574,726 annotated words with 3096 unique words. Two thousand and forty-eight of the words were recognized in the Dutch Wordforms database in CELEX (Version 3.1) in order to extract the word frequency as well as the number of syllables per word (later needed to fit a regression model). Mean word duration was 0.392 s, with an average standard deviation of 0.094 s (Figure 2—figure supplement 1). By splitting up the data in sequences of 10 sequential words, we could extract the average word, syllable, and character rate (Figure 2—figure supplements 2 and 3). The reported rates fall within the generally reported ranges for syllables (5.2 Hz) and words (3.7 Hz; Ding et al., 2017; Pellegrino and Coupé, 2011).

We predict that knowledge about the language statistics influences the duration of speech units. As such we predict that more prevalent words will have on average a shorter duration (also reported in Monsell et al., 1989). In Figure 2A, the duration of several mono- and bi-syllabic words are listed with their word frequency. From these examples, it seems that words with higher word frequency generally have a shorter duration. To test this statistically we entered word frequency in an ordinary least square regression with number of syllables as control. Both number of syllables (coefficient = 0.1008, t(2843) = 75.47, p<0.001) as well as word frequency (coefficient = −0.022, t(2843) = −13.94, p<0.001) significantly influence the duration of the word. Adding an interaction term did not significantly improve the model (F (1,2843) = 1.320, p=0.251; Figure 2B,C). The effect is so strong that words with a low frequency can last three times as long as high-frequency words (even within mono-syllabic words). This indicates that word frequency could be an important part of an internal model that influences word duration.

Figure 2 with 3 supplements see all
Word frequency modulates word duration.

(A) Example of mono- and bi-syllabic words of different word frequencies in brackets (van=from, zijn=be, snel=fast, stem=voice, hebben=have, eten=eating, volgend=next, toekomst=future). Text in the graph indicates the mean word duration. (B) Relationship between word frequency and duration. Darker colors mean more values. (C) same as (B) but separately for mono- and bi-syllabic words. (D) Relationship character amount and word duration. The longer the words, the longer the duration (left). The increase in word duration does not follow a fixed number per character as duration as measured by rate increases. (E) same as (D) but for number of syllables. Red dots indicate the mean.

The previous analysis probed us to expand on the relationship between word duration and length of the words. Obviously, there is a strong correlation between word length and mean word duration (number of characters 0.824, p<0.001; number of syllables: ρ = 0.808, p<0.001; for number of syllables already shown above; Figure 2D,E). In contrast, this correlation is present, but much lower for the standard deviation of word duration (number of characters: ρ = 0.269, p<0.001; number of syllables: ρ = 0.292, p<0.001). Finding a strong correlation does not imply that for every time unit increase in the word length, the duration of the word also increases with the same time unit, i.e., bi-syllabic words do not necessarily have to last twice as long as mono-syllabic words. Therefore, we recalculated word duration to a rate unit considering the number of syllables/characters of the word. Thus, a 250 ms mono- versus bi-syllabic word would have a rate of 4 versus 8 Hz, respectively. Then we correlated character/syllabic rate with word duration. If word duration increases monotonically with character/syllable length, there should be no correlation. We found that the syllabic rate varies between 3 and 8 Hz as previously reported (Figure 2E, right; Ding et al., 2017; Pellegrino and Coupé, 2011). However, the more syllables there are in a word, the higher this rate (ρ = 0.676, p<0.001). This increase was less strong for the character rate (ρ = 0.499, p<0.001; Figure 2D, right).

These results show that the syllabic/character rate depends on the number of characters /syllables within a word and is not an independent temporal unit (Ghitza, 2013). This effect is easy to explain when assuming that the prediction strength of an internal model influences word duration: transitional probabilities of syllables are simply more constrained within a word than across words (Thompson and Newport, 2007). This will reduce the time it takes to utter/perceive any syllable which is later in a word. In the current model, we focus on words (based on the availability of word2vec embedding used to calculate contextual predictabilities based on a RNN) instead of syllables, so we will not test this prediction for syllables, but instead we can investigate the effect of transitional probabilities and other statistical regularities flowing from internal models across words (see next section and [Jadoul et al., 2016] for statistical regularities in syllabic processing).

Word-by-word predictability predicts word onset differences

The brain’s internal model likely provides predictions about what linguistic features and representations, and possibly about which specific units, such as words, to expect next when listening to ongoing speech (Martin, 2016; Martin, 2020). As such, it is also expected that word-by-word onset delays are shorter for words that fit the internal model (i.e., those that are expected; Beattie and Butterworth, 1979). To investigate this possibility, we created a simplified version of an internal model predicting the next word using recurrent neural nets (RNN). We trained an RNN to predict the next word from ongoing sentences (Figure 3A). The model consisted of an embedding layer (pretrained; github.com/coosto/dutch-word-embeddings), a recurrent layer with a tanh activation function, and a dense output layer with a softmax activation. To prevent overfitting, we added a 0.2 dropout to the recurrent layers and the output layer. An Adam optimizer was used at a 0.001 learning rate and a batch size of 32. We investigated four different recurrent layers (GRU and LSTM at either 128 or 300 units; see Figure 3—figure supplement 1). The final model we use here includes a LSTM with 300 units. Input data consistent of 10 sequential words (label encoding) within the corpus (of a single speaker; shifting the sentences by one word at a time), and an output consisted of a single word. A maximum of four unknown labeled words (words not included in the word2vec estimations. Four was chosen as it was <50% of the words) was allowed in the input, but not in output. Validation consisted of a randomly chosen 2% of the data.

Figure 3 with 2 supplements see all
RNN output influence word onset differences.

(A) Sequences of 10 words were entered in an RNN in order to predict the content of the next word. Three examples are provided of input data with the label (bold word) and probability output for three different words. The regression model showed a relation between the duration of last word in the sequence and the predictability of the next word such that words were systematically shorter when the next word was more predictable according to the RNN output (illustrated here with the shorted black boxes). (B) Regression line estimated at mean value of word duration and bigram. (C) Scatterplot of prediction and onset difference of data within ± 0.5 standard deviation of word duration and bigram. Note that for (B) and (C), the axes are linear on the transformed values. (D) Regression line for the correlation between logarithm of variance of the prediction and theta power. (E) None-transformed distribution of variance of the predictions (within a sentence). Translation of the sentences in (A) from top to bottom: ‘... that it has for me and while you have no answer [on]’, ‘... the only real hope for us humans is a firm and [sure]’, ‘... a couple of glass doors in front and then it would not have been [in]’.

The output of the RNN reflects a probability distribution in which the values of the RNN sum up to one and each word has its own predicted value (Figure 3A; see Figure 3—figure supplement 2 for differences across words and sentence position). As such we can extract the predicted value of the uttered word and relate the RNN prediction with the stimulus onset delay relative to the previous word. We entered word prediction in a regression using the stimulus onset difference between the current word in the sentence and the previous word (i.e., onset difference of words). We added the control variables bigram (using the NLTK toolbox based on the training data only), frequency of previous word, syllable rate (rate of the full sentence input), and mean duration of previous word (all variables that can account for part of the variance that affects the duration of the last word). We only used the test data (total of 7361 sentences, excluding all words in which the previous word (W-1) was not present in Celex. 4837 sentences). Many of the variables were skewed to the right; therefore, we transformed the data accordingly (see Table 1; results were robust to changes in these transformation).

Table 1
Summary of regression model for logarithm of onset difference of words.
VariableTransBβSEtpVIF
Interceptx0.97190.04919.764<0.001
RNN predictionx (1/6)−0.3370−0.08620.047−7.163<0.0011.5
Bigramlog(x)−0.0118−0.03160.005−2.4240.0151.8
Word frequency W-1x0.00490.00760.0090.5460.5852.0
Mean duration W-1log(x)1.12060.70030.02250.326<0.0012.0
Syllable Ratex−0.1033−0.22450.004−23.014<0.0011.0
  1. Model R2 = 0.542. Trans = transformation, W-1 = previous word, B = unstandardized coefficient, β = standardized coefficient, SE = standard error, t = t value, p = p value, VIF = variance inflation factor.

All predictors except word frequency of the previous word showed a significant effect (Table 1). The variance explained by word frequency was likely captured by the mean duration variable of the previous word, which is correlated to word frequency. The RNN predictor could capture more variance than the bigram model, suggesting that word duration is modulated by the level of predictability within a fuller context than just the conditional probability of the current word given the previous word (Figure 3B,C). Importantly, it was necessary to use the trained RNN model as a predictor; entering the RNN predictions after the first training cycle (of a total of 100) did not results in a significant predictor (t(4837) = −1.191, p=0.234). Also adding the predictor word frequency of the current word did not add significant information to the model (F(1, 4830) = 0.2048, p=0.651). These results suggest that words are systematically lengthened (or pauses are added. However, the same predictors are also significant when excluding sentences containing pauses) when the next word is not strongly predicted by the internal model. We also investigate whether RNN predictions have an influence on the duration of the word that has to be uttered. We found no effect on the duration (Supporting Table 1).

Sentence isochrony depends on prediction variance

In the previous section, we investigated word-to-word onsets, but did not investigate how this influences the temporal properties within a full sentence. In a regular sentence, predictability values change from word-to-word. Based on the previous results, it is expected that overall sentences with a more stable predictability level (sequential words are equally predictable) should be more isochronous than sentences in which the predictability shifts from high to low. This prediction is based on the observation that when predictions are equal the expected shift is the same, while for varying predictions, temporal shifts vary (Figure 3B,C).

To test this hypothesis, we extracted the RNN prediction for 10 subsequent words. Then we extracted the variance of the prediction across those 10 words and extracted the word onset itself. We created a time course at which word onset were set to 1 (at a sampling rate of 100 Hz). Then we performed an fast Fourier transform (FFT) and extracted z-transformed power values over a 0–15 Hz interval. The power at the maximum power value with the theta range (3–8 Hz) was extracted. These max z-scores were correlated with the log transform of the variance (to normalize the skewed variance distribution; Figure 3E). We found a weak, but significant negative correlation (r = −0.062, p<0.001; Figure 3D) in line with our hypothesis. This suggests that the more variable the predictions within a sentence, the lower the peak power value is. When we repeated the analysis on the envelope, we did not find a significant effect.

Materials and methods

Speech Tracking in a Model Constrained Oscillatory Network

Request a detailed protocol

In order to investigate how much of these duration effects can be explained using an oscillator model, we created the model Speech Tracking in a Model Constrained Oscillatory Network (STiMCON). STiMCON in its current form will not be exhaustive; however, it can extract how much an oscillating network can cope with asynchronies by using its own internal model illustrating how the brain’s language model and speech timing interact (Guest and Martin, 2021). The current model is capable of explaining how top-down predictions can influence the processing time as well as provide an explanation for two known temporal illusions in speech.

STiMCON consists of a network of semantic nodes of which the activation A of each level in the model l is governed by:

(1) Al,T=Cl-1l*Al-1,T+Cl+1l*Al+1,T+inhibTa+osc(T)

in which C represents the connectivity patterns between different hierarchical levels, T the time in a sentence, and Ta the vector of times of an individual node in an inhibition function (in milliseconds). The inhibition function is a gate function:

(2) inhib(Ta)={3BaseInhib,Ta <l3BaseInhib,20Ta<100BaseInhib,Ta>100

in which BaseInhib is a constant for the base level of inhibition (negative value, set to −0.2). As such nodes are by default inhibited, as soon as they get activated above threshold (activation threshold set at 1) Ta sets to zero. Then, the node will have suprathreshold activation, which after 20 ms returns to increased inhibition until the base level of inhibition is returned. These values are set to reflect early excitation and longer lasting inhibition, which are only loosely related to neurophysiological time scales. The oscillation is a constant oscillator:

(3) osc(T)=Ame2πiωT+iφ

in which Am is the amplitude of the oscillator, ω the frequency, and φ the phase offset. As such we assume a stable oscillator which is already aligned to the average speech rate (see Rimmele et al., 2018; Poeppel and Assaneo, 2020 for phase alignment models). The model used for the current simulation has one an input layer (l−1 level) and one single layer of semantic word nodes (l level) that receives feedback from a higher level layer (l+1 level). As such only the word (l) level is modeled according to Equation 1–3 and the other levels form fixed input and feedback connection patterns. Even though the feedback influences the activity at the word level, it does not cause a phase reset as the phase of the oscillation does not change in response to this feedback.

Language models influence time of activation

Request a detailed protocol

To illustrate how STiMCON can explain how processing time depends on the prediction of internal language models, we instantiated a language model that had only seen three sentences and five words presented at different probabilities (I eat cake at 0.5 probability, I eat nice cake at 0.3 probability, I eat very nice cake at 0.2 probability; Table 2). While in the brain the prediction should add up to 1, we can assume that the probability is spread across a big number of word nodes of the full language model and therefore neglectable. This language model will serve as the feedback arriving from the l+1-level to the l-level. The l-level consists of five nodes that each represent one of the words and receives proportional feedback from l+1 according to Table 2 with a delay of 0.9*ω seconds, which then decays at 0.01 unit per millisecond and influences the l-level at a proportion of 1.5. The 0.9*ω was defined as we hypothesized that onset time would be loosely predicted around on oscillatory cycle, but to be prepared for input slightly earlier (which of course happens for predictable stimuli), we set it to 0.9 times the length of the cycle. The decay is needed and set such that the feedback would continue around a full theta cycle. The proportion was set empirically such to ensure that strong feedback did cause suprathreshold activation at the active node. The feedback is only initiated when supra-activation arrives due to l−1-level bottom-up input. Each word at the l−1-level input is modeled as a linearly function to the individual nodes lasting length of 125 ms (half a cycle, ranging from 0 to 1 arbitrary units). As such, the input is not the acoustic input itself but rather reflects a linear increase representing the increasing confidence of a word representing the specific node. φ is set such that the peak of a 4 Hz oscillation aligns to the peak of sensory input of the first word. Sensory input is presented at a base stimulus onset asynchrony of 250 ms (i.e., 4 Hz).

Table 2
Example of a language model.

This model has seen three sentences at different probabilities. Rows represent the prediction for the next word, e.g., /I/ predicts /eat/ at a probability of 1, but after /eat/ there is a wider distribution.

IEatVeryNiceCake
I01000
eat000.20.30.5
very00010
nice00001
cake00000

When we present this model with different sensory inputs at an isochronous rhythm of 4 Hz, it is evident that the timing at which different nodes reach activation depends on the level of feedback that is provided (Figure 4). For example, while the /I/-node needs a while to get activated after the initial sensory input, the /eat/-node is activated earlier as it is pre-activated due to feedback. After presenting /eat/, the feedback arrives at three different nodes and the activation timing depends on the stimulus that is presented (earlier activation for /cake/ compared to /very/).

Model output for different sentences.

For the supra-threshold activation dark red indicates activation which included input from l+1 as well as l1, orange indicates activation due to l+1 input. Feedback at different strengths causes phase dependent activation (left). Suprathreshold activation is reached earlier when a highly predicted stimulus (right) arrives, compared to a mid-level predicted stimulus (middle).

Time of presentation influences processing efficiency

Request a detailed protocol

To investigate how the time of presentation influences the processing efficiency, we presented the model with /I eat XXX/ in which the last word was varied in content (Figure 5A; either /I/, /very/, /nice/, or /cake/), intensity (linearly ranging from 0 to 1), and onset delay (ranging between −125 and +125 ms relative to isochronous presentation). We extracted the time at which the node matching the stimulus presentation reached activation threshold first (relative to stimulus onset and relative to isochronous presentation).

Model output on processing efficiency.

(A) Input given to the model. Sensory input is varied in intensity and timing. We extract the time of activation relative to stimulus onset (supra-time) and relative to isochrony onset. (B) Time of presentation influences efficiency. Outcome variable is the time at which the node reached threshold activation (supra-time). The dashed line is presented to ease comparison between the four content types. White indicates that threshold is never reached. (C) Same as (B), but estimated at a threshold of 0.53 showing that oscillations regulate feedforward timing. Panel (C) shows that the earlier the stimuli are presented (on a weaker point of the ongoing oscillation), the longer it takes until supra-threshold activation is reached. This figure shows that timing relative to the ongoing oscillation is regulated such that the stimulus activation timing is closer to isochronous. Line discontinuities are a consequence of stimuli never reaching threshold for a specific node.

Figure 5B shows the output. When there is no feedback (i.e., at the first word /I/ presentation), a classical efficiency map can be found in which processing is most optimal (possible at lowest stimulus intensities) at isochronous (in phase with the stimulus rate) presentation and then drops to either side. For nodes that have feedback, input processing is possible at earlier times relative to isochronous presentation and parametrically varies with prediction strength (earlier for /cake/ at 0.5 probability, then /very/ at 0.2 probability). Additionally, the activation function is asymmetric. This is a consequence of the interaction between the supra-activation caused by the feedback and the sensory input. As soon as supra-activation is reached due to the feedback, sensory input at any intensity will reach supra-activity (thus at early stages of the linearly increasing confidence of the input). This is why for the /very/ stimulus activation is still reached at later delays compared to /nice/ and /cake/ as the /very/-node reaches supra-activation due to feedback at a later time point. In regular circumstances, we would of course always want to process speech, also when it arrives at a less excitable phase. Note, however, that the current stimulus intensities were picked to exactly extract the threshold responses. When we increase our intensity range above 2.1, nodes will always get activated even on the lowest excitable phase of the oscillation.

When we investigate timing differences in stimulus presentation, it is important to also consider what this means for the timing in the brain. Before, we showed that the amount of prediction can influence timing in our model. It is also evident that the earlier a stimulus was presented the more time it took (relative to the stimulus) for the nodes to reach threshold (more yellow colors for earlier delays). This is a consequence of the oscillation still being at a relatively low excitability point at stimulus onset for stimuli that are presented early during the cycle. However, when we translate these activation threshold timing to the timing of the ongoing oscillation, the variation is strongly reduced (Figure 5C). A stimulus timing that varies between 130 ms (e.g., from −59 to +72 in the /cake/ line; excluding the non-linear section of the line) only reaches the first supra-threshold response with 19 ms variation in the model (translating to a reduction of 53–8% of the cycle of the ongoing oscillation, i.e., a 1:6.9 ratio). This means that within this model (and any oscillating model) the activation of nodes is robust to some timing variation in the environment. This effect seemed weaker when no prediction was present (for the /I/ stimulus this ratio was around 1:3.5. Note that when determining the /cake/ range using the full line the ratio would be 1:3.4).

Top-down interactions can provide rhythmic processing for non-isochronous stimulus input

Request a detailed protocol

The previous simulation demonstrate that oscillations provide a temporal filter and the processing at the word layer can actually be closer to isochronous than what can be solely extracted from the stimulus input. Next, we investigated whether dependent on changes in top-down prediction, processing within the model will be more or less rhythmic. To do this, we create stimulus input of 10 sequential words at a base rate of 4 Hz to the model with constant (Figure 6A; low at 0 and high at 0.8 predictability) or alternating word-to-word predictability. For the alternating conditions, word-to-word predictability alternates between low and high (sequences which word are predicted at 0 or 0.8 predictability, respectively) or shift from high to low. For this simulation, we used Gaussian sensory input (with a standard deviation of 42 ms aligning the mean at the peak of the ongoing oscillation; see Figure 6—figure supplement 1 for output with linear sensory input). Then, we vary the onset time of the odd words in the sequence (shifting from −100 up to +100 ms) and the stimulus intensity (from 0.2 to 1.5). We extracted the overall activity of the model and computed the FFT of the created time course (using a Hanning taper only including data from 0.5 to 2.5 s to exclude the onset responses). From this FFT, we extracted the peak activation at the stimulation rate of 4 Hz.

Figure 6 with 2 supplements see all
Model output on rhythmicity.

(A) We presented the model with repeating (A, B) stimuli with varying internal models. We extracted the power spectra and peak activity at various odd stimulus offsets and stimulus intensities. (B) Strength of 4 Hz power depends on predictability in the stream. When predictability is alternated between low and high, activation is more rhythmic when the predictable odd stimulus arrives earlier and vice versa. (C) Power across different internal models at intensity of 0.8 and 1.0 (different visualization than B). (D) Magnitude spectra at three different odd word offsets at 1.0 intensity. To more clearly illustrate the differences, the magnitude to the power of 20 is plotted.

The first thing that is evident is that the model with no content predictions has overall lowest power, but has the strongest 4 Hz response around isochronous presentation (odd word offset of 0 ms) at high stimulus intensities (Figure 6B–D) following closely the acoustic input. Adding overall high predictability increases the power, but also here the power seems symmetric around zero. The spectra of the alternating predictability conditions look different. For the low to high predictability condition, the curve seems to be shifted to the left such that 4 Hz power is strongest when the predictable odd stimulus is shifted to an earlier time point (low–high condition). This is reversed for the high–low condition. At middle stimulus intensities, there is a specific temporal specificity window at which the 4 Hz power is particularly strong. This window is earlier for the low–high than the high–low alternation (Figure 6C,D and Figure 6—figure supplement 2). The effect only occurs at specific middle-intensity combination as at high intensities the stimulus dominates the responses and at low intensities the stimulus does not reach threshold activation. These results show that even though stimulus input is non-isochronous, the interaction with the internal model can still create a potential isochronous structure in the brain (see Meyer et al., 2019; Meyer et al., 2020). Note that the direction in which the brain response is more isochronous matches with the natural onset delays in speech (shorter onset delays for more predictable stimuli).

Model validation

STiMCON’s sinusoidal modulations of RNN predictions is optimally sensitive to natural onset delays

Request a detailed protocol

Next, we aimed to investigated whether STiMCON would be optimally sensitive to speech input timings found naturally in speech. Therefore, we tried to fit STIMCON’s expected word-to-word onset differences to the word-to-word onset differences we found in the CGN. At a stable level of intensity of the input and inhibition, the only aspect that changes the timing of the interaction between top-down predictions and bottom-up input within STiMCON is the ongoing oscillation. Considering that we only want to model for individual words how much the prediction (Cl+1l*Al+1,T) influences the expected timing we can set the contribution of the other factors from Equation (1) to zero remaining with the relative contribution of prediction:

(4) Cl+1lAl+1,T=topdowninfluence=osc(T)

We can solve this formula in order to investigate the expected relative time shift (T) in processing that is a consequence of the strength of the prediction (ignoring that in the exact timing will also depend on the strength of the input and inhibition):

(5) relativetimeshift=12πω(arcsin(Cl+1lAl+1,TAm)φ)

ω was set as the syllable rate for each sentence, and Am and φ were systematically varied. We fitted a linear model between the STiMCON’s expected time and the actual word-to-word onset differences. This model was similar to the model described in the section Word-by-word predictability predicts word onset differences and included the predictor syllable rate and duration of the previous word. However, as we were interested in how well non-transformed data matches the natural onset timings, we did not perform any normalization besides Equation (5). As this might involve violating some of the assumptions of the ordinary least square fit, we estimate model performance by repeating the regression 1000 times fitting it on 90% of the data (only including the test data from the RNN) and extracting R2 from the remaining 10%.

Results show a modulation of the R2 dependent on the amplitude and phase offset of the oscillation (Figure 7A). This was stronger than a model in which transformation in Equation (5) was not applied (R2for a model with no transfomation was 0.389). This suggests that STiMCON expected time durations matches the actual word-by-word duration. This was even more strongly so for specific oscillatory alignments (around −0.25π offset), suggesting an optimal alignment phase relative to the ongoing oscillation is needed for optimal tracking (Giraud and Poeppel, 2012; Schroeder and Lakatos, 2009). Interestingly, the optimal transformation seemed to automatically alter a highly skewed prediction distribution (Figure 7B) toward a more normal distribution of relative time shifts (Figure 7C). Note that the current prediction only operated on the word node (to which we have the RNN predictions), while full temporal shifts are probably better explained by word, syllabic, and phrasal predictions.

Fit between real and expected time shift dependent on predictability.

(A) Phase offset and amplitude of the oscillation modulate the fit to the word-to-word onset durations. (B) Histogram of the predictions created by the deep neural net. (C) Histogram of the relative time shift transformation at phase of −0.15π and amplitude of 1.5.

STiMCON can explain perceptual effects in speech processing

Request a detailed protocol

Due to the differential feedback strength and the inhibition after suprathreshold feedback stimulation, STiMCON is more sensitive to lower predictable stimuli at phases later in the oscillatory cycle. This property can explain two illusions that have been reported in the literature, specifically, the observation that the interpretation of ambiguous input depends on the phase of presentation (Ten Oever and Sack, 2015; Kayser et al., 2016; Ten Oever et al., 2020) and on speech rate (Bosker and Reinisch, 2015). The only assumption that has to be made is that there is an uneven base prediction balance between the ways the ambiguous stimulus can be interpreted.

The empirical data we aim to model comprises an experiment in which ambiguous syllables, which could either be interpreted as /da/ or /ga/, were presented (Ten Oever and Sack, 2015). In one of the experiments in this study, broadband simuli were presented at specific rates to entrain ongoing oscillations. After the last entrainment stimulus, an ambiguous /daga/ stimulus was presented at different delays (covering two cycles of the presentation rate at 12 different steps), putatively reflecting different oscillatory phases. Dependent on the delay of stimulation participants perceived either /da/ or /ga/, suggesting that phase modulates the percept of the participants. Besides this behavioral experiment, the authors also demonstrated that the same temporal dynamics were present when looking at ongoing EEG data, showing that the phase of ongoing oscillations at the onset of ambiguous stimulus presentation determined the percept (Ten Oever and Sack, 2015).

To illustrate that STiMCON is capable of showing a phase (or delay) dependent effect, we use an internal language model similar to our original model (Table 2). The model consists of four nodes (N1, N2, Nda, and Nga). N1 and N2 represent nodes responsive to two stimulus S1 and S2 that function as entrainment stimuli. N1 activation predicts a second unspecific stimulus (S2) represented by N2 at a predictability of 1. N2 activation predicts either da or ga at 0.2 and 0.1 probability, respectively. This uneven prediction of /da/ and /ga/ is justified as /da/ is more prevalent in the Dutch language as /ga/ (Zuidema, 2010), and it thus has a higher predicted level of occurring. Then, we present STiMCON (same parameters as before) with /S1 S2 XXX/. XXX is varied to have different proportion of the stimulus /da/ and /ga/ (ranging from 0% /da/ to 100% /ga/ in 12 times steps; these reflect relative proportions that sum up to one such that at 30% the intensity of /da/ would be at max 0.3 and of /ga/ 0.7) and is the onset is varied relate to the second to last word. We extract the time that a node reaches suprathreshold activity after stimulus onset. If both nodes were active at the same time, the node with the highest total activation was chosen. Results showed that for some ambiguous stimuli, the delay determines which node is activated first, modulating the ultimate percept of the participant (Figure 8A, also see Figure 8—figure supplement 1A). The same type of simulation can explain how speech rate can influence perception (Figure 8—figure supplement 1B; but see Bosker and Kösem, 2017).

Figure 8 with 1 supplement see all
Results for /daga/ illusions.

(A) Modulations due to ambiguous input at different times. Illustration of the node that is active first. Different proportions of the /da/ stimulus show activation timing modulations at different delays. (B) Summary of the model and the parameters altered for the empirical fits in (C) and (D). (C). R2 for the grid search fit of the full model using the first active node as outcome variable, a model without inhibition (no inhib), without uneven feedback (no fb), or without an oscillation (no os). The right panel shows the fit of the full model on the rectified behavioral data of Ten Oever and Sack, 2015. Blue crossed indicate rectified data and red lines indicate the fit. (D) is the same as (C) but using the average activity instead of the first active node. Removing the oscillation results in an R2 less than 0.

To further scrutinize on this effect, we fitted our model to the behavioral data of Ten Oever and Sack, 2015. As we used an iterative approach in the simulations of the model, we optimized the model using a grid search. We varied the parameters of proportion of the stimulus being /da/ or /ga/ (ranging between 10:5:80%), the onset time of the feedback (0.1:0.1:1.0 cycle), the speed of the feedback decay (0:0.01:0.1), and a temporal offset of the final sound to account for the time it takes to interpret a specific ambiguous syllable (ranging between −0.05:0.01:0.05 s). Our first outcome variable was the node that show the first suprathreshold activation (Nda = 1, Nga = 0). If both nodes were active at the same time, the node with the highest total activates was chosen. If both nodes had equal activation or never reached threshold activation, we coded the outcome to 0.5 (i.e., fully ambiguous). These outcomes were fitted to the behavioral data of the 6.25 Hz and 10 Hz presentation rate (the two rates showing a significant modulation of the percept). This data was normalized to have a range between 0 and 1 to account for the model outcomes being binary (0, 0.5, or 1). As a second outcome measure, we also extracted the relative activity of the /da/ and /ga/ nodes by subtracting their activity and dividing by the summed activity. The activity was calculated as the average activity over a window of 500 ms after stimulus onset and the final time course was normalized between 0 and 1.

For the first node activation analysis, we found that our model could fit the data at an average explained variance of 43% (30% and 58% for 6.25 Hz and 10 Hz, respectively; Figure 8C,D). For the average activity analysis, we found a fit with 83% explained variance. Compared to the original sinus fit, this explained variance was higher for the average activation analysis (40% for three parameter sinus fit [amplitude, phase offset, and mean]). Note that for the first node activation analysis, our fit cannot account for variance ranging between 0–0.5 and 0.5–1, while the sinus fit can do this. If we correct for this (by setting the sinus fit to the closest 0, 0.5, or 1 value and doing a grid search to optimize the fitting), the average fit of the sinus is 21%. Comparing the fits of the rectified sinus versus the first node activation reveals an average Akaike information criterion of the model and sinus fits of −27.0 and −24.1, respectively. For the average activation analysis, this was −41.5 versus −27.8, respectively. This overall suggests that the STiMCON model has the better fit. Thus, STiMCON does better than a fixed-frequency sinus fit. This is a likely consequence of the sinus fit not being able to explain the dampening of the oscillation later (i.e., the perception bias is stronger for shorter compared to longer delays).

Finally, we investigated the relevance of the three key features of our model for this fit: inhibition, feedback, and oscillations (Figure 8B). We repeated the grid search fit but set either the inhibition to zero, the feedback matrix equal for both /da/ and /ga/ (both 0.15), or the oscillation at an amplitude of zero. Results showed for both outcome measures that the full model showed the best performance. Without the oscillation, the models could not even fit better than the mean of the model (R2 < 0). Removing the feedback had a negative influence on both the outcome measures, dropping the performance. Removing the inhibition reduced performance for both outcome measures, but more strongly on the average activation compared to the first active node model. This suggest that all features (with potentially to a lesser extend the inhibition) are required to model the data, suggesting that oscillatory tracking is dependent on linguistic constrains flowing from the internal language model.

Discussion

In the current paper, we combined an oscillating computational model with a proxy for linguistic knowledge, an internal language model, in order to investigate the model’s processing capacity for onset timing differences in natural speech. We show that word-to-word speech onset differences in natural speech are indeed related to predictions flowing from the internal language model (estimated through an RNN). Fixed oscillations aligned to the mean speech rate are robust against natural temporal variations and even optimized for temporal variations that match the predictions flowing from the internal model. Strikingly, when the pseudo-rhythmicity in speech matches the predictions of the internal model, responses were more rhythmic for matched pseudo-rhythmic compared to isochronous speech input. Our model is optimally sensitive to natural speech variations, can explain phase-dependent speech categorization behavior (Ten Oever and Sack, 2015; Thézé et al., 2020; Reinisch and Sjerps, 2013; Ten Oever et al., 2020), and naturally comprises a neural phase code (Panzeri et al., 2015; Mehta et al., 2002; Lisman and Jensen, 2013). These results show that part of the pseudo-rhythmicity of speech is expected by the brain and it is even optimized to process it in this manner, but only when it follows the internal model.

Speech timing is variable, and in order to understand how the brain tracks this pseudo-rhythmic signal, we need a better understanding of how this variability arises. Here, we isolated one of the components explaining speech time variation, namely, constraints that are posed by an internal language model. This goes beyond extracting the average speech rate (Ding et al., 2017; Poeppel and Assaneo, 2020; Pellegrino and Coupé, 2011) and might be key to understanding how a predictive brain uses temporal cues. We show that speech timing depends on the predictions made from an internal language model, even when those predictions are highly reduced to be as simple as word predictability. While syllables generally follow a theta rhythm, there is a systematic increase in syllabic rate as soon as more syllables are in a word. This is likely a consequence of the higher close probability of syllables within a word which reduces the onset differences of the later uttered syllables (Thompson and Newport, 2007). However, an oscillatory model constrained by an internal language model is sensitive to these temporal variations, it is actually capable of processing them optimally.

The oscillatory model we here pose has three components: oscillations, feedback, and inhibition. The oscillations allow for the parsing of speech and provide windows in which information is processed (Giraud and Poeppel, 2012; Ghitza, 2012; Peelle and Davis, 2012; Martin and Doumas, 2017). Importantly, the oscillation acts as a temporal filter, such that the activation time of any incoming signal will be confined to the high excitable window and thereby is relatively robust against small temporal variations (Figure 5C). The feedback allows for differential activation time dependent on the sensory input (Figure 5B). As a consequence, the model is more sensitive to higher predictable speech input and therefore active earlier on the duty cycle (this also means that oscillations are less robust against temporal variations when the feedback is very strong). The inhibition allows for the network to be more sensitive to less predictable speech units when they arrive later (the higher predictable nodes get inhibited at some point on the oscillation; best illustrated by the simulation in Figure 8A). In this way, speech is ordered along the duty cycle according to its predictability (Lisman and Jensen, 2013; Jensen et al., 2012). The feedback in combination with an oscillatory model can explain speech rate and phase-dependent content effects. Moreover, it is an automatic temporal code that can use time of activation as a cue for content (Mehta et al., 2002). Note that previously we have interpreted the /daga/ phase-dependent effect as a mapping of differences between natural audio-visual onset delays of the two syllabic types onto oscillatory phase (Ten Oever et al., 2013; Ten Oever and Sack, 2015). However, the current interpretation is not mutually exclusive with this delay-to-phase mapping as audio-visual delays could be bigger for less frequent syllables. The three components in the model are common brain mechanisms (Malhotra et al., 2012; Mehta et al., 2002; Buzsáki and Draguhn, 2004; Bastos et al., 2012; Michalareas et al., 2016; Lisman, 2005) and follow many previously proposed organization principles (e.g., temporal coding and parsing of information). While we implement these components on an abstract level (not veridical to the exact parameters of neuronal interactions), they illustrate how oscillations, feedback, and inhibition interact to optimize sensitivity to natural pseudo-rhythmic speech.

The current model is not exhaustive and does not provide a complete explanation of all the details of speech processing in the brain. For example, it is likely that the primary auditory cortex is still mostly modulated by the acoustic pseudo-rhythmic input and only later brain areas follow more closely the constraints posed by the language model of the brain. Moreover, we now focus on the word level, while many tracking studies have shown the importance of syllabic temporal structure (Giraud and Poeppel, 2012; Ghitza, 2012; Luo and Poeppel, 2007) as well as the role of higher order linguistic temporal dynamics (Meyer et al., 2019; Kaufeld et al., 2020b). It is likely that predictive mechanisms also operate on these higher linguistic levels as well as on syllabic levels. It is known, for example, that syllables are shortened when the following syllabic content is known versus producing syllables in isolation (Pluymaekers et al., 2005a; Lehiste, 1972). Interactions also occur as syllables part of more frequent words are generally shortened (Pluymaekers et al., 2005b). Therefore, more hierarchical levels need to be added to the current model (but this is possible following Equation (1)). Moreover, the current model does not allow for phase or frequency shifts. This was intentional in order to investigate how much a fixed oscillator could explain. We show that onset times matching the predictions from the internal model can be explained by a fixed oscillator processing pseudo-rhythmic input. However, when the internal model and the onset timings do not match, the internal model phase and/or frequency shift are still required and need to be incorporated (see e.g. Rimmele et al., 2018; Poeppel and Assaneo, 2020).

We aimed to show that a stable oscillator can be sensitive to temporal pseudo-rhythmicities when these shifts match predictions from an internal linguistic model (causing higher sensitivity to these nodes). In this way, we show that temporal dynamics in speech and the brain cannot be isolated from processing the content of speech. This is in contrast with other models that try to explain how the brain deals with pseudo-rhythmicity in speech (Giraud and Poeppel, 2012; Rimmele et al., 2018; Doelling et al., 2019). While some of these models discuss that higher-level linguistic processing can modulate the timing of ongoing oscillations (Rimmele et al., 2018), they typically do not consider that in the speech signal itself the content or predictability of a word relates to the timing of this word. Phase resetting models typically deal with pseudo-rhythmicity by shifting the phase of ongoing oscillations in response to a word that is offset to the mean frequency of the input (Giraud and Poeppel, 2012; Doelling et al., 2019). We believe that this cannot explain how the brain uses what/when dependencies in the environment to infer the content of the word (e.g., later words are likely a less predictable word). Our current model does not have an explanation of how the brain can actually entrain to an average speech rate. This is much better described in dynamical systems theories in which this is a consequence of the coupling strength between internal oscillations and speech acoustics (Doelling et al., 2019; Assaneo et al., 2021). However, these models do not take top-down predictive processing into account. Therefore, the best way forward is likely to extend coupling between brain oscillations and speech acoustics (Poeppel and Assaneo, 2020), with the coupling of brain oscillations to brain activity patterns of internal models (Cumin and Unsworth, 2007).

In the current paper, we use an RNN to represent the internal model of the brain. However, it is unlikely that the RNN captures the wide complexities of the language model in the brain. The decades-long debates about the origin of a language model in the brain remains ongoing and controversial. Utilizing the RNN as a proxy for our internal language model makes a tacit assumption that language is fundamentally statistical or associative in nature, and does not posit the derivation or generation of knowledge of grammar from the input (Chater, 2001; McClelland and Elman, 1986). In contrast, our brain could as well store knowledge of language that functions as fundamental interpretation principles to guide our understanding of language input (Martin, 2016; Martin, 2020; Hagoort, 2017; Martin and Doumas, 2017; Friederici, 2011). Knowledge of language and linguistic structure could be acquired through an internal self-supervised comparison process extracted from environmental invariants and statistical regularities from the stimulus input (Martin and Doumas, 2019; Doumas et al., 2008; Doumas and Martin, 2018). Future research should investigate which language model can better account for the temporal variations found in speech.

A natural feature of our model is that time can act as a cue for content implemented as a phase code (Lisman and Jensen, 2013; Jensen et al., 2012). This code unravels as an ordered list of predictability strength of the internal model. This idea diverges from the idea that entrainment should align to the most excitable phase of the oscillation with the highest energy in the acoustics (Giraud and Poeppel, 2012; Rimmele et al., 2018). Instead, this type of phase coding could increase the brain representational space to separate information content (Lisman and Jensen, 2013; Panzeri et al., 2001). We predict that if speech nodes have a different base activity, ambiguous stimulus interpretation should dependent on the time/phase of presentation (see Ten Oever and Sack, 2015; Ten Oever et al., 2020). Indeed, we could model two temporal speech illusions (Figure 8, Figure 8—figure supplement 1). There have also been null results regarding the influence of phase on ambiguous stimulus interpretation (Bosker and Kösem, 2017; Kösem et al., 2016). For the speech rate effect, when modifying the time of presentation with a neutral entrainer (summed sinusoidals with random phase), no obvious phase effect was reported (Bosker and Kösem, 2017). A second null result relates to a study where participants were specifically instructed to maintain a specific perception in different blocks which likely increases the pre-activation and thereby the phase (Kösem et al., 2016). Future studies need to investigate the use of temporal/phase codes to disambiguate speech input and specifically use predictions in their design.

The temporal dynamics of speech signals needs to be integrated with the temporal dynamics of brain signals. However, it is unnecessary (and unlikely) that the exact duration of speech matches with the exact duration of brain processes. Temporal expansion or squeezing of stimulus inputs occur regularly in the brain (Eagleman et al., 2005; Pariyadath and Eagleman, 2007), and this temporal morphing also maps to duration (Eagleman, 2008; Terao et al., 2008; Ulrich et al., 2006) or order illusions (Vroomen and Keetels, 2010). Our model predicts increased rhythmic responses for non-isochronous speech matching the internal model. The perceived rhythmicity of speech could therefore also be an illusion generated by a rhythmic brain signal somewhere in the brain.

When investigating the pseudo-rhythmicity in speech, it is important to identify situations where speech is actually more isochronous. Two examples are the production of lists (Jefferson, 1990) and infant-directed speech (Fernald, 2000). In both these examples, it is clear that a strong internal predictive language model is lacking either on the producer’s or on the receiver’s side, respectively. The infant-directed speech also illustrates that a producer might proactively adapt its speech rhythm to the expectations of the internal model of the receiver to align better with the predictions from the receiver’s model (Figure 9B; similar to when you are speaking to somebody that is just learning a new language). Other examples in which speech is more isochronous is during poems, during emotional conversation (Hawkins, 2014), and in noisy situations (Bosker and Cooke, 2018). While speculative, it is conceivable that in these circumstances one puts more weight on a different level of hierarchy than the internal linguistic model. In the case of poems and emotional conversation, an emotional route might get more weight in processing. In the case of noisy situations, stimulus input has to pass the first hierarchical level of the primary auditory cortex which effectively gets more weight than the internal model.

Predictions of the model.

(A) Acoustics signals will be more isochronous when a producer has a weak versus a strong internal model (top right). When the producer’s strong model matches the receiver’s model, the brain response will be more isochronous for less isochronous acoustic input. (B) When a producer realizes the model of the receiver is weak, it might transform its model and thereby their speech timing to match the receiver’s expectations.

Conclusions

We argued that pseudo-rhythmicity in speech is in part a consequence of top-down predictions flowing from an internal model of language. This pseudo-rhythmicity is created by a speaker and expected by a receiver if they have overlapping internal language models. Oscillatory tracking of this signal does not need to be hampered by the pseudo-rhythmicity, but can use temporal variations as a cue to extract content information since the phase of activation parametrically relates to the likelihood of an input relative to the internal model. Brain responses can even be more isochronous to pseudo-rhythmic compared to isochronous speech if they follow the temporal delays imposed by the internal model. This account provides various testable predictions which, we list in Table 3 and Figure 9. We believe that by integrating neuroscientific explanations of speech tracking with linguistic models of language processing (Martin, 2016; Martin, 2020), we can improve to explain temporal speech dynamics. This will ultimately aid our understanding of language in the brain and provide a means to improve temporal properties in speech synthesis.

Table 3
Predictions from the current model.
When there is a flat constraint distribution over an utterance (e.g., when probabilities are uniform over the utterance), the acoustics of speech should naturally be more isochronous (Figures 9A and 3D,E).
If speech timing matches the internal language model, brain responses should be more isochronous even if the acoustics are not (Figure 9A).
The more similar the internal language models of two speakers, the more effective they are in ‘entraining’ each other’s brain.
If speakers suspect their listener to have a flatter constraint distribution than themselves (e.g., the environment is noisy, or the speakers are in a second language context), they adjust to the distribution by speaking more isochronous (Figure 9B).
One adjusts the weight of the constraint distribution to a hierarchical level when needed. For example, when there is noise, participants adjust to the rhythm of primary auditory cortex instead of higher order language models. As a consequence, they speak more isochronous.
The theoretical account provides various predictions that are listed in this table.

Code availability statement

Code for the creation of the main figures is available on GitHub (Ten Oever & Martin, 2021; copy archived at swh:1:rev:873a2bf5c79fe2f828e72e14ef74db409d387854).

Data availability

Data used in the dataset relate to the corpus gesproken nederlands. Information about this dataset can be found here: http://lands.let.ru.nl/cgn/. Access to the dataset can be requested here: https://taalmaterialen.ivdnt.org/download/tstc-corpus-gesproken-nederlands/. Data regarding the simulations in Figure 8 are based on data from Ten Oever & Sack (2015). As this data regards a closed database owned by Maastricht University it is not openly available. However, the data is available upon request without any restrictions via sanne.tenoever@mpi.nl or datamanagement-fpn@maastrichtuniversity.nl.

References

  1. Report
    1. Bosker HR
    2. Reinisch E
    (2015)
    Normalization for Speechrate in Native and Nonnative Speech. 18th International Congress of Phonetic Sciences (ICPhS 2015)
    International Phonetic Association.
  2. Book
    1. Chater M
    (2001)
    Connectionist Psycholinguistics
    Greenwood Publishing Group.
    1. Hawkins S
    (2014) Situational influences on rhythmicity in speech, music, and their interaction
    Philosophical Transactions of the Royal Society B: Biological Sciences 369:20130398.
    https://doi.org/10.1098/rstb.2013.0398
    1. Lehiste I
    (1972) The timing of utterances and linguistic boundaries
    The Journal of the Acoustical Society of America 51:2018–2024.
    https://doi.org/10.1121/1.1913062
  3. Book
    1. Monsell S
    (1991)
    The Nature and Locus of Word Frequency Effects in Reading
    Routledge.
    1. Nolan F
    2. Jeon H-S
    (2014) Speech rhythm: a metaphor?
    Philosophical Transactions of the Royal Society B: Biological Sciences 369:20130396.
    https://doi.org/10.1098/rstb.2013.0396
  4. Conference
    1. Powers DM
    (1998)
    Editor applications and explanations of zipf’s law
    New Methods in Language Processing and Computational Natural Language Learning.
    1. Rosen S
    (1992) Temporal information in speech: acoustic, auditory and linguistic aspects
    Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 336:367–373.
    https://doi.org/10.1098/rstb.1992.0070
  5. Book
    1. Zuidema W
    (2010)
    A Syllable Frequency List for Dutch
    Taalportaal.

Decision letter

  1. Andrew J King
    Senior Editor; University of Oxford, United Kingdom
  2. Anne Kösem
    Reviewing Editor; Lyon Neuroscience Research Center, France
  3. Anne Kösem
    Reviewer; Lyon Neuroscience Research Center, France
  4. Johanna Rimmele
    Reviewer; Max-Planck-Institute for Empirical Aesthetics, Germany
  5. Keith Doelling
    Reviewer

Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Acceptance summary:

The manuscript is of broad interest to readers in the field of speech recognition and neural oscillations. The authors provide a computational model which, in addition to feedforward acoustic input, incorporates linguistic predictions as feedback, allowing a fixed oscillator to process non-isochronous speech. The model is tested extensively by applying it to a linguistic corpus, EEG and behavioral data. The article gives new insights to the ongoing debate about the role of neural oscillations and predictability in speech recognition.

Decision letter after peer review:

Thank you for submitting your article "Oscillatory tracking of pseudo-rhythmic speech is constrained by linguistic predictions" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Anne Kösem as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Andrew King as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Johanna Rimmele (Reviewer #2); Keith Doelling (Reviewer #3).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

All reviewers had a very positive assessment of the manuscript. They find the described work highly novel and interesting. However, they also find the manuscript quite dense and complex, and suggest some clarifications in the description of the model and in the methods.

Please find a list of the recommendations below.

Reviewer #1 (Recommendations for the authors):

1. First of all, I think that the concept of "internal language model" should be defined and described in more detail in the introduction. What does it mean when it is weak and when it is strong, for the producer and for the receiver?

2. It is still not fully clear to me what kind of information the internal oscillation is entraining to in the model. From Figure 1, it seems that the oscillation is driven by the acoustics only, but the phase of processing of linguistic units depends on their predictability.

3. If acoustic information arrives at the most excitable phase of the neural oscillation (as described in figure 1), and if predictability makes words arrive earlier, does it entail that more predictable words arrive at less excitable phases of the neural oscillation? What would be the computational advantage of this mechanism?

4. What is "stimulus intensity" in figure 5? Does it reflect volume or SNR?

5. Similarly what is "amplitude" in Figure 6?

6. L. 376-L. 439 "N2 activation predicts either da or ga at 0.2 and 0.1 probability respectively." Please explain why the probabilities are not equal in the model, same for l. 445 "intensity of /da/ would be at max 0.3. and of /ga/ 0.7".

7. Table 3: I feel that the first prediction is not actually a prediction, but a result of the article, as the first data shows that "The more predictable a word, the earlier this word is uttered."

8. Table 3 and Figure 8A: I think that the second prediction that "When there is a flat constraint distribution over an utterance (e.g., when probabilities are uniform over the utterance) the acoustics of speech should naturally be more rhythmic (Figure 8A)." could be tested with the current data. In the speech corpus, are sentences with lower linguistic constraints more rhythmic?

9. Table 3: "If speech timing matches the internal language model, brain responses should be more rhythmic even if the acoustics are not (Figure 8A)." What do the authors mean by "more rhythmic"? Does it mean the brain follows more accurately the acoustics? Does it generate stronger internal rhythms that are distinct from the acoustic temporal structure?

10. Figure 5 C: "Strength of 4 Hz power": what is the frequency bandwidth?

11. Figure 5 D: " Slice of D", Slice of C?

12. L 442: "propotions » -> proportions.

13. Abstract: "Our results reveal that speech tracking does not only rely on the input acoustics but instead entails an interaction between oscillations and constraints flowing from internal language model " I think this claim is too strong, considering that the article does not present direct electrophysiological evidence.

Reviewer #2 (Recommendations for the authors):

1. In the model the predictability is computed at the word-level, while the oscillator operates at the syllable level. The authors show different duration effects for syllables within words, likely related to predictability. Is there any consequence of this mismatch of scales?

2. Furthermore, could the authors clarify whether or not and how they think the model mechanism is different from top-down phase reset (e.g. l. 41). It seems that the excitability cycle at the intermediate word-level is shifted from being aligned to the 4 Hz oscillator though the linguistic feedback from layer l+1. Would that indicate a phase resetting at the word-level layer through the feedback?

3. The model shows how linguistic predictability can affect neuronal excitability in an oscillatory model, allowing to improve the processing of non-isochronous speech. I do not fully understand the claim that the linguistic predictability makes the processing (at the word-level) more isochronous, and why such isochronicity is crucial.

4. The authors showed that word frequency affects the duration of a word. Now the RNN model relates the predictability of a word (output) to the duration of the previous word W-1 (l. 187). Didn't one expect from Figure 1B that the duration of the actually predicted word is affected? How are these two effects related?

5. Title: is "constrained" the right word here, rather "modulated"? As we can process non-predictable speech.

6. See l. 129: "In this way, oscillations do not have to shift their phase after every speech unit and can remain at a relatively stable frequency as long as the internal model of the speaker matches the internal model of the perceiver." It seems to me that in the model the authors introduce, the phase-shifting still occurs. Even though the oscillator component is fixed, the activation threshold fluctuations at the word-level are "shifted" due to the feedback. So there is no feedforward phase-reset, however, a phase-reset due to feedback?

7. l. 219: why was bigrams added as control variable?

8. l. 233 in l. 142 it says that only 2848 words were present in CELEX. Where the 4837 sentences consisting of the 2848 words?

9. Figure 2 D,E the labeling with ρ and p is confusing, I'd at least state consistently both, so one sees the difference.

10. Table 1 legend: could you add why the specific transformations were performed?

11. l. 204: the β coefficient is rather small compared to the duration of W-1 effect. The dependent variable onset-to-onset should be strongly correlated with the W-1 duration. I wonder if this is a problem?

12. l. 249: what is meant with "after the first epoch"?

13. l. 254: how local were these lengthening effects? Did the predictability based on the trained RNN strongly vary across words or rather vary on a larger scale i.e. full sentences being less predictable than others?

14. l. 268: Could you explain where the constants are coming from: like the 20 and 100 ms windows for inhibition and the values -0.2 and -3. The function inhibit(ta) is not clear to me. What is the output when Ta is 0 versus 1?

15. Figure 4: the legend is very short, adding some description what the figure illustrates would make it easier to follow. The small differences in early/late activation are hard to see, particularly for the 4th row. Maybe it would help to add lines?

16. Figure 5 B: could you clarify the effect at late stim times relative to isochronous, i.e. why the supra time relative to isochronous decreases for highly predictable stimuli. I assume this is to the inhibition function?

17. How is the connectivity between layers defined? Is it symmetric for feedforward and feedback?

18. l. 294/l. 205: "with a delay of 0.9*ω seconds, which then decays at 0.01 unit per millisecond and influences the l-level at a proportion of 1.5." where are the constants coming from?

19. l. 347: "the processing itself can actually be closer to isochronous than what can be solely extracted from the stimulus". This refers to Figure 5 D I assume. Did you directly compare the acoustics and the model output with respect to isochrony?

20. l. 437-438: I am not fully understanding these choices: why is N1 represented by N2? Why is the probability of da and ga uneaven, and why are there nodes for da and ga (Nda, Nga) plus a node N2 which predicts both with different probability?

21. Figure 5: why is the power of the high-high predictable condition the lowest. Is this an artifact of the oscillator in the model being fixed at 4 Hz or related to the inhibition function? High-high should like low-low result in rather regular, but faster acoustics?

22. l. 600: "The perceived rhythmicity" In my view speech has been suggested to be quasi-rhythmic, as (1) some consistency in syllable duration has been observed within/across languages, and (2) as (quasi-)rhythmicity seemed a requirement to explain how segmentation of speech based on oscillations could work in the absence of simple segmentation cues (i.e. pauses between syllables). While one can ask when something is "rhythmic enough" to be called rhythmic, I don't understand why this is related to "perceived rhythmicity".

23. l. 604: interesting thought!

Reviewer #3 (Recommendations for the authors):

1. An important question is how the authors relate these findings to the Giraud and Poeppel, 2012 proposal which really focuses on the syllable. Would you alter the hypothesis to focus on the word level? Or remain at the syllable level and speed up and low down the oscillator depending on the predictability of each word? It would be interesting to hear the authors thoughts on how to manage the juxtaposition of syllable and word processing in this framework.

2. The authors describe the STiMCON model as having an oscillator with frequency set to the average stimulus rate of the sentence. But how an oscillator can achieve this on its own (without the hand of its overloads) is unclear particularly given a pseudo-rhythmic input. The authors freely accept this limitation. However, it is worth noting that the ability for an oscillator mechanism to do this under pseudorhythmic context is more complicated than it might seem, particularly once we include that the stimulus rate might change from the beginning to the end of a sentence and across an entire discourse.

3. The analysis of the naturalistic dataset shows a nice correlation between the estimated time shifts predicted by the model and the true naturalistic deviations. However, I find it surprising that there is so little deviation across the parameters of the oscillator (Figure 6A). What should we take from the fact that an oscillator aligned in anti-phase from the with the stimulus (which would presumably show the phase code only stimulus offsets), still shows a near equal correlation with true timing deviations. Furthermore, while the R2 shows that the predictions of the model co-vary with the true values, I'm curious to know how accurately they are predicted overall (in terms of mean squared error for example). Does the model account for deviations from rhythmicity of the right magnitude?

4. Lastly, it is unclear to what extent the oscillator is necessary to find this relative time shift. A model comparison between the predictions of the STiMCON and the RNN predictions on their own (à la Figure 3) would help to show how much the addition of the oscillation improves our predictions. Perhaps this is what is meant by the "non-transformed R2" but this is unclear.

5. Figure 7 shows a striking result demonstrating how the model can be used to explain an interesting finding that phase of an oscillation can bias perception towards da or ga. The initial papers consider this result to be explained by delays in onset between visual and auditory stimuli whereas this result explains it in terms of the statistical likelihood each syllable. It is a nice reframing which helps me to better understand the previous result.

6. The authors show that syllable lengths are determined in part by the predictability of the word it is a part of. While the authors have reasonably restricted themselves to a single hierarchical level, the point invites the question as to whether all hierarchical levels are governed by similar processes. Should syllables accelerate from beginning to end of a word? Or in more or less predictable phrases?

7. Figure 5 shows how an oscillator mechanism can force pseudo-rhythmic stimuli into a more rhythmic code. The authors note that this can be done either by slowing responses to early stimuli and quickening responses to later ones, or by dropping (nodes don't reach threshold) stimuli too far outside the range of the oscillation. The first is an interesting mechanism, the second is potentially detrimental to processing (although it could be used as a means for filtering out noise). The authors should make clear how much deviation is required to invoke the dropping out mechanism and how this threshold relates to the naturalistic case. This would give the reader a clearer view of the flexibility of this model.

8. I found Figure 5 very difficult to understand and had to read and read it multiple times to feel like I could get a handle on it. I struggled to get a handle on why supra time was shorter and shorter the later the stimulus was activated. It should reverse at some point as the phase goes back into lower excitability, right? The current wording is very unclear on this point. In addition, the low-high, high-low analysis is unclear because the nature of the stimuli is unclear. I think an added figure panel to show how these stimuli are generated and manipulated would go a long way here.

9. The prediction of behavioral data in Figure 7 is striking but the methods could be improved. Currently, the authors bin the output of the model to be 0, 0.5 or 1 which requires some maneuvering to effectively compare it with the sinewave model. They could instead use a continuous measure (either lag of activation between da and ga, or activation difference) as a feature in a logistic regression to predict the human subject behavior.

10. I'm not sure but I think there is a typo in line 383-384. The parameter for feedback should read Cl+1◊ l * Al+1,T. Note the + sign instead of the -. Or I have misunderstood something important.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Tracking of pseudo-rhythmic speech is modulated by linguistic predictions in an oscillating computational model" for further consideration by eLife. Your revised article has been evaluated by Andrew King (Senior Editor) and a Reviewing Editor.

The manuscript has been greatly improved, and only these issues need to be addressed, as outlined below:

Reviewer #2 (Recommendations for the authors):

I want to thank the authors for the great effort revising the manuscript. The manuscript has much improved. I only have some final small comments.

Detailed comments

l. 273-275: In my opinion: This is because the oscillator is set as a rigid oscillator in the model that is not affected by the word level layer activation; however, as the authors already discuss this topic, this is just a comment.

l. 344: "the processing itself" I'd specify: "the processing at the word layer".

l. 557/558: Rimmele et al., (2018) do discuss that besides the motor system, predictions from higher-level linguistic processing might affect auditory cortex neuronal oscillations through phase resetting. Top-down predictions affecting auditory cortex oscillations is one of the main claims of the paper. Thus, this paper seems not a good example for proposals that exclude when-to-what interactions. In my view the claims are rather consistent with the ones proposed here, although Rimmele et al., do not detail the mechanism and differ from the current proposal in that they suggest phase resetting. Could you clarify?

l 584 ff.: "This idea diverges from the idea that entrainment should per definition occur on the most excitable phase of the oscillation [3,15]." Maybe rephrase: "This idea diverges from the idea that entrainment should align the most excitable phase of the oscillation with the highest energy in the acoustics [3,15]."

l. 431: "The model consists of four nodes (N1, N2, Nda, and Nga) at which N1 activation predicts a second unspecific stimulus (S2) represented by N2 at a predictability of 1. N2 activation predicts either da or ga at 0.2 and 0.1 probability respectively."

This is still hard to understand for me. E.g. What is S2, is this either da or ga, wouldn't their probability have to add up to 1?

Wording

l. 175/176: sth is wrong with the sentence.

l. 544: "higher and syllabic"? (sounds like sth is wrong in the wording)

l. 546: "within more frequency" (more frequent or higher frequency?)

https://doi.org/10.7554/eLife.68066.sa1

Author response

Essential revisions:

All reviewers had a very positive assessment of the manuscript. They find the described work highly novel and interesting. However, they also find the manuscript quite dense and complex, and suggest some clarifications in the description of the model and in the methods.

Please find a list of the recommendations below.

Reviewer #1 (Recommendations for the authors):

1. First of all, I think that the concept of "internal language model" should be defined and described in more detail in the introduction. What does it mean when it is weak and when it is strong, for the producer and for the receiver?

We define internal language model as the individually acquired statistical and structural knowledge of language stored in the brain. A virtue of such an internal language model is that it can predict the most likely future input based on the currently presented speech information. If a language model creates strong predictions, we call it a strong model. In contrast, a weak model creates no or little predictions about future input (note that the strength of individual predictions depends not only on the capability of the system to create a prediction, but also on the available information). If a node represents a speech unit that is likely to be spoken next, a strong internal language model will sensitize this node and it will therefore be active earlier, that is, on a less excitable phase of the oscillation.

The above explanation has been included in the introduction.

2. It is still not fully clear to me what kind of information the internal oscillation is entraining to in the model. From Figure 1, it seems that the oscillation is driven by the acoustics only, but the phase of processing of linguistic units depends on their predictability.

The entrainment proper is indeed still the acoustics. However, compared to other models in which oscillations are very strongly coupled to this acoustic envelope by aligning the most excitable phase as a consequence of the acoustic phase shifts (Doelling et al., 2019) or proactively dependent on temporal predictions (Rimmele et al., 2018), we propose that it is not necessary to change the phase of the ongoing oscillation in response to the acoustics to optimally process pseudo-rhythmic speech. As such, we view the model as weakly coupled relatively to more strongly coupled oscillator models. The phase of the oscillation does not need to change to every phase shift in the acoustics. We propose that the model can entrain to the average speech rate, however, we acknowledge in the updated manuscript we do not answer how this can be done. We have added a discussion on this point in the discussion which now reads: “We aimed to show that a stable oscillator can be sensitive to temporal pseudo-rhythmicities when these shifts match predictions from an internal linguistic model (causing higher sensitivity to these nodes). In this way we show that temporal dynamics in speech and the brain cannot be isolated from processing the content of speech. This is contrast with other models that try to explain how the brain deals with pseudo-rhythmicity in speech (Giraud and Poeppel, 2012, Rimmele et al., 2018, Doelling et al., 2019), which typically do not take into account that the content of a word can influence the timing. Phase resetting models can only deal with pseudo-rhythmicity by shifting the phase of ongoing oscillations in response to a word that is off-set to the mean frequency of the input (Giraud and Poeppel, 2012, Doelling et al., 2019). We believe that this goes beyond the temporal and content information the brain can extract from the environment which has what/when interactions. However, our current model does not have an explanation of how the brain can actually entrain to an average speech rate. This is much better described in dynamical systems theories in which this is a consequence of the coupling strength between internal oscillations and speech acoustics (Doelling et al., 2019, Assaneo et al., 2021). However, these models do not take top-down predictive processing into account. Therefore, the best way forward is likely to extend coupling between brain oscillations and speech acoustics (Poeppel and Assaneo, 2020) with the coupling of brain oscillations to brain activity patterns of internal models (Cumin and Unsworth, 2007).”

3. If acoustic information arrives at the most excitable phase of the neural oscillation (as described in figure 1), and if predictability makes words arrive earlier, does it entail that more predictable words arrive at less excitable phases of the neural oscillation? What would be the computational advantage of this mechanism?

Indeed, more predictable words arrive at a less excitable phase of the oscillation. This is an automatic consequence of the statistics in the environment in which more predictable words are uttered earlier. The brain could utilize these statistical patterns (we can infer that earlier uttered words are more predictable). If we indeed do this, temporal information in the brain contains content information. But how would the brain code for this? We propose that this can be done by phase-of-firing, such that the phase at which neurons are active is relevant for the content of the item to be processed. Computationally it would be advantageous to be able to separate different representations in the brain by virtue of phase coding. As such, you don’t only have a spatial code of information (which neuron is active), but also a temporal code of information (when is a neuron active). This leads to a redundancy of coding which has a lot of computational advantages in a noisy world (e.g. Barlow, 2001).

The idea of phase coding has been proposed in the past (Jensen and Lisman, 2005; Kayser et al., 2009; Panzeri et al., 2001 Hopfield, 1995). Indeed, there is evidence that time or phase of firing contains content information (Siegel et al., 2012; O’Keefe and Recce, 1993). In some of these theoretical accounts, the first active node actually provides more information that the less-specific activation occurring later (Hopfield, 1995). One might say a neuron that is active already at a low excitable phase contains more information about the relevance of the activation than many neurons that are going to be active at a very excitable phase. Indeed, neurons in rest are found to be active at excitable phases (Haegens et al., 2011). In a similar vein, it has also been suggested that α power/phase modulates not the sensitivity, but mostly the bias to detect something (Iemi et al., 2016), suggesting that high excitable points do not improve sensitivity, but merely the idea that something was perceived. In sum, computational advantages of this model relate to having information about time relate to information on content by virtue of a phase code. To make this clear we have added a section in the discussion. It now reads: “A natural feature of our model is that time can act as a cue for content implemented as a phase code (Jensen et al., 2012, Lisman and Jensen, 2013). This code unravels as an ordered list of predictability strength of the internal model. This idea diverges from the idea that entrainment should per definition occur on the most excitable phase of the oscillation (Giraud and Poeppel, 2012, Rimmele et al., 2018). Instead, this type of phase could increase the brain representational space to separate information content (Panzeri, Petersen et al. 2001, Lisman and Jensen, 2013). We predict that if speech nodes have a different base activity, ambiguous stimulus interpretation should be dependent on the time/phase of presentation (see (Ten Oever and Sack, 2015, Ten Oever et al., 2020))”.

4. What is "stimulus intensity" in figure 5? Does it reflect volume or SNR?

We have added this information in the figure legend. It reflects the overall amplitude of the input in the model. See Figure 5A. We now consistently refer to intensity when referring to the amplitude of the input.

5. Similarly what is "amplitude" in Figure 6?

Amplitude refers to the amplitude of the sinus in the model. This information is now in the figure legend.

6. L. 376-L. 439 "N2 activation predicts either da or ga at 0.2 and 0.1 probability respectively." Please explain why the probabilities are not equal in the model, same for l. 445 "intensity of /da/ would be at max 0.3. and of /ga/ 0.7".

We regret that this was unclear. Phase coding of information only occurs when the internal model of STiMCON has different probabilities of predicting the content of the next word. Otherwise, the nodes will be active at the same time. The assumption for the /da/ and /ga/ having different probabilities is reasonable as the /d/ and /g/ consonant have a different overall proportion in the Dutch language (with /d/ being more frequent than the /g/). As such, we would expect the overall /d/ representation in the brain to be active at lower thresholds than then the /g/ representation. We have clarified this now in the manuscript: “N2 activation predicts either da or ga at 0.2 and 0.1 probability respectively. This uneven prediction of /da/ and /ga/ is justified as /da/ is more prevalent in the Dutch language as /ga/ (Zuidema, 2010) and it thus has a higher predicted level of occurring.”

l. 445 refers to what input we gave to the model. We gave the model the input for a fully unambiguous /da/, a fully unambiguous /ga/, and morphs in between. This was to demonstrate the behavior of the model to show that only at the ambiguous stimulation we would find a phase code of information.

7. Table 3: I feel that the first prediction is not actually a prediction, but a result of the article, as the first data shows that "The more predictable a word, the earlier this word is uttered."

We agree and have removed it from the table.

8. Table 3 and Figure 8A: I think that the second prediction that "When there is a flat constraint distribution over an utterance (e.g., when probabilities are uniform over the utterance) the acoustics of speech should naturally be more rhythmic (Figure 8A)." could be tested with the current data. In the speech corpus, are sentences with lower linguistic constraints more rhythmic?

We thank the reviewer for the interesting suggestion. We indeed do have the data to investigate whether acoustics are more rhythmic when they are less predictable across the sentence. To investigate this question, we extracted the RNN prediction for 10 subsequent words. Then we extracted the variance of the prediction across those 10 words and extracted the word onset itself. We created a time course at which word onset were set to 1 (at a sampling rate of 100 Hz). Then we performed an FFT and extracted z-transformed power values over a 0-15 Hz interval. The power at the maximum power value with the theta range (3-8Hz) was extracted. These max z-scores were correlated with the log transform of the variance (to normalize the skewed variance distribution; Figure 3E). We found a weak, but significant negative correlation (r = -0.062, p < 0.001; Figure 3D) in line with our hypothesis. This suggests that the more variable the predictions within a sentence, the lower the peak power value is. When we repeated the analysis on the envelope, we did not find a significant effect. We have added this analysis in the main manuscript and added two panels to figure 2.

We do believe that a full answer to this question requires more experimental work and therefore keep it as a prediction in the table.

9. Table 3: "If speech timing matches the internal language model, brain responses should be more rhythmic even if the acoustics are not (Figure 8A)." What do the authors mean by "more rhythmic"? Does it mean the brain follows more accurately the acoustics? Does it generate stronger internal rhythms that are distinct from the acoustic temporal structure?

This would refer to the brain having a stronger isochronous response for non-isochronous than isochronous acoustics. This is based on the results in figure 6 (previous figure 5). In figure 6 we show that when the internal model predicts the next word at alternatingly high or low predictabilities, the model’s response is not most isochronous (strongest 4 Hz response) when the acoustics are isochronous, but rather when the acoustics are shifted in line with the internal model (more predictable words occurring earlier). We would expect the same in the brain’s responses. We think the wording rhythmic is not correct in this context and should rather refer to isochronous. We have updated the text to now refer to isochrony.

10. Figure 5 C: "Strength of 4 Hz power": what is the frequency bandwidth?

This is the peak activity. We clarified this now in the text. You can see from (current) Figure 6A+D that the response overall is also very peaky (by nature of the stimulation and the isochrony of the oscillation that we enter in the model).

11. Figure 5 D: " Slice of D", Slice of C?

This is the peak activity. We clarified this now in the text. You can see from (current) Figure 6A+D that the response overall is also very peaky (by nature of the stimulation and the isochrony of the oscillation that we enter in the model).

12. L 442: "propotions » -> proportions.

Adjusted accordingly.

13. Abstract: "Our results reveal that speech tracking does not only rely on the input acoustics but instead entails an interaction between oscillations and constraints flowing from internal language model " I think this claim is too strong, considering that the article does not present direct electrophysiological evidence.

We agree and regret this strong claim. We have updated the text and it now reads: “Our results suggest that speech tracking does not have to rely only on the acoustics but could also entail an interaction between oscillations and constraints flowing from internal language models.”

Reviewer #2 (Recommendations for the authors):

1. In the model the predictability is computed at the word-level, while the oscillator operates at the syllable level. The authors show different duration effects for syllables within words, likely related to predictability. Is there any consequence of this mismatch of scales?

The current model does indeed operate on the word level, while oscillatory models operate on the syllabic level. We do not claim by this that predictions per see only work on a word level. In contrary, we believe that ultimately also syllabic level predictions as well as higher level linguistic predictions can be made to influence speech processing. Therefore, our model is incomplete, but serves the purpose to demonstrate how internal language models can influence speech timing as well as perceptual tracking.

Our choice of the word level was mostly practical. We choose in the current manuscript to start with a word level prediction as this is the starting point commonly available and applied for RNNs. RNNs often work on the word level and not on the syllabic level. For example, this allowed use to use highly trained word level embeddings as a starting point for our LSTM. We are not aware of pre-trained syllabic embeddings that can achieve the same thing. As mentioned above, the temporal shift in STiMCON would also be predicted based on syllabic prediction. Therefore, the only results that are really affected by this notion are the results of figure 6 (or current figure 7). Predicting the temporal shift would likely have been benefitted from also adding predictions in temporal shifts based on syllabic and higher order linguistic predictions.

We now add a note about the level of processing that might affect the results of current figure 7 as well as a paragraph in the discussion that mentions that the model should have multiple levels operating at different levels of the linguistic hierarchy.

The result section now reads: “Note that the current prediction only operated on the word node (to which we have the RNN predictions), while full temporal shifts are probably better explained by word, syllabic and phrasal predictions.”

The discussion now reads: “The current model is not exhaustive and does not provide a complete explanation of all the details of speech processing in the brain. For example, it is likely that the primary auditory cortex is still mostly modulated by the acoustic pseudo-rhythmic input and only later brain areas follow more closely the constraints posed by the language model of the brain. Moreover, we now focus on the word level, while many tracking studies have shown the importance of syllabic temporal structure (Luo and Poeppel, 2007, Ghitza, 2012, Giraud and Poeppel, 2012) as well as the role of higher order linguistic temporal dynamics (Meyer, Sun et al. 2019, Kaufeld et al., 2020).”

2. Furthermore, could the authors clarify whether or not and how they think the model mechanism is different from top-down phase reset (e.g. l. 41). It seems that the excitability cycle at the intermediate word-level is shifted from being aligned to the 4 Hz oscillator though the linguistic feedback from layer l+1. Would that indicate a phase resetting at the word-level layer through the feedback?

The model is different as it does not assume that the top-down influence needs to reflect a phase reset. A phase reset would indicate a shift in the ongoing phase of the oscillator (Winfree 2001). However, the feedback in our model does not shift the phase of the oscillator. The phase of the oscillator remains stable, only the phase at which a node in a model is active changes. The feedback is implemented as a constant input that decays over time, it does not actively interfere with the phase of the oscillator at the word level. We propose here that if the internal model of the perceiver and receiver perfectly align the oscillator can remain entrained to the average speaker rate without shifting phase at every word, but instead can take the linguistic predictions into account.

We don’t claim phase resetting does not occur. We have acknowledged that our model does not account for this (yet) in the discussion. However, in our model phase resetting would be required when the predictions of perceiver and receiver don’t match through which the expected timing don’t match. To clarify the difference with phase reset models we have added a section in the discussion. It now reads: “We aimed to show that a stable oscillator can be sensitive to temporal pseudo-rhythmicities when these shifts match predictions from an internal linguistic model (causing higher sensitivity to these nodes). In this way we show that temporal dynamics in speech and the brain cannot be isolated from processing the content of speech. This is contrast with other models that try to explain how the brain deals with pseudo-rhythmicity in speech (Giraud and Poeppel 2012, Rimmele et al., 2018, Doelling et al., 2019), which typically do not take into account that the content of a word can influence the timing. Phase resetting models can only deal with pseudo-rhythmicity by shifting the phase of ongoing oscillations in response to a word that is off-set to the mean frequency of the input (Giraud and Poeppel, 2012, Doelling et al., 2019). We believe that this goes beyond the temporal and content information the brain can extract from the environment which has what/when interactions. However, our current model does not have an explanation of how the brain can actually entrain to an average speech rate. This is much better described in dynamical systems theories in which this is a consequence of the coupling strength between internal oscillations and speech acoustics (Doelling et al., 2019, Assaneo et al., 2021). However, these models do not take top-down predictive processing into account. Therefore, the best way forward is likely to extend coupling between brain oscillations and speech acoustics (Poeppel and Assaneo, 2020) with the coupling of brain oscillations to brain activity patterns of internal models (Cumin and Unsworth, 2007).”

3. The model shows how linguistic predictability can affect neuronal excitability in an oscillatory model, allowing to improve the processing of non-isochronous speech. I do not fully understand the claim that the linguistic predictability makes the processing (at the word-level) more isochronous, and why such isochronicity is crucial.

The main point of figure 5 (current figure 6) is to show that acoustic time and brain time do not necessarily have to have the same relation. We show that isochronously presented acoustic input does not need to lead to the most isochronous brain responses. Why does the model behave this way? Effectively the linguistic feedback changes the time at which different word nodes are active and thereby does not make it directly linked to the acoustic input. This creates an interesting dynamic that causes the model’s response to follow a stronger 4 Hz response for non-isochronous acoustic input. This dynamic was not necessarily predicted by us in advance, but is what came out of the model. The exact dynamics can be seen in Figure 6 —figure supplement 2. It seems that due to the interactions between acoustic input and feedback, higher peak activation at a 4 Hz rhythm is reached at an acoustic offset that matches the linguistic predictions (earlier presentation when words are more predictable).

We don’t necessarily believe isochronicity is crucial for the brain to operate and it can deviate from isochronicity. But isochronicity provides two clear benefits: (1) strong temporal predictions and (2) higher processing efficiency (Schroeder and Lakatos, 2009). If it is possible for the brain to maintain a stable oscillator, this would increase the efficiency relative to changing its phase after the presentation of every word. If the brain needs to shift its phase after every word, we even wonder what the computational benefit of an oscillator is. If simply the acoustics are followed, why would we need an oscillator at all?

We claim that the brain shows a more isochronous response when the linguistic predictions match the temporal shifts (earlier onsets for more predictable words). If this is not the case the isochrony is also lower. As said above, this provides the benefit that there is higher processing efficiency. But in reality, the predictions of a receiver will not always match the exact timing of the production of the producer. Therefore, some phase shifts are likely still needed (not modelled here).

4. The authors showed that word frequency affects the duration of a word. Now the RNN model relates the predictability of a word (output) to the duration of the previous word W-1 (l. 187). Didn't one expect from Figure 1B that the duration of the actually predicted word is affected? How are these two effects related?

We expect that when words are easier accessible, they are uttered early. This is why we created the RNN based on prediction the onset of the next word (see prediction in Figure 1B). However, based on previous literature in deed it could also be expected that word duration itself is affected by the predictability (Lehiste, 1972, Pluymaekers et al., 2005). However, most of these effects have been shown on mean duration of words and not that the duration of words within a sentence is modulated in the context of the full sentence above what can be explained by word frequency. Indeed, we replicate the findings of individual words showing that word frequency (likely affecting accessibility to a word) affect the mean duration of this word. To test if this extends within the context of predictability within a sentence, we reran our linear regression model, but used word duration as our dependent variable. As other control variables we included the bigram, the frequency of the current word, the mean duration of the current word and the syllable rate. Note that only effects will be significant in which the effects are stronger than can be expected based on the mean duration. Results can be found in Table 1.

We found no effect of the RNN prediction on the overall duration of word (above what can be explained by the other control factors). Other factors did show effects, such as the word frequency and bigram (p < 0.001). Interesting and unexpected, these resulted in longer duration, not shorter durations (positive t-values). At this moment we do not have an explanation for this effect, but it could be that this lengthening is a consequence of the earlier onset of the individual word by which a speaker tries to keep to the average rate. Alternatively, it is possible that this positive relation is a consequence of the predictability of the word following, such that words get shorter if the word after it is more predictable (as we showed in the manuscript). However, in this case we would also expect the RNN prediction to be significant. We have now reported on these effects in the supplementary materials, but do not have a direct explanation for the direction of this effect.

5. Title: is "constrained" the right word here, rather "modulated"? As we can process non-predictable speech.

This is fair and we changed it to modulated. Indeed, we can perfectly well process non-predictable speech as well.

6. See l. 129: "In this way, oscillations do not have to shift their phase after every speech unit and can remain at a relatively stable frequency as long as the internal model of the speaker matches the internal model of the perceiver." It seems to me that in the model the authors introduce, the phase-shifting still occurs. Even though the oscillator component is fixed, the activation threshold fluctuations at the word-level are "shifted" due to the feedback. So there is no feedforward phase-reset, however, a phase-reset due to feedback?

This is fair and we changed it to modulated. Indeed, we can perfectly well process non-predictable speech as well.

7. l. 219: why was bigrams added as control variable?

We regret this was unclear. We were interested to investigate if a bigger context (as measured with the RNN) provides more information than a bigram which only investigates the statistics regarding two previous words.

8. l. 233 in l. 142 it says that only 2848 words were present in CELEX. Where the 4837 sentences consisting of the 2848 words?

We did two rounds of checking our dictionary of words. In the first round we investigated if the words were present in the word2vec embeddings (otherwise we couldn’t use them for the RNN). If not, they were marked with a <unknown> label. The RNN was ran with all these words. This refers to the 4837 sentences. In total there were 3096 unique words.

For the regression analyses, we further wanted to extract parameters of the individual words. Thus, we investigated if the words were present in CELEX. This was the case of 2848 of 3096 words. For the regression analyses we only included the words so that we could estimate the relevant parameters. This was when the W-1 word was in CELEX (for the analysis in 4 this was related to the current word). For the regression we therefore didn’t include all the sentences going into the RNN.

9. Figure 2 D,E the labeling with ρ and p is confusing, I'd at least state consistently both, so one sees the difference.

We now also report on the p-value in the figure.

10. Table 1 legend: could you add why the specific transformations were performed?

The transformations were performed to ensure that our factors were close-to-normal in order to enter in the regression analyses. This information is now in the main manuscript. We would like to note that our analysis is robust against changes in the transformations. If we don’t perform any transformation the same results hold.

11. l. 204: the β coefficient is rather small compared to the duration of W-1 effect. The dependent variable onset-to-onset should be strongly correlated with the W-1 duration. I wonder if this is a problem?

Indeed, the word duration has the strongest relation to the onset-to-onset difference (as is of course intuitive, but also evident from the β coefficient). To capture this variance, we added this variable into the regression analyses. When performing a regression analyses it is useful to include factors that explain variance in your model. This ensure that the factor of interest (here RNN prediction) captures any variance that cannot be related to variance already explained by the other factors. Therefore, we don’t feel that this is a big problem, but actually an intended and expected effect.

12. l. 249: what is meant with "after the first epoch"?

We have clarified it. RNN are normally trained with different steps (or epochs). After every epoch the weights in the model are adjusted to reduce the overall error of the fit. But this term might be specific to keras, and not to machine learning in general. We now state: “entering the RNN predictions after the first training cycle (of a total of 100)”.

13. l. 254: how local were these lengthening effects? Did the predictability based on the trained RNN strongly vary across words or rather vary on a larger scale i.e. full sentences being less predictable than others?

To answer this question, we investigated for individual words and sentence positions whether the RNN prediction was generally higher. Indeed, for some words average prediction was higher. Both for the word that was predicted as well as for the last word before the prediction. For sentence position it seemed that for words very early in the sentence there was a lower overall prediction than for words later in the sentence. We have added a figure as a supporting figure to the manuscript.

14. l. 268: Could you explain where the constants are coming from: like the 20 and 100 ms windows for inhibition and the values -0.2 and -3. The function inhibit(ta) is not clear to me. What is the output when Ta is 0 versus 1?

The inhibition function reflects the excitation and inhibition of individual nodes in the model and is always relative to the time of activation of that node (Ta). When Ta is less than 20 (in ms), the node is activated (-3* inhibition factor), when Ta is between 20 and 100 there is strong inhibition, and after that, there is a base inhibition on the node. So, for Ta 0 versus 1 the output is both -3*BaseInhibition. The point of this activation function is to have a nonlinear activation, loosely resembling the nonlinear activation pattern in the brain.

The values for inhibition reflect rather early excitation (20 ms) and longer lasting inhibition (100 ms). We acknowledge that these numbers are only loosely related to neurophysiological time scales and is of course highly dependent on the region of interest and the local and distant connections. However, the exact timing is not critical to the main output factors of the model (phase coding and temporal sensitivity for higher predictable scales) as long as there is excitation and inhibition. We have added the rationale behind our parameter choice in the manuscript.

15. Figure 4: the legend is very short, adding some description what the figure illustrates would make it easier to follow. The small differences in early/late activation are hard to see, particularly for the 4th row. Maybe it would help to add lines?

We have clarified the description and have added a zoom-in on the relevant early/late activations for the critical word including relevant lines. We hope that this has improved the readability of this figure.

16. Figure 5 B: could you clarify the effect at late stim times relative to isochronous, i.e. why the supra time relative to isochronous decreases for highly predictable stimuli. I assume this is to the inhibition function?

This is indeed related to the inhibition function. As soon as high enough activation reaches a node, the node will reach suprathreshold activation due to the feedback (when Ta < 20). While the node is at suprathreshold activation, very low sensory input will push the node to reach suprathreshold as a consequence due to sensory input. See Author response image 1.

Author response image 1

Here the ‘nice’ node is already activated due to the feedback (supra threshold activation panel is orange), then the sensory input arrives, and the node immediately reaches a 2 activation (red color. Activation due to sensory input). This of course never happens for stimuli that have no feedback (the ‘I’ node) and happens later for nodes that have weaker feedback (later for ‘very’ compared to ‘cake’). Moreover, after the supra threshold due to feedback is finished, the inhibition sets in reducing the activation and the nodes never (or at least at intensities used for the simulation) reaches threshold for later stimulus times during inhibition.

17. How is the connectivity between layers defined? Is it symmetric for feedforward and feedback?

In the current model the feedforward layers only connect to nodes representation the same items (i.e. a ‘I’ L-1 (stimulus level) node connects to a ‘I’ L node which connects to a ‘I’ L+1 node). Only the feedback nodes (L+1) are fully connected with the active level (L) but with different connection strengths which are defined by the internal model (defined in table 2).

18. l. 294/l. 205: "with a delay of 0.9*ω seconds, which then decays at 0.01 unit per millisecond and influences the l-level at a proportion of 1.5." where are the constants coming from?

These are informed choices. The 0.9*ω was defined as we hypothesized that onset time would be loosely predicted around on oscillatory cycle, but to be prepared for input slightly earlier (which of course happens for predictable stimuli), we set it to 0.9 times the length of the cycle. The decay is needed and set such that the feedback would continue around a full theta cycle. The proportion was set empirically such to ensure that strong feedback did cause suprathreshold activation at the active node. We added this explanation to the manuscript.

19. l. 347: "the processing itself can actually be closer to isochronous than what can be solely extracted from the stimulus". This refers to Figure 5 D I assume. Did you directly compare the acoustics and the model output with respect to isochrony?

We can compare the relative distribution (so when is the peak strongest across delays), but not the absolute values as the stimulus intensity and the activation are not on the same unit scale. In order to promote this comparison, we now also show the power of the stimulus input distribution across stimulus intensities and delays (Figure 6B). It is evident that the stimulus has a symmetrical 4 Hz power spectral which is strongest at a delay of 0 (isochrony). This is not strange as we defined it a-priory this way.

20. l. 437-438: I am not fully understanding these choices: why is N1 represented by N2? Why is the probability of da and ga uneaven, and why are there nodes for da and ga (Nda, Nga) plus a node N2 which predicts both with different probability?

N1 represent N1. We regret the confusing, we meant to state that N1 predicts N2. N2 is represented as N2. We have rephrased this sentence to clarify it.

The probabilities of /da/ and /ga/ are uneven as the model only shows phase coding of information when the internal model of STiMCON has different probabilities of predicting the content of the next word. Otherwise, the nodes will be active at the same time. The assumption for the /da/ and /ga/ having different probabilities is reasonable as the /d/ and /g/ consonant have a different overall proportion in the Dutch language (with /d/ being more frequent than the /g/). As such, we would expect the overall /d/ representation in the brain to be active at lower thresholds than then the /g/ representation. We have clarified this now in the manuscript. It now reads: “N2 activation predicts either da or ga at 0.2 and 0.1 probability respectively. This uneven prediction of /da/ and /ga/ is justified as /da/ is more prevalent in the Dutch language as /ga/ [64] and it thus has a higher predicted level of occurring.”

21. Figure 5: why is the power of the high-high predictable condition the lowest. Is this an artifact of the oscillator in the model being fixed at 4 Hz or related to the inhibition function? High-high should like low-low result in rather regular, but faster acoustics?

We thank the reviewer and regret an unfortunate mistake. The labeling of the low-low and high-high was reversed. So actually, the high-high has the stronger activation. We have updated the labels and text accordingly.

Regarding the interpretation of the reviewer, we do indeed predict in natural situations for high-high that the acoustics to be slightly faster. However, in the simulations in figure 5 (current figure 6), the speed of the acoustics is never modulated, but decided by us in the simulation. Therefore, only the response of the model is estimated to varying internal models is estimated.

22. l. 600: "The perceived rhythmicity" In my view speech has been suggested to be quasi-rhythmic, as (1) some consistency in syllable duration has been observed within/across languages, and (2) as (quasi-)rhythmicity seemed a requirement to explain how segmentation of speech based on oscillations could work in the absence of simple segmentation cues (i.e. pauses between syllables). While one can ask when something is "rhythmic enough" to be called rhythmic, I don't understand why this is related to "perceived rhythmicity".

We regret our terminology. We mean perceived isochronicity. Indeed, rhythmicity occurs without isochrony (Obleser, Henry and Lakatos, 2017) and we don’t intend to state that natural speech timing do not have rhythm and pushing away from isochrony does not reflect rhythm. We mere meant to state that when the brain response is more isochronous (relative to acoustic stimulation), than likely the perception is also more isochronous.

23. l. 604: interesting thought!

We hope to pursue these ideas in the future.

Reviewer #3 (Recommendations for the authors):

1. An important question is how the authors relate these findings to the Giraud and Poeppel, 2012 proposal which really focuses on the syllable. Would you alter the hypothesis to focus on the word level? Or remain at the syllable level and speed up and low down the oscillator depending on the predictability of each word? It would be interesting to hear the authors thoughts on how to manage the juxtaposition of syllable and word processing in this framework.

The current model does indeed operate on the word level, while oscillatory models operate on the syllabic level. We do not claim by this that predictions per see only work on a word level. In contrary, we believe that ultimately also syllabic level predictions as well as higher level linguistic predictions can be made to influence speech processing. Therefore, our model is incomplete, but serves the purpose to demonstrate how internal language models can influence speech timing as well as perceptual tracking.

Our choice of the word level was mostly practical. We choose in the current manuscript to start with a word level prediction as this is the starting point commonly available and applied for RNNs. RNNs often work on the word level and not on the syllabic level. For example, this allowed use to use highly trained word level embeddings as a starting point for our LSTM. We are not aware of pre-trained syllabic embeddings that can achieve the same thing. As mentioned above, the temporal shift in STiMCON would also be predicted based on syllabic prediction. Therefore, the only results that are really affected by this notion are the results of figure 6 (or current figure 7). Predicting the temporal shift would likely have been benefitted from also adding predictions in temporal shifts based on syllabic and higher order linguistic predictions.

We would predict that linguistic inference involves integrating knowledge at different hierarchical levels including predictions concerning which syllable comes next, based on syllabic regularities, word-to-word regularities as well as syntactic context. This is definitely on our to-do list to extrapolate the model to include predictions at these different layers. To ensure this is clear to the reader, we now add a note about the level of processing that might affect the results of figure 7 as well as a paragraph in the discussion that mentions that the model should have multiple levels operating at different levels of the linguistic hierarchy.

The result section now reads: “Note that the current prediction only operated on the word node (to which we have the RNN predictions), while full temporal shifts are probably better explained by word, syllabic and phrasal predictions.”

The discussion now reads: “The current model is not exhaustive and does not provide a complete explanation of all the details of speech processing in the brain. For example, it is likely that the primary auditory cortex is still mostly modulated by the acoustic pseudo-rhythmic input and only later brain areas follow more closely the constraints posed by the language model of the brain. Moreover, we now focus on the word level, while many tracking studies have shown the importance of syllabic temporal structure (Luo and Poeppel, 2007, Ghitza, 2012, Giraud and Poeppel, 2012) as well as the role of higher order linguistic temporal dynamics (Meyer et al., 2019, Kaufeld et al., 2020).”

2. The authors describe the STiMCON model as having an oscillator with frequency set to the average stimulus rate of the sentence. But how an oscillator can achieve this on its own (without the hand of its overloads) is unclear particularly given a pseudo-rhythmic input. The authors freely accept this limitation. However, it is worth noting that the ability for an oscillator mechanism to do this under pseudorhythmic context is more complicated than it might seem, particularly once we include that the stimulus rate might change from the beginning to the end of a sentence and across an entire discourse.

This is a clear limitation, but to be fair also a limitation of any proposal on entrainment not often addressed (Giraud and Poeppel, 2012; Ghitza, 2012). The reviewer himself has done great work on investigating oscillatory entrainment based on ideas on dynamical systems and weakly coupled oscillators (Doelling et al., 2019). These simulations show that oscillations can shift their frequency when the coupling to the acoustics is strong enough. If the coupling is very strong, this could even happen from word to word (as the intrinsic oscillator in that case almost one-to-one follows the phase of the acoustics). However, our contribution here is that none of these models take top-down influences about word predictability into account on these dynamics. Moreover, it is difficult to explain how these models can explain how temporal information can inform about the content of a word (Ten Oever et al., 2013; Ten Oever and Sack, 2015). Specifically, the effect on how syllable identity depends on oscillatory phase (Ten Oever and Sack, 2015). We added a section on this in the discussion clarifying this limitation and our contribution relating to this type of research.

It now reads: “We aimed to show that a stable oscillator can be sensitive to temporal pseudo-rhythmicities when these shifts match predictions from an internal linguistic model (causing higher sensitivity to these nodes). In this way we show that temporal dynamics in speech and the brain cannot be isolated from processing the content of speech. This is contrast with other models that try to explain how the brain deals with pseudo-rhythmicity in speech (Giraud and Poeppel, 2012, Rimmele et al., 2018, Doelling et al., 2019), which typically do not take into account that the content of a word can influence the timing. Phase resetting models can only deal with pseudo-rhythmicity by shifting the phase of ongoing oscillations in response to a word that is off-set to the mean frequency of the input (Giraud and Poeppel, 2012, Doelling et al., 2019). We believe that this goes beyond the temporal and content information the brain can extract from the environment which has what/when interactions. However, our current model does not have an explanation of how the brain can actually entrain to an average speech rate. This is much better coupling in dynamical systems theories in which this is a consequence of the coupling strength between internal oscillations and speech acoustics (Doelling et al., 2019, Assaneo et al., 2021). However, these models do not take top-down predictive processing into account. Therefore, the best way forward is likely to extend coupling between brain oscillations and speech acoustics (Poeppel and Assaneo, 2020) with the coupling of brain oscillations to brain activity patterns of internal models (Cumin and Unsworth, 2007).”

3. The analysis of the naturalistic dataset shows a nice correlation between the estimated time shifts predicted by the model and the true naturalistic deviations. However, I find it surprising that there is so little deviation across the parameters of the oscillator (Figure 6A). What should we take from the fact that an oscillator aligned in anti-phase from the with the stimulus (which would presumably show the phase code only stimulus offsets), still shows a near equal correlation with true timing deviations. Furthermore, while the R2 shows that the predictions of the model co-vary with the true values, I'm curious to know how accurately they are predicted overall (in terms of mean squared error for example). Does the model account for deviations from rhythmicity of the right magnitude?

We agree that the differences in R2 depending on the parameters of the oscillation might seem slightly underwhelming. This is likely due to the nature of our fitting. We fit the same parameter (RNN prediction), but use different transformation on this predictor (all of which maintain the same order of the prediction parameter). Therefore, the difference between the model can be viewed as a difference in a transformation applied to the data (the same as a difference between doing for example doing a log transform or an arcsin transformation). In general, OLS is robust to slight variations in these transformations and therefore will still fit at a similar explained variance with only slight variations. Therefore, we believe that these small differences are meaningful but should rather be viewed as a relative comparison than absolute explanation of our different oscillatory parameters. Indeed, this is also why we set all our other parameters of our model to zero (equation 4) and don’t actually simulate stimulus processing, but merely the relative expected shift. We agree with the reviewer that is unlikely that processing at an anti-phase manner would lead to good performance. We demonstrate that in the other figures this is not the case (e.g. processing at anti-phase is not even possible at low stimulus intensities in Figure 5). For this specific comparison we did choose for the OLS and transformation shift and not a brute force fitting as in Figure 8 as (1) we can enter control variables to account for different variance we need to control for in the natural dataset, and (2) on this big dataset will result in a much more efficient code.

To further answer the reviewer’s question, we extracted the mean square error of the model (Author response image 2). See Figure 7 for the original R2 for comparisons. It is evident that the MSE is directly related to the R2 of the model. This is not strange as the lower the error variance, the more the explained variance. Maybe we misunderstood what the reviewer was asking for, but we do not see the direct added benefit of including the MSE value.

Author response image 2

4. Lastly, it is unclear to what extent the oscillator is necessary to find this relative time shift. A model comparison between the predictions of the STiMCON and the RNN predictions on their own (à la Figure 3) would help to show how much the addition of the oscillation improves our predictions. Perhaps this is what is meant by the "non-transformed R2" but this is unclear.

We regret that this was unclear. Indeed, we meant with the non-transformed R2 a model with the same factors and same permutations as for the models show figure 6, but not performing the transformation as described in equation 5. This has now been clarified in the manuscript. It now reads: “Results show a modulation of the R2 dependent on the amplitude and phase offset of the oscillation (Figure 7A). This was stronger than a model in which transformation in equation (5) was not applied (R2 was there 0.389).”

5. Figure 7 shows a striking result demonstrating how the model can be used to explain an interesting finding that phase of an oscillation can bias perception towards da or ga. The initial papers consider this result to be explained by delays in onset between visual and auditory stimuli whereas this result explains it in terms of the statistical likelihood each syllable. It is a nice reframing which helps me to better understand the previous result.

We agree that this argument is more parsimonious than our original interpretation. Of course, the two are not mutually exclusive (AV delays could also be bigger for less likely syllables). We have mentioned this now in the result section. It now reads: “Note that previously we have interpreted the /daga/ phase-dependent effect as a mapping of differences between natural audio-visual onset delays of the two syllabic types onto oscillatory phase (Ten Oever et al., 2013, Ten Oever and Sack, 2015). However, the current interpretation is not mutually exclusive with this delay-to-phase mapping as audio-visual delays could be bigger for less frequent syllables.”

6. The authors show that syllable lengths are determined in part by the predictability of the word it is a part of. While the authors have reasonably restricted themselves to a single hierarchical level, the point invites the question as to whether all hierarchical levels are governed by similar processes. Should syllables accelerate from beginning to end of a word? Or in more or less predictable phrases?

We would predict that different level word on similar operations. However, there has not been a great deal of research on syllabic predictions and lengthening or shortening of syllables. In figure 2E we show that indeed words with more syllables are shorten than would be expected by summing up the average syllabic duration expected from mono-syllabic words. However, this does not have to be because later syllables are shortened. Note however that this in not directly what we would predict. We predict that the more predictable the next syllable, the shorter the syllable-to-syllable onsets. So if the next word is very unpredictable, the last syllable would actually be longer. Indeed, often it is found that the last syllable of a word is lengthened (Lindblom, 1968). Other studies show that for individual phonemes word frequency can affect the duration of the syllable (Pluymaekers et al., 2005). But also, higher predictability within the word or based on the next word can influence shortening (Luymakers et al., 2005b). In linguistic studies it has been found that also initial syllables of a word are shortened when words are longer (Lehiste, 1972). While this initially seems against our prediction of higher predictability (within a word) leading to shortening, note that these were only a bi- or tri-syllabic words. For these words we would predict that the first syllable would be shortened when the next syllable is predictable which it the case when a producer can access the full word. In sum, whenever transitional boundary to the next syllable are weaker (across than within words) this can lead to (relative) lengthening (as predicted from Figure 3 and Table 2). This needs to be investigated across syllabic, word, and phrasal levels. However, to fully answer this question we would need to investigate the transitional probabilities of the syllables in the corpus. We have added a section on this point in the discussion. It now reads: “It is likely that predictive mechanisms also operate on these higher and syllabic levels. It is known for example that syllables are shortened when the following syllabic content is known versus producing syllables in isolation (Lehiste ,1972, Pluymaekers et al., 2005). Interactions also occur as syllables within more frequency words are generally shortened (Pluymaekers et al., 2005). Therefore, more hierarchical levels need to be added to the current model (but this is possible following equation (1)).”

7. Figure 5 shows how an oscillator mechanism can force pseudo-rhythmic stimuli into a more rhythmic code. The authors note that this can be done either by slowing responses to early stimuli and quickening responses to later ones, or by dropping (nodes don't reach threshold) stimuli too far outside the range of the oscillation. The first is an interesting mechanism, the second is potentially detrimental to processing (although it could be used as a means for filtering out noise). The authors should make clear how much deviation is required to invoke the dropping out mechanism and how this threshold relates to the naturalistic case. This would give the reader a clearer view of the flexibility of this model.

The dropping mechanism will only every occur when the stimulus intensity is not high enough. For this specific demonstration we kept the stimulus intensity low to demonstrate the change in sensitivity based on feedback. As the current implementation of the model an intensity of 2.1 would be sufficient to always reach activation also at the least excitable point of the oscillation. We have now clarified this in the manuscript. It now reads: “In regular circumstances we would of course always want to process speech, also when it arrives at a less excitable phase. Note however, that the current stimulus intensities were picked to exactly extract the threshold responses. When we increase our intensity range above 2.1 nodes will always get activated even on the lowest excitable phase of the oscillation.”

8. I found Figure 5 very difficult to understand and had to read and read it multiple times to feel like I could get a handle on it. I struggled to get a handle on why supra time was shorter and shorter the later the stimulus was activated. It should reverse at some point as the phase goes back into lower excitability, right? The current wording is very unclear on this point. In addition, the low-high, high-low analysis is unclear because the nature of the stimuli is unclear. I think an added figure panel to show how these stimuli are generated and manipulated would go a long way here.

9. The prediction of behavioral data in Figure 7 is striking but the methods could be improved. Currently, the authors bin the output of the model to be 0, 0.5 or 1 which requires some maneuvering to effectively compare it with the sinewave model. They could instead use a continuous measure (either lag of activation between da and ga, or activation difference) as a feature in a logistic regression to predict the human subject behavior.

We have split up the figure in two figure, one relation to Figure 5A+B and the other to C-D-E. In this way we could add clarifying figures on what the model is doing exactly (Figure 5A and Figure 6A). We hope this is now clearer.

Regarding the clarification points.

1) I struggled to get a handle on why supra time was shorter and shorter the later the stimulus was activated. It should reverse at some point as the phase goes back into lower excitability, right?

Indeed, at some point the supra time should reverse, however, we didn’t go so far out as also the feedback would be faded by that time and there would be no difference among the different stimuli.

2) The low-high, high-low analysis is unclear because the nature of the stimuli is unclear.

The nature of the stimuli is now clarified in panel 6A. The stimuli are all the same, but vary in the underlying internal model. For the low-high the stimuli alternate between a highly predicted and not-predicted stimulus and vice versa for high-low.

10. I'm not sure but I think there is a typo in line 383-384. The parameter for feedback should read Cl+1◊ l * Al+1,T. Note the + sign instead of the -. Or I have misunderstood something important.

The lag analysis is difficult as often one of the nodes is not activated at all and we would have to set an arbitrary value to this latency then. But we are able to look at the mean activation for the nodes to get a continuous variable. Therefore, we repeated the analysis but using the relative activation between /da/ and /ga/ nodes over an interval of 500 ms post-stimulus. The results of this analysis are shown in figure 8D. Firstly, the explained variance increases up to 83% compared to the analyses using the first active node as outcome measure. For the rest the pattern is very similar, but for the mean activation, the inhibition function was more important for the strength of the first than using the first active node as outcome. We have decided to keep both types of fitting in our manuscript as we are not sure yet what could be the more relevant neuronal feature for identifying the stimulus. Is it the first time the node is active or the average activation? But as both analyses point to the same direction we are confident that all features of the model are important to fit the data.

References

Assaneo, M. F., et al. (2021). "Speaking rhythmically can shape hearing." Nature human behaviour 5(1): 71-82.

Cumin, D. and C. Unsworth (2007). "Generalising the Kuramoto model for the study of neuronal synchronisation in the brain." Physica D: Nonlinear Phenomena 226(2): 181-196.

Doelling, K. B., et al. (2019). "An oscillator model better predicts cortical entrainment to music." Proceedings of the National Academy of Sciences 116(20): 10113-10121.

Ghitza, O. (2012). "On the role of theta-driven syllabic parsing in decoding speech: intelligibility of speech with a manipulated modulation spectrum." Frontiers in Psychology 3.

Giraud, A. L. and D. Poeppel (2012). "Cortical oscillations and speech processing: emerging computational principles and operations." Nature Neuroscience 15(4): 511-517.

Jensen, O., et al. (2012). "An oscillatory mechanism for prioritizing salient unattended stimuli." Trends in Cognitive Sciences 16(4): 200-206.

Kaufeld, G., et al. (2020). "Linguistic structure and meaning organize neural oscillations into a content-specific hierarchy." Journal of Neuroscience 40(49): 9467-9475.

Lehiste (1972). "The timing of utterances and linguistic boundaries." The Journal of the Acoustical Society of America 51.

Lisman, J. E. and O. Jensen (2013). "The theta-gamma neural code." Neuron 77(6): 1002-1016.

Luo, H. and D. Poeppel (2007). "Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex." Neuron 54(6): 1001-1010.

Meyer, L., et al. (2019). "Synchronous, but not entrained: Exogenous and endogenous cortical rhythms of speech and language processing." Language, Cognition and Neuroscience: 1-11.

Panzeri, S., et al. (2001). "The role of spike timing in the coding of stimulus location in rat somatosensory cortex." Neuron 29(3): 769-777.

Pluymaekers, M., et al. (2005). "Articulatory planning is continuous and sensitive to informational redundancy." Phonetica 62(2-4): 146-159.

Pluymaekers, M., et al. (2005). "Lexical frequency and acoustic reduction in spoken Dutch." The Journal of the Acoustical Society of America 118(4): 2561-2569.

Poeppel, D. and M. F. Assaneo (2020). "Speech rhythms and their neural foundations." Nature Reviews Neuroscience: 1-13.

Rimmele, J. M., et al. (2018). "Proactive sensing of periodic and aperiodic auditory patterns." Trends in Cognitive Sciences 22(10): 870-882.

Schroeder, C. E. and P. Lakatos (2009). "Low-frequency neuronal oscillations as instruments of sensory selection." Trends in Neurosciences 32(1): 9-18.

Ten Oever, S., et al. (2020). "Phase-coded oscillatory ordering promotes the separation of closely matched representations to optimize perceptual discrimination." iScience: 101282.

Ten Oever, S. and A. T. Sack (2015). "Oscillatory phase shapes syllable perception." Proceedings of the National Academy of Sciences 112(52): 15833-15837.

Ten Oever, S., et al. (2013). "Audio-visual onset differences are used to determine syllable identity for ambiguous audio-visual stimulus pairs." Frontiers in Psychology 4.

Winfree, A. T. (2001). The geometry of biological time, Springer Science and Business Media.

Zuidema, W. (2010). "A syllable frequency list for Dutch."

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Reviewer #2 (Recommendations for the authors):

I want to thank the authors for the great effort revising the manuscript. The manuscript has much improved. I only have some final small comments.

Detailed comments

l. 273-275: In my opinion: This is because the oscillator is set as a rigid oscillator in the model that is not affected by the word level layer activation; however, as the authors already discuss this topic, this is just a comment.

l. 344: "the processing itself" I'd specify: "the processing at the word layer".

We have changed the phrasing accordingly.

l. 557/558: Rimmele et al., (2018) do discuss that besides the motor system, predictions from higher-level linguistic processing might affect auditory cortex neuronal oscillations through phase resetting. Top-down predictions affecting auditory cortex oscillations is one of the main claims of the paper. Thus, this paper seems not a good example for proposals that exclude when-to-what interactions. In my view the claims are rather consistent with the ones proposed here, although Rimmele et al., do not detail the mechanism and differ from the current proposal in that they suggest phase resetting. Could you clarify?

We regret that it seemed that we overlooked that Rimmele et al., (2018) discuss that linguistic predictions can influence auditory cortex. We believe this is a misunderstanding. While Rimmele et al., (2018) indeed discuss how timing in the brain can change due to top-down linguistic predictions, they do not discuss that timing in the speech input itself is also dependent on the content and relates by itself to linguistic predictions. One core question to us is how the brain can extract the statistical information from what-when dependencies in the environment. For example, it is unclear how phase resetting would account for inferring that a less predictable word typically occurs later. This is a core difference between those models and the current model. We realize now that the current phrasing is rather ambiguous talking about when-what in the brain or in the stimulus statistics and the argument was not complete. We rephrase this part of the manuscript. It now reads: “While some of these models discuss that higher-level linguistic processing can modulate the timing of ongoing oscillations [15], they typically do not consider that in the speech signal itself the content or predictability of a word relates to the timing of this word. Phase resetting models typically deal with pseudo-rhythmicity by shifting the phase of ongoing oscillations in response to a word that is off-set to the mean frequency of the input [3, 77]. We believe that this cannot explain how the brain uses what/when dependencies present in the environment to infer the content of the word (e.g. later words are likely a less predictable word).”

l 584 ff.: "This idea diverges from the idea that entrainment should per definition occur on the most excitable phase of the oscillation [3,15]." Maybe rephrase: "This idea diverges from the idea that entrainment should align the most excitable phase of the oscillation with the highest energy in the acoustics [3,15]."

We have changed the phrasing accordingly.

l. 431: "The model consists of four nodes (N1, N2, Nda, and Nga) at which N1 activation predicts a second unspecific stimulus (S2) represented by N2 at a predictability of 1. N2 activation predicts either da or ga at 0.2 and 0.1 probability respectively."

This is still hard to understand for me. E.g. What is S2, is this either da or ga, wouldn't their probability have to add up to 1?

N1 and N2 represent nodes that are responsive to two stimuli S1 and S2 (just as Nda and Nga are responsive to the stimulus /da/ and /ga/). S1 and S2 are two unspecific stimuli that are in the simulation to model the entrainment stimuli. (manuscript now reads: “N1 and N2 represent nodes responsive to two stimulus S1 and S2 that function as entrainment stimuli.”)

Indeed, in the brain it would make sense that the prediction adds up to 1. However, we here only model a small proportion of all the possible word nodes in the brain. To clarify this, we added the following: “While in the brain the prediction should add up to 1, we can assume that the probability is spread across a big number of word nodes of the full language model and therefore neglectable.”

Wording

l. 175/176: sth is wrong with the sentence.

l. 544: "higher and syllabic"? (sounds like sth is wrong in the wording)

l. 546: "within more frequency" (more frequent or higher frequency?)

We have updated the wording of these sentences accordingly.

https://doi.org/10.7554/eLife.68066.sa2

Article and author information

Author details

  1. Sanne ten Oever

    1. Language and Computation in Neural Systems group, Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
    2. Donders Centre for Cognitive Neuroimaging, Radboud University, Nijmegen, Netherlands
    3. Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, Netherlands
    Contribution
    Conceptualization, Data curation, Formal analysis, Visualization, Methodology, Writing - original draft, Writing - review and editing
    For correspondence
    sanne.tenoever@mpi.nl
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7547-5842
  2. Andrea E Martin

    1. Language and Computation in Neural Systems group, Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
    2. Donders Centre for Cognitive Neuroimaging, Radboud University, Nijmegen, Netherlands
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Validation, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3395-7234

Funding

Max Planck Society (MaxPlanck Research Group)

  • Andrea E Martin

Nederlandse Organisatie voor Wetenschappelijk Onderzoek (016.Vidi.188.029)

  • Andrea E Martin

Max Planck Society (Lise Meitner Research Group)

  • Andrea E Martin

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

AEM was supported by the Max Planck Research Group and Lise Meitner Research Group ‘Language and Computation in Neural Systems’ from the Max Planck Society, and by the Netherlands Organization for Scientific Research (grant 016.Vidi.188.029 to AEM). Figure 1 and 9 were created in collaboration with scientific illustrator Jan-Karen Campbell (http://www.jankaren.com). 

Senior Editor

  1. Andrew J King, University of Oxford, United Kingdom

Reviewing Editor

  1. Anne Kösem, Lyon Neuroscience Research Center, France

Reviewers

  1. Anne Kösem, Lyon Neuroscience Research Center, France
  2. Johanna Rimmele, Max-Planck-Institute for Empirical Aesthetics, Germany
  3. Keith Doelling

Version history

  1. Preprint posted: December 7, 2020 (view preprint)
  2. Received: March 3, 2021
  3. Accepted: July 16, 2021
  4. Version of Record published: August 2, 2021 (version 1)
  5. Version of Record updated: August 3, 2021 (version 2)

Copyright

© 2021, ten Oever and Martin

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,348
    Page views
  • 161
    Downloads
  • 8
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Sanne ten Oever
  2. Andrea E Martin
(2021)
An oscillating computational model can track pseudo-rhythmic speech by using linguistic predictions
eLife 10:e68066.
https://doi.org/10.7554/eLife.68066

Further reading

    1. Evolutionary Biology
    2. Neuroscience
    Katja Heuer, Nicolas Traut ... Roberto Toro
    Research Article

    The process of brain folding is thought to play an important role in the development and organisation of the cerebrum and the cerebellum. The study of cerebellar folding is challenging due to the small size and abundance of its folia. In consequence, little is known about its anatomical diversity and evolution. We constituted an open collection of histological data from 56 mammalian species and manually segmented the cerebrum and the cerebellum. We developed methods to measure the geometry of cerebellar folia and to estimate the thickness of the molecular layer. We used phylogenetic comparative methods to study the diversity and evolution of cerebellar folding and its relationship with the anatomy of the cerebrum. Our results show that the evolution of cerebellar and cerebral anatomy follows a stabilising selection process. We observed 2 groups of phenotypes changing concertedly through evolution: a group of 'diverse' phenotypes - varying over several orders of magnitude together with body size, and a group of 'stable' phenotypes varying over less than 1 order of magnitude across species. Our analyses confirmed the strong correlation between cerebral and cerebellar volumes across species, and showed in addition that large cerebella are disproportionately more folded than smaller ones. Compared with the extreme variations in cerebellar surface area, folial anatomy and molecular layer thickness varied only slightly, showing a much smaller increase in the larger cerebella. We discuss how these findings could provide new insights into the diversity and evolution of cerebellar folding, the mechanisms of cerebellar and cerebral folding, and their potential influence on the organisation of the brain across species.

    1. Neuroscience
    Amanda J González Segarra, Gina Pontes ... Kristin Scott
    Research Article

    Consumption of food and water is tightly regulated by the nervous system to maintain internal nutrient homeostasis. Although generally considered independently, interactions between hunger and thirst drives are important to coordinate competing needs. In Drosophila, four neurons called the interoceptive subesophageal zone neurons (ISNs) respond to intrinsic hunger and thirst signals to oppositely regulate sucrose and water ingestion. Here, we investigate the neural circuit downstream of the ISNs to examine how ingestion is regulated based on internal needs. Utilizing the recently available fly brain connectome, we find that the ISNs synapse with a novel cell-type bilateral T-shaped neuron (BiT) that projects to neuroendocrine centers. In vivo neural manipulations revealed that BiT oppositely regulates sugar and water ingestion. Neuroendocrine cells downstream of ISNs include several peptide-releasing and peptide-sensing neurons, including insulin producing cells (IPCs), crustacean cardioactive peptide (CCAP) neurons, and CCHamide-2 receptor isoform RA (CCHa2R-RA) neurons. These neurons contribute differentially to ingestion of sugar and water, with IPCs and CCAP neurons oppositely regulating sugar and water ingestion, and CCHa2R-RA neurons modulating only water ingestion. Thus, the decision to consume sugar or water occurs via regulation of a broad peptidergic network that integrates internal signals of nutritional state to generate nutrient-specific ingestion.