Cortical activity during naturalistic music listening reflects short-range predictions based on long-term experience

  1. Pius Kern
  2. Micha Heilbron
  3. Floris P de Lange
  4. Eelke Spaak  Is a corresponding author
  1. Radboud University Nijmegen, Donders Institute for Brain, Cognition and Behaviour, Netherlands

Abstract

Expectations shape our experience of music. However, the internal model upon which listeners form melodic expectations is still debated. Do expectations stem from Gestalt-like principles or statistical learning? If the latter, does long-term experience play an important role, or are short-term regularities sufficient? And finally, what length of context informs contextual expectations? To answer these questions, we presented human listeners with diverse naturalistic compositions from Western classical music, while recording neural activity using MEG. We quantified note-level melodic surprise and uncertainty using various computational models of music, including a state-of-the-art transformer neural network. A time-resolved regression analysis revealed that neural activity over fronto-temporal sensors tracked melodic surprise particularly around 200ms and 300–500ms after note onset. This neural surprise response was dissociated from sensory-acoustic and adaptation effects. Neural surprise was best predicted by computational models that incorporated long-term statistical learning—rather than by simple, Gestalt-like principles. Yet, intriguingly, the surprise reflected primarily short-range musical contexts of less than ten notes. We present a full replication of our novel MEG results in an openly available EEG dataset. Together, these results elucidate the internal model that shapes melodic predictions during naturalistic music listening.

Editor's evaluation

This study models the predictions a listener makes in music in two ways: how different model algorithms compare in their performance at predicting the upcoming notes in a melody, and how well they predict listeners' brain responses to these notes. The study will be important as it implements and compares three contemporary models of music prediction. In a set of convincing analyses, the authors find that musical melodies are best predicted by models taking into account long-term experience of musical melodies, whereas brain responses are best predicted by applying these models to only a few most recent notes.

https://doi.org/10.7554/eLife.80935.sa0

Introduction

The second movement of Haydn’s symphony No. 94 begins with a string section creating the expectation of a gentle and soft piece, which is suddenly interrupted by a tutti fortissimo chord. This startling motif earned the composition the nickname ‘Surprise symphony’. All music, in fact, plays with listeners’ expectations to evoke musical enjoyment and emotions, albeit often in more subtle ways (Huron, 2006; Juslin and Västfjäll, 2008; Meyer, 1957; Salimpoor et al., 2015). A central element of music which induces musical expectations is melody, the linear sequence of notes alternating in pitch. Within a musical piece and style, such as Western classical music, certain melodic patterns appear more frequently than others, establishing a musical syntax (Krumhansl, 2015; Patel, 2003; Rohrmeier et al., 2011). Human listeners have been proposed to continuously form predictions on how the melody will continue based on these regularities (Koelsch et al., 2019; Meyer, 1957; Tillmann et al., 2014; Vuust et al., 2022).

In support of prediction-based processing of music, it has been shown that listeners are sensitive to melodic surprise. Behaviorally, higher surprise notes are rated as more unexpected (Krumhansl and Kessler, 1982; Marmel et al., 2008; Marmel et al., 2010; Pearce et al., 2010; Schmuckler, 1989) and impair performance, for example in dissonance detection tasks (Pearce et al., 2010; Sears et al., 2019). Listeners continue melodic primes with low-surprise notes in musical cloze tasks (Carlsen, 1981; Morgan et al., 2019; Schmuckler, 1989). Neural activity tracks melodic surprise (Di Liberto et al., 2020) and high-surprise notes elicit electrophysiological signatures indicative of surprise processing, in particular the mismatch negativity (Brattico et al., 2006; Mencke et al., 2021; Näätänen et al., 2007; Quiroga-Martinez et al., 2020) and P3 component (Quiroga-Martinez et al., 2020) (for a review see Koelsch et al., 2019), but also the P2 component (Omigie et al., 2013), a late negative activity around 400ms (Miranda and Ullman, 2007; Pearce et al., 2010), and oscillatory activity (Omigie et al., 2019; Pearce et al., 2010). Despite this extensive body of neural and behavioral evidence on the effects of melodic expectations in music perception, the form and content of the internal model generating these expectations remain unclear. Furthermore, the evidence stems primarily from studying the processing of relatively artificial stimuli, and how these findings extend to a more naturalistic setting is unknown.

We set out to answer three related open questions regarding the nature of melodic expectations, as reflected in neural activity. First, are expectations best explained by a small set of Gestalt-like principles (Krumhansl, 2015; Narmour, 1990; Narmour, 1992; Temperley, 2008; Temperley, 2014), or are they better captured by statistical learning (Pearce, 2005; Pearce and Wiggins, 2012; Rohrmeier and Koelsch, 2012)? According to Gestalt-like models, expectations stem from relatively simple rules also found in music theory, for example that intervals between subsequent notes tend to be small. From a statistical learning perspective, in contrast, listeners acquire internal predictive models, capturing potentially similar or different principles, through exposure to music. Overall, statistical learning models have proven slightly better fits for musical data (Temperley, 2014) and for human listeners’ expectations assessed behaviorally (Morgan et al., 2019; Pearce and Wiggins, 2006; Temperley, 2014), but the two types of models have rarely been directly compared. Second, if statistical learning drives melodic expectations, does this rely on long-term exposure to music, or might it better reflect the local statistical structure of a given musical piece? Finally, independent of whether melodic expectations are informed by short or long-term experience, we ask how much temporal context is taken into account by melodic expectations; that is whether these are based on a short- or a longer-range context. On the one hand, the brain might use as much temporal context as possible in order to predict optimally. On the other hand, the range of echoic memory is limited and temporal integration windows are relatively short, especially in sensory areas (Hasson et al., 2008; Honey et al., 2012; Himberger et al., 2018). Therefore, melodic expectations could be based on shorter-range context than would be statistically optimal. To address this question, we derived model-based probabilistic estimates of expectations using the Music Transformer (Huang et al., 2018). This is a state-of-the-art neural network model that can take long-range (and variable) context into account much more effectively than the n-gram models previously used to model melodic expectations, since transformer models process blocks of (musical) context as a whole, instead of focusing on (note) sequences of variable, yet limited, length.

In the current project, we approached this set of questions as follows (Figure 1). First, we operationalized different sources of melodic expectations by simulating different predictive architectures: the Probabilistic Model of Melody Perception (Temperley, 2008; Temperley, 2014), which is a Gestalt-like model; the Information Dynamics of Music (IDyOM) model, an n-gram based statistical learning model (Pearce, 2005; Pearce and Wiggins, 2012); and the aforementioned Music Transformer. We compared the different computational models’ predictive performance on music data to establish them as different hypotheses about the sources of melodic expectations. We then analyzed a newly acquired MEG dataset obtained while participants (n=35) were listening to diverse, naturalistic, musical stimuli using time-resolved regression analysis. This allowed us to disentangle the contributions of different sources of expectations, as well as different lengths of contextual information, to the neural signature of surprise processing that is so central to our experience of music. To preview our results: we found that melodic surprise strongly modulates the evoked response, and that this effect goes beyond basic acoustic features and simple repetition effects, confirming that also in naturalistic music listening, brain responses are shaped by melodic expectations. Critically, we found that neural melodic surprise is best captured by long-term statistical learning; yet, intriguingly, depends primarily on short-range musical context. In particular, we observed a striking dissociation at a context window of about ten notes: models taking longer-range context into account become better at predicting music, but worse at predicting neural activity. Superior temporal cortical sources most strongly contributed to the surprise signature, primarily around 200ms and 300–500ms after note onset. Finally, we present a full replication of our findings in an independent openly available EEG dataset (Di Liberto et al., 2020).

Overview of the research paradigm.

Listeners undergoing EEG (data from Di Liberto et al., 2020) or MEG measurement (novel data acquired for the current study) were presented with naturalistic music synthesized from MIDI files. To model melodic expectations, we calculated note-level surprise and uncertainty estimates via three computational models reflecting different internal models of expectations. We estimated the regression evoked response or temporal response function (TRF) for different features using time-resolved linear regression on the M|EEG data, while controlling for low-level acoustic factors.

Results

Music analysis

We quantified the note-level surprise and uncertainty using different computational models of music, which were hypothesized to capture different sources of melodic expectation (see Materials and methods for details). The Probabilistic Model of Melody Perception (Temperley) (Temperley, 2008; Temperley, 2014) rests on a few principles derived from musicology and thus represents Gestalt-like perception (Morgan et al., 2019). The Information Dynamics of Music (IDyOM) model (Pearce and Wiggins, 2012) captures expectations from statistical learning, either based on short-term regularities in the current musical piece (IDyOM stm), long-term exposure to music (IDyOM ltm), or a combination of the former two (IDyOM both). The Music Transformer (MT) (Huang et al., 2018) is a state-of-the-art neural network model, which also reflects long-term statistical learning but is more sensitive to longer-range structure. In a first step, we aimed to establish the different models as distinct hypotheses about the sources of melodic expectations. We examined how well the models predicted music data and to what extent their predictions improved when the amount of available context increased.

IDyOM stm and Music Transformer show superior melodic prediction

First, we tested how well the different computational models predicted the musical stimuli presented in the MEG study (Figure 2). Specifically, we quantified the accuracy with which the models predicted upcoming notes, given a certain number of previous notes as context information. While all models performed well above chance level accuracy (1/128=0.8%), the IDyOM stm (median accuracy across compositions: 57.9%), IDyOM both (53.5%), and Music Transformer (54.8%) models performed considerably better than the Temperley (19.3%) and IDyOM ltm (27.3%) models, in terms of median accuracy across compositions (Figure 2A left). This pattern was confirmed in terms of the models’ note-level surprise, which is a continuous measure of predictive performance. Here lower values indicate a better ability to predict the next note given the context (median surprise across compositions: Temperley = 2.18, IDyOM stm = 1.12, IDyOM ltm = 2.23, IDyOM both = 1.46, MT = 1.15, Figure 2A middle). The median surprise is closely related to the cross-entropy loss, which can be defined as the mean surprise across all notes (Temperley = 2.7, IDyOM stm = 2, ltm = 2.47, both = 1.86, Music Transformer = 1.81). Furthermore, the uncertainty, defined as the entropy of the probability distribution at each time point, characterizes each model’s confidence (inverse) in its predictions (maximum uncertainty = 4.85 given a uniform probability distribution). The Music Transformer model formed predictions more confidently than the other models, whereas the Temperley model displayed the highest uncertainty (median uncertainty across compositions: Temperley = 2.65, IDyOM stm = 2.23, ltm = 2.49, both = 2.28, MT = 1.69, Figure 2A right). Within the IDyOM class, the stm model consistently showed lower uncertainty compared to the ltm model, presumably reflecting a greater consistency of melodic patterns within versus across compositions. As a result, the both model was driven by the stm model, since it combines the ltm and stm components weighted by their uncertainty (mean stm weight = 0.72, mean ltm weight = 0.18).

Model performance on the musical stimuli used in the MEG study.

(A) Comparison of music model performance in predicting upcoming note pitch, as composition-level accuracy (left; higher is better), median surprise across notes (middle; lower is better), and median uncertainty across notes (right). Context length for each model is the best performing one across the range shown in (B). Vertical bars: single compositions, circle: median, thick line: quartiles, thin line: quartiles ±1.5 × interquartile range. (B) Accuracy of note pitch predictions (median across 19 compositions) as a function of context length and model class (same color code as (A)). Dots represent maximum for each model class. (C) Correlations between the surprise estimates from the best models. (For similar results for the musical stimuli used in the EEG study, see Appendix 1—figure 2).

Music Transformer utilizes long-range musical structure

Next, we examined to what extent the different models utilize long-range structure in musical compositions or rely on short-range regularities by systematically varying the context length k (above we considered each model at its optimal context length, defined by the maximum accuracy). The Music Transformer model proved to be the only model for which the predictive accuracy increased considerably as the context length increased, from about 9.17% (k=1) up to 54.82% (k=350) (Figure 2B). The IDyOM models’ performance, in contrast, plateaued early at context lengths between three and five notes (optimal k: stm: 25, ltm: 4, both: 3), reflecting the well-known sparsity issue of n-gram models (Jurafsky and Martin, 2000). Although the Temperley model benefited from additional musical context slightly, the increment was small and the accuracy was lower compared to the other models across all context lengths (5.58% at k=1 to 19.25% at k=25).

Computational models capture distinct sources of musical expectation

To further evaluate the differences between models, we tested how strongly their surprise estimates were correlated across all notes in the stimulus set (Figure 2C). Since the IDyOM stm model dominated the both model, the two were correlated most strongly (r=.87). The lowest correlations occurred between the IDyOM stm on the one hand and the IDyOM ltm (r=0.24) and Temperley model (r=0.22) on the other hand. Given that all estimates quantified surprise, positive correlations of medium to large size were expected. More importantly, the models appeared to pick up substantial unique variance, in line with the differences in predictive performance explored above.

Taken together, these results established that the computational models of music capture different sources of melodic expectation. Only the Music Transformer model was able to exploit long-range structure in music to facilitate predictions of note pitch. Yet, short-range regularities in the current musical piece alone enabled accurate melodic predictions already: the IDyOM stm model performed remarkably well, even compared to the much more sophisticated Music Transformer. We confirmed these results on the musical stimuli from the EEG study (Appendix 1—figure 2).

M|EEG analysis

We used a time-resolved linear regression approach (see Materials and methods for details) to analyse listeners’ M|EEG data. By comparing different regression models, we asked (1) whether there is evidence for the neural processing of melodic surprise and uncertainty during naturalistic music listening and (2) which sources of melodic expectations, represented by the different computational models, best capture that. We quantified the performance of each regression model in explaining the MEG data by computing the correlation r between predicted and observed neural data. Importantly, we estimated r using fivefold cross-validation, thereby ruling out any trivial increase in predictive performance due to increases in number of regressors (i.e. free parameters).

The simplest model, the Onset model, contained a single regressor coding note onsets in binary fashion. Unsurprisingly, this model significantly explained variance in the recorded MEG data (mean r across participants = 0.12, SD = 0.03; one-sample t-test versus zero, t34=25.42, p=1.06e-23, d=4.36, Figure 3A top left), confirming that our regression approach worked properly. The Baseline model included the note onset regressor, and additionally a set of regressors to account for sensory-acoustic features, such as loudness, sound type, pitch class (low/high), as well as note repetitions to account for sensory adaptation (Auksztulewicz and Friston, 2016; Todorovic and de Lange, 2012). The Baseline model explained additional variance beyond the Onset model (ΔrBaseline-Onset=0.013, SD = 0.006; paired-sample t-test, t34=12.07, p=7.58e-14, d=2.07, Figure 3A bottom left), showing that differences in acoustic features and repetition further modulated neural activity elicited by notes.

Model performance on MEG data from 35 listeners.

(A) Cross-validated r for the Onset only model (top left). Difference in cross-validated r between the Baseline model including acoustic regressors and the Onset model (bottom left). Difference in cross-validated r between models including surprise estimates from different model classes (color-coded) and the Baseline model (right). Vertical bars: participants; box plot as in Figure 2. (B) Comparison between the best surprise models from each model class as a function of context length. Lines: mean across participants, shaded area: 95% CI. (C) Predictive performance of the Music Transformer (MT) on the MEG data (left y-axis, dark, mean across participants) and the music data from the MEG study (right y-axis, light, median across compositions).

Long-term statistical learning best explains listeners’ melodic surprise

We next investigated to which degree the surprise estimates from the different computational models of music could explain unique variance in the neural data, over and above that already explained by the Baseline model. All models performed significantly better than the Baseline model, providing evidence for tracking of neural surprise during naturalistic music listening (Temperley: ΔrSurprise-Baseline=0.002, SD = 0.001, paired-sample t-test, t34=8.76 p=2.42e-09, d=1.5; IDyOM stm: ΔrSurprise-Baseline=0.001, SD = 0.001, t34=5.66 p=9.39e-06, d=0.97; IDyOM ltm: ΔrSurprise-Baseline=0.003, SD = 0.002, t34=12.74 p=2.51e-13, d=2.19; IDyOM both: ΔrSurprise-Baseline=0.002, SD = 0.001, t34=8.77, p=2.42e-09, d=1.5; and Music Transformer: ΔrSurprise-Baseline=0.004, SD = 0.002, t34=10.82, p=1.79e-11, d=1.86, corrected for multiple comparisons using the Bonferroni-Holm method) (Figure 3A right). Importantly, the Music Transformer and IDyOM ltm model significantly outperformed the other models (paired-sample t-test, MT-Temperley: t34=7.56, p=5.33e-08, d=1.30; MT-IDyOM stm: t34=9.51, p=4.12e-10, d=1.63, MT-IDyOM both: t34=8.87, p=2.07e-09, d=1.52), with no statistically significant difference between the two (paired-sample t-test, t34=1.634, p=0.225), whereas the IDyOM stm model performed worst. This contrasts with the music analysis, where the IDyOM stm model performed considerably better than the IDyOM ltm model. These observations suggest that listeners’ melodic surprise is better explained by musical enculturation (i.e., exposure to large amounts of music across the lifetime), modeled as statistical learning on a large corpus of music (IDyOM ltm and MT), rather than by statistical regularities within the current musical piece alone (IDyOM stm) or Gestalt-like rules (Temperley).

Short-range musical context shapes listeners’ melodic surprise

We again systematically varied the context length k to probe which context length captures listeners’ melodic surprise best (above we again considered each model at its optimal context length, defined by the maximum ΔrSurprise-Baseline averaged across participants). The Temperley and IDyOM models’ incremental predictive contribution were marginally influenced by context length, with early peaks for the IDyOM stm (k=1) and ltm (k=2) and later peaks for the both (k=75) and Temperley models (k=10) (Figure 3B). The roughly constant level of performance was expected based on the music analysis, since these models mainly relied on short-range context and their estimates of surprise were almost constant. In contrast, we reported above that the Music Transformer model extracts long-range structure in music, with music-predictive performance increasing up to context lengths of 350 notes. Strikingly, however, surprise estimates from the MT predicted MEG data best at a context length of nine notes and decreased for larger context lengths, even below the level of shorter ones (<10) (Figure 3C).

Together, these findings suggest that long-term experience of listeners (IDyOM ltm and MT) better captures neural correlates of melodic surprise than short-term statistical regularities (IDyOM stm). Yet, melodic expectations based on statistical learning might not necessarily rest on long-range temporal structure but rather shorter time scales between 5 and 10 notes. These results were replicated on the EEG data (Figure 4).

Model performance on EEG data from 20 listeners.

All panels as in Figure 3, but applied to the EEG data and its musical stimuli.

Spatiotemporal neural characteristics of melodic surprise

To elucidate the spatiotemporal neural characteristics of naturalistic music listening, we further examined the temporal response functions (TRFs; or ‘regression evoked responses’) from the best model (MEG: MT at k=8, Figure 5; EEG: MT at k=7, Figure 6). Each TRF combines the time-lagged coefficients for one regressor. The resulting time course describes how the feature of interest modulates neural activity over time. Here, we focused on note onset, the repetition of notes, and melodic surprise. The TRFs were roughly constant around zero in the baseline period (−0.2–0 s before note onset) and showed a clear modulation time-locked to note onset (Figures 5 and 6). This confirmed that the deconvolution of different features and the temporal alignment in the time-resolved regression worked well. Note that the MEG data were transformed to combined planar gradients to yield interpretable topographies (Bastiaansen and Knösche, 2000), and therefore did not contain information about the polarity. While we reflect on the sign of modulations in the TRFs below, these judgements were based on inspection of the axial gradiometer MEG results (not shown) and confirmed on the EEG data (Figure 6).

Temporal response functions (TRFs, left column) and spatial topographies at four time periods (right column) for the best model on the MEG data.

(A): Note onset regressor. (B): Note repetition regressor. (C): Surprise regressor from the Music Transformer with a context length of eight notes. TRF plots: Grey horizontal bars: time points at which at least one channel in the ROI was significant. Lines: mean across participants and channels. Shaded area: 95% CI across participants.

All panels as in Figure 5, but applied to the EEG data and its musical stimuli.

The TRF for the note onset regressor reflects the average neural response evoked by a note. The effect was temporally extended from note onset up to 0.8 s (MEG) and 1 s (EEG) and clustered around bilateral fronto-temporal MEG sensors (MEG: cluster-based permutation test p=0.035, Figure 5A; EEG: p=5e-04, Figure 6A). The time course resembled a P1-N1-P2 complex, typically found in ERP studies on auditory processing (Picton, 2013; Pratt, 2011), with a first positive peak at about 75ms (P1) and a second positive peak at about 200ms (P2). This was followed by a more sustained negative deflection between 300 and 600ms. We inspected the note repetition regressors to account for the repetition suppression effect, as a potential confound of melodic expectations (Todorovic et al., 2011; Todorovic and de Lange, 2012). We observed a negative deflection at temporal sensors peaking at about 200ms, reflecting lower neural activity for repeated versus non-repeated notes (MEG: p=5e-04, Figure 5B; EEG: p=0.008, Figure 6B). This extends the well-known auditory repetition suppression effect (Grill-Spector et al., 2006; Todorovic and de Lange, 2012) to the setting of naturalistic music listening. Finally, the TRF of the surprise regressor indicates how the level of model-based surprise modulates neural activity over and above simple repetition. A fronto-temporal cluster of MEG sensors exhibited a positive peak at about 200ms and a sustained negative deflection between 300 and 600ms (MEG: p=5e-04, Figure 5C; EEG: p=0.004, Figure 6C). The increased activity for more surprising notes is consistent with expectation suppression effects (Todorovic and de Lange, 2012). We ruled out that the late negativity effect was an artifact arising from a negative correlation between surprise estimates of subsequent notes, since these temporal autocorrelations were consistently found to be positive. The surprise estimates from the Temperley and IDyOM models yielded similar, although slightly weaker, spatiotemporal patterns in the MEG and EEG data (Appendix 1—figures 3 and 4), indicating that they all captured melodic surprise given the cross-model correlations.

Melodic processing is associated with superior temporal and Heschl’s gyri

To further shed light on the spatial profile of melody and surprise processing, we estimated the dominant neural sources corresponding to the peak TRF deflection (180–240ms post note onset) using equivalent current dipole (ECD) modeling of the MEG data (with one, two, or three dipoles per hemisphere, selected by comparing adjusted r2). These simple models provided a good fit to the sensor-level TRF maps, indicated by the substantial amount of variance explained (mean adjusted r2 across participants = 0.98 / 0.98/0.97 for Onset / Repetition / Surprise regressors, SD = 0.013 / 0.011/0.020). We show the density of fit dipole locations in Figure 7. The TRF peak deflection for the Onset regressor was best explained by sources in bilateral Heschl’s gyri (Figure 7, top). The peak deflections for the Repetition and Surprise regressors were best explained by slightly more lateral sources encompassing both bilateral Heschl’s gyri as well as bilateral superior temporal gyri (see Figure 7 for exact MNI coordinates of density peaks).

Source-level results for the MEG TRF data.

Volumetric density of estimated dipole locations across participants in the time window of interest identified in Figure 5 (180–240ms), projected on the average Montreal Neurological Institute (MNI) template brain. MNI coordinates are given for the density maxima with anatomical labels from the Automated Anatomical Labeling atlas.

No evidence for neural tracking of melodic uncertainty

Besides surprise, melodic expectations can be characterized by their note-level uncertainty. Estimates of surprise and uncertainty were positively correlated across different computational models (e.g. MT with a context of eight notes: r=0.21) (Figure 8A). Surprisingly, the addition of uncertainty and its interaction with surprise did not further improve but rather reduce models’ cross-validated predictive performance on listeners’ MEG data compared to surprise alone (MT Surprise: ΔrSurpise-Baseline=0.004, SD = 0.002;+Uncertainty: ΔrUncertainty-Baseline=0.003, SD = 0.002, paired-sample t-test compared to Surprise, t34=–9.57, p=1.42e-10, d=–1.64;+Interaction S×U: ΔrSxU-Baseline=0.002, SD = 0.002, t34=–13.81, p=1.66e-14, d=–2.37) (Figure 8B). This result holds true for other computational models of music and for the EEG data. Therefore, we do not further examine the TRFs here.

Results for melodic uncertainty.

(A) Relationship between and distribution of surprise and uncertainty estimates from the Music Transformer (context length of eight notes). (B) Cross-validated predictive performance for the Baseline +surprise model (top), and for models with added uncertainty regressor (middle) and the interaction between surprise and uncertainty (SxU, bottom). Adding uncertainty and/or the interaction between surprise and uncertainty (SxU) did not improve but worsen the predictive performance on the MEG data.

Discussion

In the present study, we investigated the nature of melodic expectations during naturalistic music listening. We used a range of computational models to calculate melodic surprise and uncertainty under different internal models. Through time-resolved regression on human listeners’ M|EEG activity, we gauged which model could most accurately predict neural indices of melodic surprise. In general, melodic surprise enhanced neural responses, particularly around 200ms and between 300 and 500ms after note onset. This was dissociated from sensory-acoustic and repetition suppression effects, supporting expectation-based models of music perception. In a comparison between computational models of musical expectation, melodic surprise estimates that were generated by an internal model that used long-term statistical learning best captured neural surprise responses, highlighting extensive experience with music as a key source of melodic expectations. Strikingly, this effect appeared to be driven by short-range musical context of up to 10 notes instead of longer range structure. This provides an important window into the nature and content of melodic expectations during naturalistic music listening.

Expectations are widely considered a hallmark of music listening (Huron, 2006; Koelsch et al., 2019; Krumhansl, 2015; Meyer, 1957; Tillmann et al., 2014; Vuust et al., 2022), which resonates with the predictive coding framework of perception and cognition (Clark, 2013; de Lange et al., 2018; Friston, 2010). Here, we tested the role of melodic expectations during naturalistic music listening, for which neural evidence has been scarce. We quantified note-level surprise and uncertainty as markers of melodic expectations and examined their effect on neural music processing using time-resolved regression. Importantly, our analyses focused on disentangling different sources of melodic expectations, as well as elucidating the length of temporal context that the brain is taking into account when predicting which note will follow. This represents a critical innovation over earlier related work (Di Liberto et al., 2020), from which conclusions were necessarily limited to establishing that the brain predicts something during music listening, whereas we begin to unravel what it is that is being predicted. Furthermore, our use of diverse naturalistic musical stimuli and MEG allows for a broader generalization of our conclusions than was previously possible. Of course, the stimuli do not fully reflect the richness of real-world music yet, as for example the MIDI velocity (i.e. loudness) was held constant and only monophonic compositions were presented. Monophony was a technical limitation given the application of the Temperley and IDyOM model. The reported performance of the MusicTransformer, which supports fully polyphonic music, opens new avenues for future work studying the neural basis of music processing in settings even closer to fully naturalistic.

A key signature of predictive auditory processing is the neural response to unexpected events, also called the prediction error response (Clark, 2013; Friston, 2010; Heilbron and Chait, 2018). The degree to which notes violate melodic expectations can be quantified as the melodic surprise. Across different computational models of music, we found that melodic surprise explained M|EEG data from human listeners beyond sensory-acoustic factors and beyond simple repetition effects. We thereby generalize previous behavioral and neural evidence for listeners’ sensitivity to unexpected notes to a naturalistic setting (for reviews see Koelsch et al., 2019; Rohrmeier and Koelsch, 2012; Tillmann et al., 2014; Zatorre and Salimpoor, 2013).

While the role of expectations in music processing is well established, there is an ongoing debate about the nature of these musical expectations (Bigand et al., 2014; Collins et al., 2014; Rohrmeier and Koelsch, 2012). It has been claimed that these stem from a small set of general, Gestalt-like, principles (Krumhansl, 2015; Temperley, 2008; Temperley, 2014). Alternatively, they may reflect the outcome of a statistical learning process (Pearce, 2005; Pearce and Wiggins, 2012; Rohrmeier and Koelsch, 2012), which, in turn, could reflect either short- or long-range regularities. For the first time, we present neural evidence that weighs in on these questions. We simulated note-level expectations from different predictive architectures of music, which reflected distinct sources of melodic expectations: Gestalt-like principles (Temperley model), short-term statistical learning during the present composition (IDyOM stm) or statistical learning through long-term exposure to music (IDyOM ltm, Music Transformer).

As a first core result, we found that long-term statistical learning (Music Transformer and IDyOM ltm) captured neural surprise processing better than short-term regularities or Gestalt principles. Our results thus stress the role of long-term exposure to music as a central source of neural melodic expectations. The human auditory system exhibits a remarkable sensitivity to detect and learn statistical regularities in sound (Saffran et al., 1999; Skerritt-Davis and Elhilali, 2018). This capacity has been corroborated in statistical learning paradigms using behavioral (Barascud et al., 2016; Bianco et al., 2020), eye-tracking (Milne et al., 2021; Zhao et al., 2019), and neuroimaging techniques (Barascud et al., 2016; Moldwin et al., 2017; Pesnot Lerousseau and Schön, 2021). Furthermore, humans have extraordinary implicit memory for auditory patterns (Agres et al., 2018; Bianco et al., 2020). It has therefore been proposed that listeners learn the statistical regularities embedded in music through mere exposure (Pearce, 2018; Rohrmeier et al., 2011; Rohrmeier and Rebuschat, 2012).

Short-term regularities and Gestalt principles also significantly predicted neural variance and might constitute concurrent, though weaker, sources of melodic expectations (Rohrmeier and Koelsch, 2012). Gestalt principles, specifically, have been shown to adequately model listeners’ melodic expectations in behavioral studies (Cuddy and Lunney, 1995; Morgan et al., 2019; Pearce and Wiggins, 2006; Temperley, 2014). One shortcoming of Gestalt-like models, however, is that they leave unresolved how Gestalt rules emerge, assuming either innate principles (Narmour, 1990) or being agnostic to this question (Temperley, 2008). We propose that the well-established statistical learning framework can account for Gestalt-like principles. If the latter, for example pitch proximity, indeed fit a certain musical style, they have to be reflected in the statistical regularities. Music theoretical research has indeed shown that statistical learning based on bigrams can recover music theoretical Gestalt principles (Rodriguez Zivic et al., 2013), even across different (musical) cultures (Savage et al., 2015). This further backs up the role of statistical learning for musical expectations.

As a second core result, strikingly, we found that neural activity was best explained by those surprise estimates taking into account only relatively short-range musical context. Even though extracting the patterns upon which expectations are based requires long-term exposure (previous paragraph), the relevant context length of these patterns for predicting upcoming notes turned out to be short, around 7–8 notes. In contrast, for modeling music itself (i.e. independently of neural activity), the music transformer performed monotonically better with increasing context length, up to hundreds of notes. This pattern of results is very unlike similar studies in language processing, where models that perform best at next word prediction and can take the most context into account (i.e. transformers) also perform best at predicting behavioral and brain responses, and predictions demonstrably take long-term context into account (Goodkind and Bicknell, 2018; Heilbron et al., 2021; Schmitt et al., 2021; Schrimpf et al., 2021; Wilcox et al., 2020). A cautious hypothesis is that musical motifs, groups of about 2–10 notes, are highly generalizable within a musical style compared to longer range structure (Krumhansl, 2015). Motifs might thus drive statistical learning and melodic predictions, while other temporal scales contribute concurrently (Maheu et al., 2019). However, several alternative explanations are possible, between which we cannot adjudicate, based on our data. First, the length of ten notes roughly corresponds to the limit of auditory short-term memory at about 2–4 s (Thaut, 2014), which might constrain predictive sequence processing. Second, our analysis is only sensitive to time-locked note-level responses and those signals measured by M|EEG, whereas long-range musical structure might have different effects on neural processing (Krumhansl, 2015; Rohrmeier and Koelsch, 2012), in particular slower effects that are less precisely linked to note onsets. A third and final caveat is that the modeling of long-range structure by the music transformer model might be different from how human listeners process temporally extended or hierarchical structure.

Our approach of using temporal response function (TRF, or ‘regression evoked response’, rERP) analysis allowed us to investigate the spatiotemporal characteristics of continuously unfolding neural surprise processing. Melodic surprise modulated neural activity evoked by notes over fronto-temporal sensors with a positive peak at about 200ms, corresponding to a modulation of the P2 component (Picton, 2013; Pratt, 2011). Source modeling suggests superior temporal and Heschl’s gyri as likely sources of this neural response (although we note that MEG’s spatial resolution is limited and the exact localization of surprise responses within auditory cortex requires further research). Surprising notes elicited stronger neural responses, in line with previous reports by Di Liberto et al., 2020. This finding is furthermore consistent with the more general effect of expectation suppression, the phenomenon that expected stimuli evoke weaker neural responses (Auksztulewicz and Friston, 2016; Garrido et al., 2009; Todorovic and de Lange, 2012; Wacongne et al., 2011) through gain modulation (Quiroga-Martinez et al., 2021). In line with predictive coding, the brain might hence be predicting upcoming notes in order to explain away predicted sensory input, thereby leading to enhanced responses to surprising (i.e., not yet fully explainable) input.

Additionally, we found a sustained late negativity correlating with melodic surprise, which some studies have labeled a musical N400 or N500 (Calma-Roddin and Drury, 2020; Koelsch et al., 2000; Miranda and Ullman, 2007; Painter and Koelsch, 2011; Pearce et al., 2010). Similar to its linguistic counterpart (Kutas and Federmeier, 2011), the N400 has been interpreted as an index of predictive music processing. The literature has furthermore frequently emphasised the mismatch negativity (MMN) (Näätänen et al., 2007) and P3 component in predictive music processing (Koelsch et al., 2019), neither of which we observe for melodic surprise here. However, the MMN is typically found for deviants occurring in a stream of standard tones, such as in oddball paradigms, while the P3 is usually observed in the context of an explicit behavioral task (Koelsch et al., 2019). In our study, listeners were listening passively to maximize the naturalistic setting, which could account for the absence of these components. Importantly, our results go beyond previous research by analysing the influence of melodic surprise in a continuous fashion, instead of focusing on deviants.

As a final novel contribution, we demonstrate the usefulness of a state-of-the-art deep learning model, the Music Transformer (MT) (Huang et al., 2018), for the study of music cognition. The network predicted music and neural data at least on par with the IDyOM model, an n-gram model which is currently a highly popular model of musical expectations (Pearce and Wiggins, 2012). We are likely severely underestimating the relative predictive power of the MT, since we constrained our stimuli to monophonic music in the present study. Monophonic music is the only type of music the other models (IDyOM, Temperley) are able to process, so this restriction was a technical necessity. The MT, in contrast, supports fully polyphonic music. This opens up new avenues for future work to study neural music processing in even more naturalistic settings.

To conclude, by using computational models to capture different hypotheses about the nature and source of melodic expectations and linking these to neural data recorded during naturalistic listening, we found that these expectations have their origin in long-term exposure to the statistical structure of music. Yet, strikingly, as listeners continuously exploit this long-term knowledge during listening, they do so primarily on the basis of short-range context. Our findings thereby elucidate the individual voices making up the ‘surprise symphony’ of music perception.

Materials and methods

Data and code availability

Request a detailed protocol

The (anonymized, de-identified) MEG and music data are available from the Donders Repository (https://data.donders.ru.nl/) under CC-BY-4.0 license. The persistent identifier for the data is https://doi.org/10.34973/5qxw-nn97. The experiment and analysis code is also available from the Donders Repository.

Participants

We recruited 35 healthy participants (19 female; 32 right-handed; age: 18–30 years, mean = 23.8, SD = 3.05) via the research participation system at Radboud University. The sample size was chosen to achieve a power of ≥80% for detecting a medium effect size (d=0.5) with a two-sided paired t-test at an α level of 0.05. All participants reported normal hearing. The study was approved under the general ethical approval for the Donders Centre for Cognitive Neuroimaging (Imaging Human Cognition, CMO2014/288) by the local ethics committee (CMO Arnhem-Nijmegen, Radboud University Medical Centre). Participants provided written informed consent before the experiment and received monetary compensation.

Procedure

Request a detailed protocol

Participants listened to music, while their neural activity was recorded using magnetoencephalography (MEG) (Figure 1). Participants started each musical stimulus with a button press and could take short breaks in between stimuli. Participants were instructed to fixate a dot displayed at the centre of a screen (~85 cm viewing distance) in order to reduce head and eye movements. Besides that, participants were only asked to listen attentively to the music and remain still. These minimal instructions were intended to maximize the naturalistic character of the study. Initially, three test runs (~10 s each) were completed, in which three short audio snippets from different compositions (not used in the main experiment) were presented. This was intended to make listeners familiar with the procedure and the different sounds, as well as to adjust the volume to a comfortable level.

Musical stimuli

Request a detailed protocol

We selected 19 compositions (duration: total = 43 min, median across stimuli = 134 s, median absolute deviation (MAD, Leys et al., 2013) = 39 s; note events: total = 9824, median = 448, MAD = 204) from Western classical music (see Appendix 1—table 1). We chose this genre, since (a) participants recruited from the Nijmegen area were assumed to be somewhat familiar with it, (b) it entails relatively complex melodies and long-term structure allowing us to sample a broad range of surprise and uncertainty estimates, (c) many digital music files and corpora in MIDI format are publicly available, and (d) these included monophonic pieces. Monophonic refers to one note being played at a time, that is only containing a melody, compared to polyphonic music, which further includes chords and/or parallel voices. The constraint to monophonic compositions was necessary to enable the application of the Temperley and IDyOM model, which cannot parse polyphonic music. Based on the available databases, the selection aimed to cover various musical periods (1711–1951), composers, tempi (60–176 bpm), and key signatures, roughly matching the statistics of the training corpus for the music models (see below). The median note duration was about 161ms (MAD across all notes = 35ms, min = 20ms, max = 4498ms), with a median inter-note onset interval of 200ms (MAD across all notes = 50ms, min = 22ms, max = 2550ms).

We used the Musescore 3 software to synthesize and export the digital MIDI files as wav audio files (sampling rate = 44.1 kHz). This ensured accurate control over the note timing compared to live or studio recordings, facilitating time-locked analyses. The synthesisation via one of three virtual instruments from fluidsynth (piano, oboe, flute) ensured the natural character of the music. The MIDI velocity, corresponding to loudness (termed ‘velocity’ in MIDI terms because it refers to the velocity with which one could strike a piano key), was set to 100 for all notes, since most files were missing velocity information and the volume was thus held roughly constant across notes.

Stimulus presentation

Request a detailed protocol

The experiment was run on a Windows computer using Matlab 2018b (The MathWorks) and the Psychophysics Toolbox (Brainard, 1997). The music was presented binaurally via ear tubes (Doc’s Promolds NonVent with #13 thick prebent 1.85 mm ID tubes, Audine Healthcare, in combination with Etymotic ER3A earphones) at a sampling rate of 44.1 kHz. The volume was adjusted to a comfortable level for each participant during the initial three test runs. To ensure equivalent acoustic input in both ears, the right audio channel from potentially stereo recordings was duplicated, resulting in mono audio presentation. After participants initiated a run by a button press, the wav file was first loaded into the sound card buffer to ensure accurate timing. Once the file was fully loaded, the visual fixation cross appeared at the centre of the screen and after 1.5–2.5 s (random uniform distribution) the music started. The order of compositions was randomized across participants.

MEG data acquisition

Request a detailed protocol

Neural activity was recorded on a 275-channel axial gradiometer MEG system (VSM/CTF Systems) in a magnetically shielded room, while the participant was seated. Eight malfunctioning channels were disabled during the recording or removed during preprocessing, leaving 267 MEG channels in the recorded data. We monitored the head position via three fiducial coils (left and right ear, nasion). When the head movement exceeded 5 mm, in between listening periods, the head position was shown to the participant, and they were instructed to reposition themselves (Stolk et al., 2013). All data were low-pass filtered online at 300 Hz and digitized at a sampling rate of 1200 Hz.

Further data acquisition

Request a detailed protocol

For source analysis, the head shape and the location of the three fiducial coils were measured using a Polhemus 3D tracking device. T1-weighted anatomical MRI scans were acquired on a 3T MRI system (Siemens) after the MEG session if these were not already available from the local database (MP-RAGE sequence with a GRAPPA acceleration factor of 2, TR = 2.3 s, TE = 3.03ms, voxel size 1 mm isotropic, 192 transversal slices, 8 ° flip angle). Additionally, during the MEG session, eye position, pupil diameter and blinks were recorded using an Eyelink 1000 eye tracker (SR Research) and digitized at a sampling rate of 1200 Hz. After the experiment, participants completed a questionnaire including a validated measure of musicality, the Goldsmith Musical Sophistication Index (Müllensiefen et al., 2014). The eye tracking and questionnaire data were not analysed here.

EEG dataset

Request a detailed protocol

In addition, we analysed an open data set from a recently published study (Di Liberto et al., 2020) including EEG recordings from 20 participants (10 musicians, 10 non-musicians) listening to music. The musical stimuli were 10 violin compositions by J. S. Bach synthesized using a piano sound (duration: total = 27 min, median = 161.5 s, MAD = 18.5 s; note events: total = 7839,, median = 631, MAD = 276.5; see Appendix 1—table 1), that were each presented three times in pseudo-randomized order (total listening time = 80 min). The median note duration was 145ms (MAD across all notes = 32ms, min = 70ms, max = 2571ms), with a median inter-note onset interval of 150ms (MAD across all notes = 30ms, min = 74ms, max = 2571ms). EEG was acquired using a 64-electrode BioSemi Active Two system and digitized at a sampling rate of 512 Hz.

Music analysis

Request a detailed protocol

We used three types of computational models of music to investigate human listeners’ melodic expectations: the Temperley model (Temperley, 2008; Temperley, 2014), the IDyOM model (Pearce and Wiggins, 2012), and the Music Transformer (Huang et al., 2018). Based on their differences in computational architecture, we used these models to operationalize different sources of melodic expectations. All models take as input MIDI data, specifically note pitch values X ranging discretely from 0 to 127 (8.18–12543.85 Hz, middle C=60,~264 Hz). The models output a probability distribution for the next note pitch at time point t, Xt, given a musical context of k preceding consecutive note pitches:

P(Xt|xtkt1),whereX{0..127},k>0,t0.

For the first note in each composition, we assumed a uniform distribution across pitches (PX0=x=1/128). Based on these probability distributions, we computed the surprise S of an observed note pitch xt given the musical context as

S(xt)=logeP(xt|xtkt1).

Likewise, the uncertainty U associated with predicting the next note pitch was defined as the entropy of the probability distribution across all notes in the alphabet:

Ut=x=0127P(Xt=x|xtkt1)logeP(Xt=x|xtkt1).

Training corpora

Request a detailed protocol

All models were trained on the Monophonic Corpus of Complete Compositions (MCCC) (https://osf.io/dg7ms/), which consists of 623 monophonic pieces (Note events: total = 500,000, median = 654, MAD = 309). The corpus spans multiple musical periods and composers and matches the statistics of the musical stimuli used in the MEG and EEG study regarding the distribution of note pitch and pitch interval (Appendix 1—figure 1) as well as the proportion of major key pieces (MCCC: ~81%, MusicMEG: ~74%, but MusicEEG: 20%). Furthermore, the Maestro corpus V3 (Hawthorne et al., 2019, https://magenta.tensorflow.org/datasets/maestro), which comprises 1276 polyphonic compositions collected from human piano performances (Duration: total = 200 h, note events: total = 7 million), was used for the initial training of the Music Transformer (see below).

Probabilistic Model of Melody Perception | Temperley

Request a detailed protocol

The Probabilistic Model of Melody Perception (Temperley, 2008; Temperley, 2014) is a Bayesian model based on three interpretable principles established in musicology. Therefore, it has been coined a Gestalt-model (Morgan et al., 2019). The three principles are modeled by probability distributions (discretized for integer pitch values), whose free parameters were estimated, in line with previous literature, based on the MCCC:

  1. Pitches xt cluster in a narrow range around a central pitch c (central pitch tendency):

    xtN(c,vr),wherecN(c0, varc0).

    The parameters c0 and varc0: were set to the mean and variance of compositions’ mean pitch in the training corpus (c0=72, varc0 = 34.4). The variance of the central pitch profile vr was set to the variance of each melody’s first note around its mean (vr = 83.2).

  2. Pitches tend to be close to the previous pitch xt−1, in other words pitch intervals tend to be small (pitch proximity):

    xtN(xt1, vx)

    The variance of the pitch proximity profile vx was estimated as the variance of pitches around xt−1 considering only notes where xt−1=c (vx = 18.2).

  3. Depending on the key, certain pitches occur more frequently given their scale degree (the position of a pitch relative to the tonic of the key). This key profile is modeled as the probability of a scale degree conditioned on the key (12 major and 12 minor keys) spread out across several octaves, weighted by the probability of major and minor keys (pmaj = .81).

The final model multiplicatively combines these distributions to give the probability of the next note pitch given the context. The C code was provided by David Temperley in personal communication and adapted to output probabilities for all possible pitch values X. Specific choices in principles 1–3 above were made in accordance with earlier work (Morgan et al., 2019; Temperley, 2008; Temperley, 2014).

Information Dynamics of Music model | IDyOM

Request a detailed protocol

The Information Dynamics of Music (IDyOM) model is an unsupervised statistical learning model, specifically a variable order Markov model (Pearce, 2005; Pearce and Wiggins, 2012). Based on n-grams and the alphabet X, the probability of a note pitch x at time point t, xt, given a context sequence of length k, xt-kt-1 , is defined as the relative n-gram frequency of the continuation compared to the context:

P(xt|xtkt1)=count(xtkt)count(xtkt1).

The probabilities are computed for every possible n-gram length up to a bound k and combined through interpolated smoothing. The context length was, therefore, manipulated via the n-gram order bound. The model can operate on multiple musical features, called viewpoints. Here, we use pitch (in IDyOM terminology cpitch) to predict pitch, in line with the other models.

The IDyOM model class entails three different subtypes: a short-term model (stm), a long-term model (ltm), and a combination of the former two (both). The IDyOM stm model rests solely on the recent context in the current composition. As such, it approximates online statistical learning of short-term regularities in the present piece. The IDyOM ltm model, on the other hand, is trained on a corpus, reflecting musical enculturation, that is (implicit) statistical learning through long-term exposure to music. The IDyOM both model combines the stm and ltm model weighted by their entropy at each note.

Music Transformer

Request a detailed protocol

The Music Transformer (MT) (Huang et al., 2018) is a state-of-the-art neural network model that was developed to generate music with improved long-range coherence. To this end, it takes advantage of a Transformer architecture (Vaswani et al., 2017) and relative self-attention (Shaw et al., 2018), which better capture long-range structure in sequences than for example n-gram models. The MT is the only model used here that can process polyphonic music. This is possible due to a representation scheme that comprises four event types (note onset, note offset, velocity, and time-shift events) for encoding and decoding MIDI data. The note onset values are equivalent to pitch values and were used to derive probability distributions. Our custom scripts were based on an open adaptation for PyTorch (https://github.com/gwinndr/MusicTransformer-Pytorch; Gwinn et al., 2022).

The Music Transformer was initially trained on the polyphonic Maestro corpus for 300 epochs using the training parameters from the original paper (learning rate = 0.1, batch size = 2, number of layers = 6, number of attention heads = 6, dropout rate = 0.1, Huang et al., 2018). The training progress was monitored based on the cross-entropy loss on the training data (80%) and test data (20%) (Figure 9A). The cross-entropy loss is defined as the average surprise across all notes. The model is, thus, trained to minimize the surprise for upcoming notes. The minimal loss we achieved (1.97) was comparable to the original paper (1.835). The divergence between the loss curve for training and test set indicated some overfitting starting from about epoch 50, however, without a noticeable decrease in test performance. Therefore, we selected the weights at epoch 150 to ensure stable weights without severe overfitting.

Training (A) and fine-tuning (B) of the Music Transformer on the Maestro corpus and MCCC, respectively.

Cross-entropy loss (average surprise across all notes) on the test (dark) and training (light) data as a function of training epoch.

In order to adjust the model to monophonic music, we finetuned the pretrained Music Transformer on the MCCC for 100 epochs using the same training parameters (Figure 9B). Again, the training progress was evaluated based on the cross-entropy loss and the weights were selected based on the minimal loss. While the loss started at a considerably lower level on this monophonic dataset (0.78), it continued to decrease until epoch 21 (0.59), but quickly started to increase, indicating overfitting on the training data. Therefore, the weights from epoch 21 were selected for further analyses.

Music model comparison

Request a detailed protocol

We compared the models’ predictive performance on music data as a function of model class and context length. Thereby, we aimed to scrutinize the hypothesis that the models reflect different sources of melodic expectations. We used the musical stimuli from the MEG and EEG study as test sets and assessed the accuracy, median surprise and uncertainty across compositions.

M|EEG analysis

Preprocessing

Request a detailed protocol

The MEG data were preprocessed in Matlab 2018b using FieldTrip (Oostenveld et al., 2011). We loaded the raw data separately for each composition including about 3 s pre- and post-stimulus periods. Based on the reference sensors of the CTF MEG system, we denoised the recorded MEG data using third-order gradient correction, after which the per-channel mean across time was subtracted. We then segmented the continuous data in 1 s segments. Using the semi-automatic routines in FieldTrip, we marked noisy segments according to outlying variance, such as MEG SQUID jumps, eye blinks or eye movements (based on the unfiltered data) or muscle artifacts (based on the data filtered between 110 and 130 Hz). After removal of noisy segments, the data were downsampled to 400 Hz. Independent component analysis (ICA) was then performed on the combined data from all compositions for each participant to identify components that reflected artifacts from cardiac activity, residual eye movements or blinks. Finally, we reloaded the data without segmentation, removed bad ICA components and downsampled the data to 60 Hz for subsequent analyses.

A similar preprocessing pipeline was used for the EEG data. Here, the data were re-referenced using the linked mastoids. Bad channels were identified via visual inspection and replaced through interpolation after removal of bad ICA components.

TRF analysis

Request a detailed protocol

We performed time-resolved linear regression on the M|EEG data to investigate the neural signatures of melodic surprise and uncertainty (Figure 1), using the regression evoked response technique (‘rERP’, Smith and Kutas, 2015).This approach allowed us to deconvolve the responses to different features and subsequent notes and correct for their temporal overlap. The preprocessed M|EEG data were loaded and band-pass filtered between 0.5 and 8 Hz (bidirectional FIR filter). All features of interest were modeled as impulse regressors with one value per note, either binary (x = {0,1}) or continuous (xR). The M|EEG channel data and continuous regressors were z-scored. We constructed a time-expanded regression matrix M, which contained time-shifted versions of each regressor column-wise (tmin = –0.2 s, tmax = 1 s relative to note onsets, 73 columns per regressor given the sampling rate of 60 Hz). After removal of bad time points identified during M|EEG preprocessing, we estimated the regression weights β^ using ordinary least squares (OLS) regression:

β^=(MTM)1MTy.

Collectively, the weights form a response function known as the regression evoked response or temporal response function (TRF; Crosse et al., 2016; Ding and Simon, 2012). The TRF depicts how a feature modulates neural activity across time. Here, the units are arbitrary, since both binary and z-scored continuous regressors were included. Model estimation was performed using custom Python code built on the MNE rERP implementation (Gramfort et al., 2013; Smith and Kutas, 2015). Previous similar work has used ridge-regularized regression, rather than OLS (Di Liberto et al., 2020). We instead opted to use OLS, since the risk for overfitting was low given the sparse design matrices and low correlations between the time-shifted regressors. To make sure this did not unduly influence our results, we also implemented ridge-regularized regression with the optimal cost hyperparameter alpha estimated via nested cross-validation. OLS (alpha = 0) was always among the best-fitting models and any increase in predictive performance for alpha >0 for some participants was negligible. Results for this control analysis are shown for the best fitting model for the MEG and EEG data in Appendix 1—figures 5 and 6, respectively. In the rest of the manuscript we thus report the results from the OLS regression.

Models and regressors

Request a detailed protocol

The Onset model contained a binary regressor, which coded for note onsets and was included in all other models too. The Baseline model added a set of regressors to control for acoustic properties of the music and other potential confounds. Binary regressors were added to code for (1) very high pitch notes (>90% quantile), (2) very low pitch notes (<10% quantile), since extreme pitch values go along with differences in perceived loudness, timbre, and other acoustic features; (3) the first note in each composition (i.e. composition onset); (4) repeated notes, to account for the repetition suppression effect and separate it from the surprise response. Since the MEG experiment used stimuli generated by different musical instruments, we additionally controlled for the type of sound, by including binary regressors for oboe and flute sounds. This was done since the different sounds have different acoustic properties, such as a lower attack time for piano sounds and longer sustain for oboe or flute sounds. For computing continuous acoustic regressors, we downsampled the audio signal to 22.05 kHz. We computed the mean for each variable of interest across the note duration to derive a single value for each note and create impulse regressors. The root-mean-square value (RMS) of the audio signal captures differences in (perceived) loudness. Flatness, defined as the ratio between the geometric and the arithmetic mean of the acoustic signal, controlled for differences in timbre. The variance of the broad-band envelope represented acoustic edges (McDermott and Simoncelli, 2011). The broad-band envelope was derived by (a) filtering the downsampled audio signal through a gammatone filter bank (64 logarithmically spaced filter bands ranging between 50 and 8000 Hz), which simulates human auditory processing; (b) taking the absolute value of the Hilbert transform of the 64 band signals; (c) averaging across bands (Zuk et al., 2021). The baseline regressors were also included in all of the following models. The main models of interest added note-level surprise, uncertainty, and/or their interaction from the different computational models of music, varying the model class and context length.

Model comparison

Request a detailed protocol

We applied a fivefold cross-validation scheme (train: 80%, test: 20%, time window: 0–0.6 s) (Varoquaux et al., 2017) to compare the regression models’ predictive performance on the M|EEG data. We computed the correlation between the predicted and recorded neural signal across time for each fold and channel on the hold out data. To increase the sensitivity of subsequent analyses, we selected the channels most responsive to musical notes for each participant according to the cross-validated performance for the Onset model (>2/3 quantile). The threshold was determined through visual inspection of the spatial topographies, but did not affect the main results. The overall model performance was then determined as the median across folds and the mean across selected channels. Since the predictive performance was assessed on unseen hold out data, the approach controlled for overfitting the neural data and for differences in the number of regressors and free model parameters. For statistical inference, we computed one-sample or paired t-tests using multiple comparison correction (Bonferroni-Holm method).

Cluster-based statistics

Request a detailed protocol

For visualizations and cluster-based statistics, we transformed the regression coefficients from the axial MEG data to a planar representation using FieldTrip (Bastiaansen and Knösche, 2000). The regression coefficients estimated on the axial gradient data were linearly transformed to planar gradient data, for which the resulting synthetic horizontal and vertical planar gradient components were then non-linearly combined to a single magnitude per original MEG sensor. For the planar-transformed coefficients, we selected the most responsive channels according to the coefficients of the note onset regressor in the Onset model (>5/6 quantile, time window: 0–0.6 s). The threshold was determined through visual inspection of the spatial topographies, but did not affect the main results. We then used cluster-based permutation tests (Maris and Oostenveld, 2007) to identify significant spatio-temporally clustered effects compared to the baseline time window (−0.2–0 s, 2000 permutations). Using threshold free cluster enhancement (TFCE, Smith and Nichols, 2009), we further determined significant time points, where at least one selected channel showed a significant effect. Mass-univariate testing was done via one-sample t-tests on the baseline-corrected M|EEG data with ‘hat’ variance adjustment (σ=1e−3) (Ridgway et al., 2012).

Source analysis

Request a detailed protocol

To localize the neural sources associated with the different regressors, we used equivalent current dipole modeling (ECD). Individuals’ anatomical MRI scans were realigned to CTF space based on the headshape data and the fiducial coil locations, using a semi-automatic procedure in Fieldtrip. The lead field was computed using a single-shell volume conduction model (Nolte, 2003). Based on individuals’ time-averaged axial gradient TRF data in the main time window of interest (180–240ms), we used a non-linear fitting algorithm to estimate the dipole configuration that best explained the observed sensor maps (FieldTrip’s ft_dipolefitting). We compared three models with one to three dipoles per hemisphere. As the final solution per participant, we chose that with the largest adjusted-r2 score in explaining the observed sensor topography (thereby adjusting for the additional 12 free parameters caused by introducing an extra dipole; 2 hemispheres times x/y/z/dx/dy/dz). As starting point for the search, we roughly specified bilateral primary auditory cortex (MNI coordinates x/y/z [48, -28, 10] mm (R), [-40,–28, 6] mm (L); Anderson et al., 2011; Kiviniemi et al., 2009), with a small random jitter (normally distributed with SD = 1 mm) to prevent exact overlap in starting positions of multiple dipoles. Note that the initial dipole location has a negligible effect on the final solution if the data are well explained by the final fit model. This was the case for our data, see Results. For visualization, we estimated the (volumetric) density of best-fit dipole locations across participants and projected this onto the average MNI brain template, separately for each regressor.

Appendix 1

Appendix 1—table 1
Overview of the musical stimuli presented in the MEG (top) and EEG study (bottom).
MusicMEG
ComposerCompositionYearKeyTime signatureTempo (bpm)Duration (sec)NotesSound
Benjamin BrittenMetamorphoses Op. 49, II. Phaeton1951C maj4/411095384Oboe
Benjamin BrittenMetamorphoses Op. 49, III. Niobe1951Db maj4/460101171Oboe
Benjamin BrittenMetamorphoses Op. 49, IV. Bacchus1951F maj4/4100114448Oboe
César FranckViolin Sonata IV. Allegretto poco mosso1886A maj4/4150175458Flute
Carl Philipp Emanuel BachSonata for Solo Flute, Wq.132/H.564 III.1763A min3/8982751358Flute
Ernesto KöhlerFlute Exercises Op. 33 a, V. Allegretto1880G maj4/4124140443Flute
Ernesto KöhlerFlute Exercises Op. 33b, VI. Presto1880D min6/8176134664Piano
Georg Friedrich HändelFlute Sonata Op. 1 No. 5, HWV 363b, IV. Bourrée1711G maj4/413284244Oboe
Georg Friedrich HändelFlute Sonata Op. 1 No. 3, HWV 379, IV. Allegro1711E min3/896143736Piano
Joseph HaydnLittle Serenade1785F maj3/49281160Oboe
Johann Sebastian BachFlute Partita BWV 1013, II. Courante1723A min3/464176669Flute
Johann Sebastian BachFlute Partita BWV 1013, IV. Bourrée angloise1723A min2/462138412Oboe
Johann Sebastian BachViolin Concerto BWV 1042, I. Allegro1718E maj2/2100122698Piano
Johann Sebastian BachViolin Concerto BWV 1042, III. Allegro Assai1718E maj3/89280413Piano
Ludwig van BeethovenSonatina (Anh. 5 No. 1)1807G maj4/4128210624Flute
Muzio ClementiSonatina Op. 36 No. 5, III. Rondo1797G maj2/4112187915Piano
Modest MussorgskyPictures at an Exhibition - Promenade1874Bb maj5/480106179Oboe
Pyotr Ilyich TchaikovskyThe Nutcracker Suite - Russian Dance Trepak1892G maj2/412078396Piano
Wolfgang Amadeus MozartThe Magic Flute K620, Papageno’s Aria1791F maj2/472150452Flute
25899824
MusicEEG
ComposerCompositionYearKeyTime signatureTempo (bpm)Duration (sec)NotesSound
Johann Sebastian BachFlute Partita BWV 1013, I. Allemande1723A min4/41001581022Piano
Johann Sebastian BachFlute Partita BWV 1013, II. Corrente1723A min3/4100154891Piano
Johann Sebastian BachFlute Partita BWV 1013, III. Sarabande1723A min3/470120301Piano
Johann Sebastian BachFlute Partita BWV 1013, IV. Bourree1723A min2/480135529Piano
Johann Sebastian BachViolin Partita BWV 1004, I. Allemande1723D min4/447165540Piano
Johann Sebastian BachViolin Sonata BWV 1001, IV. Presto1720G min3/81251991604Piano
Johann Sebastian BachViolin Partita BWV 1002, I. Allemande1720Bb min4/450173620Piano
Johann Sebastian BachViolin Partita BWV 1004, IV. Gigue1723D min12/8_1201821352Piano
Johann Sebastian BachViolin Partita BWV 1006, II. Loure1720E maj6/480134338Piano
Johann Sebastian BachViolin Partita BWV 1006, III. Gavotte1720E maj4/4140178642Piano
15987839
Appendix 1—figure 1
Comparison of the pitch (left) and pitch interval distributions (right) for the music data from the MEG study (top), EEG study (middle), and MCCC corpus (bottom).
Appendix 1—figure 2
Model performance on the musical stimuli used in the EEG study.

(A) Comparison of music model performance in predicting upcoming note pitch, as composition-level accuracy (left; higher is better), median surprise across notes (middle; lower is better), and median uncertainty across notes (right). Context length for each model is the best performing one across the range shown in (B). Vertical bars: single compositions, circle: median, thick line: quartiles, thin line: quartiles ±1.5 × interquartile range. (B) Accuracy of note pitch predictions (median across 10 compositions) as a function of context length and model class (same color code as (A)). Dots represent maximum for each model class. (C) Correlations between the surprise estimates from the best models.

Appendix 1—figure 3
Comparison of the MEG TRFs and spatial topographies for the surprise estimates from the best models of each model class.
Appendix 1—figure 4
Comparison of the EEG TRFs and spatial topographies for the surprise estimates from the best models of each model class.
Appendix 1—figure 5
Comparison of the predictive performance on the MEG data using ridge-regularized regression, with the optimal cost hyperparameter alpha estimated using nested cross-validation.

Results are shown for the best-performing model (MT, context length of 8 notes). Each line represents one participant. Lower panel: raw predictive performance (r). Upper panel: predictive performance expressed as percentage of a participant’s maximum.

Appendix 1—figure 6
Comparison of the predictive performance on the EEG data using ridge-regularized regression, with the optimal cost hyperparameter alpha estimated using nested cross-validation.

Results are shown for the best-performing model (MT, context length of 7 notes). Each line represents one participant. Lower panel: raw predictive performance (r). Upper panel: predictive performance expressed as percentage of a participant’s maximum.

Data availability

All data have been deposited into the Donders Repository under CC-BY-4.0 license, under identifier https://doi.org/10.34973/5qxw-nn97.

The following data sets were generated
    1. Kern P
    2. Heilbron M
    3. de Lange FP
    4. Spaak E
    (2022) Donders Repository
    Tracking predictions in naturalistic music listening using MEG and computational models of music.
    https://doi.org/10.34973/5qxw-nn97
The following previously published data sets were used
    1. DiLiberto et al
    (2020) Dryad Digital Repository
    Cortical encoding of melodic expectations in human temporal cortex.
    https://doi.org/10.5061/dryad.g1jwstqmh

References

  1. Conference
    1. Goodkind A
    2. Bicknell K
    (2018) Predictive power of word surprisal for reading times is a linear function of language model quality
    Proceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2018. pp. 10–18.
    https://doi.org/10.18653/v1/W18-0102
  2. Book
    1. Jurafsky D
    2. Martin JH
    (2000)
    Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
    Prentice Hall PTR.
  3. Book
    1. Meyer LB
    (1957)
    Emotion and Meaning in Music
    University of Chicago Press.
  4. Book
    1. Narmour E
    (1990)
    The Analysis and Cognition of Basic Melodic Structures: The Implication-Realization Model
    University of Chicago Press.
  5. Book
    1. Narmour E
    (1992)
    The Analysis and Cognition of Melodic Complexity: The Implication-Realization Model
    University of Chicago Press.
  6. Book
    1. Pearce MT
    (2005)
    The Construction and Evaluation of Statistical Models of Melodic Structure in Music Perception and Composition
    Doctoral City University London.
  7. Book
    1. Pratt H
    (2011) Sensory ERP Components
    The Oxford Handbook of Event-Related Potential Components.
    https://doi.org/10.1093/oxfordhb/9780195374148.013.0050
  8. Conference
    1. Shaw P
    2. Uszkoreit J
    3. Vaswani A
    (2018) Self-Attention with Relative Position Representations
    Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
    https://doi.org/10.18653/v1/N18-2074
  9. Book
    1. Thaut MH
    (2014)
    Musical echoic memory training (MEM)
    In: Thaut MH, editors. Handbook of Neurologic Music Therapy. Oxford University Press. pp. 311–313.
  10. Book
    1. Vaswani A
    2. Shazeer N
    3. Parmar N
    4. Uszkoreit J
    5. Jones L
    6. Gomez AN
    7. Kaiser Ł
    8. Polosukhin I
    (2017)
    Attention is all you need
    In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, editors. Advances in Neural Information Processing Systems. Curran Associates, Inc. pp. 5998–6008.

Decision letter

  1. Jonas Obleser
    Reviewing Editor; University of Lübeck, Germany
  2. Christian Büchel
    Senior Editor; University Medical Center Hamburg-Eppendorf, Germany
  3. William Sedley
    Reviewer; Newcastle University, United Kingdom
  4. Keith Doelling
    Reviewer

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Cortical activity during naturalistic music listening reflects short-range predictions based on long-term experience" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Christian Büchel as the Senior Editor. The following individuals involved in the review of your submission have agreed to reveal their identity: William Sedley (Reviewer #2); Keith Doelling (Reviewer #3).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1) The one non-standard feature of the analysis is the lack of regularization (e.g., ridge). The authors should perform the analysis using regularization (via nested cross-validation) and test if the predictions are improved (if they have already done this, they should report the results).

2) The authors make a distinction between "Gestalt-like principles" and "statistical learning" but they never define was is meant by this distinction. The Temperley model encodes a variety of important statistics of Western music, including statistics such as keys that are unlikely to reflect generic Gestalt principles. The Temperley model builds in some additional structure such as the notion of a key, which the n-gram and transformer models must learn from scratch. In general, the models being compared differ in so many ways that it is hard to conclude much about what is driving the observed differences in prediction accuracy, particularly given the small effect sizes. The context manipulation is more controlled, and the fact that neural prediction accuracy dissociates from the model performance is potentially interesting. However, we were not confident that the authors have a good neural index of surprise for the reasons described above, and this might limit the conclusions that can be drawn from this manipulation in the manuscript as is.

3) The authors may overstate the advancement of the Music Transformer with the present stimuli, as its increase in performance requires a considerably longer context than the other models. Secondly, the Baseline model, to which the other models are compared, does not contain any pitch information on which these models operate on. As such, it's unclear if the advancements of these models come from being based on new information or the operations it performs on this information as claimed.

4) Source analysis: See below in Rev #1 and Rev #3 for concerns over the results and interpretation of the source localisation.

Reviewer #1 (Recommendations for the authors):

The one non-standard feature of the analysis is the lack of regularization (e.g., ridge). The authors should perform the analysis using regularization (via nested cross-validation) and test if the predictions are improved (if they have already done this, they should report the results).

Source localization analysis with MEG or EEG is highly uncertain. The conclusion that the results are coming from "fronto-temporal" areas doesn't strike me as adding much value; I'd be inclined to remove this analysis from the manuscript. Minimally, the authors should note that source localization is highly uncertain.

The authors should report the earphones/tubes used to present sounds.

Reviewer #3 (Recommendations for the authors):

– Figure 2: Music Transformer is presented as the state-of-the-art model throughout the paper, with the main advantage of grasping regularities on longer time scales. Yet the computational results in figure 2 tend to show that it does not bring much in terms of musical predictions compared to IDyOM. MT also needs a way larger context length to reach the same accuracy. It has a lower uncertainty, but this feature did not improve the results of the neural analysis. This point could be discussed to better understand why MT is a better model of human musical expectations and surprise.

– The source analysis is a bit puzzling to me. The results show that every feature (including Note Onset and Note Repetition) are localized in Broca's area, frontally located. Wouldn't the authors expect that a feature as fundamental as Note Onset should be clearly (or at least partially) localized in the primary auditory cortex? Because these results localize frontally, it is hard to fully trust the MT surprise results as well. Can the authors provide some form of sanity check to provide further clarity on these results? Perhaps the source localizing the M100 to show the analysis pipeline is not biased towards frontal areas? Do the authors expect note onsets in continuous music to be represented in higher-order areas than the representation of a single tone? Figure 7 and source-level results: the authors do not discuss the fact that the results are mainly found in a frontal area. It looks like there is no effect in the auditory cortex, which is surprising and should be discussed.

– The dipole fit shows a high correlation with behavior but it is not compared with any alternatives. Can the authors use some model comparison methods (e.g. AIC or BIC) to show that a single dipole per hemisphere is a better fit than two or three?

– Line 252: the order in which the transformation from axial to planar gradients is applied with respect to other processing steps (e.g., z-scoring of the MEG data and TRF fitting) is not clear. Is it the MEG data that was transformed as suggested here, or is the transformation applied to TRFs coefficients (as explained in line 710), with TRFs trained on axial gradiometers? This has downstream consequences that lead to a lack of clarity in the results. For example, the MEG data shows positive values in the repetition kernel which if the transformation was applied before the TRF would imply that repetition increases activity rather than reduces it as would be expected. From this, I infer that it is the TRF kernels that were transformed. I recognize that the authors can infer results regarding directionality in the EEG data but this analysis choice results in an unnecessary loss of information in the MEG analysis. Please clarify the methods and if the gradiometer transformation is performed on the kernels, I would recommend re-running the analysis with gradiometer transformation first.

– The Baseline model claims to include all relevant acoustic information but it does not include the note pitches themselves. As these pitches are provided to the model, it is difficult to tell whether the effects of surprise are related to model output predictions, or how well they represent their input. If the benefits from the models rely truly on their predictive aspect then including pitch information in the baseline model would be an important control. The authors have done their analysis in a way that fits what previous work has done but I think it is a good time to correct this error.

– I wonder if the authors could discuss a bit more about the self-attention component of the Music Transformer model. I'm not familiar with the model's inner workings but I am intrigued by this dichotomy of a larger context to improve musical predictions and a shorter context to improve neural predictions. I wonder if a part of this dissociation has to do with the self-attention feature of Transformers, that a larger context is needed to have a range of motifs to draw from but the size of the motifs that the model attends to should be of a certain size to fit neural predictions.

– Figure 3C: it could be interesting to show the same figure for IDyOM ltm. Given that context length does not impact ltm as much as MT, we could obtain different results. Since IDyOM ltm gets similar results to MT on the MEG data (figure 3A, no significant difference), it is thus hard to tell if the influence of context length comes from brain processing or the way MT works.

– Figure 8: Adding the uncertainty estimates did not improve the model's predictive performance compared to surprise alone, but what about TRFs trained on uncertainty without surprise? Without this result, it is hard to understand why the surprise was chosen over uncertainty.

– The sentence from lines 272 to 274 is not clear. In particular, the late negativity effect seems to be present in EEG data only, and it is thus hard to understand why a negative correlation between surprise estimates of subsequent notes would have such an effect in EEG and not MEG. Moreover, the same late negativity effect can be seen on the TRF of note onset but is not discussed.

– Some of the choices for the Temperley Model seem to unnecessarily oversimplify and restrict its potential performance. In the second principle, the model only assesses the relationships between neighboring notes when the preceding note is the tonal centroid. It would seem more prudent to include all notes to collect further data. In the third principle, the model marginalizes major and minor scales by weighting probabilities of each profile by the frequency of major and minor pieces in the database. Presumably, listeners can identify the minor or major key of the current piece (at least implicitly). Why not select the model for each piece, outright?

– Stimulus: Many small details make it so that the stimuli are not so naturalistic (MIDI velocity set to 100, monophonic, mono channel…). This results in a more controlled experiment, but the claim that it expands melodic expectation findings to naturalistic music listening is a bit bold.

– Line 478: authors refer to "original compositions" which may give the impression that the pieces were written for the experiment. From the rest of the text, I don't believe this to be true.

– Formula line 553: the first probability of Xt (on the left) should also be a conditional probability of Xt given previous values of x. This is the entropy of the probability distribution estimated by the model.

– Line 667: the study by Di Liberto from which the EEG data come uses a ridge regression (ridge regularization). Is there a reason to use a non-regularized regression in this case? This should be discussed in the methods.

https://doi.org/10.7554/eLife.80935.sa1

Author response

Essential revisions:

1) The one non-standard feature of the analysis is the lack of regularization (e.g., ridge). The authors should perform the analysis using regularization (via nested cross-validation) and test if the predictions are improved (if they have already done this, they should report the results).

Our motivation for the ‘plain’ ordinary least squares (OLS) was, firstly, that we use the regression ERP/F modelling framework (Smith and Kutas, 2015). This means that our design matrices were very sparse, with little correlation between the time-shifted regressors and hence a comparatively low risk for overfitting. Moreover, the model comparisons were always performed in a cross-validated fashion (thus any potential overfitting would reduce, rather than artificially inflate, model performance). However, we appreciate that we hereby deviate from earlier similar work, which used ridgeregularized regression.

We therefore now also implemented ridge-regularized regression, with the optimal cost hyperparameter α estimated using nested cross-validation. The results for this are shown in Appendix—Figure 5 (one curve per participant; regression of best-performing model (MT, context length 8) on MEG data):

This clearly demonstrates that the OLS we used previously (α = 0) is always among the best-fitting models. Even for those participants that show an increase in cross-validated r for non-zero regularization, these increases are negligible. We therefore report the same OLS models in the manuscript as before, and have now added the above figure to the supplemental information.

This also holds for the EEG data. Appendix—figure 6 shows the results for the best-performing model (MT, context length 7).

2) The authors make a distinction between "Gestalt-like principles" and "statistical learning" but they never define was is meant by this distinction. The Temperley model encodes a variety of important statistics of Western music, including statistics such as keys that are unlikely to reflect generic Gestalt principles. The Temperley model builds in some additional structure such as the notion of a key, which the n-gram and transformer models must learn from scratch. In general, the models being compared differ in so many ways that it is hard to conclude much about what is driving the observed differences in prediction accuracy, particularly given the small effect sizes. The context manipulation is more controlled, and the fact that neural prediction accuracy dissociates from the model performance is potentially interesting. However, we were not confident that the authors have a good neural index of surprise for the reasons described above, and this might limit the conclusions that can be drawn from this manipulation in the manuscript as is.

First of all, we would like to apologize for any unclarity regarding the distinction between Gestalt-like and statistical models. We take Gestalt-like models to be those that explain music perception as following a restricted set of rules, such as that adjacent notes tend to be close in pitch. In contrast, as the reviewer correctly points out, statistical learning models have no such a priori principles and must learn similar or other principles from scratch. Importantly, the distinction between these two classes of models is not one we make for the first time in the context of music perception. Gestalt-like models have a long tradition in musicology and the study of music cognition dating back to (Meyer, 1957). The Implication-Realization model developed by Eugene Narmour (Narmour, 1990, 1992; Schellenberg, 1997) is another example for a rule-based theory of music listening, which has influenced the model by David Temperley, which we applied as the most recently influential Gestalt-model of melodic expectations in the present study. Concurrently to the development of Gestalt-like models, a second strand of research framed music listening in light of information theory and statistical learning (Bharucha, 1987; Cohen, 1962; Conklin and Witten, 1995; Pearce and Wiggins, 2012). Previous work has made the same distinction and compared models of music along the same axis (Krumhansl, 2015; Morgan et al., 2019a; Temperley, 2014). We have updated the manuscript to elaborate on this distinction and highlight that it is not uncommon.

Second, we emphasize that we compare the models directly in terms of their predictive performance both of upcoming musical notes and of neural responses. This predictive performance is not dependent on the internal details of any particular model; e.g. in principle it would be possible to include a “human expert” model where we ask professional composers to predict upcoming notes given a previous context. Because of this independence of the relevant comparison metric on model details, we believe comparing the models is justified. Again, this is in line with previously published work in music (Morgan et al., 2019a), language, (Heilbron et al., 2022; Schmitt et al., 2021; Wilcox et al., 2020), and other domains (Planton et al., 2021). Such work compares different models in how well they align with human statistical expectations by assessing how well different models explain predictability/surprise effects in behavioral and/or brain responses.

Third, regarding the doubts on the neural index of surprise used: we respond to this concern below, after reviewer 1’s first point to which the present comment refers (the referred-to comment was not included in the “Essential revisions” here).

3) The authors may overstate the advancement of the Music Transformer with the present stimuli, as its increase in performance requires a considerably longer context than the other models.

We do not believe to have overstated the advance presented by the MusicTransformer, for the following reasons. First, we appreciate that from the music analysis (Figure 2b and Figure A3b), it seems as if the Music Transformer requires much longer context to reach only slightly higher predictive performance on the musical stimuli. Note, however, that this only applies to the comparison between MT and the IDyOM-stm and IDyOM-both (which subsumes IDyOM-stm) models, but not to the comparison between MT and IDyOM-ltm or the Temperley model. The MT and IDyOM-stm (and therefore IDyOM-both) deal with context information rather differently, possibly leading to the wrong impression that predictive performance for the MT requires a lot more ‘data’. We go into these differences in more detail below.

Second, and importantly, the distinctive empirical contribution of our study is not the superiority of the MT over the other models per se, but the (neural and predictive) performance differences among model classes: statistical learning (IDyOM/MT) versus Gestalt/rule-based (Temperley), and the dependence of performance on context lengths. For these comparisons, the MT is a very useful tool because it efficiently tracks hierarchical structure in longer musical contexts (Huang et al., 2018). We furthermore demonstrate that it works at least as well as the previous state-of-the-art statistical model (IDyOM), yet may process a much larger class of music (i.e., polyphonic music; not yet explored).

Regarding the first point: There are small technical differences in the way previous context is used by IDyOM-stm and the MusicTransformer (MT). IDyOM-stm is an n-gram model, predicting the probability of an upcoming note xt given a previous context {xt-1…xt-n}. The context parameter we varied here governs the maximum length n of n-grams that IDyOM can take into account to make its predictions. Importantly, IDyOM-stm is an on-line statistical learning model: it updates the relative probabilities p(xt | {xt-1…xt-n}) as it is making predictions and parsing the ongoing composition. So while for any given note IDyOM-stm will only directly take into account the preceding n notes, the underlying statistical model against which it will interpret those n context notes can depend on all the n-grams and subsequent notes that preceded it. Because of this property, IDyOM-stm is in effect “learning” based on the current ongoing composition and therefore can indirectly leverage more information than the strict limit of n-grams considered. (It could be said that IDyOM-stm is ‘peeking’ at the test set to some extent, and therefore its predictive performance may be slightly overestimated.) Importantly, the type of on-line updating in IDyOM-stm still precludes the learning of any hierarchical structure encompassing longer context lengths than n (which is, for our purposes, an essential difference with the MT).

The MT model performs no on-line learning. Instead, the model only takes into account the strict n context notes that it is provided with when asked for an upcoming prediction. Critically, the transformer architecture enables the MT to make hierarchical predictions on those n notes, which depend on the musical corpus it was trained on.

Finally, regarding the second point: We further note that the dependence of predictive performance on context length is quite different between predicting music (Figure 2) and predicting neural responses (Figures 3 and 4). For predicting upcoming musical notes, indeed the MT required considerably larger context lengths in order to outperform IDyOM (stm and both), likely in part due to the reasons described above. In contrast, for the related surprise scores to predict neural responses, the context length required for the MT to peak was on the same order as IDyOM.

Secondly, the Baseline model, to which the other models are compared, does not contain any pitch information on which these models operate on. As such, it's unclear if the advancements of these models come from being based on new information or the operations it performs on this information as claimed.

We apologize for not being clear enough here. Importantly, none of the models compared contained any exact pitch information. We only used surprisal (and uncertainty) collapsed across the entire distribution of possible upcoming notes as a regressor for the MEG and EEG data, which by itself cannot be traced back to any particular pitch. Furthermore, we already did include a confound regressor encoding low versus high pitch, which, critically, was included identically in all the models (including the Baseline). We have updated the manuscript to emphasize this point.

4) Source analysis: See below in Rev #1 and Rev #3 for concerns over the results and interpretation of the source localisation.

In response to the reviewers’ comments, we revisited the source analysis considerably. Previously, we had only investigated the later peaks, for which we had no very strong expectations and hence the frontal peak did not appear particularly suspect. However, as a sanity check and as suggested by reviewer 3, we source localized the earliest peak of the Onset TRF, expecting a clear bilateral peak in early auditory cortex. In contrast, this peak was also localized to frontal cortex, which raised our suspicions that something was wrong in our source analysis pipeline. We indeed identified a bug: for all participants, the pipeline used the forward model computed for one participant, rather than each individual participant’s forward model. If this one participant happened to sit somewhat further to the front or back of the MEG helmet than the mean, and/or have a different head size than the mean, this would introduce a consistent spatial bias, which could explain the previous absence of a peak in auditory cortex. We apologize for this error in our code, and are intensely grateful to the reviewers for their well-founded suspicion.

We have now re-run the source analysis using the correct forward models, and additionally compare three models with one to three dipoles per hemisphere, as suggested by reviewer 3. As the final solution per participant, we use that with the largest adjusted-r2 score in explaining the observed sensor topography. For the majority of participants, the three-dipole model fits best, although the gain in adjusted-r2 over the one- and two-dipole models is modest:

Author response image 1

(Even for the one-dipole model, these adjusted-r2 scores are higher than the previously reported r2, since now the correct forward models are used.) Using this corrected and updated pipeline, we now find consistent peaks in auditory cortex for all three regressors, also in the later time window of interest, and have updated the manuscript accordingly.Additionally, in accordance with reviewer 1’s suggestion, we have now additionally reflected on the limits of MEG’s spatial resolution in the Discussion.

Reviewer #1 (Recommendations for the authors):

The one non-standard feature of the analysis is the lack of regularization (e.g., ridge). The authors should perform the analysis using regularization (via nested cross-validation) and test if the predictions are improved (if they have already done this, they should report the results).

See our response in “Essential revisions” above.

Source localization analysis with MEG or EEG is highly uncertain. The conclusion that the results are coming from "fronto-temporal" areas doesn't strike me as adding much value; I'd be inclined to remove this analysis from the manuscript. Minimally, the authors should note that source localization is highly uncertain.

See our response in “Essential revisions” above.

The authors should report the earphones/tubes used to present sounds.

The earmolds used to present sounds were Doc’s Promolds NonVent with #13 thick prebent 1.85 mm ID tubes, from Audine Healthcare, in combination with Etymotic ER3A earphones. We have now added this information to the manuscript.

Reviewer #3 (Recommendations for the authors):

– Figure 2: Music Transformer is presented as the state-of-the-art model throughout the paper, with the main advantage of grasping regularities on longer time scales. Yet the computational results in figure 2 tend to show that it does not bring much in terms of musical predictions compared to IDyOM. MT also needs a way larger context length to reach the same accuracy. It has a lower uncertainty, but this feature did not improve the results of the neural analysis. This point could be discussed to better understand why MT is a better model of human musical expectations and surprise.

See our response in “Essential revisions” above.

– The source analysis is a bit puzzling to me. The results show that every feature (including Note Onset and Note Repetition) are localized in Broca's area, frontally located. Wouldn't the authors expect that a feature as fundamental as Note Onset should be clearly (or at least partially) localized in the primary auditory cortex? Because these results localize frontally, it is hard to fully trust the MT surprise results as well. Can the authors provide some form of sanity check to provide further clarity on these results? Perhaps the source localizing the M100 to show the analysis pipeline is not biased towards frontal areas? Do the authors expect note onsets in continuous music to be represented in higher-order areas than the representation of a single tone? Figure 7 and source-level results: the authors do not discuss the fact that the results are mainly found in a frontal area. It looks like there is no effect in the auditory cortex, which is surprising and should be discussed.

See our response in “Essential revisions” above.

– The dipole fit shows a high correlation with behavior but it is not compared with any alternatives. Can the authors use some model comparison methods (e.g. AIC or BIC) to show that a single dipole per hemisphere is a better fit than two or three?

See our response in “Essential revisions” above.

– Line 252: the order in which the transformation from axial to planar gradients is applied with respect to other processing steps (e.g., z-scoring of the MEG data and TRF fitting) is not clear. Is it the MEG data that was transformed as suggested here, or is the transformation applied to TRFs coefficients (as explained in line 710), with TRFs trained on axial gradiometers? This has downstream consequences that lead to a lack of clarity in the results. For example, the MEG data shows positive values in the repetition kernel which if the transformation was applied before the TRF would imply that repetition increases activity rather than reduces it as would be expected. From this, I infer that it is the TRF kernels that were transformed. I recognize that the authors can infer results regarding directionality in the EEG data but this analysis choice results in an unnecessary loss of information in the MEG analysis. Please clarify the methods and if the gradiometer transformation is performed on the kernels, I would recommend re-running the analysis with gradiometer transformation first.

Indeed, we estimated the TRFs on the original axial gradient data and subsequently (1) (linearly) transformed those axial TRFs to planar gradient data and (2) (nonlinearly) combined the resulting synthetic horizontal and vertical planar gradient components to a single magnitude per original MEG sensor. This has, first of all, the advantage that we can perform source analysis straightforwardly on the axial-gradient TRFs (analogous to an axial-gradient ERF). Most importantly, however, this order of operations prevents the amplification of noise that would result from executing the non-linear combination step (2) on the continuous, raw MEG data. For this latter reason, this order of operations is the de facto standard in ERF research (as well as in other recent studies employing TRFs or “regression ERFs”).

Estimating the TRFs on the non-combined synthetic planar gradient data (so after step 1, but before step 2) would not suffer from this noise amplification issue. However, since step 1 is linear, this would yield exactly the same results as the current order of operations, while doubling the computational cost of the regression.

We apologize for being unclear on the exact order of operations regarding the planar gradient transformation in the original manuscript and have now clarified this.

– The Baseline model claims to include all relevant acoustic information but it does not include the note pitches themselves. As these pitches are provided to the model, it is difficult to tell whether the effects of surprise are related to model output predictions, or how well they represent their input. If the benefits from the models rely truly on their predictive aspect then including pitch information in the baseline model would be an important control. The authors have done their analysis in a way that fits what previous work has done but I think it is a good time to correct this error.

See our response in “Essential revisions” above.

– I wonder if the authors could discuss a bit more about the self-attention component of the Music Transformer model. I'm not familiar with the model's inner workings but I am intrigued by this dichotomy of a larger context to improve musical predictions and a shorter context to improve neural predictions. I wonder if a part of this dissociation has to do with the self-attention feature of Transformers, that a larger context is needed to have a range of motifs to draw from but the size of the motifs that the model attends to should be of a certain size to fit neural predictions.

We were similarly intrigued by this dissociation; however, we believe the most likely explanation of the dissociation is neural in origin, rather than reflecting the self-attention mechanism. This mechanism indeed allows the transformer to “attend” preceding notes in various ways: a prediction might, for example, be based on a sequence of several preceding notes, while taking into account several notes (or motifs) from many notes ago, but nothing in between. We thus agree that large contexts allow the MT to “have a range of motifs to draw from”. This is a good way to put it, since the model itself ‘decides’ which part of the context to ‘draw from’.

However, we do not think this can explain the dissociation. Instead, we observe that increasing the context ‘seen’ by the MT steadily improves its predictions (Figure 2b), and we suggest that at some point (over 5-10 notes of context), the MT expectations become more sophisticated than the expectations reflected in the MEG signal. This does not mean they are more sophisticated than human listeners’ expectations: humans can clearly track and appreciate patterns in music much longer and richer than the MT can. However, it seems that such high-level, long-range, likely hierarchical expectations are not driving the surprise effects in the evoked responses, which instead seem to reflect more low-level predictions over shorter temporal scales. The neural processing of the highest-level, longest-range predictions are likely not time-locked to the onset of musical notes, which precludes these being detected with the techniques used in the present study.

Finally, another reason to believe that the dissociation between context-versus-music-prediction and context-versus-brain-response is not driven by the specific details of the MT is that the same dissociation is observed for the IDyOM models. There, the pattern is much less clear because the musical prediction performance plateaus early, such that the musical predictions never become ‘much smarter’ than the neural predictions. The same pattern is nonetheless observed: musical prediction continues to improve for longer contexts than the neural prediction.

– Figure 3C: it could be interesting to show the same figure for IDyOM ltm. Given that context length does not impact ltm as much as MT, we could obtain different results. Since IDyOM ltm gets similar results to MT on the MEG data (figure 3A, no significant difference), it is thus hard to tell if the influence of context length comes from brain processing or the way MT works.

In Author response image 2 we have added a plot showing the predictive performance of the IDyOM ltm model on MEG versus music data similar to Figure 3C for the Music Transformer. We believe the MT to be the more sensitive measure of context dependency, for reasons outlined earlier. For that reason, we have decided not to add it to the manuscript. Furthermore, even though this particular plot is not included, the exact same traces feature as part of Figures 2B and 3B, so the information is present in the manuscript nonetheless.

Author response image 2

– Figure 8: Adding the uncertainty estimates did not improve the model's predictive performance compared to surprise alone, but what about TRFs trained on uncertainty without surprise? Without this result, it is hard to understand why the surprise was chosen over uncertainty.

We did not investigate regression models using only the uncertainty regressor for two reasons. First, we were a priori primarily interested in the neural response to surprise, rather than uncertainty. Surprise is a much more direct index of content-based expectations and their violation than (unspecific) uncertainty, and since our theoretical interest is in content-based expectations, we focused on the former.

Second, we did explore the effect of uncertainty as a secondary interest, but found that adding uncertainty to the regression model not only did not improve the cross-validated performance, but actually worsened it (Figure 8B). Surprise and uncertainty were modestly correlated (Figure 8A), and therefore the most likely interpretation of this drop in cross-validated performance is that uncertainty truly does not explain additional neural variance. (That is, any neural variance it would explain on its own is likely due to its correlation with surprise; if it were to capture unique neural variance by itself, then the performance of the joint model would be at least as high as the model featuring only surprise.) For this a posteriori reason, in addition to the a priori reason already formulated, we did not further explore regression models featuring only uncertainty.

– The sentence from lines 272 to 274 is not clear. In particular, the late negativity effect seems to be present in EEG data only, and it is thus hard to understand why a negative correlation between surprise estimates of subsequent notes would have such an effect in EEG and not MEG. Moreover, the same late negativity effect can be seen on the TRF of note onset but is not discussed.

We apologize for the unclarity here. We emphasize that any judgements regarding the polarity of the effects are based on the EEG data (Figure 6), as well as inspection of the axial-gradient MEG data (not shown). This was already made explicit in the manuscript around lines 259-261. We have now rephrased the relevant passage. We hope that this should alleviate any worry of a discrepancy between the MEG and EEG results.

Regarding the same late negativity for the Onset regressor: we do already emphasize the presence of this deflection (line 268), but since our interest is in the modulation of neural response by surprise (and, to a lesser extent, repetitions, which are related to surprise), we do not reflect on it in any further detail. Note that the presence of this deflection in the Onset TRF does not lessen the importance of its presence in the Surprise TRF – the latter remains indication of this particular peak being modulated by musical surprise.

– Some of the choices for the Temperley Model seem to unnecessarily oversimplify and restrict its potential performance. In the second principle, the model only assesses the relationships between neighboring notes when the preceding note is the tonal centroid. It would seem more prudent to include all notes to collect further data. In the third principle, the model marginalizes major and minor scales by weighting probabilities of each profile by the frequency of major and minor pieces in the database. Presumably, listeners can identify the minor or major key of the current piece (at least implicitly). Why not select the model for each piece, outright?

We used the Temperley model as David Temperley and others have formulated and applied it in previous research. While some of these choices could indeed be debated, we aimed to use the model in line with the existing literature, which has demonstrated the capabilities of the model in a variety of tasks, such as pitch prediction, key finding, or explaining behavioural ratings of melodic surprise (Morgan et al., 2019b; Temperley, 2008, 2014). We have now explicitly mentioned that the specifics in the three principles were chosen in accordance with earlier work.

– Stimulus: Many small details make it so that the stimuli are not so naturalistic (MIDI velocity set to 100, monophonic, mono channel…). This results in a more controlled experiment, but the claim that it expands melodic expectation findings to naturalistic music listening is a bit bold.

We agree, and do not wish to make the claim that the stimuli we used are representative of the full breadth of music that humans may encounter in everyday life. However, we do maintain that these stimuli are considerably closer to naturalistic music than much previous work on the neural basis of (the role of expectations in) music processing. It could be argued that the most severe limitation to a broad claim of ‘naturalistic’ is the use of strictly monophonic music. This was a technical necessity given two of the three model classes (IDyOM, Temperley). An important contribution of our work is to demonstrate that a different model (MusicTransformer) does at least equally well as the previous state-of-the-art. Critically, the MT support the processing of polyphonic music, and our work thus paves the way for future studies investigating neural expectations in music more representative of that which is encountered in daily life. In accordance with the reviewer’s suggestion, we have now nuanced our claim of ‘naturalistic’ in the Discussion.

– Line 478: authors refer to "original compositions" which may give the impression that the pieces were written for the experiment. From the rest of the text, I don't believe this to be true.

The reviewer is correct; this is now fixed.

– Formula line 553: the first probability of Xt (on the left) should also be a conditional probability of Xt given previous values of x. This is the entropy of the probability distribution estimated by the model.

Fixed.

– Line 667: the study by Di Liberto from which the EEG data come uses a ridge regression (ridge regularization). Is there a reason to use a non-regularized regression in this case? This should be discussed in the methods.

See our response in “Essential revisions” above.

References

Bharucha, J. J. (1987). Music cognition and perceptual facilitation: A connectionist framework. Music Perception, 5, 1–30. https://doi.org/10.2307/40285384

Broderick, M. P., Anderson, A. J., Di Liberto, G. M., Crosse, M. J., and Lalor, E. C. (2018). Electrophysiological Correlates of Semantic Dissimilarity Reflect the Comprehension of Natural, Narrative Speech. Current Biology, 28(5), 803-809.e3. https://doi.org/10.1016/j.cub.2018.01.080

Cohen, J. E. (1962). Information theory and music. Behavioral Science, 7(2), 137–163.

https://doi.org/10.1002/bs.3830070202

Conklin, D., and Witten, I. H. (1995). Multiple viewpoint systems for music prediction. Journal of New Music Research, 24(1), 51–73. https://doi.org/10.1080/09298219508570672

Heilbron, M., Armeni, K., Schoffelen, J.-M., Hagoort, P., and de Lange, F. P. (2022). A hierarchy of linguistic predictions during natural language comprehension. Proceedings of the National Academy of Sciences, 119(32), e2201968119. https://doi.org/10.1073/pnas.2201968119

Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., Dai, A. M., Hoffman, M. D., Dinculescu, M., and Eck, D. (2018). Music Transformer. ArXiv:1809.04281 [Cs, Eess, Stat].

http://arxiv.org/abs/1809.04281

Krumhansl, C. L. (2015). Statistics, Structure, and Style in Music. Music Perception, 33(1), 20–31. https://doi.org/10.1525/mp.2015.33.1.20

Liberto, G. M. D., Pelofi, C., Shamma, S., and Cheveigné, A. de. (2020). Musical expertise enhances the cortical tracking of the acoustic envelope during naturalistic music listening. Acoustical Science and Technology, 41(1), 361–364. https://doi.org/10.1250/ast.41.361

Meyer, L. B. (1957). Emotion and Meaning in Music. University of Chicago Press.

Morgan, E., Fogel, A., Nair, A., and Patel, A. D. (2019a). Statistical learning and Gestalt-like principles predict melodic expectations. Cognition, 189, 23–34. https://doi.org/10.1016/j.cognition.2018.12.015

Morgan, E., Fogel, A., Nair, A., and Patel, A. D. (2019b). Statistical learning and Gestalt-like principles predict melodic expectations. Cognition, 189, 23–34. https://doi.org/10.1016/j.cognition.2018.12.015 Narmour, E. (1990). The analysis and cognition of basic melodic structures: The implication-realization model (pp. xiv, 485). University of Chicago Press.

Narmour, E. (1992). The Analysis and Cognition of Melodic Complexity: The Implication-Realization Model. University of Chicago Press.

Pearce, M. T., and Wiggins, G. A. (2012). Auditory Expectation: The Information Dynamics of Music Perception and Cognition. Topics in Cognitive Science, 4(4), 625–652. https://doi.org/10.1111/j.17568765.2012.01214.x

Planton, S., Kerkoerle, T. van, Abbih, L., Maheu, M., Meyniel, F., Sigman, M., Wang, L., Figueira, S., Romano, S., and Dehaene, S. (2021). A theory of memory for binary sequences: Evidence for a mental compression algorithm in humans. PLOS Computational Biology, 17(1), e1008598. https://doi.org/10.1371/journal.pcbi.1008598

Schellenberg, E. G. (1997). Simplifying the Implication-Realization Model of Melodic Expectancy. Music Perception: An Interdisciplinary Journal, 14(3), 295–318. JSTOR. https://doi.org/10.2307/40285723

Schmitt, L.-M., Erb, J., Tune, S., Rysop, A. U., Hartwigsen, G., and Obleser, J. (2021). Predicting speech from a cortical hierarchy of event-based time scales. Science Advances, 7(49), eabi6070. https://doi.org/10.1126/sciadv.abi6070

Smith, N. J., and Kutas, M. (2015). Regression-based estimation of ERP waveforms: I. The rERP framework. Psychophysiology, 52(2), 157–168. https://doi.org/10.1111/psyp.12317

Temperley, D. (2008). A Probabilistic Model of Melody Perception. Cognitive Science, 32(2), 418–444. https://doi.org/10.1080/03640210701864089

Temperley, D. (2014). Probabilistic Models of Melodic Interval. Music Perception, 32(1), 85–99. https://doi.org/10.1525/mp.2014.32.1.85

Wilcox, E. G., Gauthier, J., Hu, J., Qian, P., and Levy, R. (2020). On the Predictive Power of Neural Language Models for Human Real-Time Comprehension Behavior (arXiv:2006.01912). arXiv. https://doi.org/10.48550/arXiv.2006.01912

https://doi.org/10.7554/eLife.80935.sa2

Article and author information

Author details

  1. Pius Kern

    Radboud University Nijmegen, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, Netherlands
    Contribution
    Conceptualization, Data curation, Formal analysis, Visualization, Methodology, Writing - original draft, Project administration, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4796-1864
  2. Micha Heilbron

    Radboud University Nijmegen, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, Netherlands
    Contribution
    Conceptualization, Supervision, Investigation, Methodology, Project administration
    Competing interests
    No competing interests declared
  3. Floris P de Lange

    Radboud University Nijmegen, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, Netherlands
    Contribution
    Conceptualization, Supervision, Funding acquisition, Project administration, Writing – review and editing
    Competing interests
    Senior editor, eLife
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6730-1452
  4. Eelke Spaak

    Radboud University Nijmegen, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, Netherlands
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Project administration, Writing – review and editing
    For correspondence
    eelke.spaak@donders.ru.nl
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2018-3364

Funding

Nederlandse Organisatie voor Wetenschappelijk Onderzoek (016.Veni.198.065)

  • Eelke Spaak

European Research Council (101000942)

  • Floris P de Lange

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank David Temperley for providing the code for his model and Marcus Pearce for discussions on the IDyOM model. This work was supported by The Netherlands Organisation for Scientific Research (NWO Veni grant 016.Veni.198.065 awarded to ES) and the European Research Council (ERC Consolidator grant SURPRISE # 101000942 awarded to FPdL).

Ethics

Human subjects: The study was approved under the general ethical approval for the Donders Centre for Cognitive Neuroimaging (Imaging Human Cognition, CMO2014/288) by the local ethics committee (CMO Arnhem-Nijmegen, Radboud University Medical Centre). Participants provided written informed consent before the experiment and received monetary compensation.

Senior Editor

  1. Christian Büchel, University Medical Center Hamburg-Eppendorf, Germany

Reviewing Editor

  1. Jonas Obleser, University of Lübeck, Germany

Reviewers

  1. William Sedley, Newcastle University, United Kingdom
  2. Keith Doelling

Publication history

  1. Received: June 9, 2022
  2. Preprint posted: June 10, 2022 (view preprint)
  3. Accepted: December 22, 2022
  4. Accepted Manuscript published: December 23, 2022 (version 1)
  5. Version of Record published: January 12, 2023 (version 2)

Copyright

© 2022, Kern et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 930
    Page views
  • 134
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Pius Kern
  2. Micha Heilbron
  3. Floris P de Lange
  4. Eelke Spaak
(2022)
Cortical activity during naturalistic music listening reflects short-range predictions based on long-term experience
eLife 11:e80935.
https://doi.org/10.7554/eLife.80935
  1. Further reading

Further reading

    1. Neuroscience
    Nathaniel J Himmel, Akira Sakurai ... Daniel N Cox
    Research Article Updated

    Individual sensory neurons can be tuned to many stimuli, each driving unique, stimulus-relevant behaviors, and the ability of multimodal nociceptor neurons to discriminate between potentially harmful and innocuous stimuli is broadly important for organismal survival. Moreover, disruptions in the capacity to differentiate between noxious and innocuous stimuli can result in neuropathic pain. Drosophila larval class III (CIII) neurons are peripheral noxious cold nociceptors and innocuous touch mechanosensors; high levels of activation drive cold-evoked contraction (CT) behavior, while low levels of activation result in a suite of touch-associated behaviors. However, it is unknown what molecular factors underlie CIII multimodality. Here, we show that the TMEM16/anoctamins subdued and white walker (wwk; CG15270) are required for cold-evoked CT, but not for touch-associated behavior, indicating a conserved role for anoctamins in nociception. We also evidence that CIII neurons make use of atypical depolarizing chloride currents to encode cold, and that overexpression of ncc69—a fly homologue of NKCC1—results in phenotypes consistent with neuropathic sensitization, including behavioral sensitization and neuronal hyperexcitability, making Drosophila CIII neurons a candidate system for future studies of the basic mechanisms underlying neuropathic pain.

    1. Neuroscience
    Dongwon Lee, Wu Chen ... Mingshan Xue
    Research Article Updated

    UBE3A encodes ubiquitin protein ligase E3A, and in neurons its expression from the paternal allele is repressed by the UBE3A antisense transcript (UBE3A-ATS). This leaves neurons susceptible to loss-of-function of maternal UBE3A. Indeed, Angelman syndrome, a severe neurodevelopmental disorder, is caused by maternal UBE3A deficiency. A promising therapeutic approach to treating Angelman syndrome is to reactivate the intact paternal UBE3A by suppressing UBE3A-ATS. Prior studies show that many neurological phenotypes of maternal Ube3a knockout mice can only be rescued by reinstating Ube3a expression in early development, indicating a restricted therapeutic window for Angelman syndrome. Here, we report that reducing Ube3a-ATS by antisense oligonucleotides in juvenile or adult maternal Ube3a knockout mice rescues the abnormal electroencephalogram (EEG) rhythms and sleep disturbance, two prominent clinical features of Angelman syndrome. Importantly, the degree of phenotypic improvement correlates with the increase of Ube3a protein levels. These results indicate that the therapeutic window of genetic therapies for Angelman syndrome is broader than previously thought, and EEG power spectrum and sleep architecture should be used to evaluate the clinical efficacy of therapies.