Where is the melody? Spontaneous attention orchestrates melody formation during polyphonic music listening

Martin M Winchester; Kevin Reynolds; Charbel Nebo; Ian Cecil Scott; Giovanni M Di Liberto

doi:10.7554/eLife.108767.1

Introduction

Music from many cultures and traditions, including Western tonal music, typically leads to the perception of melody, which is a monophonic sequence of notes that a listener would sing, whistle, or hum back to identify the music (1, 2). While certain types of music are monophonic by nature (e.g., Sean-nós singing, nursery rhymes), leaving little doubt as to what the melody is, music typically involves multiple sound streams, or voices. A variety of factors, some acoustic (e.g., the particular instrument used, pitch distance) and others musical (e.g., meaningful melodic progressions), have been identified as key contributors to the perception of distinct sound streams in music. However, there remains considerable uncertainty on how those streams are processed by our brain into a coherent whole; and while melody has been proposed as a key structural element of music (3), the neural strategies underpinning its emergence from multi-stream compositions remains unclear.

While many theories have been proposed, one point of agreement is that attention mechanisms have a main role in polyphonic music perception. Much of the debate has centred around whether our brains are capable of dividing auditory attention across multiple auditory streams. In contrast with the widely studied multi-talker speech scenario, where attention can only be directed to one speaker or conversation at a time (4, 5), music is built so that distinct streams have strong relationships (e.g., harmonic) that can make them be perceived as coherent elements of a unified percept. Theories of polyphonic music perception have been formulated, for example explaining this phenomenon as the result of a divided attention between the streams (6, 7). A second view proposes that the attentional focus is directed to a single “foreground” stream, while also processing its harmonic relationship with the “background” (figure-ground model (8, 9)). As music unfolds, listeners may switch their attention to different streams, changing what is regarded as foreground and background. A third view proposes that our brains merge multiple music streams into a single complex melody, which then becomes the focus of attention (integration model (9)).

Tailored experiments have been used to test the validity of these models, often involving tasks such as error detection in which participants are asked to click a button when they hear anomalies in pitch, rhythm, or harmony (10). That work led to a wealth of knowledge on what our brains “can” do. However, these tailored tasks come at a cost, such as imposing tasks leading to neural processing strategies different from uninstructed listening. For example, error detection tasks may include the explicit instruction of focussing on one or all streams, leading to processing strategies (e.g., selective attention behaviour, rapid attention switching) that might or might not be employed during uninstructed listening. As such, while that work can certainly inform us on what our brains are capable of doing, it is less clear if that reflects the typical neural functioning during uninstructed listening.

Here, we investigate how the human brain processes two-stream polyphonic music during uninstructed listening by combining a behavioural task and non-invasive electroencephalography measurements (EEG). Participants were presented with classical music pieces synthesised with a high-quality virtual piano instrument, where each of the two voices consisted of a meaningful melody that would stand on its own (Fig. 1). In that condition, the listener’s attention was expected to primarily focus on the high pitch stream due to:

Experimental design.
**(A)** Participants were presented with polyphonic music made of two monophonic streams (PolyOrig condition), with control stimuli where the average pitch heights for the two streams were inverted (PolyInv), and with the corresponding monophonic streams in isolation (Monophonic). All stimuli were synthesised from MIDI scores using high quality virtual piano sounds. **(B)** Each experimental session was organised into two blocks: a listening block, where EEG signals were recorded from participants during uninstructed music listening, followed by a music identification block. The music pieces were selected randomly from any condition, an a given piece was presented once, only in one of the three conditions, without repetitions. In the melody identification block, participants heard short polyphonic snippet from the same pieces, and were asked to sing back the melody and to indicate if that was the high or low pitch stream on a five-point scale. This led to a behavioural measurement of attention bias. **(C)** Attention bias was also measured from the neural signal. Attention decoding models were built on the monophonic condition, by fitting backward temporal response functions (TRF) on each participant to reconstruct the sound envelope from the EEG signal. TRF models were then applied to the polyphonic conditions to decode the attended melody, resulting in a neural measurement of attention bias.

the high-voice superiority effect (i.e., a bias towards the high-pitch stream; (11));
and the melody itself, as the more attractive statistics are often placed at the higher pitch stream (11-14).

To disentangle these two factors, alongside this main experimental condition involving the original polyphonic music (PolyOrig), our experiment included a control polyphonic condition where the average pitch-heights of the two-streams were inverted (PolyInv). Finally, we also included a monophonic condition including a random selection of the monophonic voices in both polyphonic conditions (Monophonic), which was used for training the models for the attention decoding analysis.

Where is the attention?

First, we decode the attention bias in the listeners’ brains in the polyphonic listening conditions. In doing so, we answer the question: Where is the attended melody? To that end, we derived two attention bias scores. The first score is behavioural, and it reflects which of the two sound streams the listener sung back during a listen and repeat task in a dedicated block (behavioural attention bias score; Figure 1). The second score was derived from the EEG signal through an attention decoding procedure (neural attention bias score). Recent methodological developments provide us with tools to carry out that decoding from EEG signals based on regression methods (15). Specifically, we fit attention decoders on the EEG data recorded during the monophonic condition, which contained melodies selected from high and low pitch streams, and main and support streams. That way, the resulting decoders were unbiased with regard to pitch height and voice statistics, and were then applied to decode attention on the polyphonic conditions.

We hypothesised both music streams to be robustly encoded in the human cortex during polyphonic music listening, primarily reflecting the cortical encoding of low-level acoustic properties (e.g., sound envelope). With regard to auditory attention, the high-voice superiority effect and melody statistics are the two possible contributors to salience – timbre was not a factor in this experiment, as the sound stimuli were synthesised using MIDI sounds from a single instrument (see Methods). In the PolyOrig condition, both contributors were expected to primarily steer attention toward the high pitch stream by construction. Avoiding generalisations, the melodic material that characterises the main motif or primary thematic element, which is distinguished by its more intriguing and expressive features, is often constructed in the more exposed higher registers. For good practice, the main motif is supported by a harmonic and rhythmic structure constructed in order to enhance the attention on the main thematic material.

We designed the PolyInv condition to disentangle the two key contributors to saliency – high-voice superiority and melody statistics – by placing the main motif in the low-pitch voice. The pieces in this experiment, primarily double-counterpoint compositions, were selected specifically as the inversion of voices does not compromise their structural integrity. For example, this technique is a well-established practice in fugues. The bass (low-pitch) line, even if less ornamented than the main (high-pitch) melody, provides a fundamental harmonic structure. This harmonic foundation ensures that when the lines are inverted, the resulting music retains coherence and stability, even in the absence of a third voice to fully define the harmonic context. With this premise, three hypotheses were explored for the PolyInv condition: Hp0) High-voice superiority is the main contributor to attention bias, which would lead to a high-pitch bias; Hp1) The two contributors contrast each other, leading to small (or no) attentional bias; and Hp2) Melody statistics is the main contributor, meaning that PolyInv would lead to an inverted attention bias compared with PolyOrig, as the pitch lines for the two voices were swapped (Fig. 2A).

Behavioural and neural indices of attention bias during uninstructed listening of polyphonic music.
**(A)** Illustration of the hypothesised attention bias scenarios for polyphonic music listening. A high-pitch bias was expected for PolyOrig by design. Dominant high-voice superiority effect or motif attractiveness were expected to lead to a strong attention bias toward the high (Hp0) and low (Hp2) pitch streams respectively, while a substantial reduction in attention bias would reflect a comparable contribution of the two factors (Hp1). **(B)** Behavioural result. The behavioural attention bias metric (mean ± SEM; ***p<0.001) was derived from the subjective reporting in the melody identification block. Subjective ratings indicated what stream was perceived as the main melody, from value -2 (low pitch) to value 2 (high pitch). **(C)** EEG decoding analysis. (Left) Envelope reconstruction correlations for the decoding analysis are reported (mean ± SEM; *p<0.05, ***p<0.001) for individual streams (high and low pitch) and conditions (PolyOrig and PolyInv). Colours refer to the motif (red: main melody; blue: support stream). Note that colours are inverted in the two conditions, reflecting the pitch inversion. (Right) Envelope reconstruction correlations for individual participants. **(D)** Neural attention bias index, which was obtained by subtracting the high and low pitch reconstruction correlations in (C) within each condition (Δenvelope reconstruction; mean ± SEM; **p<0.01). **(E)** Forward TRF model weights at channel Cz, providing insights into the temporal dynamics of the neural response to the two streams. Lines and shaded areas indicate the mean and SEM over participants respectively. Thick black lines indicate time points with statistically significantly difference across conditions.

What is the melody?

After having established the attention bias of each participant and music piece, we present a second analysis to determine how the perceived melody is built by our brains. To that end, we rely on the known sensitivity of EEG signals to melodic expectations (10, 16-19). A music note may be more or less expected based on its prior context. Monophonic music listening leaves little doubt as to what that local context is, and models have been built that estimate next-note expectations, for example based on variable-order Markov Chains (20) and deep-learning architectures (21-23). Here, we use state-of-the-art transformer-based computational models of music to generate numerical hypotheses for our brains’ next-note expectations, matching different theories of polyphonic music perception. We then relate these simulations with neural signals recorded during the polyphonic conditions, determining which music processing strategy is most neurophysiologically plausible for uninstructed listening, and how attention bias relates with the selected strategy.

We considered a divided attention (6) and an integration model (9). The divided attention model would predict distinct melodic surprise for the two polyphonic streams, with equal strength. Measuring different strengths, in this two-stream listening scenario, would instead be compatible with a figure-ground model (7, 9), where the listening focus is directed to one particular stream (horizontal melody), which may alternate as the music unfolds, while the other is processed in function of the attended stream (vertical harmony). The other possibility that we considered was the integration model, according to which next-note expectations would be based on a single melody combining the two streams. Stream integration is in line with previous findings (9, 24) and compatible with the intuition that our vocal tract can only produce monophonic sequences. An auditory-motor neural pathway, condensing the auditory input into monophonic vocal motor commands, would be in line with the definition of melody itself (i.e., a monophonic sequence that a listener would hum back after hearing a music piece). Note that integration is not in contrast with the processing of other properties, such as vertical harmony, for example via a distinct neural pathway.

With that premise, our analysis aims to arbitrate among those three models. Measuring differences in the melody expectation encoding for the two streams in PolyOrig would contrast with the divided attention model. With regard to PolyInv, we had distinct hypotheses for the other two models. Specifically, the Figure-ground model PolyInv to impair the integration negatively, if anything, due to the alterations in vertical harmony produced by the pitch inversion. An integration model, instead, would be in line with an increased integration in PolyInv, where the attention bias was expected to be in-between the two streams (Hp1 in Fig. 2A).

Results

Neural signals were recorded with EEG from 31 participants (16 males) during the uninstructed listening of polyphonic compositions or their monophonic components (listening block; Fig. 1A). The listening block involved the listening of monophonic and polyphonic pieces. In the second part of the experiment, participants undertook a melody identification task in a dedicated block, where short polyphonic music segments (∼4-8 seconds) were presented to the participants, who were asked to sing back the melody, and to indicate if they sung the high or low pitch stream on a five-point scale (see Methods). The analyses that follow integrate behavioral (Fig. 1B) and neural data (Fig. 1C) to address the following key questions: (a) Where was the focus of attention, and what factors influenced it? (b) What melody is encoded in the human brain during uninstructed listening to polyphonic music?

Where is the attention? Investigating attention bias and its contributors using behavioural and neural measurements

In PolyOrig, the listeners’ attention was expected to focus on the high pitch stream by construction. PolyInv was built to disentangle the high-pitch superiority effect and the motif attractiveness. Measuring a strong attention bias toward the high-pitch stream in PolyInv would reflect a dominance of the high-pitch superiority effect (Hp0). Conversely, an attention bias toward the low-pitch stream in PolyInv would indicate a dominance of the motif attractiveness (Hp2). Our expectation was that both factors would contribute to the attention bias after the pitch inversion, thus leading to a reduced attention bias compared with PolyOrig (Hp1; Fig. 2A).

Behavioural measurements of attention bias confirm the high-pitch attention bias in the PolyOrig condition (t-test, p = 3.5*10^-9, d = 1.88). The PolyInv condition also showed a high-pitch bias (t-test, p = 6.6*10^-4, d = 0.80). Crucially, while the attention bias remained toward the high-pitch voice, its magnitude decreased with the voice inversion (paired t-test: p = 1.2*10^-4, d = 0.95; Fig. 2B), in line with Hp1.

Next, we tested if the neural data also pointed to the same hypothesis. Neural measurements of attention bias were derived via an attention decoding analysis. Attention decoders were fit on the low-frequency EEG data (1-8 Hz) from the monophonic condition using lagged ridge regression, with a methodology referred to as backward temporal response function (TRF; (25-27)). The model fit identifies a linear combination of all EEG channels that produces an optimal reconstruction of the sound envelope. Since the monophonic listening condition only involved one music stream, participants could only attend to that stream, meaning that the resulting TRF model serves as an attention decoder. Sound envelope reconstructions were derived using that model on the EEG data from the polyphonic conditions. Pearson’s correlations were calculated between the reconstructed signal and the envelope of both polyphonic streams. The neural attention bias index was derived as the difference between those correlations (Fig. 1C).

Envelope reconstruction correlations showed a statistically significant attention bias (two-way ANOVA; main effect of stream: F(1,30) = 36.9, p = 1.1*10^-6; Fig. 2C), no main effect of condition (PolyOrig vs. PolyInv: F(1,30) = 1.1, p = 0.296) and a statistically significant stream x condition interaction (F(1,30) = 8.8, p = 0.006; Fig. 2D). Post hoc tests confirmed, in line with the behavioural results and Hp1, a neural attention bias toward the high pitch stream in both PolyOrig (paired t-test: p = 2.4*10^-6, d = 1.04) and PolyInv (p = 0.025, d = 0.42; Fig. 2C), and that the voice inversion led to a lower attention bias in PolyInv (paired t-test: p = 0.006, d = 0.53).

Further analyses were run to determine the impact of attention on the temporal unfolding of music stream encoding. Multivariate forward envelope TRFs were fit to build an optimal linear mapping from the envelope of the two input streams to the corresponding low-frequency EEG recording. Statistically significant differences were measured at the representative EEG channel Cz between the TRF weights for the two conditions (FDR corrected paired t-tests, p<0.05; Fig. 2E) in PolyOrig, with the effects emerging for all TRF components within the first 400ms after stimulus onset. Only a very short temporal cluster of significance emerged in PolyInv instead (∼80ms). Results were comparable at neighbouring channels such as Fz, while weaker effects emerged in more occipital scalp areas, such as Pz (not shown).

What is the melody? Investigating melody formation by probing melodic prediction mechanisms in the listener’s brain

When asked to sing back a polyphonic piece, listeners produce a melody that is either one of the streams (segregation) or a mixture of the two (integration). The analysis that follows aimed at determining what the listeners’ brain regards as the melody. To that end, we relied on measurements of melodic expectations, which were previously validated in the context of monophonic music listening (16, 28, 29). Three models in the literature would predict different outcomes for this analysis: (a) the divided attention model predicts a simultaneous encoding of melodic expectations calculated on two streams as independent monophonic streams; (b) the figure-ground model would predict the encoding of melodic expectations for the main motif; and (c) the integration model would correspond to the encoding of melodic expectations for a melody that combines the two streams. Here, estimates of melodic expectations (surprise and entropy) for each melody stream were derived from an Anticipatory Music Transformer (23).

Forward TRFs were fit to predict the low-frequency EEG signals (1-8 Hz) based on a multivariate feature set including acoustic features (A: note onset, envelope, envelope derivative) and melodic expectation features (M: note onsets amplitude-modulated by pitch surprise and note onsets amplitude-modulated by pitch entropy for all streams in the stimulus). Melodic expectation encoding was quantified as the gain in variance explained by including M in the model (i.e., r_AM - r_A).

As a first step, we obtained a neural measurement of melodic expectation encoding on the monophonic condition, aiming to replicate previous findings in the literature (16, 30). The mapping between stimulus features and EEG is multivariate-to-multivariate, making methods such as the TRF (which is multivariate-to-univariate) suboptimal. Here we used the more appropriate Canonical Correlation Analysis (CCA) methodology instead, using match-vs-mismatch classification metrics for the evaluation (Fig. 3A; (31, 32)), where higher classification scores indicate a stronger encoding of a given stimulus feature-set. The match-vs-mismatch evaluation is a randomised procedure. Here, we run tests on the mean across participants, where observations are 250 repetitions of that procedure. A statistically significant cortical encoding of melodic expectations was measured in the monophonic condition (AM > A, Wilcoxon rank sum: p < 10^-30; Fig. 3B-top), consistent with previous EEG, MEG, and intracranial EEG work with Music Transformers and Markov models (16).

Low-frequency cortical encoding of melodic expectation during polyphonic music listening.
**(A)** Schematics of the analysis method. Canonical Correlation Analysis (CCA) was run to study the stimulus-EEG relationship. Match-vs-mismatch classification scores were derived to quantify the strength of the neural encoding of a given stimulus feature-set. **(B)** The analysis was run separately for acoustic-only features (A) and acoustic+melodic expectations features (AM) in each of the three experimental conditions. The distributions in the figure indicate repetitions of the match-vs-mismatch procedure. **(C)** The gain in match-vs-mismatch classification after including the melodic expectation features, ΔClassification, is compared across conditions and models, informing on whether a monophonic or a polyphonic account of melodic expectations best fit the neural data. In the box-plot, the bottom and top edges mark the 25th and 75th percentiles respectively, while the mid-line indicates the median value (***p<0.001).

Next, we measured the cortical encoding of melodic expectation in the polyphonic conditions using the monophonic music transformer (Fig. 3A, B). A statistically significant encoding of melodic expectations emerged in both PolyOrig and PolyInv (two-way ANOVA; main effect of melodic expectations: F(1,249) = 512.1, p < 10^-30; one-tailed post hoc Wilcoxon rank sum tests in PolyOrig and PolyInv: p < 10^-30 and p = 1.6*10^-18 respectively), with a significant main effect of condition (F(1,249) = 7998.3, p < 10^-30). This analysis did not find a condition x expectations interaction (F(1,249) = 0.1, p = 0.765).

Finally, we tested if melodic expectations built with a polyphonic transformer better represented the EEG signal than for a monophonic model (Fig. 3C). A two-way ANOVA indicated main effects of model (monophonic vs. polyphonic: F(1,249) = 710.4, p < 10^-30) and condition (PolyOrig vs. PolyInv: F(1,249) = 283.8, p < 10^-30), as well as a statistically significant interaction effect (model x condition: F(1,249) = 4995.8, p < 10^-30). Interestingly, post hoc tests indicated that the EEG data more strongly related with the monophonic model in the PolyOrig condition and with the polyphonic model in the PolyInv condition (two-tailed post hoc Wilcoxon rank sum tests with p < 10^-30).

Discussion

This study investigated the neural processing of melody during polyphonic music listening. Previous research indicated that listeners can detect changes or errors in simultaneous music streams in some contexts (6, 8) but not others (9), challenging the possibility that our brains can truly divide attention between simultaneous melodic streams. Theories like the figure-ground model and stream integration (9) have been proposed as perceptual strategies that can compensate for the difficulty of truly dividing attention. While that work offers precious insights into what our brains “can” do, the tasks used previously for that research alter the listening experience, raising doubts with regard to how those models impact uninstructed listening. Here, we fill that gap by measuring the spontaneous attention bias in the listeners’ brains during the uninstructed listening of polyphonic music. First, we found that salience is affected by both the high-voice superiority effect and the statistical properties of the melodic lines (Fig. 2), providing a quantitative approach to disentangle the two effects. Second, we provide evidence for a weighted integration strategy, with the attention bias altering the stream integration rate (Fig. 3). Altogether, these results shed new light on the processing of melody during the uninstructed listening of polyphonic music, pointing to an inter-relationship between attention and statistical processing mechanisms, with attention orchestrating the way melody is formed.

Investigating the contributors to attention during uninstructed polyphonic music listening

Attention is potentially the key mechanism for understanding how melody is processed during polyphonic music listening. To that end, we designed a paradigm where the attention focus would change across conditions, without the need for instructing the participants on where to direct their attentional focus. We opted for using double-counterpoint compositions involving two monophonic piano melodic lines, primarily from J.S. Bach (Table 1). As discussed by P.A. Scholes, such compositions involve melodic lines that “move in apparent independence and freedom though fitting together harmonically” (33), creating a scenario where both streams present attractive melodic progressions. As a counter example, let’s consider a scenario involving polyphonic music with two streams, where the support stream has infrequent notes with little melodic variation. That scenario would make the pitch-line inversion less effective, as the new high-pitch stream would not stand on its own as the main melody.

Composers and titles of musical pieces used as stimuli.

Previous studies demonstrated that the focus of selective auditory attention can be reliably decoded from EEG signals during sustained attention tasks (27) and, more recently, by studying instructed attention switches (34, 35). While that work was initially carried out on speech listening tasks in multi-talker scenarios (27, 36, 37), EEG attention decoding was also shown to be effective when considering polyphonic music during instructed selective attention tasks (15). Hausfeld and colleagues (38) went one step beyond by comparing an instructed selective attention task, where attention was steered to a given music stream, with a divided attention task, where participants were asked to focus on two music streams simultaneously. EEG attention decoding showed a preference toward the attended stream in the former, while no preference was measured in the latter task. Altogether, those results strongly support the reliability of decoding attention from EEG. However, that work focused on instructed listening, thus not informing us on how polyphonic music listening unfolds naturally. Other design choices further complicate the interpretation of those results, such as the use of different instruments for the distinct streams, which introduce another possible bias that, although interesting, was considered unnecessary to answer the research question in the present investigation. For that reason, our experiment only included virtual piano emulations generated from MIDI scores.

Our attention decoding analysis indicates that both high-pitch and melody statistics contribute to music salience. The present study teases apart two key factors influencing uninstructed attention by utilising two conditions involving the same type of music, with the sole difference that the pitch lines were inverted. Using behavioural (Fig. 2B) and EEG decoding indices (Fig. 2D), we found that the listeners’ steer their attention toward the high-pitch stream in both conditions, even though there was a substantially weaker bias in PolyInv. These results, together with the forward TRF results (Fig. 2E), are in line with our initial hypothesis that two key contributors are at play, and that the high-voice superiority effect is, in this case, stronger than the bias due to the melody attractiveness. It should be noted that our EEG results refer to the average attention bias throughout a piece, and that the behavioural index is only derived on a subsample of the music material. As such, that result would be compatible with different attention dynamics, such as a weaker attention bias throughout the piece, an increase in attention-switching between the voices, or their integration. Another limitation was the focus on a single music culture and style, which was a design choice allowing us to compare attention bias across two conditions that were minimally different. Indeed, different cultures are characterised by unique musical features. For example, some traditions employ distinct systems of note organization, such as makams, which incorporate additional pitches like quarter tones — which are not commonly found in Western music, except in certain avant-garde applications. Likewise, the inclusion of different musical styles would introduce varying melodic features, as they are constructed using diverse techniques. Future research should consider generalising the results of this research using a more diverse set of stimuli that can capture these interesting variabilities across music style and culture, as well as investigating the impact of other related factors, such as subjective biases (e.g., personal taste) and musical training.

This experiment involved two-stream stimuli where the main melody is primarily on the high pitch line, while both streams could stand on their own. The pitch inversion in PolyInv allowed us to study the neural encoding of music in conditions with different attention biases, while also disentangling the effects of high-pitch superiority and melody statistics. Switching pitch lines results in an inversion of the voicing, which affects the perception of the sound but not the harmonic context. Working with only two melodic streams facilitates their interchangeability, avoiding harmonic issues. Since a chord is typically defined by three notes, the absence of a third voice does not compromise harmonic clarity. In particular, many of these melodic elements, especially within the fugue, are constructed using the double-counterpoint technique, which guarantees the interchange of voices, without disrupting the structural integrity of the composition. Although the bass line is often less ornamented than the main melody, it provides essential harmonic grounding. This foundation ensures that even when the lines are inverted, the resulting music retains coherence and stability, despite the absence of a third voice to fully articulate the harmonic framework. As such, while generalisation the results of this study to other music styles requires further work, our choice of using double-counterpoint pieces enabled manipulations that were key for testing our hypotheses.

Stream integration mechanisms contribute to melody formation

Numerous studies reported that neural responses to monophonic melodies reflect the expectation of a note, leading to patterns of neural activations that are compatible with a Bayesian view of the human brain, fitting frameworks such as predictive processing ((39); but see (21)). However, the relevance of those findings to the uninstructed listening of polyphonic music processing remains unclear. Listeners can identify a music piece by singing back its melody, indicating that part of polyphonic music processing involves transforming the incoming sound into a monophonic melody. This melody might correspond to a given stream or result from the integration of two or more streams. Previous work proposed theories on how that melody might be extracted. Here, we worked under the assumption that our brains process the resulting melody according to the predictive processing framework, where melodic expectations have been shown to be measurable with EEG (16). Using music transformers, we built numerical hypotheses on those melodic expectations, in one case assuming the processing of each monophonic stream (monophonic model), while the other model implements an integration of the two polyphonic streams (polyphonic model; Fig. 3).

Differently from existing theories, our results suggest that that the human brain processes multiple music streams via a dynamic strategy. Specifically, a strong attention bias toward a dominant stream led to a melody processing consistent with a monophonic processing of that stream. A weak attention bias, instead, led to an integration of the two streams into a single melody, as evidenced by a stronger alignment with the polyphonic model of melody expectations. In light of this result, we propose a weighted integration model, where attention modulates the rate of stream integration. The resulting integrated melody would then be the input of a statistical predictive process, which is what was measured in Figure 3. It should be noted that our results only speak to melody formation, and they do not address the important issue of how the support stream is processed in presence of a strong attention bias. While we contend that the figure-ground model could be a valid possibility, our data does not address that question nor exclude other possibilities, such as the divided attention model or one of its variations.

With the premise that attention might orchestrate music stream integration, one might observe that PolyOrig is more closely related with everyday listening of, for example, Western music, which typically involve a dominant melody. As a result, everyday music listening might be consistent with how the corresponding melodies would be processed in a monophonic context. This is a tantalising possibility, as it would make previous findings on monophonic melody perception directly relevant to more typical everyday multi-stream music listening. It should also be noted that our results involved averaging attention bias measurements throughout entire music pieces, while that bias likely changes within each piece. As such, the rate of integration might change dynamically at a more fine-grained level than when assessed here, calling for further investigation.

Our findings complement the past literature on simultaneous pitch sequences. Previous work using an odd-ball paradigm measured mismatch-negativity responses to deviant notes in response to both high and low pitch streams (40). That work had constraints that made that listening task different from the typical music listening experience, such as relying on the processing of deviant notes, using very short stimuli, and having isochronous note timing. Nonetheless, that study brought forward important data supporting the view that melody predictions built by our brains are compared with two simultaneous streams, leading to deviant responses (i.e., surprises) on either of them. Our result goes beyond that in several ways, especially by determining what melodic context is used by our brains to build note expectations. In doing so, these data confirm that melodic expectations can be measured for two streams simultaneously, determining that stream integration can naturally occur in correspondence of a weak attention bias.

Conclusions

Additional research is necessary to determine if the inter-relationship between attention and melody formation is specific to this specific music style or if that is a general mechanism. Our methodology could be replicated on two-stream polyphonic music from other styles and cultures, with one caveat: the choice of an appropriate computational model of music must take into account the music culture or style of the stimuli and listener. That might constitute a limiting factor for under-represented music styles, even though the rapid developments in music transformer research are promising in that regard and may close that gap in the near future. Further work should also explore factors such as repetition, musical training, and music preference, which are expected to be particularly important for uninstructed listening tasks. Repetition, for example, may alter the attention bias and has been put forward as a possible way for disentangling the figure-ground and integration accounts of polyphonic music processing (9). With regard to musical training, while our results did not show any effects, our definition of musical training was quite varied in this sample. Future work aiming to shed light on that specific phenomenon should consider larger and more homogeneous samples.

In sum, we demonstrate that polyphonic music processing can be studied with EEG with an uninstructed listening task. Our findings include novel insights into the inter-relationship between attention and melody statistical processing. Based on these results, we propose an extension of the stream integration model of polyphonic music listening, where attention bias modulates the engagement of integration strategies. We also speculate that stream integration might be a consequence of an auditory-motor neural pathway condensing the auditory input into monophonic motor instructions that can be produced via our vocal tract. More work is necessary to validate and extend our findings and speculations, for example by considering more diverse sets of stimuli, music styles, and participant cohorts. Furthermore, while our study encapsulated melodic expectations into a single metric, our brains have been shown to encode expectation in relation to different attributes (16). And while this study only focussed on pitch and timing properties when studying expectations, it is possible that different results would emerge in relation with distinct properties. For example, it might be the case that expectations for low pitch stream melodies are more relevant to the rhythm, while pitch contour could be more important for high pitch streams, in line with previous results from basic auditory physiology research (41). Additional research with tailored tasks and stimuli should be explored to tackle those questions.

Methods

Participants

40 participants were recruited for this study. All participants finished the study however, nine of them were excluded due to technical issues with the EEG recording, leading to a dataset with 31 participants overall (16 males, aged between 19 and 30, mean = 23.2, std = 2.3), 8 of whom have had formal musical training or at least 2 years of professional musical experience. All participants gave their written informed consent to participate in this study. The study was undertaken in accordance with the Declaration of Helsinki and was approved by the ethics committee in the School of Psychology of Trinity College Dublin. Data collection was carried out between September 2022 and January 2024.

Experimental design

Testing took place in a dark room. The experiment was organized into two experimental blocks: a listening block and a melody identification block (Figure 1), collecting EEG and behavioural measurements respectively. First, participants were presented to monophonic and polyphonic music pieces, with the sole instruction of listening to the music while looking at a fixation cross and minimising motor movement. After the listening block, the second part of the experiment involved a melody identification task, aided by a listen and repeat exercise. Specifically, participants were presented with short (4-8 seconds) segments of polyphonic music, extracted from previously presented PolyOrig or PolyInv pieces. After hearing a segment, participants were asked to sing it back. Then, they were presented with the two melodies as separate monophonic pieces and asked: “Which of the 2 melodies were you attempting to sing? 1: only the first melody; 2: mostly the first melody; 3: an equal mix of both; 4: mostly the second melody; 5: only the second melody”. EEG signals were not recorded during the second block.

Neurobs Presentation software was used for coding the experiment (http://www.neurobs.com), which was carried out in a single session for all participants. Audio stimuli were presented at a sampling rate of 44,100 Hz and played through Sennheiser HD 280 Pro headphones. EEG data was simultaneously acquired with a sampling rate of 250 Hz, from twenty-four electrode positions using an mBrainTrain Smarting wireless system.

Stimuli

Stimuli consisted of MIDI versions of 32 classical music pieces from a corpus of Bach and other Western composers (Table 1). The original pieces were ∼150s long snippets of polyphonic music. Minor manual corrections were applied to ensure that each piece was the combination of two monophonic streams, ensuring no more than two notes co-occurred at any given time. These two streams were then separated based into the high-pitch stream and the low-pitch stream (corresponding to the main and support stream respectively). These original polyphonic pieces (PolyOrig) were then manipulated to generate stimuli for other two conditions: PolyInv and monophonic. The two streams were pitch-shifted (typically by one octave) to invert the pitch lines, ensuring that all notes in the low-pitched stream were below the high-pitched stream. This ensured that the high- and low-pitch streams were clearly distinguishable in terms of pitch range (42). This manipulation led to the PolyInv stimuli, where the melody originally built to be the main stream was now at the low pitch line. Finally, the monophonic stimuli consisted of the individual streams extracted from PolyOrig and PolyInv. Finally, the music integrity of the resulting pieces was verified by a music expert and composer (I.C.S.).

EEG data preprocessing

EEG and stimulus data were anonymised after collection and were stored using the Continuous-event Neural Data (CND) format. EEG data was preprocessed and analysed offline using MATLAB R2023a software using custom code built starting from the analysis scripts of the CNSP open science initiative (https://github.com/CNSP-Workshop/CNSP-resources). EEG signals were filtered using high and low pass filters at cut off frequencies 1 and 8 Hz respectively. All filtering was done with Butterworth zero-phase filters of order two, implemented with the filtfilt function. EEG signals were downsampled to 125 Hz. EEG channels with a variance exceeding three times that of the surrounding ones were replaced by an estimate calculated using spherical spline interpolation. EEG data were re-referenced to the global average of all channels.

Analytic procedure

A system identification approach was used to compute the channel-specific mapping between music features and EEG responses. This method, referred to as the temporal response function (TRF), models the neural response at time t and channel η as a linear convolution of the stimulus property s(t) with an unknown, channel-specific filter w(τ, η), plus a residual term:

Here, r(t,η) is the instantaneous neural response, and ε(t, η) is the residual error not explained by the model.

The TRF w(τ,η) was estimated using regularised linear regression (ridge regression) to minimise overfitting. The solution is given by:

where S is the lagged time series matrix of the stimulus features, λ is the ridge parameter, and I is the identity matrix.

Model performance is evaluated using leave-one-out cross-validation across trials. The quality of a prediction is quantified by calculating Pearson’s correlation between the pre-processed recorded signals and the corresponding predictions at each scalp electrode.

Conversely, the backward TRF is a linear filter that can be fit to decode stimulus features from the neural recording. To estimate spontaneous attention, a backward TRF model is fit on the monophonic condition, where only one stream is played at a time. Note that the monophonic stimuli in the experiment were randomly selected from all streams in any of the polyphonic conditions, constructing a balanced training set and leading to a decoder that reconstructs the envelope of the attended melody. That decoder was then applied to the polyphonic conditions, producing envelope reconstructions that best correlate with the attended stream, which is then used to compare the decoding accuracies of the two streams the participants actually heard.

Stimuli feature extraction

Acoustic and melodic features were extracted from the audio files. Acoustic features included the sound envelope (Env), which was extracted from the Hilbert transform of the acoustic waveform. Second, we included the halfway rectified envelope derivative (Env’), which was calculated on the Hilbert envelope at the original sampling rate (44,100 Hz), before downsampling both Env and Env’. Melodic features were derived using the Anticipatory Music Transformer (AMT), a generative pre-trained transformer model built on Western music. Transformer architectures internally build probability distribution functions for each upcoming note based on long term memory (training set) and short term memory (context). The melodic features used were the surprise and entropy of the pitch of the notes (S, H). Surprise is the information content of the note.

Where n_i is the current note and n_i-1, n_i-2, … are the previous notes. The entropy represents the uncertainty at the time of the new note, calculated as the Shannon entropy over the distribution of the set of notes (N).

All features were downsampled to 125 Hz for analysis.

Statistical analysis

All statistical analyses were carried out in MATLAB R2023a. Analyses directly comparing the groups were performed using repeated measures two-way ANOVAs with the ranova function. One-sample t-tests were used for post hoc tests. Correction for multiple comparisons was applied where necessary via Benjamini-Hochberg Procedure with the fdr_bh function. Effect sizes are reported using Cohen’s d.

Data Availability

Analysis code and data (both EEG signals and stimuli) will be made publicly available in a standardised format (Continuous-event Neural Data (43)) at the time of publication via the OSF repository.

Acknowledgements

This research was conducted with the financial support of Reserach Ireland at ADAPT, the Research Ireland Centre for AI-Driven Digital Content Technology at Trinity College Dublin and University College Dublin [13/RC/2106_P2]. I.C.S. was supported by a Government of Ireland Postgraduate Scholarship (Irish Research Council). For the purpose of Open Access, the authors have applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission. We thank the Cognition and Natural Sensory Processing (CNSP) initiative, which provided the blueprint for the analysis code and data standardisation guidelines used in this work. We thank Asena Akkaya and Amirhossein Chalehchaleh for their help with data collection.

Significance of findings

Strength of evidence

Abstract

Highlights

Introduction

Experimental design.

Where is the attention?

Behavioural and neural indices of attention bias during uninstructed listening of polyphonic music.

What is the melody?

Results

Where is the attention? Investigating attention bias and its contributors using behavioural and neural measurements

What is the melody? Investigating melody formation by probing melodic prediction mechanisms in the listener’s brain

Low-frequency cortical encoding of melodic expectation during polyphonic music listening.

Discussion

Investigating the contributors to attention during uninstructed polyphonic music listening

Composers and titles of musical pieces used as stimuli.

Stream integration mechanisms contribute to melody formation

Conclusions

Methods

Participants

Experimental design

Stimuli

EEG data preprocessing

Analytic procedure

Stimuli feature extraction

Statistical analysis

Data Availability

Acknowledgements

References

Article and author information

Author information

Martin M Winchester

Kevin Reynolds

Charbel Nebo

Ian Cecil Scott

Giovanni M Di Liberto

Author Notes

Version history

Cite all versions

Copyright

Metrics