Neuroscience

Auditory-motor surprisal reveals learning across multiple timescales during exploration and production

Haiqin Zhang author has email address
Giorgia Cantisani
Shihab Shamma

Laboratoire des systèmes perceptifs, CNRS, École Normale Supérieure, PSL University, Paris, France
Sciences et Technologies de la Musique et du Son Lab, CNRS, IRCAM, Sorbonne Université, Paris, France
Neural Systems Laboratory, University of Maryland College Park, College Park, United States

https://doi.org/10.7554/eLife.111080.1

Open access
Copyright information

Figures and data

Schematic of the MirrorNet model proposed by Shamma et al., (7) in relation to the neural substrates of a sound production task.
The forward pathway (decoder) maps the motor to the auditory regions, generating predictions of the sounds corresponding to the actions; the inverse pathway (encoder) maps auditory to motor regions, translating an intended sound into its corresponding motor commands. Predictions are informed by the previous context, i.e., the previous actions in the motor sequence and the previous tokens in the auditory one.

Experimental paradigm and hypotheses.
A. Example of keys pressed by a participant over time. Vertical lines mark individual strokes, while the colored bar shows the active key-pitch map (green: inverted, yellow: shifted-inverted, red: normal). The small keyboards show key-pitch assignments and their pitch distribution on a standard piano keyboard, for each map. B. Hypotheses about surprisal differences between first and other keystrokes after a map change. The null hypothesis assumes no effect of the map change; the ‘first vs. others’ hypothesis predicts higher surprisal for the first keystroke than for later ones. C. An experimental session consisted of three main blocks, each lasting approximately 30 minutes: pre-training, training, and post-training. The pre- and post-training blocks were identical, each including three 10-minute tasks: passive listening, mute playing, and variable-map playing. D. During training, participants imitated 4-note melodies by playing them back on the keyboard, in two blocks of increasing difficulty. E. Training was evaluated note-by-note against the target melody, ignoring rhythm and timing. F. Mean imitation scores for each block show learning over time (Wilcoxon signed-rank test, p = 0.002.). Lines denote individual participants.

Neural signatures of auditory-motor surprisal
A. Grand-average ERPs of firsts and others at Fz. The dashed line marks 100 ms, where responses differ significantly B. Topography of N100 amplitude differences between first and others (Δ_f−o), masked to significant channels (FDR-corrected Wilcoxon signed-rank test). C. Bootstrapped Δ_f−o distribution (100-sample resamples, one from firsts and one from others over 1000 iterations) versus a null distribution where both samples are drawn from the pool of other keystrokes only. Horizontal bars indicate the CI95. D. Δ_f−o difference waves. Bars indicate regions of interest identified by the cluster-based permutation test (p < 0.05). E. Distribution of N100 Δ_f−o in the playing and listening experiments, compared with the null distribution; bars mark significant clusters from the permutation cluster test (p < 0.05). F. Surprisal of heard notes as calculated by cross-validation using IDyOM: the scatter plot shows an excerpt from one recording, the histogram shows surprisal distributions over all recordings for all firsts and a size-matched random sample of others (p < 0.001).

The influence of short-term context
A. Hypothesis: surprisal is modulated by the number of keystrokes in the preceding map. B. Bar plot showing mean difference in N100 amplitude between first and other keystrokes (Δ_f−o) as a function of the number of keystrokes in the previous map. Independent samples t-test for comparisons between firsts with different numbers of previous keystrokes (bottom stars); one-sample t-test for comparisons between firsts and others (top stars). C. Hypothesis: surprisal is modulated by the number of keystrokes since the first keystroke in the current map. D. Δf−o sorted by the number of keystrokes since the map change. Independent samples t-test.

Disentangling auditory and motor components of playing responses
A. Grand average ERPs to note onsets at Fz for passive listening, mute playing, and variable map playing. B. The auditory decoder reconstructs note onsets from passive listening EEG; the motor decoder reconstructs key presses from mute playing EEG. C. Both decoders are then applied to the playing EEG. D. Example of ground truth and reconstructed onsets. E. Reconstruction amplitudes at first and other keystrokes are averaged and compared in pre- and post-training. Lines represent participants; bars show medians and quartiles. Wilcoxon signed-rank test, p < 0.001.

Effects of targeted and sustained training.
A. Grand-average ERPs of firsts and others at Fz before and after training. Mean with shading representing SEM; dashed lines mark time-points of interest (P50 and N100). B. Δ_f−o in pre- and post-training at Fz. C. Amplitudes of N100 (minimum over 80-120 ms), and P50 (maximum over 20-60 ms) averaged across centro-frontal channels. D. Topomap of Δ_f−o at N100 and P50 in pre and post training, as well as the difference between the two (Δf ₋o, post Δ_f−o, pre). Colours are masked to show channels with significant Δ_f−o (FDR-corrected Wilcoxon signed-rank test, only significant channels are displayed, significance threshold at p = 0.05). E. N100 and P50 Δ_f−o bootstrapped distributions as a function of training and map type against the null distribution. F. Subject mean N100 and P50 Δ_f−o as a function of training and map type (p = 0.00281, Wilcoxon signed-rank test). G. Correlation between training score and Δ_f−o at N100 in pre-training (R = 0.477, p = 0.0451, Pearson’s correlation) and post-training (not significant). Points represent individual participants.

Key-pitch map assignment over time during the full 10-minute task.

Time points with significant differences between the amplitude of the first and others at Fz, Cz, Pz, and Iz channels.
* p < 0.05, Wilcoxon signed-rank test, without FDR correction.

First keystrokes of the map versus the keystrokes immediately after the map change when entering each of the key-pitch maps, at timepoints of interest identified in Figure S2 : 50, and 370 ms.
For analysis at 100 ms, see Fig. 3.

Grand average note onset ERPs during the passive listening, mute playing, and variable map playing tasks in all channels, in pre-training only.
Colours represent EEG channels.

ERPs aligned to the note onsets during the passive listening, mute playing, and variable map playing tasks in the FCz channel, showing differences pre and post-training.
Lines represent the grand average over all subjects, shading represents SEM.

Note onset ERPs sorted by map, including all firsts and others, separated by pre and post-training.
Lines represent the grand average over all subjects, shading represents SEM.

A. Example of keystrokes included when analyzing entering of the maps, showing the keystrokes included when analyzing entering the INV map: INV/first and INV/other keystrokes. B. Difference waves (first - other ERPs) sorted by the map entered, pre- and post-training.

Correlation between training score and Δ_f−o at P50.
Columns show the active key-pitch map at the time of the keystroke. Rows show correlations in pre-training, post-training, and the difference between the two (post-pre). Dotted lines show linear regression. p > 0.05 for all panels.

Reconstruction accuracy of inverse TRF decoder.
A-B. Average window around note onset times in reconstructed stimulus using listening and motor decoders, respectively. C-D. Correlation between ground truth (sparse vector with time of note onsets set to 1) and reconstructed stimulus when the ground truth is shuffled versus unshuffled, for listening and motor decoders, respectively. ** p < 0.01, **** p < 0.0001, Pearson’s correlation.

Comparison of weighted sum of auditory and motor ERPs with playing ERP.
A. Grand average ERPs for the auditory-only condition (locked to note onset), motor-only condition (locked to keystrokes), the variable map playing ERP, and the optimized weighted sum of auditory and motor ERPs. Weights were determined by least squares optimization. B. Root mean square error (RMSE) between the playing ERP and the optimized weighted sum of auditory and motor ERPs across different lags applied to the motor ERP relative to the auditory ERP. The lowest RMSE occurs at zero lag.

Sign up for email alerts