EEG-based detection of the locus of auditory attention with convolutional neural networks

  1. Servaas Vandecappelle  Is a corresponding author
  2. Lucas Deckers
  3. Neetha Das
  4. Amir Hossein Ansari
  5. Alexander Bertrand
  6. Tom Francart  Is a corresponding author
  1. Department of Neurosciences, Experimental Oto-rhino-laryngology, Belgium
  2. Department of Electrical Engineering (ESAT), Stadius Center for Dynamical Systems, Signal Processing and Data Analytics, Belgium
9 figures, 3 tables and 1 additional file

Figures

CNN architecture (windows of T samples).

Input: T time samples of a 64-channel EEG signal, at a sampling rate of 128 Hz. Output: two scalars that determine the attended direction (left/right). The convolution, shown in blue, considers 130 ms of data over all channels. EEG = electroencephalography, CNN = convolutional neural network, ReLu = rectifying linear unit, FC = fully connected.

Auditory attention detection performance of the CNN for two different window lengths.

Linear decoding model shown as baseline. Blue dots: per-subject results, averaged over two test stories. Gray lines: same subjects. Red triangles: median accuracies. CNN = convolutional neural network.

Minimal expected switch durations (MESDs) for the CNN and the linear baseline.

Dots: per-subject results, averaged over two test stories. Gray lines: same subjects. Vertical black bars: median MESD. As before, two poorly performing subjects were excluded from the analysis. CNN = convolutional neural network.

Auditory attention detection performance as a function of the decision window length.

Blue dots: per-subject results, averaged over two test stories. Gray lines: same subjects. Red triangles: median accuracies. CNN = convolutional neural network.

Auditory attention detection performance of the CNN when one particular frequency band is removed (left) and when only one band is used (right).

The original results are also shown for reference. Each box plot contains results for all window lengths and for the two test stories.

Grand-average topographic map of the normalized power of convolutional filters.
Impact of the model validation strategy on the performance of the CNN (decision windows of 1 s).

In Leave-one-story+speaker-out, the training set does not contain examples of the speakers or stories that appear in the test set. In Every trial (unprocessed), the training, validation, and test sets are extracted from every trial (although always disjoint), and no spatial filtering takes places. In Every trial (per-trial MWFs), data is again extracted from every trial, but this time per-trial MWF filters are applied. CNN = convolutional neural network.

Impact of leaving out the test subject on the accuracy of the CNN model (decision windows of 1 s).

Blue dots: per-subject results, averaged over two test stories. Gray lines: same subjects. Red triangles: median accuracies. CNN = convolutional neural network.

Author response image 1
Grand-average temporal profile of the filters in the convolutional layer.

Tables

Table 1
First eight trials for a random subject.

Trials are numbered according to the order in which they were presented to the subject. Which ear was attended to first was determined randomly. After that, the attended ear was alternated. Presentation (dichotic/HRTF) was balanced over subjects with respect to the attended ear. Adapted from Das et al., 2016. HRTF = head-related transfer function.

TrialLeft stimulusRight stimulusAttended earPresentation
1Story1, part1Story2, part1LeftDichotic
2Story2, part2Story1, part2RightHRTF
3Story3, part1Story4, part1LeftDichotic
4Story4, part2Story3, part2RightHRTF
5Story2, part1Story1, part1LeftDichotic
6Story1, part2Story2, part2RightHRTF
7Story4, part1Story3, part1LeftDichotic
8Story3, part2Story4, part2RightHRTF
Table 2
Cross-validating over stories and speakers.

With the current dataset, there are only two folds that do not mix stories and speakers across training and test sets. Top: Story 1 as test data; story 2, 3, and 4 as training data and validation data (85/15% division, per story). Bottom: similarly, but now with a different story and speaker as test data. In both cases, the story and speaker are completely unseen by the model. The model is trained on the same training set for all subjects and tested on a unique, subject-specific, test set.

StorySpeakerSubject 1Subject 2Subject 16
11testtesttest
22train/val
33train/val
43train/val
StorySpeakerSubject 1Subject 2Subject 16
11train/val
22testtesttest
33train/val
43train/val
Author response table 1
Leave-one-story-out scheme.

Example of one out of four folds. In this particular fold, the test set consists of story 1, and the training and validation sets consist of stories 2, 3, and 4. Training and validation sets are completely separate from the test set. Per-subject accuracies are based on a subject-specific test set (noted by multiple mentions of "test" in Author response table 1). The model is trained on data of all subjects (noted by a single mention of "train/val").

StorySubject 1Subject 2….Subject 16
1testtesttesttest
2train/val
3train/val
4train/val

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Servaas Vandecappelle
  2. Lucas Deckers
  3. Neetha Das
  4. Amir Hossein Ansari
  5. Alexander Bertrand
  6. Tom Francart
(2021)
EEG-based detection of the locus of auditory attention with convolutional neural networks
eLife 10:e56481.
https://doi.org/10.7554/eLife.56481