EEG decodability of facial expressions and their stereoscopic depth cues in immersive virtual reality

  1. Max Planck Institute for Human Cognitive and Brain Sciences, Department of Neurology, Leipzig, Germany
  2. Humboldt-Universität zu Berlin, Department of Psychology, Berlin, Germany
  3. Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany
  4. Fraunhofer Institute for Telecommunications, Heinrich Hertz Institute (HHI), Department of Artificial Intelligence, Berlin, Germany
  5. Department of Physics and Life Science Imaging Center, Hong Kong Baptist University, Hong Kong, China
  6. Faculty of Education, National University of Malaysia, Bangi, Malaysia

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Ming Meng
    University of Alabama at Birmingham, Birmingham, United States of America
  • Senior Editor
    Andre Marquand
    Radboud University Nijmegen, Nijmegen, Netherlands

Reviewer #1 (Public review):

Summary:

The study by Klotzsche et al. examines whether emotional facial expressions can be decoded from EEG while participants view 3D faces in immersive VR and whether stereoscopic depth cues affect these neural representations. Participants viewed computer-generated faces (three identities, four emotions) rendered either stereoscopically or monoscopically, while performing an emotion recognition task. Time-resolved multivariate decoding revealed above-chance decodability of facial expressions from EEG. Importantly, decoding accuracy did not differ between monoscopic and stereoscopic viewing. This indicates that the neural representation of expressions is robust against stereoscopic disparity for the relevant features. However, a separate classifier could distinguish the depth condition (mono vs. stereo) from EEG, i.e., the pattern of neuronal activity differs between conditions, but not in ways relevant for the decoding of emotions. It had an early peak and a temporal profile similar to identity decoding, suggesting that early, task-irrelevant visual differences are captured neurally. Cross-decoding further demonstrated that expression decoders trained in one depth condition could generalize to the other, supporting the idea of representational invariance. Eye-tracking analyses showed that expressions and identities could be decoded from gaze patterns, but not the depth condition, and EEG- and gaze-based decoding performances were not correlated across participants. Overall, this work shows that EEG decoding in VR is feasible and sensitive, and suggests that stereoscopic cues are represented in the brain but do not influence the neural processing of facial expressions. This study addresses a relevant question with state-of-the-art experimental and data analysis techniques.

Strengths:

(1) It combines EEG, virtual reality stereoscoptic and monoscopic presentation of visual stimuli, and advanced data analysis methods to address a timely question.

(2) The figures are of very high quality.

(3) The reference list is appropriate and up to date.

Weaknesses:

(1) The introduction-results-discussion-methods order makes it hard to follow the Results without repeatedly consulting the Methods. Please introduce minimal, critical methodological context at the start of each Results subsection; reserve technical details for Methods/Supplement.

(2) Many Results subsections begin with a crisp question and present rich analyses, but end without a short synthesis. Please add 1-2 sentences that explicitly answer the opening question and state what the analyses demonstrate.

(3) The Results compellingly show that (a) expressions are decodable from EEG and (b) mono vs stereo trials are decodable from EEG; yet expression decoding is comparable across mono and stereo. It would help if you articulate why depth is neurally distinguishable while leaving expression representations unchanged. Maybe improve the discussion of the results of source localization and give a more detailed connection to what we already know about the processing of disparity.

Reviewer #2 (Public review):

Summary:

The authors' main aim was to determine the extent to which the emotional expression of face images could be inferred from electrophysiological data under the viewing conditions imposed by immersive virtual reality displays. Further, given that stereoscopic depth cues can be easily manipulated in such displays, the authors wished to investigate whether successful emotion decoding was affected by the presence or absence of these depth cues, and also if the presence/absence of depth cues was itself a property of the viewing experience that could be decoded from neural data.

Overall, the authors use fairly standard approaches to decoding neural data to demonstrate that above-chance results (slightly above the 0.5 chance threshold for their measure of choice) are in general achievable for emotion decoding, decoding the identity of faces from neural data, and decoding the presence/absence of depth cues in an immersive virtual reality display. They further examine the contribution of specific components of the response to visual stimuli with similar outcomes.

Strengths:

The main contribution of the manuscript is methodological. Rather than shedding particular light on the neural mechanisms supporting depth processing or face perception, what is on offer is primarily a straightforward examination of an applied question. With regard to the goal of answering that applied question, I think the paper succeeds. The overall experimental design is not novel, but in this case, that is a good thing. The authors have used relatively unadorned tasks and previous approaches to applying decoding tools to EEG data to see what they can get out of the neural data collected under these viewing conditions. While I would say that there is not a great deal that is especially surprising about these results, the authors do meet the goal they set for themselves.

Weaknesses:

Some of the key weaknesses I see are points that the authors raise themselves in their discussion, particularly with regard to the generalizability of their results. In particular, the 3D faces they have employed here perhaps exhibit a somewhat limited repertoire of emotional expression and do not necessarily cover a representative gamut of emotional face appearances, such as one would encounter in naturalistic settings. Then again, part of the goal of the paper was to examine the decodability of emotional expression in a specific, non-natural viewing environment - a viewing environment in which one could reasonably expect to encounter artificial faces like these. Still, the limitations of the stimuli potentially limit the scope of the conclusions one should draw from the data. I also think that there is a great deal of room for low-level image properties to drive the decoding results for faces, which could have been addressed in a number of ways (matching power spectra, for example, or using an inverted-image control condition). The absence of such control comparisons means that it is difficult to know if this is really a result that reflects face processing or much lower-level image differences that are diagnostic of emotion or identity in this subset of images. Again, to some extent, this is potentially acceptable - if one is mostly interested in whether this result is achievable at all (by hook or by crook), then it is not so important how the goal is met. Then again, one would perhaps like to know if what has been measured here is more a reflection of spatial vision vs. face processing mechanisms.

Reviewer #3 (Public review):

Summary:

This study investigates two main questions:

(1) whether brain activity recorded during immersive virtual reality can differentiate facial expressions and stereoscopic depth, and

(2) whether depth cues modulate facial information processing.

The results show that both expression and depth information can be decoded from multivariate EEG recorded in a head-mounted VR setup. However, the results show that the decoding performance of facial expressions does not benefit from depth information.

Strengths:

The study is technically strong and well executed. EEG data are of high quality despite the challenges of recording inside a head-mounted VR system. The work effectively combines stereoscopic stimulus presentation, eye-tracking to monitor gaze behavior, and time-resolved multivariate decoding techniques. Together, these elements provide an exemplary demonstration of how to collect and analyze high-quality EEG data in immersive VR environments.

Weaknesses:

The major limitation concerns the theoretical question about how stereoscopic depth modulates facial expression processing. While previous work has suggested that stereoscopic depth cues can shape natural face perception and emphasize the importance of binocular information in recognizing facial expressions (lines 95-97), the present study reports a null effect of depth. However, the stimulus configuration they used likely constrained the ability to detect any depth-related effects. All facial stimuli were static, frontal, and presented at a fixed distance. This design leads to near-ceiling behavioral performance and no behavioral effect of depth on expression recognition. It makes the null modulation of depth on expression processing unsurprising and limits the theoretical reach of the study. Adding more subtle or naturalistic features (such as various viewing angles and dynamic expressions) to the stimulus set if the authors aim to advance a strong theoretical claim about the role of binocular disparity. Or reframing the work as a technical validation of EEG decoding in this context.

Another issue relates to the claim that eye movements cannot explain the EEG decoding results. It is a real challenge to remove eye-movement-related artifacts and confounds, as the VR setup tends to encourage viewers to explore the environment freely. However, nearly half of the eye-tracking datasets were lost (usable in only 17 of 33 participants), which substantially weakens the evidence for EEG-gaze dissociation. Moreover, it would be almost impossible to decode facial information from only two-dimensional gaze direction, given that with 60 EEG channels, the decoding accuracy was modest (AUC ≈ 0.60). These two factors together limited the strength of the reported null correlation between neural and eye-data decoding.

The decoding analysis appears to use all 60 EEG channels as input features. I wonder why the authors did not examine using more spatially specific channel subsets. Facial expression and depth cues are known to preferentially engage occipito-temporal regions (e.g., N170-related sites), yet the current approach treats all sensors equally. Including all the channels may add noise and irrelevant signals to facial information decoding. Besides, using a subset of spatial-specific channels would align more directly with the subsequent source reconstruction.

Author response:

We thank the reviewers for their thoughtful and constructive comments. We are pleased that they found the study technically strong and the integration of EEG decoding, immersive VR, and eye tracking valuable.

Across all three reviews, several points of clarification emerged. In our revision, we will focus on:

(1) Improving clarity and structure of the manuscript (Reviewer #1).

We will strengthen the flow between the Methods and Results subsections and include explicit concluding statements for the single results.

(2) Emphasize methodological scope and limitations in terms of stimulus set and generalizability (Reviewers #2 and #3).

We will further emphasize that a key objective was to establish, for the first time, the methodological feasibility of decoding facial features (especially emotional expressions) under VR conditions, and that our stimulus set (consisting of facial expressions that were easy to distinguish) limits (a) the task-relevance (and thus possibly the neural integration) of depth information and (b) the generalizability to less easily distinguishable settings. We appreciate the suggestion of an inverted-face control to further investigate the extent to which the decoding results were based on low-level features; however, we do not plan a follow-up experiment at this stage; instead, we will discuss this limitation more explicitly.

We believe these revisions will substantially strengthen the manuscript and further highlight its methodological focus.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation