Inverted encoding of neural responses to audiovisual stimuli reveals super-additive multisensory enhancement

Zak Buhmann; Amanda K. Robinson; Jason B. Mattingley; Reuben Rideaux

doi:10.7554/eLife.97230.1

eLife assessment

Despite the well-known facilitatory effect that integration across the senses has on behavioural measures, standard neuroimaging approaches have not yet produced reliable and precise neural correlates. In this paper, Buhman et al. harness the decoding of EEG responses, beyond univariate approaches, to capture these correlates in a robust, clear fashion. If confirmed, this approach could be important for estimating multisensory integration in humans across a wide range of different domains. However, the strength of evidence to support these claims is still incomplete because of the potentially confounding factor of eye movements, which the authors themselves identify in their data, and because of the discrepancies between the behavioural and EEG data.

https://doi.org/10.7554/eLife.97230.1.sa2

Significance of findings

important: Findings that have theoretical or practical implications beyond a single subfield

landmark
fundamental
important
valuable
useful

Strength of evidence

incomplete: Main claims are only partially supported

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

A central challenge for the brain is how to combine separate sources of information from different sensory modalities to optimally represent objects and events in the external world, such as combining someone’s speech and lip movements to better understand them in a noisy environment. At the level of individual neurons, audiovisual stimuli often elicit super-additive interactions, where the neural response is greater than the sum of auditory and visual responses. However, investigations using electroencephalography (EEG) to record brain activity have revealed inconsistent interactions, with studies reporting a mix of super- and sub-additive effects. A possible explanation for this inconsistency is that standard univariate analyses obscure multisensory interactions present in EEG responses by overlooking multivariate changes in activity across the scalp. To address this shortcoming, we investigated EEG responses to audiovisual stimuli using inverted encoding, a population tuning approach that uses multivariate information to characterise feature-specific neural activity. Participants (n=41) completed a spatial localisation task for both unisensory stimuli (auditory clicks, visual flashes) and combined audiovisual stimuli (spatiotemporally congruent clicks and flashes). To assess multivariate changes in EEG activity, we used inverted encoding to recover stimulus location information from event-related potentials (ERPs). Participants localised audiovisual stimuli more accurately than unisensory stimuli alone. For univariate ERP analyses we found an additive multisensory interaction. By contrast, multivariate analyses revealed a super-additive interaction ∼180 ms following stimulus onset, such that the location of audiovisual stimuli was decoded more accurately than that predicted by maximum likelihood estimation. Our results suggest that super-additive integration of audiovisual information is reflected within multivariate patterns of activity rather than univariate evoked responses.

Introduction

We exist in a complex, dynamically changing sensory environment. Vertebrates, including humans, have evolved sensory organs that transduce relevant sources of physical information, such as light and changes in air pressure, into patterns of neural activity that support perception (vision and audition) and adaptive behaviour. Such activity patterns are noisy, and often ambiguous, due to a combination of external (environmental) and internal (transduction) factors. Critically, information from the different sensory modalities can be highly correlated because it is often elicited by a common external source or event. For example, the sight and sound of a hammer hitting a nail produces a single, unified perceptual experience, as does the sight of a person’s lips moving as we hear their voice. To improve the reliability of neural representations, the brain leverages these sensory relationships by combining information in a process referred to as multisensory integration. The existence of such processes heighten perception, e.g., by making it easier to understand a person’s speech in a noisy setting by looking at their lip movements (Sumby & Pollack, 1954).

Multisensory integration of audiovisual cues improves performance across a range of behavioural outcomes, including detection accuracy (Bolognini et al., 2005; Frassinetti et al., 2002; Lovelace et al., 2003), response speed (Arieh & Marks, 2008; Cappe et al., 2009; Colonius & Diederich, 2004; Rach & Diederich, 2006; Senkowski et al., 2011), and saccade speed and accuracy (Corniel et al., 2002; Van Wanrooij et al., 2009). Successful integration requires the constituent stimuli to occur at approximately the same place and time (Leone & McCourt, 2015). The degree to which behavioural performance is improved follows the principles of maximum likelihood estimation (MLE), wherein sensory information from each modality is weighted and integrated according to its relative reliability (Alais & Burr, 2004; Ernst & Banks, 2002; although other processing schemes have also been identified; Rideaux & Welchman, 2018). As such, behavioural performance that matches MLE predictions is often seen as a benchmark of successful, optimal integration of relevant unisensory cues.

The ubiquity of behavioural enhancements for audiovisual stimuli suggests there are fundamental neural mechanisms that facilitate improved precision. Recordings from single multisensory (audiovisual) neurons within cat superior colliculus have revealed the principle of inverse effectiveness, whereby the increased response to audiovisual stimuli is larger when the constituent unisensory stimuli are weakly stimulating (Corniel et al., 2002; Meredith & Stein, 1983). Depending on the intensity of the integrated stimuli, the neural response can be either super-additive, where the multisensory response is greater than the sum of the unisensory responses, additive, equal to the sum of responses, or sub-additive, where the combined response is less than the sum of the unisensory responses (see Stein & Stanford, 2008). Inverse effectiveness has also been observed in human behavioural experiments, with low intensity audiovisual stimuli eliciting greater multisensory enhancements in response precision than those of high intensity (Colonius & Diederich, 2004; Corniel et al., 2002; Rach & Diederich, 2006; Rach et al., 2010).

Neuroimaging methods, such as electroencephalography (EEG) and functional magnetic resonance imaging (fMRI), have been used to investigate neural population-level audiovisual integration in humans. These studies have typically applied an additive criterion to quantify multisensory integration, wherein successful integration is marked by a non-linear enhancement of audiovisual responses relative to unisensory responses (Besle et al., 2004). The super- or sub-additive nature of this enhancement, however, is often inconsistent. In fMRI, neural super-additivity in blood-oxygen-level dependent (BOLD) responses to audiovisual stimuli has been found in a variety of regions, primarily the superior temporal sulcus (STS; Calvert et al., 2000; Calvert et al., 2001; Stevenson et al., 2007; Stevenson & James, 2009; Werner & Noppeney, 2010, 2011). However, other studies have failed to replicate audiovisual super-additivity in the STS (Joassin et al., 2011; Porada et al., 2021; Ross et al., 2022; Venezia et al., 2015), or have found sub-additive responses (see Scheliga et al., 2023 for review). As such, some have argued that BOLD responses are not sensitive enough to adequately characterise super-additive audiovisual interactions within populations of neurons (Beauchamp, 2005; James et al., 2012; Laurienti et al., 2005). In EEG, meanwhile, the evoked response to an audiovisual stimulus typically conforms to a sub-additive principle (Cappe et al., 2010; Fort et al., 2002; Giard & Peronnet, 1999; Murray et al., 2016; Puce et al., 2007; Stekelenburg & Vroomen, 2007; Teder-Sälejärvi et al., 2002; Vroomen & Stekelenburg, 2010). However, other studies have found super-additive enhancements to the amplitude of sensory event-related potentials (ERPs) for audiovisual stimuli (Molholm et al., 2002; Talsma et al., 2007), especially when considering the influence of stimulus intensity (Senkowski et al., 2011).

While behavioural outcomes for multisensory stimuli can be predicted by MLE, and single neuron responses follow the principles of inverse effectiveness and super-additivity, among others (Rideaux et al., 2021), how audiovisual super-additivity manifests within populations of neurons is comparatively unclear given the mixed findings from relevant fMRI and EEG studies. This uncertainty may be due to biophysical limitations of human neuroimaging techniques, but it may also be related to the analytic approaches used to study these recordings. In particular, information encoded by the brain can be represented as increased activity in some areas, accompanied by decreased activity in others, so simplifying complex neural responses to the average rise and fall of activity may obscure relevant multivariate patterns of activity evoked by a stimulus.

Inverted encoding is a multivariate analytic method that can reveal how sensory information is encoded within the brain by recovering patterns of neural activity associated with different stimulus features. This method has been successfully used in fMRI, EEG, and magnetoencephalography studies to characterise the neural representations of a range of stimulus features, including colour (Brouwer & Heeger, 2009), spatial location (Bednar & Lalor, 2020; Robinson et al., 2021) and orientation (Brouwer & Heeger, 2011; Harrison et al., 2023; Kok et al., 2017). A multivariate approach may capture potential non-linear enhancements associated with audiovisual responses and thus could reveal super-additive interactions that would otherwise be hidden within the brain’s univariate responses. The sensitivity of inverted encoding analyses to multivariate neural patterns may provide insight into how audiovisual information is processed and integrated at the population level.

In the present study, we investigated neural super-additivity in human audiovisual sensory processing using inverted encoding of EEG responses during a task where participants had to spatially localise visual, auditory, and audiovisual stimuli. In a separate behavioural experiment, we monitored response accuracy to characterise behavioural improvements to audiovisual relative to unisensory stimuli. Although there was no evidence for super-additivity in response to audiovisual stimuli within univariate ERPs, we observed a reliable non-linear enhancement of multivariate decoding performance at ∼180 ms following stimulus onset when auditory and visual stimuli were presented concurrently as opposed to alone. These findings suggest that population-level super-additive multisensory neural responses are present within multivariate patterns of activity rather than univariate evoked responses.

Methods

Participants

Seventy-one human adults were recruited in return for payment. The study was approved by The University of Queensland Human Research Ethics Committee, and informed consent was obtained in all cases. Participants were first required to complete a behavioural session with above chance performance to qualify for the EEG session (see Behavioural session for details). Twenty-nine participants failed to meet this criterion and were excluded from further participation and analyses, along with one participant who failed to complete the EEG session with above chance behavioural accuracy. This left a total of 41 participants (M = 27.21 yrs; min = 20 yrs; max = 64 yrs; 24 females; 41 right-handed). Participants reported no neurological or psychiatric disorders, and had normal visual acuity (assessed using a standard Snellen eye chart).

Materials and procedure

The experiment was split into two separate sessions, with participants first completing a behavioural session followed by an EEG session. Each session had three conditions, in which the presented stimuli were either visual, auditory, or combined audio and visual (audiovisual). The order in which conditions were presented was counterbalanced across participants. Before each task, participants were given instructions and completed two rounds of practice for each condition.

Apparatus

The experiment was conducted in a dark, acoustically and electromagnetically shielded room. For the EEG session, stimuli were presented on a 24-inch ViewPixx monitor (VPixx Technologies Inc., Saint-Bruno, QC) with 1920x1080-pixel resolution and a refresh rate of 144 Hz. Viewing distance was maintained at 54 cm using a chinrest. For the behavioural session, stimuli were presented on a 32-inch Cambridge Research Systems Display++ LCD monitor with a 1920x1080-pixel resolution, hardware gamma correction and a refresh rate of 144Hz. Viewing distance was maintained at 59.5 cm using a chinrest. Stimuli were generated in MATLAB v2021b (The MathWorks Inc., 2021) using the Psychophysics Toolbox (Brainard, 1997). Auditory stimuli were played through two loudspeakers placed either side of the display (25-75 W, 6 Ω). In both experiments, an EyeLink 1000 infrared eye tracker recorded gaze direction (SR Research Ltd., 2009) at a sampling rate of 1000 Hz.

Stimuli

The EEG and behavioural paradigms used the same stimuli within each condition. Visual stimuli were gaussian blobs (0.2 contrast, 16° diameter) presented for 16 ms on a mid-grey background. Auditory stimuli were 100 ms clicks with a flat 850 Hz tone embedded within a decay envelope (sample rate = 44, 100 Hz; volume = 60 dB_A SPL, as measured at the ears). Audiovisual stimuli were spatially and temporally matched combinations of the audio and visual stimuli, with no changes to stimuli properties. To manipulate spatial location, target stimuli were presented from multiple horizontal locations along the display, centred on linearly spaced locations from 15° visual angle to the left and right of the display centre (eight locations for behavioural, five for EEG). Auditory stimuli were played through two speakers placed equidistantly either side of the display. The perceived source location of auditory stimuli was manipulated via changes to interaural intensity and timing (Whitworth & Jeffress, 1961; Wightman & Kistler, 1992). Specifically, auditory stimuli toward the edges of the display had marginally higher volume and were played slightly earlier from the nearest speaker than the other, whereas those toward the centre more uniform volume and less delay between speakers.

Behavioural Session

Prior to data collection, stimulus intensity and timing were manipulated to make visual and auditory stimuli similarly difficult to spatially localize. We employed a two-interval forced choice design to measure participants’ audiovisual localization sensitivity. Participants were presented with two consecutive stimuli and tasked with indicating, via button press, whether the first or second interval contained the more leftward stimulus. Each trial consisted of a central reference stimulus, and a target stimulus presented at one of eight locations along the horizontal azimuth on the display. The presentation order of the reference and target stimuli was randomised across trials. Stimulus modality was either auditory, visual, or audiovisual. Trials were blocked with short (∼2 min) breaks between conditions (see Figure 1A for an example trial). Each condition consisted of 384 target presentations across the eight origin locations, leading to 48 presentations at each location.

EEG Session

In this session, the experimental task was changed slightly from the behavioural session to increase the number stimulus presentations required for inverted encoding analyses of EEG data. Participants viewed and/or listened to a sequence of 20 stimuli, each of which were presented at one of five horizontal locations along the display (selected at random). At the end of each sequence, participants were tasked with indicating, via button press, whether more presentations appeared on the right or the left of the display. To minimize eye movements, participants were asked to fixate on a black dot presented 8° above the display centre (see Figure 1B for an example trial). The task used in the EEG session included the same blocked conditions as in the behavioural session, i.e., visual, auditory, and (congruent) audiovisual stimuli. As the locations of stimuli were selected at random, some sequences had an equal number of presentations on each side of the display, and thus had no correct “left” or “right” response; these trials were not included in the analysis of behavioural performance. Each block consisted of 10 trials, followed by a feedback display indicating the number of trials participants answered correctly. Each condition consisted of 12 blocks, yielding a total of 2400 presentations for each.

EEG data pre-processing

EEG data were recorded using a 64-channel BioSemi system at a sampling rate of 1024 Hz, which was down-sampled to 512 Hz during preprocessing. Signals were recorded with reference to the CMS/DRL electrode loop, with bipolar electrodes placed above and below the eye, at the temples, and at each mastoid to monitor for eye-movements and muscle artifacts. EEG preprocessing was undertaken in MATLAB using custom scripts and the EEGLAB toolbox (Delorme & Makeig, 2004). Data were high-pass filtered at 0.25 Hz to remove baseline drifts, and re-referenced according to the average of all 64 channels. Analyses were stimulus locked, with ERP responses segmented into 600 ms epochs from 100 ms before stimulus presentation to 500 ms after stimulus presentation. Bad channels, identified by the clean_artifacts function (Kothe & Makeig, 2013), were reconstructed using spherical interpolation from surrounding channels.

Forward model

To describe the neural representations of sensory stimuli, we used an inverted modelling approach to reconstruct the location of stimuli based upon recorded ERPs (Brouwer & Heeger, 2011; Harrison et al., 2023). Analyses were performed separately for visual, auditory, and audiovisual stimuli. We first created an encoding model that characterised the patterns of activity in the EEG electrodes given the five locations of the presented stimuli. The encoding model was then used to obtain the inverse decoding model that described the transformation from electrode activity to stimulus location. We used a 10-fold cross-validation approach where 90% of the data were used to obtain the inverse model on which the remaining 10% of the data were decoded. Cross-validation was repeated 10 times such that all the data were decoded. For the purposes of these analyses, we assume that EEG electrode noise is isotropic across locations and additive with the signal.

Prior to the neural decoding analyses, we established the sensors that contained the most location information by treating time as the decoding dimension and obtaining the inverse models from each electrode, using 10-fold cross-validation. This revealed that location was primarily represented in posterior electrodes for visual and audiovisual stimuli, and in central electrodes for auditory stimuli. Thus, for all subsequent analyses we only included signals from the central-temporal, parietal-occipital, occipital and inion sensors for computing the inverse model.

The encoding model contained five hypothetical channels, with evenly distributed idealised location preferences between −15° to +15° viewing angles to the left and right of the display centre. Each channel consisted of a half-wave rectified sinusoid raised to the fifth power. The channels were arranged such that an idealised tuning curve of each location preference could be expressed as a weighted sum of the five channels. The observed activity for each presentation can be described by the following linear model:

where B indicates the EEG data (m electrodes x n presentations), W is a weight matrix (m electrodes x 5 channels) that describes the transformation from EEG activity to stimulus location, C denotes the hypothesized channel activities (5 channels x n presentations), and E indicates the residual errors.

To compute the inverse model we estimated the weights that, when applied to the data, would reconstruct the channel activities with the least error. Due to the correlation between neighbouring electrodes, we took noise covariance into account when computing the model to optimize it for EEG data (Harrison et al., 2023; Kok et al., 2017; Rideaux et al., 2023). We then used the inverse model to reconstruct the stimulus location from the recorded ERP responses.

To assess how well the forward model captured location information in the neural signal per modality, two measures of performance were analysed. First, decoding accuracy was calculated as the similarity of the decoded location to the presented location, represented in arbitrary units. To test whether a super-additive interaction was present in the multivariate response, an additive threshold against which to compare the audiovisual response was required. However, it is unclear how the arbitrary units used to represent decoding accuracy translate to a measure of the linear summation of auditory and visual accuracy. As used for the behavioural analyses, MLE provides a framework for calculating the estimated optimal sensitivity of the combination of two sensory signals, according to signal detection theory principles. To compute decoding sensitivity (d’), required to apply MLE, we omitted trials where stimuli appeared in the centre of the display. The decoder’s reconstructions of stimulus location were grouped for stimuli appearing on the left and right side of the display, respectively. The proportion of hits and misses was derived by comparing the decoded side to the presented side, which was then used to calculate d’ for each condition (Stanislaw & Todorov, 1999). The d’ of the auditory and visual conditions can be used to estimate the predicted ‘optimal’ sensitivity of audiovisual signals as calculated through MLE. We can then compare actual audiovisual sensitivity to this auditory + visual sensitivity and test for super-additivity in the audiovisual condition as evidenced by the presence of a nonlinear combination of auditory and visual stimuli. A similar method was previously employed to investigate depth estimation from motion and binocular disparity cues, decoded from BOLD responses (Ban et al., 2012).

To represent an additional ‘additive’ multivariate signal with which to compare the decoding sensitivity derived through MLE, we first matched the EEG data between unisensory conditions such that the order of presented stimulus locations was the same for the auditory and visual conditions. The auditory and visual condition data were then concatenated across sensors, and inverted encoding analyses were performed on the resulting ‘additive’ audiovisual dataset. This additive condition was designed to represent neural activity evoked by both the auditory and visual conditions, without any non-linear neural interaction, and served as a baseline for the audiovisual condition.

Statistical analyses

Statistical analyses were performed in MATLAB v2021b. Two metrics of accuracy were calculated to assess behavioural performance. For the behavioural session we calculated participants’ sensitivity separately for each modality condition by fitting psychometric functions to the proportion of rightward responses per stimulus location. In the EEG session participants responded to multiple stimuli rather than individual presentations, so behavioural performance was assessed via d’. We derived d’ in each condition from the average proportion of hits and misses for each participant’s performance in discriminating the side of the display on which more stimuli were presented (Stanislaw & Todorov, 1999). A one-sample Kolmogorov-Smirnov test for each condition revealed all conditions in both sessions violated assumptions of normality. A non-parametric two-sided Wilcoxon signed-rank test was therefore used to test for significant differences in behavioural accuracy between all conditions.

For the neural data, univariate ERPs were calculated by averaging EEG activity across presentations and channels for each stimulus location from −100 to 500 ms around stimulus onset. A cluster correction was applied to account for spurious differences across time. To test for significant differences between conditions, a paired-samples t-test was conducted between each condition at each time point. A one-sample t-test was used when comparing decoding accuracy against chance (i.e., zero). Next, the summed value of computed t statistics associated with each comparison (separately for positive and negative values) was calculated within contiguous temporal clusters of significant values. We then simulated the null distribution of the maximum summed cluster values using permutation (n = 5000) of the location labels, from which we derived the 95% percentile threshold value. Clusters identified in the data with a summed effect-size value less than the threshold were considered spurious and removed.

Data availability

The behavioural and EEG data, and the scripts used for analysis and figure creation, are available at https://doi.org/10.17605/OSF.IO/8CDRA.

Results

Behavioural performance

Participants performed well in discriminating stimulus location across all conditions in both the behavioural and EEG sessions (Figure 2). For the behavioural session, the psychometric curves for responses as a function of stimulus location showed stereotypical relationships for the auditory, visual, and audiovisual conditions (Figure 2A). A quantification of the behavioural sensitivity (i.e., steepness of the curves) revealed significantly greater sensitivity for the audiovisual stimuli than for the auditory stimuli alone (Z = −3.09, p = .002), and visual stimuli alone (Z = −5.28, p = 1.288e-7; Figure 2B). Sensitivity for auditory stimuli was also significantly greater than sensitivity for visual (Z = 2.02, p = .044). To test for successful integration of stimuli in the audiovisual condition, we calculated the predicted MLE sensitivity from the unisensory auditory and visual results. We found no evidence for a significant difference between the predicted and actual audiovisual sensitivity (Z = - 1.54, p = 0.125).

We repeated these analyses for behavioural performance in the EEG session (Figure 2C). We found a similar pattern of results to those in the behavioural session; sensitivity for audiovisual stimuli was significantly greater than auditory (Z = −2.27, p = .023) and visual stimuli alone (Z = −3.52, p = 4.345e-4), but not significantly different from the MLE prediction (Z = −1.07, p = .285). However, sensitivity for auditory stimuli was not significantly different from sensitivity to visual stimuli (Z = 1.12, p = .262). Sensitivity was higher overall in the EEG session than the behavioural session, likely due to the increased number of stimuli in the EEG task.

Event-related potentials

We plotted the ERPs for auditory, visual, and audiovisual conditions at each stimulus location from −100ms to 500ms around stimulus presentation (Figure 3). For each stimulus location, cluster-corrected t-tests were conducted to assess significant differences in ERP amplitude between the unisensory (auditory and visual) and audiovisual conditions. While auditory ERPs did not significantly differ from the audiovisual, visual ERPs were significantly lower in amplitude than audiovisual ERPs at all stimulus locations (typically from ∼80-130 ms following stimulus presentation).

Audiovisual ERPs follow an additive principle.
Average ERP amplitude for each modality condition. Five plots represent the different stimulus locations, as indicated by the grey inset, and the final plot (bottom-right) shows the difference between the summed auditory and visual responses and the audiovisual response. Shaded error bars indicate ±1 SEM. Coloured horizontal bars indicate cluster corrected periods of significant difference between visual and audiovisual ERP amplitudes.

To test whether the enhancement in response amplitude to audiovisual stimuli was super-additive, we compared this response with the sum of the response amplitudes for visual and auditory conditions, averaged over stimulus location. We found no significant difference between the additive and audiovisual ERPs (Figure 3, bottom right). This result suggests that, using univariate analyses, the audiovisual response was additive and did not show any evidence for super- or sub-additivity.

Inverted encoding results

We next used inverted encoding to calculate the spatial decoding accuracy for auditory, visual, and audiovisual stimuli (Figure 4A). For all conditions, we found that spatial location could be reliably decoded from approximately ∼100-150 ms after stimulus onset. Decoding for all conditions was consistent for most of the epoch, indicating that location information within the neural signal was relatively persistent and robust.

Spatiotemporal representation of audiovisual location.
A) Accuracy of locations decoded from neural responses for each stimulus condition. Shaded error bars indicate ±1 SEM. Coloured horizontal bars indicate cluster corrected periods that showed a significant difference from chance (0). B) Topographic decoding performance in each condition during critical period (grey inset in (A)).

To assess the spatial representation of the neural signal containing location-relevant information, we computed decoding accuracy at each electrode from 150-250 ms post-stimulus presentation (Figure 4B). For auditory stimuli, information was primarily based over bilateral temporal regions, whereas for visual and audiovisual stimuli, the occipital electrodes carried the most information.

Multivariate super-additivity

Although the univariate response did not show evidence for super-additivity, we expected the multivariate measure would be more sensitive to nonlinear audiovisual integration. To test whether a super-additive interaction was present in the multivariate response, we calculated the sensitivity of the decoder in discriminating stimuli presented on the left and right side. The pattern of decoding sensitivity for auditory, visual, and audiovisual stimuli (Figure 5A) was similar to that in decoding accuracy (Figure 4A). Notably, audiovisual sensitivity was significantly greater than sensitivity to auditory and visual stimuli alone, particularly ∼180 ms following stimulus onset. To test whether this enhanced sensitivity reflected super-additivity, we compared decoding sensitivity for audiovisual stimuli with two estimates of linearly combined unisensory stimuli: 1) MLE predicted sensitivity based on auditory and visual sensitivity and 2) aggregate responses of auditory and visual stimuli (Figure 5B). We found that audiovisual sensitivity significantly exceeded both estimates of linear combination (MLE, ∼160-220 ms post-stimulus; aggregate, ∼150-250 ms). These results provide evidence of non-linear audiovisual integration in the multivariate pattern of EEG recordings. Taken together with the ERP results, our findings suggest that super-additive integration of audiovisual information is reflected in multivariate patterns of activity, but not univariate evoked responses.

Super-additive multisensory interaction in multivariate patterns of EEG activity.
A) Decoding sensitivity in each stimulus condition across the epoch. Overall trends closely matched decoding accuracy. B) Predicted (MLE and aggregate A+V) and actual audiovisual sensitivity across the epoch. Coloured horizontal bars indicate cluster corrected periods where actual sensitivity significantly exceeded that which was predicted. Shaded error bars indicate ±1 SEM.

Audiovisual decoding sensitivity is significantly positively correlated to behavioural sensitivity.
The correlations (Spearman’s Rho) between decoding and behavioural sensitivity from the EEG session from 150-250 ms post-stimulus onset for each stimulus condition, with a line of best fit.

Neurobehavioural correlations

To test whether neural decoding was related to behaviour, we calculated rank-order correlations (Spearman’s Rho) between the average decoding sensitivity for each participant from 150-250 ms post-stimulus onset and behavioural performance on the EEG task. We found that decoding sensitivity was significantly positively correlated with behavioural sensitivity for audiovisual stimuli (r = .43, p = .003), but not for auditory (r = -.04, p = .608) or visual stimuli (r = .14, p = .170) alone.

Discussion

We tested for super-additivity in multivariate patterns of EEG responses to audiovisual stimuli. Participants judged the location of auditory, visual, and audiovisual stimuli while their brain activity was measured using EEG. As expected, participants’ behavioural responses to audiovisual stimuli were more precise than that for unisensory auditory and visual stimuli. ERP analyses showed that although audiovisual stimuli elicited larger responses than visual stimuli, the overall response followed an additive principle. Critically, our multivariate analyses revealed that decoding sensitivity for audiovisual stimuli exceeded predictions of both MLE and aggregate auditory and visual information, indicating non-linear multisensory enhancement (i.e., super-additivity).

Participants localised audiovisual stimuli more accurately than unisensory in both the behavioural and EEG sessions. This behavioural facilitation in response to audiovisual stimuli is well-established within the literature (Bolognini et al., 2005; Frassinetti et al., 2002; Lovelace et al., 2003; Meredith & Stein, 1983; Senkowski et al., 2011). In accordance with theories of optimal cue integration, we found participants’ performance for audiovisual stimuli in both sessions matched that predicted by MLE (Ernst & Banks, 2002). Matching this ‘optimal’ prediction of performance indicates that the auditory and visual cues were successfully integrated when presented together in the audiovisual condition (Fetsch et al., 2013).

Our EEG analyses revealed that for most spatial locations, audiovisual stimuli elicited a significantly greater neural response than exclusively visual stimuli approximately 80-120 ms after stimulus onset. Despite numerically larger ERPs to audiovisual than auditory stimuli, this effect failed to reach significance, most likely due to greater inter-trial variability in the auditory ERPs. Critically, however, the audiovisual ERPs consistently matched the sum of visual and auditory ERPs. Sub- or super-additive interaction effects in neural responses to multisensory stimuli are a hallmark of successful integration of unisensory cues in ERPs (Besle et al., 2004; Stevenson et al., 2014). An additive ERP in this context cannot imply successful multisensory integration, as the multisensory ‘enhancement’ may be the result of recording from distinct populations of unisensory neurons responding to the two unisensory sensory modalities (Besle et al., 2009). This invites the question of why we see evidence for integration at the behavioural level, but not in the amplitude of neural responses. One explanation could be that the signals measured by EEG simply do not contain evidence of non-linear integration because the super-additive responses are highly spatiotemporally localized and filtered out by the skull before reaching the EEG sensors. Another possibility, however, is that evidence for non-linear integration is only observable within the changing pattern of ERPs across sensors. Indeed, Murray et al. (2016) found that multisensory interactions followed from changes in scalp topography rather than net gains to ERP amplitude.

Our decoding results reveal that not only do audiovisual stimuli elicit more distinguishable patterns of activity than visual and auditory stimuli, but this enhancement exceeds that predicted by both optimal integration and the aggregate combination of auditory and visual responses. Critically, the non-linear enhancement of decoding sensitivity for audiovisual stimuli indicates the presence of an interactive effect for the integration of auditory and visual stimuli that was not evident from the univariate analyses. This indicates super-additive enhancement of the neural representation of integrated audiovisual cues, and supports the interpretation that increased behavioural performance for multisensory stimuli is related to a facilitation of the neural response (Fetsch et al., 2013). This interaction was absent from univariate analyses (Nikbakht et al., 2018), suggesting that the neural facilitation of audiovisual processing is more nuanced than net increased excitation, and may be associated with a complex pattern of excitatory and inhibitory neural activity, e.g., divisive normalization (Ohshiro et al., 2017).

The non-linear neural enhancement in decoding sensitivity for audiovisual stimuli occurred ∼180 ms after stimulus onset, which is later than previously reported audiovisual interactions (<150 ms; Cappe et al., 2010; Fort et al., 2002; Giard & Peronnet, 1999; Molholm et al., 2002; Murray et al., 2016; Senkowski et al., 2011; Talsma et al., 2007; Teder-Sälejärvi et al., 2002). As stimulus characteristics and task requirements are likely to have a significant influence over the timing of multisensory interaction effects in EEG activity (Calvert & Thesen, 2004; De Meo et al., 2015), our use of peripheral spatial locations (where previous studies only used stimuli centrally) may explain the slightly later timing of our audiovisual effect. Indeed, our finding is consistent with previous multivariate studies which found that location information in EEG data, for both visual (Rideaux, 2024; Robinson et al., 2021) and auditory (Bednar & Lalor, 2020) stimuli, is maximal at ∼190 ms following stimulus presentation.

We also found a significant positive correlation between participants’ behavioural judgements in the EEG task and decoding sensitivity for audiovisual stimuli, suggesting that participants who were better at identifying stimulus location may have more distinct patterns of neural activity for audiovisual stimuli. Multisensory stimuli have consistently been found to elicit stronger neural responses than unisensory stimuli (Meredith & Stein, 1983; Puce et al., 2007; Senkowski et al., 2011; Vroomen & Stekelenburg, 2010), which has been associated with behavioural performance (Frens & Van Opstal, 1998; Wang et al., 2008). However, the neuro-behavioural correlation we observed suggests that behavioural facilitation from audiovisual integration is not simply represented by the strength of the neural response, but rather by a reliably distinguishable pattern of activity.

Any experimental design that varies stimulus location needs to consider the potential contribution of eye movements. To reduce eye movements during our study, we had participants fixate on a central dot and removed trials with substantial eye-movements (>3.75°) from the analyses. A re-analysis of the data with a very strict eye-movement criterion (i.e., removing trials with eye movements >1.875°) revealed that the super-additive enhancement in decoding accuracy no longer survived cluster correction, suggesting that our results may be impacted by the consistent motor activity of saccades towards presented stimuli. One piece of evidence against this is that we did not observe significant differences between auditory and audiovisual ERP amplitudes, the latter condition being more likely to drive eye movements. Furthermore, we found that the electrodes with the most location information were in occipital and temporal regions of the scalp, brain areas dedicated to sensory processing, rather than frontal regions, which would be expected if activity was dominated by consistent muscular activity evoked by eye movements. The lack of a super-additive enhancement when using the stricter eye-movement criterion, therefore, is perhaps more likely due to a loss of statistical power.

In summary, here we have shown a non-linear enhancement in the neural representation of audiovisual stimuli relative to unisensory (visual/auditory) stimuli. This enhancement was obscured within univariate ERP analyses focusing exclusively on response amplitude but was revealed through inverted encoding analyses in feature-space, suggesting that super-additive integration of audiovisual information is reflected within multivariate patterns of activity rather than univariate evoked responses. Further research on the multivariate representation of audiovisual integration may shed light on the neural mechanisms that facilitate this non-linear enhancement. In particular, future work may consider the influence of different stimulus features and task requirements on the timing and magnitude of the audiovisual enhancement. How and when auditory and visual information are integrated to enhance multisensory processing remains an open question, with evidence for a complex combination of top-down and bottom-up interactions (Delong & Noppeney, 2021; Keil & Senkowski, 2018; Rohe & Noppeney, 2018). Our study highlights the importance of considering multivariate analyses in multisensory research, and the potential loss of stimulus-relevant neural information when relying solely on univariate responses.

Acknowledgements

We thank R. West for data collection, and D. Lloyd for technical assistance. This work was supported by Australian Research Council (ARC) Discovery Early Career Researcher Awards awarded to RR (DE210100790) and AKR (DE200101159). RR was also supported by a National Health and Medical Research Council (Australia) Investigator Grant (2026318).

References

1. Alais D.
2. Burr D
2004The ventriloquist effect results from near-optimal bimodal integrationCurrent Biology 14:257–262https://doi.org/10.1016/j.cub.2004.01.029
1. Arieh Y.
2. Marks L. E
2008Cross-modal interaction between vision and hearing: a speed-accuracy analysisAttention, Perception, & Psychophysics 70:412–421https://doi.org/10.3758/pp.70.3.412
1. Ban H.
2. Preston T. J.
3. Meeson A.
4. Welchman A. E
2012The integration of motion and disparity cues to depth in dorsal visual cortexNature Neuroscience 15:636–643https://doi.org/10.1038/nn.3046
1. Beauchamp M. S
2005Statistical criteria in fMRI studies of multisensory integrationNeuroinformatics 3:93–113https://doi.org/10.1385/NI:3:2:093
1. Bednar A.
2. Lalor E. C
2020Where is the cocktail party? Decoding locations of attended and unattended moving sound sources using EEGNeuroimage 205:116283https://doi.org/10.1016/j.neuroimage.2019.116283
1. Besle J.
2. Bertrand O.
3. Giard M. H
2009Electrophysiological (EEG, sEEG, MEG) evidence for multiple audiovisual interactions in the human auditory cortexHearing Research 258:143–151https://doi.org/10.1016/j.heares.2009.06.016
1. Besle J.
2. Fort A.
3. Giard M. H
2004Interest and validity of the additive model in electrophysiological studies of multisensory interactionsCognitive Processing 5:189–192https://doi.org/10.1007/s10339-004-0026-y
1. Bolognini N.
2. Frassinetti F.
3. Serino A.
4. Ladavas E
2005“Acoustical vision” of below threshold stimuli: interaction among spatially converging audiovisual inputsExperimental Brain Research 160:273–282https://doi.org/10.1007/s00221-004-2005-z
1. Brainard D. H
1997The Psychophysics ToolboxSpatial Vision 10:433–436https://doi.org/10.1163/156856897X00357
1. Brouwer G. J.
2. Heeger D. J
2009Decoding and Reconstructing Color from Responses in Human Visual CortexThe Journal of Neuroscience 29:13992https://doi.org/10.1523/JNEUROSCI.3577-09.2009
1. Brouwer G. J.
2. Heeger D. J
2011Cross-orientation suppression in human visual cortexJournal of Neurophysiology 106:2108–2119https://doi.org/10.1152/jn.00540.2011
1. Calvert G. A.
2. Campbell R.
3. Brammer M. J
2000Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortexCurrent Biology 10:649–657https://doi.org/10.1016/S0960-9822(00)00513-3
1. Calvert G. A.
2. Hansen P. C.
3. Iversen S. D.
4. Brammer M. J
2001Detection of Audio-Visual Integration Sites in Humans by Application of Electrophysiological Criteria to the BOLD EffectNeuroimage 14:427–438https://doi.org/10.1006/nimg.2001.0812
1. Calvert G. A.
2. Thesen T
2004Multisensory integration: methodological approaches and emerging principles in the human brainJournal of Physiology-Paris 98:191–205https://doi.org/10.1016/j.jphysparis.2004.03.018
1. Cappe C.
2. Thut G.
3. Romei V.
4. Murray M. M
2009Selective integration of auditory-visual looming cues by humansNeuropsychologia 47:1045–1052https://doi.org/10.1016/j.neuropsychologia.2008.11.003
1. Cappe C.
2. Thut G.
3. Romei V.
4. Murray M. M
2010Auditory–Visual Multisensory Interactions in Humans: Timing, Topography, Directionality, and SourcesThe Journal of Neuroscience 30:12572https://doi.org/10.1523/JNEUROSCI.1099-10.2010
1. Colonius H.
2. Diederich A
2004Multisensory Interaction in Saccadic Reaction Time: A Time-Window-of-Integration ModelJournal of Cognitive Neuroscience 16:1000–1009https://doi.org/10.1162/0898929041502733
1. Corniel B. D.
2. van Wanrooij M. M.
3. Munoz D. P.
4. van Opstal J.
2002Auditory-Visual Interactions Subserving Goal-Directed Saccades in a Complex SceneJournal of Neurophysiology 88:438–454https://doi.org/10.1152/jn.00699.2001
1. De Meo R.
2. Murray M. M.
3. Clarke S.
4. Matusz P. J.
2015Top-down control and early multisensory processes: chicken vs. eggFrontiers in integrative neuroscience 9:17https://doi.org/10.3389/fnint.2015.00017
1. Delong P.
2. Noppeney U
2021Semantic and spatial congruency mould audiovisual integration depending on perceptual awarenessScientific Reports 11:10832https://doi.org/10.1038/s41598-021-90183-w
1. Delorme A.
2. Makeig S
2004EEGLAB: An open source toolbox for analysis of single-trial EEG dynamics including independent component analysisJournal of Neuroscience Methods 134:9–21https://doi.org/10.1016/j.jneumeth.2003.10.00
1. Ernst M. O.
2. Banks M. S
2002Humans integrate visual and haptic information in a statistically optimal fashionNature 415:429–433https://doi.org/10.1038/415429a
1. Fetsch C. R.
2. DeAngelis G. C.
3. Angelaki D. E
2013Bridging the gap between theories of sensory cue integration and the physiology of multisensory neuronsNature Reviews Neuroscience 14:429–442https://doi.org/10.1038/nrn3503
1. Fort A.
2. Delpuech C.
3. Pernier J.
4. Giard M.-H
2002Dynamics of Cortico-subcortical Cross-modal Operations Involved in Audio-visual Object Detection in HumansCerebral Cortex 12:1031–1039https://doi.org/10.1093/cercor/12.10.1031
1. Frassinetti F.
2. Bolognini N.
3. Ladavas E
2002Enhancement of visual perception by crossmodal visuo-auditory interactionExperimental Brain Research 147:332–343https://doi.org/10.1007/s00221-002-1262-y
1. Frens M. A.
2. Van Opstal A. J.
1998Visual-auditory interactions modulate saccade-related activity in monkey superior colliculusBrain Research Bulletin 46:211–224https://doi.org/10.1016/S0361-9230(98)00007-0
1. Giard M. H.
2. Peronnet F
1999Auditory-Visual Integration during Multimodal Object Recognition in Humans: A Behavioral and Electrophysiological StudyJournal of Cognitive Neuroscience 11:473–490https://doi.org/10.1162/089892999563544
1. Harrison W. J.
2. Bays P. M.
3. Rideaux R
2023Neural tuning instantiates prior expectations in the human visual systemNature Communications 14:5320https://doi.org/10.1038/s41467-023-41027-w
1. James T. W.
2. Stevenson R. A.
3. Kim S.
4. Stein B. E.
2012Inverse Effectiveness and BOLD fMRIThe New Handbook of Multisensory Processing :207https://doi.org/10.7551/mitpress/8466.003.0020
1. Joassin F.
2. Maurage P.
3. Campanella S
2011The neural network sustaining the crossmodal processing of human gender from faces and voices: An fMRI studyNeuroimage 54:1654–1661https://doi.org/10.1016/j.neuroimage.2010.08.073
1. Keil J.
2. Senkowski D
2018Neural Oscillations Orchestrate Multisensory ProcessingThe Neuroscientist 24:609–626https://doi.org/10.1177/1073858418755352
1. Kok P.
2. Mostert P.
3. de Lange F. P.
2017Prior expectations induce prestimulus sensory templatesProceedings of the National Academy of Sciences 114:10473–10478https://doi.org/10.1073/pnas.1705652114
1. Kothe C. A.
2. Makeig S
2013BCILAB: a platform for brain-computer interface developmentJournal of Neural Engineering 10:5https://doi.org/10.1088/1741-2560/10/5/056014
1. Laurienti P. J.
2. Perrault T. J.
3. Stanford T. R.
4. Wallace M. T.
5. Stein B. E
2005On the use of superadditivity as a metric for characterizing multisensory integration in functional neuroimaging studiesExperimental Brain Research 166:289–297https://doi.org/10.1007/s00221-005-2370-2
1. Leone L. M.
2. McCourt M. E
2015Dissociation of perception and action in audiovisual multisensory integrationEuropean Journal of Neuroscience 42:2915–2922https://doi.org/10.1111/ejn.13087
1. Lovelace C. T.
2. Stein B. E.
3. Wallace M. T
2003An irrelevant light enhances auditory detection in humans: a psychophysical analysis of multisensory integration in stimulus detectionCognitive Brain Research 17:447–453https://doi.org/10.1016/s0926-6410(03)00160-5
1. Meredith M. A.
2. Stein B. E
1983Interactions Among Converging Sensory Inputs in the Superior ColliculusScience 221:389–391https://doi.org/10.1126/science.6867718
1. Molholm S.
2. Ritter W.
3. Murray M. M.
4. Javitt D. C.
5. Schroeder C. E.
6. Foxe J. J
2002Multisensory auditory–visual interactions during early sensory processing in humans: a high-density electrical mapping studyCognitive Brain Research 14:115–128https://doi.org/10.1016/S0926-6410(02)00066-6
1. Murray M. M.
2. Thelen A.
3. Thut G.
4. Romei V.
5. Martuzzi R.
6. Matusz P. J
2016The multisensory function of the human primary visual cortexNeuropsychologia 83:161–169https://doi.org/10.1016/j.neuropsychologia.2015.08.011
1. Nikbakht N.
2. Tafreshiha A.
3. Zoccolan D.
4. Diamond M. E
2018Supralinear and Supramodal Integration of Visual and Tactile Signals in Rats: Psychophysics and Neuronal MechanismsNeuron 97:626–639https://doi.org/10.1016/j.neuron.2018.01.003
1. Ohshiro T.
2. Angelaki D. E.
3. DeAngelis G. C
2017A Neural Signature of Divisive Normalization at the Level of Multisensory Integration in Primate CortexNeuron 95:399–411https://doi.org/10.1016/j.neuron.2017.06.043
1. Porada D. K.
2. Regenbogen C.
3. Freiherr J.
4. Seubert J.
5. Lundström J. N
2021Trimodal processing of complex stimuli in inferior parietal cortex is modality-independentCortex 139:198–210https://doi.org/10.1016/j.cortex.2021.03.008
1. Puce A.
2. Epling J. A.
3. Thompson J. C.
4. Carrick O. K
2007Neural responses elicited to face motion and vocalization pairingsNeuropsychologia 45:93–106https://doi.org/10.1016/j.neuropsychologia.2006.04.017
1. Rach S.
2. Diederich A
2006Visual-tactile integration: does stimulus duration influence the relative amount of response enhancement?Experimental Brain Research 173:514–520https://doi.org/10.1007/s00221-006-0452-4
1. Rach S.
2. Diederich A.
3. Colonius H
2010On quantifying multisensory interaction effects in reaction time and detection ratePsychological Research 75:77–94https://doi.org/10.1007/s00426-010-0289-0
1. Rideaux R
2024Task-related modulation of event-related potentials does not reflect changes to sensory representationsbioRxiv https://doi.org/10.1101/2024.01.20.576485
1. Rideaux R.
2. Storrs K. R.
3. Maiello G.
4. Welchman A. E
2021How multisensory neurons solve causal inferenceProceedings of the National Academy of Sciences 118:e2106235118https://doi.org/10.1073/pnas.2106235118
1. Rideaux R.
2. Welchman A. E
2018Proscription supports robust perceptual integration by suppression in human visual cortexNature Communications 9:1502https://doi.org/10.1038/s41467-018-03400-y
1. Rideaux R.
2. West R. K.
3. Rangelov D.
4. Mattingley J. B
2023Distinct early and late neural mechanisms regulate feature-specific sensory adaptation in the human visual systemProceedings of the National Academy of Sciences 120:e2216192120https://doi.org/10.1073/pnas.2216192120
1. Robinson A. K.
2. Grootswagers T.
3. Shatek S. M.
4. Gerboni J.
5. Holcombe A.
6. Carlson T. A
2021Overlapping neural representations for the position of visible and imagined objectsNeurons, behavior, data analysis, and theory 4:1https://doi.org/10.51628/001c.19129
1. Rohe T.
2. Noppeney U
2018Reliability-Weighted Integration of Audiovisual Signals Can Be Modulated by Top-down Attentioneneuro 5:1https://doi.org/10.1523/ENEURO.0315-17.2018
1. Ross L. A.
2. Molholm S.
3. Butler J. S.
4. Bene V. A. D.
5. Foxe J. J
2022Neural correlates of multisensory enhancement in audiovisual narrative speech perception: A fMRI investigationNeuroimage 263:119598https://doi.org/10.1016/j.neuroimage.2022.119598
1. Scheliga S.
2. Kellermann T.
3. Lampert A.
4. Rolke R.
5. Spehr M.
6. Habel U
2023Neural correlates of multisensory integration in the human brain: an ALE meta-analysisReviews in the Neurosciences 34:223–245https://doi.org/10.1515/revneuro-2022-0065
1. Senkowski D.
2. Saint-Amour D.
3. Hofle M.
4. Foxe J. J
2011Multisensory interactions in early evoked brain activity follow the principle of inverse effectivenessNeuroimage 56:2200–2208https://doi.org/10.1016/j.neuroimage.2011.03.075
1. Stanislaw H.
2. Todorov N
1999Calculation of signal detection theory measuresBehavior Research Methods, Instruments, & Computers 31:137–149https://doi.org/10.3758/BF03207704
1. Stein B. E.
2. Stanford T. R
2008Multisensory integration: current issues from the perspective of the single neuronNature Reviews Neuroscience 9:255–266https://doi.org/10.1038/nrn2331
1. Stekelenburg J. J.
2. Vroomen J
2007Neural Correlates of Multisensory Integration of Ecologically Valid Audiovisual EventsJournal of Cognitive Neuroscience 19:1964–1973https://doi.org/10.1162/jocn.2007.19.12.1964
1. Stevenson R. A.
2. Geoghegan M. L.
3. James T. W
2007Superadditive BOLD activation in superior temporal sulcus with threshold non-speech objectsExperimental Brain Research 179:85–95https://doi.org/10.1007/s00221-006-0770-6
1. Stevenson R. A.
2. Ghose D.
3. Fister J. K.
4. Sarko D. K.
5. Altieri N. A.
6. Nidiffer A. R.
7. Kurela L. R.
8. Siemann J. K.
9. James T. W.
10. Wallace M. T
2014Identifying and quantifying multisensory integration: a tutorial reviewBrain Topography 27:707–730https://doi.org/10.1007/s10548-014-0365-7
1. Stevenson R. A.
2. James T. W
2009Audiovisual integration in human superior temporal sulcus: Inverse effectiveness and the neural processing of speech and object recognitionNeuroimage 44:1210–1223https://doi.org/10.1016/j.neuroimage.2008.09.034
1. Sumby W. H.
2. Pollack I
1954Visual Contribution to Speech Intelligibility in NoiseThe Journal of the Acoustical Society of America 26:212–215https://doi.org/10.1121/1.1907309
1. Talsma D.
2. Doty T. J.
3. Woldorff M. G
2007Selective Attention and Audiovisual Integration: Is Attending to Both Modalities a Prerequisite for Early Integration?Cerebral Cortex 17:679–690https://doi.org/10.1093/cercor/bhk016
1. Teder-Sälejärvi W. A.
2. McDonald J. J.
3. Di Russo F.
4. Hillyard S. A.
2002An analysis of audio-visual crossmodal integration by means of event-related potential (ERP) recordingsCognitive Brain Research 14:106–114https://doi.org/10.1016/S0926-6410(02)00065-4
1. The MathWorks Inc.
2021MATLAB version: 9.11.0 (R2021b), Natick, MassachusettsThe MathWorks Inc https://www.mathworks.com
1. Van Wanrooij M. M.
2. Bell A. H.
3. Munoz D. P.
4. Van Opstal A. J.
2009The effect of spatial-temporal audiovisual disparities on saccades in a complex sceneExperimental Brain Research 198:425–437https://doi.org/10.1007/s00221-009-1815-4
1. Venezia J. H.
2. Matchin W.
3. Hickok G.
4. Toga A. W.
2015Multisensory Integration and Audiovisual Speech PerceptionIn: Brain Mapping Academic Press pp. 565–572https://doi.org/10.1016/B978-0-12-397025-1.00047-6
1. Vroomen J.
2. Stekelenburg J. J
2010Visual Anticipatory Information Modulates Multisensory Interactions of Artificial Audiovisual StimuliJournal of Cognitive Neuroscience 22:1583–1596https://doi.org/10.1162/jocn.2009.21308
1. Wang Y.
2. Celebrini S.
3. Trotter Y.
4. Barone P
2008Visuo-auditory interactions in the primary visual cortex of the behaving monkey: Electrophysiological evidenceBMC neuroscience 9:79https://doi.org/10.1186/1471-2202-9-79
1. Werner S.
2. Noppeney U
2010Superadditive Responses in Superior Temporal Sulcus Predict Audiovisual Benefits in Object CategorizationCerebral Cortex 20:1829–1842https://doi.org/10.1093/cercor/bhp248
1. Werner S.
2. Noppeney U
2011The Contributions of Transient and Sustained Response Codes to Audiovisual IntegrationCerebral Cortex 21:920–931https://doi.org/10.1093/cercor/bhq161
1. Whitworth R. H.
2. Jeffress L. A
1961Time vs Intensity in the Localization of TonesThe Journal of the Acoustical Society of America 33:925–929https://doi.org/10.1121/1.1908849
1. Wightman F. L.
2. Kistler D. J
1992The dominant role of low-frequency interaural time differences in sound localizationThe Journal of the Acoustical Society of America 91:1648–1661https://doi.org/10.1121/1.402445

Article and author information

Author information

Zak Buhmann
Queensland Brain Institute, The University of Queensland
ORCID iD: 0009-0002-4249-462X
- Corresponding author; email: z.buhmann@uq.net.au
Amanda K. Robinson
Queensland Brain Institute, The University of Queensland
ORCID iD: 0000-0002-7378-2803
Jason B. Mattingley
Queensland Brain Institute, The University of Queensland, School of Psychology, The University of Queensland
ORCID iD: 0000-0003-0929-9216
Reuben Rideaux
Queensland Brain Institute, The University of Queensland, School of Psychology, University of Sydney
ORCID iD: 0000-0001-8416-005X

Version history

Sent for peer review: March 3, 2024
Preprint posted: March 4, 2024
Reviewed Preprint version 1: May 17, 2024
Reviewed Preprint version 2: December 6, 2024

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Reviewing Editor
Maria Chait
University College London, London, United Kingdom
Senior Editor
Barbara Shinn-Cunningham
Carnegie Mellon University, Pittsburgh, United States of America

Reviewer #1 (Public Review):

This study presents a novel application of the inverted encoding (i.e., decoding) approach to detect the correlates of crossmodal integration in the human EEG (electrophysiological) signal. The method is successfully applied to data from a group of 41 participants, performing a spatial localization task on auditory, visual, and audio-visual events. The analyses clearly show a behavioural superiority for audio-visual localization. Like previous studies, the results when using traditional univariate ERP analyses were inconclusive, showing once more the need for alternative, more sophisticated approaches. Instead, the principal approach of this study, harnessing the multivariate nature of the signal, captured clear signs of super-additive responses, considered by many as the hallmark of multisensory integration. Unfortunately, the manuscript lacks many important details in the descriptions of the methodology and analytical pipeline. Although some of these details can eventually be retrieved from the scripts that accompany this paper, the main text should be self-contained and sufficient to gain a clear understanding of what was done. (A list of some of these is included in the comments to the authors). Nevertheless, I believe the main weakness of this work is that the positive results obtained and reported in the results section are conditioned upon eye movements. When artifacts due to eye movements are removed, then the outcomes are no longer significant.

Therefore, whether the authors finally achieved the aims and showed that this method of analysis is truly a reliable way to assess crossmodal integration, does not stand on firm ground. The worst-case scenario is that the results are entirely accounted for by patterns of eye movements in the different conditions. In the best-case scenario, the method might truly work, but further experiments (and/or analyses) would be required to confirm the claims in a conclusive fashion.

If finally successful, this approach could bring important advances in the many fields where multisensory integration has been shown to play a role, by providing a way to bring much-needed coherence across levels of analysis, from behaviour to single-cell electrophysiology. To achieve this, one would have to make sure that the pattern of super-additive effects, the standard self-imposed by the authors as a proxy for multisensory integration, shows up reliably regardless of eye movement or artifact corrections. One first step toward this goal would be, perhaps, to facilitate the understanding of results in context by reporting both the uncorrected and corrected analyses in the main results section. Second, one could try to support the argument given in the discussion, pointing out the origin of the super-additive effects in posterior electrode sites, by also modelling frontal electrode clusters and showing they aren't informative as to the effect of interest.

https://doi.org/10.7554/eLife.97230.1.sa1

Reviewer #2 (Public Review):

Summary:

This manuscript seeks to reconcile observations in multisensory perception - from behavior and neural responses. It is intuitively obvious that perceiving a stimulus via two senses results in better performance than one alone. In fact, it is not uncommon to observe that for a perceptual task, the percentage of correct responses seen with two senses is higher than the sum of the percentage correct obtained with each modality individually. i.e. the gains are "superadditive". The gains of adding a second sense are typically larger when the performance with the first sense is relatively poor - this effect is often called the principle of inverse effectiveness. More generally, what this tells us is that performance in a multisensory perceptual task is a non-linear sum of performance for each sensory modality alone.

Despite this abundant evidence of behavioral non-linearity in multisensory integration, evoked responses (EEG) to such sensory stimuli often show little evidence of it - and this is the problem this manuscript tackles. The key assertion made is that univariate analysis of the EEG signal is likely to average out the non-linear effects of integration. This is a reasonable assertion, and their analysis does indeed provide evidence that a multivariate approach can reveal non-linear interactions in the evoked responses.

Strengths:

It is of great value to understand how the process of multisensory integration occurs, and despite a wealth of observations of the benefits of perceiving the world with multiple senses, we still lack a reasonable understanding of how the brain integrates information. For example - what underlies the large individual differences in the benefits of two senses over one? One way to tackle this is via brain imaging, but this is problematic if important features of the processing - such as non-linear interactions are obscured by the lack of specificity of the measurements. The approach they take to the analysis of the EEG data allows the authors to look in more detail at the variation in activity across EEG electrodes, which averaging across electrodes cannot.

This version of the manuscript is well-written and for the most part clear. It shows a good understanding of the non-linear effects described above (where many studies show a poor understanding of "superadditivity" of perceptual performance) and the report of non-linear summation of neural responses is convincing.

A particular strength of the paper is their use of a statistical model of multisensory integration as their "null" model of neural responses, and the "inverted-encoder" which infers an internal representation of the stimulus which can explain the EEG responses. This encoder generates a prediction of decoding performance, which can be used to generate predictions of multisensory decoding from unisensory decoding, or from a sum of the unisensory internal representations.

In behavioural performance, it is frequently observed that the performance increase from two senses is close to what is expected from the optimal integration of information across the senses, in a statistical sense. It can be plausibly explained by assuming that people are able to weigh sensory inputs according to their reliability - and somewhat optimally. Critically the apparent "superadditive" effect on performance described above does not require any non-linearity in the sum of information across the senses but can arise from correctly weighting the information according to reliability.

The authors apply a similar model to predict the neural responses expected to audiovisual stimuli from the neural responses to audio and visual stimuli alone, assuming optimal statistical integration of information. The neural responses to audiovisual stimuli exceed the predictions of this model and this is the main evidence supporting their conclusion, and it is convincing.

Weaknesses:

The main weakness of the manuscript is that their behavioural data show no evidence of performance that exceeds the predictions of these statistical models. In fact, the models predict multisensory performance from unisensory performance pretty well. So this manuscript presents the opposite problem to that which motivated the study - neural interactions across the senses which appear to be more non-linear than perception. This makes it hard to interpret their results, as surely if these nonlinear neural interactions underlie the behaviour, then we should be able to see evidence of it in the behaviour? I cannot offer an easy explanation for this.

Overall, therefore, I applaud the motivation and the sophistication of the analysis method and think it shows great promise for tackling these problems, but the manuscript unfortunately brushes over an important problem specific to the results. It appeals to the higher-level reasoning - that non-linearity is a behavioural hallmark of integration and therefore we should see it in neural responses. Yet it ignores the fact that the behaviour observed here does not exceed the predictions of the "null" model applied to the neural response.

Part of the problem, I think, is that the authors never explain the difference between superadditivity of perceptual performance (proportion correct) and superadditivity of the underlying processing, which is implied by the EEG results but not their behavior. This is of course a difficult matter to describe succinctly or clearly (I somehow doubt I have). It is however worth addressing. The literature is full of confusing claims of superadditivity. I believe these authors understand this distinction and have an opportunity to represent it clearly for the benefit of all.

https://doi.org/10.7554/eLife.97230.1.sa0

Author response:

Response to Reviewer #1 (Public Review):

We thank the reviewer for their constructive criticism of our study, their proposed solutions, and for highlighting areas of the methodology and analytical pipeline where explanations were unclear or unsatisfactory. We will take the reviewer’s feedback into account to improve the clarity and readability of the revised manuscript. We acknowledge the importance of ruling out eye movements as a potential confound. We address these concerns briefly below, but a more detailed explanation (and a full breakdown of the relevant analyses, including the corrected and uncorrected results) will be provided in the revised manuscript.

First, the source of EEG activity recorded from the frontal electrodes is often unclear. Without an external reference, it is challenging to resolve the degree to which frontal EEG activity represents neural or muscular responses1. Thus, as a preventative measure against the potential contribution of eye movement activity, for all our EEG analyses, we only included activity from occipital, temporal, and parietal electrodes (the selected electrodes can be seen in the final inset of Figure 3).

Second, as suggested by the reviewer, we re-ran our analyses using the activity measured from the frontal electrodes alone. If the source of the nonlinear decoding accuracy in the AV condition was muscular activity produced by eye movements, we would expect to observe better decoding accuracy from sensors closer to the source. Instead, we found that decoding accuracy from the frontal electrodes (peak d' = 0.08) was less than half that of decoding accuracy from the more posterior electrodes (peak d' = 0.18). These results suggest that the source of neural activity containing information about stimulus position was located over occipito-parietal areas, consistent with our topographical analyses (inset of Figure 4).

Third, we compared the average eye movements between the three main sensory conditions (auditory, visual, and audiovisual). In the visual condition, there was little difference in eye movements corresponding to the five stimulus locations, likely because the visual stimuli were designed to be spatially diffuse. For the auditory and audiovisual conditions, there was more distinction between eye movements corresponding to the stimulus locations. However, these appeared to be the same between auditory and audiovisual conditions. If consistent saccades to audiovisual stimuli had been responsible for the nonlinear decoding we observed, we would expect to find a higher positive correlation between horizontal eye position and stimulus location in the audiovisual condition than in the auditory or visual conditions. Instead, we found no difference in correlation between audiovisual and auditory stimuli, indicating that eye movements were equivalent in these conditions and unlikely to explain better decoding accuracy for audiovisual stimuli.

Finally, we note that the stricter eye movement criterion acknowledged in the Discussion section of the original manuscript resulted in significantly better audiovisual d' than the MLE prediction, but this difference did not survive cluster correction. This is an important distinction to make as, when combined with the results described above, it seems to support our original interpretation that the stricter criterion combined with our conservative measure of (mass-based) cluster correction2 led to type 2 error.

References

(1) Roy, R. N., Charbonnier, S., & Bonnet, S. (2014). Eye blink characterization from frontal EEG electrodes using source separation and pattern recognition algorithms. Biomedical Signal Processing and Control, 14, 256–264.

(2) Pernet, C. R., Latinus, M., Nichols, T. E., & Rousselet, G. A. (2015). Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. Journal of Neuroscience Methods, 250, 85–93.

Response to Reviewer #2 (Public Review):

We thank the reviewer for their insight and constructive feedback. As emphasized in the review, an interesting question that arises from our results is that, if the neural data exceeds the optimal statistical decision (MLE d'), why doesn’t the behavioural data? We agree with the reviewer’s suggestion that more attention should be devoted to this question, and plan to provide a deeper discussion of the relationship between behavioural and neural super-additivity in the revised manuscript. We also note that while this discrepancy remains unexplained, our results are consistent with the literature. That is, both non-linear neural responses (single-cell recordings) and behavioural responses that match MLE are reliable phenomenon in multisensory integration1,2,3,4.

One possible explanation for this puzzling discrepancy is that behavioural responses occur sometime after the initial neural response to sensory input. There are several subsequent neural processes between perception and a behavioural response5, all of which introduce additional noise that may obscure super-additive perceptual sensitivity. In particular, the mismatch between neural and behavioural accuracy may be the result of additional neural processes that translate sensory activity into a motor response to perform the behavioural task.

Our measure of neural super-additivity (exceeding optimally weighted linear summation) differs from how it is traditionally assessed (exceeding summation of single neuron responses)2. However, neither method has yet fully explained how this neural activity translates to behavioural responses, and we think that more work is needed to resolve the abovementioned discrepancy. However, our method will facilitate this work by providing a reliable method of measuring neural super-additivity in humans, using non-invasive recordings.

References

(1) Alais, D., & Burr, D. (2004). The ventriloquist effect results from near-optimal bimodal integration. Current Biology, 14(3), 257–262.

(2) Ernst, M. O., & Banks, M. S., (2002). Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870), 429–433.

(3) Meredith, M. A., & Stein, B. E. (1993). Interactions among converging sensory inputs in the superior colliculus. Science, 221, 389–391.

(4) Stanford, T. R., & Stein, B. E. (2007). Superadditivity in multisensory integration: putting the computation in context. Neuroreport 18, 787–792.

(5) Heekeren, H., Marrett, S. & Ungerleider, L. (2008). The neural systems that mediate human perceptual decision making. Nature Reviews Neuroscience, 9, 467–479.

https://doi.org/10.7554/eLife.97230.1.sa3

Significance of findings

Strength of evidence

Abstract

Introduction

Methods

Participants

Materials and procedure

Apparatus

Stimuli

Behavioural Session

Experimental design of behavioural and EEG sessions.

EEG Session

EEG data pre-processing

Forward model

Statistical analyses

Data availability

Results

Behavioural performance

Behavioural performance is improved for audiovisual stimuli.

Event-related potentials

Audiovisual ERPs follow an additive principle.

Inverted encoding results

Spatiotemporal representation of audiovisual location.

Multivariate super-additivity

Super-additive multisensory interaction in multivariate patterns of EEG activity.

Audiovisual decoding sensitivity is significantly positively correlated to behavioural sensitivity.

Neurobehavioural correlations

Discussion

Acknowledgements

References

Article and author information

Author information

Zak Buhmann

Amanda K. Robinson

Jason B. Mattingley

Reuben Rideaux

Version history

Copyright

Peer review process

Editors