Abstract
A central challenge for the brain is how to combine separate sources of information from different sensory modalities to optimally represent objects and events in the external world, such as combining someone’s speech and lip movements to better understand them in a noisy environment. At the level of individual neurons, audiovisual stimuli often elicit super-additive interactions, where the neural response is greater than the sum of auditory and visual responses. However, investigations using electroencephalography (EEG) to record brain activity have revealed inconsistent interactions, with studies reporting a mix of super- and sub-additive effects. A possible explanation for this inconsistency is that standard univariate analyses obscure multisensory interactions present in EEG responses by overlooking multivariate changes in activity across the scalp. To address this shortcoming, we investigated EEG responses to audiovisual stimuli using inverted encoding, a population tuning approach that uses multivariate information to characterise feature-specific neural activity. Participants (n=41) completed a spatial localisation task for both unisensory stimuli (auditory clicks, visual flashes) and combined audiovisual stimuli (spatiotemporally congruent clicks and flashes). To assess multivariate changes in EEG activity, we used inverted encoding to recover stimulus location information from event-related potentials (ERPs). Participants localised audiovisual stimuli more accurately than unisensory stimuli alone. For univariate ERP analyses we found an additive multisensory interaction. By contrast, multivariate analyses revealed a super-additive interaction ∼180 ms following stimulus onset, such that the location of audiovisual stimuli was decoded more accurately than that predicted by maximum likelihood estimation. Our results suggest that super-additive integration of audiovisual information is reflected within multivariate patterns of activity rather than univariate evoked responses.
Introduction
We exist in a complex, dynamically changing sensory environment. Vertebrates, including humans, have evolved sensory organs that transduce relevant sources of physical information, such as light and changes in air pressure, into patterns of neural activity that support perception (vision and audition) and adaptive behaviour. Such activity patterns are noisy, and often ambiguous, due to a combination of external (environmental) and internal (transduction) factors. Critically, information from the different sensory modalities can be highly correlated because it is often elicited by a common external source or event. For example, the sight and sound of a hammer hitting a nail produces a single, unified perceptual experience, as does the sight of a person’s lips moving as we hear their voice. To improve the reliability of neural representations, the brain leverages these sensory relationships by combining information in a process referred to as multisensory integration. The existence of such processes heighten perception, e.g., by making it easier to understand a person’s speech in a noisy setting by looking at their lip movements (Sumby & Pollack, 1954).
Multisensory integration of audiovisual cues improves performance across a range of behavioural outcomes, including detection accuracy (Bolognini et al., 2005; Frassinetti et al., 2002; Lovelace et al., 2003), response speed (Arieh & Marks, 2008; Cappe et al., 2009; Colonius & Diederich, 2004; Rach & Diederich, 2006; Senkowski et al., 2011), and saccade speed and accuracy (Corniel et al., 2002; Van Wanrooij et al., 2009). Successful integration requires the constituent stimuli to occur at approximately the same place and time (Leone & McCourt, 2015). The degree to which behavioural performance is improved follows the principles of maximum likelihood estimation (MLE), wherein sensory information from each modality is weighted and integrated according to its relative reliability (Alais & Burr, 2004; Ernst & Banks, 2002; although other processing schemes have also been identified; Rideaux & Welchman, 2018). As such, behavioural performance that matches MLE predictions is often seen as a benchmark of successful, optimal integration of relevant unisensory cues.
The ubiquity of behavioural enhancements for audiovisual stimuli suggests there are fundamental neural mechanisms that facilitate improved precision. Recordings from single multisensory (audiovisual) neurons within cat superior colliculus have revealed the principle of inverse effectiveness, whereby the increased response to audiovisual stimuli is larger when the constituent unisensory stimuli are weakly stimulating (Corniel et al., 2002; Meredith & Stein, 1983). Depending on the intensity of the integrated stimuli, the neural response can be either super-additive, where the multisensory response is greater than the sum of the unisensory responses, additive, equal to the sum of responses, or sub-additive, where the combined response is less than the sum of the unisensory responses (see Stein & Stanford, 2008). Inverse effectiveness has also been observed in human behavioural experiments, with low intensity audiovisual stimuli eliciting greater multisensory enhancements in response precision than those of high intensity (Colonius & Diederich, 2004; Corniel et al., 2002; Rach & Diederich, 2006; Rach et al., 2010).
Neuroimaging methods, such as electroencephalography (EEG) and functional magnetic resonance imaging (fMRI), have been used to investigate neural population-level audiovisual integration in humans. These studies have typically applied an additive criterion to quantify multisensory integration, wherein successful integration is marked by a non-linear enhancement of audiovisual responses relative to unisensory responses (Besle et al., 2004). The super- or sub-additive nature of this enhancement, however, is often inconsistent. In fMRI, neural super-additivity in blood-oxygen-level dependent (BOLD) responses to audiovisual stimuli has been found in a variety of regions, primarily the superior temporal sulcus (STS; Calvert et al., 2000; Calvert et al., 2001; Stevenson et al., 2007; Stevenson & James, 2009; Werner & Noppeney, 2010, 2011). However, other studies have failed to replicate audiovisual super-additivity in the STS (Joassin et al., 2011; Porada et al., 2021; Ross et al., 2022; Venezia et al., 2015), or have found sub-additive responses (see Scheliga et al., 2023 for review). As such, some have argued that BOLD responses are not sensitive enough to adequately characterise super-additive audiovisual interactions within populations of neurons (Beauchamp, 2005; James et al., 2012; Laurienti et al., 2005). In EEG, meanwhile, the evoked response to an audiovisual stimulus typically conforms to a sub-additive principle (Cappe et al., 2010; Fort et al., 2002; Giard & Peronnet, 1999; Murray et al., 2016; Puce et al., 2007; Stekelenburg & Vroomen, 2007; Teder-Sälejärvi et al., 2002; Vroomen & Stekelenburg, 2010). However, other studies have found super-additive enhancements to the amplitude of sensory event-related potentials (ERPs) for audiovisual stimuli (Molholm et al., 2002; Talsma et al., 2007), especially when considering the influence of stimulus intensity (Senkowski et al., 2011).
While behavioural outcomes for multisensory stimuli can be predicted by MLE, and single neuron responses follow the principles of inverse effectiveness and super-additivity, among others (Rideaux et al., 2021), how audiovisual super-additivity manifests within populations of neurons is comparatively unclear given the mixed findings from relevant fMRI and EEG studies. This uncertainty may be due to biophysical limitations of human neuroimaging techniques, but it may also be related to the analytic approaches used to study these recordings. In particular, information encoded by the brain can be represented as increased activity in some areas, accompanied by decreased activity in others, so simplifying complex neural responses to the average rise and fall of activity may obscure relevant multivariate patterns of activity evoked by a stimulus.
Inverted encoding is a multivariate analytic method that can reveal how sensory information is encoded within the brain by recovering patterns of neural activity associated with different stimulus features. This method has been successfully used in fMRI, EEG, and magnetoencephalography studies to characterise the neural representations of a range of stimulus features, including colour (Brouwer & Heeger, 2009), spatial location (Bednar & Lalor, 2020; Robinson et al., 2021) and orientation (Brouwer & Heeger, 2011; Harrison et al., 2023; Kok et al., 2017). A multivariate approach may capture potential non-linear enhancements associated with audiovisual responses and thus could reveal super-additive interactions that would otherwise be hidden within the brain’s univariate responses. The sensitivity of inverted encoding analyses to multivariate neural patterns may provide insight into how audiovisual information is processed and integrated at the population level.
In the present study, we investigated neural super-additivity in human audiovisual sensory processing using inverted encoding of EEG responses during a task where participants had to spatially localise visual, auditory, and audiovisual stimuli. In a separate behavioural experiment, we monitored response accuracy to characterise behavioural improvements to audiovisual relative to unisensory stimuli. Although there was no evidence for super-additivity in response to audiovisual stimuli within univariate ERPs, we observed a reliable non-linear enhancement of multivariate decoding performance at ∼180 ms following stimulus onset when auditory and visual stimuli were presented concurrently as opposed to alone. These findings suggest that population-level super-additive multisensory neural responses are present within multivariate patterns of activity rather than univariate evoked responses.
Methods
Participants
Seventy-one human adults were recruited in return for payment. The study was approved by The University of Queensland Human Research Ethics Committee, and informed consent was obtained in all cases. Participants were first required to complete a behavioural session with above chance performance to qualify for the EEG session (see Behavioural session for details). Twenty-nine participants failed to meet this criterion and were excluded from further participation and analyses, along with one participant who failed to complete the EEG session with above chance behavioural accuracy. This left a total of 41 participants (M = 27.21 yrs; min = 20 yrs; max = 64 yrs; 24 females; 41 right-handed). Participants reported no neurological or psychiatric disorders, and had normal visual acuity (assessed using a standard Snellen eye chart).
Materials and procedure
The experiment was split into two separate sessions, with participants first completing a behavioural session followed by an EEG session. Each session had three conditions, in which the presented stimuli were either visual, auditory, or combined audio and visual (audiovisual). The order in which conditions were presented was counterbalanced across participants. Before each task, participants were given instructions and completed two rounds of practice for each condition.
Apparatus
The experiment was conducted in a dark, acoustically and electromagnetically shielded room. For the EEG session, stimuli were presented on a 24-inch ViewPixx monitor (VPixx Technologies Inc., Saint-Bruno, QC) with 1920x1080-pixel resolution and a refresh rate of 144 Hz. Viewing distance was maintained at 54 cm using a chinrest. For the behavioural session, stimuli were presented on a 32-inch Cambridge Research Systems Display++ LCD monitor with a 1920x1080-pixel resolution, hardware gamma correction and a refresh rate of 144Hz. Viewing distance was maintained at 59.5 cm using a chinrest. Stimuli were generated in MATLAB v2021b (The MathWorks Inc., 2021) using the Psychophysics Toolbox (Brainard, 1997). Auditory stimuli were played through two loudspeakers placed either side of the display (25-75 W, 6 Ω). In both experiments, an EyeLink 1000 infrared eye tracker recorded gaze direction (SR Research Ltd., 2009) at a sampling rate of 1000 Hz.
Stimuli
The EEG and behavioural paradigms used the same stimuli within each condition. Visual stimuli were gaussian blobs (0.2 contrast, 16° diameter) presented for 16 ms on a mid-grey background. Auditory stimuli were 100 ms clicks with a flat 850 Hz tone embedded within a decay envelope (sample rate = 44, 100 Hz; volume = 60 dBA SPL, as measured at the ears). Audiovisual stimuli were spatially and temporally matched combinations of the audio and visual stimuli, with no changes to stimuli properties. To manipulate spatial location, target stimuli were presented from multiple horizontal locations along the display, centred on linearly spaced locations from 15° visual angle to the left and right of the display centre (eight locations for behavioural, five for EEG). Auditory stimuli were played through two speakers placed equidistantly either side of the display. The perceived source location of auditory stimuli was manipulated via changes to interaural intensity and timing (Whitworth & Jeffress, 1961; Wightman & Kistler, 1992). Specifically, auditory stimuli toward the edges of the display had marginally higher volume and were played slightly earlier from the nearest speaker than the other, whereas those toward the centre more uniform volume and less delay between speakers.
Behavioural Session
Prior to data collection, stimulus intensity and timing were manipulated to make visual and auditory stimuli similarly difficult to spatially localize. We employed a two-interval forced choice design to measure participants’ audiovisual localization sensitivity. Participants were presented with two consecutive stimuli and tasked with indicating, via button press, whether the first or second interval contained the more leftward stimulus. Each trial consisted of a central reference stimulus, and a target stimulus presented at one of eight locations along the horizontal azimuth on the display. The presentation order of the reference and target stimuli was randomised across trials. Stimulus modality was either auditory, visual, or audiovisual. Trials were blocked with short (∼2 min) breaks between conditions (see Figure 1A for an example trial). Each condition consisted of 384 target presentations across the eight origin locations, leading to 48 presentations at each location.
EEG Session
In this session, the experimental task was changed slightly from the behavioural session to increase the number stimulus presentations required for inverted encoding analyses of EEG data. Participants viewed and/or listened to a sequence of 20 stimuli, each of which were presented at one of five horizontal locations along the display (selected at random). At the end of each sequence, participants were tasked with indicating, via button press, whether more presentations appeared on the right or the left of the display. To minimize eye movements, participants were asked to fixate on a black dot presented 8° above the display centre (see Figure 1B for an example trial). The task used in the EEG session included the same blocked conditions as in the behavioural session, i.e., visual, auditory, and (congruent) audiovisual stimuli. As the locations of stimuli were selected at random, some sequences had an equal number of presentations on each side of the display, and thus had no correct “left” or “right” response; these trials were not included in the analysis of behavioural performance. Each block consisted of 10 trials, followed by a feedback display indicating the number of trials participants answered correctly. Each condition consisted of 12 blocks, yielding a total of 2400 presentations for each.
EEG data pre-processing
EEG data were recorded using a 64-channel BioSemi system at a sampling rate of 1024 Hz, which was down-sampled to 512 Hz during preprocessing. Signals were recorded with reference to the CMS/DRL electrode loop, with bipolar electrodes placed above and below the eye, at the temples, and at each mastoid to monitor for eye-movements and muscle artifacts. EEG preprocessing was undertaken in MATLAB using custom scripts and the EEGLAB toolbox (Delorme & Makeig, 2004). Data were high-pass filtered at 0.25 Hz to remove baseline drifts, and re-referenced according to the average of all 64 channels. Analyses were stimulus locked, with ERP responses segmented into 600 ms epochs from 100 ms before stimulus presentation to 500 ms after stimulus presentation. Bad channels, identified by the clean_artifacts function (Kothe & Makeig, 2013), were reconstructed using spherical interpolation from surrounding channels.
Forward model
To describe the neural representations of sensory stimuli, we used an inverted modelling approach to reconstruct the location of stimuli based upon recorded ERPs (Brouwer & Heeger, 2011; Harrison et al., 2023). Analyses were performed separately for visual, auditory, and audiovisual stimuli. We first created an encoding model that characterised the patterns of activity in the EEG electrodes given the five locations of the presented stimuli. The encoding model was then used to obtain the inverse decoding model that described the transformation from electrode activity to stimulus location. We used a 10-fold cross-validation approach where 90% of the data were used to obtain the inverse model on which the remaining 10% of the data were decoded. Cross-validation was repeated 10 times such that all the data were decoded. For the purposes of these analyses, we assume that EEG electrode noise is isotropic across locations and additive with the signal.
Prior to the neural decoding analyses, we established the sensors that contained the most location information by treating time as the decoding dimension and obtaining the inverse models from each electrode, using 10-fold cross-validation. This revealed that location was primarily represented in posterior electrodes for visual and audiovisual stimuli, and in central electrodes for auditory stimuli. Thus, for all subsequent analyses we only included signals from the central-temporal, parietal-occipital, occipital and inion sensors for computing the inverse model.
The encoding model contained five hypothetical channels, with evenly distributed idealised location preferences between −15° to +15° viewing angles to the left and right of the display centre. Each channel consisted of a half-wave rectified sinusoid raised to the fifth power. The channels were arranged such that an idealised tuning curve of each location preference could be expressed as a weighted sum of the five channels. The observed activity for each presentation can be described by the following linear model:
where B indicates the EEG data (m electrodes x n presentations), W is a weight matrix (m electrodes x 5 channels) that describes the transformation from EEG activity to stimulus location, C denotes the hypothesized channel activities (5 channels x n presentations), and E indicates the residual errors.
To compute the inverse model we estimated the weights that, when applied to the data, would reconstruct the channel activities with the least error. Due to the correlation between neighbouring electrodes, we took noise covariance into account when computing the model to optimize it for EEG data (Harrison et al., 2023; Kok et al., 2017; Rideaux et al., 2023). We then used the inverse model to reconstruct the stimulus location from the recorded ERP responses.
To assess how well the forward model captured location information in the neural signal per modality, two measures of performance were analysed. First, decoding accuracy was calculated as the similarity of the decoded location to the presented location, represented in arbitrary units. To test whether a super-additive interaction was present in the multivariate response, an additive threshold against which to compare the audiovisual response was required. However, it is unclear how the arbitrary units used to represent decoding accuracy translate to a measure of the linear summation of auditory and visual accuracy. As used for the behavioural analyses, MLE provides a framework for calculating the estimated optimal sensitivity of the combination of two sensory signals, according to signal detection theory principles. To compute decoding sensitivity (d’), required to apply MLE, we omitted trials where stimuli appeared in the centre of the display. The decoder’s reconstructions of stimulus location were grouped for stimuli appearing on the left and right side of the display, respectively. The proportion of hits and misses was derived by comparing the decoded side to the presented side, which was then used to calculate d’ for each condition (Stanislaw & Todorov, 1999). The d’ of the auditory and visual conditions can be used to estimate the predicted ‘optimal’ sensitivity of audiovisual signals as calculated through MLE. We can then compare actual audiovisual sensitivity to this auditory + visual sensitivity and test for super-additivity in the audiovisual condition as evidenced by the presence of a nonlinear combination of auditory and visual stimuli. A similar method was previously employed to investigate depth estimation from motion and binocular disparity cues, decoded from BOLD responses (Ban et al., 2012).
To represent an additional ‘additive’ multivariate signal with which to compare the decoding sensitivity derived through MLE, we first matched the EEG data between unisensory conditions such that the order of presented stimulus locations was the same for the auditory and visual conditions. The auditory and visual condition data were then concatenated across sensors, and inverted encoding analyses were performed on the resulting ‘additive’ audiovisual dataset. This additive condition was designed to represent neural activity evoked by both the auditory and visual conditions, without any non-linear neural interaction, and served as a baseline for the audiovisual condition.
Statistical analyses
Statistical analyses were performed in MATLAB v2021b. Two metrics of accuracy were calculated to assess behavioural performance. For the behavioural session we calculated participants’ sensitivity separately for each modality condition by fitting psychometric functions to the proportion of rightward responses per stimulus location. In the EEG session participants responded to multiple stimuli rather than individual presentations, so behavioural performance was assessed via d’. We derived d’ in each condition from the average proportion of hits and misses for each participant’s performance in discriminating the side of the display on which more stimuli were presented (Stanislaw & Todorov, 1999). A one-sample Kolmogorov-Smirnov test for each condition revealed all conditions in both sessions violated assumptions of normality. A non-parametric two-sided Wilcoxon signed-rank test was therefore used to test for significant differences in behavioural accuracy between all conditions.
For the neural data, univariate ERPs were calculated by averaging EEG activity across presentations and channels for each stimulus location from −100 to 500 ms around stimulus onset. A cluster correction was applied to account for spurious differences across time. To test for significant differences between conditions, a paired-samples t-test was conducted between each condition at each time point. A one-sample t-test was used when comparing decoding accuracy against chance (i.e., zero). Next, the summed value of computed t statistics associated with each comparison (separately for positive and negative values) was calculated within contiguous temporal clusters of significant values. We then simulated the null distribution of the maximum summed cluster values using permutation (n = 5000) of the location labels, from which we derived the 95% percentile threshold value. Clusters identified in the data with a summed effect-size value less than the threshold were considered spurious and removed.
Data availability
The behavioural and EEG data, and the scripts used for analysis and figure creation, are available at https://doi.org/10.17605/OSF.IO/8CDRA.
Results
Behavioural performance
Participants performed well in discriminating stimulus location across all conditions in both the behavioural and EEG sessions (Figure 2). For the behavioural session, the psychometric curves for responses as a function of stimulus location showed stereotypical relationships for the auditory, visual, and audiovisual conditions (Figure 2A). A quantification of the behavioural sensitivity (i.e., steepness of the curves) revealed significantly greater sensitivity for the audiovisual stimuli than for the auditory stimuli alone (Z = −3.09, p = .002), and visual stimuli alone (Z = −5.28, p = 1.288e-7; Figure 2B). Sensitivity for auditory stimuli was also significantly greater than sensitivity for visual (Z = 2.02, p = .044). To test for successful integration of stimuli in the audiovisual condition, we calculated the predicted MLE sensitivity from the unisensory auditory and visual results. We found no evidence for a significant difference between the predicted and actual audiovisual sensitivity (Z = - 1.54, p = 0.125).
We repeated these analyses for behavioural performance in the EEG session (Figure 2C). We found a similar pattern of results to those in the behavioural session; sensitivity for audiovisual stimuli was significantly greater than auditory (Z = −2.27, p = .023) and visual stimuli alone (Z = −3.52, p = 4.345e-4), but not significantly different from the MLE prediction (Z = −1.07, p = .285). However, sensitivity for auditory stimuli was not significantly different from sensitivity to visual stimuli (Z = 1.12, p = .262). Sensitivity was higher overall in the EEG session than the behavioural session, likely due to the increased number of stimuli in the EEG task.
Event-related potentials
We plotted the ERPs for auditory, visual, and audiovisual conditions at each stimulus location from −100ms to 500ms around stimulus presentation (Figure 3). For each stimulus location, cluster-corrected t-tests were conducted to assess significant differences in ERP amplitude between the unisensory (auditory and visual) and audiovisual conditions. While auditory ERPs did not significantly differ from the audiovisual, visual ERPs were significantly lower in amplitude than audiovisual ERPs at all stimulus locations (typically from ∼80-130 ms following stimulus presentation).
To test whether the enhancement in response amplitude to audiovisual stimuli was super-additive, we compared this response with the sum of the response amplitudes for visual and auditory conditions, averaged over stimulus location. We found no significant difference between the additive and audiovisual ERPs (Figure 3, bottom right). This result suggests that, using univariate analyses, the audiovisual response was additive and did not show any evidence for super- or sub-additivity.
Inverted encoding results
We next used inverted encoding to calculate the spatial decoding accuracy for auditory, visual, and audiovisual stimuli (Figure 4A). For all conditions, we found that spatial location could be reliably decoded from approximately ∼100-150 ms after stimulus onset. Decoding for all conditions was consistent for most of the epoch, indicating that location information within the neural signal was relatively persistent and robust.
To assess the spatial representation of the neural signal containing location-relevant information, we computed decoding accuracy at each electrode from 150-250 ms post-stimulus presentation (Figure 4B). For auditory stimuli, information was primarily based over bilateral temporal regions, whereas for visual and audiovisual stimuli, the occipital electrodes carried the most information.
Multivariate super-additivity
Although the univariate response did not show evidence for super-additivity, we expected the multivariate measure would be more sensitive to nonlinear audiovisual integration. To test whether a super-additive interaction was present in the multivariate response, we calculated the sensitivity of the decoder in discriminating stimuli presented on the left and right side. The pattern of decoding sensitivity for auditory, visual, and audiovisual stimuli (Figure 5A) was similar to that in decoding accuracy (Figure 4A). Notably, audiovisual sensitivity was significantly greater than sensitivity to auditory and visual stimuli alone, particularly ∼180 ms following stimulus onset. To test whether this enhanced sensitivity reflected super-additivity, we compared decoding sensitivity for audiovisual stimuli with two estimates of linearly combined unisensory stimuli: 1) MLE predicted sensitivity based on auditory and visual sensitivity and 2) aggregate responses of auditory and visual stimuli (Figure 5B). We found that audiovisual sensitivity significantly exceeded both estimates of linear combination (MLE, ∼160-220 ms post-stimulus; aggregate, ∼150-250 ms). These results provide evidence of non-linear audiovisual integration in the multivariate pattern of EEG recordings. Taken together with the ERP results, our findings suggest that super-additive integration of audiovisual information is reflected in multivariate patterns of activity, but not univariate evoked responses.
Neurobehavioural correlations
To test whether neural decoding was related to behaviour, we calculated rank-order correlations (Spearman’s Rho) between the average decoding sensitivity for each participant from 150-250 ms post-stimulus onset and behavioural performance on the EEG task. We found that decoding sensitivity was significantly positively correlated with behavioural sensitivity for audiovisual stimuli (r = .43, p = .003), but not for auditory (r = -.04, p = .608) or visual stimuli (r = .14, p = .170) alone.
Discussion
We tested for super-additivity in multivariate patterns of EEG responses to audiovisual stimuli. Participants judged the location of auditory, visual, and audiovisual stimuli while their brain activity was measured using EEG. As expected, participants’ behavioural responses to audiovisual stimuli were more precise than that for unisensory auditory and visual stimuli. ERP analyses showed that although audiovisual stimuli elicited larger responses than visual stimuli, the overall response followed an additive principle. Critically, our multivariate analyses revealed that decoding sensitivity for audiovisual stimuli exceeded predictions of both MLE and aggregate auditory and visual information, indicating non-linear multisensory enhancement (i.e., super-additivity).
Participants localised audiovisual stimuli more accurately than unisensory in both the behavioural and EEG sessions. This behavioural facilitation in response to audiovisual stimuli is well-established within the literature (Bolognini et al., 2005; Frassinetti et al., 2002; Lovelace et al., 2003; Meredith & Stein, 1983; Senkowski et al., 2011). In accordance with theories of optimal cue integration, we found participants’ performance for audiovisual stimuli in both sessions matched that predicted by MLE (Ernst & Banks, 2002). Matching this ‘optimal’ prediction of performance indicates that the auditory and visual cues were successfully integrated when presented together in the audiovisual condition (Fetsch et al., 2013).
Our EEG analyses revealed that for most spatial locations, audiovisual stimuli elicited a significantly greater neural response than exclusively visual stimuli approximately 80-120 ms after stimulus onset. Despite numerically larger ERPs to audiovisual than auditory stimuli, this effect failed to reach significance, most likely due to greater inter-trial variability in the auditory ERPs. Critically, however, the audiovisual ERPs consistently matched the sum of visual and auditory ERPs. Sub- or super-additive interaction effects in neural responses to multisensory stimuli are a hallmark of successful integration of unisensory cues in ERPs (Besle et al., 2004; Stevenson et al., 2014). An additive ERP in this context cannot imply successful multisensory integration, as the multisensory ‘enhancement’ may be the result of recording from distinct populations of unisensory neurons responding to the two unisensory sensory modalities (Besle et al., 2009). This invites the question of why we see evidence for integration at the behavioural level, but not in the amplitude of neural responses. One explanation could be that the signals measured by EEG simply do not contain evidence of non-linear integration because the super-additive responses are highly spatiotemporally localized and filtered out by the skull before reaching the EEG sensors. Another possibility, however, is that evidence for non-linear integration is only observable within the changing pattern of ERPs across sensors. Indeed, Murray et al. (2016) found that multisensory interactions followed from changes in scalp topography rather than net gains to ERP amplitude.
Our decoding results reveal that not only do audiovisual stimuli elicit more distinguishable patterns of activity than visual and auditory stimuli, but this enhancement exceeds that predicted by both optimal integration and the aggregate combination of auditory and visual responses. Critically, the non-linear enhancement of decoding sensitivity for audiovisual stimuli indicates the presence of an interactive effect for the integration of auditory and visual stimuli that was not evident from the univariate analyses. This indicates super-additive enhancement of the neural representation of integrated audiovisual cues, and supports the interpretation that increased behavioural performance for multisensory stimuli is related to a facilitation of the neural response (Fetsch et al., 2013). This interaction was absent from univariate analyses (Nikbakht et al., 2018), suggesting that the neural facilitation of audiovisual processing is more nuanced than net increased excitation, and may be associated with a complex pattern of excitatory and inhibitory neural activity, e.g., divisive normalization (Ohshiro et al., 2017).
The non-linear neural enhancement in decoding sensitivity for audiovisual stimuli occurred ∼180 ms after stimulus onset, which is later than previously reported audiovisual interactions (<150 ms; Cappe et al., 2010; Fort et al., 2002; Giard & Peronnet, 1999; Molholm et al., 2002; Murray et al., 2016; Senkowski et al., 2011; Talsma et al., 2007; Teder-Sälejärvi et al., 2002). As stimulus characteristics and task requirements are likely to have a significant influence over the timing of multisensory interaction effects in EEG activity (Calvert & Thesen, 2004; De Meo et al., 2015), our use of peripheral spatial locations (where previous studies only used stimuli centrally) may explain the slightly later timing of our audiovisual effect. Indeed, our finding is consistent with previous multivariate studies which found that location information in EEG data, for both visual (Rideaux, 2024; Robinson et al., 2021) and auditory (Bednar & Lalor, 2020) stimuli, is maximal at ∼190 ms following stimulus presentation.
We also found a significant positive correlation between participants’ behavioural judgements in the EEG task and decoding sensitivity for audiovisual stimuli, suggesting that participants who were better at identifying stimulus location may have more distinct patterns of neural activity for audiovisual stimuli. Multisensory stimuli have consistently been found to elicit stronger neural responses than unisensory stimuli (Meredith & Stein, 1983; Puce et al., 2007; Senkowski et al., 2011; Vroomen & Stekelenburg, 2010), which has been associated with behavioural performance (Frens & Van Opstal, 1998; Wang et al., 2008). However, the neuro-behavioural correlation we observed suggests that behavioural facilitation from audiovisual integration is not simply represented by the strength of the neural response, but rather by a reliably distinguishable pattern of activity.
Any experimental design that varies stimulus location needs to consider the potential contribution of eye movements. To reduce eye movements during our study, we had participants fixate on a central dot and removed trials with substantial eye-movements (>3.75°) from the analyses. A re-analysis of the data with a very strict eye-movement criterion (i.e., removing trials with eye movements >1.875°) revealed that the super-additive enhancement in decoding accuracy no longer survived cluster correction, suggesting that our results may be impacted by the consistent motor activity of saccades towards presented stimuli. One piece of evidence against this is that we did not observe significant differences between auditory and audiovisual ERP amplitudes, the latter condition being more likely to drive eye movements. Furthermore, we found that the electrodes with the most location information were in occipital and temporal regions of the scalp, brain areas dedicated to sensory processing, rather than frontal regions, which would be expected if activity was dominated by consistent muscular activity evoked by eye movements. The lack of a super-additive enhancement when using the stricter eye-movement criterion, therefore, is perhaps more likely due to a loss of statistical power.
In summary, here we have shown a non-linear enhancement in the neural representation of audiovisual stimuli relative to unisensory (visual/auditory) stimuli. This enhancement was obscured within univariate ERP analyses focusing exclusively on response amplitude but was revealed through inverted encoding analyses in feature-space, suggesting that super-additive integration of audiovisual information is reflected within multivariate patterns of activity rather than univariate evoked responses. Further research on the multivariate representation of audiovisual integration may shed light on the neural mechanisms that facilitate this non-linear enhancement. In particular, future work may consider the influence of different stimulus features and task requirements on the timing and magnitude of the audiovisual enhancement. How and when auditory and visual information are integrated to enhance multisensory processing remains an open question, with evidence for a complex combination of top-down and bottom-up interactions (Delong & Noppeney, 2021; Keil & Senkowski, 2018; Rohe & Noppeney, 2018). Our study highlights the importance of considering multivariate analyses in multisensory research, and the potential loss of stimulus-relevant neural information when relying solely on univariate responses.
Acknowledgements
We thank R. West for data collection, and D. Lloyd for technical assistance. This work was supported by Australian Research Council (ARC) Discovery Early Career Researcher Awards awarded to RR (DE210100790) and AKR (DE200101159). RR was also supported by a National Health and Medical Research Council (Australia) Investigator Grant (2026318).
References
- The ventriloquist effect results from near-optimal bimodal integrationCurrent Biology 14:257–262https://doi.org/10.1016/j.cub.2004.01.029
- Cross-modal interaction between vision and hearing: a speed-accuracy analysisAttention, Perception, & Psychophysics 70:412–421https://doi.org/10.3758/pp.70.3.412
- The integration of motion and disparity cues to depth in dorsal visual cortexNature Neuroscience 15:636–643https://doi.org/10.1038/nn.3046
- Statistical criteria in fMRI studies of multisensory integrationNeuroinformatics 3:93–113https://doi.org/10.1385/NI:3:2:093
- Where is the cocktail party? Decoding locations of attended and unattended moving sound sources using EEGNeuroimage 205https://doi.org/10.1016/j.neuroimage.2019.116283
- Electrophysiological (EEG, sEEG, MEG) evidence for multiple audiovisual interactions in the human auditory cortexHearing Research 258:143–151https://doi.org/10.1016/j.heares.2009.06.016
- Interest and validity of the additive model in electrophysiological studies of multisensory interactionsCognitive Processing 5:189–192https://doi.org/10.1007/s10339-004-0026-y
- “Acoustical vision” of below threshold stimuli: interaction among spatially converging audiovisual inputsExperimental Brain Research 160:273–282https://doi.org/10.1007/s00221-004-2005-z
- The Psychophysics ToolboxSpatial Vision 10:433–436https://doi.org/10.1163/156856897X00357
- Decoding and Reconstructing Color from Responses in Human Visual CortexThe Journal of Neuroscience 29https://doi.org/10.1523/JNEUROSCI.3577-09.2009
- Cross-orientation suppression in human visual cortexJournal of Neurophysiology 106:2108–2119https://doi.org/10.1152/jn.00540.2011
- Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortexCurrent Biology 10:649–657https://doi.org/10.1016/S0960-9822(00)00513-3
- Detection of Audio-Visual Integration Sites in Humans by Application of Electrophysiological Criteria to the BOLD EffectNeuroimage 14:427–438https://doi.org/10.1006/nimg.2001.0812
- Multisensory integration: methodological approaches and emerging principles in the human brainJournal of Physiology-Paris 98:191–205https://doi.org/10.1016/j.jphysparis.2004.03.018
- Selective integration of auditory-visual looming cues by humansNeuropsychologia 47:1045–1052https://doi.org/10.1016/j.neuropsychologia.2008.11.003
- Auditory–Visual Multisensory Interactions in Humans: Timing, Topography, Directionality, and SourcesThe Journal of Neuroscience 30https://doi.org/10.1523/JNEUROSCI.1099-10.2010
- Multisensory Interaction in Saccadic Reaction Time: A Time-Window-of-Integration ModelJournal of Cognitive Neuroscience 16:1000–1009https://doi.org/10.1162/0898929041502733
- Auditory-Visual Interactions Subserving Goal-Directed Saccades in a Complex SceneJournal of Neurophysiology 88:438–454https://doi.org/10.1152/jn.00699.2001
- Top-down control and early multisensory processes: chicken vs. eggFrontiers in integrative neuroscience 9https://doi.org/10.3389/fnint.2015.00017
- Semantic and spatial congruency mould audiovisual integration depending on perceptual awarenessScientific Reports 11https://doi.org/10.1038/s41598-021-90183-w
- EEGLAB: An open source toolbox for analysis of single-trial EEG dynamics including independent component analysisJournal of Neuroscience Methods 134:9–21https://doi.org/10.1016/j.jneumeth.2003.10.00
- Humans integrate visual and haptic information in a statistically optimal fashionNature 415:429–433https://doi.org/10.1038/415429a
- Bridging the gap between theories of sensory cue integration and the physiology of multisensory neuronsNature Reviews Neuroscience 14:429–442https://doi.org/10.1038/nrn3503
- Dynamics of Cortico-subcortical Cross-modal Operations Involved in Audio-visual Object Detection in HumansCerebral Cortex 12:1031–1039https://doi.org/10.1093/cercor/12.10.1031
- Enhancement of visual perception by crossmodal visuo-auditory interactionExperimental Brain Research 147:332–343https://doi.org/10.1007/s00221-002-1262-y
- Visual-auditory interactions modulate saccade-related activity in monkey superior colliculusBrain Research Bulletin 46:211–224https://doi.org/10.1016/S0361-9230(98)00007-0
- Auditory-Visual Integration during Multimodal Object Recognition in Humans: A Behavioral and Electrophysiological StudyJournal of Cognitive Neuroscience 11:473–490https://doi.org/10.1162/089892999563544
- Neural tuning instantiates prior expectations in the human visual systemNature Communications 14https://doi.org/10.1038/s41467-023-41027-w
- Inverse Effectiveness and BOLD fMRIThe New Handbook of Multisensory Processing https://doi.org/10.7551/mitpress/8466.003.0020
- The neural network sustaining the crossmodal processing of human gender from faces and voices: An fMRI studyNeuroimage 54:1654–1661https://doi.org/10.1016/j.neuroimage.2010.08.073
- Neural Oscillations Orchestrate Multisensory ProcessingThe Neuroscientist 24:609–626https://doi.org/10.1177/1073858418755352
- Prior expectations induce prestimulus sensory templatesProceedings of the National Academy of Sciences 114:10473–10478https://doi.org/10.1073/pnas.1705652114
- BCILAB: a platform for brain-computer interface developmentJournal of Neural Engineering 10https://doi.org/10.1088/1741-2560/10/5/056014
- On the use of superadditivity as a metric for characterizing multisensory integration in functional neuroimaging studiesExperimental Brain Research 166:289–297https://doi.org/10.1007/s00221-005-2370-2
- Dissociation of perception and action in audiovisual multisensory integrationEuropean Journal of Neuroscience 42:2915–2922https://doi.org/10.1111/ejn.13087
- An irrelevant light enhances auditory detection in humans: a psychophysical analysis of multisensory integration in stimulus detectionCognitive Brain Research 17:447–453https://doi.org/10.1016/s0926-6410(03)00160-5
- Interactions Among Converging Sensory Inputs in the Superior ColliculusScience 221:389–391https://doi.org/10.1126/science.6867718
- Multisensory auditory–visual interactions during early sensory processing in humans: a high-density electrical mapping studyCognitive Brain Research 14:115–128https://doi.org/10.1016/S0926-6410(02)00066-6
- The multisensory function of the human primary visual cortexNeuropsychologia 83:161–169https://doi.org/10.1016/j.neuropsychologia.2015.08.011
- Supralinear and Supramodal Integration of Visual and Tactile Signals in Rats: Psychophysics and Neuronal MechanismsNeuron 97:626–639https://doi.org/10.1016/j.neuron.2018.01.003
- A Neural Signature of Divisive Normalization at the Level of Multisensory Integration in Primate CortexNeuron 95:399–411https://doi.org/10.1016/j.neuron.2017.06.043
- Trimodal processing of complex stimuli in inferior parietal cortex is modality-independentCortex 139:198–210https://doi.org/10.1016/j.cortex.2021.03.008
- Neural responses elicited to face motion and vocalization pairingsNeuropsychologia 45:93–106https://doi.org/10.1016/j.neuropsychologia.2006.04.017
- Visual-tactile integration: does stimulus duration influence the relative amount of response enhancement?Experimental Brain Research 173:514–520https://doi.org/10.1007/s00221-006-0452-4
- On quantifying multisensory interaction effects in reaction time and detection ratePsychological Research 75:77–94https://doi.org/10.1007/s00426-010-0289-0
- Task-related modulation of event-related potentials does not reflect changes to sensory representationsbioRxiv https://doi.org/10.1101/2024.01.20.576485
- How multisensory neurons solve causal inferenceProceedings of the National Academy of Sciences 118https://doi.org/10.1073/pnas.2106235118
- Proscription supports robust perceptual integration by suppression in human visual cortexNature Communications 9https://doi.org/10.1038/s41467-018-03400-y
- Distinct early and late neural mechanisms regulate feature-specific sensory adaptation in the human visual systemProceedings of the National Academy of Sciences 120https://doi.org/10.1073/pnas.2216192120
- Overlapping neural representations for the position of visible and imagined objectsNeurons, behavior, data analysis, and theory 4https://doi.org/10.51628/001c.19129
- Reliability-Weighted Integration of Audiovisual Signals Can Be Modulated by Top-down Attentioneneuro 5https://doi.org/10.1523/ENEURO.0315-17.2018
- Neural correlates of multisensory enhancement in audiovisual narrative speech perception: A fMRI investigationNeuroimage 263https://doi.org/10.1016/j.neuroimage.2022.119598
- Neural correlates of multisensory integration in the human brain: an ALE meta-analysisReviews in the Neurosciences 34:223–245https://doi.org/10.1515/revneuro-2022-0065
- Multisensory interactions in early evoked brain activity follow the principle of inverse effectivenessNeuroimage 56:2200–2208https://doi.org/10.1016/j.neuroimage.2011.03.075
- Calculation of signal detection theory measuresBehavior Research Methods, Instruments, & Computers 31:137–149https://doi.org/10.3758/BF03207704
- Multisensory integration: current issues from the perspective of the single neuronNature Reviews Neuroscience 9:255–266https://doi.org/10.1038/nrn2331
- Neural Correlates of Multisensory Integration of Ecologically Valid Audiovisual EventsJournal of Cognitive Neuroscience 19:1964–1973https://doi.org/10.1162/jocn.2007.19.12.1964
- Superadditive BOLD activation in superior temporal sulcus with threshold non-speech objectsExperimental Brain Research 179:85–95https://doi.org/10.1007/s00221-006-0770-6
- Identifying and quantifying multisensory integration: a tutorial reviewBrain Topography 27:707–730https://doi.org/10.1007/s10548-014-0365-7
- Audiovisual integration in human superior temporal sulcus: Inverse effectiveness and the neural processing of speech and object recognitionNeuroimage 44:1210–1223https://doi.org/10.1016/j.neuroimage.2008.09.034
- Visual Contribution to Speech Intelligibility in NoiseThe Journal of the Acoustical Society of America 26:212–215https://doi.org/10.1121/1.1907309
- Selective Attention and Audiovisual Integration: Is Attending to Both Modalities a Prerequisite for Early Integration?Cerebral Cortex 17:679–690https://doi.org/10.1093/cercor/bhk016
- An analysis of audio-visual crossmodal integration by means of event-related potential (ERP) recordingsCognitive Brain Research 14:106–114https://doi.org/10.1016/S0926-6410(02)00065-4
- MATLAB version: 9.11.0 (R2021b), Natick, MassachusettsThe MathWorks Inc
- The effect of spatial-temporal audiovisual disparities on saccades in a complex sceneExperimental Brain Research 198:425–437https://doi.org/10.1007/s00221-009-1815-4
- Multisensory Integration and Audiovisual Speech PerceptionBrain Mapping Academic Press :565–572https://doi.org/10.1016/B978-0-12-397025-1.00047-6
- Visual Anticipatory Information Modulates Multisensory Interactions of Artificial Audiovisual StimuliJournal of Cognitive Neuroscience 22:1583–1596https://doi.org/10.1162/jocn.2009.21308
- Visuo-auditory interactions in the primary visual cortex of the behaving monkey: Electrophysiological evidenceBMC neuroscience 9https://doi.org/10.1186/1471-2202-9-79
- Superadditive Responses in Superior Temporal Sulcus Predict Audiovisual Benefits in Object CategorizationCerebral Cortex 20:1829–1842https://doi.org/10.1093/cercor/bhp248
- The Contributions of Transient and Sustained Response Codes to Audiovisual IntegrationCerebral Cortex 21:920–931https://doi.org/10.1093/cercor/bhq161
- Time vs Intensity in the Localization of TonesThe Journal of the Acoustical Society of America 33:925–929https://doi.org/10.1121/1.1908849
- The dominant role of low-frequency interaural time differences in sound localizationThe Journal of the Acoustical Society of America 91:1648–1661https://doi.org/10.1121/1.402445
Article and author information
Author information
Version history
- Sent for peer review:
- Preprint posted:
- Reviewed Preprint version 1:
Copyright
© 2024, Buhmann et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.