A Stimulus-Computable Model for Audiovisual Perception and Spatial Orienting in Mammals

Cesare V Parise

doi:10.7554/eLife.106122.1

Introduction

Perception in natural environments is inherently multisensory. For example, during speech perception, the human brain integrates audiovisual information to enhance speech intelligibility, often beyond awareness. A compelling demonstration of this is the McGurk illusion⁶, where the auditory perception of a syllable is altered by mismatched lip movements. Likewise, audiovisual integration plays a critical role in spatial localization, as illustrated by the ventriloquist illusion⁸, where perceived sound location shifts toward a synchronous visual stimulus.

Extensive behavioural and neurophysiological findings demonstrate that audiovisual integration occurs when visual and auditory stimuli are presented in close spatiotemporal proximity (i.e., the spatial and temporal determinants of multisensory integration) ⁹¹². When redundant multisensory information is integrated, the resulting percept is more reliable¹³ and salient¹⁶. Various models have successfully described how audiovisual integration unfolds across time and space^3,10,17,18–often within a Bayesian Causal Inference framework, where the system determines the probability that visual and auditory stimuli have a common cause and weigh the senses accordingly. This is the case for the detection of spatiotemporal discrepancies across the senses, or susceptibility to phenomena such as the McGurk or Ventriloquist illusions^10,17,20.

Prevailing theoretical models for multisensory integration, however, are not stimulus computable. That is, rather than directly analysing images and sounds, they operate on abstract and low-dimensional representations of the stimuli^{3,10,17,18,20}. For instance, models of audiovisual time perception estimate subjective synchrony from a parameter of the stimuli (i.e., asynchrony in seconds^17,18), not from the actual stimuli (i.e., pixels and audio samples). As a result, these models solve a different task than real observers, as the rely on explicit physical measures that are unavailable to the observers, while disregarding the information that observers actually process. Despite their obvious success in accounting for responses to simple audiovisual stimuli, these models are silent on how perceptual systems extract, process and combine task-relevant information from the continuous stream of audiovisual signals.

This omission is critical: audiovisual perception involves the continuous analysis of images and sounds, hence models that do not operate on the stimuli cannot provide a complete account of perception. Only a few models can process elementary audiovisual stimuli^22,23, none can tackle the complexity of natural audiovisual input. Currently, there are no stimulus-computable models for multisensory perception²⁴ that can take as input natural audiovisual data, like movies. This study addresses this gap by investigating how behaviour consistent with mammalian multisensory perception emerges from low-level analyses of natural auditory and visual signals.

Besides offering an account²⁵ for audiovisual integration in biological systems, an image and sound-computable model for multisensory perception would also provide a timely tool for computer vision. Animals consistently exhibit near-optimal integration of multisensory cues¹³, a capability that current computer vision systems struggle to achieve when dealing with audiovisual signals²⁶. Bio-inspired solutions are widely recognized as crucial to efficiently handle audiovisual data²⁶. However, in the absence of stimulus-computable models of multisensory perception, inspiration from biology risks remaining superficial, driven more by intuition than by clearly defined theoretical principles. A validated, image- and sound-computable model for audiovisual perception would provide computer vision with a biologically-plausible framework for extracting and combining task-relevant cues from the continuous flow of multimodal sensory input²⁷.

In an image- and sound-computable model, visual and auditory stimuli can be represented as patterns in a three-dimensional space, where x and y are the two spatial dimensions, and t the temporal dimension. An instance of such a three-dimensional diagram for the case of audiovisual speech is shown in Figure 1B (top): moving lips generate patterns of light that vary in synch with the sound. In such a representation, audiovisual correspondence can be detected by a local correlator (i.e., multiplier), that operates across space, time, and the senses²⁸. In previous studies, we proposed a biologically plausible solution to detect temporal correlation across the senses (Figure 1A)^22,29–31. Here, we will illustrate how a population of multisensory correlation detectors can take real-life footage as input and provide a comprehensive bottom-up account for multisensory integration in mammals, encompassing its temporal, spatial and attentional aspects.

The MCD population model.
**Panel A**: schematic representation of a single MCD unit. The input visual signal represents the intensity of a pixel (mouth area) over time, while the audio input is the soundtrack (the syllable /ba/). The grey soundtrack represents the experimental manipulation of AV lag, obtained by delaying one sense with respect to the other. BPF and LPF indicate band-pass and low-pass temporal filters, respectively. **Panel C** shows how single-unit responses vary as a function of crossmodal lag. **Panel B** represents the architecture of the MCD population model. Each visual unit (in blue) receives input from a single pixel, while the auditory unit receives as input the intensity envelope of the soundtrack (mono audio; see Figure 4A for a version of the model capable of receiving spatialized auditory input). Sensory evidence is then integrated over time and space for perceptual decision-making, a process in which the two model responses are weighted, summed, corrupted with additive Gaussian noise, and compared to a criterion to generate a forced-choice response (**Panel D**).

The present approach posits the existence of elementary processing units, the Multisensory Correlation Detectors (MCD)³⁰, each integrating time-varying input from unimodal transient channels through a set of temporal filters and elementary operations (Figure 1A, see Methods). Each unit returns two outputs, representing the temporal correlation and order of incoming visual and auditory signals (Figure 1C). When arranged in a two-dimensional square lattice (Figure 1B), a population of MCD units is naturally suited to take movies (e.g., dynamic images and sounds) as input, hence capable to process any stimuli used in previous studies in audiovisual integration. Given that the aim of this study is to provide an account for multisensory integration in biological system, the benchmark of our model is to reproduce observers’ behaviour in carefully-controlled psychophysical and eye-tracking experiments. Emphasis will be given to studies using natural stimuli which, despite their manifest ecological value, simply cannot be handled by alternative models. Among them, particular attention will be dedicated to experiments involving speech, perhaps the most representative instance of audiovisual perception, and often claimed to be processed via dedicated mechanisms in the human brain.

Results

We tested the performance of our population model on three main aspects of audiovisual integration. The first concerns the temporal determinants of multisensory integration, primarily investigating how subjective audiovisual synchrony and integration depend on the physical lag across the senses. The second addresses the spatial determinants of audiovisual integration, focusing on the combination of visual and acoustic cues for spatial localization. The third one involves audiovisual attention and examines how gaze behaviour is spontaneously attracted to audiovisual stimuli even in the absence of explicit behavioural tasks. While most of the literature on audiovisual psychophysics involves human participants, in recent years monkeys and rats have also been trained to perform the same behavioural tasks. Therefore, to generalize our approach, whenever possible we simulated experiments involving all available animal models.

Temporal determinants of audiovisual integration in humans and rats

Classic experiments on the temporal determinants of audiovisual integration usually manipulate the lag between the senses and assess the perception of synchrony, temporal order, and audiovisual speech integration (as measured in humans with the McGurk illusion) through psychophysical forced-choice tasks^32,33. Among them, we obtained both the audiovisual footage and the psychophysical data from 43 experiments in humans that used ecological audiovisual stimuli (real-life recordings of, e.g., speech and performing musicians, Figure 2A, Supplementary Figure 1 and Supplementary Table 1, for the inclusion criteria, see Methods): 27 experiments were simultaneity judgments^{2,4,5,17,34–38}, 10 temporal order judgments^5,39,6 others assessed the McGurk effect^2,35,39.

Natural audiovisual stimuli and psychophysical responses.
**Panel A** stimuli (still frame and soundtrack) and psychometric functions for McGurk illusion², synchrony judgments⁴, and temporal order judgments⁵. In all panels, dots correspond to empirical data, lines to MCD responses; negative lags represent vision first. **Panel B** left, envelopes of auditory stimuli (clicks) recorded at different distances in a reverberant environment (the Sydney Opera House)⁷. While the reverberant portion of the sound is identical across distances, the intensity of the direct sound (the onset) decreases with depth. As a result, the centre of mass of the envelopes shifts rightward with increasing distance. The central panel shows empirical and predicted psychometric functions for the various distances. The four curves were fitted using the same decision-making parameters, so that the separation between the curves results purely from the operation of the MCD. The lag at which sound and light appear synchronous (point of subjective synchrony) scales with distance at a rate approximately matching the speed of sound (right panel). The dots in the right panel display the point of subjective synchrony (estimated separately for each curve), while the jagged line is the model prediction. **Panel C**, shows temporal order judgments for clicks and flashes from both rats and human observers^11,12. Rats outperform humans at short lag, and vice-versa. **Panel D**, rats’ temporal order and synchrony judgments for flashes and clicks of varying intensity¹⁴. Note that the tree curves in each task were fitted using the same decision-making parameters, so that the MCD alone accounts for the separation between the curves. **Panel E**, pharmacologically-induced changes in rats’ audiovisual time perception. Left: glutamatergic inhibition (MK-801 injection) leads to asymmetric broadening of the psychometric functions for simultaneity judgments¹⁹. Right: GABA inhibition (Gabazine injection) abolishes rapid temporal adaptation, so that psychometric curves did not change based on the lag of the previous trials (as they do in controls)²¹. All pharmacologically-induced changes in audiovisual time perception can be accounted for by changes in the decision-making process, with no need to postulate changes in low-level temporal processing.

For each of the experiments, we can feed the stimuli to the model (Figure 1B,D), and compare the output to the empirical psychometric functions (Equation 10, for details see Methods)^22,29–31. Results demonstrate that a population of MCDs can broadly account for audiovisual temporal perception of ecological stimuli, and near-perfectly (rho=0.97) reproduces the empirical psychometric functions for simultaneity judgments, temporal order judgments and the McGurk effect (Figure 2A, Supplementary Figure 1). To quantify the impact of the low-level properties of the stimuli on the performance of the model, we ran a permutation test, where psychometric functions were predicted from mismatching stimuli (see Methods). The psychometric curves predicted from the matching stimuli provided significantly better fit than mismatching stimuli (p<0.001, see Supplementary Figure 1K). This demonstrates that our model captures the subtle effects of how individual features affect observed responses, and it highlights the role of low-level stimulus properties on multisensory integration. All analyses performed so far relied on psychometric functions averaged across observers; individual observer analyses are included in the Supplementary Information.

When estimating the perceived timing of audiovisual events, it is important to consider the different propagation speeds of light and sound, which introduce audio lags that are proportional to the observer’s distance from the source (Figure 2B, right). Psychophysical temporal order judgments demonstrate that, to compensate for these lags, humans scale subjective audiovisual synchrony with distance (Figure 2B)⁷. This result has been interpreted as evidence that humans exploit auditory spatial cues, such as the direct-to-reverberant energy ratio (Figure 2B, left), to estimate the distance of the sound source and adjust subjective synchrony by scaling distance estimates by the speed of sound⁷. When presented with the same stimuli, our model also predicts the observed shifts in subjective simultaneity (Figure 2B, centre). However, rather than relying on explicit spatial representations and physics simulations, these shifts emerge from elementary audiovisual signal analyses of natural stimuli. Specifically, in reverberant environments, the intensity of the direct portion of a sound increases with source proximity, while the reverberant component remains constant. As a result, the envelopes of sounds originating close to the observers are more front-heavy than distant sounds (Figure 2B, left). These are low-level acoustic features that the lag detector of the MCD is especially sensitive to, thereby providing a computational shortcut to explicit physics simulations.

In recent years, audiovisual timing has been systematically studied also in rats^{11,14,19,21,40,41}, generally using minimalistic stimuli (such as clicks and flashes), and under a variety of manipulations of the stimuli (e.g., loudness) and pharmacological interventions (e.g., GABA and glutamatergic inhibition). Therefore, to further generalize our model to other species, we assessed whether it can also account for rats’ behaviour in synchrony and temporal order judgments. Overall, we could tightly replicate rats’ behaviour (rho=0.981; see Figure 2C-E, Supplementary Figure 2), including the effect of loudness on observed responses (Figure 2D). Interestingly, the unimodal temporal constants for rats were 4 times faster than for humans: such a different temporal tuning is reflected in higher sensitivity in rats for short lags (<0.1s), and in humans for longer lags (Figure 2C). This four-fold difference in temporal tuning between rats and humans closely mirrors analogous interspecies differences in physiological rhythms, such as heart rate (~4.7 times faster in rats) and breathing rate (~6.3 times faster in rats)⁴². While tuning the temporal constants of the model was necessary to account for the difference between humans and rats, this was not necessary to account for pharmacologically-induced changes in audiovisual time perception in rats (Figure 2E, Supplementary Figure 2F-G), which could all be accounted for by changes in the decision-making process (Equation 10). This latter finding suggests that these pharmacological interventions only affect perceptual decision-making, with negligible effects on low-level uni- and multisensory processing.

An asset of a low-level approach is that it allows one to inspect, at the level of individual pixels and frames, the features of the stimuli that determine the response of the model (i.e., the saliency maps). This is illustrated in Figure 3 for the case of audiovisual speech, where model responses cluster mostly around the mouth area and (to a lesser extent) the eyes (Figure 3B). These are the regions where pixels’ luminance changes in synch with the audio track (Figure 3A). When the model responses for the same stimulus are averaged over time, it is possible to display how both population responses vary with audiovisual lag (Figure 3C).

Ecological audiovisual stimuli and model responses.
**Panel A** displays the frames and soundtrack of a dynamic audiovisual stimulus over time (in this example, video and audio tracks are synchronous, and the actress utters the syllable /ta/). **Panel B** shows how the dynamic population responses *MCD*_corr and *MCD*_lag vary across the frames of Panel A. Note how model responses highlight the pixels whose intensity changed with the soundtrack (i.e., the mouth area). The right side of Panel B represents the population read-out process, as implemented for the simulations in Figure 2: the population responses *MCD*_corr and *MCD*_lag are integrated over space (i.e., pixels) and time (i.e., frames), scaled and weighted by the gain parameters β_corr, and β_lag and summed to obtain a single decision variable that is fed to the decision-making stage (see Figure 1D). **Panel C** represents the time-averaged population responses and as a function of crossmodal lag (the central one corresponds to the time-averaged responses show in Panel B). Note how peaks at around zero lag and decreases with increasing lag (following the same trend shown in Figure 1C, left), while polarity of changes with the sign of the delay. The psychophysical data corresponding to the stimulus in this figure is shown in Supplementary Figure 1B. S

Spatial determinants of audiovisual integration in humans and monkeys

Classic experiments on the spatial determinants of audiovisual integration usually require observers to localize the stimuli under systematic manipulations of the discrepancy and reliability (i.e., precision) of the spatial cues³ (Figure 4B). This allows one to assess how unimodal cues are weighted and combined, to give rise to phenomena such as the ventriloquist illusion⁸. When the spatial discrepancy across the senses is low, observers’ behaviour is well described by Maximum Likelihood Estimation (MLE)³, where unimodal information is combined in a statistically optimal fashion, so as to maximize the precision (reliability) of the multimodal percept (see Equations 11–14, Methods).

Audiovisual integration in space.
**Panel A** top represents the MCD population model for spatialized audio. Visual and auditory input units receive input from corresponding spatial locations, and feed into spatially-tuned MCD units. The output of each MCD unit is eventually normalized by the total population output, so as to represent the probability distribution of stimulus location over space. The bottom part of Panel A represents the dynamic unimodal and bimodal population response over time and space (azimuth). and their marginals. When time is marginalized out, a population of MCDs implements integration as predicted by the MLE model. When space is marginalized, the output show the temporal response function of the model. In this example, visual and auditory stimuli were asynchronously presented from discrepant spatial locations (note how the blue and orange distributions are spatiotemporally offset). **Panel B** shows a schematic representation of the stimuli used to test the MLE model by Alais and Burr³. Stimuli were presented from different spatial positions, with a parametric manipulation of audiovisual spatial disparity and blob size (i.e., the standard deviation, σ of the blob). **Panel C** shows how the bimodal psychometric functions predicted by the MCD (lines, see Equation 16) and the MLE (dots) models fully overlap. **Panel D** shows how bimodal bias varies as a function of disparity and visual reliability (see legend on the left). The dots correspond to the empirical data from participant LM, while the lines are the predictions of the MCD model (compare to Figure 2A, of Alais and Burr³). **Panel E** shows how the just noticeable differences (JND, the random localization error) vary as a function of blob size. The blue squares represent the visual JNDs, the purple dots the bimodal JNDs, while the dashed orange line represents the auditory JND. The continuous line shows the JNDs predicted by the MCD population model (compare to Figure 2B, of Alais and Burr³). **Panel F** represents the stimuli and results of the experiment used by Körding and colleagues¹⁰ to test the BCI model. Auditory and visual stimuli originate from one of five spatial locations, spanning a range of 20 deg. The plots show the perceived locations of visual (blue) and auditory (orange) stimuli for each combination of audiovisual spatial locations. The dots represent human data, while the lines represent the responses of the MCD population model. **Panel G** shows the stimuli and results of the experiment of Mohl and colleagues¹⁵. The plots on the right display the probability of a single (vs. double) fixation (top monkeys, bottom humans). The dots represent human data, while the lines represent the responses of the MCD population model. The remaining panels show the histogram of the fixated locations in bimodal trials: the jagged histograms are the empirical data, while the smooth ones are the model prediction (zero free parameters). The regions of overlap between empirical and predicted histograms are shown in black.

Given that both the MLE and the MCD operate by multiplying unimodal inputs (see Methods), the time-averaged MCD population response (Equation 16) is equivalent to MLE (Figure 4A). This can be illustrated by simulating the experiment of Alais and Burr³ using both models. In this experiment, observers had to report whether a probe audiovisual stimulus appeared left or right of a standard. To assess the weighing behaviour resulting from multisensory integration, they manipulated the spatial reliability of the visual stimuli and the disparity between the senses (Figure 4B). Figure 4C shows that the integrated percept predicted by the two models is statistically indistinguishable. As such, a population of MCDs (Equation 16) can jointly account for the observed bias and precision of the bimodal percept (Figure 4D-E), with zero parameters.

While fusing audiovisual cues is a sensible solution in the presence of minor spatial discrepancies across the senses, integration eventually breaks down with increasing disparity⁴³–when the spatial (or temporal) conflict is too large, visual and auditory signals may well be unrelated. To account for the breakdown of multisensory integration in the presence of intersensory conflicts, Körding and colleagues proposed the influential Bayesian Causal Inference (BCI) model¹⁰, where uni- and bimodal location estimates are weighted based on the probability that the two modalities share a common cause (Equation 17). The BCI model was originally tested in an experiment in which sound and light were simultaneously presented from one of five random locations, and observers had to report the position of both modalities¹⁰ (Figure 4F). Results demonstrate that visual and auditory stimuli preferentially bias each other when the discrepancy is low, with the bias progressively declining as the discrepancy increases.

Also a population of MCDs can compute the probability that auditory and visual stimuli share a common cause (Figure 1B, D; Equation 20), therefore we can test whether it also can implement BCI. For that, we simulated the experiment of Körding and colleagues, and fed the stimuli to a population of MCDs (Equations 18–20) which near-perfectly replicated the empirical data (rho=0.99)–even slightly outperforming the BCI model (rho=0.97), while relying on fewer free parameters.

To test for the generalizability of these findings to different species and behavioural paradigms, we simulated an analogous experiment¹⁵, where monkeys (Macaca Mulatta) and humans were instructed to direct their gaze towards spatially scattered audiovisual stimuli (Figure 4G). If observers perceive the stimuli as sharing a common cause, they should make a single fixation, otherwise two: one per modality. Results demonstrate that, in both humans and monkeys, the probability of a single fixation decreases with increasing disparity (Figure 4G, right). This pattern is fully captured by a population of MCDs, which could fit the probability of a single fixation (Equation 21) and from that closely predict gaze directions (Equation 20) in both species with zero free parameters (Figure 4G, see Methods).

Taken together, these simulations show that a behaviour consistent with BCI and MLE naturally emerges from a population of MCDs. Unlike BCI and MLE, however, a population of MCDs is both image- and sound-computable and it also makes explicit the temporal dynamic of the integration process (Figure 4A, bottom, Figure 3B and Figure 5B-C). On one hand, this extends the application of the MCD to complex, dynamic audiovisual stimuli (such as real-life videos, which cannot be handled by MLE or BCI); on the other, it allows for direct comparison of model response with time-varying neurophysiological measures²⁹.

Audiovisual saliency maps.
**Panel A** represents a still frame of Coutrot and Guyader’s¹ stimuli. The white dots represent gaze direction of the various observers. **Panel B** represents the MCD population response for the frame in Panel A. The dots represent observed gaze direction (and correspond to the white dots of Panel A). **Panel C** represents how the MCD response varies over time and azimuth (with elevation marginalized-out). The black solid lines represent the active speaker, while the waveform on the right displays the soundtrack. Note how the MCD response was higher for the active speaker. **Panel D** shows the distribution of model response at gaze direction (see Panel B) across all frames and observers in the database. Model response was normalized for each frame (Z-scores). The y axis represents the number of frames. The vertical grey line represents the mean.

Attention and audiovisual gaze behaviour

Multisensory stimuli are typically salient, and a vast body of literature demonstrates that spatial attention is commonly attracted to audiovisual stimuli¹⁶. This fundamental aspect of multisensory perception is naturally captured by a population of MCDs, whose dynamic response explicitly represents the regions in space with the highest audiovisual correspondence for each point in time. Therefore, for a population of MCDs to provide a plausible account for audiovisual integration, such dynamic saliency maps should be able to predict human audiovisual gaze behaviour, in a purely bottom-up fashion and with no free parameters. Figure 5A shows the stimuli and eye-tracking data from the experiment of Coutrot and Guyader¹, in which observers passively observed a video of four persons talking. Panel B shows the same eye-tracking data plotted over the corresponding MCD population response: across 20 observers, and 15 videos (for a total of over 16,000 frames), gaze was on average directed towards the locations (i.e., pixels) yielding the top 2% MCD response (Figure 5D, Equations 22–23). The tight correspondence of predicted and empirical salience is illustrated in Figure 5C: note how population responses peak based on the active speaker.

Discussion

This study demonstrates that elementary audiovisual analyses are sufficient to replicate behaviours consistent with multisensory perception in mammals. The proposed image- and sound-computable model, composed of a population of biologically-plausible elementary processing units, offers the first end-to-end account of multisensory perception. Starting exclusively from the raw stimuli (i.e., pixels and audio samples), our model quantitatively reproduced observed behaviour, including multisensory illusions, attention maps, and Bayesian Causal Inference with an average Pearson correlation above 0.97—as tested in a large-scale simulation of 69 classic audiovisual experiments with 7 different behavioural tasks, and involving 534 human observers, 110 rats, and 2 monkeys.

This novel framework has clear advantages over traditional, non-stimulus-computable perceptual models. By directly handling realistic audiovisual stimuli, our population model represents sensory input in a format that closely mirrors what animals encounter in the real world, rather than relying on abstracted or simplified representations that fail to capture the complexity of natural sensory inputs. This makes stimulus-computable models more consistent with how the brain represents and processes bimodal spatiotemporal information²⁴. Moreover, by directly operating on the raw signals, our population model tailors its predictions to the exact stimuli and tasks used in each experiment, thereby capturing subtle effects of how individual features affect observed responses (see Supplementary Figure 1K).

The present approach naturally lends itself to be generalized and tested against a broad range of tasks, stimuli, and responses—as reflected by the unparalleled breadth of the experiments simulated here. Among the perceptual effects emerging from elementary signal processing, one notable example is the scaling of subjective audiovisual synchrony with sound source distance⁷. As sound travels slower than light, humans compensate for audio delays by adjusting subjective synchrony based on the source’s distance scaled by the speed of sound. Although this phenomenon seems to rely on explicit physics modelling, our simulations demonstrate that auditory cues embedded in the envelope (Fig 2B, left) are sufficient to scale subjective audiovisual synchrony. In a similar fashion, our simulations show that phenomena such as the McGurk illusion, the subjective timing of natural audiovisual stimuli, and saliency detection, may emerge from elementary operations performed at pixel level, bypassing the need for more sophisticated analyses such as image segmentation, lip or face-tracking, 3D reconstruction, etc. Elementary, general-purpose operations on natural stimuli can drive complex behaviour, sometimes even in the absence of advanced perceptual and cognitive contributions. Indeed, it is intriguing that a population of MCDs, a computational architecture originally developed for motion vision in insects, can predict speech illusions in humans.

The fact that identical low-level analyses can account for all of the 69 experiments simulated here, directly addresses several open questions in multisensory perception. For instance, psychometric functions for speech and non-speech stimuli often differ significantly⁴⁴. This has been interpreted as evidence that speech may be special and processed via dedicated mechanisms⁴⁵. However, identical low-level analyses are sufficient to account for all observed responses, regardless of the stimulus type (Figure 2, Supplementary Figures 1–2). This suggests that the differences in psychometric curves across classes of stimuli (e.g., speech vs. non-speech vs. clicks-&-flashes) are due to the low-level features of the stimuli themselves, not how the brain processes them. Similarly, experience and expertise also modulate multisensory perception. For example, audiovisual simultaneity judgments differ significantly between musicians and non-musicians⁴ (see Supplementary Figure 1C). Likewise, the McGurk illusion³⁹ and subjective audiovisual timing⁴⁶ vary over the lifespan in humans, and following pharmacological interventions in rats^19,21 (see Supplementary Figure 1E,J and Supplementary 2F-G). Our simulations show that parametric adjustments in the decision-making process are sufficient to account for all these effects, without the need to assume structural or parametric differences in low-level perceptual analyses across observers or conditions.

Although the same model explains responses to multisensory stimuli in humans, rats, and monkeys, the temporal constants vary across species. For example, the model for rats is tuned to temporal frequencies over four times higher than those for humans. This not only explains the differential sensitivity of humans and rats to long and short audiovisual lags, but it also mirrors analogous interspecies differences in physiological rhythms, such as heart and breathing rates⁴². Previous research has shown that physiological arousal modulates perceptual rhythms within individuals⁴⁷. It is an open question whether the same association between multisensory temporal tuning and physiological rhythms persists in other mammalian systems. Conversely, no major differences in the model’s spatial tuning were found between humans and macaques, possibly reflecting the close phylogenetic link between the two species.

A compelling result from our simulations is the predictability of gaze direction for naturalistic audiovisual stimuli. Saliency, the property by which some elements in a display stand out and attract observer’s attention and gaze direction, is a popular concept in both cognitive and computer sciences⁴⁸. In computer vision, saliency models are usually complex and rely on advanced signal processing and semantic knowledge—typically with tens of millions of parameters⁴⁹. Despite successfully predicting gaze behaviour, current audiovisual saliency models are often computationally expensive, and the resulting maps are hard to interpret and inevitably affected by the datasets used for training⁵⁰. In contrast, our model detects saliency “out of the box”, without any free parameters, and operating purely at the individual pixel level. The elementary nature of the operations performed by a population of MCDs returns saliency maps that are easy to interpret: salient points are those with high audiovisual correlation. By grounding multisensory integration and saliency detection in biologically plausible computations, our study offers a new tool for machine perception and robotics to handle multimodal inputs in a more human-like way, while also improving system accountability.

This framework also provides a solution for self-supervised and unsupervised audiovisual learning in multimodal machine perception. A key challenge when handling raw audiovisual data is solving the causal inference problem—determining whether signals from different modalities are causally related or not¹⁰. Models in machine perception often depend on large, labelled datasets for training. In this context, a biomimetic module that handles saliency maps, audiovisual correspondence detection, and multimodal fusion can drive self-supervised learning through simulated observers, thereby reducing the dependency on labelled data^26,27,51. What is more, the simplicity of our population model offers a computationally lightweight solution for systems that require rapid sensory integration, such as real-time audiovisual processing, robotics, and AR/VR technologies.

Although a population of MCDs can explain when phenomena such as the McGurk Illusion occur, it does not explain the process of phoneme categorization that ultimately determines what syllable is perceived²⁰. More generally, it is well known that cognitive and affective factors modulate our responses to multisensory stimuli⁹. While a purely low-level model does not directly address these issues, the modularity of our approach makes it possible to extend the system to include high-level perceptual, cognitive and affective factors. What is more, although this study focused on audiovisual integration in mammals, the same approach can be naturally extended to other instances of sensory integration (e.g., visuo- and audio-tactile) and animal classes.

Besides simulating behavioural responses, a stimulus-computable approach necessarily makes explicit all the intermediate steps of sensory information processing. This opens the system to inspection at all of its levels, thereby allowing for direct comparisons with neurophysiology²⁹. In insect motion vision, this transparency made it possible for the Hassenstein-Reichardt detector to act as a searchlight to link computation, behavior, and physiology at the scale of individual cells⁵², ultimately inspiring key advances in motion detection algorithms for modern computer vision. Being based on formally identical computational principles¹⁵, the present approach holds the same potential for multisensory perception.

Methods

The MCD population model

The architecture of each MCD unit used here is the same as described in Parise and Ernst³⁰, however, units here receive time-varying visual and auditory input from spatiotopic receptive fields. The input stimuli (s) consist of luminance level and sound amplitude varying over space and time and are denoted as s_m(x, y, t) – with x and y, representing the spatial coordinates along the horizontal and vertical axes, t is the temporal coordinate, and m is the modality (video and audio). When the input stimulus is a movie with mono audio, the visual input to each unit is a signal representing the luminance of a single pixel over time, while the auditory input is the amplitude envelope (later we will consider more complex scenarios where auditory stimuli are also spatialized).

Each unit operates independently and detects changes in unimodal signals over time by temporal filtering based on two biphasic impulse response functions that are 90 deg out of phase (i.e., a quadrature pair). A physiologically plausible implementation of this process has been proposed by Adelson and Bergen ⁵³ and consists of linear filters of the form:

The phase of the filter is determined by n, which based on Emerson et al.⁵⁴ takes the values of 6 for the fast filter, and 9 for the slow one. The temporal constant of the filters is determined by the parameter τ_bp; in humans, its best fitting value is 0.045 s for vision and 0.0367 s for audition. In rats, the fitted temporal constant for vision and audition are nearly identical and their value is 0.010 s.

Fast and slow filters are applied to each unimodal input signal and the two resulting signals are squared and then summed. After that, a compressive non-linearity (square-root) is applied to the output, so as to constrain it within a reasonable range⁵³. Therefore, the output of each unimodal unit feeding into the correlation detector takes the following form

where mod = vid, aud represents the sensory modality and * is the convolution operator.

As in the original version²², each MCD consists of two sub-unit, in which the unimodal input is low-pass filtered and multiplied as follows

The impulse response of the low-pass filter of each sub-unit takes the form

where τ_lp represents the temporal constant, and its estimated value is 0.180 s for humans and 0.138 s for rats.

The response of the sub-units is eventually multiplied to obtain MCD_corr, which represents the local spatiotemporal audiovisual correlation, and subtracted to obtain MCD_lag which describes the relative temporal order of vision and audition

The outputs MCD_corr (x, y, t) and MCD_lag(x, y, t) are the final product of the MCD and represent the outcome of early uni- and multisensory processing.

The temporal constants of the filters were fitted using the Bayesian Adaptive Direct Search (BADS) algorithm⁵⁵, set to maximize the correlation between the empirical and predicted psychometric functions of the temporal determinants of multisensory integration. For humans, that included all studies in Supplementary Figure 1, besides Patient PH; For rats, it included all non-pharmacological studies in Supplementary Figure 2. To minimize the effect of starting parameters, the fitting was performed 200 times using random starting values (from 0.001 to 1.5 s). The parameters estimated using BADS were further refined using the fminsearch algorithm in Matlab. Parameter estimation was considerably compute-intensive, hence the amount of data had to be reduced by rescaling the videos to 15% of the original size (without affecting the frame rate). Besides reducing run-time, this simulates the (Gaussian) spatial pooling occurring in early visual pathways.

Now that the MCD population is defined and its parameters are fully constrained, what remains to be explained is how to read out, from the dynamic population responses, the relevant information that is needed to generate a behavioral response, such as eye-movements, button-presses, nose- or lick-contacts etc. While the early sensory processing of each MCD unit is task-independent and operates in a purely bottom-up fashion, the exact nature of the read-out and decision-making process depends on the behavioral task, which ultimately determines how to weigh and combine the dynamic population responses. Given that in the present study we consider different types of behavioral experiments (investigating audiovisual integration trough temporal tasks, spatial tasks, and passive observation), the read-out process for each task will be described in separate sections.

Modeling the temporal determinants of multisensory integration

The experiments on the temporal constraints of audiovisual integration considered here, rely on psychophysical forced-choice tasks to assess the effects of crossmodal lags on perceived synchrony, temporal order, and the McGurk illusion. In humans, such experiments entail pressing one of two buttons; in rats, nose-poking or licking one of two spouts. In the case simultaneity judgments, on each trial observers reported whether visual and auditory stimuli appeared synchronous or not (resp = {yes, no}). In temporal order judgments, observers reported which modality came first (resp = {vision first, audition first}). Finally, in the case of the McGurk, observers (humans) had to report what syllable they heard (e.g., “da”, “ga” or “ba”, usually recoded as “fused” and “non-fused percept”). When plotted against lag, the resulting empirical psychometric functions describe how audiovisual timing affect perceived synchrony, temporal order, and the McGurk illusion. Here, we model these tasks using the same read-out and decision-making process.

To account for observed responses, the dynamic, high-bandwidth populations responses must be transformed (compressed) into a single number, representing response probability for each lag. In line with standard procedures⁵⁶, this was achieved by integrating MCD_corr (x, y, t) and MCD_lag(x, y, t) over time and space, so as to obtain two summary decision variables

and

The temporal window for these analyses consisted of the duration of each stimulus, plus two seconds before and after (during which the audio was silent, and the video displayed a still frame); the exact extension of the temporal window used for the analyses had minimal effect on the results. and are eventually linearly weighted and transformed into response probabilities through a cumulative normal function as follows

Here, Φ represents the cumulative normal, β_corr and β_lag are linear coefficients that weigh and scale and . β₀ is a bias term, which corresponds to the response criterion. Finally, p_MCD(resp) is the probability of a response, which is p_MCD(Synchronous) for the simultaneity judgment task, p_MCD(Audio first) for the temporal order judgment task, and p_MCD(Fusion) for the McGurk task. Note how Equation 9 entails that simultaneity judgments, temporal order judgments, and the McGurk illusions are all simulated using the very same model architecture.

With the population model fully constrained (see above), the only free parameters in the present simulations are the ones controlling the decision-making process: β_crit, β_corr, and β_lag. These were separately fitted for each experiment using fitglm Matlab (binomial distribution and probit link function). For the simulations of the effects of distance on audiovisual temporal order judgments in humans⁷ (Figure 2B), and the manipulation of loudness in rats¹⁴ (Figure 2D) β_crit, β_corr, and β_lag were constant across conditions (i.e., distance⁷ or loudness¹⁴). This way, differences in the psychometric functions as a function of distance⁷ or loudness¹⁴ of the stimuli are fully explained by the MCD. Overall, the model provided an excellent fit to the empirical psychometric functions, and the average Pearson correlation between human and model response (weighted by sample size) is 0.981 for humans and 0.994 for rats. Naturally, model-data correlation varied across experiments, largely due to sample size. This can be appreciated when the MCD-data correlation for each experiment is plotted against the number of trials for each lag in a funnel plot (see Supplementary Figure 1I). The number of trials for each lag determine the binomial error for data points in Supplementary Figure 1 and 2; accordingly, the funnel plot shows that lower MCD-data correlation is more commonly observed for curves based on smaller sample size. This shows that the upper limit in the MCD-data correlation is mostly constrained by the reliability of the dataset, rather than systematic errors in the model.

To assess the contribution of the low-level properties of the stimuli on model’s performance, we ran a permutation test, where psychometric curves were generated (Equation 10) using stimuli from different experiments (but with the same manipulation of lag). If the low-level properties of the stimuli play a significant role, the correlation between data and model with permuted stimuli should be lower than with non-permuted stimuli. For this permutation we used the data from simultaneity judgment tasks on humans (Supplementary Figure 1), as with 27 individual experiments there are enough permutations to render the test meaningful. For that, we used the temporal constants of the MCD fitted before, so that each psychometric curve each permutation had 3 free parameters (β_crit, β_corr, and β_lag). The results from 200k permutations demonstrate that the goodness of fit obtained with the original stimuli is superior to that of permuted stimuli. Specifically, the permuted distribution of the mean Pearson correlation of predicted vs. empirical psychometric curves had a mean of 0.972 (σ =0.0036), while such a correlation rose to 0.989 when the MCD received the original stimuli.

Modeling the spatial determinants of multisensory integration

Most studies on audiovisual space perception only investigated the horizontal spatial dimension, hence the stimuli can be reduced to s_m(x, t) instead of s_m(x, y, t), as in the simulations above (see Equation 2). Additionally, based on preliminary observations, the output MCD_lag does not seem necessary to account for audiovisual integration in space, hence only the population response MCD_corr (x, t) (see Equation 6) will be considered here.

MCD and MLE – simulation of Alais and Burr (2004)

In its general form, the MLE model can be expressed probabilistically as p_MLE(x) ∝ p_vid(x) · p_aud(x), where p_vid(x) and p_aud(x) represent the probability distribution of the unimodal location estimate (i.e., likelihood functions), and p_MLE(x) is the bimodal distribution. When the p_vid(x) and p_aud(x) follow a Gaussian distribution, also p_MLE(x) is Gaussian, with the variance equal to the product divided by the sum of the unimodal variances:

and the mean consisting of a weighted average of the unimodal means

where

To test whether a population of MCDs can replicate the study of Alais and Burr (2004), this simulation has two main goals. The first one is to compare the predictions of the MCD with that of the MLE model; the second one is to test whether the MCD model can predict observers’ responses.

To simulate the study of Alais and Burr (2004), we first need to generate the stimuli. The visual stimuli consisted of a 1-D Gaussian luminance profiles, presented for 10ms. Their standard deviations, which determined visual spatial reliability, were defined as the standard deviation of the visual psychometric functions. Likewise, the auditory stimuli also consisted of a 1-D Gaussian sound intensity profile (with a standard deviation determined by the unimodal auditory psychometric function). Note that the spatial reliability in the MCD model jointly depends on the stimulus and the receptive field of the input units; however, teasing apart the differential effects induced by these two sources of spatial uncertainty is beyond the scope of the present study. Hence, for simplicity, here injected all spatial uncertainty into the stimulus (see also^22,29–31). For this simulation, we only used the data from observer LM of Alais and Burr’s³, as it is the only participant for which the full dataset is publicly available (other observers, however, had similar results).

The stimuli are fed to the model to obtain the population response MCD_corr (x, t) (see Equation 6), which is marginalized over time as follows

This provides a distribution of the population response over the horizontal spatial dimension. Finally, a divisive normalization is performed to transform model response into a probability distribution

It is important to note that Equations 15 and 16 have no free parameters, and all simulations are now performed with a fully constrained model. To test whether the MCD model can perform audiovisual integration according to the MLE model, we replicated the various conditions run by observer LM, and calculated the bimodal likelihoods distribution predicted by the MLE and MCD models (i.e., p_MLE(x) and p_MCD(x)). These were statistically identical, when considering rounding errors. The results of these simulations are plotted in Figure 4C, and displayed as cumulative distributions as in Figure 1 of Alais and Burr³.

Once demonstrated that p_MLE(x) = p_MCD(x), it is clear that the MCD is equally capable of predicting observed responses. However, for completeness, we compared the prediction of the MCD to the empirical data: the results demonstrate that, just like the MLE model, a population of MCDs can predict both audiovisual bias (Figure 4D) and just noticeable differences (JND, Figure 4E).

MCD and BCI – simulation of Körding et a. (2007)

To account for the spatial breakdown of multisensory integration, the BCI model operates in a hierarchical fashion: first it estimates the probability that audiovisual stimuli share a common cause (p(C = 1)). Next, the model weigh and integrates the unimodal and bimodal information (p_mod(x) and p_mle(x)) as follows:

All the terms of Equation 17 have a homologous representation in an MCD population response: p_mle(x) corresponds to p_MCD(x) (see previous section). Likewise, the homologous of p_mod(x) can be obtained by marginalizing over time and normalizing the output of the unimodal units as follows (see Equation 15)

Finally, the MCD homologous of BCI’s p(C = 1) can be read-out from the population response following the same logic as Equations 8 and 9:

Here, we found that including a compressive non-linearity (logarithm of the total MCD_corr response) provided a tighter fit to the empirical data. With p_MCD(C = 1) representing the probability that vision and audition share a common cause, and p_MCD(x) and representing the bimodal and unimodal population responses, the MCD model can simulate observed responses as follows:

The similarity between Equation 20 and Equation 17 demonstrates the fundamental homology of the BCI and MCD models: what remains to be tested is whether the MCD can also account for the results of Körding and colleagues¹⁰. For that, just like the BCI model, also the MCD model relies on 4 free parameters: two are shared by both models, and represent the spatial uncertainty (i.e., the variance) of the unimodal input (i.e., and ). Additionally, the MCD model then needs two linear coefficients (slope β_corr and intercept β₀) to transform the dynamic population response into a probability of a common cause. Conversely, the remaining two parameters of the BCI model correspond to a prior for common cause and another for central location, neither of which are necessary to account for observed responses in the present framework. Parameters were fitted using BADS⁵⁵ set to maximize the Pearson correlation between model and human responses. Overall, the MCD provided an excellent fit to the empirical data (r=0.99), even slightly exceeding the performance of the BCI model (while relying on the same degrees of freedom). Given that the fitted value of the slope parameter (β_corr) approached 1, we repeated the fitting while removing β_corr from Equation 19: even with just 3 free parameters (one fewer than the BCI model), the MCD outperformed the BCI model and the correlation with the empirical data approached 0.99.

MCD and BCI – simulation of Mohl et al. (2020)

Mohl and colleagues used eye movements to test whether humans and monkeys integrate audiovisual spatial cues according to BCI. The targets consisted of either unimodal or bimodal stimuli. Following the same logic as the simulation of Alais and Burr³ (see above), the unimodal input consisted of impulses with a Gaussian spatial profile (Figure 4A-B), whose the variance (i.e., and ) was set equal to the variance of the fixations measured in the unimodal trials (averaged across observers). Although the probability of a single fixation decreases with increasing disparity (Figure 4G, right), observers sometimes failed to make a second fixation even when the stimuli were noticeably far apart (lapses). This was especially true for monkeys, which had a lapse rate of 16%, indicative of low attention and compliance. To account for this, we can modify Equation 19 and obtain the probability of a single fixation as follows:

Here, p_lapse is a free parameter that represents the probability of making the incorrect number of fixations, irrespective of the discrepancy. As in Equation 19, β_corr and β₀ are free parameters that transform the dynamic population response into a probability of a common cause (i.e., single fixation). Equation 21 could tightly reproduce the observed probability of a single fixation (Figure 4G, right), and the Pearson correlation between the model and data was 0.995 for monkeys and 0.988 for humans.

With the parameters p_lapse, β_corr and β₀ fitted to the probability of a single fixation (i.e., the probability of a common cause), it is now possible to predict gaze direction (i.e., the perceived location of the stimuli) with zero free parameters. For that, we can use Equation 21 to get the probability of a common cause and predict gaze direction using Equation 20. The distribution of fixations predicted by the MCD closely follow the empirical histogram in both species (Figure 4G) and the correlation between the model and data was 0.9 for monkeys and 0.93 for humans. Note that figure 4G shows only a subset of the 20 conditions tested in the experiment (the same subset of conditions shown in the original paper).

MCD and audiovisual gaze behavior

To test whether the dynamic MCD population response can account for gaze behavior during passive observation of audiovisual footage, a simple solution is to measure whether observers preferentially looked at the MCD_corr responses is maximal. Such an analysis was separately performed for each frame. For that, gaze directions were first low pass filtered with a Gaussian kernel (σ = 14 pixels) and normalized to probabilities p_gaze(x, y). Next, we calculated the average MCD responses at gaze direction for each frame (t); this was done by weighing MCD_corr by p_gaze as follows

To assess the MCD response at gaze is larger than the frame average, we calculated the standardized mean difference (SMD) for each frame as follows

Across the over 16,000 frames of the available dataset, the average SMD was 2.03 (Figure 5D). Given that the standardized mean difference serves as a metric for effect size, and that effect sizes surpassing 1.2 are deemed very large ⁵⁷, it is remarkable that the MCD population model can so tightly account for human gaze behavior in a purely bottom-up fashion and without free parameters.

Code availability statement

A MATLAB script running the MCD population model will be included as Supplementary Material.

Pre-processing of ecological audiovisual footage

The diversity of the stimuli used in this dataset requires some preprocessing before the stimuli can be fed to the population model. First, all movies were converted to grayscale (scaled between 0 and 1) and the soundtrack was converted to rms envelope (scaled between 0 and 1), thereby removing chromatic and tonal information. Movies were then padded with two seconds of frozen frames at onset and offset to accommodate for the manipulation of the lag. Finally, the luminance of the first still frame was set as baseline and subtracted from all subsequent frames. Along with the padding, this helps minimizing transient artefacts induced by the onset of the video.

Video frames were to 15% of the original size, and the static background was cropped. On a practical side, this made the simulations much faster (which is crucial for parameter estimation); on a theoretical side, such a down sampling simulates the Gaussian spatial pooling of luminance across the visual field (unfortunately, the present datasets do not provide sufficient information to convert pixels into visual angles). In a similar fashion, we down sampled sound envelope to match the frame-rate of the video.

Datasets

To thoroughly compare observed and model behavior, this study requires a large and diverse dataset consisting of both the raw stimuli and observers’ responses. For that, we adopted a convenience sampling and simulated the studies for which both stimuli and responses were available (either in public repositories, shared by the authors, or extracted from published figures). The inclusion criteria depend on what aspect of multisensory integration is being investigated, and they are described below.

For the temporal determinants of multisensory integration in humans, we only included studies that: (1) used real-life audiovisual footage, (2) performed a parametric manipulation of lag, and (3) engaged observers in a forced-choice behavioral task. Forty-three individual experiments met the inclusion criteria (Figure 2A-B and Supplementary Figure 1). These varied in terms of stimuli, observers, and tasks. In terms of stimuli, the dataset consists of responses from 105 unique real-life videos (see Supplementary Figure 1 and Supplementary Table 1). The majority of the videos represented audiovisual speech (possibly the most common stimulus in audiovisual research), but they varied in terms of content (i.e., syllables, words, full sentences, etc.), intelligibility (i.e., sine-wave speech, amplitude-modulated noise, blurred visuals, etc.), composition (i.e., full face, mouth-only, oval frame, etc.), speaker identity, etc. The remaining non-speech stimuli consist of footage of actors playing a piano or a flute. The study from Alais and Carlile⁷ was included in the dataset because, even if the visual stimuli were minimalistic (blobs), the auditory stimuli consisted of ecological auditory depth cues depth cues recorded in a real reverberant environment (the Sydney Opera House, see Supplementary Information for details on the dataset). The dataset contains forced-choice responses to three different tasks: speech categorization (i.e., for the McGurk illusion), simultaneity judgments, and temporal order judgment. In terms of observers, besides the general population, the dataset consists of experimental groups varying in terms of age, musical expertise, and even includes a patient, PH, who reports hearing speech before seeing mouth movements after a lesion in the pons and basal ganglia³⁹. Taken together the dataset consists of ~1k individual psychometric functions, from 454 unique observers, for a total of ~300k trials; the psychometric curves for each experiment are shown in Figure 2A-B and Supplementary Figure 1. All these simulations are based on psychometric functions averaged across observers, for simulations of individual observers see Supplementary Information and Supplementary Figures 3–5.

For the temporal determinants of multisensory integration in rats, we included studies that performed a parametric manipulation of lag and engaged rats in simultaneity and temporal order judgment tasks. Sixteen individual experiments^{11,14,19,21,40,41} met the inclusion criteria (Figure 2C-D and Supplementary Figure 2 and Supplementary Table 2): all of them used minimalistic audiovisual stimuli (clicks and flashes) and with a parametric manipulation of audiovisual lag. Overall, the dataset consists of ~190 individual psychometric functions, from 110 rats (and 10 humans), for a total of ~300k trials.

For the case of the spatial determinants of multisensory integration, to the best of our knowledge there are no available datasets with both stimuli and psychophysical responses. Fortunately, however, the spatial aspects of multisensory integration are often studied with minimalistic audiovisual stimuli (e.g., clicks and blobs), which can be simulated exactly. Audiovisual integration in space is commonly framed in terms of optimal statistical estimation, where the bimodal percept is modelled either through maximum likelihood estimation (MLE), or Bayesian causal inference (BCI). To provide a plausible account for audiovisual integration in space, a population of MCDs should also behave as a Bayesian-optimal estimator. This hypothesis was tested by comparing the population response to human data in the studies that originally tested the MLE and BCI models; hence we simulated the study of Alais and Burr³ and Körding et al.¹⁰. Such simulations allow us to compare the data with our model, and our model with previous ones (MLE and BCI). Given that in these two experiments a population of MCDs behaves just like the MLE and BCI models (and with the same number of free parameters or less), the current approach can be easily extended to other instances of sensory cue integration previously modelled in terms of optimal statistical estimation. This was tested by simulating the study of Mohl and colleagues¹⁵, who used eye movements to assess whether BCI can account for audiovisual integration in monkeys and humans.

Finally, we tested whether a population of MCDs can predict audiovisual attention and gaze behaviour during passive observation of ecological audiovisual stimuli. Coutrot and Guyader¹ run the ideal testbed for this hypothesis: much like our previous simulations (Figure 2A), they employed audiovisual speech stimuli, recorded indoor, with no camera shake. Specifically, they tracked eye movements from 20 observers who passively watched 15 videos of a lab meeting (see Figure 5A). Without fitting parameters, the population response tightly matched the empirical saliency maps (see Figure 5B-D).

Supplementary Information

Pre-processing of ecological audiovisual footage

The diversity of the stimuli used in this dataset requires some preprocessing before the stimuli can be fed to the population model. First, all movies were converted to grayscale (scaled between 0 and 1) and the soundtrack was converted to rms envelope (scaled between 0 and 1), thereby removing chromatic and tonal information. Movies were then padded with two seconds of frozen frames at onset and offset to accommodate for the manipulation of the lag. Finally, the luminance of the first still frame was set as baseline and subtracted from all subsequent frames. Along with the padding, this helps minimizing transient artefacts induced by the onset of the video.

Video frames were to 15% of the original size, and the static background was cropped. On a practical side, this made the simulations much faster (which is crucial for parameter estimation); on a theoretical side, such a down sampling simulates the Gaussian spatial pooling of luminance across the visual field (unfortunately, the present datasets do not provide sufficient information to convert pixels into visual angles). In a similar fashion, we down sampled sound envelope to match the frame-rate of the video.

Datasets

To thoroughly compare observed and model behavior, this study requires a large and diverse dataset consisting of both the raw stimuli and observers’ responses. For that, we adopted a convenience sampling and simulated the studies for which both stimuli and responses were available (either in public repositories, shared by the authors, or extracted from published figures). The inclusion criteria depend on what aspect of multisensory integration is being investigated, and they are described below.

For the temporal determinants of multisensory integration in humans, we only included studies that: (1) used real-life audiovisual footage, (2) performed a parametric manipulation of lag, and (3) engaged observers in a forced-choice behavioral task. Forty-three individual experiments met the inclusion criteria (Figure 2A-B and Supplementary Figure 1). These varied in terms of stimuli, observers, and tasks. In terms of stimuli, the dataset consists of responses from 105 unique real-life videos (see Supplementary Figure 1 and Supplementary Table 1). The majority of the videos represented audiovisual speech (possibly the most common stimulus in audiovisual research), but they varied in terms of content (i.e., syllables, words, full sentences, etc.), intelligibility (i.e., sine-wave speech, amplitude-modulated noise, blurred visuals, etc.), composition (i.e., full face, mouth-only, oval frame, etc.), speaker identity, etc. The remaining non-speech stimuli consist of footage of actors playing a piano or a flute. The study from Alais and Carlile(33) was included in the dataset because, even if the visual stimuli were minimalistic (blobs), the auditory stimuli consisted of ecological auditory depth cues depth cues recorded in a real reverberant environment (the Sydney Opera House, see Supplementary Information for details on the dataset). The dataset contains forced-choice responses to three different tasks: speech categorization (i.e., for the McGurk illusion), simultaneity judgments, and temporal order judgment. In terms of observers, besides the general population, the dataset consists of experimental groups varying in terms of age, musical expertise, and even includes a patient, PH, who reports hearing speech before seeing mouth movements after a lesion in the pons and basal ganglia(32). Taken together the dataset consists of ~1k individual psychometric functions, from 454 unique observers, for a total of ~300k trials; the psychometric curves for each experiment are shown in Figure 2A-B and Supplementary Figure 1. All these simulations are based on psychometric functions averaged across observers, for simulations of individual observers see Supplementary Information and Supplementary Figures 3–5.

For the temporal determinants of multisensory integration in rats, we included studies that performed a parametric manipulation of lag and engaged rats in simultaneity and temporal order judgment tasks. Sixteen individual experiments(34-39) met the inclusion criteria (Figure 2C-D and Supplementary Figure 2 and Supplementary Table 2): all of them used minimalistic audiovisual stimuli (clicks and flashes) and with a parametric manipulation of audiovisual lag. Overall, the dataset consists of ~190 individual psychometric functions, from 110 rats (and 10 humans), for a total of ~300k trials.

For the case of the spatial determinants of multisensory integration, to the best of our knowledge there are no available datasets with both stimuli and psychophysical responses. Fortunately, however, the spatial aspects of multisensory integration are often studied with minimalistic audiovisual stimuli (e.g., clicks and blobs), which can be simulated exactly. Audiovisual integration in space is commonly framed in terms of optimal statistical estimation, where the bimodal percept is modelled either through maximum likelihood estimation (MLE), or Bayesian causal inference (BCI). To provide a plausible account for audiovisual integration in space, a population of MCDs should also behave as a Bayesian-optimal estimator. This hypothesis was tested by comparing the population response to human data in the studies that originally tested the MLE and BCI models; hence we simulated the study of Alais and Burr(11) and Körding et al.(8). Such simulations allow us to compare the data with our model, and our model with previous ones (MLE and BCI). Given that in these two experiments a population of MCDs behaves just like the MLE and BCI models (and with the same number of free parameters or less), the current approach can be easily extended to other instances of sensory cue integration previously modelled in terms of optimal statistical estimation. This was tested by simulating the study of Mohl and colleagues(42), who used eye movements to assess whether BCI can account for audiovisual integration in monkeys and humans.

Finally, we tested whether a population of MCDs can predict audiovisual attention and gaze behaviour during passive observation of ecological audiovisual stimuli. Coutrot and Guyader(43) run the ideal testbed for this hypothesis: much like our previous simulations (Figure 2A), they employed audiovisual speech stimuli, recorded indoor, with no camera shake. Specifically, they tracked eye movements from 20 observers who passively watched 15 videos of a lab meeting (see Figure 5A). Without fitting parameters, the population response tightly matched the empirical saliency maps (see Figure 5B-D).

Individual observers’ analyses

Most of the simulations described so far rely on group-level data, where psychometric curves represent the average response across the pool of observers that took part in each experiment. Individual psychometric functions, however, sometimes vary dramatically across observers, hence one might wonder whether the MCD, besides predicting stimulus-driven variations in the psychometric functions, can also capture individual differences. A recent study by Yarrow and colleagues¹ directly addressed this question, and concluded that models of the Independent Channels family outperform the MCD at fitting responses individual difference.

Although it can be easily shown that such a conclusion was supported by an incomplete implementation of the MCD (which did not include the MCD_lag output), a closer look at the two models against the same datasets help us illustrate their fundamental difference and highlight a key drawback of perceptual models that take parameters as input. Therefore, we first simulated the impulse stimuli used by Yarrow and colleagues¹, fed them to the MCD and used Equation 10 to generate the individual psychometric curves. Given that their stimuli consisted of temporal impulses with no spatiotemporal manipulation, a single MCD unit is sufficient to run these simulations. Overall, the model provided an excellent fit to the original data and tightly captured individual differences: the average Pearson correlation between predicted and empirical psychometric functions across the 57 curves shown in Figure S2 is 0.98 (see Supplementary Figure S2). Importantly, for such simulations the MCD was fully constrained, and the only free parameters (3 in total) were the linear coefficients of Equation 10, which describe how the output of the MCD is used for perceptual decision-making. For comparison, also Independent Channels models achieved analogous goodness of fit, but they required at least 5 free parameters (depending on the exact implementation, see¹).

To assess the generalizability of this finding, we additionally simulated the individual psychometric functions from the experiments that informed the architecture of the MCD units used here². Specifically, Parise and Ernst² run two psychophysical studies using minimalistic stimuli that only varied over time. In the first one, auditory and visual stimuli consisted of step increments and/or decrements in intensity. Audiovisual lag was parametrically manipulated using the method of constant stimuli, and observers were required to perform both simultaneity and temporal order judgments. Following the logic described above, we fed the stimuli to the model, and used Equation 10 (with 3 free parameters) to simulate human responses. Results demonstrate that the MCD can account for individual differences regardless of the task: the average Pearson correlation between empirical and predicted psychometric curves was 0.97 for the simultaneity judgments, and 0.96 for the temporal order judgments (64 individual psychometric curves from 8 observers, for a total of 9600 trials). This generalizes the results of the previous simulation to a different type of stimuli (steps vs. impulses) and extends them to include a task, the temporal order judgment, which was not considered by Yarrow and colleagues¹ (whose model can only perform simultaneity judgments).

The second study of Parise and Ernst² consists of simultaneity judgments for periodic audiovisual stimuli defined by a square-wave intensity envelope (Figure S4A). Simultaneity judgments for this type of periodic stimuli are also periodic, with two complete oscillations in perceived simultaneity for each cycle of phase shifts between the senses (a phenomenon known as frequency doubling). Once again, using Equation 10 the MCD could account for individual differences in observed behavior (5 psychometric curves from 5 observers, for a total of 3000 trials) with an average Pearson correlation of 0.93, while relying on just 3 free parameters.

It is important to note that for these simulations, the same MCD model accurately predicted (in a purely bottom-up fashion) bell-shaped SJ curves for non-periodic stimuli, and sinusoidal curves for periodic stimuli. Alternative models of audiovisual simultaneity that directly take lag as input, enforce bell-shaped psychometric functions, where perceived synchrony monotonically decrease as we move away from the point of subjective simultaneity. As a result, in the absence of ad-hoc adjustments they all necessarily fail at replicating the results of Parise and Ernst², due to their inability to generate periodic psychometric functions. Conversely, the MCD is agnostic regarding the shape of the psychometric functions, hence the very same model used to predict the simultaneity judgments of Yarrow and colleagues¹ can also predict the empirical psychometric functions of Parise and Ernst², including individual differences across observers (all while relying on just three free parameters).

Simulation of Alais and Carlile (2005)

For the simulations of Alais and Carlile²⁹, the envelope of the auditory stimuli was extracted from the waveforms shown in the figures of the original publication. That was done using WebPlotDigitizer to trace the profile of the waveforms; the digitized points were then interpolated and resampled at 1000Hz. To preserve the manipulation of the direct-to-reverberant waves, the section of the envelope with the reverberant signal was identical across the four conditions (i.e., distances), so that what varied across conditions was the initial portion of the signals (the direct waves). For the simulations of Alais and Carlile²⁹, all four psychometric functions were fitted simultaneously, so that the four psychometric functions all relied on just three free parameters: the ones related to the decision-making process.

Significance of findings

Strength of evidence

Abstract

Introduction

The MCD population model.

Results

Temporal determinants of audiovisual integration in humans and rats

Natural audiovisual stimuli and psychophysical responses.

Ecological audiovisual stimuli and model responses.

Spatial determinants of audiovisual integration in humans and monkeys

Audiovisual integration in space.

Audiovisual saliency maps.

Attention and audiovisual gaze behaviour

Discussion

Methods

The MCD population model

Modeling the temporal determinants of multisensory integration

Modeling the spatial determinants of multisensory integration

MCD and MLE – simulation of Alais and Burr (2004)

MCD and BCI – simulation of Körding et a. (2007)

MCD and BCI – simulation of Mohl et al. (2020)

MCD and audiovisual gaze behavior

Code availability statement

Pre-processing of ecological audiovisual footage

Datasets

Supplementary Information

Pre-processing of ecological audiovisual footage

Datasets

Individual observers’ analyses

Simulation of Alais and Carlile (2005)

Supplementary Figures

Temporal determinants of multisensory integration in humans.

Temporal determinants of multisensory integration in rats.

Simultaneity judgment for impulse stimuli in humans, individual observer data (Yarrow et al. 2023).

Simultaneity and temporal order judgments for step stimuli (Parise & Ernst, 2023).

Simultaneity judgment for periodic stimuli (Parise & Ernst, 2023).

Supplementary Tables

Summary of the experiments simulated in Supplementary Figure 1.

Summary of the experiments simulated in Supplementary Figure 2.

References

Article and author information

Author information

Cesare V Parise

Author Notes

Version history

Cite all versions

Copyright

Metrics