A Stimulus-Computable Model for Audiovisual Perception and Spatial Orienting in Mammals

Cesare V Parise

doi:10.7554/eLife.106122.2

Introduction

Perception in natural environments is inherently multisensory. For example, during speech perception, the human brain integrates audiovisual information to enhance speech intelligibility, often beyond awareness. A compelling demonstration of this is the McGurk illusion⁶ (Supplementary Movie 2), where the auditory perception of a syllable is altered by mismatched lip movements. Likewise, audiovisual integration plays a critical role in spatial localization, as illustrated by the ventriloquist illusion⁸, where perceived sound location shifts toward a synchronous visual stimulus.

Extensive behavioural and neurophysiological findings demonstrate that audiovisual integration occurs when visual and auditory stimuli are presented in close spatiotemporal proximity (i.e., the spatial and temporal determinants of multisensory integration) ^9,12. When redundant multisensory information is integrated, the resulting percept is more reliable¹¹ and salient¹⁴. Various models have successfully described how audiovisual integration unfolds across time and space^3,10,15,16–often within a Bayesian Causal Inference framework, where the system determines the probability that visual and auditory stimuli have a common cause and weigh the senses accordingly. This is the case for the detection of spatiotemporal discrepancies across the senses, or susceptibility to phenomena such as the McGurk or Ventriloquist illusions^10,15,18.

Prevailing theoretical models of multisensory integration typically operate at what Marr²⁰ termed the computational level: they describe what the system is trying to achieve (e.g., precise estimates). However, these models are not stimulus-computable. That is, rather than analysing raw auditory and visual input directly, they rely on experimenter-defined, low-dimensional abstractions of the stimuli^{3,10,15,16,18}—such as the asynchrony between sound and image, expressed in seconds^15,16, or spatial location ^{3, 10}. As a result, they solve a fundamentally different task than real perceptual systems, which must infer such properties from the stimuli themselves—from dynamic patterns of pixels and audio samples—without access to ground-truth parameters. From Marr’s perspective, what is missing is an account at the algorithmic level: a concrete description of the stimulus-driven representations and operations that could give rise to the observed computations.

Despite their clear success in accounting for behaviour in simple, controlled conditions, current models remain silent on how perceptual systems extract, process, and combine task-relevant information from the continuous and structured stream of audiovisual signals that real-world perception entails. This omission is critical: audiovisual perception involves the continuous analysis of images and sounds, hence models that do not operate on the stimuli cannot provide a complete account of perception. Only a few models can process elementary audiovisual stimuli^23,24, none can tackle the complexity of natural audiovisual input. Currently, there are no stimulus-computable models for multisensory perception²⁵ that can take as input natural audiovisual data, like movies. This study explores how behaviour consistent with mammalian multisensory perception emerges from low-level analyses of natural auditory and visual signals.

In an image- and sound-computable model, visual and auditory stimuli can be represented as patterns in a three-dimensional space, where x and y are the two spatial dimensions, and t the temporal dimension. An instance of such a three-dimensional diagram for the case of audiovisual speech is shown in Figure 1B (top): moving lips generate patterns of light that vary in synch with the sound. In such a representation, audiovisual correspondence can be detected by a local correlator (i.e., multiplier), that operates across space, time, and the senses²⁶. In previous studies, we proposed a biologically plausible solution to detect temporal correlation across the senses (Figure 1A)^23,27–29. Here, we will illustrate how a population of multisensory correlation detectors can take real-life footage as input and provide a comprehensive bottom-up account for multisensory integration in mammals, encompassing its temporal, spatial and attentional aspects.

The MCD population model.
**Panel A**: schematic representation of a single MCD unit. The input visual signal represents the intensity of a pixel (mouth area) over time, while the audio input is the soundtrack (the syllable /ba/). The grey soundtrack represents the experimental manipulation of AV lag, obtained by delaying one sense with respect to the other. BPF and LPF indicate band-pass and low-pass temporal filters, respectively. **Panel C** shows how single-unit responses vary as a function of crossmodal lag. **Panel B** represents the architecture of the MCD population model. Each visual unit (in blue) receives input from a single pixel, while the auditory unit receives as input the intensity envelope of the soundtrack (mono audio; see Figure 4A for a version of the model capable of receiving spatialized auditory input). Sensory evidence is then integrated over time and space for perceptual decision-making, a process in which the two model responses are weighted, summed, corrupted with additive Gaussian noise, and compared to a criterion to generate a forced-choice response (**Panel D**).

The present approach posits the existence of elementary processing units, the Multisensory Correlation Detectors (MCD)²⁸, each integrating time-varying input from unimodal transient channels through a set of temporal filters and elementary operations (Figure 1A, see Methods). Each unit returns two outputs, representing the temporal correlation and order of incoming visual and auditory signals (Figure 1C). When arranged in a two-dimensional lattice (Figure 1B), a population of MCD units is naturally suited to take movies (e.g., dynamic images and sounds) as input, hence capable to process any stimuli used in previous studies in audiovisual integration. Given that the aim of this study is to provide an account for multisensory integration in biological system, the benchmark of our model is to reproduce observers’ behaviour in carefully-controlled psychophysical and eye-tracking experiments. Emphasis will be given to studies using natural stimuli which, despite their manifest ecological value, simply cannot be handled by alternative models. Among them, particular attention will be dedicated to experiments involving speech, perhaps the most representative instance of audiovisual perception, and sometimes claimed to be processed via dedicated mechanisms in the human brain.

Results

We tested the performance of our population model on three main aspects of audiovisual integration. The first concerns the temporal determinants of multisensory integration, primarily investigating how subjective audiovisual synchrony and integration depend on the physical lag across the senses. The second addresses the spatial determinants of audiovisual integration, focusing on the combination of visual and acoustic cues for spatial localization. The third one involves audiovisual attention and examines how gaze behaviour is spontaneously attracted to audiovisual stimuli even in the absence of explicit behavioural tasks. While most of the literature on audiovisual psychophysics involves human participants, in recent years monkeys and rats have also been trained to perform the same behavioural tasks. Therefore, to generalize our approach, whenever possible we simulated experiments involving all available animal models.

Temporal determinants of audiovisual integration in humans and rats

Classic experiments on the temporal determinants of audiovisual integration usually manipulate the lag between the senses and assess the perception of synchrony, temporal order, and audiovisual speech integration (as measured in humans with the McGurk illusion, see Supplementary Movie 2) through psychophysical forced-choice tasks^30,31. Among them, we obtained both the audiovisual footage and the psychophysical data from 43 experiments in humans that used ecological audiovisual stimuli (real-life recordings of, e.g., speech and performing musicians, Figure 2A, Supplementary Figure 1 and Supplementary Table 1, for the inclusion criteria, see Methods): 27 experiments were simultaneity judgments^{2,4,5,15,32–36}, 10 temporal order judgments^5,37, 6 others assessed the McGurk effect^2,33,37.

Natural audiovisual stimuli and psychophysical responses.
**Panel A** stimuli (still frame and soundtrack) and psychometric functions for McGurk illusion², synchrony judgments⁴, and temporal order judgments⁵. In all panels, dots correspond to empirical data, lines to MCD responses; negative lags represent vision first. **Panel B** left, envelopes of auditory stimuli (clicks) recorded at different distances in a reverberant environment (the Sydney Opera House)⁷. While the reverberant portion of the sound is identical across distances, the intensity of the direct sound (the onset) decreases with depth. As a result, the centre of mass of the envelopes shifts rightward with increasing distance. The central panel shows empirical and predicted psychometric functions for the various distances. The four curves were fitted using the same decision-making parameters, so that the separation between the curves results purely from the operation of the MCD. The lag at which sound and light appear synchronous (point of subjective synchrony) scales with distance at a rate approximately matching the speed of sound (right panel). The dots in the right panel display the point of subjective synchrony (estimated separately for each curve), while the jagged line is the model prediction. **Panel C**, shows temporal order judgments for clicks and flashes from both rats and human observers¹²,¹³. Rats outperform humans at short lag, and vice-versa. **Panel D**, rats’ temporal order and synchrony judgments for flashes and clicks of varying intensity¹⁹. Note that the tree curves in each task were fitted using the same decision-making parameters, so that the MCD alone accounts for the separation between the curves. **Panel E**, pharmacologically-induced changes in rats’ audiovisual time perception. Left: glutamatergic inhibition (MK-801 injection) leads to asymmetric broadening of the psychometric functions for simultaneity judgments²¹. Right: GABA inhibition (Gabazine injection) abolishes rapid temporal adaptation, so that psychometric curves did not change based on the lag of the previous trials (as they do in controls)²². All pharmacologically-induced changes in audiovisual time perception can be accounted for by changes in the decision-making process, with no need to postulate changes in low-level temporal processing.

For each of the experiments, we can feed the stimuli to the model (Figure 1B,D), and compare the output to the empirical psychometric functions (Equation 10, for details see Methods)^23,27–29. Results demonstrate that a population of MCDs can broadly account for audiovisual temporal perception of ecological stimuli, and near-perfectly (rho=0.97) reproduces the empirical psychometric functions for simultaneity judgments, temporal order judgments and the McGurk effect (Figure 2A, Supplementary Figure 1). To quantify the impact of the low-level properties of the stimuli on the performance of the model, we ran a permutation test, where psychometric functions were predicted from mismatching stimuli (see Methods). The psychometric curves predicted from the matching stimuli provided significantly better fit than mismatching stimuli (p<0.001, see Supplementary Figure 1K). This demonstrates that our model captures the subtle effects of how individual features affect observed responses, and it highlights the role of low-level stimulus properties on multisensory integration. All analyses performed so far relied on psychometric functions averaged across observers; individual observer analyses are included in the Supplementary Information.

When estimating the perceived timing of audiovisual events, it is important to consider the different propagation speeds of light and sound, which introduce audio lags that are proportional to the observer’s distance from the source (Figure 2B, right). Psychophysical temporal order judgments demonstrate that, to compensate for these lags, humans scale subjective audiovisual synchrony with distance (Figure 2B)⁷. This result has been interpreted as evidence that humans exploit auditory spatial cues, such as the direct-to-reverberant energy ratio (Figure 2B, left), to estimate the distance of the sound source and adjust subjective synchrony by scaling distance estimates by the speed of sound⁷. When presented with the same stimuli, our model also predicts the observed shifts in subjective simultaneity (Figure 2B, centre). However, rather than relying on explicit spatial representations and physics simulations, these shifts emerge from elementary audiovisual signal analyses of natural stimuli. Specifically, in reverberant environments, the intensity of the direct portion of a sound increases with source proximity, while the reverberant component remains constant. As a result, the envelopes of sounds originating close to the observers are more front-heavy than distant sounds (Figure 2B, left). These are low-level acoustic features that the lag detector of the MCD is especially sensitive to, thereby providing a computational shortcut to explicit physics simulations. A Matlab implementation of this simulation is included as Supplementary Material.

In recent years, audiovisual timing has been systematically studied also in rats^{12,19,21,22,38,39}, generally using minimalistic stimuli (such as clicks and flashes), and under a variety of manipulations of the stimuli (e.g., loudness) and pharmacological interventions (e.g., GABA and glutamatergic inhibition). Therefore, to further generalize our model to other species, we assessed whether it can also account for rats’ behaviour in synchrony and temporal order judgments. Overall, we could tightly replicate rats’ behaviour (rho=0.981; see Figure 2C-E, Supplementary Figure 2), including the effect of loudness on observed responses (Figure 2D). Interestingly, the unimodal temporal constants for rats were 4 times faster than for humans: such a different temporal tuning is reflected in higher sensitivity in rats for short lags (<0.1s), and in humans for longer lags (Figure 2C). This four-fold difference in temporal tuning between rats and humans closely mirrors analogous interspecies differences in physiological rhythms, such as heart rate (∼4.7 times faster in rats) and breathing rate (∼6.3 times faster in rats)⁴⁰.

While tuning the temporal constants of the model was necessary to account for the difference between humans and rats, this was not necessary to reproduce pharmacologically-induced changes in audiovisual time perception in rats (Figure 2E, Supplementary Figure 2F-G), which could be accounted for solely by changes in the decision-making process (Equation 10). This suggests that the observed effects can be explained without altering low-level temporal processing. However, this does not imply that such changes did not occur—only that they were not required to reproduce the behavioural data in our simulations. Future studies using richer temporal stimuli—such as temporally modulated sequences that vary in frequency, rhythm, or phase—will be necessary to disentangle sensory and decisional contributions, as these stimuli can more selectively engage low-level temporal processing and better reveal whether perceptual changes arise from early encoding or later interpretive stages.

An asset of a low-level approach is that it allows one to inspect, at the level of individual pixels and frames, the features of the stimuli that determine the response of the model (i.e., the saliency maps). This is illustrated in Figure 3 and Supplementary Movies 1-2 for the case of audiovisual speech, where model responses cluster mostly around the mouth area and (to a lesser extent) the eyes. These are the regions where pixels’ luminance changes in synch with the audio track.

Ecological audiovisual stimuli and model responses.
**Panel A** displays the frames and soundtrack of a dynamic audiovisual stimulus over time (in this example, video and audio tracks are synchronous, and the actress utters the syllable /ta/). **Panel B** shows how the dynamic population responses *MCD*_corr and *MCD*_lag vary across the frames of Panel A. Note how model responses highlight the pixels whose intensity changed with the soundtrack (i.e., the mouth area). The right side of Panel B represents the population read-out process, as implemented for the simulations in Figure 2: the population responses *MCD*_corr and *MCD*_lag are integrated over space (i.e., pixels) and time (i.e., frames), scaled and weighted by the gain parameters β_corr, and β_lag and summed to obtain a single decision variable that is fed to the decision-making stage (see Figure 1D). **Panel C** represents the time-averaged population responses and as a function of crossmodal lag (the central one corresponds to the time-averaged responses show in Panel B). Note how peaks at around zero lag and decreases with increasing lag (following the same trend shown in Figure 1C, left), while polarity of changes with the sign of the delay. The psychophysical data corresponding to the stimulus in this figure is shown in Supplementary Figure 1B. See Supplementary Movie 1 for a dynamic representation of the content of this figure.

Spatial determinants of audiovisual integration in humans and monkeys

Classic experiments on the spatial determinants of audiovisual integration usually require observers to localize the stimuli under systematic manipulations of the discrepancy and reliability (i.e., precision) of the spatial cues³ (Figure 4B). This allows one to assess how unimodal cues are weighted and combined, to give rise to phenomena such as the ventriloquist illusion⁸. When the spatial discrepancy across the senses is low, observers’ behaviour is well described by Maximum Likelihood Estimation (MLE)³, where unimodal information is combined in a statistically optimal fashion, so as to maximize the precision (reliability) of the multimodal percept (see Equations 11–14, Methods).

Audiovisual integration in space.
**Panel A** top represents the MCD population model for spatialized audio. Visual and auditory input units receive input from corresponding spatial locations, and feed into spatially-tuned MCD units. The output of each MCD unit is eventually normalized by the total population output, so as to represent the probability distribution of stimulus location over space. The bottom part of Panel A represents the dynamic unimodal and bimodal population response over time and space (azimuth). and their marginals. When time is marginalized out, a population of MCDs implements integration as predicted by the MLE model. When space is marginalized, the output show the temporal response function of the model. In this example, visual and auditory stimuli were asynchronously presented from discrepant spatial locations (note how the blue and orange distributions are spatiotemporally offset). **Panel B** shows a schematic representation of the stimuli used to test the MLE model by Alais and Burr³. Stimuli were presented from different spatial positions, with a parametric manipulation of audiovisual spatial disparity and blob size (i.e., the standard deviation, σ of the blob). **Panel C** shows how the bimodal psychometric functions predicted by the MCD (lines, see Equation 16) and the MLE (dots) models fully overlap. **Panel D** shows how bimodal bias varies as a function of disparity and visual reliability (see legend on the left). The dots correspond to the empirical data from participant LM, while the lines are the predictions of the MCD model (compare to Figure 2A, of Alais and Burr³). **Panel E** shows how the just noticeable differences (JND, the random localization error) vary as a function of blob size. The blue squares represent the visual JNDs, the purple dots the bimodal JNDs, while the dashed orange line represents the auditory JND. The continuous line shows the JNDs predicted by the MCD population model (compare to Figure 2B, of Alais and Burr³). **Panel F** represents the stimuli and results of the experiment used by Körding and colleagues¹⁰ to test the BCI model. Auditory and visual stimuli originate from one of five spatial locations, spanning a range of 20 deg. The plots show the perceived locations of visual (blue) and auditory (orange) stimuli for each combination of audiovisual spatial locations. The dots represent human data, while the lines represent the responses of the MCD population model. **Panel G** shows the stimuli and results of the experiment of Mohl and colleagues¹⁷. The plots on the right display the probability of a single (vs. double) fixation (top monkeys, bottom humans). The dots represent human data, while the lines represent the responses of the MCD population model. The remaining panels show the histogram of the fixated locations in bimodal trials: the jagged histograms are the empirical data, while the smooth ones are the model prediction (zero free parameters). The regions of overlap between empirical and predicted histograms are shown in black.

Given that both the MLE and the MCD operate by multiplying unimodal inputs (see Methods), the time-averaged MCD population response (Equation 16) is equivalent to MLE (Figure 4A). This can be illustrated by simulating the experiment of Alais and Burr³ using both models. In this experiment, observers had to report whether a probe audiovisual stimulus appeared left or right of a standard. To assess the weighing behaviour resulting from multisensory integration, they manipulated the spatial reliability of the visual stimuli and the disparity between the senses (Figure 4B). Figure 4C shows that the integrated percept predicted by the two models is statistically indistinguishable. As such, a population of MCDs (Equation 16) can jointly account for the observed bias and precision of the bimodal percept (Figure 4D-E), with zero parameters (see Supplementary Codes).

While fusing audiovisual cues is a sensible solution in the presence of minor spatial discrepancies across the senses, integration eventually breaks down with increasing disparity⁴¹–when the spatial (or temporal) conflict is too large, visual and auditory signals may well be unrelated. To account for the breakdown of multisensory integration in the presence of intersensory conflicts, Körding and colleagues proposed the influential Bayesian Causal Inference (BCI) model¹⁰, where uni- and bimodal location estimates are weighted based on the probability that the two modalities share a common cause (Equation 17). The BCI model was originally tested in an experiment in which sound and light were simultaneously presented from one of five random locations, and observers had to report the position of both modalities¹⁰ (Figure 4F). Results demonstrate that visual and auditory stimuli preferentially bias each other when the discrepancy is low, with the bias progressively declining as the discrepancy increases. A Matlab implementation of this simulation is included as Supplementary Material.

Also a population of MCDs can compute the probability that auditory and visual stimuli share a common cause (Figure 1B, D; Equation 20), therefore we can test whether it also can implement BCI. For that, we simulated the experiment of Körding and colleagues, and fed the stimuli to a population of MCDs (Equations 18–20) which near-perfectly replicated the empirical data (rho=0.99)–even slightly outperforming the BCI model. A Matlab implementation of this simulation is included as Supplementary Material.

To test the generalizability of these findings across species and behavioural paradigms, we simulated an experiment in which monkeys (Macaca mulatta) and humans directed their gaze toward audiovisual stimuli presented at varying spatial disparities (Figure 4G)¹⁷. If observers infer a common cause, they tend to make a single fixation; otherwise, two—one for each modality. As expected, the probability of a single fixation decreased with increasing disparity (Figure 4G, right). This pattern was captured by a population of MCDs: MCD_corr values were used to fit the probability of single vs. double saccades as a function of disparity (Equation 21, Figure 4G, right). Critically, using this fit, the model was then able to predict the full distribution of gaze locations (Equation 20, Figure 4G, left) in both species with zero additional free parameters. A Matlab implementation of this simulation is included as Supplementary Material.

Taken together, these simulations show that behaviour consistent with Bayesian Causal Inference (BCI) and Maximum Likelihood Estimation (MLE) naturally emerges from a population of MCDs. Unlike BCI and MLE, however, the MCD population is both image- and sound-computable, and it explicitly represents the spatiotemporal dynamics of the process (Figure 4A, bottom; Figure 3B; Figure 5; Figure 6B-C). On one hand, this enables the model to be applied to complex, dynamic audiovisual stimuli—such as real-life videos—which are beyond the scope of traditional BCI or MLE frameworks. On the other, it permits direct, time-resolved comparisons between model responses and neurophysiological measures²⁷.

MCD and the Ventriloquist Illusion.
The upper panel represents still frame of a performing ventriloquist. The central panel represents the MCD population response. The lower plot represents the horizontal profile of the MCD response for the same frame. Note how the population response cluster on the location of the dummy, were more pixels were temporally correlated with the soundtrack.

Audiovisual saliency maps.
**Panel A** represents a still frame of Coutrot and Guyader’s¹ stimuli. The white dots represent gaze direction of the various observers. **Panel B** represents the MCD population response for the frame in Panel A. The dots represent observed gaze direction (and correspond to the white dots of Panel A). **Panel C** represents how the MCD response varies over time and azimuth (with elevation marginalized-out). The black solid lines represent the active speaker, while the waveform on the right displays the soundtrack. Note how the MCD response was higher for the active speaker. **Panel D** shows the distribution of model response at gaze direction (see Panel B) across all frames and observers in the database. Model response was normalized for each frame (Z-scores). The y axis represents the number of frames. The vertical grey line represents the mean. See Supplementary Movie 4 for a dynamic representation of the content of this figure.

As a practical demonstration, we applied the model (Equation 6) to a real-life video of a performing ventriloquist. The population response dynamically tracked the active talker, clustering around the dummy’s face whenever it produced speech

Spatial orienting and audiovisual gaze behaviour

Multisensory stimuli are typically salient, and a vast body of literature demonstrates that spatial attention is commonly attracted to audiovisual stimuli¹⁴. This aspect of multisensory perception is naturally captured by a population of MCDs, whose dynamic response explicitly represents the regions in space with the highest audiovisual correspondence for each point in time. Therefore, for a population of MCDs to provide a plausible account for audiovisual integration, such dynamic saliency maps should be able to predict human audiovisual gaze behaviour, in a purely bottom-up fashion and with no free parameters. Figure 6A shows the stimuli and eye-tracking data from the experiment of Coutrot and Guyader¹, in which observers passively observed a video of four persons talking. Panel B shows the same eye-tracking data plotted over the corresponding MCD population response: across 20 observers, and 15 videos (for a total of over 16,000 frames), gaze was on average directed towards the locations (i.e., pixels) yielding the top 2% MCD response (Figure 6D, Equations 22–23). The tight correspondence of predicted and empirical salience is illustrated in Figure 6C and Supplementary Movie 4: note how population responses peak based on the active speaker.

Discussion

This study demonstrates that elementary audiovisual analyses are sufficient to replicate behaviours consistent with multisensory perception in mammals. The proposed image- and sound-computable model, composed of a population of biologically plausible elementary processing units, provides a stimulus-driven framework for multisensory perception that transforms raw audiovisual input into behavioural predictions. Starting directly from pixels and audio samples, our model closely matched observed behaviour across a wide range of phenomena—including multisensory illusions, spatial orienting, and causal inference—with average correlations above 0.97. This was tested in a large-scale simulation spanning 69 audiovisual experiments, 7 behavioural tasks, and data from 534 humans, 110 rats, and 2 monkeys.

We define a stimulus-computable model as one that receives input directly from the stimulus—such as raw images and sound waveforms—rather than from abstracted descriptors like lag, disparity, or reliability. Framed in Marr’s terms, stimulus-computable models operate at the algorithmic level, specifying how sensory information is represented and processed. This contrasts with computational-level models, such as Bayesian ideal observers, which define the goals of perception (e.g., maximizing reliability^3,11) without specifying how those goals are achieved. Rather than competing with such normative accounts, the MCD provides a mechanistic substrate that could plausibly implement them. By operating directly on realistic audiovisual signals, our population model captures the richness of natural sensory input and directly addresses the problem of how biological systems represent and process multisensory information²⁵. This allows the MCD to generate precise, stimulus-specific predictions across tasks, including subtle differences in behavioural outcomes that arise from the structure of individual stimuli (see Supplementary Figure 1K).

The present approach naturally lends itself to be generalized and tested against a broad range of tasks, stimuli, and responses—as reflected by the breadth of the experiments simulated here. Among the perceptual effects emerging from elementary signal processing, one notable example is the scaling of subjective audiovisual synchrony with sound source distance⁷. As sound travels slower than light, humans compensate for audio delays by adjusting subjective synchrony based on the source’s distance scaled by the speed of sound. Although this phenomenon appear to rely on explicit physics modelling, our simulations demonstrate that auditory cues embedded in the envelope (Fig 2B, left) are sufficient to scale subjective audiovisual synchrony. In a similar fashion, our simulations show that phenomena such as the McGurk illusion, the subjective timing of natural audiovisual stimuli, and saliency detection, may emerge from elementary operations performed at pixel level, bypassing the need for more sophisticated analyses such as image segmentation, lip or face-tracking, 3D reconstruction, etc. Elementary, general-purpose operations on natural stimuli can drive complex behaviour, sometimes even in the absence of advanced perceptual and cognitive contributions. Indeed, it is intriguing that a population of MCDs, a computational architecture originally developed for motion vision in insects, can predict speech illusions in humans.

The fact that identical low-level analyses can account for all of the 69 experiments simulated here, directly addresses several open questions in multisensory perception. For instance, psychometric functions for speech and non-speech stimuli often differ significantly⁴². This has been interpreted as evidence that speech may be special and processed via dedicated mechanisms⁴³. However, identical low-level analyses are sufficient to account for all observed responses, regardless of the stimulus type (Figure 2, Supplementary Figures 1–2). This suggests that most of the differences in psychometric curves across classes of stimuli (e.g., speech vs. non-speech vs. clicks-&-flashes) are due to the low-level features of the stimuli themselves, not how the brain processes them. Similarly, experience and expertise also modulate multisensory perception. For example, audiovisual simultaneity judgments differ significantly between musicians and non-musicians⁴ (see Supplementary Figure 1C). Likewise, the McGurk illusion³⁷ and subjective audiovisual timing⁴⁴ vary over the lifespan in humans, and following pharmacological interventions in rats^21,22 (see Supplementary Figure 1E,J and Supplementary 2F-G). Our simulations show that adjustments at the decision-making level are sufficient to account for these effects, without requiring structural or parametric changes to low-level perceptual processing across observers or conditions.

Although the same model explains responses to multisensory stimuli in humans, rats, and monkeys, the temporal constants vary across species. For example, the model for rats is tuned to temporal frequencies over four times higher than those for humans. This not only explains the differential sensitivity of humans and rats to long and short audiovisual lags, but it also mirrors analogous interspecies differences in physiological rhythms, such as heart and breathing rates⁴⁰. Previous research has shown that physiological arousal modulates perceptual rhythms within individuals⁴⁵. It is an open question whether the same association between multisensory temporal tuning and physiological rhythms persists in other mammalian systems. Conversely, no major differences in the model’s spatial tuning were found between humans and macaques, possibly reflecting the close phylogenetic link between the two species.

How might these computations be implemented neurally? In a recent study²⁷, we identified neural responses in the posterior superior temporal sulcus, superior temporal gyrus, and left superior parietal gyrus that tracked the output of an MCD model during audiovisual temporal tasks. Participants were presented with random sequences of clicks and flashes while performing either a causality judgment or a temporal order judgment task. By applying a time-resolved encoding model to MEG data, we demonstrated that MCD dynamics aligned closely with stimulus-evoked cortical activity. The present study considerably extends the scope of the MCD framework, allowing it to process more naturalistic stimuli and to account for a broader range of behaviours—including cue combination, attentional orienting, and gaze-based decisions. This expansion opens the door to new neurophysiological investigations into the implementation of multisensory integration. For instance, the dynamic, spatially distributed population responses generated by the MCD (see Movies) can be directly compared with neural population activity recorded using techniques such as ECoG, Neuropixels, or high-density fMRI—similar to previous efforts that linked the Bayesian Causal Inference model to neural responses during audiovisual spatial integration^46–48. Such comparisons may help bridge algorithmic and implementational levels of analysis, offering concrete hypotheses about how audiovisual correspondence detection and integration are instantiated in the brain.

An informative outcome of our simulations is the model’s ability to predict gaze direction in response to naturalistic audiovisual stimuli. Saliency, the property by which some elements in a display stand out and attract observer’s attention and gaze direction, is a popular concept in both cognitive and computer sciences⁴⁹. In computer vision, saliency models are usually complex and rely on advanced signal processing and semantic knowledge—typically with tens of millions of parameters⁵⁰. Despite successfully predicting gaze behaviour, current audiovisual saliency models are often computationally expensive, and the resulting maps are hard to interpret and inevitably affected by the datasets used for training⁵¹. In contrast, our model detects saliency “ out of the box”, without any free parameters, and operating purely at the individual pixel level. The elementary nature of the operations performed by a population of MCDs returns saliency maps that are easy to interpret: salient points are those with high audiovisual correlation. By grounding multisensory integration and saliency detection in biologically plausible computations, our study offers a new tool for machine perception and robotics to handle multimodal inputs in a more human-like way, while also improving system accountability.

This framework also provides a solution for self-supervised and unsupervised audiovisual learning in multimodal machine perception. A key challenge when handling raw audiovisual data is solving the causal inference problem—determining whether signals from different modalities are causally related or not¹⁰. Models in machine perception often depend on large, labelled datasets for training. In this context, a biomimetic module that handles saliency maps, audiovisual correspondence detection, and multimodal fusion can drive self-supervised learning through simulated observers, thereby reducing the dependency on labelled data^52–54. Furthermore, the simplicity of our population-based model provides a computationally efficient alternative for real-time multisensory integration in applications such as robotics, AR/VR, and other low-latency systems.

Although a population of MCDs can explain when phenomena such as the McGurk Illusion occur, it does not explain the process of phoneme categorization that ultimately determines what syllable is perceived¹⁸. More generally, it is well known that cognitive and affective factors modulate our responses to multisensory stimuli⁹. In particular, the model does not currently incorporate linguistic mechanisms or top-down predictive processes, which play a central role in audiovisual speech perception—such as the integration of complementary articulatory features, lexical expectations, or syntactic constraints^55–58. While a purely low-level model does not directly address these issues, the modularity of our approach makes it possible to extend the system to include high-level perceptual, cognitive and affective factors. What is more, although this study focused on audiovisual integration in mammals, the same approach can be naturally extended to other instances of sensory integration (e.g., visuo- and audio-tactile) and animal classes. A possible extension of the model for trimodal integration is included in the Supplementary Material.

Besides simulating behavioural responses, a stimulus-computable approach necessarily makes explicit all the intermediate steps of sensory information processing. This opens the system to inspection at all of its levels, thereby allowing for direct comparisons with neurophysiology²⁷. In insect motion vision, this transparency made it possible for the Hassenstein-Reichardt detector to act as a searchlight to link computation, behavior, and physiology at the scale of individual cells⁵⁹. Being based on formally identical computational principles¹⁵, the present approach holds the same potential for multisensory perception.

Methods

The MCD population model

The architecture of each MCD unit used here is the same as described in Parise and Ernst²⁸, however, units here receive time-varying visual and auditory input from spatiotopic receptive fields. The input stimuli (s) consist of luminance level and sound amplitude varying over space and time and are denoted as s_m(x, y, t) – with x and y, representing the spatial coordinates along the horizontal and vertical axes, t is the temporal coordinate, and m is the modality (video and audio). When the input stimulus is a movie with mono audio, the visual input to each unit is a signal representing the luminance of a single pixel over time, while the auditory input is the amplitude envelope (later we will consider more complex scenarios where auditory stimuli are also spatialized).

Each unit operates independently and detects changes in unimodal signals over time by temporal filtering based on two biphasic impulse response functions that are 90 deg out of phase (i.e., a quadrature pair). A physiologically plausible implementation of this process has been proposed by Adelson and Bergen ⁶⁰ and consists of linear filters of the form:

The phase of the filter is determined by n, which based on Emerson et al.⁶¹ takes the values of 6 for the fast filter, and 9 for the slow one. The temporal constant of the filters is determined by the parameter τ_bp; in humans, its best fitting value is 0.045 s for vision and 0.0367 s for audition. In rats, the fitted temporal constant for vision and audition are nearly identical and their value is 0.010 s.

Fast and slow filters are applied to each unimodal input signal and the two resulting signals are squared and then summed. After that, a compressive non-linearity (square-root) is applied to the output, so as to constrain it within a reasonable range⁶⁰. Therefore, the output of each unimodal unit feeding into the correlation detector takes the following form

where mod = vid, aud represents the sensory modality and * is the convolution operator.

As in the original version²³, each MCD consists of two sub-unit, in which the unimodal input is low-pass filtered and multiplied as follows

The impulse response of the low-pass filter of each sub-unit takes the form

where τ_lp represents the temporal constant, and its estimated value is 0.180 s for humans and 0.138 s for rats.

The response of the sub-units is eventually multiplied to obtain MCD_corr, which represents the local spatiotemporal audiovisual correlation, and subtracted to obtain MCD_lag which describes the relative temporal order of vision and audition

The outputs MCD_corr (x, y, t) and MCD_lag(x, y, t) are the final product of the MCD.

The temporal constants of the filters were fitted using the Bayesian Adaptive Direct Search (BADS) algorithm⁶², set to maximize the correlation between the empirical and predicted psychometric functions of the temporal determinants of multisensory integration. For humans, that included all studies in Supplementary Figure 1, besides Patient PH; For rats, it included all non-pharmacological studies in Supplementary Figure 2. To minimize the effect of starting parameters, the fitting was performed 200 times using random starting values (from 0.001 to 1.5 s). The parameters estimated using BADS were further refined using the fminsearch algorithm in Matlab. Parameter estimation was considerably compute-intensive, hence the amount of data had to be reduced by rescaling the videos to 15% of the original size (without affecting the frame rate). Besides reducing run-time, this simulates the (Gaussian) spatial pooling occurring in early visual pathways. Details on parameter fitting are provided below.

Now that the MCD population is defined and its parameters are fully constrained, what remains to be explained is how to read out, from the dynamic population responses, the relevant information that is needed to generate a behavioral response, such as eye-movements, button-presses, nose- or lick-contacts etc. While the MCD units are task-independent and operates in a purely bottom-up fashion, the exact nature of the read-out and decision-making process depends on the behavioral task, which ultimately determines how to weigh and combine the dynamic population responses. Given that in the present study we consider different types of behavioral experiments (investigating audiovisual integration trough temporal tasks, spatial tasks, and passive observation), the read-out process for each task will be described in separate sections.

Modeling the temporal determinants of multisensory integration

The experiments on the temporal constraints of audiovisual integration considered here, rely on psychophysical forced-choice tasks to assess the effects of crossmodal lags on perceived synchrony, temporal order, and the McGurk illusion. In humans, such experiments entail pressing one of two buttons; in rats, nose-poking or licking one of two spouts. In the case simultaneity judgments, on each trial observers reported whether visual and auditory stimuli appeared synchronous or not (resp = {yes, no}). In temporal order judgments, observers reported which modality came first (resp = {vision first, audition first}). Finally, in the case of the McGurk, observers (humans) had to report what syllable they heard (e.g., “ da”, “ ga” or “ ba”, usually recoded as “ fused” and “ non-fused percept”). When plotted against lag, the resulting empirical psychometric functions describe how audiovisual timing affect perceived synchrony, temporal order, and the McGurk illusion. Here, we model these tasks using the same read-out and decision-making process.

To account for observed responses, the dynamic, high-bandwidth populations responses must be transformed (compressed) into a single number, representing response probability for each lag. In line with standard procedures⁶³, this was achieved by integrating MCD_corr (x, y, t) and MCD_lag(x, y, t) over time and space, so as to obtain two summary decision variables

and

The temporal window for these analyses consisted of the duration of each stimulus, plus two seconds before and after (during which the audio was silent, and the video displayed a still frame); the exact extension of the temporal window used for the analyses had minimal effect on the results. and are eventually linearly weighted and transformed into response probabilities through a cumulative normal function as follows

Here, Φ represents the cumulative normal, β_corr and β_lag are linear coefficients that weigh and scale and . β_crit is a bias term, which corresponds to the response criterion. Finally, p_MCD(resp) is the probability of a response, which is p_MCD(Synchronous) for the simultaneity judgment task, p_MCD(Audio first) for the temporal order judgment task, and p_MCD(Fusion) for the McGurk task. Note how Equation 9 entails that simultaneity judgments, temporal order judgments, and the McGurk illusions are all simulated using the very same model architecture.

With the population model fully constrained (see above), the only free parameters in the present simulations are the ones controlling the decision-making process: β_crit, β_corr, and β_lag. These were separately fitted for each experiment using fitglm Matlab (binomial distribution and probit link function). For the simulations of the effects of distance on audiovisual temporal order judgments in humans⁷ (Figure 2B), and the manipulation of loudness in rats¹⁹ (Figure 2D) β_crit, β_corr, and β_lag were constant across conditions (i.e., distance⁷ or loudness¹⁹). This way, differences in the psychometric functions as a function of distance⁷ or loudness¹⁹ of the stimuli are fully explained by the MCD. Overall, the model provided a good fit to the empirical psychometric functions, and the average Pearson correlation between human and model response (weighted by sample size) is 0.981 for humans and 0.994 for rats. Naturally, model-data correlation varied across experiments, largely due to sample size. This can be appreciated when the MCD-data correlation for each experiment is plotted against the number of trials for each lag in a funnel plot (see Supplementary Figure 1I). The number of trials for each lag determine the binomial error for data points in Supplementary Figure 1 and 2; accordingly, the funnel plot shows that lower MCD-data correlation is more commonly observed for curves based on smaller sample size. This shows that the upper limit in the MCD-data correlation is mostly constrained by the reliability of the dataset, rather than systematic errors in the model.

To assess the contribution of the low-level properties of the stimuli on model’s performance, we ran a permutation test, where psychometric curves were generated (Equation 10) using stimuli from different experiments (but with the same manipulation of lag). If the low-level properties of the stimuli play a significant role, the correlation between data and model with permuted stimuli should be lower than with non-permuted stimuli. For this permutation we used the data from simultaneity judgment tasks on humans (Supplementary Figure 1), as with 27 individual experiments there are enough permutations to render the test meaningful. For that, we used the temporal constants of the MCD fitted before, so that each psychometric curve each permutation had 3 free parameters (β_crit, β_corr, and β_lag). The results from 200k permutations demonstrate that the goodness of fit obtained with the original stimuli is superior to that of permuted stimuli. Specifically, the permuted distribution of the mean Pearson correlation of predicted vs. empirical psychometric curves had a mean of 0.972 (σ =0.0036), while such a correlation rose to 0.989 when the MCD received the original stimuli.

Modeling the spatial determinants of multisensory integration

Most studies on audiovisual space perception only investigated the horizontal spatial dimension, hence the stimuli can be reduced to s_m(x, t) instead of s_m(x, y, t), as in the simulations above (see Equation 2). Additionally, based on preliminary observations, the output MCD_lag does not seem necessary to account for audiovisual integration in space, hence only the population response MCD_corr (x, t) (see Equation 6) will be considered here.

MCD and MLE – simulation of Alais and Burr³

In its general form, the MLE model can be expressed probabilistically as p_MLE(x) ∝ p_vid(x) · p_aud(x), where p_vid(x) and p_aud(x) represent the probability distribution of the unimodal location estimate (i.e., likelihood functions), and p_MLE(x) is the bimodal distribution. When the p_vid(x) and p_aud(x) follow a Gaussian distribution, also p_MLE(x) is Gaussian, with the variance equal to the product divided by the sum of the unimodal variances:

and the mean consisting of a weighted average of the unimodal means

where

To test whether a population of MCDs can replicate the study of Alais and Burr³, this simulation has two main goals. The first one is to compare the predictions of the MCD with that of the MLE model; the second one is to test whether the MCD model can predict observers’ responses.

To simulate the study of Alais and Burr³, we first need to generate the stimuli. The visual stimuli consisted of a 1-D Gaussian luminance profiles, presented for 10ms. Their standard deviations, which determined visual spatial reliability, were defined as the standard deviation of the visual psychometric functions. Likewise, the auditory stimuli also consisted of a 1-D Gaussian sound intensity profile (with a standard deviation determined by the unimodal auditory psychometric function). Note that the spatial reliability in the MCD model jointly depends on the stimulus and the receptive field of the input units; however, teasing apart the differential effects induced by these two sources of spatial uncertainty is beyond the scope of the present study. Hence, for simplicity, here injected all spatial uncertainty into the stimulus (see also^23,27–29). For this simulation, we only used the data from observer LM of Alais and Burr’s³, as it is the only participant for which the full dataset is publicly available (other observers, however, had similar results).

The stimuli are fed to the model to obtain the population response MCD_corr (x, t) (see Equation 6), which is marginalized over time as follows

This provides a distribution of the population response over the horizontal spatial dimension. Finally, a divisive normalization is performed to transform model response into a probability distribution

It is important to note that Equations 15 and 16 have no free parameters, and all simulations are now performed with a fully constrained model. To test whether the MCD model can perform audiovisual integration according to the MLE model, we replicated the various conditions run by observer LM, and calculated the bimodal likelihoods distribution predicted by the MLE and MCD models (i.e., p_MLE(x) and p_MCD(x)). These were statistically identical, when considering rounding errors. The results of these simulations are plotted in Figure 4C, and displayed as cumulative distributions as in Figure 1 of Alais and Burr³.

Once demonstrated that p_MLE(x) = p_MCD(x), it is clear that the MCD is equally capable of predicting observed responses. However, for completeness, we compared the prediction of the MCD to the empirical data: the results demonstrate that, just like the MLE model, a population of MCDs can predict both audiovisual bias (Figure 4D) and just noticeable differences (JND, Figure 4E). A Matlab implementation of this simulation is included as Supplementary Material.

MCD and BCI – simulation of Körding et al.¹⁰

To account for the spatial breakdown of multisensory integration, the BCI model operates in a hierarchical fashion: first it estimates the probability that audiovisual stimuli share a common cause (p(C = 1)). Next, the model weigh and integrates the unimodal and bimodal information (p_mod(x) and p_mle(x)) as follows:

All the terms in Equation 17 have a clear analogue in the computations of the MCD population response: p_mle(x) corresponds to p_MCD(x) (see previous section). Likewise, the homologous of p_mod(x) can be obtained by marginalizing over time and normalizing the output of the unimodal units as follows (see Equation 15)

Finally, the MCD homologous of BCI’s p(C = 1) can be read-out from the population response following the same logic as Equations 8 and 9:

Here, we found that including a compressive non-linearity (logarithm of the total MCD_corr response) provided a tighter fit to the empirical data. With p_MCD(C = 1) representing the probability that vision and audition share a common cause, and p_MCD(x) and representing the bimodal and unimodal population responses, the MCD model can simulate observed responses as follows:

The similarity between Equation 20 and Equation 17 demonstrates the fundamental homology of the BCI and MCD models: what remains to be tested is whether the MCD can also account for the results of Körding and colleagues¹⁰. For that, just like the BCI model, also the MCD model relies on 4 free parameters: two are shared by both models, and represent the spatial uncertainty (i.e., the variance) of the unimodal input (i.e., and ). Additionally, the MCD model then needs two linear coefficients (slope β_corr and intercept β_crit) to transform the dynamic population response into a probability of a common cause. Conversely, the remaining two parameters of the BCI model correspond to a prior for common cause and another for central location, neither of which are necessary to account for observed responses in the present framework. Parameters were fitted using BADS⁶² set to maximize the Pearson correlation between model and human responses. Overall, the MCD provided an excellent fit to the empirical data (r=0.99), even slightly exceeding the performance of the BCI model while relying on the same degrees of freedom. Given that the fitted value of the slope parameter (β_corr) approached 1, we repeated the fitting while removing β_corr from Equation 19: even with just 3 free parameters (one fewer than the BCI model), the MCD in line with the BCI model and the correlation with the empirical data was 0.98. A Matlab implementation of this simulation is included as Supplementary Material.

MCD and BCI – simulation of Mohl et al.¹⁷

Mohl and colleagues used eye movements to test whether humans and monkeys integrate audiovisual spatial cues according to BCI. The targets consisted of either unimodal or bimodal stimuli. Following the same logic as the simulation of Alais and Burr³ (see above), the unimodal input consisted of impulses with a Gaussian spatial profile (Figure 4A-B), whose the variance (i.e., and ) was set equal to the variance of the fixations measured in the unimodal trials (averaged across observers). Although the probability of a single fixation decreases with increasing disparity (Figure 4G, right), observers sometimes failed to make a second fixation even when the stimuli were noticeably far apart (lapses). This was especially true for monkeys, which had a lapse rate of 16%, indicative of low attention and compliance. To account for this, we can modify Equation 19 and obtain the probability of a single fixation as follows:

Here, p_lapse is a free parameter that represents the probability of making the incorrect number of fixations, irrespective of the discrepancy. As in Equation 19, β_corr and β_crit are free parameters that transform the dynamic population response into a probability of a common cause (i.e., single fixation). Equation 21 could tightly reproduce the observed probability of a single fixation (Figure 4G, right), and the Pearson correlation between the model and data was 0.995 for monkeys and 0.988 for humans.

With the parameters p_lapse, β_corr and β_crit fitted to the probability of a single fixation (i.e., the probability of a common cause), it is now possible to predict gaze direction (i.e., the perceived location of the stimuli) with zero free parameters. For that, we can use Equation 21 to get the probability of a common cause and predict gaze direction using Equation 20. The distribution of fixations predicted by the MCD closely follow the empirical histogram in both species (Figure 4G) and the correlation between the model and data was 0.9 for monkeys and 0.93 for humans. Note that figure 4G shows only a subset of the 20 conditions tested in the experiment (the same subset of conditions shown in the original paper). A Matlab implementation of this simulation is included as Supplementary Material.

MCD and audiovisual gaze behavior

To test whether the dynamic MCD population response can account for gaze behavior during passive observation of audiovisual footage, a simple solution is to measure whether observers preferentially looked where the MCD_corr response is maximal. Such an analysis was separately performed for each frame. For that, gaze directions were first low pass filtered with a Gaussian kernel (σ = 14 pixels) and normalized to probabilities p_gaze(x, y). Next, we calculated the average MCD responses at gaze direction for each frame (t); this was done by weighing MCD_corr by p_gaze as follows

To assess the MCD response at gaze is larger than the frame average, we calculated the standardized mean difference (SMD) for each frame as follows

Across the over 16,000 frames of the available dataset, the average SMD was 2.03 (Figure 6D). Given that the standardized mean difference serves as a metric for effect size, and that effect sizes surpassing 1.2 are deemed very large ⁶⁴, it is remarkable that the MCD population model can so tightly account for human gaze behavior in a purely bottom-up fashion and without free parameters.

Code availability statement

A variety MATLAB script running the MCD population model is included as Supplementary Material. This include a code running the MCD on real-life footage (i.e., this is what was used to model MCD responses to the temporal determinants of multisensory integration, spatial orienting and Figure. 5). Additional codes simulate the experiment of Alais and Carlile⁷, and all simulations in Figure 5

Pre-processing of ecological audiovisual footage

The diversity of the stimuli used in this dataset requires some preprocessing before the stimuli can be fed to the population model. First, all movies were converted to grayscale (scaled between 0 and 1) and the soundtrack was converted to rms envelope (scaled between 0 and 1), thereby removing chromatic and tonal information. Movies were then padded with two seconds of frozen frames at onset and offset to accommodate for the manipulation of the lag. Finally, the luminance of the first still frame was set as baseline and subtracted from all subsequent frames (see Matlab codes in the Supplementary Material). Along with the padding, this helps minimizing transient artefacts induced by the onset of the video.

Video frames were to 15% of the original size, and the static background was cropped. On a practical side, this made the simulations much faster (which is crucial for parameter estimation); on a theoretical side, such a down sampling simulates the Gaussian spatial pooling of luminance across the visual field (unfortunately, the present datasets do not provide sufficient information to convert pixels into visual angles). In a similar fashion, we down sampled sound envelope to match the frame-rate of the video.

Datasets

To thoroughly compare observed and model behavior, this study requires a large and diverse dataset consisting of both the raw stimuli and observers’ responses. For that, we adopted a convenience sampling and simulated the studies for which both stimuli and responses were available (either in public repositories, shared by the authors, or extracted from published figures). The inclusion criteria depend on what aspect of multisensory integration is being investigated, and they are described below.

For the temporal determinants of multisensory integration in humans, we only included studies that: (1) used real-life audiovisual footage, (2) performed a parametric manipulation of lag, and (3) engaged observers in a forced-choice behavioral task. Forty-three individual experiments met the inclusion criteria (Figure 2A-B and Supplementary Figure 1). These varied in terms of stimuli, observers, and tasks. In terms of stimuli, the dataset consists of responses from 105 unique real-life videos (see Supplementary Figure 1 and Supplementary Table 1). The majority of the videos represented audiovisual speech (possibly the most common stimulus in audiovisual research), but they varied in terms of content (i.e., syllables, words, full sentences, etc.), intelligibility (i.e., sine-wave speech, amplitude-modulated noise, blurred visuals, etc.), composition (i.e., full face, mouth-only, oval frame, etc.), speaker identity, etc. The remaining non-speech stimuli consist of footage of actors playing a piano or a flute. The study from Alais and Carlile⁷ was included in the dataset because, even if the visual stimuli were minimalistic (blobs), the auditory stimuli consisted of ecological auditory depth cues depth cues recorded in a real reverberant environment (the Sydney Opera House, see Supplementary Information for details on the dataset). The dataset contains forced-choice responses to three different tasks: speech categorization (i.e., for the McGurk illusion), simultaneity judgments, and temporal order judgment. In terms of observers, besides the general population, the dataset consists of experimental groups varying in terms of age, musical expertise, and even includes a patient, PH, who reports hearing speech before seeing mouth movements after a lesion in the pons and basal ganglia³⁷. Taken together the dataset consists of ∼1k individual psychometric functions, from 454 unique observers, for a total of ∼300k trials; the psychometric curves for each experiment are shown in Figure 2A-B and Supplementary Figure 1. All these simulations are based on psychometric functions averaged across observers, for simulations of individual observers see Supplementary Information and Supplementary Figures 3–5.

For the temporal determinants of multisensory integration in rats, we included studies that performed a parametric manipulation of lag and engaged rats in simultaneity and temporal order judgment tasks. Sixteen individual experiments^{12,19,21,22,38,39} met the inclusion criteria (Figure 2C-D and Supplementary Figure 2 and Supplementary Table 2): all of them used minimalistic audiovisual stimuli (clicks and flashes) and with a parametric manipulation of audiovisual lag. Overall, the dataset consists of ∼190 individual psychometric functions, from 110 rats (and 10 humans), for a total of ∼300k trials.

For the case of the spatial determinants of multisensory integration, to the best of our knowledge there are no available datasets with both stimuli and psychophysical responses. Fortunately, however, the spatial aspects of multisensory integration are often studied with minimalistic audiovisual stimuli (e.g., clicks and blobs), which can be simulated exactly. Audiovisual integration in space is commonly framed in terms of optimal statistical estimation, where the bimodal percept is modelled either through maximum likelihood estimation (MLE), or Bayesian causal inference (BCI). To provide a plausible account for audiovisual integration in space, a population of MCDs should also behave as a Bayesian-optimal estimator. This hypothesis was tested by comparing the population response to human data in the studies that originally tested the MLE and BCI models; hence we simulated the study of Alais and Burr³ and Körding et al.¹⁰. Such simulations allow us to compare the data with our model, and our model with previous ones (MLE and BCI). Given that in these two experiments a population of MCDs behaves just like the MLE and BCI models (and with the same number of free parameters or fewer), the current approach can be easily extended to other instances of sensory cue integration previously modelled in terms of optimal statistical estimation. This was tested by simulating the study of Mohl and colleagues¹⁷, who used eye movements to assess whether BCI can account for audiovisual integration in monkeys and humans.

Finally, we tested whether a population of MCDs can predict audiovisual attention and gaze behaviour during passive observation of ecological audiovisual stimuli. Coutrot and Guyader¹ run the ideal testbed for this hypothesis: much like our previous simulations (Figure 2A), they employed audiovisual speech stimuli, recorded indoor, with no camera shake. Specifically, they tracked eye movements from 20 observers who passively watched 15 videos of a lab meeting (see Figure 6A). Without fitting parameters, the population response tightly matched the empirical saliency maps (see Figure 6B-D and Supplementary Movie 3).

MCD temporal filters: parameter estimation and generalizability

The temporal filters determine the temporal tuning of the model and consist of three parameters: two temporal constants of the unimodal band-pass filters (τ_bpA, τ_bpV; Equation 1), and one bimodal constant (τ_lp, Equation 5). For humans, these parameter values were estimated by combining data from all 40 experiments shown in Supplementary Figure 1, excluding the data from Patient PH. These experiments encompass SJ, TOJ, and McGurk tasks, each involving parametric manipulations of audiovisual lag, as well as tasks using ecological audiovisual stimuli (i.e., real-life footage). For rats, we followed the same approach, combining all psychometric curves shown in Supplementary Figure 2 (excluding pharmacological manipulation experiments), based on stimuli consisting of clicks and flashes in TOJ and SJ tasks.

To simulate each trial, we fed the stimuli into the population model to estimate internal response variables and (Equations 8–9). Response probabilities were then derived from these internal signals as described in Equation 10 (see also²⁸). This decision stage introduces three additional parameters: β_crit (the criterion, a bias term), and β_corr, and β_lag (gain parameters, acting as scaling factors). These decision-related parameters were estimated independently for each experiment. An exception was made for simulations based on Schormans & Allman¹⁹ (Supplementary Figure 2E), which manipulated audio intensity across three levels—here, we used a single set of β_crit, β_corr, and β_lag values across all conditions.

Two search algorithms were used for parameter estimation. First, we applied a global optimization method—Bayesian Adaptive Direct Search (BADS; Acerbi). To minimize the influence of initial values, we ran the BADS 200 times with different starting points (from 0.001 to 1.5 s). The best-fitting parameter values were then further refined using a local optimizer (fminsearch in MATLAB). The cost function minimized by both algorithms was 1 minus the Pearson correlation between real and simulated data, averaged across all experiments (each experiment was weighted equally, regardless of trial count or sample size). To assess robustness, we repeated the procedure using a mean squared error (MSE) cost function, which produced a set of parameter values (τ_bpA=0.042s τ_bpV =0.039s; τ_lp=0.156s) that are closely aligned with the original estimate (τ_bpA =0.045s τ_bpV =0.036s; τ_lp=0.180s).

This approach to parameter estimation, which aggregates data from a broad range of paradigms (n = 40 for humans, n = 15 for rats), enables us to estimate the temporal constants of the MCD units using a variety of ecological audiovisual stimuli (n=105 unique stimuli). Given that the stimuli consist of raw audiovisual footage, parameter estimation is computationally demanding — each search iteration can take up to ten minutes. This makes standard techniques such as leave-one-out cross-validation impractical. The generalizability of the estimated temporal constants can nevertheless be tested against new, unseen data

First, we tested whether the MCD model, with fixed temporal constants, could predict a well-known finding: that perceived audiovisual synchrony varies with distance in reverberant environments (Alais & Carlile, 2004). To do this, we used the previously estimated temporal constants and fitted only a single set of three decision parameters to predict the four psychometric curves in Figure 2B—effectively using fewer than one free parameter per curve. Importantly, no changes were made to the model’s temporal tuning. The MCD was able to fully reproduce the separation of the four curves across conditions. This result is not simply an extension of the model to a new stimulus set. Rather, it provides a purely bottom-up account of how perceived synchrony scales with spatial depth—without invoking any explicit computation of distance or higher-level inference mechanisms.

Next, we examined whether the estimated temporal constants—which had so far only been tested on group-averaged data— could also predict individual-level responses. To do this, we simulated individual observer data using previously published datasets: SJ data from Yarrow et al.¹⁶, and both TOJ and SJ data from Parise and Ernst²⁸, Experiments 1 and 2. These simulations allowed us to assess whether the model could generalize from group-level patterns to individual-level behaviour. The results confirmed that the MCD, using a single set of fixed temporal constants across all individuals, could successfully predict both individual and group data. Only the decision parameters (β_crit, β_corr, and β_lag) were fitted per observer, showing that temporal tuning itself can be considered constant across different observers.

Finally, using the same set of temporal constants, we evaluated whether the population output (Equation 6) could predict gaze behaviour during passive observation of real-life audiovisual footage—this time with zero free parameters. For this, we presented the audiovisual frames to the model (Figure 6A) and computed the MCD response for each pixel and time frame (Figure 6B). Across frames, the model’s output consistently predicted the location of gaze direction: the region of the screen with the highest response systematically attracted participants’ gaze (Figure 6D).

In summary, we have used data from a large pool experiments consisting of 105 unique stimuli to estimate the temporal constants of the MCD model. These constants, once estimated, demonstrated predictive validity across a wide range of behavioural measures (SJ, TOJ, eye-tracking), stimulus types, species (humans and rats), and data granularities (group and individual). They generalized successfully to completely novel behavioural datasets and natural viewing conditions— including eye-tracking during passive video observation—all without any re-tuning of the model’s core temporal filters.

Supplementary Information

Pre-processing of ecological audiovisual footage

The diversity of the stimuli used in this dataset requires some preprocessing before the stimuli can be fed to the population model. First, all movies were converted to grayscale (scaled between 0 and 1) and the soundtrack was converted to rms envelope (scaled between 0 and 1), thereby removing chromatic and tonal information. Movies were then padded with two seconds of frozen frames at onset and offset to accommodate for the manipulation of the lag. Finally, the luminance of the first still frame was set as baseline and subtracted from all subsequent frames. Along with the padding, this helps minimizing transient artefacts induced by the onset of the video.

Video frames were to 15% of the original size, and the static background was cropped. On a practical side, this made the simulations much faster (which is crucial for parameter estimation); on a theoretical side, such a down sampling simulates the Gaussian spatial pooling of luminance across the visual field (unfortunately, the present datasets do not provide sufficient information to convert pixels into visual angles). In a similar fashion, we down sampled sound envelope to match the frame-rate of the video.

Individual observers’ analysis

Most of the simulations described so far rely on group-level data, where psychometric curves represent the average response across the pool of observers that took part in each experiment. Individual psychometric functions, however, sometimes vary dramatically across observers, hence one might wonder whether the MCD, besides predicting stimulus-driven variations in the psychometric functions, can also capture individual differences. A recent study by Yarrow and colleagues¹⁶ directly addressed this question, and concluded that models of the Independent Channels family outperform the MCD at fitting responses individual difference.

Although it can be easily shown that such a conclusion was supported by an incomplete implementation of the MCD (which did not include the MCD_lag output), a closer look at the two models against the same datasets help us illustrate their fundamental difference and highlight a key drawback of perceptual models that take parameters as input. Therefore, we first simulated the impulse stimuli used by Yarrow and colleagues¹⁶, fed them to the MCD and used Equation 10 to generate the individual psychometric curves. Given that their stimuli consisted of temporal impulses with no spatiotemporal manipulation, a single MCD unit is sufficient to run these simulations. Overall, the model provided an excellent fit to the original data and tightly captured individual differences: the average Pearson correlation between predicted and empirical psychometric functions across the 57 curves shown in Figure S2 is 0.98 (see Supplementary Figure S2). Importantly, for such simulations the MCD was fully constrained, and the only free parameters (3 in total) were the linear coefficients of Equation 10, which describe how the output of the MCD is used for perceptual decision-making. For comparison, also Independent Channels models achieved analogous goodness of fit, but they required at least 5 free parameters (depending on the exact implementation, see¹⁶).

To assess the generalizability of this finding, we additionally simulated the individual psychometric functions from the experiments that informed the architecture of the MCD units used here²⁸. Specifically, Parise and Ernst²⁸ run two psychophysical studies using minimalistic stimuli that only varied over time. In the first one, auditory and visual stimuli consisted of step increments and/or decrements in intensity. Audiovisual lag was parametrically manipulated using the method of constant stimuli, and observers were required to perform both simultaneity and temporal order judgments. Following the logic described above, we fed the stimuli to the model, and used Equation 10 (with 3 free parameters) to simulate human responses. Results demonstrate that the MCD can account for individual differences regardless of the task: the average Pearson correlation between empirical and predicted psychometric curves was 0.97 for the simultaneity judgments, and 0.96 for the temporal order judgments (64 individual psychometric curves from 8 observers, for a total of 9600 trials). This generalizes the results of the previous simulation to a different type of stimuli (steps vs. impulses) and extends them to include a task, the temporal order judgment, which was not considered by Yarrow and colleagues¹⁶ (whose model can only perform simultaneity judgments).

The second study of Parise and Ernst²⁸ consists of simultaneity judgments for periodic audiovisual stimuli defined by a square-wave intensity envelope (Figure S4A). Simultaneity judgments for this type of periodic stimuli are also periodic, with two complete oscillations in perceived simultaneity for each cycle of phase shifts between the senses (a phenomenon known as frequency doubling). Once again, using Equation 10 the MCD could account for individual differences in observed behavior (5 psychometric curves from 5 observers, for a total of 3000 trials) with an average Pearson correlation of 0.93, while relying on just 3 free parameters.

It is important to note that for these simulations, the same MCD model accurately predicted (in a purely bottom-up fashion) bell-shaped SJ curves for non-periodic stimuli, and sinusoidal curves for periodic stimuli. Alternative models of audiovisual simultaneity that directly take lag as input, enforce bell-shaped psychometric functions, where perceived synchrony monotonically decrease as we move away from the point of subjective simultaneity. As a result, in the absence of ad-hoc adjustments they all necessarily fail at replicating the results of Parise and Ernst²⁸, due to their inability to generate periodic psychometric functions. Conversely, the MCD is agnostic regarding the shape of the psychometric functions, hence the very same model used to predict the simultaneity judgments of Yarrow and colleagues¹⁶ can also predict the empirical psychometric functions of Parise and Ernst²⁸, including individual differences across observers (all while relying on just three free parameters).

Simulation of Alais and Carlile (2005)

For the simulations of Alais and Carlile²⁹, the envelope of the auditory stimuli was extracted from the waveforms shown in the figures of the original publication. That was done using WebPlotDigitizer to trace the profile of the waveforms; the digitized points were then interpolated and resampled at 1000Hz. To preserve the manipulation of the direct-to-reverberant waves, the section of the envelope with the reverberant signal was identical across the four conditions (i.e., distances), so that what varied across conditions was the initial portion of the signals (the direct waves). For the simulations of Alais and Carlile²⁹, all four psychometric functions were fitted simultaneously, so that the four psychometric functions all relied on just three free parameters: the ones related to the decision-making process. A Matlab code running this simulation is now included as Supplementary Material.

Trimodal integration

Much like Bayesian ideal observer models, the present framework can be naturally extended to trimodal integration. As described in Equation 6, the, MCD_corr (x, y, t) response is based on the pointwise product of input transients across modalities. In the bimodal case, this corresponds to the product of auditory and visual transient channels. For three modalities (e.g., auditory, visual, tactile), this generalizes to a trimodal coincidence detector, in which MCD units compute:

This detector responds maximally when transients in all three modalities co-occur in time and space. As in the bimodal case (Figure 4), the trimodal MCD response closely approximates the predictions of the maximum likelihood estimation (MLE) model (Supplementary Figure 6).

MCD_lag(x, y, t) (Equation 7) is instead defined via opponency (subtraction) between the two subunits of the MCD, which introduces a directional asymmetry between modalities. This structure makes it fundamentally pairwise. As a result, extending MCD_lag(x, y, t) to three modalities would require computing pairwise lag estimates (AV, VT, AT) independently.