Correlation detection as a stimulus computable account for audiovisual perception, causal inference, and saliency maps in mammals
Figures
The Multisensory Correlation Detector (MCD) population model.
(A) Schematic representation of a single MCD unit. The input visual signal represents the intensity of a pixel (mouth area) over time, while the audio input is the soundtrack (the syllable /ba/). The gray soundtrack represents the experimental manipulation of AV lag, obtained by delaying one sense with respect to the other. BPF and LPF indicate band-pass and low-pass temporal filters, respectively. (C) shows how single-unit responses vary as a function of cross-modal lag. (B) represents the architecture of the MCD population model. Each visual unit (in blue) receives input from a single pixel, while the auditory unit receives as input the intensity envelope of the soundtrack (mono audio; see Figure 4A for a version of the model capable of receiving spatialized auditory input). Sensory evidence is then integrated over time and space for perceptual decision-making, a process in which the two model responses are weighted, summed, corrupted with additive Gaussian noise, and compared to a criterion to generate a forced-choice response (D).
Natural audiovisual stimuli and psychophysical responses.
(A) Stimuli (still frame and soundtrack) and psychometric functions for McGurk illusion (van Wassenhove et al., 2007), synchrony judgments (Lee and Noppeney, 2011), and temporal order judgments (Vroomen and Stekelenburg, 2011). In all panels, dots correspond to empirical data, lines to Multisensory Correlation Detectors (MCD) responses; negative lags represent vision first. (B) Stimuli and results of Alais and Carlile, 2005. The left panel displays the envelopes of auditory stimuli (clicks) recorded at different distances in a reverberant environment (the Sydney Opera House). While the reverberant portion of the sound is identical across distances, the intensity of the direct sound (the onset) decreases with depth. As a result, the centre of mass of the envelopes shifts rightward with increasing distance. The central panel shows empirical and predicted psychometric functions for the various distances. The four curves were fitted using the same decision-making parameters, so that the separation between the curves results purely from the operation of the MCD. The lag at which sound and light appear synchronous (point of subjective synchrony) scales with distance at a rate approximately matching the speed of sound (right panel). The dots in the right panel display the point of subjective synchrony (estimated separately for each curve), while the jagged line is the model prediction. (C) shows temporal order judgments for clicks and flashes from both rats and human observers (Mafi et al., 2022). Rats outperform humans at short lag, and vice-versa. (D) Rats’ temporal order and synchrony judgments for flashes and clicks of varying intensity (Schormans and Allman, 2018). Note that in the synchrony judgment task only the left flank of the psychometric curve (video-lead lags) was sampled. Importantly the tree curves in each task were fitted using the same decision-making parameters, so that the MCD alone accounts for the separation between the curves. (E) Pharmacologically-induced changes in rats’ audiovisual time perception. Left: Glutamatergic inhibition (MK-801 injection) leads to asymmetric broadening of the psychometric functions for simultaneity judgments. Right: GABA inhibition (Gabazine injection) abolishes rapid temporal adaptation, so that psychometric curves do not change based on the lag of the previous trials (as they do in controls) (Schormans and Allman, 2023). All pharmacologically-induced changes in audiovisual time perception can be accounted for by changes in the decision-making process, with no need to postulate changes in low-level temporal processing.
Temporal determinants of multisensory integration in humans.
Schematic representation of the stimuli with a manipulation of audiovisual lag (A). (B-J) shows the stimuli, psychophysical data (dots) and model response (lines) of simulated studies investigating the temporal determinants of multisensory integration in humans using ecological audiovisual stimuli. The colour of the psychometric curves represents the psychophysical task (red = McGurk, black = simultaneity judgments, blue = temporal order judgments). (K) shows the results of the permutation test: the histogram represents the permuted distribution of the correlation between data and model response. The arrow represents the observed correlation between model and data, and it is 4.7 standard deviation above the mean of the permuted distribution (p<0.001). Panel I represents the funnel plot of the Pearson correlation between model and data across all studies in B-J, plotted against the number of trials used for each level of audiovisual lag. Red diamonds represent the McGurk task, blue diamonds and circles represent temporal order judgments, and black diamonds and circles represents simultaneity judgments. Diamonds represents human data, circles represent rat data. The dashed gray line represents the overall Pearson correlation between empirical and predicted psychometric curves, weighted by the number of trials of each curve. As expected, the correlation decreases with decreasing number of trials.
Temporal determinants of multisensory integration in rats.
Schematic representation of the stimuli (clicks and flashes) with a manipulation of audiovisual lag (A). (B-J) shows the psychophysical data (dots) and model response (lines) of simulated studies. The colour of the psychometric curves represents the psychophysical task (black = simultaneity judgments, blue = temporal order judgments).
Simultaneity judgment for impulse stimuli in humans, individual observer data (Yarrow et al., 2023).
(A) represents the baseline condition; (B) represents the conservative condition (in which observers were instructed to respond ‘synchronous’ only when absolutely sure); (C) represents the post-test condition (in which observers were instructed to respond as they preferred, like in the baseline condition). Individual plots represent data from a single observer (dots) and model responses (lines), and consists of 135 trials each (total 7695 trials). The bottom-right plots represents the average empirical and predicted psychometric functions.
Simultaneity and temporal order judgments for step stimuli (Parise and Ernst, 2025).
(A) represents the results of the simultaneity judgments; (B) represents the data from the temporal order judgment. Each column represents a different combination of audiovisual step polarity (i.e. onset steps in both modalities, offset steps in both modalities, video onset and audio offset, video offset and audio onset). Different rows represent curves from different observers (dots) and model responses (lines). Each curve consists of 150 trials (total of 9600 trials). The bottom row (light gray background) represents the average empirical and predicted psychometric functions.
Simultaneity judgment for periodic stimuli (Parise and Ernst, 2025).
Individual plots represent data from a single observer (dots) and model responses (lines). Each curve consists of 600 trials (total of 3000 trials). The bottom-right curve represents the average empirical and predicted psychometric functions.
Ecological audiovisual stimuli and model responses.
(A) displays the frames and soundtrack of a dynamic audiovisual stimulus over time (in this example, video and audio tracks are synchronous, and the actress utters the syllable /ta/). (B) shows how the dynamic population responses and vary across the frames of Panel A. Note how model responses highlight the pixels whose intensity changed with the soundtrack (i.e. the mouth area). The right side of Panel B represents the population read-out process, as implemented for the simulations in Figure 2: the population responses and are integrated over space (i.e. pixels) and time (i.e. frames), scaled and weighted by the gain parameters , and and summed to obtain a single decision variable that is fed to the decision-making stage (see Figure 1D). (C) represents the time-averaged population responses and as a function of cross-modal lag (the central one corresponds to the time-averaged responses shown in B). Note how peaks at around zero lag and decreases with increasing lag (following the same trend shown in Figure 1C, left), while polarity of changes with the sign of the delay. The psychophysical data corresponding to the stimulus in this figure is shown in Figure 2—figure supplement 1B. See Video 2 for a dynamic representation of the content of this figure.
Audiovisual integration in space.
(A) Top represents the Multisensory Correlation Detector (MCD) population model for spatialized audio. Visual and auditory input units receive input from corresponding spatial locations, and feed into spatially-tuned MCD units. The output of each MCD unit is eventually normalized by the total population output, so as to represent the probability distribution of stimulus location over space. The bottom part of Panel A represents the dynamic unimodal and bimodal population response over time and space (azimuth) and their marginals. When time is marginalized out, a population of MCDs implements integration as predicted by the maximum likelihood estimation (MLE) model. When space is marginalized, the output show the temporal response function of the model. In this example, visual and auditory stimuli were asynchronously presented from discrepant spatial locations (note how the blue and orange distributions are spatiotemporally offset). (B) shows a schematic representation of the stimuli used to test the MLE model by Alais and Burr, 2004. Stimuli were presented from different spatial positions, with a parametric manipulation of audiovisual spatial disparity and blob size (i.e. the standard deviation, of the blob). (C) shows how the bimodal psychometric functions predicted by the MCD (lines, see Equation 16) and the MLE (dots) models fully overlap. (D) shows how bimodal bias varies as a function of disparity and visual reliability (see legend on the left). The dots correspond to the empirical data from participant LM, while the lines are the predictions of the MCD model (compare to Figure 2A, of Alais and Burr, 2004). (E) shows how the just noticeable differences (just noticeable difference JND, the random localization error) vary as a function of blob size. The blue squares represent the visual JNDs, the purple dots the bimodal JNDs, while the dashed orange line represents the auditory JND. The continuous line shows the JNDs predicted by the MCD population model (compare to Figure 2B, of Alais and Burr, 2004). (F) represents the breakdown of integration with spatial disparity. The magnitude of the MCD population output (Equation 8, shown as the area under the curve of the bimodal response) decreases with increasing spatial disparity across the senses. This can be then transformed into a probability for a common cause (Equation 19). (G) represents the stimuli and results of the experiment used by Körding et al., 2007 to test the Bayesian Causal Inference (BCI) model. Auditory and visual stimuli originate from one of five spatial locations, spanning a range of 20° . The plots show the perceived locations of visual (blue) and auditory (orange) stimuli for each combination of audiovisual spatial locations. The dots represent human data, while the lines represent the responses of the MCD population model. (H) shows the stimuli and results of the experiment of Mohl et al., 2020. The plots on the right display the probability of a single (vs. double) fixation (top monkeys, bottom humans). The dots represent human data, while the lines represent the responses of the MCD population model. The remaining panels show the histogram of the fixated locations in bimodal trials: the jagged histograms are the empirical data, while the smooth ones are the model prediction (zero free parameters). The regions of overlap between empirical and predicted histograms are shown in black.
Multisensory Correlation Detector (MCD) model for trimodal integration.
Trimodal MCD units combine inputs from three unimodal transient channels via multiplicative interactions. When all three modalities are stimulated in close temporal and spatial proximity, the resulting trimodal population response closely replicates the predictions of maximum likelihood estimation (MLE).
Multisensory Correlation Detector (MCD) and the Ventriloquist Illusion.
The upper panel represents still frame of a performing ventriloquist. The central panel represents the MCD population response. The lower plot represents the horizontal profile of the MCD response for the same frame. Note how the population response clusters on the location of the dummy, where more pixels are temporally correlated with the soundtrack.
Audiovisual saliency maps.
(A) represents a still frame of Coutrot and Guyader, 2015 stimuli. The white dots represent gaze direction of the various observers. (B) represents the Multisensory Correlation Detector (MCD) population response for the frame in Panel A. The dots represent observed gaze direction (and correspond to the white dots of Panel A). (C) represents how the MCD response varies over time and azimuth (with elevation marginalized-out). The black solid lines represent the active speaker, while the waveform on the right displays the soundtrack. Note how the MCD response was higher for the active speaker. (D) shows the distribution of model response at gaze direction (see Panel B) across all frames and observers in the database. Model response was normalized for each frame (Z-scores). The y axis represents the number of frames. The vertical gray line represents the mean. See Video 4 for a dynamic representation of the content of this figure.
Videos
The McGurk Illusion – integration of mismatching audiovisual speech.
The soundtrack is from a recording where the actress utters the syllable /pa/, whereas in the video she utters /ka/. When the video and sound tracks are approximately synchronous, observers often experience the McGurk illusion, and perceive the syllable /ta/. To experience the illusion, try to recognize what the actress utters as we manipulate audiovisual lag. Note how the population response clusters around the mouth area, and how its magnitude scales with the probability of experiencing the illusion. See Video 2 for details.
Population response to audiovisual speech stimuli.
The top left panel displays the stimulus from van Wassenhove et al., 2007, where the actress utters the syllable /ta/. The central and right top panels represent the dynamic and population responses, respectively (Equations 6 and 7). The lower part of the video displays the temporal profile of the stimuli and model responses (averaged over space). The top two lines represent the stimuli: for the visual stimuli, the line represents the root-mean-squared difference of the pixel value from one frame to the next; the line for audio represents the envelope of the stimulus. and represents the output of the unimodal transient channels (averaged over space) that feed to the MCD (Equation 2). The two lower lines represent the and responses and correspond to the average of the responses displayed in the central and top-right panels This movie corresponds to the data displayed in Figure 3A–B. Note how the magnitude of increases as the absolute lag decreases, while the polarity of changes depending on which modality came first.
The ventriloquist illusion.
The top panel represents a video of a performing ventriloquist. The voice of the dummy was edited (pitch-shifted) and added in post-production. The second panel represents the dynamic population response to a blurred version of the video (Equation 6). The third panel shows the distribution of population responses along the horizontal axis (obtained by averaging the upper panel over the vertical dimension). This represents the dynamic, real-life version of the bimodal population response shown in Figure 4A for the case of minimalistic audiovisual stimuli. The lower panel represents the same information as the panel above displayed as a rolling timeline. For this video, the population response was temporally aligned to the stimuli to compensate for lags introduced by the temporal filters of the model. Note how the population response spatially follows the active speaker, hence capturing the sensed location of the audiovisual event towards correlated visuals.
Audiovisual saliency maps.
The top panel represents Movie 1 from Coutrot and Guyader, 2015. The central panel represents in gray scales, while the colorful blobs represent observers’ gaze direction during passive viewing. The lower panel represents how the and gaze direction (co)vary over time and azimuth (with elevation marginalized-out). The black solid lines represent the active speaker. For the present simulations, movies were converted to grayscale and the upper and lower sections of the videos (which were mostly static) were cropped. Note how gaze is consistently directed towards the regions of the frames displaying the highest audiovisual correlation.
Additional files
-
MDAR checklist
- https://cdn.elifesciences.org/articles/106122/elife-106122-mdarchecklist1-v1.docx
-
Source code 1
This compressed folder includes Matlab codes and stimuli to simulate the experiments of Alais and Burr, 2004; Alais and Carlile, 2005; Körding et al., 2007; Mohl et al., 2020.
- https://cdn.elifesciences.org/articles/106122/elife-106122-code1-v1.zip
-
Supplementary file 1
Summary of the experiments simulated in Figure 2—figure supplement 2.
The first column contains the reference of the study, the second column the task (McGurk, Simultaneity Judgment, and Temporal Order Judgment). The third column describes the stimuli: n represents the number of individual instances of the stimuli, ‘HI’ and ‘LI’ in Magnotti and Beauchamp, 2017 indicate speech stimuli with High and Low Intelligibility, respectively. ‘Blur’ indicates that the videos were blurred. ‘Disamb’ indicates that ambiguous speech stimuli (i.e., sine-wave speech) were disambiguated by informing the observers of the original speech sound. The fourth column indicates whether visual and acoustic stimuli were congruent. Here, incongruent stimuli refer to the mismatching speech stimuli used in the McGurk task. ‘SWS’ indicates sine-wave speech; ‘noise’ in Ikeda and Morishita, 2020 indicates a stimulus similar to sine-wave speech but in which white noise was used instead of pure sinusoidal waves. The fifth column represents the country where the study was performed. The sixth column describes the observers included in the study: ‘c.s.’ indicates convenience sampling (usually undergraduate students) musicians in Lee and Noppeney, 2011; Lee and Noppeney, 2014 were amateur piano players; Freeman et al., 2013 tested young observers (18–28 years old), a patient P.H. (67 years old) that after a lesion in the pons and basal ganglia reported hearing speech before seeing the lips move; and a group of age-matched controls (59–74 years old). The seventh column reports the number of observers included in the study. Overall, the full dataset consisted of 986 individual psychometric curves; however, several observers participated in more than one experiment, so that the total number of unique observers was 454. The eight column reports the number of lags used in the method of constant stimuli. The nineth column reports the number of trials included in the study. The tenth column reports the correlation between empirical and predicted psychometric functions. The bottom row contains some descriptive statistics of the dataset.
- https://cdn.elifesciences.org/articles/106122/elife-106122-supp1-v1.docx
-
Supplementary file 2
Summary of the experiments simulated in Figure 2—figure supplement 2.
The first column contains the reference of the study, the second column the task (Simultaneity Judgment and Temporal Order Judgment). The third column describes the stimuli. The fourth column indicates what rats were used as observers. The fifth column reports the number of rats in the study; ‘same’ means that the same rats took part in the experiment in the row above. The sixth column reports the number of lags used in the method of constant stimuli. The seventh column reports the number of trials included in the study (not available for all studies). The eighth column reports the correlation between empirical and predicted psychometric functions. The bottom row contains some descriptive statistics of the dataset.
- https://cdn.elifesciences.org/articles/106122/elife-106122-supp2-v1.docx