The MCD population model.

Panel A: schematic representation of a single MCD unit. The input visual signal represents the intensity of a pixel (mouth area) over time, while the audio input is the soundtrack (the syllable /ba/). The grey soundtrack represents the experimental manipulation of AV lag, obtained by delaying one sense with respect to the other. BPF and LPF indicate band-pass and low-pass temporal filters, respectively. Panel C shows how single-unit responses vary as a function of crossmodal lag. Panel B represents the architecture of the MCD population model. Each visual unit (in blue) receives input from a single pixel, while the auditory unit receives as input the intensity envelope of the soundtrack (mono audio; see Figure 4A for a version of the model capable of receiving spatialized auditory input). Sensory evidence is then integrated over time and space for perceptual decision-making, a process in which the two model responses are weighted, summed, corrupted with additive Gaussian noise, and compared to a criterion to generate a forced-choice response (Panel D).

Natural audiovisual stimuli and psychophysical responses.

Panel A stimuli (still frame and soundtrack) and psychometric functions for McGurk illusion2, synchrony judgments4, and temporal order judgments5. In all panels, dots correspond to empirical data, lines to MCD responses; negative lags represent vision first. Panel B left, envelopes of auditory stimuli (clicks) recorded at different distances in a reverberant environment (the Sydney Opera House)7. While the reverberant portion of the sound is identical across distances, the intensity of the direct sound (the onset) decreases with depth. As a result, the centre of mass of the envelopes shifts rightward with increasing distance. The central panel shows empirical and predicted psychometric functions for the various distances. The four curves were fitted using the same decision-making parameters, so that the separation between the curves results purely from the operation of the MCD. The lag at which sound and light appear synchronous (point of subjective synchrony) scales with distance at a rate approximately matching the speed of sound (right panel). The dots in the right panel display the point of subjective synchrony (estimated separately for each curve), while the jagged line is the model prediction. Panel C, shows temporal order judgments for clicks and flashes from both rats and human observers11,12. Rats outperform humans at short lag, and vice-versa. Panel D, rats’ temporal order and synchrony judgments for flashes and clicks of varying intensity14. Note that the tree curves in each task were fitted using the same decision-making parameters, so that the MCD alone accounts for the separation between the curves. Panel E, pharmacologically-induced changes in rats’ audiovisual time perception. Left: glutamatergic inhibition (MK-801 injection) leads to asymmetric broadening of the psychometric functions for simultaneity judgments19. Right: GABA inhibition (Gabazine injection) abolishes rapid temporal adaptation, so that psychometric curves did not change based on the lag of the previous trials (as they do in controls)21. All pharmacologically-induced changes in audiovisual time perception can be accounted for by changes in the decision-making process, with no need to postulate changes in low-level temporal processing.

Ecological audiovisual stimuli and model responses.

Panel A displays the frames and soundtrack of a dynamic audiovisual stimulus over time (in this example, video and audio tracks are synchronous, and the actress utters the syllable /ta/). Panel B shows how the dynamic population responses MCDcorr and MCDlag vary across the frames of Panel A. Note how model responses highlight the pixels whose intensity changed with the soundtrack (i.e., the mouth area). The right side of Panel B represents the population read-out process, as implemented for the simulations in Figure 2: the population responses MCDcorr and MCDlag are integrated over space (i.e., pixels) and time (i.e., frames), scaled and weighted by the gain parameters βcorr, and βlag and summed to obtain a single decision variable that is fed to the decision-making stage (see Figure 1D). Panel C represents the time-averaged population responses and as a function of crossmodal lag (the central one corresponds to the time-averaged responses show in Panel B). Note how peaks at around zero lag and decreases with increasing lag (following the same trend shown in Figure 1C, left), while polarity of changes with the sign of the delay. The psychophysical data corresponding to the stimulus in this figure is shown in Supplementary Figure 1B. S

Audiovisual integration in space.

Panel A top represents the MCD population model for spatialized audio. Visual and auditory input units receive input from corresponding spatial locations, and feed into spatially-tuned MCD units. The output of each MCD unit is eventually normalized by the total population output, so as to represent the probability distribution of stimulus location over space. The bottom part of Panel A represents the dynamic unimodal and bimodal population response over time and space (azimuth). and their marginals. When time is marginalized out, a population of MCDs implements integration as predicted by the MLE model. When space is marginalized, the output show the temporal response function of the model. In this example, visual and auditory stimuli were asynchronously presented from discrepant spatial locations (note how the blue and orange distributions are spatiotemporally offset). Panel B shows a schematic representation of the stimuli used to test the MLE model by Alais and Burr3. Stimuli were presented from different spatial positions, with a parametric manipulation of audiovisual spatial disparity and blob size (i.e., the standard deviation, σ of the blob). Panel C shows how the bimodal psychometric functions predicted by the MCD (lines, see Equation 16) and the MLE (dots) models fully overlap. Panel D shows how bimodal bias varies as a function of disparity and visual reliability (see legend on the left). The dots correspond to the empirical data from participant LM, while the lines are the predictions of the MCD model (compare to Figure 2A, of Alais and Burr3). Panel E shows how the just noticeable differences (JND, the random localization error) vary as a function of blob size. The blue squares represent the visual JNDs, the purple dots the bimodal JNDs, while the dashed orange line represents the auditory JND. The continuous line shows the JNDs predicted by the MCD population model (compare to Figure 2B, of Alais and Burr3). Panel F represents the stimuli and results of the experiment used by Körding and colleagues10 to test the BCI model. Auditory and visual stimuli originate from one of five spatial locations, spanning a range of 20 deg. The plots show the perceived locations of visual (blue) and auditory (orange) stimuli for each combination of audiovisual spatial locations. The dots represent human data, while the lines represent the responses of the MCD population model. Panel G shows the stimuli and results of the experiment of Mohl and colleagues15. The plots on the right display the probability of a single (vs. double) fixation (top monkeys, bottom humans). The dots represent human data, while the lines represent the responses of the MCD population model. The remaining panels show the histogram of the fixated locations in bimodal trials: the jagged histograms are the empirical data, while the smooth ones are the model prediction (zero free parameters). The regions of overlap between empirical and predicted histograms are shown in black.

Audiovisual saliency maps.

Panel A represents a still frame of Coutrot and Guyader’s1 stimuli. The white dots represent gaze direction of the various observers. Panel B represents the MCD population response for the frame in Panel A. The dots represent observed gaze direction (and correspond to the white dots of Panel A). Panel C represents how the MCD response varies over time and azimuth (with elevation marginalized-out). The black solid lines represent the active speaker, while the waveform on the right displays the soundtrack. Note how the MCD response was higher for the active speaker. Panel D shows the distribution of model response at gaze direction (see Panel B) across all frames and observers in the database. Model response was normalized for each frame (Z-scores). The y axis represents the number of frames. The vertical grey line represents the mean.

Temporal determinants of multisensory integration in humans.

Schematic representation of the stimuli with a manipulation of audiovisual lag (Panel A). Panels B-J shows a the stimuli, psychophysical data (dots) and model response (lines) of simulated studies investigating the temporal determinants of multisensory integration in humans using ecological audiovisual stimuli. The colour of the psychometric curves represents the psychophysical task (red=McGurk, black=simultaneity judgments, blue=temporal order judgments). Panel K shows the results of the permutation test: the histogram represents the permuted distribution of the correlation between data and model response. The arrow represents the observed correlation between model and data, and it is 4.7 standard deviation above the mean of the permuted distribution (p<0.001). Panel I represents the funnel plot of the Pearson correlation between model and data across all studies in Panels B-J, plotted against the number of trials used for each level of audiovisual lag. Red diamonds represent the McGurk task, blue diamonds and circles represent temporal order judgments, and black diamonds and circles represents simultaneity judgments. Diamonds represents human data, circles represent rat data. The dashed grey line represents the overall Pearson correlation between empirical and predicted psychometric curves, weighted by the number of trials of each curve. As expected, the correlation decreases with decreasing number of trials.

Temporal determinants of multisensory integration in rats.

Schematic representation of the stimuli (clicks and flashes) with a manipulation of audiovisual lag (Panel A). Panels B-J shows a the psychophysical data (dots) and model response (lines) of simulated studies. The colour of the psychometric curves represents the psychophysical task (black=simultaneity judgments, blue=temporal order judgments).

Simultaneity judgment for impulse stimuli in humans, individual observer data (Yarrow et al. 2023).

Panel A represents the baseline condition; Panel B represents the conservative condition (in which observers were instructed to respond “synchronous” only when absolutely sure); Panel C represents the post-test condition (in which observers were instructed to respond as they preferred, like in the baseline condition). Individual plots represent data from a single observer (dots) and model responses (lines), and consists of 135 trials each (total 7695 trials). The bottom-right plots represents the average empirical and predicted psychometric functions.

Simultaneity and temporal order judgments for step stimuli (Parise & Ernst, 2023).

Panel A represents the results of the simultaneity judgments; Panel B represents the data from the temporal order judgment. Each column represents a different combination of audiovisual step polarity (i.e., onset steps in both modalities, offset steps in both modalities, video onset and audio offset, video offset and audio onset). Different rows represent curves from different observers (dots) and model responses (lines). Each curve consists of 150 trials (total 9600 trials). The bottom row (light grey background) represents the average empirical and predicted psychometric functions.

Simultaneity judgment for periodic stimuli (Parise & Ernst, 2023).

Individual plots represent data from a single observer (dots) and model responses (lines). Each curve consists of 600 trials (total 3000 trials). The bottom-right curve represents the average empirical and predicted psychometric functions.

Summary of the experiments simulated in Supplementary Figure 1.

The first column contains the reference of the study, the second column the task (McGurk, Simultaneity Judgment, and Temporal Order Judgment). The third column describes the stimuli: n represents the number of individual instances of the stimuli, “HI” and “LI” in Magnotti and Beauchamp (2013) indicate speech stimuli with High and Low Intelligibility, respectively. “Blur” indicates that the videos were blurred. “Disamb” indicates that ambiguous speech stimuli (i.e., sine-wave speech) were disambiguated by informing the observers of the original speech sound. The fourth column indicates whether visual and acoustic stimuli were congruent. Here, incongruent stimuli refer to the mismatching speech stimuli used in the McGurk task. “SWS” indicates sine-wave speech; “noise” in Ikeda and Morishita (2020) indicates a stimulus similar to sine-wave speech but in which white noise was used instead of pure sinusoidal waves. The fifth column represents the country where the study was performed. The sixth column describes the observers included in the study: “c.s.” indicates convenience sampling (usually undergraduate students) musicians in Lee and Noppeney (2011, 2014) were amateur piano players; Freeman et al. (2013) tested young observers (18-28 years old), a patient P.H. (67 years old) that after a lesion in the pons and basal ganglia reported hearing speech before seeing the lips move; and a group of age matched controls (59-74 years old). The seventh column reports the number of observers included in the study. Overall, the full dataset consisted of 986 individual psychometric curves; however, several observers participated in more than one experiment, so that the total number of unique observers was 454. The eight column reports the number of lags used in the method of constant stimuli. The nineth column reports the number of trials included in the study. The tenth column reports the correlation between empirical and predicted psychometric functions. The bottom row contains some descriptive statistics of the dataset.

Summary of the experiments simulated in Supplementary Figure 2.

The first column contains the reference of the study, the second column the task (Simultaneity Judgment, and Temporal Order Judgment). The third column describes the stimuli. The fourth column indicates what rats were used as observers. The fifth column reports the number of rats in the study; ‘same’ means that the same rats took part in the experiment in the row above. The sixth column reports the number of lags used in the method of constant stimuli. The seventh column reports the number of trials included in the study (not available for all studies). The eight column reports the correlation between empirical and predicted psychometric functions. The bottom row contains some descriptive statistics of the dataset.