1 Introduction

Perception is flexible and should change in response to perceptual error (Bedford, 1999). In a multimodal environment, systematic discrepancies between modalities indicate perceptual errors and thus the need for recalibration. Signals in different modalities from the same event can arrive with different physical and neural delays in the relevant brain areas (Fain, 2019; Pöppel, 1988; Spence & Squire, 2003). Cross-modal temporal recalibration has been considered a compensatory mechanism that attempts to realign the perceived timing between modalities to maintain perceptual synchrony across changes in the perceptual systems and the environment (reviewed by King, 2005; Vroomen and Keetels, 2010). This phenomenon is exemplified by audiovisual temporal recalibration, where consistent exposure to audiovisual asynchrony shifts the audiovisual temporal bias (i.e., point of subjective simultaneity) between auditory and visual stimuli in the direction of the asynchrony to which one has been exposed (Di Luca et al., 2009; Fujisaki et al., 2004; Hanson et al., 2008; Harrar & Harris, 2008; Heron et al., 2007; Keetels & Vroomen, 2007; Navarra et al., 2005; Roach et al., 2011; Tanaka et al., 2011; Vatakis et al., 2007, 2008; Vroomen & Keetels, 2010; Vroomen et al., 2004; Yamamoto et al., 2012).

The classic compensatory view is that recalibration serves to offset the physical and neural latency differences between modalities (Fujisaki et al., 2004), aiming for external accuracy, the agreement between perception and the environment (Zaidel et al., 2011). When external feedback is not available, recalibration may target internal consistency (Burge et al., 2010). In the context of the recalibration of relative timing, both theories predict similar behavior regardless of whether the goal is accuracy or consistency: the perceptual system will attempt to recalibrate for any amount of asynchrony so as to homeostatically restore a physical or perceived synchrony. Here, we formalize this hypothesis as the fixed-update model.

The fixed-update model predicts a linearly increasing amount of temporal recalibration as with increases in the asynchrony one is exposed to. However, empirical observations do not fully align with this model. The amount of audiovisual temporal recalibration as a function of adapted stimulus-onset asynchrony (SOA) exhibits two crucial characteristics: nonlinearity and asymmetry. The amount of recalibration is not proportional to the adapted SOA, but instead plateaus at SOAs of approximately 100–300 ms (Fujisaki et al., 2004; Vroomen et al., 2004). The amount of recalibration can also be asymmetrical: the magnitude of recalibration differs when the visual stimulus leads during the exposure phase compared to when the auditory stimulus leads (Fujisaki et al., 2004; O’Donohue et al., 2022; Van der Burg et al., 2013). These observations suggest that while the fixed-update model might capture the general purpose behind temporal recalibration, it falls short in fully capturing the nuanced ways in which the brain adjusts to varying SOAs. This prompts consideration of additional mechanisms to account for previously observed nonlinearity and asymmetry.

Notably, the fixed-update model overlooks the causal relationship between multimodal stimuli by implicitly assuming that they originate from a single source. However, that’s not always the case in a multimodal environment. For instance, in a dubbed movie with a noticeable delay between the video and the audio, this delay can indicate that the sound is provided by a different actor, not the character on screen. To address this challenge, the brain must perform causal inference to determine whether multimodal signals come from a common source and should be integrated, or else kept separate. Indeed, numerous studies support the hypothesis that humans consider the causal structure of cross-modal stimuli when making perceptual decisions (Aller & Noppeney, 2019; Cao et al., 2019; Dokka et al., 2019; Körding et al., 2007; Locke & Landy, 2017; McGovern et al., 2016; Rohe & Noppeney, 2015; Samad et al., 2015; Sato et al., 2007; Wei & Körding, 2009; Wozny et al., 2010). Drawing on this framework, we formulate a causal-inference model of temporal recalibration based on previous models that have successfully predicted visual-auditory (Hong et al., 2021; Sato et al., 2007) and visual-tactile (Badde, Navarro, & Landy, 2020) spatial recalibration.

Although models incorporating causal inference are promising in capturing the observed nonlinearities, they predict an identical amount of temporal recalibration for audiovisual stimulus pairs that have the same SOA but with opposite sign (i.e., lead vs. lag). This suggests that additional factors are required to explain the observed asymmetry. In previous studies, the asymmetry has been attributed to different factors, such as the physical and neural latency differences between sensory signals (O’Donohue et al., 2022; Van der Burg et al., 2013) or more frequent exposure to visual-lead events in natural environments (Fujisaki et al., 2004; Van der Burg et al., 2013). These factors can explain the audiovisual temporal bias most humans developed through early sensory experience (Badde, Ley, et al., 2020). Yet, this bias would again equally affect the amount of recalibration resulting from the same SOAs on either side of the observer’s bias. In contrast to bias, sensory uncertainty has been shown to affect the degree of cross-modal recalibration in a complex fashion (Badde, Navarro, & Landy, 2020; Hong et al., 2021; van Beers et al., 2002). We hypothesize that different degrees of auditory and visual uncertainty play a critical role in the asymmetry of cross-modal temporal recalibration.

To examine the mechanism underlying cross-modal temporal recalibration, we used a classic three-phase recalibration paradigm, in which participants completed a pre-test, exposure, and post-test. We manipulated the adapter SOA (i.e., the audiovisual asynchrony presented in the exposure phase) across sessions, introducing SOAs up to 0.7 s of either auditory or visual lead. Before and after the exposure phase in each session, we measured participant’s perception of audiovisual relative timing using a temporal-order-judgement (TOJ) task. To preview the empirical results, we confirmed the nonlinearity as well as idiosyncratic asymmetry of the recalibration effect. To scrutinize the factors that might drive these two main characteristics, we fitted four models to the data, using either causal inference or a fixed update. Despite previous empirical evidence challenging the fixed-update model, it doesn’t mean we should discount its relevance without a statistical comparison to alternative models. The causal-inference and the fixed-update models were combined with either modality-specific or modality-independent uncertainty.

Model comparison revealed that causal inference combined with modality-specific uncertainty is essential to accurately capture the nonlinearity and idiosyncratic asymmetry of temporal recalibration. Our results indicate that human observers employ causal-inference-based percepts to recalibrate cross-modal temporal perception. This finding suggests that cross-modal temporal recalibration, typically considered an early-stage, low-level perceptual process, involves higher cognitive functions in the adjustment of perception.

2 Results

2.1 Behavioral results

We adopted a classical three-phase recalibration paradigm in which participants completed a pre-test, an exposure phase, and a post-test in each session. In pre- and post-tests, we measured participants’ perception of audiovisual relative timing using a TOJ task: participants reported the perceived order (“visual first,” “auditory first,” or “simultaneous”) of audiovisual stimulus pairs with varying SOAs (range: from -0.5 to 0.5 s with 15 levels; Figure 1A). In the exposure phase, participants were exposed to a series of audiovisual stimuli with a consistent SOA (250 trials; Figure 1B). To ensure that participants were attentive to the stimuli, they performed an oddball-detection task. Specifically, we inserted oddball stimuli with slightly greater intensity in either one or both modalities (5% of total trials independently sampled for each modality). Participants were instructed to respond whenever they detected such stimuli. The high d of oddball-detection performance (auditory d = 3.34 ± 0.54, visual d = 2.44 ± 0.72) showed that participants paid attention to both modalities (Figure S2). The post-test was almost identical to the pre-test, except that before every temporal-order judgment, there were three top-up exposure trials to maintain the recalibration effect. In total, participants completed nine sessions on separate days. The adapter SOA (range: -0.7 to 0.7 s) varied across but not within sessions.

Task timing. (A) Temporal-order-judgment task administered in the pre- and post-tests. In each trial, participants made a temporal-order judgment in response to an audiovisual stimulus pair with a varying stimulus-onset asynchrony (SOA). Negative values: auditory lead; positive values: visual lead. (B) Oddball-detection task used in the exposure phase and post-test top-up trials. Participants were repeatedly presented with an audiovisual stimulus pair with a SOA that was fixed within each session but varied across sessions. Occasionally, the intensity of either or both of the auditory and the visual stimuli was increased. Participants were instructed to press a key whenever such an oddball stimulus occurred.

We compared the temporal-order judgments between the pre- and post-tests to examine the amount of audiovisual temporal recalibration induced by the SOA of audiovisual stimuli during the exposure phase. Specifically, we fitted the data from the pre- and post-tests jointly assuming different points of subjective simultaneity (PSS) between the two tests while assuming fixed arrival-latency distributions and fixed response criteria (Figure 2A; see Supplement S1 for an alternative model assuming a shift in the response criteria due to recalibration). The amount of audiovisual temporal recalibration was defined as the difference between the two PSSs. At the group level, we observed a nonlinear pattern of recalibration as a function of the adapter SOA: the amount of recalibration in the direction of the adapter SOA first increased but then plateaued with increasing magnitude of the adapter SOA presented during the exposure phase (Figure 2B). Additionally, we observed an asymmetry between auditory-lead and visual-lead adapter SOAs in the magnitude of recalibration at the group level, with auditory-lead adapter SOAs inducing a greater amount of recalibration (Figure 2B; see Figure S6 for individual participants’ data). To quantify this asymmetry for each participant, we calculated an asymmetry index, defined as the sum of the recalibration effects across all sessions (zero: no evidence for asymmetry; positive values: greater recalibration given visual-lead adapters; negative: greater recalibration given auditory-lead adapters). For each participant, we bootstrapped the temporal-order judgments to obtain a 95% confidence interval for the asymmetry index. All participants’ confidence intervals excluded zero, suggesting that all of them showed audiovisual asymmetry in temporal recalibration (Figure S3).

Behavioral results. (A) The probability of reporting that the auditory stimulus came first, the two arrived at the same time, or the visual stimulus came first as a function of SOA for a representative participant in a single session. The adapter SOA was -0.3 s for this session. Curves: best-fitting functions estimated jointly using the data from the pre-test (dashed) and post-test (solid). Shaded areas: 95% bootstrapped confidence intervals. (B) Mean recalibration effects (shifts in the point of subjective simultaneity from the pre- to the post-test phase) averaged across all participants as a function of adapter SOA. Error bars: ±SEM.

2.2 Modeling results

In the following sections, we describe our models for cross-modal temporal recalibration by first laying out the general assumptions of these models, and then explaining the differences between them. Then, we provide a comparison in terms of model performance and illustrate how well the models capture the observed data by generating model predictions.

2.2.1 General model assumptions

We formulated four models of cross-modal temporal recalibration (Figure 3). These models share several common assumptions. First, when an auditory and a visual signal are presented, the corresponding neural signals arrive in the relevant brain areas with a variable latency due to internal and external noise. The probability distribution of arrival latency is an exponential distribution (García-Pérez & Alcalá-Quintana, 2012) (Figure 3A). A simple derivation shows that the resulting measurements of SOA follow a double-exponential distribution (Figure 3B). The mode reflects the physical SOA plus the participant’s audiovisual temporal bias. The slopes of the distribution reflect the uncertainties of the arrival latency; the steeper the slope, the less variable the latency, and the less uncertainty a Bayesian observer would have in a single trial. Second, these models define temporal recalibration as accumulating updates of the audiovisual bias after each encounter with a SOA. The accumulated update of the audiovisual bias at the end of the exposure phase is then carried over to the post-test and persists throughout that phase. Lastly, the bias is assumed to be reset to the same initial value in the pre-test across all nine sessions, reflecting the stability of the audiovisual temporal bias across time (Badde, Ley, et al., 2020; Grabot & van Wassenhove, 2017).

Illustration of the model for cross-modal temporal recalibration. (A) The probability density function of the arrival latency of the auditory and visual signals relative to the physical onset of each stimulus. (B) The resulting probability density function of the measured SOA, m, before (dashed) and after (solid) recalibration. The measurement distribution peaks at the physical SOA plus an audiovisual temporal bias. Temporal recalibration is modeled as cumulative changes in the audiovisual bias, Δ, across the exposure phase. (C) Two recalibration models. The fixed-update model updates the audiovisual bias so that subsequent measurements of the audiovisual SOA approach zero. The causal-inference model updates the audiovisual bias based on the perceived SOA, ŝ, i.e., taking different causal scenarios into account. The percept ŝ is computed as a weighted average of estimates inferred from the scenarios of a common cause, C = 1, and separate causes, C = 2. α: learning rate. See text for details.

2.2.2 Models of cross-modal temporal recalibration

The four models we tested differed in the mechanism governing the updates of the audiovisual bias during the exposure phase as well as the modality specificity of the arrival latency uncertainty.

We formulated a Bayesian causal-inference model (Körding et al., 2007; McGovern et al., 2016; Sato et al., 2007) to describe the recalibration of the relative timing between cross-modal signals. When an observer is presented with an adapter SOA during the exposure phase, they infer the causal relationship between the auditory and visual stimulus. Specifically, the observer computes two intermediate estimates of the SOA, one for each causal scenario (Figure 3C). In the common-cause scenario, the estimated SOA of the stimuli is smaller than the measurement as it is combined with a prior distribution over SOA that reflects synchrony. In the separate-causes scenario, the estimated SOA is approximately equal to the measurement. The two estimates are then averaged with each one weighted by the inferred probability of the corresponding causal scenario. The audiovisual bias is then updated to reduce the difference between the measurement and the combined estimate of SOA. In other words, causal inference regulates the recalibration process by shifting the measured SOA to more closely match the percept, which in turn is computed based on the inferred causal structure.

We also considered a fixed-update model. The major distinction between the causal-inference and the fixed-update model is that, according to the latter, the measured SOA is shifted toward zero rather than toward the inferred SOA. Essentially, whenever the observer detects a SOA, they recalibrate by shifting the audiovisual bias in the opposite direction so that the measured SOA will be closer to zero.

We additionally varied a second model element: we assumed either modality-specific or modality-independent uncertainty of arrival latency. The auditory system typically has higher temporal precision than the visual system. Hence, the arrival latency of visual signals can be more variable than auditory-signal latency, resulting in an asymmetrical probability density of measured SOA (m). A Bayesian observer will take this modality-specific sensory uncertainty into account to derive an estimate of SOA (ŝ). However, temporal precision might not be due to the variability of arrival latency. The auditory and visual systems might share a common, modality-independent timing mechanism (Stauffer et al., 2012), predicting modality-independent uncertainty.

2.2.3 Model comparison

We fitted four models to each participant’s data. Each model was constrained jointly by the temporal-order judgments from the pre- and post-tests of all nine sessions. To quantify model performance, we computed the Akaike information criterion (AIC) for each model and each participant (Akaike, 1998). The model with the lowest AIC value was considered the best-fitting model. For all participants, the causal-inference model with modality-specific uncertainty outperformed the other three models. We then computed the AIC values of the other models relative to the best-fitting model, ΔAIC, with higher ΔAIC values indicating stronger evidence for the best-fitting model. The results of model comparison revealed robust evidence for the causal-inference model with modality-specific uncertainty (ΔAIC = 55.68 ± 21.45 for the fixed-update model with modality-specific uncertainty; ΔAIC = 48.71 ± 18.51 for the fixed-update model with modality-independent uncertainty; ΔAIC = 12 ± 5.94 for the causal-inference model with modality-independent uncertainty).

2.2.4 Model prediction

We predicted the recalibration effect per adapter SOA using the estimated parameters based on each of the four models. The nonlinearity in audiovisual temporal recalibration was only captured by models that rely on causal inference during the exposure phase (Figure 4A; see Figure S5 for other variants of the causal-inference model; see Figures S6 and S7 for model predictions for individual participants’ recalibration effects and TOJ responses). On the other hand, the models that assume a fixed update based on the measured SOA were unable to capture the data, as they predict a linear increase of recalibration with greater adapter SOA. We derived the asymmetry index (i.e., the recalibration effect summed across sessions) for the predictions of each model and compared these indices with those computed directly from the data. To capture participants’ idiosyncratic asymmetry in temporal recalibration, the model not only requires modality-specific uncertainty of arrival latency, it also needs to account for causal inference during the exposure phase (Figure 4B).

Model predictions. (A) Data and model predictions of recalibration as a function of adapter SOA. (B) Model prediction of the asymmetry index, the summed recalibration effect across adapter SOA. Dots: individual participants. Error bars: 68% bootstrapped confidence intervals. Identity line: perfect model prediction.

2.2.5 Model simulation

Simulations with the best-fitting model revealed key factors that determine the degree of non-linearity and asymmetry of cross-modal temporal recalibration to different adapter SOAs. The belief that the auditory and visual stimuli share a common cause plays a crucial role in adjudicating between these two causal scenarios (Figure 5A). When the observer infers that the audiovisual stimuli share the same cause, they recalibrate by a proportion of the perceived asynchrony no matter how large the measured asynchrony is, identical to the fixed-update model. On the contrary, when the observer infers that the audiovisual stimuli have separate causes, they treat the audiovisual stimuli as independent of each other and do not recalibrate. Estimates of the common-cause prior for all participants range between these two extremes. Thus, all observers weighted the estimates from these two scenarios based on the scenarios’ probability, resulting in the nonlinear pattern of recalibration (see Table S1 for parameter estimates for individual participants).

Model simulations. (A) Effect of the prior probability of a common cause on cross-modal temporal recalibration. (B) Asymmetry due to differing uncertainty of auditory vs. visual arrival latency. Left panel: When auditory uncertainty is smaller than visual uncertainty, reducing auditory uncertainty leads to less recalibration in response to visual-lead adapter SOA. Right panel: when visual uncertainty is smaller than auditory uncertainty, the opposite effect results. Top-left insets: corresponding SOA likelihood functions for a measured SOA of zero as auditory or visual uncertainty is varied.

Differences in arrival-time uncertainty between audition and vision result in an asymmetry of audiovisual temporal recalibration across adapter SOAs (Figure 5B). The amount of recalibration is attenuated when the modality with less uncertainty lags during the exposure phase. When the lagging stimulus is less uncertain, the perceptual system is more likely to attribute the SOA to separate causes and thus recalibrate less. In addition, the initial audiovisual bias does not affect asymmetry, but shifts the recalibration function horizontally and determines the SOA for which no recalibration occurs (Figure S8).

3 Discussion

In this study, we examined audiovisual temporal recalibration by repeatedly exposing participants to various stimulus-onset asynchronies and measured perceived audiovisual relative timing before and after exposure. To further understand the mechanisms underlying audiovisual temporal recalibration, we assessed the efficacy of different models of the recalibration process in predicting the amount of recalibration as a function of the audiovisual asynchrony to which one is exposed. Our findings indicate that a Bayesian causal-inference model with modality-specific uncertainty best captured the two key features of cross-modal temporal recalibration: the non-linear increase of recalibration magnitude with increasing adapted audiovisual asynchrony, and the asymmetrical recalibration magnitude between auditory- and visual-lead adapters with the same absolute asynchrony. Our results indicate that human observers employ causal-inference-based percepts to recalibrate cross-modal temporal perception.

In cross-modal recalibration, causal inference effectively serves as a credit-assignment mechanism, evaluating to what extent the source of discrepancy is external (i.e., the stimuli from the two modalities occurred at different times in the world) or internal (i.e., the measurement of asynchrony resulted from internal miscalibration). The perceptual system should correct for errors if they are due to misalignment between the senses. It shouldn’t recalibrate if two in-dependent events, such as your TV screen and the neighbors’ conversation, exhibit audiovisual asynchrony. The same principle also applies to other cross-modal domains. The relevance of causal inference extends beyond temporal recalibration, influencing cross-modal spatial recalibration (Badde, Navarro, & Landy, 2020; Hong et al., 2021; Sato et al., 2007; Wozny & Shams, 2011a, 2011b). Similarly, in sensorimotor adaptation, humans correct for motor errors that are more likely due to the motor system, but not due to the environment (Berniker & Kording, 2008; Wei & Körding, 2009).

Previous investigations into the mechanisms behind audiovisual temporal recalibration have proposed various models. These models describe recalibration as a selective reduction of response gain of the adapted asynchrony in a population code (Roach et al., 2011), a shift of latency or response criteria (Yarrow et al., 2015), changes in temporal discriminability (Roseboom et al., 2015), or the update of prior and likelihood function (Sato & Aihara, 2011). However, a common feature of the experimental methods in these studies is to examine the recalibration process within a relatively narrow range of adapted audiovisual asynchrony. This is based on the assumption that the audiovisual stimuli are perceived as originating from the same source, which holds for small asynchronies. Our model seeks to go beyond this limitation by incorporating causal inference, which extends the model applicability across a wider range of audiovisual asynchrony.

In addition to the nonlinear pattern of temporal recalibration, our results revealed significant asymmetry in how much participants recalibrated to visual- vs. auditory-lead stimuli. The majority of our participants showed larger recalibration effects in response to auditory-than visual-lead asynchrony, in line with previous studies (O’Donohue et al., 2022). Simulation results supported the idea that this asymmetry could be due to less uncertainty in auditory arrival latency, in line with psychophysical studies (reviewed by Stauffer et al., 2012) that found audition has better temporal sensitivity than vision. However, our findings also highlighted individual differences: a few participants showed the opposite pattern, which was also revealed before (Fujisaki et al., 2004). We speculate that in certain conditions, visual arrival-latency might be less variable than auditory latency if the auditory signal is influenced by environmental factors such as echoes. Accordingly, our model explains how temporal uncertainty, based on the precision of the perceptual system and temporal variability of the physical stimulus, can lead to different directions of asymmetry in audiovisual temporal recalibration.

The principle of causal inference in audiovisual temporal recalibration is likely to extend to rapid cross-modal temporal recalibration, which occurs following the exposure to a single and brief audiovisual asynchrony (Van der Burg et al., 2013). However, it is an open question whether modality-specific uncertainty can explain the asymmetry of rapid cross-modal temporal recalibration. The pattern of asymmetry for rapid temporal recalibration differs from that of cumulative recalibration; in rapid recalibration, recalibration magnitude and the range in which recalibration occurs is larger when vision leads than when audition leads (Van der Burg et al., 2013, 2015). Such findings suggest that the mechanisms behind rapid and cumulative temporal recalibration may differ fundamentally. Supporting this, recent neuroimaging research has revealed distinct underlying neurophysiological processes. Cumulative temporal recalibration induces gradual phase shifts of entrained neural oscillations in the auditory cortex (Kösem et al., 2014), whereas rapid recalibration relies on phase-frequency coupling that happens at a faster time scale (Lennert et al., 2021).

In sum, we found that causal inference with modality-specific uncertainty modulates audiovisual temporal recalibration. This finding suggests that cross-modal temporal recalibration is more complex than a compensatory mechanism for maintaining accuracy or consistency. It relies on causal inference that considers both the sensory and causal uncertainty of multisensory inputs. Cross-modal temporal recalibration is typically viewed as an early-stage, low-level perceptual process. Our findings refine this view, suggesting that it is deeply intertwined with higher cognitive functions.

4 Methods

4.1 Participants

Ten students from New York University (three males; age: 24.4 ± 1.77; all right-handed) participated in the experiment. They all reported normal or corrected-to-normal vision. All participants provided informed written consent before the experiment and received $15/hr as monetary compensation. The study was conducted in accordance with the guidelines laid down in the Declaration of Helsinki and approved by the New York University institutional review board. Data of one of the participants was identified as an outlier and therefore excluded from further data analysis (Figure S4).

4.2 Apparatus and stimuli

Participants completed the experiments in a dark and semi sound-attenuated room. They were seated 1 m from an acoustically transparent, white screen (1.36 × 1.02 m, 68 × 52° visual angle) and placed their head on a chin rest. An LCD projector (Hitachi CP-X3010N, 1024 × 768 pixels, 60 Hz) was mounted above and behind participants to project visual stimuli on the screen. The visual stimulus was a high-contrast (36.1 cd/m2) Gaussian blob (SD: 3.6°) on a gray background (10.2 cd/m2) projected onto the screen. The auditory stimulus was a 500 Hz beep (50 dB SPL) played by a loudspeaker behind and located at the center of the screen. The visual and auditory stimulus durations were 33.33 ms. We adjusted the timing of audiovisual stimulus presentations and verified the timing using an oscilloscope (PICOSCOPE 2204A).

4.3 Procedure

The experiment consisted of nine sessions, which took place on nine separate days. In each session, participants completed a pre-test, an exposure phase, and a post-test in sequence. The adapter SOA was fixed within a session, but varied across sessions (±700, ±300, ±200, ±100, 0 ms). The intensities of the oddball stimuli were determined prior to the experiment for each participant using an intensity-discrimination task to equate the difficulty of detecting oddball stimuli between participants and across modalities.

4.3.1 Pre-test phase

Participants completed a TOJ task during the pre-test phase. Each trial started with the display of a fixation cross (0.1–0.2 s, uniform distribution), followed by a blank screen (0.4–0.6 s, uniform distribution). Then, an auditory and a visual stimulus (0.033 s) were presented with a variable SOA. There were a total of 15 possible test SOAs (from -0.5 to 0.5 s in steps of 0.05 s), with positive values representing visual lead and negative values representing auditory lead. Following stimulus presentation there was another blank screen (0.4–0.6 s, uniform distribution), and then a response probe appeared on the screen. Participants indicated by button press whether the auditory stimulus occurred before the visual stimulus, occurred after, or the two were simultaneous. There was no time limit for the response, and response feedback was not provided. The inter-trial interval (ITI) was 0.2–0.4 s. Each test SOA was presented 20 times in pseudo-randomized order, resulting in 300 trials in total, divided into five blocks. Participants usually took around 15 minutes to finish the pre-test phase.

4.3.2 Exposure phase

Participants completed an oddball-detection task during the exposure phase. In each trial, participants were presented with an audiovisual stimulus pair with a fixed SOA (adapter SOA). In 10% of trials, the intensity of either the visual or the auditory component (or both) was greater than in the other trials. Participants were instructed to press a button as soon as possible when there was an auditory oddball, a visual oddball, or both stimuli were oddballs. The task timing was almost identical to the TOJ task, except that there was a response time limit of 1.4 s. The visual and auditory oddball stimuli were presented to participants prior to the exposure phase and they practiced as much as they needed to familiarize themselves with the task. There were a total of 250 trials, divided into five blocks. At the end of each block, we presented a performance summary with the hit and false alarm rates for each modality. Participants usually took 15 minutes to complete the exposure phase.

4.3.3 Post-test phase

Participants completed the TOJ task as well as the oddball-detection task during the post-test phase. Specifically, each temporal-order judgment was preceded by three top-up (oddball-detection) trials. The adapter SOA in the top-up trials was the same as that in the exposure phase to prevent dissipation of temporal recalibration (Machulla et al., 2012). To facilitate task switching, the ITI between the last top-up trial and the following TOJ trial was longer (with the additional time jittered around 1 s). Additionally, the fixation cross became red to signal the start of a TOJ trial. As in the pre-test phase, there were 300 TOJ trials (15 test SOAs × 20 repetitions) with the addition of 900 top-up trials, grouped into six blocks. At the end of each block, we provided a summary of the oddball-detection performance. Participants usually took around 1 hour to complete the post-test phase.

4.3.4 Intensity-discrimination task

This task was conducted to estimate the just-noticeable-difference (JND) in intensity for a standard visual stimulus with a luminance of 36.1 cd/m2, and a standard auditory stimulus with a volume of 40 dB SPL. The task was two-interval, forced-choice. The trial started with a fixation (0.1–0.2 s) and a blank screen (0.4–0.6 s). Participants were presented with a standard stimulus in one randomly selected interval (0.4–0.6 s) and a comparison stimulus in the other interval (0.4–0.6 s), temporally separated by an inter-stimulus interval (0.6–0.8 s). They indicated which interval contained the brighter/louder stimulus without time constraint. Seven test stimulus levels (luminance range: 5%–195%; volume range: 50%–150% of the standard) were repeated 20 times, resulting in 140 trials for each task. We fit a Gaussian cumulative distribution function to these data and defined the JND as the intensity difference for which the test stimulus was chosen 90% of the time as more intense than the standard. An oddball was defined as an auditory or visual stimulus with an intensity 1 JND above the standard intensity.

4.4 Modeling

In this section, we use the best-fitting model that combines causal inference with modality-specific uncertainty as a template, and then describe how the alternative models differ from this. We start by describing how the arrival latencies of auditory and visual stimuli lead to noisy internal measurements of audiovisual SOA, followed by how we modeled the process of audiovisual temporal recalibration. Then, we provide a formalization of the TOJ task administered in the pre- and the post-test phases, data from which were used to constrain the model parameters. Finally, we describe how the models were fit to the data.

4.4.1 Measurements of audiovisual stimulus-onset-asynchrony

When an audiovisual stimulus pair with a stimulus onset asynchrony, s = tA − tV, is presented, it triggers auditory and visual signals that are registered with different latency in the region of cortex where audiovisual comparisons are made. This leads to two internal measurements in an observer’s brain. As in previous work (Garcí a-Pé rez & Alcalá -Quintana, 2012), we model the probability of the latency of auditory and visual signals across repetitions, relative to the physical onset of each stimulus, as shifted exponential distributions (Figure 3A). These distributions may be shifted relative to the physical stimulus onset due to internal signal delays (denoted βV and βA). The arrival latency of the auditory signal relative to onset tA is the sum of the fixed delay, βA, and an additional delay that is exponentially distributed with time constant τA, and similarly for the visual arrival latency (with delay βV and time constant τV).

The measured SOA of the audiovisual stimulus pair is modeled as the difference in the arrival latency of both stimuli. Thus, the measured audiovisual SOA m includes the physical SOA s, the fixed latency (i.e., the difference between the auditory and visual fixed latency) β = βA − βV, and a stochastic component. Given that both latency distributions are shifted exponential distributions, a noisy sensory measurement of SOA m given a physical SOA s has a probability density that is an asymmetric double-exponential (Figure 6A):

Causal-inference model of the temporal-order-judgment task. (A) An example measurement distribution for a SOA of zero. The measurement distribution peaks at the audiovisual bias β, whose left and right slopes reflect the visual and auditory uncertainty, respectively. The point of subjective simultaneity is marked by the dashed line. (B) Simulated estimate distribution for a SOA of zero. The dashed lines represent the criteria placed symmetrically around zero, forming a temporal window of SOA estimates treated as simultaneous. The areas under the estimate distribution partitioned by the criteria indicate the probabilities of the three possible responses for a stimulus pair with a SOA of zero. (C) Simulated psychometric function computed by repeatedly calculating the probability of each possible response for all SOAs.

The mode of this measurement distribution is the physical SOA plus the fixed latency s + β. A negative value of β indicates faster auditory processing. The left and right spread of this measurement distribution depends on the uncertainty of the visual latency τV and auditory latency τA, respectively.

4.4.2 The perceptual inference process

To infer the stimulus SOA s from the measurement m, the ideal observer initially computes the posterior distribution of the SOA s by multiplying the likelihood function and the prior for two causal scenarios. The auditory and visual stimuli can arise from a single cause (C = 1) or two independent causes (C = 2).

The ideal observer holds two prior distributions of audiovisual SOA, one for each causal scenario. In the case of a common cause (C = 1), the prior distribution of the SOA between sound and light is a narrow Gaussian distribution (McGovern et al., 2016),

When there are two separate causes (C = 2), the prior distribution of the audiovisual SOA is a broad Gaussian distribution (McGovern et al., 2016), assigning almost equal probability to each audiovisual SOA

The observer obtains intermediate estimates of the stimulus SOA by combining the measured SOA with the prior over SOA corresponding to the two causal scenarios, ŝC =1 and ŝC = 2. In this model, we assume that this observer doesn’t have access to, or chooses to ignore, the current temporal bias β.

The likelihood functions under the two causal scenarios are identical:

where the left and right spreads depend on auditory and visual uncertainties of the arrival latency. Because the likelihood function is non-Gaussian, there is no closed form for the intermediate estimate. We computed the posterior numerically and used a maximum-a-posteriori (MAP) estimator, i.e., ŝ was the mode of the posterior over stimulus SOA in each scenario.

The final estimate of the stimulus SOA ŝ depends on the posterior probability of each causal scenario. By Bayes Rule, the posterior probability that an audiovisual stimulus pair with the measured SOA shares a common cause is

The likelihood of a common source/separate sources for a fixed SOA measurement is calculated by numerically integrating the protoposterior (i.e., the unnormalized posterior),

The posterior probability of a common cause additionally depends on the observer’s prior belief of a common cause for the auditory and visual stimuli, P (C = 1) = pcommon.

The final estimate of SOA is derived by model averaging, so that the final estimate is the average of the scenario-specific SOA estimates above weighted by the posterior probability of the corresponding causal scenario,

4.4.3 Formalization of recalibration in the exposure phase

We model the recalibration process as a shift of the audiovisual fixed latency β, audiovisual temporal bias, after encountering an audiovisual stimulus pair (Figure 3B). The internal value of β reflects the observed point of subjective simultaneity (PSS), which is the stimulus SOA s that leads to a median measurement of audiovisual synchrony equal to m = 0. That is, it is the value of s such that P (m < 0|s = PSS, β) = 0.5. A simple derivation yields

The shift of the audiovisual bias β also moves the measurement distribution. We assume the exponential time constants (τA, τV) remain unchanged across phases and sessions.

At the end of every exposure trial i, a discrepancy between the measured SOA, mi and the final estimate of the stimulus SOA ŝi signals the need for recalibration. In each session, we assume the participant arrives with a default bias β. We define Δβ,i as the cumulative shift of audiovisual bias after exposure trial i,

where α is the learning rate. At the end of the exposure phase, the predicted audiovisual bias is thus shifted by the accumulated shifts across the exposure phase, that is, βpost = β + Δβ,250.

4.4.4 Formalization of the temporal-order-judgement task

In the TOJ task administered in the pre- and post-test phases, the observer makes a perceptual judgment by comparing the final estimate of stimulus SOA ŝ to two internal criteria (Cary et al., 2024; García-Pérez & Alcalá-Quintana, 2012). We assume that the observer has a symmetric pair of criteria, ±c, centered on the stimulus SOA corresponding to perceptual simultaneity (ŝ = 0). In addition, the observer may lapse or make an error when responding. The probabilities of reporting visual lead, ΨV, auditory lead, ΨA or that the two stimuli were simultaneous, ΨS, are thus

where λ is the lapse rate. Figure 6C shows an example of the resulting psychometric functions.

The probability distribution of causal-inference-based stimulus SOA estimates P (ŝ|s) has no closed form and can only be simulated. For each simulation we sampled 10,000 SOA measurements from the corresponding double-exponential probability distribution (Figure 6A). For each sampled measurement, we simulated the process by which the observer carries out causal inference and produced an estimate of the stimulus SOA (fixing the values of a few additional causal-inference model parameters). This process resulted in a Monte-Carlo approximation of the probability distribution of the causal-inference-based stimulus SOA estimates (Figure 6B).

4.4.5 Alternative models

In the fixed-update model, observers measure the audiovisual SOA s by comparing the arrival latency of the auditory and visual signals (Eq. 1). They do not perform causal inference to estimate the SOA. Instead, the measured SOA is shifted toward zero by recalibrating the audiovisual bias β. Hence, the update of the audiovisual bias in trial i is defined by

The update of audiovisual bias is accumulated across the exposure phase, βpost = β + Δβ,250. In TOJ tasks, observers make the temporal-order decision by applying the criteria to the measurement of SOA m (see psychometric functions in Supplement Eq. S1).

In models with modal-independent uncertainty, τA = τV, resulting in a symmetrical measurement distribution (Eq. 1).

4.4.6 Model fitting

Model log-likelihood

The model was fitted by maximizing likelihood. We fit the model to the TOJ data collected during the pre- and post-test phases of all sessions together. We did not collect temporal-order judgments in the exposure phase. However, to model the post-test data, we needed to estimate the distribution of shifts of audiovisual bias resulting from the exposure phase (Δβ,250). We did this using Monte Carlo simulation of the 250 exposure trials to estimate the probability distribution of the cumulative shifts.

The set of model parameters Θ is listed in Table 1. There are I sessions, each including K trials in the pre-test phase and K trials in the post-test phase. We denote the full dataset of pre-test data as Xpre and for the post-test data as Xpost. On a given trial, the observer responds either auditory-first (A), visual-first (V) or simultaneous (S). We denote a single response using indicator variables that are equal to 1 if that was the response in that trial and 0 otherwise. These variables for trial k in session i are and for the pre-test trials, and , etc., for the post-test trials. The log-likelihood of all pre-test responses Xpre given the model parameters given is

Model parameters. Check marks signify that the parameter is used for determining the likelihood of the data from the temporal-order judgment task in the pre- and post-test phase and/or for the Monte Carlo simulation of recalibration in the exposure phase.

The psychometric functions for the pre-test (e.g., ΨA,pre) are defined in Eq. 11, and are the same across all sessions as we assumed that the audiovisual delay β was the same before recalibration in every session.

The log-likelihood of responses in the post-test depends on the audiovisual bias after re-calibration βpost = β + Δβ,250,i for session i. To determine the log-likelihood of the post-test data requires us to integrate out the unknown value of the cumulative shift Δβ,250,i. We approximated this integral in two steps based on our previous work (Hong et al., 2021). First, we simulated the 250 exposure-phase trials 1,000 times for a given set of parameters Θ and session i. This resulted in 1,000 values of Δβ,250,i. The distribution of these values was well fit by a Gaussian whose parameters were determined by the empirical mean and standard deviation of the sample distribution, resulting in the distribution . Second, we approximated the integral of the log-likelihood of the data over possible values of Δβ,250,i by numerical integration. We discretized the approximated distribution into 100 equally spaced bins centered on values Δβ,250,i(n) (n = 1, …, 100). The range of the bins was triple the range of the values from the Monte Carlo sample, so that the lower bound was and the upper bound was .

The log-likelihood of the post-test data is approximated as

where

The psychometric functions in the post-test (e.g., ΨA,post,in) differ across sessions and bins because the simulated bias after recalibration βi,post depends on the adapter SOA fixed in session i and the simulation bin n.

Parameter estimation

We used the BADS toolbox (Acerbi & Ma, 2017) in MATLAB to optimize the set of parameters for the models because it outperforms fmincon when parameter numbers increase. We repeated each search 80 times with a different and random starting point to address the possibility of reporting a local minimum, and chose the parameter estimates with the maximum likelihood across the repeated searches.