Abstract
Cross-modal temporal recalibration is crucial for maintaining coherent perception in a multimodal environment. The classic view suggests that cross-modal temporal recalibration aligns the perceived timing of sensory signals from different modalities, such as sound and light, to compensate for physical and neural latency differences. However, this view cannot fully explain the nonlinearity and asymmetry observed in audiovisual recalibration effects: the amount of re-calibration plateaus with increasing audiovisual asynchrony and varies depending on the leading modality of the asynchrony during exposure. To address these discrepancies, our study examines the mechanism of audiovisual temporal recalibration through the lens of causal inference, considering the brain’s capacity to determine whether multimodal signals come from a common source and should be integrated, or else kept separate. In a three-phase recalibration paradigm, we manipulated the adapter stimulus-onset asynchrony in the exposure phase across nine sessions, introducing asynchronies up to 0.7 s of either auditory or visual lead. Before and after the exposure phase in each session, we measured participants’ perception of audiovisual relative timing using a temporal-order-judgment task. We compared models that assumed observers re-calibrate to approach either the physical synchrony or the causal-inference-based percept, with uncertainties specific to each modality or comparable across them. Modeling results revealed that a causal-inference model incorporating modality-specific uncertainty captures both the nonlinearity and asymmetry of audiovisual temporal recalibration. Our results indicate that human observers employ causal-inference-based percepts to recalibrate cross-modal temporal perception.
1 Introduction
Perception is flexible and should change in response to perceptual error (Bedford, 1999). In a multimodal environment, systematic discrepancies between modalities indicate perceptual errors and thus the need for recalibration. Signals in different modalities from the same event can arrive with different physical and neural delays in the relevant brain areas (Fain, 2019; Pöppel, 1988; Spence & Squire, 2003). Cross-modal temporal recalibration has been considered a compensatory mechanism that attempts to realign the perceived timing between modalities to maintain perceptual synchrony across changes in the perceptual systems and the environment (reviewed by King, 2005; Vroomen and Keetels, 2010). This phenomenon is exemplified by audiovisual temporal recalibration, where consistent exposure to audiovisual asynchrony shifts the audiovisual temporal bias (i.e., point of subjective simultaneity) between auditory and visual stimuli in the direction of the asynchrony to which one has been exposed (Di Luca et al., 2009; Fujisaki et al., 2004; Hanson et al., 2008; Harrar & Harris, 2008; Heron et al., 2007; Keetels & Vroomen, 2007; Navarra et al., 2005; Roach et al., 2011; Tanaka et al., 2011; Vatakis et al., 2007, 2008; Vroomen & Keetels, 2010; Vroomen et al., 2004; Yamamoto et al., 2012).
The classic compensatory view is that recalibration serves to offset the physical and neural latency differences between modalities (Fujisaki et al., 2004), aiming for external accuracy, the agreement between perception and the environment (Zaidel et al., 2011). When external feedback is not available, recalibration may target internal consistency (Burge et al., 2010). In the context of the recalibration of relative timing, both theories predict similar behavior regardless of whether the goal is accuracy or consistency: the perceptual system will attempt to recalibrate for any amount of asynchrony so as to homeostatically restore a physical or perceived synchrony. Here, we formalize this hypothesis as the fixed-update model.
The fixed-update model predicts a linearly increasing amount of temporal recalibration as with increases in the asynchrony one is exposed to. However, empirical observations do not fully align with this model. The amount of audiovisual temporal recalibration as a function of adapted stimulus-onset asynchrony (SOA) exhibits two crucial characteristics: nonlinearity and asymmetry. The amount of recalibration is not proportional to the adapted SOA, but instead plateaus at SOAs of approximately 100–300 ms (Fujisaki et al., 2004; Vroomen et al., 2004). The amount of recalibration can also be asymmetrical: the magnitude of recalibration differs when the visual stimulus leads during the exposure phase compared to when the auditory stimulus leads (Fujisaki et al., 2004; O’Donohue et al., 2022; Van der Burg et al., 2013). These observations suggest that while the fixed-update model might capture the general purpose behind temporal recalibration, it falls short in fully capturing the nuanced ways in which the brain adjusts to varying SOAs. This prompts consideration of additional mechanisms to account for previously observed nonlinearity and asymmetry.
Notably, the fixed-update model overlooks the causal relationship between multimodal stimuli by implicitly assuming that they originate from a single source. However, that’s not always the case in a multimodal environment. For instance, in a dubbed movie with a noticeable delay between the video and the audio, this delay can indicate that the sound is provided by a different actor, not the character on screen. To address this challenge, the brain must perform causal inference to determine whether multimodal signals come from a common source and should be integrated, or else kept separate. Indeed, numerous studies support the hypothesis that humans consider the causal structure of cross-modal stimuli when making perceptual decisions (Aller & Noppeney, 2019; Cao et al., 2019; Dokka et al., 2019; Körding et al., 2007; Locke & Landy, 2017; McGovern et al., 2016; Rohe & Noppeney, 2015; Samad et al., 2015; Sato et al., 2007; Wei & Körding, 2009; Wozny et al., 2010). Drawing on this framework, we formulate a causal-inference model of temporal recalibration based on previous models that have successfully predicted visual-auditory (Hong et al., 2021; Sato et al., 2007) and visual-tactile (Badde, Navarro, & Landy, 2020) spatial recalibration.
Although models incorporating causal inference are promising in capturing the observed nonlinearities, they predict an identical amount of temporal recalibration for audiovisual stimulus pairs that have the same SOA but with opposite sign (i.e., lead vs. lag). This suggests that additional factors are required to explain the observed asymmetry. In previous studies, the asymmetry has been attributed to different factors, such as the physical and neural latency differences between sensory signals (O’Donohue et al., 2022; Van der Burg et al., 2013) or more frequent exposure to visual-lead events in natural environments (Fujisaki et al., 2004; Van der Burg et al., 2013). These factors can explain the audiovisual temporal bias most humans developed through early sensory experience (Badde, Ley, et al., 2020). Yet, this bias would again equally affect the amount of recalibration resulting from the same SOAs on either side of the observer’s bias. In contrast to bias, sensory uncertainty has been shown to affect the degree of cross-modal recalibration in a complex fashion (Badde, Navarro, & Landy, 2020; Hong et al., 2021; van Beers et al., 2002). We hypothesize that different degrees of auditory and visual uncertainty play a critical role in the asymmetry of cross-modal temporal recalibration.
To examine the mechanism underlying cross-modal temporal recalibration, we used a classic three-phase recalibration paradigm, in which participants completed a pre-test, exposure, and post-test. We manipulated the adapter SOA (i.e., the audiovisual asynchrony presented in the exposure phase) across sessions, introducing SOAs up to 0.7 s of either auditory or visual lead. Before and after the exposure phase in each session, we measured participant’s perception of audiovisual relative timing using a temporal-order-judgement (TOJ) task. To preview the empirical results, we confirmed the nonlinearity as well as idiosyncratic asymmetry of the recalibration effect. To scrutinize the factors that might drive these two main characteristics, we fitted four models to the data, using either causal inference or a fixed update. Despite previous empirical evidence challenging the fixed-update model, it doesn’t mean we should discount its relevance without a statistical comparison to alternative models. The causal-inference and the fixed-update models were combined with either modality-specific or modality-independent uncertainty.
Model comparison revealed that causal inference combined with modality-specific uncertainty is essential to accurately capture the nonlinearity and idiosyncratic asymmetry of temporal recalibration. Our results indicate that human observers employ causal-inference-based percepts to recalibrate cross-modal temporal perception. This finding suggests that cross-modal temporal recalibration, typically considered an early-stage, low-level perceptual process, involves higher cognitive functions in the adjustment of perception.
2 Results
2.1 Behavioral results
We adopted a classical three-phase recalibration paradigm in which participants completed a pre-test, an exposure phase, and a post-test in each session. In pre- and post-tests, we measured participants’ perception of audiovisual relative timing using a TOJ task: participants reported the perceived order (“visual first,” “auditory first,” or “simultaneous”) of audiovisual stimulus pairs with varying SOAs (range: from -0.5 to 0.5 s with 15 levels; Figure 1A). In the exposure phase, participants were exposed to a series of audiovisual stimuli with a consistent SOA (250 trials; Figure 1B). To ensure that participants were attentive to the stimuli, they performed an oddball-detection task. Specifically, we inserted oddball stimuli with slightly greater intensity in either one or both modalities (5% of total trials independently sampled for each modality). Participants were instructed to respond whenever they detected such stimuli. The high d′ of oddball-detection performance (auditory d′ = 3.34 ± 0.54, visual d′ = 2.44 ± 0.72) showed that participants paid attention to both modalities (Figure S2). The post-test was almost identical to the pre-test, except that before every temporal-order judgment, there were three top-up exposure trials to maintain the recalibration effect. In total, participants completed nine sessions on separate days. The adapter SOA (range: -0.7 to 0.7 s) varied across but not within sessions.
We compared the temporal-order judgments between the pre- and post-tests to examine the amount of audiovisual temporal recalibration induced by the SOA of audiovisual stimuli during the exposure phase. Specifically, we fitted the data from the pre- and post-tests jointly assuming different points of subjective simultaneity (PSS) between the two tests while assuming fixed arrival-latency distributions and fixed response criteria (Figure 2A; see Supplement S1 for an alternative model assuming a shift in the response criteria due to recalibration). The amount of audiovisual temporal recalibration was defined as the difference between the two PSSs. At the group level, we observed a nonlinear pattern of recalibration as a function of the adapter SOA: the amount of recalibration in the direction of the adapter SOA first increased but then plateaued with increasing magnitude of the adapter SOA presented during the exposure phase (Figure 2B). Additionally, we observed an asymmetry between auditory-lead and visual-lead adapter SOAs in the magnitude of recalibration at the group level, with auditory-lead adapter SOAs inducing a greater amount of recalibration (Figure 2B; see Figure S6 for individual participants’ data). To quantify this asymmetry for each participant, we calculated an asymmetry index, defined as the sum of the recalibration effects across all sessions (zero: no evidence for asymmetry; positive values: greater recalibration given visual-lead adapters; negative: greater recalibration given auditory-lead adapters). For each participant, we bootstrapped the temporal-order judgments to obtain a 95% confidence interval for the asymmetry index. All participants’ confidence intervals excluded zero, suggesting that all of them showed audiovisual asymmetry in temporal recalibration (Figure S3).
2.2 Modeling results
In the following sections, we describe our models for cross-modal temporal recalibration by first laying out the general assumptions of these models, and then explaining the differences between them. Then, we provide a comparison in terms of model performance and illustrate how well the models capture the observed data by generating model predictions.
2.2.1 General model assumptions
We formulated four models of cross-modal temporal recalibration (Figure 3). These models share several common assumptions. First, when an auditory and a visual signal are presented, the corresponding neural signals arrive in the relevant brain areas with a variable latency due to internal and external noise. The probability distribution of arrival latency is an exponential distribution (García-Pérez & Alcalá-Quintana, 2012) (Figure 3A). A simple derivation shows that the resulting measurements of SOA follow a double-exponential distribution (Figure 3B). The mode reflects the physical SOA plus the participant’s audiovisual temporal bias. The slopes of the distribution reflect the uncertainties of the arrival latency; the steeper the slope, the less variable the latency, and the less uncertainty a Bayesian observer would have in a single trial. Second, these models define temporal recalibration as accumulating updates of the audiovisual bias after each encounter with a SOA. The accumulated update of the audiovisual bias at the end of the exposure phase is then carried over to the post-test and persists throughout that phase. Lastly, the bias is assumed to be reset to the same initial value in the pre-test across all nine sessions, reflecting the stability of the audiovisual temporal bias across time (Badde, Ley, et al., 2020; Grabot & van Wassenhove, 2017).
2.2.2 Models of cross-modal temporal recalibration
The four models we tested differed in the mechanism governing the updates of the audiovisual bias during the exposure phase as well as the modality specificity of the arrival latency uncertainty.
We formulated a Bayesian causal-inference model (Körding et al., 2007; McGovern et al., 2016; Sato et al., 2007) to describe the recalibration of the relative timing between cross-modal signals. When an observer is presented with an adapter SOA during the exposure phase, they infer the causal relationship between the auditory and visual stimulus. Specifically, the observer computes two intermediate estimates of the SOA, one for each causal scenario (Figure 3C). In the common-cause scenario, the estimated SOA of the stimuli is smaller than the measurement as it is combined with a prior distribution over SOA that reflects synchrony. In the separate-causes scenario, the estimated SOA is approximately equal to the measurement. The two estimates are then averaged with each one weighted by the inferred probability of the corresponding causal scenario. The audiovisual bias is then updated to reduce the difference between the measurement and the combined estimate of SOA. In other words, causal inference regulates the recalibration process by shifting the measured SOA to more closely match the percept, which in turn is computed based on the inferred causal structure.
We also considered a fixed-update model. The major distinction between the causal-inference and the fixed-update model is that, according to the latter, the measured SOA is shifted toward zero rather than toward the inferred SOA. Essentially, whenever the observer detects a SOA, they recalibrate by shifting the audiovisual bias in the opposite direction so that the measured SOA will be closer to zero.
We additionally varied a second model element: we assumed either modality-specific or modality-independent uncertainty of arrival latency. The auditory system typically has higher temporal precision than the visual system. Hence, the arrival latency of visual signals can be more variable than auditory-signal latency, resulting in an asymmetrical probability density of measured SOA (m). A Bayesian observer will take this modality-specific sensory uncertainty into account to derive an estimate of SOA (ŝ). However, temporal precision might not be due to the variability of arrival latency. The auditory and visual systems might share a common, modality-independent timing mechanism (Stauffer et al., 2012), predicting modality-independent uncertainty.
2.2.3 Model comparison
We fitted four models to each participant’s data. Each model was constrained jointly by the temporal-order judgments from the pre- and post-tests of all nine sessions. To quantify model performance, we computed the Akaike information criterion (AIC) for each model and each participant (Akaike, 1998). The model with the lowest AIC value was considered the best-fitting model. For all participants, the causal-inference model with modality-specific uncertainty outperformed the other three models. We then computed the AIC values of the other models relative to the best-fitting model, ΔAIC, with higher ΔAIC values indicating stronger evidence for the best-fitting model. The results of model comparison revealed robust evidence for the causal-inference model with modality-specific uncertainty (ΔAIC = 55.68 ± 21.45 for the fixed-update model with modality-specific uncertainty; ΔAIC = 48.71 ± 18.51 for the fixed-update model with modality-independent uncertainty; ΔAIC = 12 ± 5.94 for the causal-inference model with modality-independent uncertainty).
2.2.4 Model prediction
We predicted the recalibration effect per adapter SOA using the estimated parameters based on each of the four models. The nonlinearity in audiovisual temporal recalibration was only captured by models that rely on causal inference during the exposure phase (Figure 4A; see Figure S5 for other variants of the causal-inference model; see Figures S6 and S7 for model predictions for individual participants’ recalibration effects and TOJ responses). On the other hand, the models that assume a fixed update based on the measured SOA were unable to capture the data, as they predict a linear increase of recalibration with greater adapter SOA. We derived the asymmetry index (i.e., the recalibration effect summed across sessions) for the predictions of each model and compared these indices with those computed directly from the data. To capture participants’ idiosyncratic asymmetry in temporal recalibration, the model not only requires modality-specific uncertainty of arrival latency, it also needs to account for causal inference during the exposure phase (Figure 4B).
2.2.5 Model simulation
Simulations with the best-fitting model revealed key factors that determine the degree of non-linearity and asymmetry of cross-modal temporal recalibration to different adapter SOAs. The belief that the auditory and visual stimuli share a common cause plays a crucial role in adjudicating between these two causal scenarios (Figure 5A). When the observer infers that the audiovisual stimuli share the same cause, they recalibrate by a proportion of the perceived asynchrony no matter how large the measured asynchrony is, identical to the fixed-update model. On the contrary, when the observer infers that the audiovisual stimuli have separate causes, they treat the audiovisual stimuli as independent of each other and do not recalibrate. Estimates of the common-cause prior for all participants range between these two extremes. Thus, all observers weighted the estimates from these two scenarios based on the scenarios’ probability, resulting in the nonlinear pattern of recalibration (see Table S1 for parameter estimates for individual participants).
Differences in arrival-time uncertainty between audition and vision result in an asymmetry of audiovisual temporal recalibration across adapter SOAs (Figure 5B). The amount of recalibration is attenuated when the modality with less uncertainty lags during the exposure phase. When the lagging stimulus is less uncertain, the perceptual system is more likely to attribute the SOA to separate causes and thus recalibrate less. In addition, the initial audiovisual bias does not affect asymmetry, but shifts the recalibration function horizontally and determines the SOA for which no recalibration occurs (Figure S8).
3 Discussion
In this study, we examined audiovisual temporal recalibration by repeatedly exposing participants to various stimulus-onset asynchronies and measured perceived audiovisual relative timing before and after exposure. To further understand the mechanisms underlying audiovisual temporal recalibration, we assessed the efficacy of different models of the recalibration process in predicting the amount of recalibration as a function of the audiovisual asynchrony to which one is exposed. Our findings indicate that a Bayesian causal-inference model with modality-specific uncertainty best captured the two key features of cross-modal temporal recalibration: the non-linear increase of recalibration magnitude with increasing adapted audiovisual asynchrony, and the asymmetrical recalibration magnitude between auditory- and visual-lead adapters with the same absolute asynchrony. Our results indicate that human observers employ causal-inference-based percepts to recalibrate cross-modal temporal perception.
In cross-modal recalibration, causal inference effectively serves as a credit-assignment mechanism, evaluating to what extent the source of discrepancy is external (i.e., the stimuli from the two modalities occurred at different times in the world) or internal (i.e., the measurement of asynchrony resulted from internal miscalibration). The perceptual system should correct for errors if they are due to misalignment between the senses. It shouldn’t recalibrate if two in-dependent events, such as your TV screen and the neighbors’ conversation, exhibit audiovisual asynchrony. The same principle also applies to other cross-modal domains. The relevance of causal inference extends beyond temporal recalibration, influencing cross-modal spatial recalibration (Badde, Navarro, & Landy, 2020; Hong et al., 2021; Sato et al., 2007; Wozny & Shams, 2011a, 2011b). Similarly, in sensorimotor adaptation, humans correct for motor errors that are more likely due to the motor system, but not due to the environment (Berniker & Kording, 2008; Wei & Körding, 2009).
Previous investigations into the mechanisms behind audiovisual temporal recalibration have proposed various models. These models describe recalibration as a selective reduction of response gain of the adapted asynchrony in a population code (Roach et al., 2011), a shift of latency or response criteria (Yarrow et al., 2015), changes in temporal discriminability (Roseboom et al., 2015), or the update of prior and likelihood function (Sato & Aihara, 2011). However, a common feature of the experimental methods in these studies is to examine the recalibration process within a relatively narrow range of adapted audiovisual asynchrony. This is based on the assumption that the audiovisual stimuli are perceived as originating from the same source, which holds for small asynchronies. Our model seeks to go beyond this limitation by incorporating causal inference, which extends the model applicability across a wider range of audiovisual asynchrony.
In addition to the nonlinear pattern of temporal recalibration, our results revealed significant asymmetry in how much participants recalibrated to visual- vs. auditory-lead stimuli. The majority of our participants showed larger recalibration effects in response to auditory-than visual-lead asynchrony, in line with previous studies (O’Donohue et al., 2022). Simulation results supported the idea that this asymmetry could be due to less uncertainty in auditory arrival latency, in line with psychophysical studies (reviewed by Stauffer et al., 2012) that found audition has better temporal sensitivity than vision. However, our findings also highlighted individual differences: a few participants showed the opposite pattern, which was also revealed before (Fujisaki et al., 2004). We speculate that in certain conditions, visual arrival-latency might be less variable than auditory latency if the auditory signal is influenced by environmental factors such as echoes. Accordingly, our model explains how temporal uncertainty, based on the precision of the perceptual system and temporal variability of the physical stimulus, can lead to different directions of asymmetry in audiovisual temporal recalibration.
The principle of causal inference in audiovisual temporal recalibration is likely to extend to rapid cross-modal temporal recalibration, which occurs following the exposure to a single and brief audiovisual asynchrony (Van der Burg et al., 2013). However, it is an open question whether modality-specific uncertainty can explain the asymmetry of rapid cross-modal temporal recalibration. The pattern of asymmetry for rapid temporal recalibration differs from that of cumulative recalibration; in rapid recalibration, recalibration magnitude and the range in which recalibration occurs is larger when vision leads than when audition leads (Van der Burg et al., 2013, 2015). Such findings suggest that the mechanisms behind rapid and cumulative temporal recalibration may differ fundamentally. Supporting this, recent neuroimaging research has revealed distinct underlying neurophysiological processes. Cumulative temporal recalibration induces gradual phase shifts of entrained neural oscillations in the auditory cortex (Kösem et al., 2014), whereas rapid recalibration relies on phase-frequency coupling that happens at a faster time scale (Lennert et al., 2021).
In sum, we found that causal inference with modality-specific uncertainty modulates audiovisual temporal recalibration. This finding suggests that cross-modal temporal recalibration is more complex than a compensatory mechanism for maintaining accuracy or consistency. It relies on causal inference that considers both the sensory and causal uncertainty of multisensory inputs. Cross-modal temporal recalibration is typically viewed as an early-stage, low-level perceptual process. Our findings refine this view, suggesting that it is deeply intertwined with higher cognitive functions.
4 Methods
4.1 Participants
Ten students from New York University (three males; age: 24.4 ± 1.77; all right-handed) participated in the experiment. They all reported normal or corrected-to-normal vision. All participants provided informed written consent before the experiment and received $15/hr as monetary compensation. The study was conducted in accordance with the guidelines laid down in the Declaration of Helsinki and approved by the New York University institutional review board. Data of one of the participants was identified as an outlier and therefore excluded from further data analysis (Figure S4).
4.2 Apparatus and stimuli
Participants completed the experiments in a dark and semi sound-attenuated room. They were seated 1 m from an acoustically transparent, white screen (1.36 × 1.02 m, 68 × 52° visual angle) and placed their head on a chin rest. An LCD projector (Hitachi CP-X3010N, 1024 × 768 pixels, 60 Hz) was mounted above and behind participants to project visual stimuli on the screen. The visual stimulus was a high-contrast (36.1 cd/m2) Gaussian blob (SD: 3.6°) on a gray background (10.2 cd/m2) projected onto the screen. The auditory stimulus was a 500 Hz beep (50 dB SPL) played by a loudspeaker behind and located at the center of the screen. The visual and auditory stimulus durations were 33.33 ms. We adjusted the timing of audiovisual stimulus presentations and verified the timing using an oscilloscope (PICOSCOPE 2204A).
4.3 Procedure
The experiment consisted of nine sessions, which took place on nine separate days. In each session, participants completed a pre-test, an exposure phase, and a post-test in sequence. The adapter SOA was fixed within a session, but varied across sessions (±700, ±300, ±200, ±100, 0 ms). The intensities of the oddball stimuli were determined prior to the experiment for each participant using an intensity-discrimination task to equate the difficulty of detecting oddball stimuli between participants and across modalities.
4.3.1 Pre-test phase
Participants completed a TOJ task during the pre-test phase. Each trial started with the display of a fixation cross (0.1–0.2 s, uniform distribution), followed by a blank screen (0.4–0.6 s, uniform distribution). Then, an auditory and a visual stimulus (0.033 s) were presented with a variable SOA. There were a total of 15 possible test SOAs (from -0.5 to 0.5 s in steps of 0.05 s), with positive values representing visual lead and negative values representing auditory lead. Following stimulus presentation there was another blank screen (0.4–0.6 s, uniform distribution), and then a response probe appeared on the screen. Participants indicated by button press whether the auditory stimulus occurred before the visual stimulus, occurred after, or the two were simultaneous. There was no time limit for the response, and response feedback was not provided. The inter-trial interval (ITI) was 0.2–0.4 s. Each test SOA was presented 20 times in pseudo-randomized order, resulting in 300 trials in total, divided into five blocks. Participants usually took around 15 minutes to finish the pre-test phase.
4.3.2 Exposure phase
Participants completed an oddball-detection task during the exposure phase. In each trial, participants were presented with an audiovisual stimulus pair with a fixed SOA (adapter SOA). In 10% of trials, the intensity of either the visual or the auditory component (or both) was greater than in the other trials. Participants were instructed to press a button as soon as possible when there was an auditory oddball, a visual oddball, or both stimuli were oddballs. The task timing was almost identical to the TOJ task, except that there was a response time limit of 1.4 s. The visual and auditory oddball stimuli were presented to participants prior to the exposure phase and they practiced as much as they needed to familiarize themselves with the task. There were a total of 250 trials, divided into five blocks. At the end of each block, we presented a performance summary with the hit and false alarm rates for each modality. Participants usually took 15 minutes to complete the exposure phase.
4.3.3 Post-test phase
Participants completed the TOJ task as well as the oddball-detection task during the post-test phase. Specifically, each temporal-order judgment was preceded by three top-up (oddball-detection) trials. The adapter SOA in the top-up trials was the same as that in the exposure phase to prevent dissipation of temporal recalibration (Machulla et al., 2012). To facilitate task switching, the ITI between the last top-up trial and the following TOJ trial was longer (with the additional time jittered around 1 s). Additionally, the fixation cross became red to signal the start of a TOJ trial. As in the pre-test phase, there were 300 TOJ trials (15 test SOAs × 20 repetitions) with the addition of 900 top-up trials, grouped into six blocks. At the end of each block, we provided a summary of the oddball-detection performance. Participants usually took around 1 hour to complete the post-test phase.
4.3.4 Intensity-discrimination task
This task was conducted to estimate the just-noticeable-difference (JND) in intensity for a standard visual stimulus with a luminance of 36.1 cd/m2, and a standard auditory stimulus with a volume of 40 dB SPL. The task was two-interval, forced-choice. The trial started with a fixation (0.1–0.2 s) and a blank screen (0.4–0.6 s). Participants were presented with a standard stimulus in one randomly selected interval (0.4–0.6 s) and a comparison stimulus in the other interval (0.4–0.6 s), temporally separated by an inter-stimulus interval (0.6–0.8 s). They indicated which interval contained the brighter/louder stimulus without time constraint. Seven test stimulus levels (luminance range: 5%–195%; volume range: 50%–150% of the standard) were repeated 20 times, resulting in 140 trials for each task. We fit a Gaussian cumulative distribution function to these data and defined the JND as the intensity difference for which the test stimulus was chosen 90% of the time as more intense than the standard. An oddball was defined as an auditory or visual stimulus with an intensity 1 JND above the standard intensity.
4.4 Modeling
In this section, we use the best-fitting model that combines causal inference with modality-specific uncertainty as a template, and then describe how the alternative models differ from this. We start by describing how the arrival latencies of auditory and visual stimuli lead to noisy internal measurements of audiovisual SOA, followed by how we modeled the process of audiovisual temporal recalibration. Then, we provide a formalization of the TOJ task administered in the pre- and the post-test phases, data from which were used to constrain the model parameters. Finally, we describe how the models were fit to the data.
4.4.1 Measurements of audiovisual stimulus-onset-asynchrony
When an audiovisual stimulus pair with a stimulus onset asynchrony, s = tA − tV, is presented, it triggers auditory and visual signals that are registered with different latency in the region of cortex where audiovisual comparisons are made. This leads to two internal measurements in an observer’s brain. As in previous work (Garcí a-Pé rez & Alcalá -Quintana, 2012), we model the probability of the latency of auditory and visual signals across repetitions, relative to the physical onset of each stimulus, as shifted exponential distributions (Figure 3A). These distributions may be shifted relative to the physical stimulus onset due to internal signal delays (denoted βV and βA). The arrival latency of the auditory signal relative to onset tA is the sum of the fixed delay, βA, and an additional delay that is exponentially distributed with time constant τA, and similarly for the visual arrival latency (with delay βV and time constant τV).
The measured SOA of the audiovisual stimulus pair is modeled as the difference in the arrival latency of both stimuli. Thus, the measured audiovisual SOA m includes the physical SOA s, the fixed latency (i.e., the difference between the auditory and visual fixed latency) β = βA − βV, and a stochastic component. Given that both latency distributions are shifted exponential distributions, a noisy sensory measurement of SOA m given a physical SOA s has a probability density that is an asymmetric double-exponential (Figure 6A):
The mode of this measurement distribution is the physical SOA plus the fixed latency s + β. A negative value of β indicates faster auditory processing. The left and right spread of this measurement distribution depends on the uncertainty of the visual latency τV and auditory latency τA, respectively.
4.4.2 The perceptual inference process
To infer the stimulus SOA s from the measurement m, the ideal observer initially computes the posterior distribution of the SOA s by multiplying the likelihood function and the prior for two causal scenarios. The auditory and visual stimuli can arise from a single cause (C = 1) or two independent causes (C = 2).
The ideal observer holds two prior distributions of audiovisual SOA, one for each causal scenario. In the case of a common cause (C = 1), the prior distribution of the SOA between sound and light is a narrow Gaussian distribution (McGovern et al., 2016),
When there are two separate causes (C = 2), the prior distribution of the audiovisual SOA is a broad Gaussian distribution (McGovern et al., 2016), assigning almost equal probability to each audiovisual SOA
The observer obtains intermediate estimates of the stimulus SOA by combining the measured SOA with the prior over SOA corresponding to the two causal scenarios, ŝC =1 and ŝC = 2. In this model, we assume that this observer doesn’t have access to, or chooses to ignore, the current temporal bias β.
The likelihood functions under the two causal scenarios are identical:
where the left and right spreads depend on auditory and visual uncertainties of the arrival latency. Because the likelihood function is non-Gaussian, there is no closed form for the intermediate estimate. We computed the posterior numerically and used a maximum-a-posteriori (MAP) estimator, i.e., ŝ was the mode of the posterior over stimulus SOA in each scenario.
The final estimate of the stimulus SOA ŝ depends on the posterior probability of each causal scenario. By Bayes Rule, the posterior probability that an audiovisual stimulus pair with the measured SOA shares a common cause is
The likelihood of a common source/separate sources for a fixed SOA measurement is calculated by numerically integrating the protoposterior (i.e., the unnormalized posterior),
The posterior probability of a common cause additionally depends on the observer’s prior belief of a common cause for the auditory and visual stimuli, P (C = 1) = pcommon.
The final estimate of SOA is derived by model averaging, so that the final estimate is the average of the scenario-specific SOA estimates above weighted by the posterior probability of the corresponding causal scenario,
4.4.3 Formalization of recalibration in the exposure phase
We model the recalibration process as a shift of the audiovisual fixed latency β, audiovisual temporal bias, after encountering an audiovisual stimulus pair (Figure 3B). The internal value of β reflects the observed point of subjective simultaneity (PSS), which is the stimulus SOA s that leads to a median measurement of audiovisual synchrony equal to m = 0. That is, it is the value of s such that P (m < 0|s = PSS, β) = 0.5. A simple derivation yields
The shift of the audiovisual bias β also moves the measurement distribution. We assume the exponential time constants (τA, τV) remain unchanged across phases and sessions.
At the end of every exposure trial i, a discrepancy between the measured SOA, mi and the final estimate of the stimulus SOA ŝi signals the need for recalibration. In each session, we assume the participant arrives with a default bias β. We define Δβ,i as the cumulative shift of audiovisual bias after exposure trial i,
where α is the learning rate. At the end of the exposure phase, the predicted audiovisual bias is thus shifted by the accumulated shifts across the exposure phase, that is, βpost = β + Δβ,250.
4.4.4 Formalization of the temporal-order-judgement task
In the TOJ task administered in the pre- and post-test phases, the observer makes a perceptual judgment by comparing the final estimate of stimulus SOA ŝ to two internal criteria (Cary et al., 2024; García-Pérez & Alcalá-Quintana, 2012). We assume that the observer has a symmetric pair of criteria, ±c, centered on the stimulus SOA corresponding to perceptual simultaneity (ŝ = 0). In addition, the observer may lapse or make an error when responding. The probabilities of reporting visual lead, ΨV, auditory lead, ΨA or that the two stimuli were simultaneous, ΨS, are thus
where λ is the lapse rate. Figure 6C shows an example of the resulting psychometric functions.
The probability distribution of causal-inference-based stimulus SOA estimates P (ŝ|s) has no closed form and can only be simulated. For each simulation we sampled 10,000 SOA measurements from the corresponding double-exponential probability distribution (Figure 6A). For each sampled measurement, we simulated the process by which the observer carries out causal inference and produced an estimate of the stimulus SOA (fixing the values of a few additional causal-inference model parameters). This process resulted in a Monte-Carlo approximation of the probability distribution of the causal-inference-based stimulus SOA estimates (Figure 6B).
4.4.5 Alternative models
In the fixed-update model, observers measure the audiovisual SOA s by comparing the arrival latency of the auditory and visual signals (Eq. 1). They do not perform causal inference to estimate the SOA. Instead, the measured SOA is shifted toward zero by recalibrating the audiovisual bias β. Hence, the update of the audiovisual bias in trial i is defined by
The update of audiovisual bias is accumulated across the exposure phase, βpost = β + Δβ,250. In TOJ tasks, observers make the temporal-order decision by applying the criteria to the measurement of SOA m (see psychometric functions in Supplement Eq. S1).
In models with modal-independent uncertainty, τA = τV, resulting in a symmetrical measurement distribution (Eq. 1).
4.4.6 Model fitting
Model log-likelihood
The model was fitted by maximizing likelihood. We fit the model to the TOJ data collected during the pre- and post-test phases of all sessions together. We did not collect temporal-order judgments in the exposure phase. However, to model the post-test data, we needed to estimate the distribution of shifts of audiovisual bias resulting from the exposure phase (Δβ,250). We did this using Monte Carlo simulation of the 250 exposure trials to estimate the probability distribution of the cumulative shifts.
The set of model parameters Θ is listed in Table 1. There are I sessions, each including K trials in the pre-test phase and K trials in the post-test phase. We denote the full dataset of pre-test data as Xpre and for the post-test data as Xpost. On a given trial, the observer responds either auditory-first (A), visual-first (V) or simultaneous (S). We denote a single response using indicator variables that are equal to 1 if that was the response in that trial and 0 otherwise. These variables for trial k in session i are and for the pre-test trials, and , etc., for the post-test trials. The log-likelihood of all pre-test responses Xpre given the model parameters given is
The psychometric functions for the pre-test (e.g., ΨA,pre) are defined in Eq. 11, and are the same across all sessions as we assumed that the audiovisual delay β was the same before recalibration in every session.
The log-likelihood of responses in the post-test depends on the audiovisual bias after re-calibration βpost = β + Δβ,250,i for session i. To determine the log-likelihood of the post-test data requires us to integrate out the unknown value of the cumulative shift Δβ,250,i. We approximated this integral in two steps based on our previous work (Hong et al., 2021). First, we simulated the 250 exposure-phase trials 1,000 times for a given set of parameters Θ and session i. This resulted in 1,000 values of Δβ,250,i. The distribution of these values was well fit by a Gaussian whose parameters were determined by the empirical mean and standard deviation of the sample distribution, resulting in the distribution . Second, we approximated the integral of the log-likelihood of the data over possible values of Δβ,250,i by numerical integration. We discretized the approximated distribution into 100 equally spaced bins centered on values Δβ,250,i(n) (n = 1, …, 100). The range of the bins was triple the range of the values from the Monte Carlo sample, so that the lower bound was and the upper bound was .
The log-likelihood of the post-test data is approximated as
where
The psychometric functions in the post-test (e.g., ΨA,post,in) differ across sessions and bins because the simulated bias after recalibration βi,post depends on the adapter SOA fixed in session i and the simulation bin n.
Parameter estimation
We used the BADS toolbox (Acerbi & Ma, 2017) in MATLAB to optimize the set of parameters for the models because it outperforms fmincon when parameter numbers increase. We repeated each search 80 times with a different and random starting point to address the possibility of reporting a local minimum, and chose the parameter estimates with the maximum likelihood across the repeated searches.
References
- Practical bayesian optimization for model fitting with bayesian adaptive direct searchProceedings of the 31st International Conference on Neural Information Processing Systems :1834–1844
- Information theory and an extension of the maximum likelihood principleSelected papers of hirotugu akaike New York: Springer :199–213
- To integrate or not to integrate: Temporal dynamics of hierarchical bayesian causal inferencePLoS Biol 17
- Sensory experience during early sensitive periods shapes cross-modal temporal biasesElife 9
- Modality-specific attention attenuates visual-tactile integration and recalibration effects by reducing prior expectations of a common source for vision and touchCognition 197
- Keeping perception accurateTrends Cogn. Sci 3:4–11
- Estimating the sources of motor errors for adaptation and generalizationNat. Neurosci 11:1454–1461
- Visual-haptic adaptation is determined by relative reliabilityJ. Neurosci 30:7714–7721
- Causal inference in the multisensory brainNeuron 102:1076–1087
- Audiovisual simultaneity windows reflect temporal sensory uncertaintyPsychon. Bull. Rev
- Recalibration of multisensory simultaneity: Cross-modal transfer coincides with a change in perceptual latencyJ. Vis 9:7–16
- Causal inference accounts for heading perception in the presence of object motionProc. Natl. Acad. Sci. U. S. A 116:9060–9065
- Sensory transductionOxford University Press
- Recalibration of audiovisual simultaneityNat. Neurosci 7:773–778
- On the discrepant results in synchrony judgment and temporal-order judgment tasks: A quantitative modelPsychon. Bull. Rev 19:820–846
- Time order as psychological biasPsychol. Sci 28:670–678
- Recalibration of perceived time across sensory modalitiesExp. Brain Res 185:347–352
- The effect of exposure to asynchronous audio, visual, and tactile stimulus combinations on the perception of simultaneityExp. Brain Res 186:517–524
- Adaptation minimizes distance-related audiovisual delaysJ. Vis 7:5–8
- Causal inference regulates audiovisual spatial recalibration via its influence on audiovisual perceptionPLoS Comput. Biol 17
- No effect of auditory-visual spatial disparity on temporal recalibrationExp. Brain Res 182:559–565
- Multisensory integration: Strategies for synchronizationCurr. Biol 15:R339–41
- Causal inference in multisensory perceptionPLoS One 2
- Encoding of event timing in the phase of neural oscillationsNeuroimage 92:274–284
- Coupled oscillations enable rapid temporal recalibration to audiovisual asynchronyCommun Biol 4
- Temporal causal inference with stochastic audio-visual sequencesPLoS One 12
- Multisensory simultaneity recalibration: Storage of the aftereffect in the absence of counterevidenceExp. Brain Res 217:89–97
- Perceptual learning shapes multisensory causal inference via two distinct mechanismsSci. Rep 6
- Exposure to asynchronous audiovisual speech extends the temporal window for audiovisual integrationBrain Res. Cogn. Brain Res 25:499–507
- Musical training refines audiovisual integration but does not influence temporal recalibrationSci. Rep 12
- Mindworks: Time and conscious experienceHarcourt Brace Jovanovich
- Asynchrony adaptation reveals neural population code for audio-visual timingProc. Biol. Sci 278:1314–1322
- Asymmetries in visuomotor recalibration of time perception: Does causal binding distort the window of integration?Acta Psychol 147:127–135
- Sensory reliability shapes perceptual inference via two mechanismsJ. Vis 15
- Sensory adaptation for timing perceptionProc. Biol. Sci 282
- Perception of body ownership is driven by bayesian sensory inferencePLoS One 10
- A bayesian model of sensory adaptationPLoS One 6
- Bayesian inference explains perception of unity and ventriloquism aftereffect: Identification of common sources of audiovisual stimuliNeural Comput 19:3335–3355
- Multisensory integration: Maintaining the perception of synchronyCurrent Biology
- Auditory and visual temporal sensitivity: Evidence for a hierarchical structure of modality-specific and modality-independent levels of temporal information processingPsychol. Res 76:20–31
- The change in perceptual synchrony between auditory and visual speech after exposure to asynchronous speechNeuroreport 22:684–688
- Rapid recalibration to audiovisual asynchronyJ. Neurosci 33:14633–14637
- Audiovisual temporal recalibration occurs independently at two different time scalesSci. Rep 5
- When feeling is more important than seeing in sensorimotor adaptationCurr. Biol 12:834–837
- Temporal recalibration during asynchronous audiovisual speech perceptionExp. Brain Res 181:173–181
- Audiovisual temporal adaptation of speech: Temporal order versus simultaneity judgmentsExp. Brain Res 185:521–529
- Perception of intersensory synchrony: A tutorial reviewAtten. Percept. Psychophys 72:871–884
- Recalibration of temporal order perception by exposure to audio-visual asynchronyBrain Res. Cogn. Brain Res 22:32–35
- Relevance of error: What drives motor adaptation?J. Neurophysiol 101:655–664
- Probability matching as a computational strategy used in perceptionPLoS Comput. Biol 6
- Computational characterization of visually induced auditory spatial adaptationFront. Integr. Neurosci 5
- Recalibration of auditory space following milliseconds of cross-modal discrepancyJ. Neurosci 31:4607–4612
- Bayesian calibration of simultaneity in audiovisual temporal order judgmentsPLoS One 7
- Shifts of criteria or neural timing? the assumptions underlying timing perception studiesConscious. Cogn 20:1518–1531
- A model-based comparison of three theories of audiovisual temporal recalibrationCogn. Psychol 83:54–76
- The best fitting of three contemporary observer models reveals how participants’ strategy influences the window of subjective synchronyJ. Exp. Psychol. Hum. Percept. Perform 49:1534–1563
- Multisensory calibration is independent of cue reliabilityJ. Neurosci 31:13949–13962
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
Copyright
© 2024, Li et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 207
- downloads
- 3
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.