Abstract
To form a more reliable percept of the environment, the brain needs to estimate its own sensory uncertainty. Current theories of perceptual inference assume that the brain computes sensory uncertainty instantaneously and independently for each stimulus. We evaluated this assumption in four psychophysical experiments, in which human observers localized auditory signals that were presented synchronously with spatially disparate visual signals. Critically, the visual noise changed dynamically over time continuously or with intermittent jumps. Our results show that observers integrate audiovisual inputs weighted by sensory uncertainty estimates that combine information from past and current signals consistent with an optimal Bayesian learner that can be approximated by exponential discounting. Our results challenge leading models of perceptual inference where sensory uncertainty estimates depend only on the current stimulus. They demonstrate that the brain capitalizes on the temporal dynamics of the external world and estimates sensory uncertainty by combining past experiences with new incoming sensory signals.
Introduction
Perception has been described as a process of statistical inference based on noisy sensory inputs (Knill and Pouget, 2004; Knill and Richards, 1996). Key to this perceptual inference is the estimation and/or representation of sensory uncertainty (as measured by variance, i.e. the inverse of reliability/precision). Most prominently, in multisensory perception, a more reliable or ‘Bayesoptimal’ percept is obtained by integrating sensory signals that come from a common source weighted by their relative reliabilities with less weight assigned to less reliable signals. Likewise, sensory uncertainty shapes observers’ causal inference. It influences whether observers infer that signals come from a common cause and should hence be integrated or else be processed independently (Aller and Noppeney, 2019; Körding et al., 2007; Rohe et al., 2019; Rohe and Noppeney, 2015b; Rohe and Noppeney, 2015a; Rohe and Noppeney, 2016; Wozny et al., 2010; Acerbi et al., 2018). Indeed, accumulating evidence suggests that human observers are close to optimal in many perceptual tasks (though see Acerbi et al., 2014; Drugowitsch et al., 2016; Shen and Ma, 2016; Meijer et al., 2019) and weight signals approximately according to their sensory reliabilities (Alais and Burr, 2004; Ernst and Banks, 2002; Jacobs, 1999; Knill and Pouget, 2004; van Beers et al., 1999; Drugowitsch et al., 2014; Hou et al., 2019).
An unresolved question is how human observers compute their sensory uncertainty. Current theories and experimental approaches generally assume that observers access sensory uncertainty nearinstantaneously and independently across briefly (≤200 ms) presented stimuli (Ma and Jazayeri, 2014; Zemel et al., 1998). At the neural level, theories of probabilistic population coding have suggested that sensory uncertainty may be represented instantaneously in the gain of the neuronal population response (Ma et al., 2006; Hou et al., 2019). Yet, in our natural environment, sensory noise often evolves at slow timescales. For instance, visual noise slowly varies when walking through a snow storm. Observers may capitalize on the temporal dynamics of the external world and use the past to inform current estimates of sensory uncertainty. In this alternative account, more reliable estimates of sensory uncertainty would be obtained by combining past estimates with current sensory inputs as predicted by Bayesian learning.
To arbitrate between these two critical hypotheses, we presented observers with audiovisual signals in synchrony but with a small spatial disparity in a sound localization task. Critically, the spatial standard deviation (STD) of the visual signal changed dynamically over time continuously (experiments 1–3) or discontinuously (i.e. with intermittent jumps; experiment 4). First, we investigated whether the influence of the visual signal location on observers’ perceived sound location depended on the noise only of the current visual signal or also of past visual signals. Second, using computational modeling and Bayesian model comparison, we formally assessed whether observers update their visual uncertainty estimates consistent with (i) an instantaneous learner, (ii) an optimal Bayesian learner, or (iii) an exponential learner.
Results
In a spatial localization task, we presented participants with audiovisual signals in a series of four experiments, in which the physical visual noise changed dynamically over time either continuously or discontinuously (Figure 1). Visual (V) signals (clouds of 20 bright dots) were presented every 200 ms for a duration of 32 ms. The cloud’s horizontal STD varied over time at this temporal rate of 5 Hz either continuously (experiments 1–3) or discontinuously with intermittent jumps (experiment 4). The cloud’s location mean was temporally independently resampled from five possible locations (−10°, −5°, 0°, 5°, 10°) on each trial with the intertrial asynchrony jittered between 1.4 and 2.8 s. In synchrony with the change in the cloud’s mean location, the dots changed their color and a sound was presented (AV signal). The location of the sound was sampled from the two possible locations adjacent to the visual cloud’s mean location (i.e. ±5° AV spatial disparity). Participants localized the sound and indicated their response using five response buttons.
The small audiovisual disparity enabled an influence of the visual signal location on the perceived sound location as a function of visual noise (Alais and Burr, 2004; Battaglia et al., 2003; Meijer et al., 2019). As a result, observers’ visual uncertainty estimate could be quantified in terms of the relative weight of the auditory signal on the perceived sound location with a greater auditory weight indicating that observers estimated a greater visual uncertainty.
In the first three experiments, we used continuous sequences, where the visual cloud’s STD changed periodically according to a sinusoid (n = 25; period = 30 s), a random walk (RW1; n = 33; period = 120 s) or a smoothed random walk (RW2; n = 19; period = 30 s; Figure 2). In an additional fourth experiment, we inserted abrupt increases or decreases into a sinusoidal evolution of the visual cloud’s STD (n = 18, period = 30 s, Figure 5). We will first describe the results for the three continuous sequences followed by the discontinuous sequence.
We assigned the sound localization responses and the associated physical visual noise (i.e. the cloud’s STD) to 20 (resp. 15 for experiment 4) temporally adjacent bins covering the entire period of each of the three sequences. Each experiment repeated the same 30 s (Sin, RW2) or 120 s (RW1) period throughout the experiment resulting in ~32 periods for the RW1 and ~130 periods for the Sin and RW2 sequences. The trial and hence sound onsets were jittered with respect to this periodic evolution of the visual cloud’s STD resulting in a greater effective sampling rate than expected for an intertrial asynchrony of 1.4–2.8 s. In total, we assigned at least 44–87 trials to each bin (Supplementary file 1Table 1). We quantified the auditory and visual influence on observers’ perceived auditory location for each bin based on regression models (separately for each of the 20 temporally adjacent bins). For instance, for bin = 1 we computed:
with ${\text{R}}_{\text{A},\text{trial},\text{bin}=1}$ = Localization response for trial t and bin 1; ${\text{L}}_{\text{A},\text{trial},\text{bin}=1}$ or ${\text{L}}_{\text{V},\text{trial},\text{bin}=1}$ = ‘true’ auditory or visual location for trial t and bin 1; ${\text{\xdf}}_{\text{A},\text{bin}=1}$ or ${\text{\xdf}}_{\text{V},\text{bin}=1}$ = auditory or visual weight for bin 1; ${\text{\xdf}}_{\text{const},\text{bin}=1}$ = constant term; ${\text{e}}_{\text{trial},\text{bin}=1}$ = error term. For each bin b, we thus obtained one auditory and one visual weight estimate. The relative auditory weight for a particular bin was computed as w_{A,bin} = ß_{A,bin} / (ß_{A,bin} + ß_{V,bin}).
Figure 2 and Figure 3 show the temporal evolution of the STD of the physical visual noise and observers’ relative auditory weight indices w_{A,bin}. If observers estimate sensory uncertainty instantaneously, observer’s relative auditory weight indices should closely track the visual cloud’s STD (Figure 2). By contrast, we observed systematic biases: while the temporal evolution of the physical visual noise was designed to be symmetrical for each time period, we observed a temporal asymmetry for w_{A} in all of the three experiments. For the monotonic sinusoidal sequence, w_{A} was smaller for the 1st half of each period, when visual noise increased, than the 2nd half, when visual noise decreased over time (Figure 3A). For the nonmonotonic RW1 and RW2 sequences, we observed more complex temporal profiles, because the visual noise increased and decreased in each half. W_{A} was larger for increasing visual noise in the 1st as compared to the 2nd half, while w_{A} was smaller for decreasing visual noise in the 1st as compared to the 2nd half (Figure 3B, C). These impressions were confirmed statistically in 2 (1st vs. flipped 2nd half) x 9 (bins) repeated measures ANOVAs (Table 1) showing a significant main effect of the 1st versus flipped 2nd half period for the sinusoidal (F(1, 24)=12.162, p=0.002, partial η^{2} = 0.336) and the RW1 sequence (F(1, 32)=14.129, p<0.001, partial η^{2} = 0.306). For the RW2 sequence, we observed a significant interaction (F(4.6, 82.9)=3.385, p=0.010, partial η^{2} = 0.158), because the visual noise did not change monotonically within each half period. Instead, monotonic increases and decreases in visual noise alternated at nearly the double frequency in RW2 as compared to RW1. The asymmetry in the auditory weights’ time course across the three experiments suggested that the visual noise in the past influenced observers’ current visual uncertainty estimate resulting in smaller auditory weights for ascending visual noise and greater auditory weights for descending visual noise.
To further investigate the influence of past visual noise on observers’ auditory weights, we estimated a regression model in which the relative auditory weights w_{A} for each of the 20 bins were predicted by the visual STD in the current bin and the difference in STD between the current and the previous bin (see Equation 2). Indeed, both the current visual STD (p<0.001 for all three sequences; Sinusoid: t(24)=15.767, Cohen’s d = 3.153; RW1: t(32) = 15.907, Cohen’s d = 2.769; RW2: t(18) = 12.978, Cohen’s d = 2.977, two sided onesample t test against zero) and the difference in STD between the current and the previous bin (i.e. Sinusoid t(24) = −3.687, p=0.001, Cohen’s d = −0.737; RW1 t(32) = −2.593, p=0.014, Cohen’s d = −0.451; RW2 t(18) = 2.395, p=0.028, Cohen’s d = −0.549) significantly predicted observers’ relative auditory weights (for complementary results of nested model comparisons see Appendix 1 and Supplementary file 1Table 5). Collectively, these results suggest that observers’ visual uncertainty estimates (as indexed by the relative auditory weights w_{A}) depend not only on the current sensory signal, but also on the recent history of the sensory noise. These results were also validated in a control analysis that regressed out and thus accounted for potential influences of the previous visual location on observers’ sound localization, suggesting that the effects of past visual uncertainty cannot be explained by effects of past visual location mean (Appendix 1, Figure 2—figure supplement 1, Supplementary file 1tables 24).
To characterize how human observers use information from the past to estimate current sensory uncertainty, we compared three computational models that differed in how visual uncertainty is learnt over time (Figure 4): Model 1, the instantaneous learner, estimates visual uncertainty independently for each trial as assumed by current standard models. Model 2, the optimal Bayesian learner, estimates visual uncertainty by updating the prior uncertainty estimate obtained from past visual signals with the uncertainty estimate from the current signal. Model 3, the exponential learner, estimates visual uncertainty by exponentially discounting past uncertainty estimates. All three models account for observers’ uncertainty about whether auditory and visual signals were generated by common or independent sources by explicitly modeling the two potential causal structures (Körding et al., 2007) underlying the audiovisual signals (n.b. only the model component pertaining to the ‘common cause’ case is shown in Figure 1B, for the full model see Figure 1—figure supplement 1). Models were fit individually to observers’ data by sampling from the posterior over parameters for each observer (Table 2).
We compared the three models in a fixed and random effects analysis (Penny et al., 2010; Rigoux et al., 2014) using the WatanabeAkaike information criterion (WAIC) as appropriate for evaluating model samples (Gelman et al., 2014) (i.e. a low WAIC indicates a better model, a difference greater than 10 is considered very strong evidence for a model). In the fixedeffects analysis (see Table 2 for details), the Bayesian learner was substantially better than the instantaneous learner across all three experiments, but outperformed the exponential learner reliably only in the sinusoidal sequence. Likewise, the randomeffects analysis based on hierarchical Bayesian model selection (Penny et al., 2010; Rigoux et al., 2014) showed a protected exceedance probability that was substantially greater for the Bayesian learner (Sin, RW2) or the exponential learner (RW1, RW2) than for the instantaneous learner (Figure 4F). However, the direct comparison between the Bayesian and the exponential learner did not provide consistent results across experiments. As shown in Figure 4A and B, both the Bayesian and the exponential learner accurately reproduced the temporal asymmetry for the auditory weights across all three experiments.
From the optimal Bayesian learner, we inferred observers’ estimated rate of change in visual reliability (i.e. parameter $\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$\kappa $}\right.$). The sinusoidal sequence was estimated to change at a faster pace (median κ = 7.4 across observers, 95% confidence interval, 95% CI [4.8, 10.8] estimated via bootstrapping) than the RW1 sequence (median κ = 8.1, 95% CI [7.0,14.9]), but slower than the RW2 sequence (median κ = 6.7, 95% CI [4.4,11.2]) indicating that the Bayesian learner accurately inferred that visual reliability changed at different pace across the three continuous sequences (see legend of Figure 2). Likewise, the learning rates 1γ of the exponential learner accurately reflect the different rates of change across the sequences (Sinusoid $\gamma =$ 0.23, 95% CI [0.14, 0.28]; RW1: $\gamma =$ 0.33, 95% CI [0.21, 0.38]; RW2: $\gamma =$ 0.25, 95% CI [0.21, 0.29]). Both the Bayesian and the exponential learner thus estimated a smaller rate of change for the RW1 than for the sinusoidal sequence – although caution needs to be applied when interpreting these results given the extensive confidence intervals. Further, the learning rates of the exponential learner imply that observers gave the visual signals presented 4.1 (Sinusoid), 5.4 (RW1), and 4.3 (RW2) seconds before the current stimulus 5% of the weight they assigned to the current visual signal to estimate the visual reliability.
To further disambiguate between the Bayesian and the exponential learner, we designed a fourth experimental ‘jump sequence’ that introduced abrupt increases or decreases in physical visual noise at three positions into the sinusoidal sequence (Figure 5A). Using the same analysis approach as for experiments 1–3, we replicated the temporal asymmetry for the auditory weights (Figure 5B). For all three ‘jump positions’, w_{A} was significantly smaller for the 1st half of each period, when visual noise increased, than the 2nd half, when visual noise decreased over time. The 3 (jump positions) x 2 (1st vs. flipped 2nd half) x 7 (bins) repeated measures ANOVA showed a significant main effect of 1st versus flipped 2nd period’s half (F(1,17) = 24.824, p<0.001, partial η^{2} = 0.594), while this factor was not involved in any higherorder interaction (see Table 1). Further, in a regression model the current visual STD (t(17) = 11.655, p<0.001, Cohen’s d = 2.747) and the difference between current and previous STD (t(17) = −4.768, p<0.001, Cohen’s d = −1.124) significantly predicted the relative auditory weights. Thus, we replicated our finding that the visual noise in the past influenced observers’ current visual uncertainty estimate as indexed by the relative auditory weights w_{A}.
Bayesian model comparison using a fixedeffects analysis showed that both the Bayesian learner and the exponential learner substantially outperformed the instantaneous learner (see Table 2). However, consistent with our Bayesian model comparison results for the continuous sequences, the Bayesian learner did not provide a better explanation for observers’ responses than the exponential learner (ΔWAIC = +2, see Table 2, Figure 5C and Figure 5—figure supplement 1A). Likewise, a randomeffects analysis based on hierarchical Bayesian model selection showed that the Bayesian and the exponential learners outperformed the instantaneous learner, but again we were not able to adjudicate between the Bayesian and exponential learner (Figure 4F, see also methods and results in Appendix 1, Figure 5—figure supplement 2 and Supplementary file 1Table 6 for further analyses justifying the choice of continuous learning models in the jump sequence).
In summary, across four experiments that used continuous and discontinuous sequences of visual noise, we have shown that the Bayesian or exponential learners outperform the instantaneous learner. However, across the four experiments we were not able to decide whether observers adapted to changes in visual noise according to a Bayesian or an exponential learner. The key feature that distinguishes between the Bayesian and the exponential learner is that only the Bayesian learner adapts dynamically based on its uncertainty about its visual reliability estimates. As a consequence, the Bayesian learner should adapt faster than the exponential learner to increases in physical visual noise (i.e. spread of the visual cloud) but slower to decreases in visual noise. From the Bayesian learner’s perspective, the faster learning for increases in visual noise emerges because it is unlikely that visual dots form a large spread cloud under the assumption that the true visual spread of the cloud is small. Conversely, the Bayesian learner will adapt more slowly to decreases in visual variance, because under the assumption of a visual cloud with a large spread visual dots may form a small cloud by chance. Indeed, previous research has shown that observers adapt their variance estimates faster for changes from small to large than for changes from large to small variance (Berniker et al., 2010). However, these results have been shown for learning about a hidden variable such as the prior that defines the spatial distribution from which an object’s location is sampled. In our study, we manipulated the variance of the likelihood, that is the variance of the clouds of dots.
Asymmetric differences in adaptation rate between the exponential and the Bayesian learner should thus be amplified if we increase observer’s uncertainty about its visual reliability estimate by reducing the number of dots of the visual cloud from 20 to 5 dots. Based on simulations, we therefore explored whether we could experimentally discriminate between the Bayesian and exponential learner using continuous sinusoidal or discontinuous ‘jump’ sequences with visual clouds of only five dots. For the two sequences, we simulated the sound localization responses of 12 observers based on the Bayesian learner model and fitted the Bayesian and exponential learner models to the responses of each simulated Bayesian observer. Figure 6 shows observers’ auditory weights indexing their estimated visual reliability across time that we obtained from the fitted responses of the Bayesian (blue) and the exponential learner (green). The simulations reveal the characteristic differences in how the Bayesian and the exponential learner adapt their visual uncertainty estimates to increases and decreases in visual noise. As expected, the Bayesian learner adapts its visual uncertainty estimates faster than the exponential learner to increases in visual noise, but slower to decreases in visual noise. Nevertheless, these differences are relatively small, so that the difference in mean log likelihood between the Bayesian and exponential learner is only −1.82 for the sinusoidal sequence and −2.74 for the jump sequence.
Next, we investigated whether our experiments successfully mimicked situations in which observers benefit from integrating past and current information to estimate their sensory uncertainty. We compared the accuracy of the instantaneous, exponential and Bayesian learner’s visual uncertainty estimates in terms of their mean absolute deviation (in percentage) from the true variance. For Gaussian clouds of 20 dots, the instantaneous learner’s error in the visual uncertainty estimates of 21.7% is reduced to 13.7% and 14.9% for the exponential and Bayesian learners, respectively (with best fitted γ = 0.6, in the sinusoidal sequence). For Gaussian clouds composed of only five dots, the exponential and Bayesian learners even cut down the error by half (i.e. 46.8% instantaneous learner, 29.5% exponential learner, 23.9% Bayesian learner, with best fitted γ = 0.7).
Collectively, these simulation results suggest that even in situations in which observers benefit from combining past with current sensory inputs to obtain more precise uncertainty estimates, the exponential learner is a good approximation of the Bayesian learner, making it challenging to dissociate the two experimentally based on noisy human behavioral responses.
Discussion
The results from our four experiments challenge classical models of perceptual inference where a perceptual interpretation is obtained using a likelihood that depends solely on the current sensory inputs (Ernst and Banks, 2002). These models implicitly assume that sensory uncertainty (i.e. likelihood variance) is instantaneously and independently accessed from the sensory signals on each trial based on initial calibration of the nervous system (Jacobs and Fine, 1999). Most prominently, in the field of cue combination it is generally assumed that sensory signals are weighted by their uncertainties that are estimated only from the current sensory signals (Alais and Burr, 2004; Ernst and Banks, 2002; Jacobs, 1999) (but see Mikula et al., 2018; Triesch et al., 2002).
By contrast, our results demonstrate that human observers integrate inputs weighted by uncertainties that are estimated jointly from past and current sensory signals. Across the three continuous and the one discontinuous jump sequences, observers’ current visual reliability estimates were influenced by visual inputs that were presented 4–5 s in the past albeit their influence amounted to only 5% of the current visual signals.
Critically, observers adapted their visual uncertainty estimates flexibly according to the rate of change in the visual noise across the experiments. As predicted by both Bayesian and exponential learning models, observers’ visual reliability estimates relied more strongly on past sensory inputs, when the visual noise changed more slowly across time. While observers did not explicitly notice that each of the four experiments was composed of repetitions of temporally symmetric sequence components, we cannot fully exclude that observers may have implicitly learnt this underlying temporal structure. However, implicit or explicit knowledge of this repetitive sequence structure should have given observers the ability to predict and preempt future changes in visual reliability and therefore attenuated the temporal lag of the visual reliability estimates. Put differently, our experimental choice of repeating the same sequence component over and over again in the experiment cannot explain the influence of past signals on observers’ current reliability estimate, but should have reduced or even abolished it.
Importantly, the key feature that distinguishes the Bayesian from the exponential learner is how the two learners adapt to increases versus decreases in visual noise. Only the Bayesian learner represents and accounts for its uncertainty about its visual reliability estimates. As compared to the exponential learner, it should therefore adapt faster to increases but slower to decreases in visual noise (e.g. see Berniker et al., 2010). Our simulation results show this profile qualitatively, when the learner’s uncertainty about its visual reliability estimate is increased by reducing the number of dots (see Figure 6). But even for visual clouds of five dots, the differences in learning curves between the Bayesian and exponential learner are very small making it difficult to adjudicate between them given noisy observations from real observers. Unsurprisingly, therefore, Bayesian model comparison showed consistently across all four experiments that observers’ localization responses can be explained equally well by an optimal Bayesian and an exponential learner. These results converge with a recent study showing that learning about a hidden variable such as observers’ priors can be accounted for by an exponential averaging model (Norton et al., 2019).
Collectively, our experimental and simulation results suggest that under circumstances where observers substantially benefit from combining past and current sensory inputs for estimating sensory uncertainty, optimal Bayesian learning can be approximated well by more simple heuristic strategies of exponential discounting that update sensory weights with a fixed learning rate irrespective of observers’ uncertainty about their visual reliability estimate (Ma and Jazayeri, 2014; Shen and Ma, 2016). Future research will need to assess whether observers adapt their visual uncertainty estimates similarly if visual noise is manipulated via other methods such as stimulus luminance, duration, or blur.
From the perspective of neural coding, our findings suggest that current theories of probabilistic population coding (Beck et al., 2008; Ma et al., 2006; Hou et al., 2019) may need to be extended to accommodate additional influences of past experiences on neural representations of sensory uncertainties. Alternatively, the brain may compute sensory uncertainty using strategies of temporal sampling (Fiser et al., 2010).
In conclusion, our study demonstrates that human observers do not access sensory uncertainty instantaneously from the current sensory signals alone, but learn sensory uncertainty over time by combining past experiences and current sensory inputs as predicted by an optimal Bayesian learner or approximate strategies of exponential discounting. This influence of past signals on current sensory uncertainty estimates is likely to affect learning not only at slower timescales across trials (i.e. as shown in this study), but also at faster timescales of evidence accumulation within a trial (Drugowitsch et al., 2014). While our research unravels the impact of prior sensory inputs on uncertainty estimation in a cue combination context, we expect that they reveal fundamental principles of how the human brain computes and encodes sensory uncertainty.
Materials and methods
Participants
Seventysix healthy volunteers participated in the study after giving written informed consent (40 female, mean age 25.3 years, range 18–52 years). All participants were naïive to the purpose of the study. All participants had normal or correctedto normal vision and reported normal hearing. The study was approved by the human research review committee of the University of Tuebingen (approval number 432 2007 BO1) and the research review committee of the University of Birmingham (approval number ERN_11–0470P).
Stimuli
Request a detailed protocolThe visual spatial stimulus was a Gaussian cloud of twenty bright gray dots (0.56° diameter, vertical STD 1.5°, luminance 106 cd/m^{2}) presented on a dark gray background (luminance 62 cd/m^{2}, i.e. 71% contrast). The auditory spatial cue was a burst of white noise with a 5 ms on/off ramp. To create a virtual auditory spatial cue, the noise was convolved with spatially specific headrelated transfer functions (HRTFs). The HRTFs were pseudoindividualized by matching participants’ head width, heights, depth, and circumference to the anthropometry of subjects in the CIPIC database (Algazi et al., 2001). HRTFs from the available locations in the database were interpolated to the desired locations of the auditory cue.
Experimental design and procedure
In a spatial ventriloquist paradigm, participants were presented with audiovisual spatial signals. Participants indicated the location of the sound by pressing one of five spatially corresponding buttons and were instructed to ignore the visual signal. Participants did not receive any feedback on their localization response. The visual signal was a cloud of 20 dots sampled from a Gaussian. The visual clouds were redisplayed with variable horizontal STDs (see below) every 200 ms (i.e. at a rate of 5 Hz; Figure 1A). The cloud’s location mean was temporally independently resampled from five possible locations (−10°, −5°, 0°, 5°, 10°) on each trial with the intertrial asynchrony jittered between 1.4 and 2.8 s in steps of 200 ms. In synchrony with the change in the cloud’s location, the dots changed their color and a concurrent sound was presented. The location of the sound was sampled from ±5° visual angle with respect to the mean of the visual cloud. Observers’ visual uncertainty estimate was quantified in terms of the relative weight of the auditory signal on the perceived sound location. The change in the dot’s color and the emission of the sound occurred in synchrony to enhance audiovisual binding.
Continuous sinusoidal and RW sequences
Request a detailed protocolCritically, to manipulate visual noise over time, the cloud’s STD changed at a rate of 5 Hz according to (i) a sinusoidal sequence, (ii) an RW sequence 1 or (iii) an RW sequence 2 (Figure 2). In all sequences, the horizontal STD of the visual cloud spanned a range from 2 to 18°:
Experiment1  Sinusoidal sequence (Sinusoid): A sinusoidal sequence was generated with a period of 30 s. During the ~65 min of the experiment, each participant completed ~130 cycles of the sinusoidal sequence.
Experiment2  Random walk sequence 1 (RW1): First, we generated an RW sequence of 60 s duration using a Markov chain with 76 discrete states and transition probabilities of stay (1/3), change to lower (1/3) or upper (1/3) adjacent states. To ensure that the RW sequence segment starts and ends with the same value, this initial 60s sequence segment was concatenated with its temporally reversed segment resulting in an RW sequence segment of 120 s duration. Each participant was presented with this 120 s RW1 sequence approximately 32 times during the experiment.
Experiment3  Random walk sequence 2 (RW2): Likewise, we created a second randomwalk sequence of 15 s duration using a Markov chain with only 38 possible states and transition probabilities similar to above. The 15s sequence was concatenated with its temporally reversed version resulting in a 30s sequence. The smoothness of this sequence segment was increased by filtering it (without phase shift) with a moving average of 250 ms. Each participant was presented with this sequence segment ~130 times.
Generally, a session of a sinusoid, RW1, or RW2 sequence included 1676 trials. Because of experimental problems, four sessions included only 1128, 1143, or 1295 trials. Before the experimental trials, participants practiced the auditory localization task in 25 unimodal auditory trials, 25 audiovisual congruent trials with a single dot as visual spatial cue and 75 trials with stimuli as in the main experiment.
Experiment 4  Sinusoidal sequence with intermittent changes in visual noise (sinusoidal jump sequence)
Request a detailed protocolTo dissociate the Bayesian learner from approximate exponential discounting, we designed a sinusoidal sequence (period = 30 s) with intermittent increases/decreases in visual variance (Figure 5). As shown in Figure 5A, we inserted increases by 8° in visual STD at three levels of visual STD: 7.2°, 8.6°, 9.6° STD. Conversely, we inserted decreases by 8° in visual STD at 15.3°, 16.7°, 17.7° STD. We inserted jumps selectively in the period sections of high visual variance to make the jumps less apparent and maximize the chances that observers treated the series as a continuous sequence. As a result, the upjumps occurred when the increases in visual variance were fastest (i.e. steeper slope), while the downjumps occurred after sections in which the visual variance was relatively constant (i.e. shallow slope). We factorially combined these 3 (increases) x 3 (decreases) such that each sinewave cycle included exactly one sudden increase and decrease in visual STD (i.e. nine jump types). Otherwise, the experimental paradigm and stimuli were identical to the continuous sinusoidal sequence described above. During the ~80 min of this experiment, each participant completed ~154 cycles of the sinusoidal sequence including 16–18 cycles for each of the nine jump types. This sinusoidal jump sequence was expected to maximize differences in adaptation rate for the Bayesian and exponential learner. If participants continuously update their estimates of the visual reliability, as opposed to using a change point model (Adams and Mackay, 2007; Heilbron and Meyniel, 2019), the exponential learner will weight past and present uncertainty estimates throughout the entire sequence according to the same exponential function. By contrast, the Bayesian learner will take into account its uncertainty about the visual reliability and therefore adapt its visual reliability estimate for jumps from high to low visual variance (resp. low to high visual reliability, see Figure 6) more slowly than the exponential learner (see Appendix 1).
Subject numbers and inclusion criteria
Request a detailed protocolOf the 76 subjects, 30 participated in the sinusoidal and the RW1 sequence session. Eight additional subjects participated only in the RW1 sequence session. Eighteen additional subjects participated in the RW2 sequence session. One participant completed all three continuous sequences. Twenty subjects participated in the sinusoidal sequence with intermittent changes in visual uncertainty. In total, we collected data from 30 participants for the sinusoidal, 38 participants for the RW1, 19 participants for the RW2, and 20 participants for the sinusoidal jump sequence. The sample sizes of 20–38 participants were based on a pilot experiment, which showed individually significant effects of past visual noise on the weighting of audiovisual spatial signals in 6/6 pilot participants. From these samples, we excluded participants if their perceived sound location did not depend on the current visual reliability (i.e. inclusion criterion p<0.05 in the linear regression; please note that this inclusion criterion is orthogonal to the question of whether participants’ visual uncertainty estimate depends on visual signals prior to the current trial). Thus, we excluded five participants of the sinusoidal and RW1 sequence and two participants from the sinusoidal jump sequence. Finally, we analyzed data from 25 participants for the sinusoidal, 33 participants for the RW1, 19 participants for the RW2, and 18 participants for the sinusoidal jump sequence.
Experimental setup
Request a detailed protocolAudiovisual stimuli were presented using Psychtoolbox 3.09 (Brainard, 1997; Kleiner et al., 2007) (http://www.psychtoolbox.org) running under Matlab R2010b (MathWorks) on a Windows machine (Microsoft XP 2002 SP2). Auditory stimuli were presented at ~75 dB SPL using headphones (Sennheiser HD 555). As visual stimuli required a large field of view, they were presented on a 30″ LCD display (Dell UltraSharp 3007WFP). Participants were seated at a desk in front of the screen in a darkened booth, resting their head on an adjustable chin rest. The viewing distance was 27.5 cm. This setup resulted in a visual field of approximately 100°. Participants responded via a standard QWERTY keyboard. Participants used the buttons [i, 9, 0, , = ] with their right hand for localization responses.
Data analysis
Continuous sinusoidal and RW sequences
Request a detailed protocolAt trial onset the visual cloud’s location mean was independently resampled from five possible locations (−10°, −5°, 0°, 5°, 10°). Concurrently, the cloud’s dots changed their color and a sound was presented sampled from ±5° visual angle with respect to the mean of the visual cloud. The intertrial asynchrony was jittered between 1.4 and 2.8 s in steps of 200 ms. Therefore, across the experiment the trial onsets occurred at different times relative to the period of the changing visual cloud’s STD resulting in a greater effective sampling rate than provided if the intertrial asynchrony had been fixed.
For each period of the three continuous sinusoidal and RW sequences, we sorted the trials (i.e. trialspecific visual cloud’s STD, visual location, auditory location, and observers’ sound localization responses) into 20 temporally adjacent bins that covered one complete period of the changing visual STD. This resulted in about 1676 trials in total/20 bins = approximately 80 trials on average per bin in each subject (more specifically: a range of 52–96 (Sin), 52–92 (RW 1), or 71–93 (RW2) trials, for details see Supplementary file 1Table 1).
We quantified the influence of the auditory and visual locations on observers’ perceived auditory location for each bin by estimating a regression model separately for each bin (i.e. one regression model per bin). For instance, for bin = 1 we computed:
with ${\text{R}}_{\text{A},\text{trial},\text{bin}=1}$ = Localization response for trial t and bin 1; ${\text{L}}_{\text{A},\text{trial},\text{bin}=1}$ or ${\text{L}}_{\text{V},\text{trial},\text{bin}=1}$= ‘true’ auditory or visual location for trial t and bin 1; $\text{\xdf}}_{\text{A},\text{bin}=1$ or ${\text{\xdf}}_{\text{V},\text{bin}=1}$ = auditory or visual weight for bin 1; $\text{\xdf}}_{\text{const},\text{bin}=1$ = constant term; ${\text{e}}_{\text{trial},\text{bin}=1}$ = error term for trial t and bin 1. For each bin b, we thus obtained one auditory and one visual weight estimate. The relative auditory weight for a particular bin was computed as w_{A,bin} = ß_{A,bin} / (ß_{A,bin} + ß_{V,bin}) (Figure 2A–C).
By design, the temporal evolution of the physical visual variance (i.e. STD of the visual cloud) is symmetric for each period in the sinusoidal, RW1 and RW2 sequences. In other words, for physical visual noise, the 1st half and the flipped 2nd half within a period are identical (Figure 3E). Given this symmetry constraint, we evaluated the influence of past visual noise on participants’ auditory weight w_{A,bin} by comparing the w_{A} for the bins in the 1st half and the flipped 2nd half in a repeated measures ANOVA. If human observers estimate visual uncertainty by combining prior with current visual uncertainty estimates as expected for a Bayesian learner, w_{A} should differ between the 1st half and the mirrorsymmetric flipped 2nd half of the sequence. More specifically, w_{A} should be smaller for the 1st half in which visual variance increased than for the mirrorsymmetric time points of the 2nd half in which visual variance decreased. To test this prediction, we entered the subjectspecific w_{A,bin} into 2 (1st vs. flipped 2nd half) x 9 (bins, i.e. removing the bins at maximal and minimal visual noise values) repeated measures ANOVAs separately for the sinusoidal, RW1 and RW2 experiments (Table 1). For the sinusoidal sequence, we expected a main effect of ‘half’ because the sequence increased/decreased monotonically within each half period. For the RW1 and RW2 sequences, an influence of prior visual noise might also be reflected in an interaction effect of ‘half x bin’ because these sequences increased/decreased nonmonotonically within each half period.
To further test whether the noise of past visual signals influenced observers’ current visual uncertainty estimate, we employed a regression model in which the relative auditory weights w_{A,bin} were predicted by the visual STD in the current bin and the difference in STD between the current and the previous bin:
with w_{A,bin} = relative auditory weight in bin b; σ_{V,bin} = mean visual STD in current bin b or previous bin b1; ß_{const} = constant term; e_{bin} = error term. To allow for generalization to the population level, the parameter estimates (ß_{σV,} ß_{ΔσV}) for each participant were entered into twosided onesample ttests at the betweensubject randomeffects level.
Sinusoidal sequence with intermittent changes in visual uncertainty
Request a detailed protocolFor each period of the sinusoidal sequence with intermittent changes, we sorted the values for the physical visual cloud’s variance (i.e. the cloud’s STD) and sound localization responses into 15 temporally adjacent bins which were positioned to capture the jumps in visual noise. For analysis of these sequences, we recombined the first and second halves of the 3 (increases at low, middle, high) x 3 (decreases at low, middle, high) sinewave cycles into three types of sinewave cycles such that both jumps were at low (=outer jump), middle (=middle jump), or high (=inner jump) visual noise. This recombination makes the simplifying assumption that the jump position of the first half will have negligible effects on participants’ uncertainty estimates of the second half. As a result of this recombination, each bin comprised at least 44–51 trials across participants (Supplementary file 1Table 1). As for the continuous sequences, we quantified the auditory and visual influence on the perceived auditory location for each bin based on separate regression models for the 15 temporally adjacent bins (see Equation 1). Next, we independently computed the relative auditory weight w_{A,bin} = ß_{A,bin} / (ß_{A,bin} + ß_{V,bin}) for each of the 15 temporally adjacent bins. We statistically evaluated the influence of past visual noise on participants’ auditory weight on the w_{A} in terms of the difference between 1st half and flipped 2nd half using a 2 (1st vs. flipped 2nd half) x 7 (bins) x 3 (jump: inner, middle, outer) repeated measures ANOVAs (Table 1).
Computational models (for continuous and discontinuous sequences)
To further characterize whether and how human observers use their uncertainty about previous visual signals to estimate their uncertainty of the current visual signal, we defined and compared three models in which visual reliability ($\lambda}_{\text{V}$) was (1) estimated instantaneously for each trial (i.e. instantaneous learner), was updated via (2) Bayesian learning or (3) exponential discounting (i.e. exponential learner) (Figure 1—figure supplement 1).
In the following, we will first describe the generative model that accounts for the fact that (1) visual uncertainty usually changes slowly across trials (i.e. timedependent uncertainty changes) and (2) auditory and visual signals can be generated by one common or two independent sources (i.e. causal structure). Using this generative model as a departure point, we then describe how the instantaneous learner, the Bayesian learner and the exponential learner perform inference. Finally, we will explain how we account for participants’ internal noise and predict participants’ responses from each model (i.e. the experimenter’s uncertainty).
Generative model
Request a detailed protocolOn each trial t, the subject is presented with an auditory signal A_{t}, from a source S_{A,t}, (see Figure 1—figure supplement 1) together with a visual cloud of dots at time t arising from a source, S_{V,t}, drawn from a Normal distribution S_{V,t} ~ N(0, $1/{\text{\lambda}}_{\text{S}}$) with the spatial reliability (i.e. inverse of the spatial variance): ${\text{\lambda}}_{\text{S}}=1/{\sigma}_{S}{}^{2}$. Critically, S_{A,t} and S_{V,t}, can either be two independent sources (C = 2) or one common source (C = 1): S_{A,t} = S_{V,t} = S_{t} (Körding et al., 2007).
We assume that the auditory signal is corrupted by noise, so that the internal signal is A_{t} ~ N(S_{A,t}, $\text{}1/{\text{\lambda}}_{\text{A}}$). By contrast, the individual visual dots (presented at high visual contrast) are assumed to be uncorrupted by noise, but presented dispersed around the location S_{V,t} according to V_{i,t} ~ N(U_{t}, $\text{}1/{\text{\lambda}}_{\text{V},\text{t}}$), where U_{t} ~ N(S_{V,t}, $\text{}1/{\text{\lambda}}_{\text{V},\text{t}}$). The dispersion of the individual dots, $\text{}1/{\text{\lambda}}_{\text{V},\text{t}},$ is assumed to be identical to the uncertainty about the visual mean, allowing subjects to use the dispersion as an estimate of the uncertainty about the visual mean.
The visual reliability of the visual cloud, $\lambda}_{\text{V},\text{t}}=1/{\sigma}_{\mathrm{V},\mathrm{t}}^{2$, varies slowly at the redisplay rate of 5 Hz according to a log RW: $\mathrm{log}{\lambda}_{\text{V},\text{t}}\sim N\left(\mathrm{log}{\lambda}_{\text{V},\text{t}1},1/\kappa \right)$ with $\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$\kappa $}\right.$ being the variability of ${\text{\lambda}}_{\text{V},\text{t}}$ in log space. We also use this log RW model to approximate learning in the four jump sequence (see Behrens et al., 2007).
The generative models of the instantaneous, Bayesian, and exponential learners all account for the causal uncertainty by explicitly modeling the two potential causal structures. Yet, they differ in how they estimate the visual uncertainty on each trial, which we will describe in greater detail below.
Observer inference
Request a detailed protocolThe instantaneous, Bayesian, and exponential learners invert this (or slightly modified, see below) generative model during perceptual inference to compute the posterior probability of the auditory location, S_{A,t}, given the observed A_{t} and V_{i,t}. The observer selects a response based on the posterior using a subjective utility function which we assume to be the minimization of the squared error (S_{A,t}  S_{true})^{2}. For all models, the estimate for the location of the auditory source is obtained by averaging the auditory estimates under the assumption of common and independent sources by their respective posterior probabilities (i.e. model averaging, see Figure 1—figure supplement 1):
where ${\widehat{S}}_{A,C=1,t}$ and ${\widehat{S}}_{A,C=2,t}$ depend on the model (see below), and $P(C=1{A}_{t},{V}_{1:n,t})$ is the posterior probability that the audio and visual stimuli originated from the same source according to Bayesian causal inference (Körding et al., 2007).
Finally, for all models, we assume that the observer pushes the button associated with the position closest to ${\widehat{S}}_{A,t}$. In the following, we describe the generative and inference models for the instantaneous, Bayesian, and exponential learners. For the Bayesian learner, we focus selectively on the model component that assumes a common cause, C = 1 (for full derivation including both model components, see Appendix 2).
Model 1: Instantaneous learner
Request a detailed protocolThe instantaneous learning model ignores that the visual reliability (i.e. the inverse of visual uncertainty) of the current trial depends on the reliability of the previous trial. Instead, it estimates the visual reliability independently for each trial from the spread of the cloud of visual dots:
with ${Z}_{1},{Z}_{2}$ as normalization constants.
Apart from $P\left(C=1\text{}{A}_{t},{V}_{t}\right)$, these terms are all normal distributions, while we assume in this model that $P\left({\text{\lambda}}_{\text{V},\text{t}}\right)$ is uninformative. Hence, visual reliability is computed from the variance: $\hat{{\lambda}_{Vt}}=1/\left({\sigma}_{Vt}^{2}+\frac{{\sigma}_{Vt}^{2}}{n}\right)$ where $\sigma}_{Vt}^{2}=1/\left(n1\right)\sum _{i=1}^{n}{\left({V}_{i,t}{\overline{V}}_{i,t}\right)}^{2$ is the sample variance (and $\text{}{\overline{V}}_{t}=1/n{\displaystyle \sum}_{i=1}^{n}{V}_{i,t}$ is the sample mean). The causal component estimates are given by:
These two components are then combined based on the posterior probabilities of common and independent cause models (see Equation 3). This model is functionally equivalent to a Bayesian causal inference model as described in Körding et al., 2007, but with visual reliability computed directly from the sample variance rather than a fixed unknown parameter (which the experimenter estimates during model fitting).
Model 2: Bayesian learner
Request a detailed protocolThe Bayesian learner capitalizes on the slow changes in visual reliability across trials and combines past and current inputs to provide a more reliable estimate of visual reliability and hence auditory location. It computes the posterior probability based on all auditory and visual signals presented until time t (here only shown for C = 1, see Appendix 2).
According to Bayes rule, the joint probability of all variables until time t can be written based on the generative model as:
As above, the visual likelihood is given by the product of individual Normal distributions for each dot i: $P\left({V}_{1:n,t}\text{}{U}_{t},{\lambda}_{\text{V},\text{t}}\right)=\prod _{i=1}^{n}N\left({V}_{i,t}{U}_{t},1/{\lambda}_{\text{V},\text{t}}\right)$, and $P\left({U}_{t}\text{}{S}_{t},{\text{\lambda}}_{\text{V},\text{t}}\right)=N\left({U}_{t}{S}_{t},1/{\text{\lambda}}_{\text{V},\text{t}}\right).$
The prior $P\left({S}_{t}\right)$ is a Normal distribution $N\left({S}_{t}\right0,1/{\text{\lambda}}_{\text{S}})$ and the auditory likelihood.
$P\left({A}_{t,}\right{S}_{t})$ is a Normal distribution $N\left({A}_{t}{S}_{t},1/{\lambda}_{A}\right)$. As described in the generative model, $P\left({\text{\lambda}}_{\text{V},\text{k}}{\text{\lambda}}_{\text{V},\text{k}1}\right)$ is given by $\mathrm{log}{\lambda}_{\text{V},\text{t}}\sim N\left(\mathrm{log}{\lambda}_{\text{V},\text{t}1},1/\kappa \right)$.
Importantly, only the visual reliability, ${\text{\lambda}}_{\text{V},\text{t}}$, is directly dependent on the previous trial ($P\left({\lambda}_{\text{V},\text{k}},{\lambda}_{\text{V},\text{k}1}\right)=P\left({\lambda}_{\text{V},\text{k}}{\lambda}_{\text{V},\text{k}1}\right)P\left({\lambda}_{\text{V},\text{k}1}\right)\ne P\left({\lambda}_{\text{V},\text{k}}\right)P\left({\lambda}_{\text{V},\text{k}1}\right)$). Because of the Markov property (i.e. ${\text{\lambda}}_{\text{V},\text{t}}$ depends only on ${\text{\lambda}}_{\text{V},\text{t}1}$), the joint distribution for time t can be written as
Hence, the joint posterior probability over location and visual reliability given a stream of auditory and visual inputs can be rewritten as:
As this equation cannot be solved analytically, we obtain an approximate solution by factorizing the posterior in terms of the unknown variables (${S}_{t},{U}_{t},{\text{\lambda}}_{\text{V},\text{t}}$) according to the method of variational Bayes (Bishop, 2006). In this approximate method (for details see Appendix 2), the posterior is factorized into three terms, each a normal distribution:
In order to estimate the set of parameters (mean and variance) of $q\left({S}_{t}\right)$, $q\left({U}_{t}\right)$ and $q\left({\text{\lambda}}_{\text{V},\text{t}}{}_{t}\right)$, the Free Energy is minimized iteratively (and thereby the Kullback–Leibler divergence between the true and approximate distribution), until a convergence criterion is reached (here, the change in each fitted parameter is less than 0.0001 between iterations).
This is done separately for the common cause model component (C = 1) and the independent cause model component (C = 2). The auditory location, for the common cause model is based on the approximation over the posterior location of ${\widehat{S}}_{A,C=1,t}$ from, ${q}_{1}\left({S}_{t}\right)=N\left({\widehat{S}}_{A,C=1,t},{\sigma}_{1,t}\right)$. The auditory location for the independent cause model is simply computed as ${\widehat{S}}_{A,C=2,t}={A}_{t}/\left(1+{\sigma}_{A}^{2}/{\sigma}_{0}^{2}\right)$, because it is independent of the visual signal.
The marginal model evidence is estimated based on the minimized Free Energy for each mode component, $P\left({A}_{t},{V}_{1:n,t}\text{}C=1\right)$, respectively $P\left({A}_{t},{V}_{1:n,t}\text{}C=2\right)$ to form the posterior probability $P\left(C=1\text{}{A}_{t},{V}_{1:n,t}\right)$, as described above in Equation 4. These values can then be used to compute the predicted responses for a particular participant according to Equation 3.
Model 3: Exponential learner
Request a detailed protocolFinally, the observer may approximate the full Bayesian inference of the Bayesian learner by a more simple heuristic strategy of exponential discounting. In the exponential discounting model, the observer learns the visual reliability by exponentially discounting past visual reliability estimates:
where ${\text{\sigma}}_{Vt}^{2}=1/\left(\text{n}1\right){\displaystyle \sum}_{i=1}^{n}{\left({V}_{i,t}{\overline{V}}_{i,t}\right)}^{2}$ is the sample variance and ${\overline{V}}_{t}=1/\text{n}{\displaystyle \sum}_{i=1}^{n}{V}_{i,t}$ is the sample mean.
Similar to the optimal Bayesian learner (above), this observer model uses the past to compute the current reliability, but it does so based on a fixed learning rate 1  γ. Computation is otherwise performed in accordance with models 1 and 2, Equations 34 and 67.
Assumptions of the computational models: motivation and caveats
Request a detailed protocolComputational models inherently make simplifying assumptions about the generation of the sensory inputs and observers’ inference.
First, we modeled that visual signals (i.e. the cloud’s mean) were sampled from a Gaussian, while they were sampled from a uniform discrete distribution (i.e. [−10°, −5°, 0°, 5°, 10°]) in the experiment. Gaussian assumptions about the stimuli locations have nearly exclusively been made in the recent series of studies focusing on Bayesian Causal Inference in multisensory perception (Körding et al., 2007; Rohe and Noppeney, 2015b; Rohe and Noppeney, 2015a). Because visual signals have been sampled from a wide range of visual angle (i.e. 20°) and are corrupted by physical (i.e. cloud of dots) and internal neural noise, we used the simplifying assumption of a Gaussian spatial prior consistent with previous research.
Second, we assumed that the auditory signal location is sampled from a Gaussian, while the experiments presented sounds ±5° from the visual location. These Gaussian assumptions about sound location can be justified by the fact that observers are known to be limited in their sound localization ability, particularly when generic HRTFs were used to generate spatial sounds. Moreover, because sounds are presented together with visual signals, it is even harder for observers to obtain an accurate estimate of the sound’s location.
Third, in the experiment we generated the cloud of dots directly from a Gaussian distribution centred on S_{t}. By contrast, in the model we introduced a hidden variable U_{t} that is sampled from a Gaussian centred on S_{t}. The visual cloud of dots is then centred on this hidden variable U_{t}. We introduced this additional hidden variable U_{t} to account for observers’ additional causal uncertainty in natural environments, in which even signals from a common source may not fully coincide in space. Critically, the dispersion of the cloud of dots is set to be equal to the STD of the distribution from which U_{t} is sampled, so that the cloud’s STD informs observers about the variance of the hidden variable U_{t}.
Inference by the experimenter
Request a detailed protocolFrom the observer’s viewpoint, this completes the inference process. However, from the experimenter’s viewpoint, the internal variable for the auditory stimulus, A_{t}, is unknown and not directly under the experimenter’s control. To integrate out this unknown variable, we generated 1000 samples of the internal auditory value for each trial from the generative process A_{t} ~ N(S_{A,t,true}, σ_{A}^{2}), where S_{A,t,true} was the true location the auditory stimulus came from. For each value of A_{t}, we obtained a single estimate ${\widehat{S}}_{A,t}$ (as described above). To link these estimates with observers’ button response data, we assumed that subjects push the button associated with the position closest to ${\widehat{S}}_{A,t}$. In this way, we obtained a histogram of responses for each subject and trial which provide the likelihood of the model parameters given a subject’s responses: $P\left(res{p}_{t}\right\kappa ,{\sigma}_{A},{P}_{common},{S}_{A,t,true},{S}_{V,t,true})$.
Model estimation and comparison
Request a detailed protocolParameters for each model (for all models: σ_{A}, P_{common} = P(C = 1), σ_{0}, Bayesian learner: $\kappa $, exponential learner: γ) were fit for each individual subject by sampling using a symmetric proposal MetropolisHasting (MH) algorithm (with ${\text{A}}_{t}$ integrated out via sampling, see above). The MH algorithm iteratively draws samples set_{n} from a probability distribution through a variant of rejection sampling: if the likelihood of the parameter set is larger than the previous set, the new set is accepted, otherwise it is accepted with probability L(modelset_{n})/L(modelset_{n1}), where L(respset_{n}) = $\prod}_{t}P\left(res{p}_{t}\text{}\kappa ,{\sigma}_{A},{P}_{common},{S}_{A,t,true},{S}_{V,t,true}\right)$ (for Bayesian learner). We sampled 4000 steps from four sampling chains with thinning (only using every fourth sampling to avoid correlations in samples), giving a total of 4000 samples per subject data sets. Convergence was assessed through scale reduction (using criterion R < 1.1 [Gelman et al., 2013]). Using sampling does not just provide a single parameter estimate for a data set (as when fitting maximum likelihood), but can instead be used to assess the uncertainty in estimation for the data set. The model code was implemented in Matlab (Mathworks, MA) and ran on two dual Xeon workstations. Each sample step, per subject data set, took 30 s on a single core (~42 hr per sampling chain).
Quantitative Bayesian model comparison of the three candidate models was based on the WatanabeAkaike Information Criterion (WAIC) as an approximation to the out of sample expectation (Gelman et al., 2013). At the fixedeffects level, Bayesian model comparison was performed by summing the WAIC over all participants within each experiment. For a randomeffects analysis, we transformed the WAIC into loglikelihoods by dividing them by minus 2. We then computed the protected exceedance probability that one model is better than the other model beyond chance using hierarchical Bayesian model selection (Penny et al., 2010; Rigoux et al., 2014).
To qualitatively compare the localization responses given by the participants and the responses predicted by the instantaneous, Bayesian and exponential learner, we computed the auditory weight w_{A} from the predicted responses of the three models exactly as in the analysis for the behavioral data. For illustration, we show and compare the model’s w_{A} from the 1st and the flipped 2nd half of the periods for each of the four experiments (Figure 3, Figure 4, Figure 5B/C and Figure 5—figure supplement 1).
Parameter recovery
Request a detailed protocolTo test the validity of the models, we performed parameter recovery and were able to recover the generating values with a bias of all parameters smaller than 10% (for full details of bias and variance across parameters, see Appendix 1 and Supplementary file 1Table 7).
Simulated localization responses
Request a detailed protocolTo further compare the Bayesian and exponential learner and assess whether they can be discriminated experimentally, we simulated the choices of 12 subjects for the continuous sinusoidal and sinusoidal jump sequence using the Bayesian learner model (parameters: σ_{A} = 6°, κ = 15, P_{common} = 0.7 and σ_{0} = 12°). To increase observers’ uncertainty about their visual reliability estimates, we reduced the number of dots in the visual clouds from 20 to 5 dots where we ensured that the mean and variance of the five dots corresponded to the experimentally defined visual mean and variance. We then fitted the Bayesian learner and exponential learner models to each simulated data set (using the BADS toolbox for likelihood maximization [Acerbi and Ma, 2017]). The fitted parameters for the Bayesian model, set_{Bayes} were very close to the parameters used to generate observers’ simulated responses (sinusoidal sequence, fitted parameters: σ_{A} = 6.11°, κ = 17.5, P_{common} = 0.72 and σ_{0} = 12.4°; sinusoidal jump sequence, fitted parameters: σ_{A} = 6.08°, κ = 17.3, P_{common} = 0.71 and σ_{0} = 12.2°) – thereby providing a simple version of parameter recovery. The parameters of the exponential model, set_{Exp} (fitted to observers’ responses generated from the Bayesian model) were very similar to those of the Bayesian learner (sinusoidal sequence: σ_{A} = 5.99°, γ = 0.70, P_{common} = 0.61 and σ_{0} = 12.0°, sinusoidal jump sequence: σ_{A} = 6.06°, γ = 0.70, P_{common} = 0.65 and σ_{0} = 12.0°). Moreover, the fits to the simulated observers’ responses were very close for the two models (Figure 6), with mean log likelihood difference (log(L(respset_{Bayes})) – log(L(respset_{Exp}))) = 1.82 for the sinusoidal and 2.74 for the sinusoidal jump sequence (implying a slightly better fit for the Bayesian learner). Figure 6C and D show the timecourses of observers’ visual uncertainty (STD) as estimated by the Bayesian and exponential learners.
Appendix 1
Additional methods and results
Influence of the visual location of the previous trial on observers’ sound localization responses
We have performed a control regression analysis to assess the influence of the visual location of the previous trial on observers’ sound localization response. This is important because 200 ms prior to trial onset and sound presentation, observers were presented with a visual cloud whose mean was the same as for the previous trial and the cloud’s standard deviation (STD) varied according to a continuous or discontinuous sequence (see main paper). To quantify the influence of the previous visual location, we expanded our regression model that we used in the main paper by another regressor modeling the visual cloud’s location on the previous trial. For instance, for bin = 1, we computed:
with R_{A,trial, bin=1} = localization response for current trial that is assigned to bin 1; L_{A,trial,bin=1} or L_{V,trial,bin=1}= ‘true’ auditory or visual location for current trial that is assigned to bin 1; L_{V,trial1,bin=1} ‘true’ visual location for corresponding previous trial (for explanatory purposes, we assign here the bin of the current trial; the previous trial actually falls into a different bin); ß_{A,bin=1} or ß_{V,bin=1} = quantified the influence of the auditory and visual location of the current trial on the perceived sound location of the current trial for bin 1; ß_{Vprevious,bin=1} quantified the influence of the visual location of the previous trial on the perceived sound location of the current trial for bin 1. ß_{const,bin=1} = constant term; e_{trial,bin=1} = error term. For each bin, we thus obtained another visual weight estimate ß_{Vprevious,bin} for the previous location.
First, we averaged ß_{Vprevious,bin} across bins and entered these participantspecific binaveraged ß_{Vprevious} into twosided onesample ttests at the betweensubject random effects level. Results: As shown in Supplementary file 1Table 2A, this analysis demonstrated that the visual location of the previous trial significantly influenced observers’ perceived sound location on the current trial in the Sinusoidal, RW1 and (marginally) the Sinusoidal jump sequence.
Second, we computed the correlation of ß_{Vprevious,bin} with the visual noise in the current trials averaged in a given bin (r(ß_{Vprevious,bin}, σ_{Vcurrent,bin})) or with the visual noise (i.e. visual cloud’s STD) in the previous trial averaged within a given bin (r(ß_{Vprevious,bin}, σ_{Vprevious,bin})). The correlations were computed over bins within each participant. We entered the participantspecific Fisher ztransformed correlation coefficients into twosided onesample ttests at the betweensubject randomeffects level (see Supplementary file 1Table 2, section B and C). Results: These analyses demonstrated that the influence of the visual location on the previous trial was not correlated with the visual cloud’s STD on the current or previous trial (apart from one significant pvalue for the sinusoidal jump sequence, but after Bonferroni correction for the eight additional statistical comparisons this pvalue is no longer statistically significant either). This analysis already suggests that the previous visual location is unlikely responsible for the effects of the previous STD on observers’ perceived sound location.
Third and most importantly, this regression model provides us with weights for the auditory (ß_{A,bin}) and visual (ß_{V,bin}) locations in the current trial from a regression model that regressed out the influence of the previous visual location. We used those auditory (ß_{A,bin}) and visual (ß_{V,bin}) weights as in the main paper to compute binspecific w_{A,bin} = ß_{A,bin} / (ß_{A,bin} + ß_{V,bin}). Following exactly the same procedures as in the main paper, we then assessed in a repeatedmeasures ANOVA whether these w_{A,bin} differed between first and second half (see Supplementary file 1Table 3). Moreover, we repeated a second regression model analysis to assess whether w_{A,bin} was predicted not only by the visual cloud’s STD of the current, but also of the previous bin using the following regression model (i.e. Equation 2 in the main text): w_{A,bin} = σ_{V,bin} * ß_{σV} + (σ_{V,bin} – σ_{V,bin1})* ß_{ΔσV} + ß_{const} + e_{bin} with w_{A,bin} = relative auditory weight in bin b; σ_{V,bin} = mean visual STD in current bin b or previous bin b1; ß_{const} = constant term; e_{bin} = error term. To allow for generalization to the population level, the parameter estimates (ß_{σV,} ß_{ΔσV}) for each subject were entered into twosided onesample ttests at the betweensubject randomeffects level (see Supplementary file 1Table 4). Results: These control analyses (see Figure 2—figure supplement 1, Supplementary file 1Table 4) replicate our initial analyses reported in the main manuscript. Collectively, they provide further evidence that the effect of previous visual location on observers’ perceived sound location cannot explain the effect of prior visual reliability that are the key focus of our paper.
Nested model comparison to assess the effect of past visual noise on observers’ auditory weights
To assess the effect of past visual noise on auditory weights (Equation 2 in the main text), we also formally compared two nested linear mixedeffects models to predicts observers’ relative auditory weights w_{A,bin} separately for the Sin, RW1, and RW2 sequences. The reduced model included only the STD of the current bin as fixed effects. The full model included both the STD of the current bin and the difference in STD between the current and the previous bin as fixed effects. Both the reduced and the full model included participants as random effects. After fitting the two models using maximum likelihood estimation, we compared them using loglikelihood ratio tests and the Bayesian Inference Criterion as an approximation to the model evidence (see Supplementary file 1Table 5). The model comparison demonstrated that the full model including the difference in STD provided a better explanation of observers relative auditory weights w_{A,bin} across all four sequences. This corroborates that observers estimate sensory uncertainty by combining information from past and current sensory inputs.
Characterization of observer’s behavior and the model predictions before and after the jumps in visual reliability
One critical question in the discontinuous sinusoidal jump sequence is whether observers continue combining past with current inputs to adapt their visual uncertainty estimates. One may argue that observers detect the discontinuity, ‘reset’ the estimation of visual uncertainty after the jump and therefore do not integrate information from before the jump. In that case Bayesian learning models would not be ideal for modeling observers’ behavior in the jump sequence and better accommodated by Bayesian changepoint detection model (Adams and Mackay, 2007; Heilbron and Meyniel, 2019).
First, we inserted the jumps selectively in the period sections in which the visual variance was greater, so that changes in visual variance were more difficult to detect. This experimental choice minimized the chances that observers ‘reset’ their estimation of visual uncertainty.
Second, we assessed observers’ estimation strategy at a greater temporal resolution before and after the jumps. To estimate the relative auditory weight w_{A} at a greater resolution, we applied separate regression models to individual sampling points of the visual cloud of dots presented every 200 ms (i.e. no binning). Thus, w_{A} was computed at 5 Hz resolution before and after the jumps (i.e. at time points [−1.9:0.2:1.9] s; Figure 5—figure supplement 2). Because the number of trials was very low on individual sampling points, we pooled the trials across the three up and downjumps before computing the regression models. Nevertheless, the small number of trials on individual sampling points (range 7–28 trials across participants and bins) rendered the estimation of the relative auditory weights very unreliable. Thus, individual w_{A} values that were smaller or larger than three times the scaled median absolute deviation were excluded from the analyses in Figure 5—figure supplement 2 and Supplementary file 1Table 6. To assess statistically whether participants’ adjusted w_{A} after the jump, we computed a paired t test on w_{A} specifically from the time point before versus after the jump (i.e. −0.1 vs. 0.1 s; Supplementary file 1Table 6).
Third, we assessed the model fit of our learning models before and after the jumps. If the jumps violate the assumptions of the learning models, we would expect that observers’ behavior deviates from the model’s predictions more strongly after the jump. We computed the root mean squared error of the models’ w_{A} (i.e. (w_{A,behavior} – w_{A,model})^{2}) before and after the change point (Figure 5—figure supplement 2B) and entered those into a paired t test (i.e. −0.1 vs. 0.1 s; Supplementary file 1Table 6).
Results: Our data showed that participants and models rapidly and significantly adjusted their weights after the jumps. Critically, the model fits did not significantly differ for the time points just before or after the jumps in visual variance (i.e. if anything, they significantly decreased after the jump; Figure 5—figure supplement 2B and Supplementary file 1Table 6). Collectively, these control analyses suggest that our Bayesian and exponential learning models adequately modeled observers’ visual uncertainty adaptation both before and after change points (Norton et al., 2019).
Parameter recovery
To test the validity of the Bayesian, exponential, and instantaneous models, we performed parameter recovery by assessing the bias and variability of the parameters fitted to simulated data sets with respect to the true parameters used to generate the data.
For each model, we selected four different parameter sets (within a realistic range of values for parameters σ_{Α} = [6:12]°, P_{common} = [0.7:0.9], σ_{0} = [6:20]°, κ = [5:20], γ = [0.3:0.7]) and generated data sets of simulated observers for the RW2 sequence. We repeated this process six times (with different initial random seeds), creating a total of 24 simulated data sets for each model. We then fitted the Bayesian, exponential, and instantaneous learner models to each simulated data set (using exactly the same fitting procedures as for observers’ data in the experiments) resulting in 24 sets of best fitting parameters for each model.
In order to assess how well the fitting procedure recovers the generating parameters, we compared the fitted parameters to the ‘true’ parameters used to generate the data. Specifically, we assessed the parameter recovery in terms of bias and variability of the fitted parameters as follows: The recovered parameters’ bias was computed as the signed deviation from the true generating value in percentage. As an example, if a data set was generated with a model parameter of 5, but the fitted (i.e. recovered) parameter was 4, we would compute a −20% deviation. As a measure of the variability for the recovered parameters, we calculated the absolute (i.e. unsigned) deviation from the true generating values in percentage. As an example, a fitted value of 4, relative to a generating value of 5 would be a 20% absolute deviation. We report the median (and first and third quartile) across simulated data sets as a robust measure for this bias and variability.
Appendix 2
This document describes a Variational Bayes approximation to inference on a generative model that allows for two possible ways that stimuli data was generated (thus allowing subjects to perform causal inference).
Section (1) describes the full generative model for both a single and two sources, section (2) explains how an optimal observer can perform inference within either submodel, through a variational Bayes approximation to the posteriors and section (3) shows how to calculate the model likelihood for either submodel, as necessary for combining the two submodels.
Section (4) finally describes how the results for each causal model are combined into a single posterior.
1 Generative model
The model presented here is an extension of the Causal Inference model of Körding et al., 2007, with the reliability of the visual signal assumed to be changing smoothly over trials according to a random walk (RW). In the case where the visual reliability is constant the model approximates the original Causal Inference model.
In this model (figure Appendix 2—figure 1), at each stimulus presentation, t, subjects assume that the visual dots at positions ${V}_{i,t}$ and auditory stimulus at ${A}_{t}$, are generated through either of two causal models (${C}_{t}=1$ or ${C}_{t}=2$) with fixed prior probabilities:
If ${C}_{t}=1$, (single source, ${S}_{t}$, leading to forced fusion)
If ${C}_{t}=2$ (independent sources, ${S}_{A,t}$ and ${S}_{V,t}$ )
The intermediate variable ${U}_{t}$ means that the mean of the visual dots is not located at the true source (${S}_{t}$ or ${S}_{V,t}$), but normally distributed around it. Note that for ${C}_{t}=1$ we can explicitly write ${S}_{A,t}={S}_{V,t}={S}_{t}$.
For simplicity, we assume that ${\mu}_{0}=0$, that is, the prior mean is located at the horizontal center. The auditory STD ${\sigma}_{A}$, the prior probability of a single cause, ${P}_{common}$ and the prior STD, ${\sigma}_{0}$ , are fixed individually for each subject (see main text for fitting procedure).
In the following, we will simplify the notation by referring to $P(*{C}_{t}=1)$ by ${P}_{1}(*)$ and $P(*{C}_{t}=2)$ by ${P}_{2}(*)$.
2 Posterior
The full posterior over the latent variables in the model up until time t is
Recursively we can write
where $P({S}_{A,t},{S}_{V,t})$ obviously depends on ${C}_{t}$ through the generative model (Appendix 2—figure 1).
If we marginalize over the latent ${C}_{t}$:
At this point it should be clear that the posterior is a mixture of the forced fusion and independent solutions, with the mixture determined by the posterior probability of either model generating the data:
To evaluate this, we need to calculate the marginal model evidence, $P({A}_{t},{V}_{1:N,t}{C}_{t})$, for either model, see the later section.
2.1 Posterior for C = 1
The full posterior over the latent variables in the single source submodel is
Recursively we can write
As we will see later it is convenient to use a change of parameters
allowing us to rewrite
where
We will assume that
can be approximated by a Normal distribution (see below), thus allowing us to write
where $1/{\kappa}_{t}^{\prime}=1/\kappa +1/{\tau}_{\theta ,t1}$ (due to properties of convolution of two Normal distributions).
The logposterior (to be used for a variational approximation) is now
2.2 Variational Bayes approximation for C = 1
We will now approximate the logposterior with variational Bayes by factorization:
2.2.1 ${q}_{1}({S}_{t})$
For ${q}_{1}({S}_{t})$
where ${E}_{Y}(X)$ signifies the expectation of X over the distribution of Y: ${E}_{Y}(X)=\int P(Y)X\mathit{d}Y$.
Using ${E}_{U}(({U}_{t}{S}_{t}{)}^{2})={E}_{U}({U}_{t}^{2}+{S}_{t}^{2}2{U}_{t}{S}_{t})={E}_{U}({U}_{t}^{2})+{S}_{t}^{2}2{S}_{t}{E}_{U}({U}_{t})$ = $({S}_{t}^{2}{S}_{t}{E}_{U}({U}_{t})\ast 2+{E}_{U}({U}_{t}{)}^{2}){E}_{U}({U}_{t}{)}^{2}+{E}_{U}({U}_{t}^{2})=({S}_{t}{E}_{U}({U}_{t}){)}^{2}{E}_{U}({U}_{t}{)}^{2}+{E}_{U}({U}_{t}^{2})$, where the last two terms do not depend on ${S}_{t}$ (and thus can be discarded) we can rewrite the last term:
2.2.2 ${q}_{1}({U}_{t})$
For ${q}_{1}({U}_{t})$
Here, we use the same trick
2.2.3 ${q}_{1}(\theta )$
For ${q}_{1}(\theta )$
2.2.4 Simplifying ${q}_{1}({S}_{t})$ and $\mathrm{log}{q}_{1}({U}_{t})$
Inspecting $\mathrm{log}{q}_{1}({S}_{t})$ and $\mathrm{log}{q}_{1}({U}_{t})$ we can see that both ${q}_{1}({S}_{t})$ and ${q}_{1}({U}_{t})$ are products of Normal distributions, and thus themselves Normal distributed
and
where
Note that ${E}_{S}({S}_{t})\approx \int {q}_{1}({S}_{t}){S}_{t}\mathit{d}{S}_{t}={\mu}_{S,t}$ and ${E}_{U}({U}_{t})\approx \int {q}_{1}({U}_{t}){U}_{t}\mathit{d}{U}_{t}={\mu}_{U,t}$
2.2.5 Simplifying ${q}_{1}({\theta}_{t})$
Regarding ${q}_{1}({\theta}_{t})$ we can expand a little using that
(using that $E({X}^{2})={\mu}^{2}+1/\tau $ for a normal distribution $\mathcal{N}(X;\mu ,1/\tau )$) and
Which together gives:
2.3 Approximating $q(\theta )$
We will approximate ${q}_{1}(\theta )$ with a Normal distribution.
To do this, we use a Laplace approximation around the max of ${q}_{1}(\theta )$ (see figure Appendix 2—figure 2): $\mathrm{arg}\mathrm{max}({q}_{1}({\theta}_{t}))={\mu}_{\theta ,t}$ and with second derivative ${\tau}_{\theta ,t}$
This gives
2.3.1 First derivative
However, in order to find $\mathrm{arg}\mathrm{max}({q}_{1}({\theta}_{t}))$, we differentiate $\mathrm{log}{q}_{1}({\theta}_{t})$ and set equal to 0:
At this point there is no analytical solution.
2.3.2 Taylor expansion of first derivative
Although we could use a numerical approximation for speed of implementation, we use Taylor expansion. We need to solve for $\theta $
For simplicity we refer to $SV=({E}_{U,S}({({U}_{t}{S}_{t})}^{2})+{E}_{U}({\sum}_{i}^{N}{({U}_{t}{V}_{i,t})}^{2}))$
We can solve this by using the thirdorder Taylor expansion of the exponential:
We can set ${\theta}_{*}$ as ${\theta}_{t1}$, as we expect that ${\theta}_{t}$ will be close to ${\theta}_{*}$. For big changes in the variance of ${V}_{i}$ this can be off; however, this was not a problem in this stimulus set which relied on slow gradual changes.
In order to find $\mathrm{arg}\mathrm{max}({q}_{1}({\theta}_{t}))$ we therefore have to solve
which can be rewritten as a third order polynomial
where
the solution to which, ${\theta}_{t}^{optim}$, can be numerically found using Matlab’s $nthroot$ function.
As we assume that the logvariance only changes slightly between trials the solution closest to the previous value ${\theta}_{t1}$ is automatically chosen, $\mathrm{arg}\mathrm{max}({q}_{1}({\theta}_{t}))={\mu}_{{\theta}_{t}}={\theta}_{t}^{optim}$.
2.3.3 Second derivative
The second derivative is
We evaluate this at $argmax({q}_{1}({\theta}_{t}))$, so we insert ${\theta}_{t}={\mu}_{\theta ,t}$
Hence we can finally write
where
and where $SV={({\mu}_{U}{\mu}_{S})}^{2}+1/{\tau}_{S}+(N+1)/{\tau}_{U}+{\sum}_{i}^{N}{({V}_{i,t}{\mu}_{U})}^{2}$
With ${q}_{1}({\theta}_{t})$ a Normal distribution, that makes ${q}_{1}({\lambda}_{V,t})$ a lognormal distribution with ${\mu}_{{\lambda}_{V,t}}=E({\lambda}_{V,t})=E(\mathrm{exp}({\theta}_{t}))=\mathrm{exp}({\mu}_{{\theta}_{t}}+1/(2*{\tau}_{{\theta}_{t}}))$ (general property of lognormal distribution).
2.4 Final algorithm for C = 1
We can now create an iterative algorithm that for each time step t represents the model posterior. Variables ${\sigma}_{A}^{2}=1/{\lambda}_{A}$, ${\sigma}_{0}^{2}=1/{\lambda}_{0}$ and $\kappa $ have to be set before hand, together with the input data ${A}_{1:t}$ and ${V}_{1:N,1:t}$. For time step t:
1. initially set
2. set ${\mu}_{S},{\tau}_{S,t}$
3. set ${\mu}_{U,t},{\tau}_{U,t}$
where $\overline{{V}_{t}}=1/N{\sum}_{i}^{N}{V}_{i,t}$
4. find ${\mu}_{{\theta}_{t}}$ by solving third order polynomial, Equation 51,
then set ${\tau}_{{\theta}_{t}}$
where ${\kappa}_{t}^{\prime}=1/(1/\kappa +1/{\tau}_{\theta ,t1})$
5. Repeat steps 2–4 until the change in each parameter is small (<0.0001)
This is then repeated for each time step t, providing us with the approximation to the posterior ${P}_{1}\left({S}_{t},{\theta}_{t},{U}_{t}{A}_{t},{V}_{1:N,t}\right)\approx {q}_{1}\left({S}_{t},{U}_{t},{\theta}_{t}\right))={q}_{1}\left({S}_{t}\right)\ast {q}_{1}\left({U}_{t}\right)\ast {q}_{1}\left({\theta}_{t}\right)$.
2.5 Posterior for C = 2
Due to the independent structure this posterior can be written as
where
which is simple enough given the Normal distribution of both $P({S}_{A,t})$ and $P({A}_{t}{S}_{A,t})$
where ${\sigma}_{A0}^{2}=1/(1/{\sigma}_{A}^{2}+1/{\sigma}_{0}^{2})$.
Note that for the subject response the posterior ${P}_{2}({S}_{A,t}{A}_{t})$ is all that is needed, but for the calculation of the prior $P({\lambda}_{V,t})$ for subsequent trial $t+1$ we need to compute the full posterior.
We again use the transformation of parameters
Proceeding with just the posterior over ${S}_{V,t}$, ${U}_{t}$ and ${\theta}_{t}$
where
We will assume that $P({\theta}_{V,1:t1}{A}_{1:t1},{V}_{1:N,1:t1})$ can be approximated by a Normal distribution (see ${q}_{1}(\theta )$ below), thus allowing us to use properties of Normal distributions.
Hence,
where $1/{\kappa}_{t}^{\prime}=1/\kappa +1/{\tau}_{\theta ,t1}$ (due to the convolution of $P({\theta}_{t}{\theta}_{t1})$ with $P({\theta}_{t1}{A}_{1:t1},{V}_{1:N,1:t1})$, both Normal distributed).
While any estimate of ${\theta}_{t}$ will depend on ${A}_{1:t1}$ and ${V}_{1:N,1:t1})$ for ease of notation, we will omit those in the following.
The logposterior is now
2.6 Variational Bayes approximation for C = 2
We will now approximate the logposterior with variational Bayes by factorization: ${P}_{2}({S}_{t},{\theta}_{t},{U}_{t}{A}_{t},{V}_{1:N,t})\approx {q}_{2}\left({S}_{t},{U}_{t},{\theta}_{t}\right))={q}_{2}({S}_{t})\ast {q}_{2}({U}_{t})\ast {q}_{2}({\theta}_{t})$ This proceeds similarly to the combined ($C=1$) model, but with ${S}_{V,t}$ instead of ${S}_{t}$, and with no influence from ${A}_{t}$. For completeness the calculations are included here:
2.6.1 ${q}_{2}({S}_{t})$
For ${q}_{2}({S}_{t})$
where ${E}_{Y}(X)$ signifies the expectation of X over the distribution of Y: ${E}_{Y}(X)=\int P(Y)X\mathit{d}Y$.
Using ${E}_{U}(({U}_{t}V,t{)}^{2})={E}_{U}({U}_{t}^{2}+{S}_{V,t}^{2}2{U}_{t}{S}_{V,t})={E}_{U}({U}_{t}^{2})+{S}_{V,t}^{2}2{S}_{V,t}{E}_{U}({U}_{t})$ = $({S}_{V,t}^{2}{S}_{V,t}{E}_{U}({U}_{t})\ast 2+{E}_{U}({U}_{t}{)}^{2}){E}_{U}({U}_{t}{)}^{2}+{E}_{U}({U}_{t}^{2})=({S}_{V,t}{E}_{U}({U}_{t}){)}^{2}{E}_{U}({U}_{t}{)}^{2}+{E}_{U}({U}_{t}^{2})$, where the last two terms do not depend on ${S}_{V,t}$ (and thus can be discarded) we can rewrite the last term:
2.6.2 ${q}_{2}({U}_{t})$
For ${q}_{2}({U}_{t})$
Here, we use the same trick
2.6.3 ${q}_{2}({\theta}_{t})$
For ${q}_{2}(\theta )$
2.6.4 Simplifying ${q}_{2}({S}_{t})$ and $\mathrm{log}{q}_{2}({U}_{t})$
Inspecting $\mathrm{log}{q}_{2}({S}_{V,t})$ and $\mathrm{log}{q}_{2}({U}_{t})$ we can see that both ${q}_{2}({S}_{V,t})$ and ${q}_{2}({U}_{t})$ are products of Normal distributions, and thus themselves Normal distributed
and
where
Note that ${E}_{S}({S}_{V,t})\approx \int {q}_{2}({S}_{V,t}){S}_{V,t}\mathit{d}{S}_{V,t}={\mu}_{{S}_{V},t}$ and ${E}_{U}({U}_{t})\approx \int {q}_{2}({U}_{t}){U}_{t}\mathit{d}{U}_{t}={\mu}_{U,t}$
We can approximate ${q}_{2}(\theta )$ with a Normal distribution, in exactly the same way as for C = 1. As equations are identical (see above) they will not be repeated here.
2.7 Final algorithm for C = 2
We can now create an iterative algorithm that for each time step t represents the variational Bayes approximation of the model posterior over ${S}_{V,t},{U}_{t}$ and ${\lambda}_{V,t}$ (or rather ${\theta}_{t}$):, $P({S}_{V,t},{U}_{t},{\theta}_{t}{V}_{1:N,t})$. Variable $\kappa $ has to be set before hand, together with the input data ${A}_{1:t}$ and ${V}_{1:N,1:t}$. For time step t:
1. initially set ${\mu}_{\theta ,t}={\mu}_{\theta ,t1}$, ${\mu}_{U,t}=1/N\sum {V}_{i,t}$, and ${\tau}_{\theta ,t}=1$
2. set ${\mu}_{{S}_{V}},{\tau}_{{S}_{V},t}$
3. set ${\mu}_{U},{\tau}_{U,t}$
where $\overline{{V}_{t}}=1/N{\sum}_{i}^{N}{V}_{i,t}$
4. find ${\mu}_{\theta ,t}$ by numerically solving polynomial, Equation 51,
then set ${\tau}_{\theta ,t}$
where ${\kappa}_{t}^{\prime}=1/(1/\kappa +1/{\tau}_{\theta ,t1})$
5. Repeat steps 2–4 until convergence, that is, until the change in each parameter is small (<0.0001)
This is then repeated for each time step t, providing us with the approximation to the posterior ${P}_{2}({S}_{t},{\theta}_{t},{U}_{t}{A}_{t},{V}_{1:N,t})\approx {q}_{2}\left({S}_{t},{U}_{t},{\theta}_{t}\right))={q}_{2}({S}_{t})\ast {q}_{2}({U}_{t})\ast {q}_{2}({\theta}_{t})$.
See figure Appendix 2—figure 3 below for an example of the learned inference of the visual variance ${\sigma}_{V,t}^{2}=1/{\lambda}_{V,t}\sim 1/\mathrm{log}{q}_{2}({\theta}_{t})$, compared with a simple instantaneous learner model that assumes ${\sigma}_{V,t}^{2}=1/N{\sum}_{i}{({V}_{i,t}{\overline{V}}_{t})}^{2}$.
3 Marginal model evidence
Recall that the posterior is a mixture of the forced fusion and independent solutions, with the mixture determined by the posterior probability of either model generating the data:
To evaluate this, we need to calculate the marginal model evidence, $P({A}_{t},{V}_{1:N,t}{C}_{t})$, for either model.
One way to do so is by a sampling approximation, but here we utilise the variational results we have already found.
3.1 Model likelihood for C = 2, two sources ${S}_{V,t},{S}_{A,t}$
We need to evaluate the model likelihood for both $C=1$ and $C=2$. The case for $C=2$ is slightly simpler, hence we start with this:
The first integral is easy as it is just the integral of the product of two Normal distributions.
It is however more convenient to operate in logspace
The second integral we approximate through the Free Energy that we already maximise iteratively in the variational Bayes algorithm.
(this approximation becomes exact if the variational approximation is exact, that is, if the KulbackLeibler difference between the posterior ${P}_{2}({U}_{t},{\theta}_{t},{S}_{V,t}{V}_{1:N})$ and the approximation ${q}_{2}({U}_{t},{\theta}_{t},{S}_{V,t})$ becomes zero.)
This can be interpreted as taking the expectation with regard to the posterior approximation, and due to the properties of the logarithm this can be separated into a sum of expectations:
where (due to Equation 43)
and (since $E({X}^{2})={\mu}_{X}^{2}+{\sigma}_{X}^{2}$)
and
and (due to Equation 75)
and
and
and
In total we now have
Although lengthy, this is trivial and fast to compute numerically in Matlab (e.g.). Note that all estimates come from the variational Bayes approximation ${q}_{2}({S}_{t},{U}_{t},{\theta}_{t})$.
3.2 Model likelihood for C = 1, one source ${S}_{t}={S}_{V,t}={S}_{A,t}$
We now need to do the same for the one source model.
Note that for simplicity in notation we will use P_{1} to indicate the probability within the model given ${C}_{t}=1$
We again approximate through the Free Energy that we already maximised iteratively in the variational Bayes algorithm.
(this approximation becomes exact if the variational approximation is exact, ie if the KulbackLeibler difference between the posterior ${P}_{1}(V,{\theta}_{t},{S}_{V,t}{V}_{1:N,t})$ and the approximation ${q}_{1}({U}_{t},{\theta}_{t},{S}_{V,t})$ becomes zero.)
This can be interpreted as taking the expectation with regard to the posterior approximation, and due to the properties of the logarithm this can be separated into a sum of expectations:
where (since $E({X}^{2})={\mu}_{X}^{2}+{\sigma}_{X}^{2}$)
and
and
and (due to Equation 75)
and
and
and
In total we now have
(where all first and second order moments ($\mu ,\tau $) have been derived from q_{1}).
This is identical to the result from $C=2$ except for the first line.
In total this provides us with an approximation to the model evidence for each model, $P({A}_{t},{V}_{1:N,t}{C}_{t}=1)$ and $P({A}_{t},{V}_{1:N,t}{C}_{t}=2)$, which can be used to calculate the posterior probability of either model given data, $P({C}_{t}{A}_{1:t},{V}_{1:N,1:t})$.
4 Putting it all together
For either submodel, the factorization (due to assumptions and variational Bayes approximation) allows us to write out the equations for the variable for subject choice:
This is now a mixture of two Gaussian distributions (due to the Variational Bayes approximation), with mixture weights given by the model evidence (partly approximated by the Free Energy).
We will assume subjects report the mean of the distribution that is,
where ${\widehat{S}}_{C=1,t}={\mu}_{S,t}$ for $C=1$ and ${\widehat{S}}_{C=2,t}={\mu}_{S,A,t}$ for $C=2$.
We also need a prior over the visual logreliability for the following trial
While this is a mixture of two Gaussians, we need the prior to be a single Gaussian in order for our approximation scheme above to work. We will approximate this mixture with a single Gaussian (essentially fitting a Gaussian to the mixture of two Gaussians).
where (due to the first and secondorder moments of mixture distributions)
While fitting a Gaussian to the sum of two Gaussians could be a very inexact approximation, in practice the two individual distributions are close enough for this not to be a problem (as any contribution from ${A}_{t}$ to the posterior of ${\theta}_{t}$ is very small).
In conclusion, subjects report $\widehat{{S}_{t}}$ (through a button response, see Equation 121) and they propagate the posterior $P({\theta}_{t}{A}_{1:t},{V}_{1:N,1:t})$ (see Equation 123) as prior for the next trial.
Data availability
The human behavioral raw data and computational model predictions as well as the code for computational modelling and analyses scripts are available in an OSF repository: https://osf.io/gt4jb/.

Open Science FrameworkUsing the past to estimate sensory uncertainty.https://doi.org/10.17605/OSF.IO/GT4JB
References

On the origins of suboptimality in human probabilistic inferencePLOS Computational Biology 10:e1003661.https://doi.org/10.1371/journal.pcbi.1003661

Bayesian comparison of explicit and implicit causal inference strategies in multisensory heading perceptionPLOS Computational Biology 14:e1006110.https://doi.org/10.1371/journal.pcbi.1006110

ConferencePractical bayesian optimization for model fitting with bayesian adaptive direct searchAdvances in Neural Information Processing Systems. pp. 1836–1846.

ConferenceThe cipic hrtf databaseIEEE Workshop on the Applications of Signal Processing to Audio and Acoustics, 2001.https://doi.org/10.1109/ASPAA.2001.969552

Bayesian integration of visual and auditory signals for spatial localizationJournal of the Optical Society of America A 20:1391–1397.https://doi.org/10.1364/JOSAA.20.001391

Learning the value of information in an uncertain worldNature Neuroscience 10:1214–1221.https://doi.org/10.1038/nn1954

Statistically optimal perception and learning: from behavior to neural representationsTrends in Cognitive Sciences 14:119–130.https://doi.org/10.1016/j.tics.2010.01.003

Understanding predictive information criteria for bayesian modelsStatistics and Computing 24:997–1016.https://doi.org/10.1007/s1122201394162

Confidence resets reveal hierarchical adaptive learning in humansPLOS Computational Biology 15:e1006972.https://doi.org/10.1371/journal.pcbi.1006972

Optimal integration of texture and motion cues to depthVision Research 39:3621–3629.https://doi.org/10.1016/S00426989(99)000887

The Bayesian brain: the role of uncertainty in neural coding and computationTrends in Neurosciences 27:712–719.https://doi.org/10.1016/j.tins.2004.10.007

BookPerception as Bayesian InferenceCambridge University Press.https://doi.org/10.1017/CBO9780511984037

Bayesian inference with probabilistic population codesNature Neuroscience 9:1432–1438.https://doi.org/10.1038/nn1790

Neural coding of uncertainty and probabilityAnnual Review of Neuroscience 37:205–220.https://doi.org/10.1146/annurevneuro071013014017

Learned rather than online relative weighting of visualproprioceptive sensory cuesJournal of Neurophysiology 119:1981–1992.https://doi.org/10.1152/jn.00338.2017

Human online adaptation to changes in prior probabilityPLOS Computational Biology 15:e1006681.https://doi.org/10.1371/journal.pcbi.1006681

Comparing families of dynamic causal modelsPLOS Computational Biology 6:e1000709.https://doi.org/10.1371/journal.pcbi.1000709

A detailed comparison of optimality and simplicity in perceptual decision makingPsychological Review 123:452–480.https://doi.org/10.1037/rev0000028

Fast temporal dynamics of visual cue integrationPerception 31:421–434.https://doi.org/10.1068/p3314

Integration of proprioceptive and visual positioninformation: an experimentally supported modelJournal of Neurophysiology 81:1355–1364.https://doi.org/10.1152/jn.1999.81.3.1355

Probability matching as a computational strategy used in perceptionPLOS Computational Biology 6:e1000871.https://doi.org/10.1371/journal.pcbi.1000871

Probabilistic interpretation of population codesNeural Computation 10:403–430.https://doi.org/10.1162/089976698300017818
Decision letter

Tobias ReichenbachReviewing Editor; Imperial College London, United Kingdom

Andrew J KingSenior Editor; University of Oxford, United Kingdom

Tobias ReichenbachReviewer; Imperial College London, United Kingdom

Luigi AcerbiReviewer
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
Our perception is notoriously inaccurate, and estimating the uncertainty of a percept is an important task of our brain. The present paper shows for the first time that our brain uses past history in the estimation of the uncertainty of a current percept. It opens up further research questions including those related to the role of learning in perception.
Decision letter after peer review:
Thank you for submitting your article "Using the past to estimate sensory uncertainty" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Tobias Reichenbach as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Andrew King as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Luigi Acerbi (Reviewer #3).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary:
Beierholm et al. present a wellexecuted psychophysical study in which participants judged the location of a conflicting visualauditory stimulus under varying degrees of visual noise. Unlike other studies in this field, the visual noise was varied slowly, enabling participants to take advantage of this fact by incorporating estimates of uncertainty from the recent past into their current cueweighting strategy. The authors then compare the data to three computational models: a simple learner that does not use past information, a Bayesian learner, and an exponential learner that accounts for past information at a fixed learning rate. The conclusion of the paper is that subjects' estimate of visual variability is influenced by the past history of visual noise.
While we find these results intriguing, and the experiment and analysis thorough, we have several major comments that we would like the authors to address.
Essential revisions:
1) The current analysis does not consider the effect that past visual locations, in addition to the uncertainty in the visual signal, may have on the estimate of the current location. In particular, if the location at t1 is the same or close to the location at time t, it might be that past estimates of location get integrated (in fact, this becomes another causal inference problem). This effect might be further related to the question investigated here, because, if present, it might be modulated by the visual noise in the previous trial. Have you investigated whether this effect is present, and if so, how did you account for it in your analysis?
2) The authors currently refer to a preprint of this manuscript on bioarxiv for a full derivation of both model components for the Bayesian learner. Instead, please provide the full model derivation in the supplementary information in a selfcontained form. Please also make the code for the modelling and the analysis as well as the data publicly available.
3) In the sinusoidal condition the bins had a duration of only 1.5 s, but the trials were 1.4 to 2.8 s apart. It therefore appears as if there were either 1 or 0 (or very rarely 2) responses in each bin. How did you handle the zeroresponse bins? And how can weights – presumed to vary smoothly between 0 and 1 – be reliably estimated from a single behavioural response?
4) The computational model makes certain assumptions that appear to differ from the experiment. We would like the authors to comment on these discrepancies. First, the computational model assumes that the auditory signal follows a normal distribution around a particular mean – . However, in the experiment, the location of the sound was either +5 degrees or 5 degrees away from the mean of the visual signal. Second, regarding the computational model, the authors write that "the dispersion of the individual dots is assumed to be identical to the uncertainty about the visual mean, allowing subjects to use the dispersions as an estimate of the uncertainty about the visual mean". But in the experiment there is no notion of an uncertainty (noise) in the visual mean. Third, the authors write that all probabilities, except for one, are Gaussian. As for the first point raised above, in the experiment, this only seems true for the distribution of the dots around the mean, but not for the other distributions. In particular, the mean of the visual signal is sampled from a discrete uniform distribution that encompasses only five different locations. Fourth, each dot location V_{i,t} is drawn from a normal distribution with mean U_{t}, but U_{t} is drawn from another distribution with mean S_{V,t} – are the variance of these two distributions the same? Wouldn't U_{t} simply be the location (10, 5, etc) on that trial, and wouldn't this mean instead that the dot positions are doubly stochastic? If so, why? The actual dispersion (not to mention the observers' estimates thereof) would be very noisy if dot locations were simply resampled at 5 Hz from a fixed distribution for a given trial. Doesn't resampling the SD at 5 Hz just complicate the modeling even more than it already is? Please also explain the purpose of the log random walk.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Thank you for resubmitting your article "Using the past to estimate sensory uncertainty" for consideration by eLife. Your revised article has been reviewed by three peer reviewers, including Tobias Reichenbach as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Andrew King as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Luigi Acerbi (Reviewer #3).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
We would like to draw your attention to changes in our revision policy that we have made in response to COVID19 (https://elifesciences.org/articles/57162). Specifically, we are asking editors to accept without delay manuscripts, like yours, that they judge can stand as eLife papers without additional data, even if they feel that they would make the manuscript stronger. Thus the revisions requested below only address clarity and presentation.
Summary:
The authors have addressed our previous comments well in their extensively revised version of the manuscript. We only have a few remaining queries.
Revisions:
Difference in STD between current and previous bin predicts auditory weight, would that be expected given the autocorrelation of the STD sequence? Our intuition is that the null hypothesis (no impact of previous bin) would only be valid if STDs were (temporally) conditionally independent. In other words, if it were only the current STD that affected the weight, you might still see an apparent influence of the previous STD simply because STD of bin N is highly correlated with STD of bin N1 (for the sinusoids at least).
This logic may only apply if the regression was on the absolute STDs, not the difference between current and previous STD, which is what the authors did. So perhaps it's not an issue. But if it is, we think one could perform a nested model comparison to test whether adding the previous time bin significantly improves the fit enough to justify the extra parameter. (It could also be that the current analysis is effectively doing this.)
Alternatively, one could perform this analysis separately for the first half vs. second half and see whether you observe a change in the regression coefficient for the δSTD. If the authors' interpretation is correct, the coefficient should systematically change (sign flip?) when STD is increasing vs decreasing, whereas if the autocorrelation were driving its significance, it should not depend on increasing vs decreasing.
https://doi.org/10.7554/eLife.54172.sa1Author response
Essential revisions:
1) The current analysis does not consider the effect that past visual locations, in addition to the uncertainty in the visual signal, may have on the estimate of the current location. In particular, if the location at t1 is the same or close to the location at time t, it might be that past estimates of location get integrated (in fact, this becomes another causal inference problem). This effect might be further related to the question investigated here, because, if present, it might be modulated by the visual noise in the previous trial. Have you investigated whether this effect is present, and if so, how did you account for it in your analysis?
We thank the reviewer for this suggestion. To quantify the influence of the previous visual location, we expanded our regression model by another regressor modelling the visual cloud’s location on the previous trial. For instance, for bin = 1 we computed:
R_{A,trial, bin=1}= L_{A,trial,bin=1}* ß_{A,bin=1} + L_{V,trial,bin=1}* ß_{V,bin=1} + L_{V,trial1,bin=1}* ß_{Vprevious,bin=1} + ß_{const,bin=1} + e_{trial,bin=1} with R_{A,trial, bin=1} = Localization response for current trial that is assigned to bin 1; L_{A,trial,bin=1} or L_{V,trial,bin=1}= ‘true’ auditory or visual location for current trial that is assigned to bin 1; L_{V,trial1,bin=1} ‘true’ visual location for corresponding previous trial (for explanatory purposes, we assign the bin of the current trial; the previous trial actually falls into a different bin); ß_{A,bin=1} or ß_{V,bin=1} = auditory or visual weight for bin = 1; ß_{Vprevious,bin=1} quantified the influence of the visual location of the previous trial on the perceived sound location of the current trial for bin 1. ß_{const,bin=1} = constant term; e_{trial,bin=1} = error term.
This analysis indeed reveals that the location of the visual cloud on the previous trial influences observers’ perceived sound location (Supplementary file 1—table 2). But surprisingly, it has a repellent effect, i.e. observers’ perceived that sound location shifts away from the true visual location. Importantly, having regressed out the influence of the previous V location on observers’ perceived sound location, we have repeated our main analyses, i.e. the repeatedmeasures ANOVA assessing whether w_{A,bin} differed for the bins in first vs. second half (see Supplementary file 1—table 3). Moreover, we repeated the regression model analysis to assess whether w_{A,bin} was predicted not only by the cloud’s STD of the current, but also by the previous bin (Supplementary file 1—table 4). Both analyses replicated our initial findings.
In addition, we also demonstrated that the regression weight quantifying the influence of the previous visual location did not correlate with the visual noise in the current trial r(ß_{Vprevious,bin}, σ_{Vcurrent,bin}) and in the previous trial (r(ß_{Vprevious,bin}, σ_{Vprevious,bin})) (see Supplementary file 1—table 2).
We have now included additional methods in Appendix 1, report those results in Supplementary file 1—table 24 and Figure 2—figure supplement 1 and refer to the control analyses in the main text.
2) The authors currently refer to a preprint of this manuscript on bioarxiv for a full derivation of both model components for the Bayesian learner. Instead, please provide the full model derivation in the supplementary information in a selfcontained form. Please also make the code for the modelling and the analysis as well as the data publicly available.
We have now added the full model derivation to the Appendix 2. Further, we uploaded the code for modelling and analyses scripts along with the behavioral data and model predictions to an OSF repository: https://osf.io/gt4jb/
We refer to this website in the main text.
3) In the sinusoidal condition the bins had a duration of only 1.5 s, but the trials were 1.4 to 2.8 s apart. It therefore appears as if there were either 1 or 0 (or very rarely 2) responses in each bin. How did you handle the zeroresponse bins? And how can weights – presumed to vary smoothly between 0 and 1 – be reliably estimated from a single behavioural response?
We are sorry for this confusion. Indeed, the reviewer is absolutely right. The bins in the four sequences had durations (Sin, RW1 = 1.5s, RW2 = 6s, Sin Jump = 2s) which were partially shorter than the ITI of 1.42.8 s, so that during the presentation of a single sequence bins had only 0,1 or rarely 2 responses. However, the experiment looped multiple times (Sin, RW1, Sin Jump ~ 130x, RW2 ~ 32) through the sequences during the course of the experiment. As a result of the jittered trial onset asynchrony, trials sampled different bins over replications/cycles of the same sequence throughout the experiment. In fact, each time bin was informed by at least 4487 trials (see Supplementary file 1—table 1). Thus, the auditory weights could be estimated quite reliably.
We have now described the experimental design and analysis strategy in greater detail in the Results and Materials and method sections. We have also introduced a new notation for the equations and parameters for clarification. Moreover, we have included the following table into Supplementary file 1
4) The computational model makes certain assumptions that appear to differ from the experiment. We would like the authors to comment on these discrepancies.
Thank you for giving us the opportunity to motivate the assumptions of our model. We have now clarified the models’ assumptions in the Materials and method section.
First, the computational model assumes that the auditory signal follows a normal distribution around a particular mean. However, in the experiment, the location of the sound was either +5 degrees or 5 degrees away from the mean of the visual signal.
Indeed, the reviewer is absolutely right that the sounds are ± 5° from the visual location. However, observers are known to be limited in their sound localization ability, particularly if sounds do not come from natural sound sources but are generated with generic headrelated transfer functions as in our study. Given observers’ substantial spatial uncertainty when locating sounds, we feel the model’s normal assumptions about sound location can be justified.
Second, regarding the computational model, the authors write that "the dispersion of the individual dots is assumed to be identical to the uncertainty about the visual mean, allowing subjects to use the dispersions as an estimate of the uncertainty about the visual mean". But in the experiment there is no notion of an uncertainty (noise) in the visual mean.
We have introduced the additional hidden variable U_{t} to account for the fact that even when auditory and visual signals come from a common source, they do not necessarily fully coincide in space in our natural environment. This introduces additional spatial uncertainty, so that observers cannot fully rely on the visual cloud of dots to locate the sound even in the common source situation. Critically, – as cited by the reviewer because the dispersion of the dots and the uncertainty about the mean were set to be equal, observers could estimate this visual uncertainty from the spread of the dots.
Third, the authors write that all probabilities, except for one, are Gaussian. As for the first point raised above, in the experiment, this only seems true for the distribution of the dots around the mean, but not for the other distributions. In particular, the mean of the visual signal is sampled from a discrete uniform distribution that encompasses only five different locations.
Again, given the uncertainty about visual location this seems like a justifiable assumption. In fact, this assumption has been made by a growing number of studies that fitted the Bayesian Causal Inference model to observers’ localization responses, even though in all of those previous studies (Kording et al., 2007, Rohe and Noppeney, 2015b, Rohe and Noppeney, 2015a), the mean of the visual and auditory signals were sampled from a discrete uniform distribution.
Fourth, each dot location V_{i,t} is drawn from a normal distribution with mean U_{t}, but U_{t} is drawn from another distribution with mean S_{V,t} – are the variance of these two distributions the same?
Yes – as explained in response to point 2, otherwise they would not be informative
Wouldn't U_{t} simply be the location (10, 5, etc) on that trial, and wouldn't this mean instead that the dot positions are doubly stochastic? If so, why?
No, U_{t} is not identical to the location on that trial but is modelled as sampled from a Gaussian centred on this location; so in this sense, the model make the inference doubly stochastic; but importantly the two standard deviations are the same, so one is informative about the other one.
The actual dispersion (not to mention the observers' estimates thereof) would be very noisy if dot locations were simply resampled at 5 Hz from a fixed distribution for a given trial. Doesn't resampling the SD at 5 Hz just complicate the modeling even more than it already is?
It is true that the SD was resampled at 5Hz in order to provide the observer with the impression of a continuous stimulus. But for modelling, we focused selectively on the SD of the visual cloud at the trial onset times.
Please also explain the purpose of the log random walk.
We performed a random walk on the logarithm of the visual reliability ${\lambda}_{V,t}$ as a convenience for the modeling. Previous research in the reward learning domain has compared a log random walk with a change point model and found very similar results (Behrens et al., 2007). Moreover, as even the exponential learner provided a reasonable fit to the data, we suspect that this type of assumption is unlikely to have a relevant effect.
As mentioned above, we included a critical discussion of our modelling assumptions in the Materials and methods section in which we discuss the aforementioned points.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Revisions:
Difference in STD between current and previous bin predicts auditory weight, would that be expected given the autocorrelation of the STD sequence? Our intuition is that the null hypothesis (no impact of previous bin) would only be valid if STDs were (temporally) conditionally independent. In other words, if it were only the current STD that affected the weight, you might still see an apparent influence of the previous STD simply because STD of bin N is highly correlated with STD of bin N1 (for the sinusoids at least).
This logic may only apply if the regression was on the absolute STDs, not the difference between current and previous STD, which is what the authors did. So perhaps it's not an issue. But if it is, we think one could perform a nested model comparison to test whether adding the previous time bin significantly improves the fit enough to justify the extra parameter. (It could also be that the current analysis is effectively doing this.)
Alternatively, one could perform this analysis separately for the first half vs. second half and see whether you observe a change in the regression coefficient for the δSTD. If the authors' interpretation is correct, the coefficient should systematically change (sign flip?) when STD is increasing vs decreasing, whereas if the autocorrelation were driving its significance, it should not depend on increasing vs decreasing.
Thanks for raising this point. Indeed, as the reviewer notes, the difference in STD between the current and previous bin is only weakly correlated to current STD ( r ~ 0.3 across the sequences). Further, because we inserted the difference in STD and the STD in the same regression model, each parameter estimate reflects only the unique variance that cannot be explained by any other regressor in the model. Hence, testing for the significance of one parameter estimate is equivalent to comparing two nested models that do or do not include this regressor. However, in the twostage summarystatistic approach this relationship is less transparent.
Following the reviewer’s suggestions, we have now implemented a nested model comparison. We fitted two linear mixed effects models to the relative auditory weights: a full model with current STD and the difference in STD versus a reduced model with only current STD. The model comparison shows a greater model evidence for the full model.
We mention the control analysis in the Result section, added a supplementary table (Supplementary file—table 5) and describe the analysis in more detail in Appendix 1.
https://doi.org/10.7554/eLife.54172.sa2Article and author information
Author details
Funding
European Research Council (ERCmultsens,309349)
 Uta Noppeney
Max Planck Society
 Tim Rohe
 Uta Noppeney
Deutsche Forschungsgemeinschaft (DFG RO 5587/11)
 Tim Rohe
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This study was funded by the ERC (ERCmultsens, 309349), the Max Planck Society and the Deutsche Forschungsgemeinschaft (DFG; grant number RO 5587/1–1). We thank Peter Dayan for his valuable contributions and very helpful comments on a previous version of the manuscript.
Ethics
Human subjects: All volunteers participated in the study after giving written informed consent. The study was approved by the human research review committee of the University of Tuebingen (approval number 432 2007 BO1) and the research review committee of the University of Birmingham (approval number ERN_151458AP1).
Senior Editor
 Andrew J King, University of Oxford, United Kingdom
Reviewing Editor
 Tobias Reichenbach, Imperial College London, United Kingdom
Reviewers
 Tobias Reichenbach, Imperial College London, United Kingdom
 Luigi Acerbi
Publication history
 Received: December 4, 2019
 Accepted: December 13, 2020
 Accepted Manuscript published: December 15, 2020 (version 1)
 Version of Record published: January 13, 2021 (version 2)
Copyright
© 2020, Beierholm et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,340
 Page views

 216
 Downloads

 2
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.