Using the past to estimate sensory uncertainty

  1. Ulrik Beierholm  Is a corresponding author
  2. Tim Rohe
  3. Ambra Ferrari
  4. Oliver Stegle
  5. Uta Noppeney
  1. Psychology Department, Durham University, United Kingdom
  2. Department of Psychiatry and Psychotherapy, University of Tübingen, Germany
  3. Department of Psychology, Friedrich-Alexander University Erlangen-Nuernberg, Germany
  4. Centre for Computational Neuroscience and Cognitive Robotics, University of Birmingham, United Kingdom
  5. Max Planck Institute for Intelligent Systems, Germany
  6. European Molecular Biology Laboratory, Genome Biology Unit, Germany
  7. Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany, Germany
  8. Donders Institute for Brain, Cognition and Behaviour, Radboud University, Netherlands

Abstract

To form a more reliable percept of the environment, the brain needs to estimate its own sensory uncertainty. Current theories of perceptual inference assume that the brain computes sensory uncertainty instantaneously and independently for each stimulus. We evaluated this assumption in four psychophysical experiments, in which human observers localized auditory signals that were presented synchronously with spatially disparate visual signals. Critically, the visual noise changed dynamically over time continuously or with intermittent jumps. Our results show that observers integrate audiovisual inputs weighted by sensory uncertainty estimates that combine information from past and current signals consistent with an optimal Bayesian learner that can be approximated by exponential discounting. Our results challenge leading models of perceptual inference where sensory uncertainty estimates depend only on the current stimulus. They demonstrate that the brain capitalizes on the temporal dynamics of the external world and estimates sensory uncertainty by combining past experiences with new incoming sensory signals.

Introduction

Perception has been described as a process of statistical inference based on noisy sensory inputs (Knill and Pouget, 2004; Knill and Richards, 1996). Key to this perceptual inference is the estimation and/or representation of sensory uncertainty (as measured by variance, i.e. the inverse of reliability/precision). Most prominently, in multisensory perception, a more reliable or ‘Bayes-optimal’ percept is obtained by integrating sensory signals that come from a common source weighted by their relative reliabilities with less weight assigned to less reliable signals. Likewise, sensory uncertainty shapes observers’ causal inference. It influences whether observers infer that signals come from a common cause and should hence be integrated or else be processed independently (Aller and Noppeney, 2019; Körding et al., 2007; Rohe et al., 2019; Rohe and Noppeney, 2015b; Rohe and Noppeney, 2015a; Rohe and Noppeney, 2016; Wozny et al., 2010; Acerbi et al., 2018). Indeed, accumulating evidence suggests that human observers are close to optimal in many perceptual tasks (though see Acerbi et al., 2014; Drugowitsch et al., 2016; Shen and Ma, 2016; Meijer et al., 2019) and weight signals approximately according to their sensory reliabilities (Alais and Burr, 2004; Ernst and Banks, 2002; Jacobs, 1999; Knill and Pouget, 2004; van Beers et al., 1999; Drugowitsch et al., 2014; Hou et al., 2019).

An unresolved question is how human observers compute their sensory uncertainty. Current theories and experimental approaches generally assume that observers access sensory uncertainty near-instantaneously and independently across briefly (≤200 ms) presented stimuli (Ma and Jazayeri, 2014; Zemel et al., 1998). At the neural level, theories of probabilistic population coding have suggested that sensory uncertainty may be represented instantaneously in the gain of the neuronal population response (Ma et al., 2006; Hou et al., 2019). Yet, in our natural environment, sensory noise often evolves at slow timescales. For instance, visual noise slowly varies when walking through a snow storm. Observers may capitalize on the temporal dynamics of the external world and use the past to inform current estimates of sensory uncertainty. In this alternative account, more reliable estimates of sensory uncertainty would be obtained by combining past estimates with current sensory inputs as predicted by Bayesian learning.

To arbitrate between these two critical hypotheses, we presented observers with audiovisual signals in synchrony but with a small spatial disparity in a sound localization task. Critically, the spatial standard deviation (STD) of the visual signal changed dynamically over time continuously (experiments 1–3) or discontinuously (i.e. with intermittent jumps; experiment 4). First, we investigated whether the influence of the visual signal location on observers’ perceived sound location depended on the noise only of the current visual signal or also of past visual signals. Second, using computational modeling and Bayesian model comparison, we formally assessed whether observers update their visual uncertainty estimates consistent with (i) an instantaneous learner, (ii) an optimal Bayesian learner, or (iii) an exponential learner.

Results

In a spatial localization task, we presented participants with audiovisual signals in a series of four experiments, in which the physical visual noise changed dynamically over time either continuously or discontinuously (Figure 1). Visual (V) signals (clouds of 20 bright dots) were presented every 200 ms for a duration of 32 ms. The cloud’s horizontal STD varied over time at this temporal rate of 5 Hz either continuously (experiments 1–3) or discontinuously with intermittent jumps (experiment 4). The cloud’s location mean was temporally independently resampled from five possible locations (−10°, −5°, 0°, 5°, 10°) on each trial with the inter-trial asynchrony jittered between 1.4 and 2.8 s. In synchrony with the change in the cloud’s mean location, the dots changed their color and a sound was presented (AV signal). The location of the sound was sampled from the two possible locations adjacent to the visual cloud’s mean location (i.e. ±5° AV spatial disparity). Participants localized the sound and indicated their response using five response buttons.

Figure 1 with 1 supplement see all
Audiovisual localization paradigm and Bayesian causal inference model for learning visual reliability.

(A) Visual (V) signals (cloud of 20 bright dots) were presented every 200 ms for 32 ms. The cloud’s location mean was temporally independently resampled from five possible locations (−10°, −5°, 0°, 5°, 10°) with an inter-trial asynchrony jittered between 1.4 and 2.8 s. In synchrony with the change in the cloud’s mean location, the dots changed their color and a sound was presented (AV signal) which the participants localized using five response buttons. The location of the sound was sampled from the two possible locations adjacent to the visual cloud’s mean location (i.e. ±5° AV spatial). (B) The generative model for the Bayesian learner explicitly modeled the potential causal structures, that is whether visual (Vi) signals and an auditory (A) signal were generated by one common audiovisual source St, that is C = 1, or by two independent sources SVt and SAt, that is C = 2 (n.b. only the model component for the common source case is shown to illustrate the temporal updating, for complete generative model, see Figure 1—figure supplement 1). Importantly, the reliability (i.e. 1/variance) of the visual signal at time t (λt) depends on the reliability of the previous visual signal (λt-1) for both model components (i.e. common and independent sources).

The small audiovisual disparity enabled an influence of the visual signal location on the perceived sound location as a function of visual noise (Alais and Burr, 2004; Battaglia et al., 2003; Meijer et al., 2019). As a result, observers’ visual uncertainty estimate could be quantified in terms of the relative weight of the auditory signal on the perceived sound location with a greater auditory weight indicating that observers estimated a greater visual uncertainty.

In the first three experiments, we used continuous sequences, where the visual cloud’s STD changed periodically according to a sinusoid (n = 25; period = 30 s), a random walk (RW1; n = 33; period = 120 s) or a smoothed random walk (RW2; n = 19; period = 30 s; Figure 2). In an additional fourth experiment, we inserted abrupt increases or decreases into a sinusoidal evolution of the visual cloud’s STD (n = 18, period = 30 s, Figure 5). We will first describe the results for the three continuous sequences followed by the discontinuous sequence.

Figure 2 with 1 supplement see all
Time course of visual noise and relative auditory weights for continuous sequences of visual noise.

The visual noise (i.e. STD of the cloud of dots, right ordinate) and the relative auditory weights (mean across participants ± SEM, left ordinate) are displayed as a function of time. The STD of the visual cloud was manipulated as (A) a sinusoidal (period 30 s, N = 25), (B) a random walk (RW1, period 120 s, N = 33) and (C) a smoothed random walk (RW2, period 30 s, N = 19). The overall dynamics as quantified by the power spectrum is faster for RW2 than RW1 (peak in frequency range [0 0.2] Hz: Sinusoid: 0.033 Hz, RW1: 0.025 Hz, RW2: 0.066 Hz). The RW1 and RW2 sequences were mirror-symmetric around the half-time (i.e. the second half was the reversed first half). The visual clouds were re-displayed every 200 ms (i.e. at 5 Hz). The trial onsets, that is audiovisual (AV) signals (color change with sound presentation, black dots), were interspersed with an inter-trial asynchrony jittered between 1.4 and 2.8 s. On each trial observers located the sound. The relative auditory weights were computed based on regression models for the sound localization responses separately for each of the 20 temporally adjacent bins that cover the entire period within each participant. The relative auditory weights vary between one (i.e. pure auditory influence on the localization responses) and zero (i.e. pure visual influence). For illustration purposes, the cloud of dots for the lowest (i.e. V signal STD = 2°) and the highest (i.e. V signal STD = 18°) visual variance are shown in (A).

We assigned the sound localization responses and the associated physical visual noise (i.e. the cloud’s STD) to 20 (resp. 15 for experiment 4) temporally adjacent bins covering the entire period of each of the three sequences. Each experiment repeated the same 30 s (Sin, RW2) or 120 s (RW1) period throughout the experiment resulting in ~32 periods for the RW1 and ~130 periods for the Sin and RW2 sequences. The trial and hence sound onsets were jittered with respect to this periodic evolution of the visual cloud’s STD resulting in a greater effective sampling rate than expected for an inter-trial asynchrony of 1.4–2.8 s. In total, we assigned at least 44–87 trials to each bin (Supplementary file 1-Table 1). We quantified the auditory and visual influence on observers’ perceived auditory location for each bin based on regression models (separately for each of the 20 temporally adjacent bins). For instance, for bin = 1 we computed:

RA,trial,bin=1=LA,trial,bin=1ß A,bin=1 +LV,trial,bin=1ß V,bin=1 +ß const,bin=1 +etrial,bin=1

with RA,trial, bin=1 = Localization response for trial t and bin 1; LA,trial,bin=1 or LV,trial,bin=1  = ‘true’ auditory or visual location for trial t and bin 1; ßA,bin=1 or ßV,bin=1 = auditory or visual weight for bin 1; ßconst,bin=1 = constant term; etrial,bin=1 = error term. For each bin b, we thus obtained one auditory and one visual weight estimate. The relative auditory weight for a particular bin was computed as wA,bin = ßA,bin / (ßA,bin + ßV,bin).

Figure 2 and Figure 3 show the temporal evolution of the STD of the physical visual noise and observers’ relative auditory weight indices wA,bin. If observers estimate sensory uncertainty instantaneously, observer’s relative auditory weight indices should closely track the visual cloud’s STD (Figure 2). By contrast, we observed systematic biases: while the temporal evolution of the physical visual noise was designed to be symmetrical for each time period, we observed a temporal asymmetry for wA in all of the three experiments. For the monotonic sinusoidal sequence, wA was smaller for the 1st half of each period, when visual noise increased, than the 2nd half, when visual noise decreased over time (Figure 3A). For the non-monotonic RW1 and RW2 sequences, we observed more complex temporal profiles, because the visual noise increased and decreased in each half. WA was larger for increasing visual noise in the 1st as compared to the 2nd half, while wA was smaller for decreasing visual noise in the 1st as compared to the 2nd half (Figure 3B, C). These impressions were confirmed statistically in 2 (1st vs. flipped 2nd half) x 9 (bins) repeated measures ANOVAs (Table 1) showing a significant main effect of the 1st versus flipped 2nd half period for the sinusoidal (F(1, 24)=12.162, p=0.002, partial η2 = 0.336) and the RW1 sequence (F(1, 32)=14.129, p<0.001, partial η2 = 0.306). For the RW2 sequence, we observed a significant interaction (F(4.6, 82.9)=3.385, p=0.010, partial η2 = 0.158), because the visual noise did not change monotonically within each half period. Instead, monotonic increases and decreases in visual noise alternated at nearly the double frequency in RW2 as compared to RW1. The asymmetry in the auditory weights’ time course across the three experiments suggested that the visual noise in the past influenced observers’ current visual uncertainty estimate resulting in smaller auditory weights for ascending visual noise and greater auditory weights for descending visual noise.

Observers’ relative auditory weights for continuous sequences of visual noise.

Relative auditory weights wA of the 1st (solid) and the flipped 2nd half (dashed) of a period (binned into 20 bins) plotted as a function of the normalized time in the sinusoidal (red), the RW1 (blue), and the RW2 (green) sequences. Relative auditory weights were computed from auditory localization responses of human observers.

Table 1
Analyses of the temporal asymmetry of the relative auditory weights across the four sequences of visual noise using repeated measures ANOVAs with the factors sequence part (1st vs. flipped 2nd half), bin and jump position (only for the sinusoidal sequences with intermittent jumps).
EffectFdf1df2pPartial η2
SinusoidPart12.1621240.0020.336
Bin92.0073.10874.584<0.0010.793
PartXBin2.1672.94270.6170.1010.083
RW1Part14.1291320.0010.306
Bin76.0554.911157.151<0.0010.704
PartXBin1.2254.874155.9710.3000.037
RW2Part2.8841180.1070.138
Bin60.1423.30459.467<0.0010.770
PartXBin3.3854.60382.8490.0100.158
Sinusoid with intermittent jumpsJump28.306234<0.0010.625
Part24.824117<0.0010.594
Bin76.4761.87331.839<0.0010.818
JumpXPart0.3002340.7430.017
JumpXBin8.3833.30956.247<0.0010.330
PartXBin1.6413.24855.2220.1870.088
JumpXPartXBin0.6405.71697.1750.6900.036
  1. Note: The factor bin comprised nine levels in the first three and seven levels in the fourth sequence. In this sequence, the factor Jump comprised three levels. If Mauchly tests indicated significant deviations from sphericity (p<0.05), we report Greenhouse-Geisser corrected degrees of freedom and p values.

To further investigate the influence of past visual noise on observers’ auditory weights, we estimated a regression model in which the relative auditory weights wA for each of the 20 bins were predicted by the visual STD in the current bin and the difference in STD between the current and the previous bin (see Equation 2). Indeed, both the current visual STD (p<0.001 for all three sequences; Sinusoid: t(24)=15.767, Cohen’s d = 3.153; RW1: t(32) = 15.907, Cohen’s d = 2.769; RW2: t(18) = 12.978, Cohen’s d = 2.977, two sided one-sample t test against zero) and the difference in STD between the current and the previous bin (i.e. Sinusoid t(24) = −3.687, p=0.001, Cohen’s d = −0.737; RW1 t(32) = −2.593, p=0.014, Cohen’s d = −0.451; RW2 t(18) = -2.395, p=0.028, Cohen’s d = −0.549) significantly predicted observers’ relative auditory weights (for complementary results of nested model comparisons see Appendix 1 and Supplementary file 1-Table 5). Collectively, these results suggest that observers’ visual uncertainty estimates (as indexed by the relative auditory weights wA) depend not only on the current sensory signal, but also on the recent history of the sensory noise. These results were also validated in a control analysis that regressed out and thus accounted for potential influences of the previous visual location on observers’ sound localization, suggesting that the effects of past visual uncertainty cannot be explained by effects of past visual location mean (Appendix 1, Figure 2—figure supplement 1, Supplementary file 1-tables 2-4).

To characterize how human observers use information from the past to estimate current sensory uncertainty, we compared three computational models that differed in how visual uncertainty is learnt over time (Figure 4): Model 1, the instantaneous learner, estimates visual uncertainty independently for each trial as assumed by current standard models. Model 2, the optimal Bayesian learner, estimates visual uncertainty by updating the prior uncertainty estimate obtained from past visual signals with the uncertainty estimate from the current signal. Model 3, the exponential learner, estimates visual uncertainty by exponentially discounting past uncertainty estimates. All three models account for observers’ uncertainty about whether auditory and visual signals were generated by common or independent sources by explicitly modeling the two potential causal structures (Körding et al., 2007) underlying the audiovisual signals (n.b. only the model component pertaining to the ‘common cause’ case is shown in Figure 1B, for the full model see Figure 1—figure supplement 1). Models were fit individually to observers’ data by sampling from the posterior over parameters for each observer (Table 2).

Table 2
Model parameters (median), absolute WAIC and relative.

ΔWAIC values for the three candidate models in the four sequences of visual noise.

SequenceModelσAPcommonσ0κ or γWAICΔWAIC
SinusoidInstantaneous learner5.560.638.95-81931.2109.9
Bayesian learner5.640.659.03κ: 7.3781821.30
Exponential discounting5.620.649.02γ: 0.2381866.945.6
RW1Instantaneous learner6.300.698.46-110051.289.0
Bayesian learner6.290.728.68κ: 8.06109962.20
Exponential discounting6.260.708.75γ: 0.33109929.9−32.3
RW2Instantaneous learner6.360.7210.79-62576.4201.3
Bayesian learner6.490.7810.9κ: 6.762375.20
Exponential discounting6.460.7311.0γ: 0.2562421.546.3
Sinusoid with intermittent jumpsInstantaneous learner6.380.658.19-83891.494.9
Bayesian learner6.450.688.26κ: 6.1383796.50
Exponential discounting6.430.678.20γ: 0.2483798.11.64
  1. Note: WAIC values were computed for each participant and summed across participants. A low WAIC indicates a better model. ΔWAIC is relative to the WAIC of the Bayesian learner.

Observed and predicted relative auditory weights for continuous sequences of visual noise.

Relative auditory weights wA of the 1st (solid) and the flipped 2nd half (dashed) of a period (binned into 20 bins) plotted as a function of the normalized time in the sinusoidal (red), the RW1 (blue) and the RW2 (green) sequences. Relative auditory weights were computed from auditory localization responses of human observers (A), Bayesian (B), exponential (C), or instantaneous (D) learning models. For comparison, the standard deviation of the visual signal is shown in (E). Please note that all models were fitted to observers’ auditory localization responses (i.e. not the auditory weight wA). (F) Bayesian model comparison – Random effects analysis: The matrix shows the protected exceedance probability (color coded and indicated by the numbers) for pairwise comparisons of the Instantaneous (Inst), Bayesian (Bayes) and Exponential (Exp) learners separately for each of the four experiments. Across all experiments we observed that the Bayesian or the Exponential learner outperformed the Instantaneous learner (i.e. a protected exceedance probability >0.94) indicating that observers used the past to estimate sensory uncertainty. However, it was not possible to arbitrate reliably between the Exponential and the Bayesian learner across all experiments (protected exceedance probability in bottom row).

We compared the three models in a fixed and random effects analysis (Penny et al., 2010; Rigoux et al., 2014) using the Watanabe-Akaike information criterion (WAIC) as appropriate for evaluating model samples (Gelman et al., 2014) (i.e. a low WAIC indicates a better model, a difference greater than 10 is considered very strong evidence for a model). In the fixed-effects analysis (see Table 2 for details), the Bayesian learner was substantially better than the instantaneous learner across all three experiments, but outperformed the exponential learner reliably only in the sinusoidal sequence. Likewise, the random-effects analysis based on hierarchical Bayesian model selection (Penny et al., 2010; Rigoux et al., 2014) showed a protected exceedance probability that was substantially greater for the Bayesian learner (Sin, RW2) or the exponential learner (RW1, RW2) than for the instantaneous learner (Figure 4F). However, the direct comparison between the Bayesian and the exponential learner did not provide consistent results across experiments. As shown in Figure 4A and B, both the Bayesian and the exponential learner accurately reproduced the temporal asymmetry for the auditory weights across all three experiments.

From the optimal Bayesian learner, we inferred observers’ estimated rate of change in visual reliability (i.e. parameter 1κ). The sinusoidal sequence was estimated to change at a faster pace (median κ = 7.4 across observers, 95% confidence interval, 95% CI [4.8, 10.8] estimated via bootstrapping) than the RW1 sequence (median κ = 8.1, 95% CI [7.0,14.9]), but slower than the RW2 sequence (median κ = 6.7, 95% CI [4.4,11.2]) indicating that the Bayesian learner accurately inferred that visual reliability changed at different pace across the three continuous sequences (see legend of Figure 2). Likewise, the learning rates 1-γ of the exponential learner accurately reflect the different rates of change across the sequences (Sinusoid γ= 0.23, 95% CI [0.14, 0.28]; RW1: γ= 0.33, 95% CI [0.21, 0.38]; RW2: γ= 0.25, 95% CI [0.21, 0.29]). Both the Bayesian and the exponential learner thus estimated a smaller rate of change for the RW1 than for the sinusoidal sequence – although caution needs to be applied when interpreting these results given the extensive confidence intervals. Further, the learning rates of the exponential learner imply that observers gave the visual signals presented 4.1 (Sinusoid), 5.4 (RW1), and 4.3 (RW2) seconds before the current stimulus 5% of the weight they assigned to the current visual signal to estimate the visual reliability.

To further disambiguate between the Bayesian and the exponential learner, we designed a fourth experimental ‘jump sequence’ that introduced abrupt increases or decreases in physical visual noise at three positions into the sinusoidal sequence (Figure 5A). Using the same analysis approach as for experiments 1–3, we replicated the temporal asymmetry for the auditory weights (Figure 5B). For all three ‘jump positions’, wA was significantly smaller for the 1st half of each period, when visual noise increased, than the 2nd half, when visual noise decreased over time. The 3 (jump positions) x 2 (1st vs. flipped 2nd half) x 7 (bins) repeated measures ANOVA showed a significant main effect of 1st versus flipped 2nd period’s half (F(1,17) = 24.824, p<0.001, partial η2 = 0.594), while this factor was not involved in any higher-order interaction (see Table 1). Further, in a regression model the current visual STD (t(17) = 11.655, p<0.001, Cohen’s d = 2.747) and the difference between current and previous STD (t(17) = −4.768, p<0.001, Cohen’s d = −1.124) significantly predicted the relative auditory weights. Thus, we replicated our finding that the visual noise in the past influenced observers’ current visual uncertainty estimate as indexed by the relative auditory weights wA.

Figure 5 with 2 supplements see all
Time course of visual noise and relative auditory weights for sinusoidal sequence with intermittent jumps in visual noise (N = 18).

(A) The visual noise (i.e. STD of the cloud of dots, right ordinate) is displayed as a function of time. Each cycle included one abrupt increase and decrease in visual noise. The sequence of visual clouds was presented every 200 ms (i.e. at 5 Hz) while audiovisual (AV) signals (black dots) were interspersed with an inter-trial asynchrony jittered between 1.4 and 2.8 s. (B, C) Relative auditory weights wA of the 1st (solid) and the flipped 2nd half (dashed) of a period (binned into 15 bins) plotted as a function of the time in the sinusoidal sequence with intermitted inner (light gray), middle (gray), and outer (dark gray) jumps. Relative auditory weights were computed from auditory localization responses of human observers (B) and the Bayesian learning model (C). Please note that all models were fitted to observers’ auditory localization responses (i.e. not the auditory weight wA).

Bayesian model comparison using a fixed-effects analysis showed that both the Bayesian learner and the exponential learner substantially outperformed the instantaneous learner (see Table 2). However, consistent with our Bayesian model comparison results for the continuous sequences, the Bayesian learner did not provide a better explanation for observers’ responses than the exponential learner (ΔWAIC = +2, see Table 2, Figure 5C and Figure 5—figure supplement 1A). Likewise, a random-effects analysis based on hierarchical Bayesian model selection showed that the Bayesian and the exponential learners outperformed the instantaneous learner, but again we were not able to adjudicate between the Bayesian and exponential learner (Figure 4F, see also methods and results in Appendix 1, Figure 5—figure supplement 2 and Supplementary file 1-Table 6 for further analyses justifying the choice of continuous learning models in the jump sequence).

In summary, across four experiments that used continuous and discontinuous sequences of visual noise, we have shown that the Bayesian or exponential learners outperform the instantaneous learner. However, across the four experiments we were not able to decide whether observers adapted to changes in visual noise according to a Bayesian or an exponential learner. The key feature that distinguishes between the Bayesian and the exponential learner is that only the Bayesian learner adapts dynamically based on its uncertainty about its visual reliability estimates. As a consequence, the Bayesian learner should adapt faster than the exponential learner to increases in physical visual noise (i.e. spread of the visual cloud) but slower to decreases in visual noise. From the Bayesian learner’s perspective, the faster learning for increases in visual noise emerges because it is unlikely that visual dots form a large spread cloud under the assumption that the true visual spread of the cloud is small. Conversely, the Bayesian learner will adapt more slowly to decreases in visual variance, because under the assumption of a visual cloud with a large spread visual dots may form a small cloud by chance. Indeed, previous research has shown that observers adapt their variance estimates faster for changes from small to large than for changes from large to small variance (Berniker et al., 2010). However, these results have been shown for learning about a hidden variable such as the prior that defines the spatial distribution from which an object’s location is sampled. In our study, we manipulated the variance of the likelihood, that is the variance of the clouds of dots.

Asymmetric differences in adaptation rate between the exponential and the Bayesian learner should thus be amplified if we increase observer’s uncertainty about its visual reliability estimate by reducing the number of dots of the visual cloud from 20 to 5 dots. Based on simulations, we therefore explored whether we could experimentally discriminate between the Bayesian and exponential learner using continuous sinusoidal or discontinuous ‘jump’ sequences with visual clouds of only five dots. For the two sequences, we simulated the sound localization responses of 12 observers based on the Bayesian learner model and fitted the Bayesian and exponential learner models to the responses of each simulated Bayesian observer. Figure 6 shows observers’ auditory weights indexing their estimated visual reliability across time that we obtained from the fitted responses of the Bayesian (blue) and the exponential learner (green). The simulations reveal the characteristic differences in how the Bayesian and the exponential learner adapt their visual uncertainty estimates to increases and decreases in visual noise. As expected, the Bayesian learner adapts its visual uncertainty estimates faster than the exponential learner to increases in visual noise, but slower to decreases in visual noise. Nevertheless, these differences are relatively small, so that the difference in mean log likelihood between the Bayesian and exponential learner is only −1.82 for the sinusoidal sequence and −2.74 for the jump sequence.

Time course of the relative auditory weights, the standard deviation (STD) of the visual cloud and the STD of the visual uncertainty estimates.

(A) Relative auditory weights wA of the 1st (solid) and the flipped 2nd half (dashed) of a period (binned into 15 bins) plotted as a function of the time in the sinusoidal sequence. Relative auditory weights were computed from the predicted auditory localization responses of the Bayesian (blue) or exponential (green) learning models fitted to the simulated localization responses of a Bayesian learner based on visual clouds of 5 dots. (B) Relative auditory weights wA computed as in (A) for the sinusoidal sequence with intermitted jumps. Only the outer-most jump (dark brown in Figure 5B/C and Figure 5—figure supplement 1) is shown. (C, D) STD of the visual cloud of 5 dots (gray) and the STD of observers’ visual uncertainty as estimated by the Bayesian (blue) and exponential (green) learners (that were fitted to the simulated localization responses of a Bayesian learner) as a function of time for the sinusoidal sequence (C) and in the sinusoidal sequence with intermitted jumps (D). Note that only an exemplary time course from 600 to 670 s after the experiment start is shown.

Next, we investigated whether our experiments successfully mimicked situations in which observers benefit from integrating past and current information to estimate their sensory uncertainty. We compared the accuracy of the instantaneous, exponential and Bayesian learner’s visual uncertainty estimates in terms of their mean absolute deviation (in percentage) from the true variance. For Gaussian clouds of 20 dots, the instantaneous learner’s error in the visual uncertainty estimates of 21.7% is reduced to 13.7% and 14.9% for the exponential and Bayesian learners, respectively (with best fitted γ = 0.6, in the sinusoidal sequence). For Gaussian clouds composed of only five dots, the exponential and Bayesian learners even cut down the error by half (i.e. 46.8% instantaneous learner, 29.5% exponential learner, 23.9% Bayesian learner, with best fitted γ = 0.7).

Collectively, these simulation results suggest that even in situations in which observers benefit from combining past with current sensory inputs to obtain more precise uncertainty estimates, the exponential learner is a good approximation of the Bayesian learner, making it challenging to dissociate the two experimentally based on noisy human behavioral responses.

Discussion

The results from our four experiments challenge classical models of perceptual inference where a perceptual interpretation is obtained using a likelihood that depends solely on the current sensory inputs (Ernst and Banks, 2002). These models implicitly assume that sensory uncertainty (i.e. likelihood variance) is instantaneously and independently accessed from the sensory signals on each trial based on initial calibration of the nervous system (Jacobs and Fine, 1999). Most prominently, in the field of cue combination it is generally assumed that sensory signals are weighted by their uncertainties that are estimated only from the current sensory signals (Alais and Burr, 2004; Ernst and Banks, 2002; Jacobs, 1999) (but see Mikula et al., 2018; Triesch et al., 2002).

By contrast, our results demonstrate that human observers integrate inputs weighted by uncertainties that are estimated jointly from past and current sensory signals. Across the three continuous and the one discontinuous jump sequences, observers’ current visual reliability estimates were influenced by visual inputs that were presented 4–5 s in the past albeit their influence amounted to only 5% of the current visual signals.

Critically, observers adapted their visual uncertainty estimates flexibly according to the rate of change in the visual noise across the experiments. As predicted by both Bayesian and exponential learning models, observers’ visual reliability estimates relied more strongly on past sensory inputs, when the visual noise changed more slowly across time. While observers did not explicitly notice that each of the four experiments was composed of repetitions of temporally symmetric sequence components, we cannot fully exclude that observers may have implicitly learnt this underlying temporal structure. However, implicit or explicit knowledge of this repetitive sequence structure should have given observers the ability to predict and preempt future changes in visual reliability and therefore attenuated the temporal lag of the visual reliability estimates. Put differently, our experimental choice of repeating the same sequence component over and over again in the experiment cannot explain the influence of past signals on observers’ current reliability estimate, but should have reduced or even abolished it.

Importantly, the key feature that distinguishes the Bayesian from the exponential learner is how the two learners adapt to increases versus decreases in visual noise. Only the Bayesian learner represents and accounts for its uncertainty about its visual reliability estimates. As compared to the exponential learner, it should therefore adapt faster to increases but slower to decreases in visual noise (e.g. see Berniker et al., 2010). Our simulation results show this profile qualitatively, when the learner’s uncertainty about its visual reliability estimate is increased by reducing the number of dots (see Figure 6). But even for visual clouds of five dots, the differences in learning curves between the Bayesian and exponential learner are very small making it difficult to adjudicate between them given noisy observations from real observers. Unsurprisingly, therefore, Bayesian model comparison showed consistently across all four experiments that observers’ localization responses can be explained equally well by an optimal Bayesian and an exponential learner. These results converge with a recent study showing that learning about a hidden variable such as observers’ priors can be accounted for by an exponential averaging model (Norton et al., 2019).

Collectively, our experimental and simulation results suggest that under circumstances where observers substantially benefit from combining past and current sensory inputs for estimating sensory uncertainty, optimal Bayesian learning can be approximated well by more simple heuristic strategies of exponential discounting that update sensory weights with a fixed learning rate irrespective of observers’ uncertainty about their visual reliability estimate (Ma and Jazayeri, 2014; Shen and Ma, 2016). Future research will need to assess whether observers adapt their visual uncertainty estimates similarly if visual noise is manipulated via other methods such as stimulus luminance, duration, or blur.

From the perspective of neural coding, our findings suggest that current theories of probabilistic population coding (Beck et al., 2008; Ma et al., 2006; Hou et al., 2019) may need to be extended to accommodate additional influences of past experiences on neural representations of sensory uncertainties. Alternatively, the brain may compute sensory uncertainty using strategies of temporal sampling (Fiser et al., 2010).

In conclusion, our study demonstrates that human observers do not access sensory uncertainty instantaneously from the current sensory signals alone, but learn sensory uncertainty over time by combining past experiences and current sensory inputs as predicted by an optimal Bayesian learner or approximate strategies of exponential discounting. This influence of past signals on current sensory uncertainty estimates is likely to affect learning not only at slower timescales across trials (i.e. as shown in this study), but also at faster timescales of evidence accumulation within a trial (Drugowitsch et al., 2014). While our research unravels the impact of prior sensory inputs on uncertainty estimation in a cue combination context, we expect that they reveal fundamental principles of how the human brain computes and encodes sensory uncertainty.

Materials and methods

Participants

Seventy-six healthy volunteers participated in the study after giving written informed consent (40 female, mean age 25.3 years, range 18–52 years). All participants were naïive to the purpose of the study. All participants had normal or corrected-to normal vision and reported normal hearing. The study was approved by the human research review committee of the University of Tuebingen (approval number 432 2007 BO1) and the research review committee of the University of Birmingham (approval number ERN_11–0470P).

Stimuli

Request a detailed protocol

The visual spatial stimulus was a Gaussian cloud of twenty bright gray dots (0.56° diameter, vertical STD 1.5°, luminance 106 cd/m2) presented on a dark gray background (luminance 62 cd/m2, i.e. 71% contrast). The auditory spatial cue was a burst of white noise with a 5 ms on/off ramp. To create a virtual auditory spatial cue, the noise was convolved with spatially specific head-related transfer functions (HRTFs). The HRTFs were pseudo-individualized by matching participants’ head width, heights, depth, and circumference to the anthropometry of subjects in the CIPIC database (Algazi et al., 2001). HRTFs from the available locations in the database were interpolated to the desired locations of the auditory cue.

Experimental design and procedure

In a spatial ventriloquist paradigm, participants were presented with audiovisual spatial signals. Participants indicated the location of the sound by pressing one of five spatially corresponding buttons and were instructed to ignore the visual signal. Participants did not receive any feedback on their localization response. The visual signal was a cloud of 20 dots sampled from a Gaussian. The visual clouds were re-displayed with variable horizontal STDs (see below) every 200 ms (i.e. at a rate of 5 Hz; Figure 1A). The cloud’s location mean was temporally independently resampled from five possible locations (−10°, −5°, 0°, 5°, 10°) on each trial with the inter-trial asynchrony jittered between 1.4 and 2.8 s in steps of 200 ms. In synchrony with the change in the cloud’s location, the dots changed their color and a concurrent sound was presented. The location of the sound was sampled from ±5° visual angle with respect to the mean of the visual cloud. Observers’ visual uncertainty estimate was quantified in terms of the relative weight of the auditory signal on the perceived sound location. The change in the dot’s color and the emission of the sound occurred in synchrony to enhance audiovisual binding.

Continuous sinusoidal and RW sequences

Request a detailed protocol

Critically, to manipulate visual noise over time, the cloud’s STD changed at a rate of 5 Hz according to (i) a sinusoidal sequence, (ii) an RW sequence 1 or (iii) an RW sequence 2 (Figure 2). In all sequences, the horizontal STD of the visual cloud spanned a range from 2 to 18°:

  1. Experiment1 - Sinusoidal sequence (Sinusoid): A sinusoidal sequence was generated with a period of 30 s. During the ~65 min of the experiment, each participant completed ~130 cycles of the sinusoidal sequence.

  2. Experiment2 - Random walk sequence 1 (RW1): First, we generated an RW sequence of 60 s duration using a Markov chain with 76 discrete states and transition probabilities of stay (1/3), change to lower (1/3) or upper (1/3) adjacent states. To ensure that the RW sequence segment starts and ends with the same value, this initial 60-s sequence segment was concatenated with its temporally reversed segment resulting in an RW sequence segment of 120 s duration. Each participant was presented with this 120 s RW1 sequence approximately 32 times during the experiment.

  3. Experiment3 - Random walk sequence 2 (RW2): Likewise, we created a second random-walk sequence of 15 s duration using a Markov chain with only 38 possible states and transition probabilities similar to above. The 15-s sequence was concatenated with its temporally reversed version resulting in a 30-s sequence. The smoothness of this sequence segment was increased by filtering it (without phase shift) with a moving average of 250 ms. Each participant was presented with this sequence segment ~130 times.

Generally, a session of a sinusoid, RW1, or RW2 sequence included 1676 trials. Because of experimental problems, four sessions included only 1128, 1143, or 1295 trials. Before the experimental trials, participants practiced the auditory localization task in 25 unimodal auditory trials, 25 audiovisual congruent trials with a single dot as visual spatial cue and 75 trials with stimuli as in the main experiment.

Experiment 4 - Sinusoidal sequence with intermittent changes in visual noise (sinusoidal jump sequence)

Request a detailed protocol

To dissociate the Bayesian learner from approximate exponential discounting, we designed a sinusoidal sequence (period = 30 s) with intermittent increases/decreases in visual variance (Figure 5). As shown in Figure 5A, we inserted increases by 8° in visual STD at three levels of visual STD: 7.2°, 8.6°, 9.6° STD. Conversely, we inserted decreases by 8° in visual STD at 15.3°, 16.7°, 17.7° STD. We inserted jumps selectively in the period sections of high visual variance to make the jumps less apparent and maximize the chances that observers treated the series as a continuous sequence. As a result, the up-jumps occurred when the increases in visual variance were fastest (i.e. steeper slope), while the down-jumps occurred after sections in which the visual variance was relatively constant (i.e. shallow slope). We factorially combined these 3 (increases) x 3 (decreases) such that each sinewave cycle included exactly one sudden increase and decrease in visual STD (i.e. nine jump types). Otherwise, the experimental paradigm and stimuli were identical to the continuous sinusoidal sequence described above. During the ~80 min of this experiment, each participant completed ~154 cycles of the sinusoidal sequence including 16–18 cycles for each of the nine jump types. This sinusoidal jump sequence was expected to maximize differences in adaptation rate for the Bayesian and exponential learner. If participants continuously update their estimates of the visual reliability, as opposed to using a change point model (Adams and Mackay, 2007; Heilbron and Meyniel, 2019), the exponential learner will weight past and present uncertainty estimates throughout the entire sequence according to the same exponential function. By contrast, the Bayesian learner will take into account its uncertainty about the visual reliability and therefore adapt its visual reliability estimate for jumps from high to low visual variance (resp. low to high visual reliability, see Figure 6) more slowly than the exponential learner (see Appendix 1).

Subject numbers and inclusion criteria

Request a detailed protocol

Of the 76 subjects, 30 participated in the sinusoidal and the RW1 sequence session. Eight additional subjects participated only in the RW1 sequence session. Eighteen additional subjects participated in the RW2 sequence session. One participant completed all three continuous sequences. Twenty subjects participated in the sinusoidal sequence with intermittent changes in visual uncertainty. In total, we collected data from 30 participants for the sinusoidal, 38 participants for the RW1, 19 participants for the RW2, and 20 participants for the sinusoidal jump sequence. The sample sizes of 20–38 participants were based on a pilot experiment, which showed individually significant effects of past visual noise on the weighting of audiovisual spatial signals in 6/6 pilot participants. From these samples, we excluded participants if their perceived sound location did not depend on the current visual reliability (i.e. inclusion criterion p<0.05 in the linear regression; please note that this inclusion criterion is orthogonal to the question of whether participants’ visual uncertainty estimate depends on visual signals prior to the current trial). Thus, we excluded five participants of the sinusoidal and RW1 sequence and two participants from the sinusoidal jump sequence. Finally, we analyzed data from 25 participants for the sinusoidal, 33 participants for the RW1, 19 participants for the RW2, and 18 participants for the sinusoidal jump sequence.

Experimental setup

Request a detailed protocol

Audiovisual stimuli were presented using Psychtoolbox 3.09 (Brainard, 1997; Kleiner et al., 2007) (http://www.psychtoolbox.org) running under Matlab R2010b (MathWorks) on a Windows machine (Microsoft XP 2002 SP2). Auditory stimuli were presented at ~75 dB SPL using headphones (Sennheiser HD 555). As visual stimuli required a large field of view, they were presented on a 30″ LCD display (Dell UltraSharp 3007WFP). Participants were seated at a desk in front of the screen in a darkened booth, resting their head on an adjustable chin rest. The viewing distance was 27.5 cm. This setup resulted in a visual field of approximately 100°. Participants responded via a standard QWERTY keyboard. Participants used the buttons [i, 9, 0, -, = ] with their right hand for localization responses.

Data analysis

Continuous sinusoidal and RW sequences

Request a detailed protocol

At trial onset the visual cloud’s location mean was independently resampled from five possible locations (−10°, −5°, 0°, 5°, 10°). Concurrently, the cloud’s dots changed their color and a sound was presented sampled from ±5° visual angle with respect to the mean of the visual cloud. The inter-trial asynchrony was jittered between 1.4 and 2.8 s in steps of 200 ms. Therefore, across the experiment the trial onsets occurred at different times relative to the period of the changing visual cloud’s STD resulting in a greater effective sampling rate than provided if the inter-trial asynchrony had been fixed.

For each period of the three continuous sinusoidal and RW sequences, we sorted the trials (i.e. trial-specific visual cloud’s STD, visual location, auditory location, and observers’ sound localization responses) into 20 temporally adjacent bins that covered one complete period of the changing visual STD. This resulted in about 1676 trials in total/20 bins = approximately 80 trials on average per bin in each subject (more specifically: a range of 52–96 (Sin), 52–92 (RW 1), or 71–93 (RW2) trials, for details see Supplementary file 1-Table 1).

We quantified the influence of the auditory and visual locations on observers’ perceived auditory location for each bin by estimating a regression model separately for each bin (i.e. one regression model per bin). For instance, for bin = 1 we computed:

(1) RA,trial,bin=1=LA,trial,bin=1ßA,bin=1 +LV,trial,bin=1ßV,bin=1 +ßconst,bin=1 +etrial,bin=1

with RA,trial, bin=1 = Localization response for trial t and bin 1; LA,trial,bin=1 or LV,trial,bin=1 = ‘true’ auditory or visual location for trial t and bin 1; ßA,bin=1 or ßV,bin=1 = auditory or visual weight for bin 1; ßconst,bin=1 = constant term; etrial,bin=1 = error term for trial t and bin 1. For each bin b, we thus obtained one auditory and one visual weight estimate. The relative auditory weight for a particular bin was computed as wA,bin = ßA,bin / (ßA,bin + ßV,bin) (Figure 2A–C).

By design, the temporal evolution of the physical visual variance (i.e. STD of the visual cloud) is symmetric for each period in the sinusoidal, RW1 and RW2 sequences. In other words, for physical visual noise, the 1st half and the flipped 2nd half within a period are identical (Figure 3E). Given this symmetry constraint, we evaluated the influence of past visual noise on participants’ auditory weight wA,bin by comparing the wA for the bins in the 1st half and the flipped 2nd half in a repeated measures ANOVA. If human observers estimate visual uncertainty by combining prior with current visual uncertainty estimates as expected for a Bayesian learner, wA should differ between the 1st half and the mirror-symmetric flipped 2nd half of the sequence. More specifically, wA should be smaller for the 1st half in which visual variance increased than for the mirror-symmetric time points of the 2nd half in which visual variance decreased. To test this prediction, we entered the subject-specific wA,bin into 2 (1st vs. flipped 2nd half) x 9 (bins, i.e. removing the bins at maximal and minimal visual noise values) repeated measures ANOVAs separately for the sinusoidal, RW1 and RW2 experiments (Table 1). For the sinusoidal sequence, we expected a main effect of ‘half’ because the sequence increased/decreased monotonically within each half period. For the RW1 and RW2 sequences, an influence of prior visual noise might also be reflected in an interaction effect of ‘half x bin’ because these sequences increased/decreased non-monotonically within each half period.

To further test whether the noise of past visual signals influenced observers’ current visual uncertainty estimate, we employed a regression model in which the relative auditory weights wA,bin were predicted by the visual STD in the current bin and the difference in STD between the current and the previous bin:

(2) wA,bin=σV,bin ß σV + (σV,bin σV,bin1) ß ΔσV + ß const +ebin

with wA,bin = relative auditory weight in bin b; σV,bin = mean visual STD in current bin b or previous bin b-1; ßconst = constant term; ebin = error term. To allow for generalization to the population level, the parameter estimates (ßσV, ßΔσV) for each participant were entered into two-sided one-sample t-tests at the between-subject random-effects level.

Sinusoidal sequence with intermittent changes in visual uncertainty

Request a detailed protocol

For each period of the sinusoidal sequence with intermittent changes, we sorted the values for the physical visual cloud’s variance (i.e. the cloud’s STD) and sound localization responses into 15 temporally adjacent bins which were positioned to capture the jumps in visual noise. For analysis of these sequences, we recombined the first and second halves of the 3 (increases at low, middle, high) x 3 (decreases at low, middle, high) sinewave cycles into three types of sinewave cycles such that both jumps were at low (=outer jump), middle (=middle jump), or high (=inner jump) visual noise. This recombination makes the simplifying assumption that the jump position of the first half will have negligible effects on participants’ uncertainty estimates of the second half. As a result of this recombination, each bin comprised at least 44–51 trials across participants (Supplementary file 1-Table 1). As for the continuous sequences, we quantified the auditory and visual influence on the perceived auditory location for each bin based on separate regression models for the 15 temporally adjacent bins (see Equation 1). Next, we independently computed the relative auditory weight wA,bin = ßA,bin / (ßA,bin + ßV,bin) for each of the 15 temporally adjacent bins. We statistically evaluated the influence of past visual noise on participants’ auditory weight on the wA in terms of the difference between 1st half and flipped 2nd half using a 2 (1st vs. flipped 2nd half) x 7 (bins) x 3 (jump: inner, middle, outer) repeated measures ANOVAs (Table 1).

Computational models (for continuous and discontinuous sequences)

To further characterize whether and how human observers use their uncertainty about previous visual signals to estimate their uncertainty of the current visual signal, we defined and compared three models in which visual reliability (λV) was (1) estimated instantaneously for each trial (i.e. instantaneous learner), was updated via (2) Bayesian learning or (3) exponential discounting (i.e. exponential learner) (Figure 1—figure supplement 1).

In the following, we will first describe the generative model that accounts for the fact that (1) visual uncertainty usually changes slowly across trials (i.e. time-dependent uncertainty changes) and (2) auditory and visual signals can be generated by one common or two independent sources (i.e. causal structure). Using this generative model as a departure point, we then describe how the instantaneous learner, the Bayesian learner and the exponential learner perform inference. Finally, we will explain how we account for participants’ internal noise and predict participants’ responses from each model (i.e. the experimenter’s uncertainty).

Generative model

Request a detailed protocol

On each trial t, the subject is presented with an auditory signal At, from a source SA,t, (see Figure 1—figure supplement 1) together with a visual cloud of dots at time t arising from a source, SV,t, drawn from a Normal distribution SV,t ~ N(0, 1/λS) with the spatial reliability (i.e. inverse of the spatial variance): λS=1/σS2. Critically, SA,t and SV,t, can either be two independent sources (C = 2) or one common source (C = 1): SA,t = SV,t = St (Körding et al., 2007).

We assume that the auditory signal is corrupted by noise, so that the internal signal is At ~ N(SA,t,  1/λA). By contrast, the individual visual dots (presented at high visual contrast) are assumed to be uncorrupted by noise, but presented dispersed around the location SV,t according to Vi,t ~ N(Ut,  1/λV,t), where Ut ~ N(SV,t,  1/λV,t). The dispersion of the individual dots,  1/λV,t, is assumed to be identical to the uncertainty about the visual mean, allowing subjects to use the dispersion as an estimate of the uncertainty about the visual mean.

The visual reliability of the visual cloud, λV,t=1/σV,t2, varies slowly at the re-display rate of 5 Hz according to a log RW: logλV,tN(logλV,t1,1/κ) with 1κ being the variability of λV,t in log space. We also use this log RW model to approximate learning in the four jump sequence (see Behrens et al., 2007).

The generative models of the instantaneous, Bayesian, and exponential learners all account for the causal uncertainty by explicitly modeling the two potential causal structures. Yet, they differ in how they estimate the visual uncertainty on each trial, which we will describe in greater detail below.

Observer inference

Request a detailed protocol

The instantaneous, Bayesian, and exponential learners invert this (or slightly modified, see below) generative model during perceptual inference to compute the posterior probability of the auditory location, SA,t, given the observed At and Vi,t. The observer selects a response based on the posterior using a subjective utility function which we assume to be the minimization of the squared error (SA,t - Strue)2. For all models, the estimate for the location of the auditory source is obtained by averaging the auditory estimates under the assumption of common and independent sources by their respective posterior probabilities (i.e. model averaging, see Figure 1—figure supplement 1):

(3) S^A,t =S^A,C=1,t P(Ct =1|At ,V1:n,t )+S^A,C=2,t(1P(Ct =1|At ,V1:n,t ))

where S^A,C=1,t and S^A,C=2,t depend on the model (see below), and P(C =1|At ,V1:n,t ) is the posterior probability that the audio and visual stimuli originated from the same source according to Bayesian causal inference (Körding et al., 2007).

(4) P(Ct =1|At ,V1:n,t )=(P(At ,V1:n,t |C=1)P(Ct=1))(P(At ,V1:n,t |Ct=1)P(Ct=1))+(P(At ,V1:n,t |Ct=2)(1P(Ct=1))

Finally, for all models, we assume that the observer pushes the button associated with the position closest to S^A,t. In the following, we describe the generative and inference models for the instantaneous, Bayesian, and exponential learners. For the Bayesian learner, we focus selectively on the model component that assumes a common cause, C = 1 (for full derivation including both model components, see Appendix 2).

Model 1: Instantaneous learner

Request a detailed protocol

The instantaneous learning model ignores that the visual reliability (i.e. the inverse of visual uncertainty) of the current trial depends on the reliability of the previous trial. Instead, it estimates the visual reliability independently for each trial from the spread of the cloud of visual dots:

(5) P(SA,t,Ut,λV,t| A1:t, V1:n,1:t)=P(SA,t,Ut,λV,t| At, V1:n,t)=P(C=1|At ,V1:n,t )PC=1(St,Ut,λV,t| At, V1:n,t)+P(C=2|At ,V1:n,t )PC=2(SA,t,Ut,λV,t| At, V1:n,t)=P(C=1|At ,V1:n,t )P(St)P(At|St)PC=1(Ut|St,λV,t)P(V1:n,t|Ut,λV,t)P(λV,t)Z1+(1P(C=1|At ,V1:n,t )) P(SA,t)P(At|SA,t)/Z2.

with Z1, Z2 as normalization constants.

Apart from P(C=1|At,Vt), these terms are all normal distributions, while we assume in this model that P(λV,t) is uninformative. Hence, visual reliability is computed from the variance: λVt^=1/(σVt2+σVt2n) where σVt2=1/(n1)i=1n(Vi,tV¯i,t)2 is the sample variance (and  V¯t=1/ni=1nVi,t is the sample mean). The causal component estimates are given by:

(6) S^A,C=1,t=λ^V,tVt¯ +λAAtλV,t+λA+λS
(7) S^A,C=2,t=λAAtλA+λS

These two components are then combined based on the posterior probabilities of common and independent cause models (see Equation 3). This model is functionally equivalent to a Bayesian causal inference model as described in Körding et al., 2007, but with visual reliability computed directly from the sample variance rather than a fixed unknown parameter (which the experimenter estimates during model fitting).

Model 2: Bayesian learner

Request a detailed protocol

The Bayesian learner capitalizes on the slow changes in visual reliability across trials and combines past and current inputs to provide a more reliable estimate of visual reliability and hence auditory location. It computes the posterior probability based on all auditory and visual signals presented until time t (here only shown for C = 1, see Appendix 2).

According to Bayes rule, the joint probability of all variables until time t can be written based on the generative model as:

(8) P(λV,1:t, A1:t, U1:t, V1:n,1:t,S1:t)=P(A1|S1)P(V1:n,1|U1,λV,1)P(U1|S1,λV,1)P(S1)P(λV,1)k=2tP(Ak|Sk)P(V1:n,k|Uk,λV,k)P(Uk|Sk,λV,k)P(λV,k|λV,k1)P(Sk)

As above, the visual likelihood is given by the product of individual Normal distributions for each dot i: P(V1:n,t|Ut,λV,t)=i=1nN(Vi,t|Ut,1/λV,t), and P(Ut|St,λV,t)=N(Ut|St,1/λV,t).

The prior P(St) is a Normal distribution N(St|0,1/λS)  and the auditory likelihood.

P(At,|St) is a Normal distribution N(At|St,1/λA). As described in the generative model, P(λV,k|λV,k1)  is given by logλV,tN(logλV,t1,1/κ).

Importantly, only the visual reliability, λV,t, is directly dependent on the previous trial (P(λV,k,λV,k1)=P(λV,k|λV,k1)P(λV,k1)P(λV,k)P(λV,k1)). Because of the Markov property (i.e. λV,t  depends only on λV,t1), the joint distribution for time t can be written as

(9) P(λV,t, λV,t1, At, Ut, V1:n,t,St)=P(At|St)P(Ut|St,λV,t)P(V1:n|Ut,λV,t)P(λV,t|λV,t1)P(λV,t1|V1:n,t1,At1)P(St).

Hence, the joint posterior probability over location and visual reliability given a stream of auditory and visual inputs can be rewritten as:

(10) P(St,Ut,λV,t| A1:t, V1:n,1:t)=P(St)P(At|St)P(Ut|St,λV,t)P(V1:n|Ut,λV,t)P(λV,t|λV,t1)P(λV,t1|V1:n,t1,At1)dλV,t1/Z.

As this equation cannot be solved analytically, we obtain an approximate solution by factorizing the posterior in terms of the unknown variables (St,Ut,λV,t) according to the method of variational Bayes (Bishop, 2006). In this approximate method (for details see Appendix 2), the posterior is factorized into three terms, each a normal distribution:

P(St,Ut,λV,tAt,V1:n,t)q(St,Ut,λV,tt)=q(St)q(Ut)q(λV,tt).

In order to estimate the set of parameters (mean and variance) of q(St), q(Ut) and q(λV,tt), the Free Energy is minimized iteratively (and thereby the Kullback–Leibler divergence between the true and approximate distribution), until a convergence criterion is reached (here, the change in each fitted parameter is less than 0.0001 between iterations).

This is done separately for the common cause model component (C = 1) and the independent cause model component (C = 2). The auditory location, for the common cause model is based on the approximation over the posterior location of S^A,C=1,t from, q1(St)=N(S^A,C=1,t,σ1,t). The auditory location for the independent cause model is simply computed as S^A,C=2,t=At/(1+σA2/σ02), because it is independent of the visual signal.

The marginal model evidence is estimated based on the minimized Free Energy for each mode component, P(At ,V1:n,t|C=1 ), respectively P(At ,V1:n,t|C=2 ) to form the posterior probability P(C=1|At ,V1:n,t ), as described above in Equation 4. These values can then be used to compute the predicted responses for a particular participant according to Equation 3.

Model 3: Exponential learner

Request a detailed protocol

Finally, the observer may approximate the full Bayesian inference of the Bayesian learner by a more simple heuristic strategy of exponential discounting. In the exponential discounting model, the observer learns the visual reliability by exponentially discounting past visual reliability estimates:

(11) λ^V,t1=1/σVt2 (1γ)+ λ^V,t1γ

where σVt2=1/(n1)i=1n(Vi,tV¯i,t)2 is the sample variance and V¯t=1/ni=1nVi,t is the sample mean.

Similar to the optimal Bayesian learner (above), this observer model uses the past to compute the current reliability, but it does so based on a fixed learning rate 1 - γ. Computation is otherwise performed in accordance with models 1 and 2, Equations 3-4 and 6-7.

Assumptions of the computational models: motivation and caveats

Request a detailed protocol

Computational models inherently make simplifying assumptions about the generation of the sensory inputs and observers’ inference.

First, we modeled that visual signals (i.e. the cloud’s mean) were sampled from a Gaussian, while they were sampled from a uniform discrete distribution (i.e. [−10°, −5°, 0°, 5°, 10°]) in the experiment. Gaussian assumptions about the stimuli locations have nearly exclusively been made in the recent series of studies focusing on Bayesian Causal Inference in multisensory perception (Körding et al., 2007; Rohe and Noppeney, 2015b; Rohe and Noppeney, 2015a). Because visual signals have been sampled from a wide range of visual angle (i.e. 20°) and are corrupted by physical (i.e. cloud of dots) and internal neural noise, we used the simplifying assumption of a Gaussian spatial prior consistent with previous research.

Second, we assumed that the auditory signal location is sampled from a Gaussian, while the experiments presented sounds ±5° from the visual location. These Gaussian assumptions about sound location can be justified by the fact that observers are known to be limited in their sound localization ability, particularly when generic HRTFs were used to generate spatial sounds. Moreover, because sounds are presented together with visual signals, it is even harder for observers to obtain an accurate estimate of the sound’s location.

Third, in the experiment we generated the cloud of dots directly from a Gaussian distribution centred on St. By contrast, in the model we introduced a hidden variable Ut that is sampled from a Gaussian centred on St. The visual cloud of dots is then centred on this hidden variable Ut. We introduced this additional hidden variable Ut to account for observers’ additional causal uncertainty in natural environments, in which even signals from a common source may not fully coincide in space. Critically, the dispersion of the cloud of dots is set to be equal to the STD of the distribution from which Ut is sampled, so that the cloud’s STD informs observers about the variance of the hidden variable Ut.

Inference by the experimenter

Request a detailed protocol

From the observer’s viewpoint, this completes the inference process. However, from the experimenter’s viewpoint, the internal variable for the auditory stimulus, At, is unknown and not directly under the experimenter’s control. To integrate out this unknown variable, we generated 1000 samples of the internal auditory value for each trial from the generative process At ~ N(SA,t,true, σA2), where SA,t,true was the true location the auditory stimulus came from. For each value of At, we obtained a single estimate S^A,t (as described above). To link these estimates with observers’ button response data, we assumed that subjects push the button associated with the position closest to S^A,t. In this way, we obtained a histogram of responses for each subject and trial which provide the likelihood of the model parameters given a subject’s responses: P(respt|κ,σA,Pcommon,SA,t,true,SV,t,true).

Model estimation and comparison

Request a detailed protocol

Parameters for each model (for all models: σA, Pcommon = P(C = 1), σ0, Bayesian learner: κ, exponential learner: γ) were fit for each individual subject by sampling using a symmetric proposal Metropolis-Hasting (MH) algorithm (with At integrated out via sampling, see above). The MH algorithm iteratively draws samples setn from a probability distribution through a variant of rejection sampling: if the likelihood of the parameter set is larger than the previous set, the new set is accepted, otherwise it is accepted with probability L(model|setn)/L(model|setn-1), where L(resp|setn) = tP(respt|κ,σA,Pcommon,SA,t,true,SV,t,true) (for Bayesian learner). We sampled 4000 steps from four sampling chains with thinning (only using every fourth sampling to avoid correlations in samples), giving a total of 4000 samples per subject data sets. Convergence was assessed through scale reduction (using criterion R < 1.1 [Gelman et al., 2013]). Using sampling does not just provide a single parameter estimate for a data set (as when fitting maximum likelihood), but can instead be used to assess the uncertainty in estimation for the data set. The model code was implemented in Matlab (Mathworks, MA) and ran on two dual Xeon workstations. Each sample step, per subject data set, took 30 s on a single core (~42 hr per sampling chain).

Quantitative Bayesian model comparison of the three candidate models was based on the Watanabe-Akaike Information Criterion (WAIC) as an approximation to the out of sample expectation (Gelman et al., 2013). At the fixed-effects level, Bayesian model comparison was performed by summing the WAIC over all participants within each experiment. For a random-effects analysis, we transformed the WAIC into log-likelihoods by dividing them by minus 2. We then computed the protected exceedance probability that one model is better than the other model beyond chance using hierarchical Bayesian model selection (Penny et al., 2010; Rigoux et al., 2014).

To qualitatively compare the localization responses given by the participants and the responses predicted by the instantaneous, Bayesian and exponential learner, we computed the auditory weight wA from the predicted responses of the three models exactly as in the analysis for the behavioral data. For illustration, we show and compare the model’s wA from the 1st and the flipped 2nd half of the periods for each of the four experiments (Figure 3, Figure 4, Figure 5B/C and Figure 5—figure supplement 1).

Parameter recovery

Request a detailed protocol

To test the validity of the models, we performed parameter recovery and were able to recover the generating values with a bias of all parameters smaller than 10% (for full details of bias and variance across parameters, see Appendix 1 and Supplementary file 1-Table 7).

Simulated localization responses

Request a detailed protocol

To further compare the Bayesian and exponential learner and assess whether they can be discriminated experimentally, we simulated the choices of 12 subjects for the continuous sinusoidal and sinusoidal jump sequence using the Bayesian learner model (parameters: σA = 6°, κ = 15, Pcommon = 0.7 and σ0 = 12°). To increase observers’ uncertainty about their visual reliability estimates, we reduced the number of dots in the visual clouds from 20 to 5 dots where we ensured that the mean and variance of the five dots corresponded to the experimentally defined visual mean and variance. We then fitted the Bayesian learner and exponential learner models to each simulated data set (using the BADS toolbox for likelihood maximization [Acerbi and Ma, 2017]). The fitted parameters for the Bayesian model, setBayes were very close to the parameters used to generate observers’ simulated responses (sinusoidal sequence, fitted parameters: σA = 6.11°, κ = 17.5, Pcommon = 0.72 and σ0 = 12.4°; sinusoidal jump sequence, fitted parameters: σA = 6.08°, κ = 17.3, Pcommon = 0.71 and σ0 = 12.2°) – thereby providing a simple version of parameter recovery. The parameters of the exponential model, setExp (fitted to observers’ responses generated from the Bayesian model) were very similar to those of the Bayesian learner (sinusoidal sequence: σA = 5.99°, γ = 0.70, Pcommon = 0.61 and σ0 = 12.0°, sinusoidal jump sequence: σA = 6.06°, γ = 0.70, Pcommon = 0.65 and σ0 = 12.0°). Moreover, the fits to the simulated observers’ responses were very close for the two models (Figure 6), with mean log likelihood difference (log(L(resp|setBayes)) – log(L(resp|setExp))) = 1.82 for the sinusoidal and 2.74 for the sinusoidal jump sequence (implying a slightly better fit for the Bayesian learner). Figure 6C and D show the timecourses of observers’ visual uncertainty (STD) as estimated by the Bayesian and exponential learners.

Appendix 1

Additional methods and results

Influence of the visual location of the previous trial on observers’ sound localization responses

We have performed a control regression analysis to assess the influence of the visual location of the previous trial on observers’ sound localization response. This is important because 200 ms prior to trial onset and sound presentation, observers were presented with a visual cloud whose mean was the same as for the previous trial and the cloud’s standard deviation (STD) varied according to a continuous or discontinuous sequence (see main paper). To quantify the influence of the previous visual location, we expanded our regression model that we used in the main paper by another regressor modeling the visual cloud’s location on the previous trial. For instance, for bin = 1, we computed:

RA,trial,bin=1=LA,trial,bin=1 ßA,bin=1 +LV,trial,bin=1 ßV,bin=1 +LV,trial1,bin=1 ßVprevious,bin=1 + ßconst,bin=1 +etrial,bin=1

with RA,trial, bin=1 = localization response for current trial that is assigned to bin 1; LA,trial,bin=1 or LV,trial,bin=1= ‘true’ auditory or visual location for current trial that is assigned to bin 1; LV,trial-1,bin=1 ‘true’ visual location for corresponding previous trial (for explanatory purposes, we assign here the bin of the current trial; the previous trial actually falls into a different bin); ßA,bin=1 or ßV,bin=1 = quantified the influence of the auditory and visual location of the current trial on the perceived sound location of the current trial for bin 1; ßVprevious,bin=1 quantified the influence of the visual location of the previous trial on the perceived sound location of the current trial for bin 1. ßconst,bin=1 = constant term; etrial,bin=1 = error term. For each bin, we thus obtained another visual weight estimate ßVprevious,bin for the previous location.

First, we averaged ßVprevious,bin across bins and entered these participant-specific bin-averaged ßVprevious into two-sided one-sample t-tests at the between-subject random effects level. Results: As shown in Supplementary file 1-Table 2A, this analysis demonstrated that the visual location of the previous trial significantly influenced observers’ perceived sound location on the current trial in the Sinusoidal, RW1 and (marginally) the Sinusoidal jump sequence.

Second, we computed the correlation of ßVprevious,bin with the visual noise in the current trials averaged in a given bin (r(ßVprevious,bin, σVcurrent,bin)) or with the visual noise (i.e. visual cloud’s STD) in the previous trial averaged within a given bin (r(ßVprevious,bin, σVprevious,bin)). The correlations were computed over bins within each participant. We entered the participant-specific Fisher z-transformed correlation coefficients into two-sided one-sample t-tests at the between-subject random-effects level (see Supplementary file 1-Table 2, section B and C). Results: These analyses demonstrated that the influence of the visual location on the previous trial was not correlated with the visual cloud’s STD on the current or previous trial (apart from one significant p-value for the sinusoidal jump sequence, but after Bonferroni correction for the eight additional statistical comparisons this p-value is no longer statistically significant either). This analysis already suggests that the previous visual location is unlikely responsible for the effects of the previous STD on observers’ perceived sound location.

Third and most importantly, this regression model provides us with weights for the auditory (ßA,bin) and visual (ßV,bin) locations in the current trial from a regression model that regressed out the influence of the previous visual location. We used those auditory (ßA,bin) and visual (ßV,bin) weights as in the main paper to compute bin-specific wA,bin = ßA,bin / (ßA,bin + ßV,bin). Following exactly the same procedures as in the main paper, we then assessed in a repeated-measures ANOVA whether these wA,bin differed between first and second half (see Supplementary file 1-Table 3). Moreover, we repeated a second regression model analysis to assess whether wA,bin was predicted not only by the visual cloud’s STD of the current, but also of the previous bin using the following regression model (i.e. Equation 2 in the main text): wA,bin = σV,bin * ßσV + (σV,bin – σV,bin-1)* ßΔσV + ßconst + ebin with wA,bin = relative auditory weight in bin b; σV,bin = mean visual STD in current bin b or previous bin b-1; ßconst = constant term; ebin = error term. To allow for generalization to the population level, the parameter estimates (ßσV, ßΔσV) for each subject were entered into two-sided one-sample t-tests at the between-subject random-effects level (see Supplementary file 1-Table 4). Results: These control analyses (see Figure 2—figure supplement 1, Supplementary file 1-Table 4) replicate our initial analyses reported in the main manuscript. Collectively, they provide further evidence that the effect of previous visual location on observers’ perceived sound location cannot explain the effect of prior visual reliability that are the key focus of our paper.

Nested model comparison to assess the effect of past visual noise on observers’ auditory weights

To assess the effect of past visual noise on auditory weights (Equation 2 in the main text), we also formally compared two nested linear mixed-effects models to predicts observers’ relative auditory weights wA,bin separately for the Sin, RW1, and RW2 sequences. The reduced model included only the STD of the current bin as fixed effects. The full model included both the STD of the current bin and the difference in STD between the current and the previous bin as fixed effects. Both the reduced and the full model included participants as random effects. After fitting the two models using maximum likelihood estimation, we compared them using loglikelihood ratio tests and the Bayesian Inference Criterion as an approximation to the model evidence (see Supplementary file 1-Table 5). The model comparison demonstrated that the full model including the difference in STD provided a better explanation of observers relative auditory weights wA,bin across all four sequences. This corroborates that observers estimate sensory uncertainty by combining information from past and current sensory inputs.

Characterization of observer’s behavior and the model predictions before and after the jumps in visual reliability

One critical question in the discontinuous sinusoidal jump sequence is whether observers continue combining past with current inputs to adapt their visual uncertainty estimates. One may argue that observers detect the discontinuity, ‘reset’ the estimation of visual uncertainty after the jump and therefore do not integrate information from before the jump. In that case Bayesian learning models would not be ideal for modeling observers’ behavior in the jump sequence and better accommodated by Bayesian change-point detection model (Adams and Mackay, 2007; Heilbron and Meyniel, 2019).

First, we inserted the jumps selectively in the period sections in which the visual variance was greater, so that changes in visual variance were more difficult to detect. This experimental choice minimized the chances that observers ‘reset’ their estimation of visual uncertainty.

Second, we assessed observers’ estimation strategy at a greater temporal resolution before and after the jumps. To estimate the relative auditory weight wA at a greater resolution, we applied separate regression models to individual sampling points of the visual cloud of dots presented every 200 ms (i.e. no binning). Thus, wA was computed at 5 Hz resolution before and after the jumps (i.e. at time points [−1.9:0.2:1.9] s; Figure 5—figure supplement 2). Because the number of trials was very low on individual sampling points, we pooled the trials across the three up- and down-jumps before computing the regression models. Nevertheless, the small number of trials on individual sampling points (range 7–28 trials across participants and bins) rendered the estimation of the relative auditory weights very unreliable. Thus, individual wA values that were smaller or larger than three times the scaled median absolute deviation were excluded from the analyses in Figure 5—figure supplement 2 and Supplementary file 1-Table 6. To assess statistically whether participants’ adjusted wA after the jump, we computed a paired t test on wA specifically from the time point before versus after the jump (i.e. −0.1 vs. 0.1 s; Supplementary file 1-Table 6).

Third, we assessed the model fit of our learning models before and after the jumps. If the jumps violate the assumptions of the learning models, we would expect that observers’ behavior deviates from the model’s predictions more strongly after the jump. We computed the root mean squared error of the models’ wA (i.e. (wA,behavior – wA,model)2) before and after the change point (Figure 5—figure supplement 2B) and entered those into a paired t test (i.e. −0.1 vs. 0.1 s; Supplementary file 1-Table 6).

Results: Our data showed that participants and models rapidly and significantly adjusted their weights after the jumps. Critically, the model fits did not significantly differ for the time points just before or after the jumps in visual variance (i.e. if anything, they significantly decreased after the jump; Figure 5—figure supplement 2B and Supplementary file 1-Table 6). Collectively, these control analyses suggest that our Bayesian and exponential learning models adequately modeled observers’ visual uncertainty adaptation both before and after change points (Norton et al., 2019).

Parameter recovery

To test the validity of the Bayesian, exponential, and instantaneous models, we performed parameter recovery by assessing the bias and variability of the parameters fitted to simulated data sets with respect to the true parameters used to generate the data.

For each model, we selected four different parameter sets (within a realistic range of values for parameters σΑ = [6:12]°, Pcommon = [0.7:0.9], σ0 = [6:20]°, κ = [5:20], γ = [0.3:0.7]) and generated data sets of simulated observers for the RW2 sequence. We repeated this process six times (with different initial random seeds), creating a total of 24 simulated data sets for each model. We then fitted the Bayesian, exponential, and instantaneous learner models to each simulated data set (using exactly the same fitting procedures as for observers’ data in the experiments) resulting in 24 sets of best fitting parameters for each model.

In order to assess how well the fitting procedure recovers the generating parameters, we compared the fitted parameters to the ‘true’ parameters used to generate the data. Specifically, we assessed the parameter recovery in terms of bias and variability of the fitted parameters as follows: The recovered parameters’ bias was computed as the signed deviation from the true generating value in percentage. As an example, if a data set was generated with a model parameter of 5, but the fitted (i.e. recovered) parameter was 4, we would compute a −20% deviation. As a measure of the variability for the recovered parameters, we calculated the absolute (i.e. unsigned) deviation from the true generating values in percentage. As an example, a fitted value of 4, relative to a generating value of 5 would be a 20% absolute deviation. We report the median (and first and third quartile) across simulated data sets as a robust measure for this bias and variability.

Appendix 2

This document describes a Variational Bayes approximation to inference on a generative model that allows for two possible ways that stimuli data was generated (thus allowing subjects to perform causal inference).

Section (1) describes the full generative model for both a single and two sources, section (2) explains how an optimal observer can perform inference within either sub-model, through a variational Bayes approximation to the posteriors and section (3) shows how to calculate the model likelihood for either sub-model, as necessary for combining the two sub-models.

Section (4) finally describes how the results for each causal model are combined into a single posterior.

1 Generative model

The model presented here is an extension of the Causal Inference model of Körding et al., 2007, with the reliability of the visual signal assumed to be changing smoothly over trials according to a random walk (RW). In the case where the visual reliability is constant the model approximates the original Causal Inference model.

In this model (figure Appendix 2—figure 1), at each stimulus presentation, t, subjects assume that the visual dots at positions Vi,t and auditory stimulus at At, are generated through either of two causal models (Ct=1 or Ct=2) with fixed prior probabilities:

(1) CtBernoulli(pcommon)

If Ct=1, (single source, St, leading to forced fusion)

(2)St𝒩(St;μ0,σ02)(3)At𝒩(At;St,σA2)(4)Ut𝒩(Ut;St,1/λV,t)(5)Vi,t𝒩(Vi,t;Ut,1/λV,t)(6)logλV,t𝒩(log(λV,t);log(λV,t1),k)

If Ct=2 (independent sources, SA,t and SV,t )

(7)SA,t𝒩(SA,t;μ0,σ02)(8)At𝒩(At;SA,t,σA2)(9)SV,t𝒩(SV,t;μ0,σ02)(10)Ut𝒩(Ut;SV,t,1/λV,t)(11)Vi,t𝒩(Vi,t;Ut,1/λV,t)(12)logλV,t𝒩(log(λV,t);log(λV,t1),k)

The intermediate variable Ut means that the mean of the visual dots is not located at the true source (St or SV,t), but normally distributed around it. Note that for Ct=1 we can explicitly write SA,t=SV,t=St.

For simplicity, we assume that μ0=0, that is, the prior mean is located at the horizontal center. The auditory STD σA, the prior probability of a single cause, Pcommon and the prior STD, σ0 , are fixed individually for each subject (see main text for fitting procedure).

In the following, we will simplify the notation by referring to P(*|Ct=1) by P1(*) and P(*|Ct=2) by P2(*).

Appendix 2—figure 1
Generative model, for one (C = 1) or two sources (C = 2).

2 Posterior

The full posterior over the latent variables in the model up until time t is

(13) P(SV,1:t,SA,1:t,,λ1:t,U1:t,C1:t|A1:t,V1:N,1:t)=P(SV,1,SA,1,λV,1,U1,C1|A1,V1:N,1)j=1tP(SV,j,SV,j,λV,j,Uj,Cj|A1:j,V1:N,1:j,λV,j-1)

Recursively we can write

(14) P(SV,t,SA,t,λV,t,Ut,Ct|A1:t,V1:N,1:t)=P(SV,t,SA,t,λV,t,Ut,Ct|A1:t,V1:N,1:t,λV,t-1)P(SV,t-1,SA,t-1,Ut-1,Ct,λV,1:t-1|A1:t-1,V1:N,1:t-1)dSV,t-1dSA,tdUt-1dλV,1:t-1
(15) P(SV,t,SA,t,λV,t,Ut,Ct|A1:t,V1:N,1:t)P(Ct)P(SA,t,SV,t)P(At|SA,t)P(V1:N,t|Ut,λV,t)P(Ut|SV,t,λV,t)P(λV,t|λV,t1)P(λV,1:t1|A1:t1,V1:N,1:t1)dλV,1:t1/Z

where P(SA,t,SV,t) obviously depends on Ct through the generative model (Appendix 2—figure 1).

If we marginalize over the latent Ct:

(16) P(SV,t,SA,t,λV,t,Ut|A1:t,V1:N,1:t)=P(SV,t,SA,t,λV,t,Ut|A1:t,V1:N,1:t,Ct=1)P(Ct=1|A1:t,V1:N,1:t)+P(SV,t,SA,t,λV,t,Ut|A1:t,V1:N,1:t,Ct=2)P(Ct=2|A1:t,V1:N,1:t)

At this point it should be clear that the posterior is a mixture of the forced fusion and independent solutions, with the mixture determined by the posterior probability of either model generating the data:

(17) P(Ct=1|A1:t,V1:N,1:t)=P(At,V1:N,t|Ct=1)P(Ct=1)P(At,V1:N,t|Ct=1)P(Ct=1)+P(At,V1:N,t|Ct=2)P(Ct=2)

To evaluate this, we need to calculate the marginal model evidence, P(At,V1:N,t|Ct), for either model, see the later section.

2.1 Posterior for C = 1

The full posterior over the latent variables in the single source sub-model is

(18) P1(S1:t,λ1:t,U1:t|A1:t,V1:N,1:t)=P(S1,λV,1,U1|A1,V1:N,1:t)i=1tP(Si,λV,i,Ui|Ai,Vi,λV,i-1)

Recursively we can write

(19) P1(St,λV,t,Ut|A1:t,V1:N,1:t)=P1(St,λV,t,Ut|A1:t,V1:N,1:t,λV,t-1)P1(St-1,λV,1:t-1,Ut-1|A1:t-1,V1:N,1:t-1)dSt-1dUt-1dλV,1:t-1
(20) P1(St,λV,t,Ut|A1:t,V1:N,1:t)=P1(St,λV,t,Ut|A1:t,V1:N,1:t,λV,t-1)P(λV,1:t-1|A1:t,V1:N,1:t-1)dλV,1:t-1
(21) P1(St,λV,t,Ut|A1:t,V1:N,1:t)P(St)P(At|St)P(Ut|St,λV,t)P(V1:N,t|Ut,λV,t)P(λV,t|λV,t-1)P(λV,1:t-1|A1:t-1,V1:N,1:t-1)𝑑λV,1:t-1/Z

As we will see later it is convenient to use a change of parameters

(22) θt=log(λV,t)

allowing us to rewrite

(23) P1(St,θt,Ut|A1:t,V1:N,1:t)P(St)P(At|St)P(Ut|St,θt)P(V1:N,t|Ut,θt)P(θt|θV,t-1)P(θV,1:t-1|A1:t-1,V1:N,1:t-1)𝑑θV,1:t-1/Z

where

(24) P(Ut|St,θt)=𝒩(Ut;St,1/exp(θt))
(25) P(V1:N,t|Ut,θt)=n𝒩(Ut;Vn,t,1/exp(θt))
(26) P(θt|θV,t-1)=𝒩(θt;θV,t-1,1/κ)

We will assume that

(27) P(θV,1:t-1|A1:t-1,V1:N,1:t-1)

can be approximated by a Normal distribution (see below), thus allowing us to write

(28) P(θt|θV,t-1)P(θV,1:t-1|A1:t-1,V1:N,1:t-1)𝑑θV,1:t-1=𝒩(θt|θV,t-1,1/κt)

where 1/κt=1/κ+1/τθ,t-1 (due to properties of convolution of two Normal distributions).

The log-posterior (to be used for a variational approximation) is now

(29) logP1(St,λV,t,Ut|A1:t,V1:N,1:t)-λSSt2/2-λA(At-St)2/2-λV,t(Ut-St)2/2+logλV,t/2-λV,tiN(Ut-Vi,t)2/2+(N/2)logλV,t-κ(θt-θV,t-1)2/2

2.2 Variational Bayes approximation for C = 1

We will now approximate the log-posterior with variational Bayes by factorization:

(30) P1(St,θt,Ut|At,V1:N,t)q1(St,Ut,θt)=q1(St)q1(Ut)q1(θt)
2.2.1 q1(St)

For q1(St)

(31) logq1(St)-λSSt2/2-λA(At-St)2/2-Eθ(exp(θt))EU((Ut-St)2)/2

where EY(X) signifies the expectation of X over the distribution of Y: EY(X)=P(Y)X𝑑Y.

Using EU((UtSt)2)=EU(Ut2+St22UtSt)=EU(Ut2)+St22StEU(Ut) = (St2StEU(Ut)2+EU(Ut)2)EU(Ut)2+EU(Ut2)=(StEU(Ut))2EU(Ut)2+EU(Ut2), where the last two terms do not depend on St (and thus can be discarded) we can rewrite the last term:

(32) logq1(St)-λSSt2/2-λA(At-St)2/2-Eθ(exp(θt))(St-EU(Ut))2/2
2.2.2 q1(Ut)

For q1(Ut)

(33) logq1(Ut)-Eθ(exp(θt))/2i(Ut-Vi,t)2-Eθ(exp(θt))ES((Ut-St)2)/2

Here, we use the same trick

(34) logq1(Ut)Eθt(exp(θt))/2i(UtVi,t)2Eθ(exp(θt))(UtES(St))2)/2
2.2.3 q1(θ)

For q1(θ)

(35) logq1(θ)=-expθtEU,S((Ut-St)2)/2-expθtEU(iN(Ut-Vi,t)2)/2+(N+1)log(exp(θ))/2-κ(θt-θt-1)2/2
2.2.4 Simplifying q1(St) and logq1(Ut)

Inspecting logq1(St) and logq1(Ut) we can see that both q1(St) and q1(Ut) are products of Normal distributions, and thus themselves Normal distributed

(36) q1(St)𝒩(St;μS,t,1/τS,t)

and

(37) q1(Ut)𝒩(Ut;μU,t,1/τU,t)

where

(38) μS,t=(λS*0+λAAt+E(expθt)μU,t)/τS,t
(39) τS,t=λS+λA+E(exp(θt))
(40) μU,t=(Eθ(exp(θt))iNVi,t+Eθ(exp(θt))μS,t)/τU,t
(41) τU,t=(N+1)*Eθ(exp(θt))

Note that ES(St)q1(St)St𝑑St=μS,t and EU(Ut)q1(Ut)Ut𝑑Ut=μU,t

2.2.5 Simplifying q1(θt)

Regarding q1(θt) we can expand a little using that

(42) EU,S((Ut-St)2)=EU,S((Ut2+St2-2StUt))=EU,S(Ut2)+E(St2)-2E(St)E(Ut)=μU2+1/τU+μS2+1/τS-2μSμU=(μU-μS)2+1/τU+1/τS

(using that E(X2)=μ2+1/τ for a normal distribution 𝒩(X;μ,1/τ)) and

(43) EU(iN(Ut-Vi,t)2)=EU(iN(Ut2+Vi,t2-2UtVi,t))=EU(iN(Ut2)+iN(Vi,t2)-2iN(UtVi,t)=EU(N*(Ut2)+iN(Vi,t2)-2UtiNVi,t)=N*(μU2+1/τU)+iN(Vi,t2)-2μUiNVi,t)=N/τU+iN(Vi,t-μU)2

Which together gives:

(44) EU,S((Ut-St)2)+EU(iN(Ut-Vi,t)2)=(μU-μS)2+1/τU+1/τS+(N+1)/τU+iN(Vi,t-μU)2

2.3 Approximating q(θ)

We will approximate q1(θ) with a Normal distribution.

To do this, we use a Laplace approximation around the max of q1(θ) (see figure Appendix 2—figure 2): argmax(q1(θt))=μθ,t and with second derivative -τθ,t

This gives

(45) q1(θt)𝒩(θt|μθ,t,1/τθ,t)
2.3.1 First derivative

However, in order to find argmax(q1(θt)), we differentiate logq1(θt) and set equal to 0:

(46) dlogq1(θt)/dθt=expθtEU,S((UtSt)2)/2expθtEU(iN(UtVi,t)2)/2+(N+1)/2κ(θtθt1)=0
(47) expθt(EU,S((UtSt)2)/2+EU(iN(UtVi,t)2)/2)=(N+1)/2κ(θtθt1)

At this point there is no analytical solution.

2.3.2 Taylor expansion of first derivative

Although we could use a numerical approximation for speed of implementation, we use Taylor expansion. We need to solve for θ

(48) expθt(EU,S((Ut-St)2)/2+EU(iN(Ut-Vi,t)2)/2)-(N+1)/2+κ(θt-θt-1)=0

For simplicity we refer to SV=(EU,S((Ut-St)2)+EU(iN(Ut-Vi,t)2))

We can solve this by using the third-order Taylor expansion of the exponential:

(49) expθte(θ)+e(θ)(θtθ)+e(θ)(θtθ)2/2+e(θ)(θtθ)3/6=e(θ)(1θ+θ2/2θ3/6+(12/2θ+3/6θ2)θt+(1/23/6θ)θt2+θt3/6)=e(θ)(1θ+θ2/2θ3/6+(1θ+θ2/2)θt+(1/2θ/2)θt2+θt3/6)

We can set θ* as θt-1, as we expect that θt will be close to θ*. For big changes in the variance of Vi this can be off; however, this was not a problem in this stimulus set which relied on slow gradual changes.

In order to find argmax(q1(θt)) we therefore have to solve

(50) SV/2*e(θ*)(1-θ*+θ*2/2-θ*3/6+(1-θ*+θ*2/2)θt+(1/2-θ*/2)θt2+θt3/6)-(N+1)/2+κ(θt-θt-1)=0

which can be rewritten as a third order polynomial

(51) θt3+c1θt2+c2θt+c3=0

where

(52) c1=1/2-θ*/2c2=1-θ*+θ*2+κ/(SV/2*e(θ*))c3=1-θ*+θ*2/2+θ*3/6-(N/2+1/2+κθt-1)/(SV/2*e(θ*))

the solution to which, θtoptim, can be numerically found using Matlab’s nthroot function.

As we assume that the log-variance only changes slightly between trials the solution closest to the previous value θt-1 is automatically chosen, argmax(q1(θt))=μθt=θtoptim.

2.3.3 Second derivative

The second derivative is

(53) d2logq1(θt)/dθt2=-expθt*SV/2-κ
Appendix 2—figure 2
Approximation of theta using Laplace approximation.

We evaluate this at argmax(q1(θt)), so we insert θt=μθ,t

Hence we can finally write

(54) q1(θt)𝒩(θt|μθ,t,1/τtheta,t)

where

(55) μθ,t=θtoptim
(56) τθ,t=exp(μθ,t)*(SV)/2+κ

and where SV=(μU-μS)2+1/τS+(N+1)/τU+iN(Vi,t-μU)2

With q1(θt) a Normal distribution, that makes q1(λV,t) a log-normal distribution with μλV,t=E(λV,t)=E(exp(θt))=exp(μθt+1/(2*τθt)) (general property of log-normal distribution).

2.4 Final algorithm for C = 1

We can now create an iterative algorithm that for each time step t represents the model posterior. Variables σA2=1/λA, σ02=1/λ0 and κ have to be set before hand, together with the input data A1:t and V1:N,1:t. For time step t:

1. initially set

(57)μθ,t=μθ,t1,(58)μS,t=0,(59)μU,t=1/NVi,t,(60)τθ,t=1

2. set μS,τS,t

(61) μS,t=(λAAt+exp(μθt+1/(2*τθt))μU,t)/τS,t
(62) τS,t=λS+λA+exp(μθt+1/(2*τθt))

3. set μU,t,τU,t

(63) μU,t=(N/(N+1))Vt¯+(1/(N+1))μS,t
(64) τU,t=(N+1)*exp(μθt+1/(2*τθt))

where Vt¯=1/NiNVi,t

4. find μθt by solving third order polynomial, Equation 51,

(65) μθ,t=θoptim

then set τθt

(66) τθ,t=k+exp(μθ,t)*((μU,t-μS,t)2+1/τS,t+(N+1)/τU,t+iN(Vi,t-μU,t)2)/2

where κt=1/(1/κ+1/τθ,t-1)

5. Repeat steps 2–4 until the change in each parameter is small (<0.0001)

This is then repeated for each time step t, providing us with the approximation to the posterior P1(St,θt,Ut|At,V1:N,t)q1(St,Ut,θt))=q1(St)q1(Ut)q1(θt).

2.5 Posterior for C = 2

Due to the independent structure this posterior can be written as

(67) P2(SA,t,SV,t,λV,t,Ut|At,V1:N,t)=P(SA,t|At)P(SV,t,λV,t,Ut|V1:N,t)

where

(68) P2(SA,t|At)=P(SA,t)P(At|SA,t)/Z

which is simple enough given the Normal distribution of both P(SA,t) and P(At|SA,t)

(69) P2(SA,t|At)=𝒩(SA,t;AtσA02/σA2,σA02)

where σA02=1/(1/σA2+1/σ02).

Note that for the subject response the posterior P2(SA,t|At) is all that is needed, but for the calculation of the prior P(λV,t) for subsequent trial t+1 we need to compute the full posterior.

We again use the transformation of parameters

(70) θt=log(λV,t)

Proceeding with just the posterior over SV,t, Ut and θt

(71) PC=2(SV,t,θt,Ut|A1:t,V1:N,1:t)P(SV,t)P(Ut|SV,t,θt)P(V1:N,t|Ut,θt)P(θt|θV,t-1)P(θV,1:t-1|A1:t-1,V1:N,1:t-1)𝑑θV,1:t-1/Z

where

(72) P(Ut|SV,t,θt)= 𝒩(Ut;SV,t,1/exp(θt))
(73) P(V1:N,t|Ut,θt)=n𝒩(Ut;Vn,t,1/exp(θt))
(74) P(θt|θt1)= 𝒩(θt;θt1,1/κ)

We will assume that P(θV,1:t-1|A1:t-1,V1:N,1:t-1) can be approximated by a Normal distribution (see q1(θ) below), thus allowing us to use properties of Normal distributions.

Hence,

(75) P(θt|θt1)P(θV,1:t1|A1:t1,V1:N,1:t1)dθV,1:t1= 𝒩(θt;θt1,1/κt)

where 1/κt=1/κ+1/τθ,t-1 (due to the convolution of P(θt|θt-1) with P(θt-1|A1:t-1,V1:N,1:t-1), both Normal distributed).

While any estimate of θt will depend on A1:t-1 and V1:N,1:t1) for ease of notation, we will omit those in the following.

The log-posterior is now

(76) logP2(SV,t,θt,Ut|A1:t,V1:N,1:t)-λ0SV,t2/2-expθt(Ut-SV,t)2/2+θt/2-expθtiN(Ut-Vi,t)2/2+Nθt/2-κ(θt-θV,t-1)2/2

2.6 Variational Bayes approximation for C = 2

We will now approximate the log-posterior with variational Bayes by factorization: P2(St,θt,Ut|At,V1:N,t)q2(St,Ut,θt))=q2(St)q2(Ut)q2(θt) This proceeds similarly to the combined (C=1) model, but with SV,t instead of St, and with no influence from At. For completeness the calculations are included here:

2.6.1 q2(St)

For q2(St)

(77) logq2(SV,t)-λ0SV,t2/2-Eθ(exp(θt))EU((Ut-SV,t)2)/2

where EY(X) signifies the expectation of X over the distribution of Y: EY(X)=P(Y)X𝑑Y.

Using EU((UtV,t)2)=EU(Ut2+SV,t22UtSV,t)=EU(Ut2)+SV,t22SV,tEU(Ut) = (SV,t2SV,tEU(Ut)2+EU(Ut)2)EU(Ut)2+EU(Ut2)=(SV,tEU(Ut))2EU(Ut)2+EU(Ut2), where the last two terms do not depend on SV,t (and thus can be discarded) we can rewrite the last term:

(78) logq2(SV,t)-λ0SV,t2/2-Eθ(exp(θt))(SV,t-EU(Ut))2/2
2.6.2 q2(Ut)

For q2(Ut)

(79) logq2(Ut)-Eθ(exp(θt))/2i(Ut-Vi,t)2-Eθ(exp(θt))ES((Ut-SV,t)2)/2

Here, we use the same trick

(80) logq2(Ut)Eθt(exp(θt))/2i(UtVi,t)2Eθ(exp(θt))(UtES(SV,t))2)/2
2.6.3 q2(θt)

For q2(θ)

(81) logq2(θ)=-expθtEU,S((Ut-SV,t)2)/2-expθtEU(iN(Ut-Vi,t)2)/2+(N+1)log(exp(θ))/2-κ(θt-θt-1)2/2
2.6.4 Simplifying q2(St) and logq2(Ut)

Inspecting logq2(SV,t) and logq2(Ut) we can see that both q2(SV,t) and q2(Ut) are products of Normal distributions, and thus themselves Normal distributed

(82) q2(SV,t) 𝒩(SV,t|μSV,t,1/τSV,t)

and

(83) q2(Ut)𝒩(Ut|μU,t,1/τU,t)

where

(84) μSV,t=(λ0*0+E(exp(θt))μU,t)/τSV,t
(85) τSV,t=λ0+E(exp(θt))
(86) μU,t=(Eθ(exp(θt))iNVi,t+Eθ(exp(θt))μSV,t)/τU,t
(87) τU,t=(N+1)*Eθ(exp(θt))

Note that ES(SV,t)q2(SV,t)SV,t𝑑SV,t=μSV,t and EU(Ut)q2(Ut)Ut𝑑Ut=μU,t

We can approximate q2(θ) with a Normal distribution, in exactly the same way as for C = 1. As equations are identical (see above) they will not be repeated here.

2.7 Final algorithm for C = 2

We can now create an iterative algorithm that for each time step t represents the variational Bayes approximation of the model posterior over SV,t,Ut and λV,t (or rather θt):, P(SV,t,Ut,θt|V1:N,t). Variable κ has to be set before hand, together with the input data A1:t and V1:N,1:t. For time step t:

1. initially set μθ,t=μθ,t-1, μU,t=1/NVi,t, and τθ,t=1

2. set μSV,τSV,t

(88) μSV,t=(exp(μθt+1/(2*τθ,t))μU,t)/τSV,t
(89) τSV,t=λ0+exp(μθt+1/(2*τθ,t))

3. set μU,τU,t

(90) μU,t=(NVt¯+μSV,t)/(N+1)
(91) τU,t=(N+1)*exp(μθt+1/(2*τθ,t))

where Vt¯=1/NiNVi,t

4. find μθ,t by numerically solving polynomial, Equation 51,

(92) μθ,t=θoptim

then set τθ,t

(93) τθ,t=k+exp(μθ,t)*((μU,t-μSV,t)2+1/τSV,t+(N+1)/τU,t+iN(Vi,t-μU,t)2)/2

where κt=1/(1/κ+1/τθ,t-1)

5. Repeat steps 2–4 until convergence, that is, until the change in each parameter is small (<0.0001)

This is then repeated for each time step t, providing us with the approximation to the posterior P2(St,θt,Ut|At,V1:N,t)q2(St,Ut,θt))=q2(St)q2(Ut)q2(θt).

See figure Appendix 2—figure 3 below for an example of the learned inference of the visual variance σV,t2=1/λV,t1/logq2(θt), compared with a simple instantaneous learner model that assumes σV,t2=1/Ni(Vi,t-V¯t)2.

Appendix 2—figure 3
Comparing variational Bayes approximation with a numerical discretised grid approximation.

Top row: Example visual stimuli over eight subsequent trials. Middle row: The distribution of estimated sample variance, with no learning over trials. Bottom row: The distribution of _V;t for the Bayesian model that incorporates the learning across trials. Red line is the numerical comparison when using a discretised grid to estimate variance, as opposed to the variational Bayes (green line).

3 Marginal model evidence

Recall that the posterior is a mixture of the forced fusion and independent solutions, with the mixture determined by the posterior probability of either model generating the data:

(94) P(Ct=1|At,V1:N,t)=P(At,V1:N,t|Ct=1)P(Ct=1)P(At,V1:N,t|Ct=1)P(Ct=1)+P(At,V1:N,t|Ct=2)P(Ct=2)

To evaluate this, we need to calculate the marginal model evidence, P(At,V1:N,t|Ct), for either model.

One way to do so is by a sampling approximation, but here we utilise the variational results we have already found.

3.1 Model likelihood for C = 2, two sources SV,t,SA,t

We need to evaluate the model likelihood for both C=1 and C=2. The case for C=2 is slightly simpler, hence we start with this:

(95) P(At,V1:N,t|Ct=2)=P(At|Ct)P(V1:N,t|Ct=2)=P(At|SA,t,Ct=2)P(SA,t|Ct=2)dSA,t*P(V1:N,t|Ut,λt,Ct=2)P(Ut,λt,SV,t|Ct=2)dVdλdSV,t

The first integral is easy as it is just the integral of the product of two Normal distributions.

(96) P2(At|SA,t)P2(SA,t)dS=1(2π(σA2+1/τ0)exp((Aμ0)22(σA2+1/τ0))

It is however more convenient to operate in log-space

(97) logP2(At)=-log(2π(σA2+1/τ0))-(A-μ0)2/(2(σA2+1/τ0))

The second integral we approximate through the Free Energy that we already maximise iteratively in the variational Bayes algorithm.

(98) logP(V1:N,t|Ct=2)=logP(V1:N,t|Ut,θt,SV,t,Ct=2)P(Ut,θt,SV,t|Ct=2)dUtdθtdSV,tL2(q)=q2(Ut,θt,SV,t)logP2(V1:N,t|Ut,θt,Ct)P2(Ut|,SV,t,θt,Ct)P2(SV,t|Ct)P2(θt|Ct)q2(Ut,θt,SV,t)dUtdθtdSV,t

(this approximation becomes exact if the variational approximation is exact, that is, if the Kulback-Leibler difference between the posterior P2(Ut,θt,SV,t|V1:N) and the approximation q2(Ut,θt,SV,t) becomes zero.)

This can be interpreted as taking the expectation with regard to the posterior approximation, and due to the properties of the logarithm this can be separated into a sum of expectations:

(99) L2(q)=E(logP2(V1:N,t|Ut,θt))+E(logP2(Ut|SV,t,θt)+E(logP2(SV,t))+E(logP2(θt))E(logq2(Ut))E(logq2(θt))E(logq2(SV,t))

where (due to Equation 43)

(100) E(logP2(V1:N,t|Ut,θt))=E(logiP2(Vi,t|Ut,θ))=E(logiexp(θt)2πexp(-(Vi,t-Ut)2exp(θt)/2)=E(ilog(exp(θt)2π)-(Vi,t-Ut)2exp(θt)/2)=N/2(Eθt(θ)-log(2π))+EUt(i-(Vi,t-Ut)2))E(exp(θt))/2=N/2(μθ-log(2π))-(N/τU,t+i(Vi,t-μU)2)exp(μθ,t+1/(2τθ,t))/2

and (since E(X2)=μX2+σX2)

(101) E(logP2(Ut)|SV,t,θt))=E(log 𝒩(Ut;SV,t,1/exp(θt)))=E(logexpθt2πexp((UtSV,t)2)exp(θt)/2))=(μθ,tlog(2π))/2[(μUμSV,t)2+1/τU,t+1/τSV,t]exp(μθ,t+1/(2τθ,t))/2

and

(102) E(logP2(SV,t))=E(log 𝒩(SV,t;μ0,σ02))=E(log1σ022πexp((SV,tμ0)2)/(2σ02))=log(σ022π)/2[(μSV,tμ0)2+1/τSV,t]/(2σ02)

and (due to Equation 75)

(103) E(logP2(θt))=E( 𝒩(θt;θt1,1/κ))=E(logκ2πexp((θtμθ,t1)2κ/2))=(logκlog(2π))/2[(μθ,tμθ,t1)2+1/τθ,t]κ/2

and

(104) E(logq2(Ut))=E(log 𝒩(Ut;μU,t,1/τU,t))=E(log(τU,t2πexp((UtμU,t)2τU,t/2)))=logτU,t2πE((UtμU,t)2τU,t/2)=(logτU,tlog(2π))/2(E(Ut2)+μU,t22μU,tE(Ut))τU,t/2=(logτU,tlog(2π))/2(μU,t2+1/τU,t2+μU,t22μU,tμU,t)τU,t/2=(logτU,tlog(2π)1)/2

and

(105) E(logq2(St))=E(log 𝒩(SV,t;μSV,t,1/τSV,t))=E(log12πσ02exp((SV,tμSV,t)2τSV,t/2))=(logτSV,tlog(2π)1)/2

and

(106) E(logq2(θt))=E(log 𝒩(θt;μθ,t,1/τθ,t))=(logτθ,tlog(2π)1)/2

In total we now have

(107) logP2(At,V1:N,t|Ct)logP2(At)+L2=(log(2π)+log(σA2+1/τ0))/2(Aμ0)2/(2(σA2+1/τ0))+(μθ,tlog(2π))N/2[N/τU,t+iN(Vi,tμU)2]exp(μθ,t+1/(2τθ,t))/2+((μθ,tlog(2π))/2[(μUμSV,t)2+1/τU,t+1/τSV,t]exp(μθ,t+1/(2τθ,t))/2log(σ022π)/2[(μSV,tμ0)2+1/τSV,t]/(2σ02)+(logκlog(2π))/2[(μθ,tμθ,t1)2+1/τθ,t]κ/2(logτU,tlog(2π)1)/2(logτSV,tlog(2π)1)/2(logτθ,tlog(2π)1)/2

Although lengthy, this is trivial and fast to compute numerically in Matlab (e.g.). Note that all estimates come from the variational Bayes approximation q2(St,Ut,θt).

3.2 Model likelihood for C = 1, one source St=SV,t=SA,t

We now need to do the same for the one source model.

(108) P(At,V1:N,t|Ct=1)=P1(At,V1:N,t)=P1(At|St,Ct=1)P1(V1:N,t|Ut,λt,Ct=1)P1(Ut,λt,St|Ct=1)dVdλdSt

Note that for simplicity in notation we will use P1 to indicate the probability within the model given Ct=1

We again approximate through the Free Energy that we already maximised iteratively in the variational Bayes algorithm.

(109) logP1(At,V1:N,t)=logP1(V1:N,t|Ut,θt,St)P1(At|St)P1(Ut,θt,St)𝑑Ut𝑑θt𝑑StLCt=1(q1)=q1(Ut,θt,SV,t)logP1(V1:N,t|Ut,θt)P1(At|St)P1(Ut|,St,θt)P1(St)P1(θt)q1(Ut,θt,St)dUtdθtdSt

(this approximation becomes exact if the variational approximation is exact, ie if the Kulback-Leibler difference between the posterior P1(V,θt,SV,t|V1:N,t) and the approximation q1(Ut,θt,SV,t) becomes zero.)

This can be interpreted as taking the expectation with regard to the posterior approximation, and due to the properties of the logarithm this can be separated into a sum of expectations:

(110) L1(q)=E(logP1(At|St))+E(logP1(V1:N,t|Ut,λt))+E(logP1(Ut))+E(logP1(θt))+E(logP1(SV,t))-E(logq1(Ut))-E(logq1(θt))-E(logq1(SV,t))

where (since E(X2)=μX2+σX2)

(111) E(logP1(At|St))=E(log12πσA2exp((AtSt)2/(2σA2))=log(σA22π)/2[(AtμS)2+1/τS,t]/(2σA2)
(112) E(logP1(V1:N,t|Ut,θt))=E(logiP1(Vi,t|Ut,θ))=E(logiexp(θt)2πexp(-(Vi,t-Ut)2exp(θt)/2)=E(ilog(exp(θt)2π)-(Vi,t-Ut)2exp(θt)/2)=n/2(Eθt(θ)-log(2π))+E(exp(θt))/2EUt(i-(Vi,t-Ut)2))=n/2(μθ,t-log(2π))+exp(μθ,t-1/(2τθ,t))/2(N/τU,t+i(Vi,t-μU)2)

and

(113) E(logP1(Ut)|St,θt))=E(log 𝒩(Ut;St,1/exp(θt)))=E(logexpθt2πexp((UtSt)2)exp(θt)/2))=(μθ,tlog(2π))/2[(μUμS,t)2+1/τU,t+1/τS,t]exp(μθ,t+1/(2τθ,t))/2

and

(114) E(logP1(St))=E(log 𝒩(St;μ0,σ02))=E(log1σ022πexp((Stμ0)2)/(2σ02))=log(σ022π)/2[(μS,tμ0)2+1/τS,t]/(2σ02)

and (due to Equation 75)

(115) E(logP1(θt))=E(𝒩(θt;θt1,1/κ))=E(logκ2πexp((θtμθ,t1)2κ/2))=(logκlog(2π))/2[(μθ,tμθ,t1)2+1/τθ,t]κ/2

and

(116) E(logq1(Ut))=E(log 𝒩(Ut;μU,t,1/τU,t))=E(logτU,t2πexp((UtμU,t)2τU,t/2))=logτU,t2πE((UtμU,t)2τU,t/2)=(logτU,tlog(2π))/2(E(Ut2)+μU,t22μU,tE(Ut))τU,t/2=(logτU,tlog(2π)1)/2

and

(117) E(logq1(St))=E(log 𝒩(St;μS,t,1/τS,t))=E(log12πσ02exp((StμS,t)2τS,t/2))=(logτS,tlog(2π)1)/2

and

(118) E(logq1(θt))=E(log 𝒩(θt;μθ,t,1/τθ,t))=(logτθ,tlog(2π)1)/2

In total we now have

(119) logP1(At,V1:N,t)L1(q1)=log(σA22π)/2[(AtμS)2+1/τS,t]/(2σA2)+N/2(μθ,tlog(2π))(N/τU,t+iN(Vi,tμU)2)exp(μθ,t+1/(2τθ,t))/2+((μθ,tlog(2π))/2[(μUμSV,t)2+1/τU,t+1/τSV,t]exp(μθ,t+1/(2τθ,t))/2log(σ022π)/2[(μSV,tμ0)2+1/τSV,t]/(2σ02)+(logκlog(2π))/2[(μθ,tμθ,t1)2+1/τθ,t]κ/2(logτU,tlog(2π)1)/2(logτS,tlog(2π)1)/2(logτθ,tlog(2π)1)/2

(where all first and second order moments (μ,τ) have been derived from q1).

This is identical to the result from C=2 except for the first line.

In total this provides us with an approximation to the model evidence for each model, P(At,V1:N,t|Ct=1) and P(At,V1:N,t|Ct=2), which can be used to calculate the posterior probability of either model given data, P(Ct|A1:t,V1:N,1:t).

4 Putting it all together

For either sub-model, the factorization (due to assumptions and variational Bayes approximation) allows us to write out the equations for the variable for subject choice:

(120) P(St|A1:t,V1:N,1:t)=P(St|A1:t,V1:N,1:t,Ct=1)P(Ct=1|A1:t,V1:N,1:t)+P(SA,t|A1:t,V1:N,1:t,Ct=2)P(Ct=2|A1:t,V1:N,1:t)qC=1,t(St)P(Ct=1|A1:t,V1:N,1:t)+P(SA,t|A1:t,Ct=2)P(Ct=2|A1:t,V1:N,1:t)

This is now a mixture of two Gaussian distributions (due to the Variational Bayes approximation), with mixture weights given by the model evidence (partly approximated by the Free Energy).

We will assume subjects report the mean of the distribution that is,

(121) St^=S^C=1,tP(Ct=1|A1:t,V1:N,1:t)+S^C=2,tP(Ct=2|A1:t,V1:N,1:t)

where S^C=1,t=μS,t for C=1 and S^C=2,t=μS,A,t for C=2.

We also need a prior over the visual log-reliability for the following trial

(122) P(θt|A1:t,V1:N,1:t)=P(θt|A1:t,V1:N,1:t,C=1)P(C=1|A1:t,V1:N,1:t)+P(θt|A1:t,V1:N,1:t,C=2)P(C=2|A1:t,V1:N,1:t)

While this is a mixture of two Gaussians, we need the prior to be a single Gaussian in order for our approximation scheme above to work. We will approximate this mixture with a single Gaussian (essentially fitting a Gaussian to the mixture of two Gaussians).

(123) P(θt|A1:t,V1:N,1:t)𝒩(θt|μθ,t,1/τθ,t)

where (due to the first- and second-order moments of mixture distributions)

(124) μθ,t=μθ,t,C=1P(Ct=1|A1:t,V1:N,1:t)+μθ,t,C=2P(Ct=2|A1:t,V1:N,1:t)
(125) 1/τθ,t=(1/τθ,t,C=1+μθ,t,C=12)P(Ct=1|A1:t,V1:N,1:t)+(1/τθ,t,C=2+μθ,t,C=22)P(Ct=2|A1:t,V1:N,1:t)-μθ,t2

While fitting a Gaussian to the sum of two Gaussians could be a very in-exact approximation, in practice the two individual distributions are close enough for this not to be a problem (as any contribution from At to the posterior of θt is very small).

In conclusion, subjects report St^ (through a button response, see Equation 121) and they propagate the posterior P(θt|A1:t,V1:N,1:t) (see Equation 123) as prior for the next trial.

Data availability

The human behavioral raw data and computational model predictions as well as the code for computational modelling and analyses scripts are available in an OSF repository: https://osf.io/gt4jb/.

The following data sets were generated
    1. Beierholm U
    2. Rohe T
    3. Noppeney U
    (2020) Open Science Framework
    Using the past to estimate sensory uncertainty.
    https://doi.org/10.17605/OSF.IO/GT4JB

References

  1. Conference
    1. Acerbi L
    2. Ma WJ
    (2017)
    Practical bayesian optimization for model fitting with bayesian adaptive direct search
    Advances in Neural Information Processing Systems. pp. 1836–1846.
  2. Conference
    1. Algazi VR
    2. Duda RO
    3. Thompson DM
    4. Avendano C
    (2001) The cipic hrtf database
    IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics, 2001.
    https://doi.org/10.1109/ASPAA.2001.969552
  3. Book
    1. Bishop CM
    (2006)
    Pattern Recognition and Machine Learning
    New York: Springer.
  4. Book
    1. Gelman A
    2. Stern HS
    3. Carlin JB
    4. Dunson DB
    5. Vehtari A
    6. Rubin DB
    (2013)
    Bayesian data analysis
    Chapman and Hall/CRC.

Article and author information

Author details

  1. Ulrik Beierholm

    Psychology Department, Durham University, Durham, United Kingdom
    Contribution
    Conceptualization, Resources, Software, Formal analysis, Investigation, Visualization, Methodology, Writing - review and editing
    Contributed equally with
    Tim Rohe
    For correspondence
    ulrik.beierholm@durham.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7296-7996
  2. Tim Rohe

    1. Department of Psychiatry and Psychotherapy, University of Tübingen, Tübingen, Germany
    2. Department of Psychology, Friedrich-Alexander University Erlangen-Nuernberg, Erlangen, Germany
    Contribution
    Conceptualization, Resources, Data curation, Formal analysis, Funding acquisition, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing
    Contributed equally with
    Ulrik Beierholm
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9713-3712
  3. Ambra Ferrari

    Centre for Computational Neuroscience and Cognitive Robotics, University of Birmingham, Birmingham, United Kingdom
    Contribution
    Data curation, Project administration
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1946-3884
  4. Oliver Stegle

    1. Max Planck Institute for Intelligent Systems, Tübingen, Germany
    2. European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
    3. Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany, Heidelberg, Germany
    Contribution
    Conceptualization, Methodology
    Competing interests
    No competing interests declared
  5. Uta Noppeney

    1. Centre for Computational Neuroscience and Cognitive Robotics, University of Birmingham, Birmingham, United Kingdom
    2. Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, Netherlands
    Contribution
    Conceptualization, Formal analysis, Supervision, Funding acquisition, Investigation, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared

Funding

European Research Council (ERC-multsens,309349)

  • Uta Noppeney

Max Planck Society

  • Tim Rohe
  • Uta Noppeney

Deutsche Forschungsgemeinschaft (DFG RO 5587/1-1)

  • Tim Rohe

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This study was funded by the ERC (ERC-multsens, 309349), the Max Planck Society and the Deutsche Forschungsgemeinschaft (DFG; grant number RO 5587/1–1). We thank Peter Dayan for his valuable contributions and very helpful comments on a previous version of the manuscript.

Ethics

Human subjects: All volunteers participated in the study after giving written informed consent. The study was approved by the human research review committee of the University of Tuebingen (approval number 432 2007 BO1) and the research review committee of the University of Birmingham (approval number ERN_15-1458AP1).

Version history

  1. Received: December 4, 2019
  2. Accepted: December 13, 2020
  3. Accepted Manuscript published: December 15, 2020 (version 1)
  4. Version of Record published: January 13, 2021 (version 2)

Copyright

© 2020, Beierholm et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,096
    Page views
  • 353
    Downloads
  • 11
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Ulrik Beierholm
  2. Tim Rohe
  3. Ambra Ferrari
  4. Oliver Stegle
  5. Uta Noppeney
(2020)
Using the past to estimate sensory uncertainty
eLife 9:e54172.
https://doi.org/10.7554/eLife.54172

Share this article

https://doi.org/10.7554/eLife.54172

Further reading

    1. Neuroscience
    Songyao Zhang, Tuo Zhang ... Tianming Liu
    Research Article

    Cortical folding is an important feature of primate brains that plays a crucial role in various cognitive and behavioral processes. Extensive research has revealed both similarities and differences in folding morphology and brain function among primates including macaque and human. The folding morphology is the basis of brain function, making cross-species studies on folding morphology important for understanding brain function and species evolution. However, prior studies on cross-species folding morphology mainly focused on partial regions of the cortex instead of the entire brain. Previously, our research defined a whole-brain landmark based on folding morphology: the gyral peak. It was found to exist stably across individuals and ages in both human and macaque brains. Shared and unique gyral peaks in human and macaque are identified in this study, and their similarities and differences in spatial distribution, anatomical morphology, and functional connectivity were also dicussed.

    1. Neuroscience
    Avani Koparkar, Timothy L Warren ... Lena Veit
    Research Article

    Complex skills like speech and dance are composed of ordered sequences of simpler elements, but the neuronal basis for the syntactic ordering of actions is poorly understood. Birdsong is a learned vocal behavior composed of syntactically ordered syllables, controlled in part by the songbird premotor nucleus HVC (proper name). Here, we test whether one of HVC’s recurrent inputs, mMAN (medial magnocellular nucleus of the anterior nidopallium), contributes to sequencing in adult male Bengalese finches (Lonchura striata domestica). Bengalese finch song includes several patterns: (1) chunks, comprising stereotyped syllable sequences; (2) branch points, where a given syllable can be followed probabilistically by multiple syllables; and (3) repeat phrases, where individual syllables are repeated variable numbers of times. We found that following bilateral lesions of mMAN, acoustic structure of syllables remained largely intact, but sequencing became more variable, as evidenced by ‘breaks’ in previously stereotyped chunks, increased uncertainty at branch points, and increased variability in repeat numbers. Our results show that mMAN contributes to the variable sequencing of vocal elements in Bengalese finch song and demonstrate the influence of recurrent projections to HVC. Furthermore, they highlight the utility of species with complex syntax in investigating neuronal control of ordered sequences.