Neuroscience

How relevant is the prior? Bayesian causal inference for dynamic perception in volatile environments

David Meijer author has email address
Roberto Barumerli
Robert Baumgartner

Acoustics Research Institute, Austrian Academy of Sciences, Vienna, Austria
Department of Neurosciences, Biomedicine and Movement Sciences, University of Verona, Verona, Italy

https://doi.org/10.7554/eLife.105385.1

Open access
Copyright information

Figures and data

Two competing models of the world for explaining prediction errors in a changepoint paradigm.

A). The first causal structure assumes that the latest observation stems from the same source as the preceding observations. B). The second causal structure instead assumes that the latest observation originates from a different source. Note that the true causes that give rise to the sensory signals are unknown to observers (i.e., latent). This is illustrated on the right side by a brick wall that obscures the true source location(s) of the balls that are thrown over it (i.e., noisy observations). The inferred cause for the latest observation is mentioned in the thought clouds.

The reduced Bayesian observer model puts the causality question centre stage after every stimulus: Did the latest observation originate from the same generative mean location as the prior (with added noise), or has there been a changepoint (cp)?
A). Example trial in which an observer is presented with ten audiovisual stimuli. The generative mean changes twice after the start of the sequence: at t=4 and at t=6. Five consecutive stimuli are presented after the last cp (SAC 5), after which the participant responds with a prediction for the location of the upcoming stimulus. Ideally, the response location approximates the mean of the stimuli since the last cp. The sequential inference process to estimate this mean location is illustrated for three stimuli. Upon experiencing a large precision-weighted prediction error, a cp is inferred, and the prior becomes irrelevant for the current generative mean. So, the posterior (and next prior) is based on the likelihood only (t=4). Instead, small prediction errors indicate a low probability of a cp, thus likelihood and prior are integrated to improve precision of the mean estimate (t=5). But what to do when there is (causal) uncertainty about the occurrence of a cp (t=6)? The moderately sized prediction error suggests a cp, but the overlap between prior and likelihood indicates that they could have also originated from a common generative mean, i.e., μ₆ = μ₅. A fully Bayesian observer computes a posterior as a weighted mixture of two posterior components (dashed lines), each conditional on a causal hypothesis (cp or not). Alternatively, the reduced Bayesian observer summarizes that mixture distribution by its mean and variance, and thus simplifies the posterior (and next prior) to a single normal distribution.
B-D). The prior relevance measure (Π) is key to the inference process of the reduced Bayesian observer. Panel B depicts its logit transformation (Q, i.e., the posterior log-odds of no changepoint) as a function of absolute prediction error , for two experimental noise conditions, two levels of prior reliability (τ), and two changepoint hazard rates (H_cp). Panel C shows the resulting normalized bias towards the prior (at 1, relative to the last stimulus), which equals the product of prior relevance and prior reliability. Panel D illustrates how spatial uncertainty about the generative mean depends on the latest prediction error: it is low for small errors due to integration of prior and likelihood, it is identical to the experimental noise for large prediction errors, and highest in-between because of causal uncertainty.

The reduced Bayesian observer model puts the causality question centre stage after every stimulus: Did the latest observation originate from the same generative mean location as the prior (with added noise), or has there been a changepoint (cp)?
A). Example trial in which an observer is presented with ten audiovisual stimuli. The generative mean changes twice after the start of the sequence: at t=4 and at t=6. Five consecutive stimuli are presented after the last cp (SAC 5), after which the participant responds with a prediction for the location of the upcoming stimulus. Ideally, the response location approximates the mean of the stimuli since the last cp. The sequential inference process to estimate this mean location is illustrated for three stimuli. Upon experiencing a large precision-weighted prediction error, a cp is inferred, and the prior becomes irrelevant for the current generative mean. So, the posterior (and next prior) is based on the likelihood only (t=4). Instead, small prediction errors indicate a low probability of a cp, thus likelihood and prior are integrated to improve precision of the mean estimate (t=5). But what to do when there is (causal) uncertainty about the occurrence of a cp (t=6)? The moderately sized prediction error suggests a cp, but the overlap between prior and likelihood indicates that they could have also originated from a common generative mean, i.e., μ₆ = μ₅. A fully Bayesian observer computes a posterior as a weighted mixture of two posterior components (dashed lines), each conditional on a causal hypothesis (cp or not). Alternatively, the reduced Bayesian observer summarizes that mixture distribution by its mean and variance, and thus simplifies the posterior (and next prior) to a single normal distribution.
B-D). The prior relevance measure (Π) is key to the inference process of the reduced Bayesian observer. Panel B depicts its logit transformation (Q, i.e., the posterior log-odds of no changepoint) as a function of absolute prediction error , for two experimental noise conditions, two levels of prior reliability (τ), and two changepoint hazard rates (H_cp). Panel C shows the resulting normalized bias towards the prior (at 1, relative to the last stimulus), which equals the product of prior relevance and prior reliability. Panel D illustrates how spatial uncertainty about the generative mean depends on the latest prediction error: it is low for small errors due to integration of prior and likelihood, it is identical to the experimental noise for large prediction errors, and highest in-between because of causal uncertainty.

Participants’ prediction responses are reasonably accurate, but the reduced Bayesian observer model with default parameters performs better.
A). On average, participants’ responses correlate well with the omniscient observer model for the two experimental noise conditions (left and right panels). The fully Bayesian observer biases its predictions towards the centre of space to accommodate potentially upcoming changepoints. This bias is not present for participants, and it is therefore not modelled for the other observers (naïve, omniscient, and reduced Bayesian, with default parameters or with parameters that were fit to the participants’ data; see section 2.3).
B-C). The response error relative to the true generative mean (unknown) decreases with larger SAC levels due to integration of consecutive stimuli, but participants’ absolute error remains larger than the naïve observer because of additional response noise (panel B). When the errors are normalized with respect to the naïve observer’s responses (at 1), then the random response noise averages out and the relative accuracy improvement becomes visible (panel C).
The traces in all panels depict the group-level median of the individuals’ median response, per SAC bin (in B-C), or using a rolling kernel method (in A). Note that the rolling median results in small edge artifacts (panel A), with an apparent central bias for peripheral locations even for the naïve observer (and in addition to the integration-based bias for the other observers). The blue shaded region depicts the range between the group-level 25% (Q1) and 75% (Q3) participants.

Normalized bias towards the prior (at 1) for SAC 1 trials (panel A) and signed bias towards the prior (positive sign) for SAC 2 and SAC 3 trials (panel B), as a function of the experienced prediction error. The prior’s location was approximated by means of the omniscient observer.
The traces depict the group-level median of the individuals’ local median response, as computed via a rolling kernel method. For appropriateness of that procedure, the prediction errors were non-linearly scaled to approximate a constant density of trials over the x-axis. Note that the rolling median results in small edge artifacts which cause the signed bias to appear larger than zero for the smallest prediction errors.
The blue shaded region depicts the range between the group-level 25% (Q1) and 75% (Q3) participants. For the modelled observers we only depict the group-level medians (naïve, omniscient, and reduced Bayesian, with default parameters or with parameters that were fit to the participants’ data; see section 2.3).

Fitted parameter values of the reduced Bayesian observer model.
Panel A. depicts the fitted changepoint hazard rate and experimental noise estimates of the individual participants (separately for both experimental noise conditions, in left and right panel) on top of a modelled surface of overall (summed) surprisal levels that a reduced Bayesian observer would experience with such parameter values. Surprisal is minimized for parameter estimates that are equal to the true generative parameter values (thin dotted black lines). The dashed black lines illustrate the expected negative correlation between hazard rate and experimental noise estimates.
Panels B and C show the same fitted hazard rates and experimental noise estimates, but now as a comparison of the noise conditions.
Individuals’ colour coding depends on the model comparison results (section 2.4 and Figure 7B): dark red for participants that are better fit by models with a larger memory capacity, light red for participants that are better fit by the reduced Bayesian observer model with limited memory capacity, and grey otherwise.

Fitted parameter values of the reduced Bayesian observer model.
Panel A. depicts the fitted changepoint hazard rate and experimental noise estimates of the individual participants (separately for both experimental noise conditions, in left and right panel) on top of a modelled surface of overall (summed) surprisal levels that a reduced Bayesian observer would experience with such parameter values. Surprisal is minimized for parameter estimates that are equal to the true generative parameter values (thin dotted black lines). The dashed black lines illustrate the expected negative correlation between hazard rate and experimental noise estimates.
Panels B and C show the same fitted hazard rates and experimental noise estimates, but now as a comparison of the noise conditions.
Individuals’ colour coding depends on the model comparison results (section 2.4 and Figure 7B): dark red for participants that are better fit by models with a larger memory capacity, light red for participants that are better fit by the reduced Bayesian observer model with limited memory capacity, and grey otherwise.

Illustration of the modelling framework with four factors: memory capacity (A), late truncation simplification (B), decision strategy (C), and pruning function (D).
A). The three panels depict a sequence of three stimuli (top to bottom) that illustrate the inference difference between a reduced Bayesian observer (M = 1, posterior depicted as dashed red line) and a similar Bayesian model with larger memory capacity (M ≥ 3, solid orange line with shaded area). Posterior distributions of models with extended memory consist of a weighted mixture of multiple nodes (dashed orange lines with a, b, and c letter indicators). The node’s weight indicates its relative posterior relevance for the inferred location of the generative mean. In this sequence, the second stimulus leads to high causal uncertainty (nearly equal weight for both nodes: a. latest cp at t=1, b. latest cp at t=2), but the third stimulus increases the inferred relevance of the node that codes for the hypothesis that no changepoint took place (a): i.e., it is most probable that all three stimuli stem from the same generative mean, and it is least probable that a changepoint took place at t=3 (c). The reduced Bayesian observer cannot retrospectively reassign weights to nodes and is therefore slightly less accurate (here: small bias towards penultimate stimulus) and less precise (larger spatial uncertainty) than a near-Bayesian observer with larger memory capacity.
B). The late truncation simplification implies that the space boundaries are ignored until a response has to be given. At that point, the observer’s best prediction is moved to within the generative mean range, only if necessary. Without the simplification, observers compute the expectation (i.e., mean) of the truncated normal distribution as their best prediction, and this will bias their response towards the centre of space. Here, the difference is illustrated for a response at t=1 (from panel A). With the simplification, the observer responds to the mean of the (non-truncated) normal distribution: subsequent movement of the intended response location to within the generative mean range is unnecessary in this example.
C). Whenever the posterior distribution consists of more than one node, the observer needs to decide on how to make a response (here depicted for t=2). The model averaging decision function computes the expectation of the weighted mixture distribution as its best prediction, thus essentially averaging two models of the world (i.e., nodes a and b). Instead, the model selection decision function deterministically selects the node with the highest weight (i.e., inferred relevance) as a base for its prediction response.
D). The node pruning function determines how an observer satisfies the memory capacity limit M. Whenever the number of posterior nodes exceeds the memory capacity, the pruning function ensures that the prior for t+1 consists of no more than M nodes. Here, this is depicted for the case at t=3, with a memory capacity of M=2. The keep_WAVG pruning function merges the oldest two nodes (a and b) by computing a weighted average node (while the newest node, c, for a cp at t=3, remains in memory as well, despite its small weight). The keep_MAXP pruning function keeps the node with the highest weight of the oldest two nodes and discards the other (i.e., node a is kept, b is discarded), whereas the keep_RCNT pruning function always keeps the most recent of the two oldest nodes (i.e., node b is kept, a is discarded). The node that is kept in memory also receives the weight of the discarded node, such that it now represents the hypothesis of a cp at t ≤ 2.

Model Comparison Results.
Panel A shows the fixed effects model comparison for the 42 models. The y-axis denotes the log model evidence (lme) of each model, summed across participants, relative to the reduced Bayesian observer model (memory capacity M = 1, keep_WAVG pruning function, model averaging decision strategy, and with late truncation simplification). The lme difference is dominated by the pruning function (colour coding) at low memory capacity (x-axis), while it is dominated by the decision strategy (light vs. dark) at larger memory capacities. Models with the model averaging decision strategy fit participants’ data better with the late truncation simplification, while models with the selection decision strategy generally fit better without the simplification. Panel B zooms in on the individual results of the memory capacity factor (x-axis), for models with keep_WAVG pruning function, model averaging decision strategy, and with late truncation simplification. Each line depicts the lme differences for one participant. The colour coding separates participants that were better fit by models with larger memory capacity (dark red), and participants who responded more like reduced Bayesian observers (light red), from participants without a clear preference (grey).

The generative model describes the process that gives rise to sensory signals.

For every stimulus, a random draw from a Bernoulli distribution determines whether or not a changepoint occurs. For a changepoint (right panel), the generative mean is drawn at random from a uniform distribution, whereas it remains the same as before for no changepoints (left panel). Finally, in both cases, the stimulus location is drawn at random from a normal distribution that is centred on the generative mean. See text in section 4.2 for details and generative parameter settings.

Sign up for email alerts