Formalizing Occam’s razor as Bayesian model selection to understand simplicity preferences in human decision-making.
a: Occam’s razor prescribes an aversion to complex explanations (models). In Bayesian model selection, model complexity quantifies the flexibility of a model, or its capacity to account for a broad range of empirical observations. In this example, we observe an apple falling from a tree (left) and compare two possible explanations: classical mechanics, and 2) the intervention of a ghost. b: Schematic comparison of the evidence of the two models in a. Classical mechanics (pink) explains a narrower range of observations than the ghost (green), which is a valid explanation for essentially any conceivable phenomenon (e.g., both a falling and spinning-upward trajectory, as in the insets). Absent further evidence and given equal prior probabilities, Occam’s razor posits that the simpler model (classical mechanics) is preferred, because its hypothesis space is more concentrated around the sparse, noisy data and thus avoids “overfitting” to noise. c: A geometrical view of the model-selection problem. Two alternative models are represented as geometrical manifolds, and the maximum-likelihood point for each model is represented as the projection of the data (red star) onto the manifolds. d: Systematic expansion of the log evidence of a model M (see previous work by Balasubramanian (1997) and Methods section A.2). is the maximum-likelihood point on model ℳ for data X, N is the number of observations, d is the number of parameters of the model, is the likelihood gradient evaluated at , h is the observed Fisher information matrix, and g is the expected Fisher information matrix (see Methods). g(ϑ) captures how distinguishable elements of ℳ are in the neighborhood of ϑ (see Methods section A.2 and previous work by Balasubramanian (1997)). When M is the true source of the data X, h(X; ϑ) can be seen as a noisy version of g(ϑ), estimated from limited data (Balasubramanian, 1997). ĥ−1 is a shorthand for, and is the length of measured in the metric defined by ĥ−1. The ellipsis collects terms that decrease as N grows. Each term of the expansion represents a distinct geometrical feature of the model (Balasubramanian, 1997): dimensionality penalizes models with many parameters; boundary (a novel contribution of this work) penalizes models for which is on the boundary; volume counts the number of distinguishable probability distributions contained in ℳ; and robustness captures the shape (curvature) of ℳ near (see Methods section A.2 and previous work by Balasubramanian (1997)). e: Psychophysical task with variants designed to probe each geometrical feature in d. For each trial, a random location on one model was selected (gray star), and data (red dots) were sampled from a Gaussian centered around that point (gray shading). The red star represents the empirical centroid of the data, by analogy with c. The maximum-likelihood point can be found by projecting the empirical centroid onto one of the models. Participants saw the models (black lines) and data (red dots) only and were required to choose which model was best for the data. Insets: task performance for the given task variant, for a set of 100 simulated ideal Bayesian observers (orange) versus a set of 100 simulated maximum-likelihood observers (i.e., choosing based only on whichever model was the closest to the empirical centroid of the data on a given trial; cyan).
Integration over latent causes leads to Occam’s razor.
a: Schematic of a simple decision-making scenario. A single datapoint (star) is sampled from one of two models (pink dot, green bar). One of the models (ℳ1) is a Gaussian with known variance, centered at the location of the pink dot. The other model (ℳ 2) is a parametric family of Gaussians, with known and fixed variance and center located at a latent location along the green bar. Cyan line: boundary indicating locations in data space that are equidistant from ℳ1 and ℳ2. b-d: Potential components of a decision-making observer for this scenario, which we call Noise-Integration-Noise observer (see Methods section A.1 and Supplementary Information section B.7 for further details). b: Sensory noise: the observer does not have access to the true data (location of the star), but a noisy version of it corrupted by Gaussian noise with variance ρ. c: Integration over latent causes: the observer can consider possible positions of the center of the Gaussian in model ℳ2. d: Choice noise: after forming an internal estimate of the relative likelihood of ℳ1 and ℳ2, the observer can choose a model based on a deterministic process (for instance, always pick the most likely one), or a stochastic one where the probability of sampling one model is related to its likelihood. e-h: Behavior of the observer as a function of the location of the datapoint, within the zoomed-in region highlighted in a, and of the presence of the mechanisms illustrated in b-d. e: probability that the observer will report ℳ2 as a function of the location of the datapoint, when sensory and choice noise are low and in absence of integration over latent causes. f: same as e, but in presence of integration over latent causes. The decision boundary of the observer (white area) is shifted towards the more complex model (ℳ 2) compared to e. This shift means that, when the data is equidistant from ℳ1 and ℳ2, the observer prefers the simplest model (ℳ1). g: same as e, but with strong sensory noise. The decision boundary of the observer is shifted in the opposite direction as f. h: same as e, but with strong choice noise. Choice noise has no effect on the location of the decision boundary.
a: Summary of human behavior. Hue (pink/green): k-nearest-neighbor interpolation of the model choice, as a function of the empirical centroid of the data. Color gradient (light/dark): marginal density of empirical data centroids for the given model pair, showing the region of space where data were more likely to fall. Cyan solid line: decision boundary for an observer that always chooses the model with highest maximum likelihood. Orange dashed line: decision boundary for an ideal Bayesian observer. The participants’ choices tended to reflect a preference for the simpler model, particularly near the center of the screen, where the evidence for the alternatives was weak. For instance, in the left panel there is a region where data were closer to the line than to the dot, but participants chose the dot (the simpler, lower-dimensional “model”) more often than the line. b: Participant sensitivity to each geometrical feature characterizing model complexity was estimated via hierarchical logistic regression (see Methods section A.6 and Supplementary Information section B.2), using as predictors a constant to account for an up/down choice bias, the difference in likelihoods for the two models (L2 − L1) and the difference in each FIA term for the two models (D2 −D1, etc). Following a hierarchical regression scheme, the participant-level sensitivities were in turn modeled as being sampled from a population-level distribution. The mean of this distribution is our population-level estimate for the sensitivity. c: Overall accuracy versus estimated relative FIA sensitivity for each task condition, as indicated. Points are data from individual participants. Each fitted FIA coefficient was normalized to the likelihood coefficient and thus could be interpreted as a relative sensitivity to the associated FIA term. For each term, an ideal Bayesian observer would have a relative sensitivity of one (dashed orange lines), whereas an observer that relied on only maximum-likelihood estimation (i.e., choosing “up” or “down” based on only the model that was the closest to the data) would have a relative sensitivity of zero (solid cyan lines). Top, gray: Population-level estimates (posterior distribution of population-level relative sensitivity given the experimental observations). Bottom: each gray dot represents the task accuracy of one participant (y axis) versus the posterior mean estimate of the relative sensitivity for that participant (x axis). Intuitively, the population-level posterior can be interpreted as an estimate of the location of the center of the cloud of dots representing individual subjects in the panel below. See Methods section A.6 for further details on statistical inference and the relationship between population-level and participant-level estimates. Purple: relative sensitivity of an ideal observer that samples from the exact Bayesian posterior (not the approximated one provided by the FIA). Shading: posterior mean ± 1 or 2 stdev., estimated by simulating 50 such observers.
a: A novel deep neural-network architecture for statistical model selection. The network (see text and Methods for details) takes two images as input, each representing a model, and a set of 2D coordinates, each representing a datapoint. The output is a softmax-encoded choice between the two models. b: Each network was trained on multiple variants of the task, including systematically varied model length or curvature, then tested using the same configurations as for the human studies. c: Summary of network behavior, like Figure 3a. Hue (pink/green): k-nearest-neighbor interpolation of the model choice, as a function of the empirical centroid of the data. Color gradient (light/dark): marginal density of empirical data centroids for the given model pair, showing the region of space where data were more likely to fall. Cyan solid line: decision boundary for an observer that always chooses the model with highest maximum likelihood. Orange dashed line: decision boundary for an ideal Bayesian observer. d: Estimated relative sensitivity to geometrical features characterizing model complexity. As for the human participants, each fitted FIA coefficient was normalized to the likelihood coefficient and thus can be interpreted as a relative sensitivity to the associated FIA term. For each term, an ideal Bayesian observer would have a relative sensitivity of one (dashed orange lines), whereas an observer that relied on only maximum-likelihood estimation (i.e., choosing”up” or “down” based on only the model that was the closest to the data) would have a relative sensitivity of zero (solid cyan lines). Top: population-level estimate (posterior distribution of population-level relative sensitivity given the experimental observations; see Methods section A.6 for details). Bottom: each gray dot represents the task accuracy of one of 50 trained networks (y axis) versus the posterior mean estimate of the relative sensitivity for that network (x axis). Intuitively, the population-level posterior can be interpreted as an estimate of the location of the center of the cloud of dots representing individual subjects in the panel below. See Methods section A.6 for further details on statistical inference and the relationship between population-level and participant-level estimates. Purple: relative sensitivity of an ideal observer that samples from the exact Bayesian posterior (not the approximated one provided by the FIA). Shading: posterior mean ± 1 or 2 stdev., estimated by simulating 50 such observers.
Humans, but not artificial networks, exhibit simplicity preferences even when they are suboptimal.
a: Relative sensitivity of human participants to the geometric complexity terms (population-level estimates, as in Figure 3c, top) for two task conditions: 1) the original, “generative” task where participants were implicitly instructed to solve a model-selection problem (same data as in Figure 3c, top; orange); and 2) a “maximum-likelihood” task variant, where participants were instructed to report which of two models has the highest likelihood (shortest distance from the data; cyan). The two task variants were tested on distinct participant pools of roughly the same size (202 participants for the generative task, 201 for the maximum-likelihood task, in both cases divided in four groups of roughly 50 participants each). Solid cyan lines: relative sensitivity of a maximum-likelihood observer. Orange dashed lines: relative sensitivity of an ideal Bayesian observer. b: Same comparison and format (note the different x-axis scaling), but for two distinct populations of 50 deep neural networks trained on the two variants of the task (orange is the same data as in Figure 4d, top).
Humans and artificial neural networks have different patterns of accuracy reflecting their different use of simplicity preferences.
Each panel shows accuracy with respect to maximum-likelihood solutions (i.e., the model closest to the centroid of the data; ordinate) versus with respect to generative solutions (i.e., the model that generated the data; abscissa). The gray line is the identity. Columns correspond to the four task variants associated with the four geometric complexity terms, as indicated. a: Data from individual human participants (points), instructed to find the generative (orange) or maximum-likelihood (cyan) solution. Human performance was higher when evaluated against maximum-likelihood solutions than it was when evaluated against generative solutions, for all groups of participants (two-tailed paired t-test, generative task participants: dimensionality, t-statistic 2.21, p-value 0.03; boundary, 6.21, 1e-7; volume, 9.57, 8e-13; robustness, 10.6, 2e-14. Maximum-likelihood task participants: dimensionality, 5.75, 5e-7; boundary, 4.79, 2e-6; volume, 10.8, 2e-14; robustness, 12.2, 2e-16). b: Data from individual ANNs (points), trained on the generative (orange) or maximum-likelihood (cyan) task. Network performance was always highest when evaluated against maximum-likelihood solutions, compared to generative solutions (all dots are above the identity line).
statistic and effective sample size (ESS) for 12 Markov Chain traces run as described in the text, for the fit to human data for the generative task.
See Gelman et al. (2014, sections 11.4–11.5) and Vehtari et al. (2020) for in-depth discussion of chain quality diagnostics. Briefly, depends on the relationship between the variance of the draws estimated within and between contiguous draw sequences. is close to 1 when the chains have successfully converged. The effective sample size estimates how many independent samples one would need to extract the same amount of information as that contained in the (correlated) MCMC draws.
Comparison of the full Bayesian and FIA computation of the log posterior ratio (LPR) for the model pairs used in our psychophysical tasks (N = 10).
Each row corresponds to one task variant (from top to bottom, “dimensionality”, “boundary”, “volume”, “robustness”). First column from the left: full Bayesian LPR, computed by numerical integration. Second column: LPR computed with the FIA. Third column: difference between FIA and exact LPR. Fourth column: relative difference (difference divided by the absolute value of the FIA LPR). Adapted from Piasini et al. (2021a).
Posterior predictive check for human performance on the generative task.
We sampled 240 samples from the posterior over model parameters by thinning the MCMC chains used for model inference. For each of these samples, we ran a simulation of the experiment using the actual stimuli shown to the participants, and we recorded the resulting performance of all 202 simulated participants. This procedure yielded 240 samples of the joint posterior-predictive distribution of task performance over all participants. To visualize this distribution, for each participant we plotted a cloud of 240 dots, where the y coordinate of each dot is the simulated performance of that participant in one of the simulations, and the x coordinate is the true performance of that participant in the experiment plus a small random jitter (for ease of visualization). The gray line is the identity, showing that our inference procedure captures well the behavioral patterns in the experimental data. Colors indicate different task types, as indicated.
Participant-level relative sensitivities to the geometric features that determine model complexity.
Dots with error bars: posterior mean ± standard deviation of the relative sensitivity (the dots are the same as in Figure 3c). For ease of visualization, participants are ranked based on their posterior mean.
Posterior mean ± standard deviation for population-level parameters.
See Equation A106 to Equation A114 for the precise definition of each parameter and its role in the hierarchical model of behavior.
WAIC comparison of the full model and the likelihood-only model for human performance on the generative task, reported in the standard format used by McElreath (2016, section 6.4.2).
WAIC is the value of the criterion (log-score scale, where higher is better), pWAIC is the estimated effective number of parameters, dWAIC is the difference between the WAIC of the given model and the highest-ranked one, SE is the standard error of the WAIC estimate, and dSE is the standard error of the difference in WAIC. These estimates were produced with the compare function provided by ArviZ (Kumar et al., 2019), using 12 MCMC chains with 10000 samples each for each model (in total, 120,000 samples for each model).
Same as Table B.6, for the maximum-likelihood task, where participants were asked to report the model that was closest to the data.
HDI vs ROPE comparison and Probability of Direction (PD) for the population-level parameters in the human experiments.
Lapse rate versus relative sensitivity to complexity across participants.
Each dot gives the posterior mean estimate of the relative sensitivity to one of the features that determine model complexity (abscissa) and the posterior mean estimate of the lapse rate, as defined in Section A.6.1.
Model-free estimate of simplicity bias in the Noise-Integration-Noise (NIN) observer, as a function of the observer’s parameters, for each of the four task types (Dimensionality, Boundary, Volume, and Robustness; different columns show results for different task types).
The example in Figure 2b in the main text corresponds to the Dimensionality task type. Top: simplicity bias as a function of sensory noise ρ and choice noise T, when the integration parameter b is set to 0, meaning that the observer does not integrate over latent causes (see section A.1). Note that the grid of choice noise values tested in the figure is not equally spaced; the values of T shown here are [0.10, 0.50, 0.56, 0.63, 0.71, 0.83, 1.00, 2.00], which correspond to the following values for the inverse temperature 1/T: [10, 2.0, 1.8, 1.6, 1.4, 1.2, 1.0, 0.5]. Bottom: simplicity bias as a function of the integration parameter b, with sensory noise and choice noise fixed to the values indicated in the top panels with the colored crosses. Note how integration is the only parameter that is associated with consistent changes in the simplicity bias for all task types (increasing integration increases the simplicity bias). Sensory noise has inconsistent effects across task types, and choice noise does not affect the simplicity bias.
Sensitivity to model likelihood and to the geometric features that characterize model complexity, for the Noise-Integration-Noise observer, as a function of the parameters of the observer.
The parameter values tested are the same as in Figure B.9. Top: dependence of the sensitivities on the sensory noise σ and the choice noise T, when the integration parameter b is fixed to zero (meaning that the observer does not integrate over latent causes). Middle: dependence on the sensitivities on integration, when sensory and choice noise are fixed to the values indicated by the colored crosses in the top panels. Bottom: same as middle, but for the normalized sensitivities, obtained by dividing the raw sensitivities by the likelihood sensitivity. Note that, reflecting the results in Figure B.9, the parameter controlling integration (x axis on each individual subplot) is the only one that has a consistent effect on the sensitivity to all features, generally increasing it. Note also the qualitative match between the bottom panels here and those in Figure B.9. The agreement with the data presented here and that in Figure B.9 further confirms that our theory-driven approach captures the intuitive notion of simplicity bias in this task.
Analysis of human behavior on the generative task, using the Noise-Integration-Noise (NIN) model.
a: sensory noise (ρ) estimate for all participants, broken down by task type (colors). Arrow: standard error of the location of the centroid of the dot cloud that, on any trial, represented the data X shown to the participants (, using the notation N of A.3 and A.4.2). b: estimates of integration parameter b. c: estimates of inverse temperature of the choice noise β = 1/T. Inset: detail of the inverse temperature histogram for β ∈ [0, 1]. Arrow: numerical value of the population estimate of likelihood sensitivity from the FIA model. d: simple participant-level model comparison (Akaike Information Criterion) between the NIN model and the behavioral model based on the FIA (Equation A106). Lower is better; the dashed diagonal line is the identity. Inset: histogram of NIN-FIA difference, excluding outliers with large positive values, which are overwhelmingly better described by FIA. The AIC is lower (better) for the FIA than for the NIN model for 182 out of 201 subjects.
Analysis of human behavior on the maximum-likelihood task, using the Noise-Integration-Noise (NIN) model.
Same as Figure B.11, but for the behavioral data of the subjects that performed the maximum-likelihood task. In panel d, the AIC is lower (better) for the FIA than for the NIN model for 144 out of 201 subjects.