How Occam’s razor guides human decision-making

  1. International School for Advanced Studies (SISSA), Trieste, Italy
  2. University of Pennsylvania, Philadelphia, United States
  3. PhD Program in Neuroscience, Harvard University, Boston, United States
  4. Santa Fe Institute, Santa Fe, United States
  5. Rudolf Peierls Centre for Theoretical Physics, University of Oxford, Oxford, United Kingdom

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Timothy Hanks
    University of California, Davis, Davis, United States of America
  • Senior Editor
    Michael Frank
    Brown University, Providence, United States of America

Reviewer #1 (Public review):

I have to preface my evaluation with a disclosure that I lack the mathematical expertise to fully assess what seems to be the authors' main theoretical contribution. I am providing this assessment to the best of my ability, but I cannot substitute for a reviewer with more advanced mathematical/physical training.

Summary:

This paper describes a new theoretical framework for measuring parsimony preferences in human judgments. The authors derive four metrics that they associate with parsimony (dimensionality, boundary, volume, and robustness) and measure whether human adults are sensitive to these metrics. In two tasks, adults had to choose one of two flower beds which a statistical sample was generated from, with or without explicit instruction to choose the flower bed perceptually closest to the sample. The authors conduct extensive statistical analyses showing that humans are sensitive to most of the derived quantities, even when the instructions encouraged participants to choose only based on perceptual distance. The authors complement their study with a computational neural network model that learns to make judgments about the same stimuli with feedback. They show that the computational model is sensitive to the tasks communicated by feedback and only uses the parsimony-associated metrics when feedback trains it to do so.

Strengths:

(1) The paper derives and applies new mathematical quantities associated with parsimony. The mathematical rigor is very impressive and is much more extensive than in most other work in the field, where studies often adopt only one metric (such as the number of causes or parameters). These formal metrics can be very useful for the field.

(2) The studies are preregistered, and the statistical analyses are strong.

(3) The computational model complements the behavioral findings, showing that the derived quantities are not simply equivalent to maximum-likelihood inference in the task.

(4) The speculations in the discussion section (e.g., the idea that human sensitivity is driven by the computational demands each metric requires) are intriguing and could usefully guide future work.

Weaknesses:

(1) The paper is very hard to understand. Many of the key details of the derived metrics are in the appendix, with very little accessible explanation in the main text. The figures helped me understand the metrics somewhat, although I am still not sure how some of them (such as boundary or robustness as measured here) are linked to parsimony. I understand that this is addressed by the derivations in the appendix, but as a computational cognitive scientist, I would have benefited from more accessible explanations. Important aspects of the human studies are also missing from the main text, such as the sample size for Experiment 2.

(2) It is not fully clear whether the sensitivity of human participants to some of the quantities convincingly reported here actually means that participants preferred shapes according to the corresponding aspect of parsimony. The title and framing suggest that parsimony "guides" human decision-making, which may lead readers to conclude that humans prefer more parsimonious shapes. I am not sure the sensitivity findings alone support this framing, but it might just be my misunderstanding of the analyses.

(3) The stimulus set included only four combinations of shapes, each designed to diagnostically target one of the theoretical quantities. It is unclear whether the results are robust or specific to these particular 4 stimuli.

(4) The study is framed as measuring "decision-making," but the task resembles statistical inference (e.g., which shape generated the data) or perceptual judgment. This is a minor point since "decision-making" is not well defined in the literature, yet the current framing in the title gave me the initial impression that humans would be making preference choices and learning about them over time with feedback.

Reviewer #2 (Public review):

This manuscript presents a sophisticated investigation into the computational mechanisms underlying human decision-making, and it presents evidence for a preference for simpler explanations (Occam's razor). The authors dissect the simplicity bias into four different components, and they design experiments to target each of them by presenting choices whose underlying models differ only in one of these components. In the learning tasks, participants must infer a "law" (a logical rule) from observed data in a way that operationalizes the process of scientific reasoning in a controlled laboratory setting. The tasks are complex enough to be engaging but simple enough to allow for precise computational modeling.

As a further novel feature, authors derive a further term in the expansion of the log-evidence, which arises from boundary terms. This is combined with a choice model, which is the one that is tested in experiments. Experiments are run, but with humans and with artificial intelligence agents, showing that humans have an enhanced preference for simplicity as compared to artificial neural networks.

Overall, the work is well written, interesting, and timely, bridging concepts in statistical inference and human decision making. Although technical details are rather elaborate, my understanding is that they represent the state of the art.

I have only one main comment that I think deserves more comments. Computing the complexity penalty of models may be hard. It is unlikely that humans can perform such a calculation on the fly. As authors discuss in the final section, while the dimensionality term may be easier to compute, others (e.g., the volume term, which requires an integral) may be considerably harder to compute (it is true that they should be computed once and for all for each task, but still...). I wonder whether the sensitivity of human decision making with reference to the different terms is so different, and in particular whether it aligns with computational simplicity, or with the possibility of approximating each term by simple heuristics. Indeed, the sensitivity to the volume term is significantly and systematically lower than that of other terms. I wonder whether this relation could be made more quantitative using neural networks, using as a proxy of computational hardness the number of samples needed to reach a given error level in learning each of these terms.

Reviewer #3 (Public review):

Summary:

This is a very interesting paper that documents how humans use a variety of factors that penalize model complexity and integrate over a possible set of parameters within each model. By comparison, trained neural networks also use these biases, but only on tasks where model selection was part of the reward structure. In the situation where training emphasizes maximum-likelihood decisions, only neural networks, but not humans, were able to adapt their decision-making. Humans continue to use model integration simplicity biases.

Strengths:

This study used a pre-registered plan for analyzing human data, which exceeds the standards compared to other current studies.

The results are technically correct.

Weaknesses:

The presentation of the results could be improved.

Author response:

Reviewer #1 (Public review)

I have to preface my evaluation with a disclosure that I lack the mathematical expertise to fully assess what seems to be the authors' main theoretical contribution. I am providing this assessment to the best of my ability, but I cannot substitute for a reviewer with more advanced mathematical/physical training.

Summary:

This paper describes a new theoretical framework for measuring parsimony preferences in human judgments. The authors derive four metrics that they associate with parsimony (dimensionality, boundary, volume, and robustness) and measure whether human adults are sensitive to these metrics. In two tasks, adults had to choose one of two flower beds which a statistical sample was generated from, with or without explicit instruction to choose the flower bed perceptually closest to the sample. The authors conduct extensive statistical analyses showing that humans are sensitive to most of the derived quantities, even when the instructions encouraged participants to choose only based on perceptual distance. The authors complement their study with a computational neural network model that learns to make judgments about the same stimuli with feedback. They show that the computational model is sensitive to the tasks communicated by feedback and only uses the parsimony-associated metrics when feedback trains it to do so.

Strengths:

(1) The paper derives and applies new mathematical quantities associated with parsimony. The mathematical rigor is very impressive and is much more extensive than in most other work in the field, where studies often adopt only one metric (such as the number of causes or parameters). These formal metrics can be very useful for the field.

(2) The studies are preregistered, and the statistical analyses are strong.

(3) The computational model complements the behavioral findings, showing that the derived quantities are not simply equivalent to maximum-likelihood inference in the task.

(4) The speculations in the discussion section (e.g., the idea that human sensitivity is driven by the computational demands each metric requires) are intriguing and could usefully guide future work.

Weaknesses:

(1) The paper is very hard to understand. Many of the key details of the derived metrics are in the appendix, with very little accessible explanation in the main text. The figures helped me understand the metrics somewhat, although I am still not sure how some of them (such as boundary or robustness as measured here) are linked to parsimony. I understand that this is addressed by the derivations in the appendix, but as a computational cognitive scientist, I would have benefited from more accessible explanations. Important aspects of the human studies are also missing from the main text, such as the sample size for Experiment 2.

(2) It is not fully clear whether the sensitivity of human participants to some of the quantities convincingly reported here actually means that participants preferred shapes according to the corresponding aspect of parsimony. The title and framing suggest that parsimony "guides" human decision-making, which may lead readers to conclude that humans prefer more parsimonious shapes. I am not sure the sensitivity findings alone support this framing, but it might just be my misunderstanding of the analyses.

(3) The stimulus set included only four combinations of shapes, each designed to diagnostically target one of the theoretical quantities. It is unclear whether the results are robust or specific to these particular 4 stimuli.

(4) The study is framed as measuring "decision-making," but the task resembles statistical inference (e.g., which shape generated the data) or perceptual judgment. This is a minor point since "decision-making" is not well defined in the literature, yet the current framing in the title gave me the initial impression that humans would be making preference choices and learning about them over time with feedback.

We are grateful for the supportive comments highlighting the rigor of our experimental design and data analysis. The Reviewer lists four points under “weaknesses”, to which we reply below.

(1) The paper is very hard to understand

In the revised version of the paper, we will expand the main text to include a more detailed and intuitive description of the terms of the Fisher Information Approximation, in particular clarifying the interpretation of robustness and boundary as parsimony. We also will include more details that are now given only in Methods, such as the sample size for the second experiment.

(2) Sensitivity of human participants

We do argue, and believe, that our data show that people tend to prefer simpler shapes. However, giving a well-posed definition of "preference" in this context turns out to be nontrivial.

At the very least, any statement such as "people prefer shape A over B" should be qualified with something like “when the distance of the data from both shapes is the same.” In other words, one should control for goodness-of-fit. Even before making any reference to our behavioral model, this phenomenon (a preference for the simpler model when goodness of fit is matched between models) is visible in Figure 3a, where the effective decision boundary used by human participants is closer to the more complex model than the cyan line representing the locus of points with equal goodness of fit under the two models (or equivalently, with the same Euclidean distance from the two shapes). The goal of our theory and our behavioral model is precisely to systematize this sort of control, extending it beyond just goodness-of-fit and allowing us to control simultaneously for multiple features of model complexity that may affect human behavior in different ways. In other words, it allows us not only to ask whether people prefer shape A over B after controlling for the distance of the data to the shapes, but also to understand to what extent this preference is driven by important geometrical features such as dimensionality, volume, curvature, and boundaries of the shapes. More specifically, and importantly, our theory makes it possible to measure the strength of the preference, rather than merely asserting its existence. In our modeling framework, the existence of a preference for simpler shapes is captured by the fact that the estimated sensitivities to the complexity penalties are positive (and although they differ in magnitude, all are statistically reliable).

(3) Generalization to different shapes

Thank you for bringing up this important topic. First, note that while dimensionality and volume are global properties of models and only take two possible values in our human tasks, the boundary and robustness penalties depend on the model and on the data and therefore assume a continuum of values through the tasks (note also that the boundary penalty is relevant for all task types, not just the one designed specifically to study it, because all models except the zero-dimensional dot have boundaries). Therefore, our experimental setting is less restrictive of what it may seem, because it explores a range of possible values for two of the four model features. However, we agree that it would be interesting to repeat our experiment with a broader range of models, perhaps allowing their dimensionality and volume to vary more. In the same spirit, it would be interesting to study the dependence of human behavior on the amount of available data. We believe that these are all excellent ideas for further study that exceed the scope of the present paper. We will include these important points in a revised Discussion.

(4) Usage of “decision making” vs “perceptual judgment”

Thank you. We will clarify better in the text that our usage of “decision making” overlaps with the idea of a perceptual judgment and that our experiments do not tackle sequential aspects of repeated decisions.

Reviewer #2 (Public review):

This manuscript presents a sophisticated investigation into the computational mechanisms underlying human decision-making, and it presents evidence for a preference for simpler explanations (Occam's razor). The authors dissect the simplicity bias into four different components, and they design experiments to target each of them by presenting choices whose underlying models differ only in one of these components. In the learning tasks, participants must infer a "law" (a logical rule) from observed data in a way that operationalizes the process of scientific reasoning in a controlled laboratory setting. The tasks are complex enough to be engaging but simple enough to allow for precise computational modeling.

As a further novel feature, authors derive a further term in the expansion of the logevidence, which arises from boundary terms. This is combined with a choice model, which is the one that is tested in experiments. Experiments are run, but with humans and with artificial intelligence agents, showing that humans have an enhanced preference for simplicity as compared to artificial neural networks.

Overall, the work is well written, interesting, and timely, bridging concepts in statistical inference and human decision making. Although technical details are rather elaborate, my understanding is that they represent the state of the art.

I have only one main comment that I think deserves more comments. Computing the complexity penalty of models may be hard. It is unlikely that humans can perform such a calculation on the fly. As authors discuss in the final section, while the dimensionality term may be easier to compute, others (e.g., the volume term, which requires an integral) may be considerably harder to compute (it is true that they should be computed once and for all for each task, but still...). I wonder whether the sensitivity of human decision making with reference to the different terms is so different, and in particular whether it aligns with computational simplicity, or with the possibility of approximating each term by simple heuristics. Indeed, the sensitivity to the volume term is significantly and systematically lower than that of other terms. I wonder whether this relation could be made more quantitative using neural networks, using as a proxy of computational hardness the number of samples needed to reach a given error level in learning each of these terms.

Thank you. The computational complexity associated with calculating the different terms and its potential connection to human sensitivity to the terms is an intriguing topic. As we hinted at in the discussion, we agree with the reviewer that this is a natural candidate for further research, which likely deserves its own study and exceeds the scope of the present paper.

As a minor aside, at least for the present task the volume term may not be that hard to compute, because it can be expressed with the number of distinguishable probability distributions in the model (Balasubramanian 1996). Given the nature of our task, where noise is Gaussian, isotropic and with known variance, the geometry of the model is actually the Euclidean geometry of the plane, and the volume is simply the (log of the) length of the line that represents the one-dimensional models, measured in units of the standard deviation of the noise.

Reviewer #3 (Public review):

Summary:

This is a very interesting paper that documents how humans use a variety of factors that penalize model complexity and integrate over a possible set of parameters within each model. By comparison, trained neural networks also use these biases, but only on tasks where model selection was part of the reward structure. In the situation where training emphasizes maximum-likelihood decisions, only neural networks, but not humans, were able to adapt their decision-making. Humans continue to use model integration simplicity biases.

Strengths:

This study used a pre-registered plan for analyzing human data, which exceeds the standards compared to other current studies.

The results are technically correct.

Weaknesses:

The presentation of the results could be improved.

We thank the reviewer for their appreciation of our experimental design and methodology, and for pointing out (in the separate "recommendations to authors") a few passages of the paper where the presentation could be improved. We will clarify these passages in the revision.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation