Resourcerational account of sequential effects in human prediction
Abstract
An abundant literature reports on ‘sequential effects’ observed when humans make predictions on the basis of stochastic sequences of stimuli. Such sequential effects represent departures from an optimal, Bayesian process. A prominent explanation posits that humans are adapted to changing environments, and erroneously assume nonstationarity of the environment, even if the latter is static. As a result, their predictions fluctuate over time. We propose a different explanation in which suboptimal and fluctuating predictions result from cognitive constraints (or costs), under which humans however behave rationally. We devise a framework of costly inference, in which we develop two classes of models that differ by the nature of the constraints at play: in one case the precision of beliefs comes at a cost, resulting in an exponential forgetting of past observations, while in the other beliefs with high predictive power are favored. To compare model predictions to human behavior, we carry out a prediction task that uses binary random stimuli, with probabilities ranging from 0.05 to 0.95. Although in this task the environment is static and the Bayesian belief converges, subjects’ predictions fluctuate and are biased toward the recent stimulus history. Both classes of models capture this ‘attractive effect’, but they depart in their characterization of higherorder effects. Only the precisioncost model reproduces a ‘repulsive effect’, observed in the data, in which predictions are biased away from stimuli presented in more distant trials. Our experimental results reveal systematic modulations in sequential effects, which our theoretical approach accounts for in terms of rationality under cognitive constraints.
Editor's evaluation
This valuable work addresses a longstanding empirical puzzle from a new computational perspective. The authors provide convincing evidence that attractive and repulsive sequential effects in perceptual decisions may emerge from rational choices under cognitive resource constraints rather than adjustments to changing environments. It is relevant to understanding how people represent uncertain events in the world around them and make decisions, with broad applications to economic behavior.
https://doi.org/10.7554/eLife.81256.sa0Introduction
In many situations of uncertainty, some outcomes are more probable than others. Knowing the probability distributions of the possible outcomes provides an edge that can be leveraged to improve and speed up decision making and perception (Summerfield and de Lange, 2014). In the case of choice reactiontime tasks, it was noted in the early 1950s that human reactions were faster when responding to a stimulus whose probability was higher (Hick, 1952; Hyman, 1953). In addition, faster responses were obtained after a repetition of a stimulus (i.e., when the same stimulus was presented twice in a row), even in the case of seriallyindependent stimuli (i.e., when the preceding stimulus carried no information on subsequent ones; Hyman, 1953; Bertelson, 1965). The observation of this seemingly suboptimal behavior has motivated in the following decades a profuse literature on ‘sequential effects’, i.e., on the dependence of reaction times on the recent history of presented stimuli (Kornblum, 1967; Soetens et al., 1985; Cho et al., 2002; Yu and Cohen, 2008; Wilder et al., 2009; Jones et al., 2013; Zhang et al., 2014; Meyniel et al., 2016). These studies consistently report a recency effect whereby the more often a simple pattern of stimuli (e.g. a repetition) is observed in recent stimulus history, the faster subjects respond to it. In tasks in which subjects are asked to make predictions about sequences of random binary events, sequential effects are also observed and they have given rise since the 1950s to a rich literature (Jarvik, 1951; Edwards, 1961; McClelland and Hackenberg, 1978; Matthews and Sanders, 1984; Gilovich et al., 1985; Ayton and Fischer, 2004; Burns and Corpus, 2004; Croson and Sundali, 2005; BarEli et al., 2006; Oskarsson et al., 2009; Plonsky et al., 2015; Plonsky and Erev, 2017; Gökaydin and Ejova, 2017).
Sequential effects are intriguing: why do subjects change their behavior as a function of the recent past observations when those are in fact irrelevant to the current decision? A common theoretical account is that humans infer the statistics of the stimuli presented to them, but because they usually live in environments that change over time, they may believe that the process generating the stimuli is subject to random changes even when it is in fact constant (Yu and Cohen, 2008; Wilder et al., 2009; Zhang et al., 2014; Meyniel et al., 2016). Consequently, they may rely excessively on the most recent stimuli to predict the next ones. In several studies, this was heuristically modeled as a ‘leaky integration’ of the stimuli, that is, an exponential discounting of past observations (Cho et al., 2002; Yu and Cohen, 2008; Wilder et al., 2009; Jones et al., 2013; Meyniel et al., 2016). Here, instead of positing that subjects hold an incorrect belief on the dynamics of the environment and do not learn that it is stationary, we propose a different account, whereby a cognitive constraint is hindering the inference process and preventing it from converging to the correct, constant belief about the unchanging statistics of the environment. This proposal calls for the investigation of the kinds of choice patterns and sequential effects that would result from different cognitive constraints at play during inference.
We derive a framework of constrained inference, in which a cost hinders the representation of belief distributions (posteriors). This approach is in line with a rich literature that views several perceptual and cognitive processes as resulting from a constrained optimization: the brain is assumed to operate optimally, but within some posited limits on its resources or abilities. The ‘efficient coding’ hypothesis in neuroscience (Ganguli and Simoncelli, 2016; Wei and Stocker, 2015; Wei and Stocker, 2017; PratCarrabin and Woodford, 2021c) and the ‘rational inattention’ models in economics (Sims, 2003; Woodford, 2009; Caplin et al., 2019; Gabaix, 2017; Azeredo da Silveira and Woodford, 2019; Azeredo da Silveira et al., 2020) are examples of this approach, which has been called ‘resourcerational analysis’ (Griffiths et al., 2015; Lieder and Griffiths, 2019). Here, we investigate the proposal that human inference is resourcerational, i.e., optimal under a cost. As for the nature of this cost, we consider two natural hypotheses: first, that a higher precision in belief is harder for subjects to achieve, and thus that more precise posteriors come with higher costs; and second, that unpredictable environments are difficult for subjects to represent, and thus that they entail higher costs. Under the first hypothesis, the cost is a function of the belief held, while under the second hypothesis the cost is a function of the inferred environment. We show that the precision cost predicts ‘leaky integration’: in the resulting inference process, remote observations are discarded. Crucially, beliefs do not converge but fluctuate instead with the recent stimulus history. By contrast, under the unpredictability cost, the inference process does converge, although not to the correct (Bayesian) posterior, but rather to a posterior that implies a biased belief on the temporal structure of the stimuli. In both cases, sequential effects emerge as the result of a constrained inference process.
We examine experimentally the degree to which the models derived from our framework account for human behavior, with a task in which we repeatedly ask subjects to predict the upcoming stimulus in sequences of Bernoullidistributed stimuli. Most studies on sequential effects only consider the equiprobable case, in which the two stimuli have the same probability. However, the models we consider here are more general than this singular case and they apply to the entire range of stimulus probability. We thus manipulate in separate blocks of trials the stimulus generative probability (i.e., the Bernoulli probability that parameterizes the stimulus) to span the range from 0.05 to 0.95 by increments of 0.05. This enables us to examine in detail the behavior of subjects in a large gamut of environments from the singular case of an equiprobable, maximallyuncertain environment (with a probability of 0.5 for both stimuli) to the stronglybiased, almostcertain environment in which one stimulus occurs with probability 0.95.
To anticipate on our results, the predictions of subjects depend on the stimulus generative probability, but also on the history of stimuli. We examine whether the occurrence of a stimulus, in past trials, increase the proportion of predictions identical to this stimulus (‘attractive effect’), or whether it decreases this proportion (‘repulsive effect’). The two costs presented above reproduce qualitatively the main patterns in subjects’ data, but they make distinct predictions as to the modulations of the recency effect as a function of the history of stimuli, beyond the last stimulus. We show that the responses of subjects exhibit an elaborate, and at times counterintuitive, pattern of attractive and repulsive effects, and we compare these to the predictions of our models. Our results suggest that the brain infers a stimulus generative probability, but under a constraint on the precision of its internal representations; the inferred generative process may be more general than the actual one, and include higherorder statistics (e.g. transition probabilities), in contrast with the Bernoullidistributed stimulus used in the experiment.
We present the behavioral task and we examine the predictions of subjects — in particular, how they vary with the stimulus generative probability, and how they depend, at each trial, on the preceding stimulus. We then introduce our framework of inference under constraint, and the two costs we consider, from which we derive two families of models. We examine the behavior of these models and the extent to which they capture the behavioral patterns of subjects. The models make different qualitative predictions about the sequential effects of past observations, which we confront to subjects’ data. We find that the predictions of subjects are qualitatively consistent with a model of inference of conditional probabilities, in which precise posteriors are costly.
Results
Subjects’ predictions of a stimulus increase with the stimulus probability
In a computerbased task, subjects are asked to predict which of two rods the lightning will strike. On each trial, the subject first selects by a key press the left or righthandside rod presented on screen. A lightning symbol (which is here the stimulus) then randomly strikes either of the two rods. The trial is a success if the lightning strikes the rod selected by the subject (Figure 1a). The location of the lightning strike (left or right) is a Bernoulli random variable whose parameter $p$ (the stimulus generative probability) we manipulate across blocks of 200 trials: in each block, $p$ is a multiple of 0.05 chosen between 0.05 and 0.95. Changes of block are explicitly signaled to the subjects: each block is presented as a different town exposed to lightning strikes. The subjects are not told that the locations of the strikes are Bernoullidistributed (in fact no information is given to them regarding how the locations are determined). Moreover, in order to capture the ‘stationary’ behavior of subjects, which presumably prevails after ample exposure to the stimulus, each block is preceded by 200 passive trials in which the stimuli (sampled with the probability chosen for the block) are successively shown with no action from the subject (Figure 1b); this is presented as a ‘useful track record’ of lightning strikes in the current town. (To verify the stationarity of subjects’ behavior, we compare their responses in the first and second halves of the 200 trials in which they are asked to make predictions. In most cases we find no significant differences. See Appendix.) We provide further details on the task in Methods.
The behavior of subjects varies with the stimulus generative probability, $p$. In our analyses, we are interested in how the subjects’ predictions of an event (left or right strike) vary with the probability of this event, regardless of its nature (left or right). Thus, for instance, we would like to pool together the trials in which a subject makes a rightward prediction when the probability of a rightward strike is 0.7, and the trials in which a subject makes a leftward prediction when the probability of a leftward strike is also 0.7. Therefore, throughout the paper, we do not discuss whether subjects predict ‘right’ or ‘left’, and instead we discuss whether they predict the event ‘A’ or the complementary event ‘B’: in different blocks of trials, A (and similarly B) may refer to different locations; but importantly, B always corresponds to the location opposite to A, and $p$ denotes the probability of A (thus B has probability $1p$). This allows us, given a probability $p$, to pool together the responses obtained in blocks of trials in which one of the two locations has probability $p$. One advantage of this pooling is that it reduces the noise in data. Looking at the unpooled data, however, does not change our conclusions; see Appendix.
Turning to the behavior of subjects, we denote by $\overline{p}(A)$ the proportion of trials in which a subject predicts the event A. In the equiprobable condition ($p=0.5$), the subjects predict either side on about half the trials ($\overline{p}(A)=.496$, subjects pooled; standard error of the mean (sem): 0.008; pvalue of ttest of equality with 0.5: 0.59). In the nonequiprobable conditions, the optimal behavior is to predict A on none of the trials ($\overline{p}(A)=0$) if $p<0.5$, or on all trials ($\overline{p}(A)=1$) if $p>0.5$. The proportion of predictions A adopted by the subjects also increases as a function of the stimulus generative probability (Pearson correlation coefficient between $p$ and $\overline{p}(A)$, subjects pooled: .97; pvalue: 3.3e6; correlation between the ‘logits’, $\mathrm{ln}\frac{p}{1p}$: 0.994, pvalue: 5.7e9.), but not as steeply: it lies between the stimulus generative probability $p$, and the optimal response 0 (if $p<0.5$) or 1 (if $p>0.5$; Figure 2a).
Firstorder sequential effects: attractive influence of the most recent stimulus on subjects’ predictions
The sequences presented to subjects correspond to independent, Bernoullidistributed random events. Having shown that the subjects’ predictions follow (in a nonoptimal fashion) the stimulus generative probability, we now test whether they also exhibit the nonindependence of consecutive trials featured by the Bernoulli process. Under this hypothesis and in the stationary regime, the proportion of predictions A conditional on the preceding stimulus being A, $\overline{p}(AA)$, should be no different than the proportion of predictions A conditional on the preceding stimulus being B, $\overline{p}(AB)$. (Here and below, $\overline{p}(XY)$ denotes the proportion of predictions X conditional on the preceding observation being Y, and not on the preceding response being Y. For the possibility that subjects’ responses depend on the preceding response, see Methods.)
In other words, conditioning on the preceding stimulus should have no effect. In subjects’ responses, however, these two conditional proportions are markedly different for all stimulus generative probabilities (Fisher exact test, subjects pooled: all pvalues < 1e10; Figure 2a). Both quantities increase as a function of the stimulus generative probability, but the proportions of predictions A conditional on an A are consistently greater than the proportions of predictions A conditional on a B, i.e., $\overline{p}(AA)\overline{p}(AB)>0$ (Figure 2b). (We note that because the stimulus is either A or B, it follows that, symmetrically, the proportions of predictions B conditional on a B are consistently greater than the proportions of predictions B conditional on an A.) In other words, the preceding stimulus has an ‘attractive’ sequential effect. In addition, this attractive sequential effect seems stronger for values of the stimulus generative probability closer to the equiprobable case (p = 0.5), and to decrease for more extreme values ($p$ closer to 0 or to 1; Figure 2b). The results in Figure 2 are obtained by pooling together the responses of the subjects. Results derived from an acrosssubjects analysis are very similar; see Appendix.
A framework of costly inference
The attractive effect of the preceding stimulus on subjects’ responses suggests that the subjects have not correctly inferred the Bernoulli statistics of the process generating the stimuli. We investigate the hypothesis that their ability to infer the underlying statistics of the stimuli is hampered by cognitive constraints. We assume that these constraints can be understood as a cost, bearing on the representation, by the brain, of the subject’s beliefs about the statistics. Specifically, we derive an array of models from a framework of inference under costly posteriors (PratCarrabin et al., 2021a), which we now present. We consider a model subject who is presented on each trial $t$ with a stimulus $x}_{t}\in \{0,1\$ (where 0 and 1 encode for B and A, respectively) and who uses the sequence of stimuli ${x}_{1:t}=({x}_{1},\dots ,{x}_{t})$ to infer the stimulus statistics, over which she holds the belief distribution $\hat{P}}_{t$. A Bayesian observer equipped with this belief $\hat{P}}_{t$ and observing a new observation $x}_{t+1$ would obtain its updated belief $P}_{t+1$ through Bayes’ rule. However, a cognitive cost $C(P)$ hinders our model subject’s ability to represent probability distributions $P$. Thus, she approximates the posterior $P}_{t+1$ through another distribution $\hat{P}}_{t+1$ that minimizes a loss function $L$ defined as
where $D$ is a measure of distance between two probability distributions, and $\lambda \ge 0$ is a coefficient specifying the relative weight of the cost. (We are not proposing that subjects actively minimize this quantity, but rather that the brain’s inference process is an effective solution to this optimization problem.) Below, we use the KullbackLeibler divergence for the distance (i.e. $D({\hat{P}}_{t+1};{P}_{t+1})={D}_{KL}({\hat{P}}_{t+1}{P}_{t+1})$). If $\lambda =0$, the solution to this minimization problem is the Bayesian posterior; if $\lambda \ne 0$, the cost distorts the Bayesian solution in ways that depend on the form of the cost borne by the subject (we detail further below the two kinds of costs we investigate).
In our framework, the subject assumes that the $m$ preceding stimuli ($x}_{tm+1:t$ with $m\ge 0$) and a vector of parameters $q$ jointly determine the distribution of the stimulus at trial $t+1$, $p({x}_{t+1}{x}_{tm+1:t},q)$. Although in our task the stimuli are Bernoullidistributed (thus they do not depend on preceding stimuli) and a single parameter determines the probability of the outcomes (the stimulus generative probability), the subject may admit the possibility that more complex mechanisms govern the statistics of the stimuli, for example transition probabilities between consecutive stimuli. Therefore, the vector $q$ may contain more than one parameter and the number $m$ of preceding stimuli assumed to influence the probability of the following stimulus, which we call the ‘Markov order’, may be greater than 0.
Below, we call ‘Bernoulli observer’ any model subject who assumes that the stimuli are Bernoullidistributed ($m=0$); in this case the vector $q$ consists of a single parameter that determines the probability of observing A, which we also denote by $q$ for the sake of concision. The bias and variability in the inference of the Bernoulli observer is studied in PratCarrabin et al., 2021a. We call ‘Markov observer’ any model subject who posits that the probability of the stimulus depends on the preceding stimuli ($m>0$). In this case, the vector $q$ contains the $2}^{m$ conditional probabilities of observing A after observing each possible sequence of $m$ stimuli. For instance, with $m=1$ the vector $q$ is the pair of parameters $({q}_{A},{q}_{B})$ denoting the probabilities of observing a stimulus A after observing, respectively, a stimulus A and a stimulus B. In the absence of a cost, the belief over the parameter(s) eventually converges towards the parameter vector that is consistent with the generative Bernoulli statistics governing the stimulus (except if the prior precludes this parameter vector). Below, we assume a uniform prior.
To understand how the costs contort the inference process, it is useful to have in mind the solution to the ‘unconstrained’ inference problem (with $\lambda =0$), i.e., the Bayesian posterior, which we denote by ${P}_{t}^{\ast}(q)$. In the case of a Bernoulli observer ($m=0$), after $t$ trials, the Bayesian posterior is a Beta distribution,
where $n}_{t}^{X$ is the number of stimuli $X$ observed up to trial $t$, that is, $n}_{t}^{A}=\sum _{i=1}^{t}{x}_{i$, and ${n}_{t}^{B}=\sum _{i=1}^{t}(1{x}_{i})$. As more evidence is accumulated, the Bayesian posterior gradually narrows and converges towards the value of the stimulus generative probability (Figure 3c and d, grey lines).
The ways in which the Bayesian posterior is distorted, in our models, depend on the nature of the cost that weighs on the inference process. Although many assumptions could be made on the kind of constraint that hinders human inference, and on the cost it would entail in our framework, here we examine two costs that stem from two possible principles: that the cost is a function of the beliefs held by the subject, or that it is a function of the environment that the subject is inferring. We detail, below, these two costs.
Precision cost
A first hypothesis about the inference process of subjects is that the brain mobilizes resources to represent probability distributions, and that more ‘precise’ distributions require more resources. We write the cost associated with a distribution, $\hat{P}(q)$, as the negative of its entropy,
which is a measure of the amount of certainty in the distribution. Wider (less concentrated) distributions provide less information about the probability parameter and are thus less costly than narrower (more concentrated) distributions (Figure 3b). As an extreme case, the uniform distribution is the least costly.
With this cost, the loss function (Equation 1) is minimized by the distribution equal to the product of the prior and the likelihood, raised to the exponent $1/(\lambda +1)$, and normalized, i.e.,
Since $\lambda$ is strictly positive, the exponent is positive and lower than 1. As a result, the solution ‘flattens’ the Bayesian posterior, and in the extreme case of an unbounded cost ($\lambda \to \mathrm{\infty}$) the posterior is the uniform distribution.
Furthermore, in the expression of our model subject’s posterior, the likelihood $p({x}_{t+1}{x}_{tm+1:t},q)$ is raised after $k$ trials to the exponent $1/(\lambda +1{)}^{k+1}$, it thus decays to zero as the number $k$ of new stimuli increases. One can interpret this effect as gradually forgetting past observations. Specifically, we recover the predictions of leakyintegration models, in which remote patterns in the sequence of stimuli are discounted through an exponential filter (Yu and Cohen, 2008; Meyniel et al., 2016); here, we do not posit the gradual forgetting of remote observations, but instead we derive it as an optimal solution to a problem of constrained inference. We illustrate leaky integration in the case of a Bernoulli observer ($m=0$): in this case, the posterior after $t$ trials, ${\hat{P}}_{t}(q)$, is a Beta distribution,
where $\stackrel{~}{n}}_{t}^{A$ and $\stackrel{~}{n}}_{t}^{B$ are exponentiallyfiltered counts of the number of stimuli A and B observed up to trial $t$, i.e.,
In other words, the solution to the constrained inference problem, with the precision cost, is similar to the Bayesian posterior (Equation 2), but with counts of the two stimuli that gradually ‘forget’ remote observations (in the absence of a cost, that is, $\lambda =0$, we have $\stackrel{~}{n}}_{t}^{A}={n}_{t}^{A$ and $\stackrel{~}{n}}_{t}^{B}={n}_{t}^{B$, and thus we recover the Bayesian posterior). As a result, these counts fluctuate with the recent history of the stimuli. Consequently, the posterior ${\hat{P}}_{t}(q)$ is dominated by the recent stimuli: it does not converge, but instead fluctuates with the recent stimulus history (Figure 3c and d, purple lines; compare with the green and gray lines). Hence, this model implies predictions about subsequent stimuli that depend on the stimulus history, i.e., it predicts sequential effects.
Unpredictability cost
A different hypothesis is that the subjects favor, in their inference, parameter vectors $q$ that correspond to more predictable outcomes. We quantify the outcome unpredictability by the Shannon entropy (Shannon, 1948) of the outcome implied by the vector of parameters $q$, which we denote by $H(X;q)$. (In the Bernoulliobserver case, $H(X;q)=q\mathrm{ln}q(1q)\mathrm{ln}(1q)$; for the Markovobserver cases, see Methods.) The cost associated with the distribution $\hat{P}(q)$ is the expectation of this quantity averaged over beliefs, i.e.,
which we call the ‘unpredictability cost’. For a Bernoulli observer, a posterior concentrated on extreme values of the Bernoulli parameter (toward 0 or 1), thus representing more predictable environments, comes with a lower cost than a posterior concentrated on values of the Bernoulli parameter close to 0.5, which correspond to the most unpredictable environments (Figure 3a).
After $t$ trials, the loss function (Equation 1) under this cost is minimized by the posterior
i.e., the product of the Bayesian posterior, which narrows with $t$ around the stimulus generative probability, and of a function that is larger for values of $q$ that imply less entropic (i.e. more predictable) environments (see Methods). In short, with the unpredictability cost the model subject’s posterior is ‘pushed’ towards less entropic values of $q$.
In the Bernoulli case ($m=0$), the posterior after $t$ stimuli has a global maximum, ${q}^{\ast}({n}_{t}/t)$, that depends on the proportion ${n}_{t}/t$ of stimuli A observed up to trial $t$. As the number of presented stimuli $t$ grows, the posterior $\hat{P}}_{t$ becomes concentrated around this maximum. The proportion ${n}_{t}/t$ naturally converges to the stimulus generative probability, $p$, thus our subject’s inference converges towards the value ${q}^{\ast}(p)$ which is different from the true value $p$, in the nonequiprobable case ($p\ne .5$). The equiprobable case ($p=.5$) is singular, in that with a weak cost ($\lambda <1$) the inferred probability is unbiased (${q}^{\ast}(p)=.5$), while with a strong cost ($\lambda >1$) the inferred probability does not converge but instead alternates between two values above and below 0.5; see PratCarrabin et al., 2021a. In other words, except in the equiprobable case, the inference converges but it is biased, i.e., the posterior peaks at an incorrect value of the stimulus generative probability (Figure 3c and d, green lines). This value is closer to the extremes (0 and 1) than the stimulus generative probability, that is, it implies an environment more predictable than the actual one (Figure 3d).
In the case of a Markov observer ($m>0$), the posterior also converges to a vector of parameters $q$ which implies not only a bias but also that the conditional probabilities of a stimulus A (conditioned on different stimulus histories) are not equal. The prediction of the next stimulus being A on a given trial depends on whether the preceding stimulus was A or B: this model therefore predicts sequential effects. We further examine below the behavior of this model in the cases of a Bernoulli observer and of different Markov observers. We refer the reader interested in more details on the Markov models, including their mathematical derivations, to the Methods section.
In short, with the unpredictabilitycost models, when $\displaystyle p\ne 0.5$, the inference process converges to an asymptotic posterior ${q}^{\ast}(p)$ which does not itself depend on the history of the stimulus, but that is biased (Figure 3c, d, green lines). In particular, for Markov observers ($m>0$), the asymptotic posterior corresponds to an erroneous belief about the dependency of the stimulus on the recent stimulus history, which results in sequential effects in behavior.
Overview of the inference models
Although the two families of models derived from the two costs both potentially generate sequential effects, they do so by giving rise to qualitatively different inference processes. Under the unpredictability cost, the inference converges to a posterior that, in the Bernoulli case ($m=0$), implies a biased estimate of the stimulus generative probability (Figure 3d, green lines), while in the Markov case ($m>0$) it implies the belief that there are serial dependencies in the stimuli: predictions therefore depend on the recent stimulus history. By contrast, the precision cost prevents beliefs from converging (Figure 3c, purple lines). As a result, the subject’s predictions vary with the recent stimulus history (Figure 3d). This inference process amounts to an exponential discount of remote observations, or equivalently, to the overweighting of recent observations (Equation 6).
To investigate in more detail the sequential effects that these two costs produce, we implement two families of inference models derived from the two costs. Each model is characterized by the type of cost (unpredictability cost or precision cost), and by the assumed Markov order ($m$): we examine the case of a Bernoulli observer ($m=0$) and three cases of Markov observers (with $m=$ 1, 2, and 3). We thus obtain $2\times 4=8$ models of inference. Each of these models has one parameter $\lambda$ controlling the weight of the cost. (We also examine a ‘hybrid’ model that combines the two costs; see below.)
Responseselection strategy
We assume that the subject’s response on a given trial depends on the inferred posterior according to a generalization of ‘probability matching’ implemented in other studies (Battaglia et al., 2011; Yu and Huang, 2014; PratCarrabin et al., 2021b). In this responseselection strategy, the subject predicts A with the probability ${\overline{p}}_{t}^{\kappa}/({\overline{p}}_{t}^{\kappa}+(1{\overline{p}}_{t}{)}^{\kappa})$, where $\overline{p}}_{t$ is the expected probability of a stimulus A derived from the posterior, i.e., $\displaystyle {\overline{p}}_{t}\equiv \int p({x}_{t+1}=1{x}_{tm+1:t},q){\hat{P}}_{t}(q)\mathrm{d}q$. The single parameter $\kappa$ controls the randomness of the response: with $\displaystyle \kappa =0$ the subject predicts A and B with equal probability; with $\kappa =1$ the responseselection strategy corresponds to probability matching, that is, the subject predicts A with probability $\overline{p}}_{t$; and as $\kappa$ increases toward infinity the choices become optimal, that is, the subjects predicts A if the expected probability of observing a stimulus A, $\overline{p}}_{t$, is greater than 0.5, and predicts B if it is lower than 0.5 (if $\displaystyle {\overline{p}}_{t}=0.5$ the subject chooses A or B with equal probability). In our investigations, we also implement several other responseselection strategies, including one in which subjects have a propensity to repeat their preceding response, or conversely, to alternate; these analyses do not change our conclusions (see Methods).
Model fitting favors Markovobserver models
Each of our eight models has two parameters: the factor weighting the cost, $\lambda$, and the exponent of the generalized probabilitymatching, $\kappa$. We fit the parameters of each model to the responses of each subject, by maximizing their likelihoods. We find that 60% of subjects are best fitted by one of the unpredictabilitycost models, while 40% are best fitted by one of the precisioncost models. When pooling the two types of cost, 65% of subjects are best fitted by a Markovobserver model. We implement a ‘Bayesian model selection’ procedure (Stephan et al., 2009), which takes into account, for each subject, the likelihoods of all the models (and not only the maximum among them) in order to obtain a Bayesian posterior over the distribution of models in the general population (see Methods). The derived expected probability of unpredictabilitycost models is 57% (and 43% for precisioncost models) with an exceedance probability (i.e. probability that unpredictabilitycost models are more frequent in the general population) of 78%. The expected probability of Markovobserver models, regardless of the cost used in the model, is 70% (and 30% for Bernoulliobserver models) with an exceedance probability (i.e. probability that Markovobserver models are more frequent in the general population) of 98%. These results indicate that the responses of subjects are generally consistent with a Markovobserver model, although the stimuli used in the experiment are Bernoullidistributed. As for the unpredictabilitycost and the precisioncost families of models, Bayesian model selection does not provide decisive evidence in favor of either model, indicating that they both capture some aspects of the responses of the subjects. Below, we examine more closely the behaviors of the models, and point to qualitative differences between the predictions resulting from each model family.
Before turning to these results, we validate the robustness of our modelfitting procedure with several additional analyses. First, we estimate a confusion matrix to examine the possibility that the modelfitting procedure could misidentify the models which generated test sets of responses. We find that the bestfitting model corresponds to the true model in at least 70% of simulations (the chance level is 12.5%=1/8 models), and actually more than 90% for the majority of models (see Appendix).
Second, we seek to verify whether the bestfitting cost factor, $\lambda$, that we obtain for each subject is consistent across the range of probabilities tested. Specifically, we fit separately the models to the responses obtained in the blocks of trials whose stimulus generative probability was ‘medium’ (between 0.3 and 0.7, included) on the one hand, and to the responses obtained when the probability was ‘extreme’ (below 0.3, and above 0.7) on the other hand; and we compare the values of the bestfitting cost factors $\lambda$ in these two cases. More precisely, for the precisioncost family, we look at the inverse of the decay time, $\mathrm{ln}(1+\lambda )$, which is the inverse of the characteristic time over which the model subject ‘forgets’ past observations. With both families of models, we find that on a logarithmic scale the parameters in the medium and extremeprobabilities cases are significantly correlated across subjects (Pearson’s $r$, precisioncost models: 0.75, pvalue: 1e4; unpredictabilitycost models: $\displaystyle r=0.47$, pvalue: .036). In other words, if a subject is best fitted by a large cost factor in mediumprobabilities trials, he or she is likely to be also best fitted by a large cost factor in extremeprobabilities trials. This indicates that our models capture idiosyncratic features of subjects that generalize across conditions instead of varying with the stimulus probability (see Appendix).
Third, as mentioned above we examine a variant of the responseselection strategy in which the subject sometimes repeats the preceding response, or conversely alternates and chooses the other response, instead of responding based on the inferred probability of the next stimulus. This propensity to repeat or alternate does not change the bestfitting inference model of most subjects, and the bestfitting values of the parameters $\lambda$ and $\kappa$ are very stable when allowing or not for this propensity. This analysis supports the results we present here, and speaks to the robustness of the modelfitting procedure (see Methods).
Finally, as the unpredictabilitycost family and the precisioncost family of models both seem to capture the responses of a sizable share of the subjects, one might assume that the behavior of most subjects actually fall ‘somewhere in between’, and would be best accounted for by a hybrid model combining the two costs. In our investigations, we have implemented such a model, whereby the subject’s approximate posterior $\hat{P}}_{t$ results from the minimization of a loss function that includes both a precision cost, with weight $\lambda}_{p$, and an unpredictability cost, with weight $\lambda}_{u$ (and the responseselection strategy is the generalized probability matching, with parameter $\kappa$). We do not find that most subjects’ responses are better fitted (as measured by the Bayesian Information Criterion Schwarz, 1978) by a combination of the two costs: instead, for more than two thirds of subjects, the bestfitting model features just one cost (see Methods). In other words, the two cost seems to capture different aspects of the behavior that are predominant in different subpopulations. Below, we examine the behavioral patterns resulting from each cost type, in comparison with the behavior of the subjects.
Models of costly inference reproduce the attractive effect of the most recent stimulus
We now examine the behavioral patterns resulting from the models. All the models we consider predict that the proportion of predictions A, $\overline{p}(A)$, is a smooth, increasing function of the stimulus generative probability (when $\lambda <\mathrm{\infty}$ and $0<\kappa <\mathrm{\infty}$; Figure 4a–d, grey lines), thus we focus, here, on the ability of the models to reproduce the subjects’ sequential effects. With the unpredictabilitycost model of a Bernoulli observer ($m=0$), the belief of the model subject, as mentioned above, asymptotically converges in nonequiprobable cases to an erroneous value of the stimulus generative probability (Figure 3d, green lines). After a large number of observations (such as the 200 ‘passive’ trials, in our task), the sensitivity of the belief to new observations becomes almost imperceptible; as a result, this model predicts practically no sequential effects (Figure 4b), that is, $\overline{p}(AA)\simeq \overline{p}(AB)$. With the unpredictabilitycost model of a Markov observer (e.g. $m=1$), the belief of the model subject also converges, but to a vector of parameters $q$ that implies a sequential dependency in the stimulus, that is, $q}_{A}\ne {q}_{B$, resulting in sequential effects in predictions, that is, $\overline{p}(AA)\ne \overline{p}(AB)$. The parameter vector $q$ yields a more predictable (less entropic) environment if the probability conditional on the more frequent outcome (say, A) is less entropic than the probability conditional on the less frequent outcome (B). This is the case if the former is greater than the latter, resulting in the inequality $\overline{p}(AA)>\overline{p}(AB)$, that is, in sequential effects of the attractive kind (Figure 4d). (The case in which B is the more frequent outcome results in the inequality $\overline{p}(BB)>\overline{p}(BA)$, i.e., $1\overline{p}(AB)>1\overline{p}(AA)$, i.e., the same, attractive sequential effects.)
Turning to the precisioncost models, we have noted that in these models the posterior fluctuates with the recent history of the stimuli (Figure 3c): as a result, sequential effects are obtained, even with a Bernoulli observer ($m=0$; Figure 4a). The most recent stimulus has the largest weight in the exponentially filtered counts that determine the posterior (Equation 6), thus the model subject’s prediction is biased towards the last stimulus, that is, the sequential effect is attractive ($\overline{p}(AA)>\overline{p}(AB)$). With the traditional probabilitymatching responseselection strategy (i.e. $\kappa =1$), the strength of the attractive effect is the same across all stimulus generative probabilities (i.e. the difference $\overline{p}(AA)\overline{p}(AB)$ is constant; Figure 4a, dotted lines and lightred dots). With the generalized probabilitymatching responseselection strategy, if $\kappa >1$, proportions below and above 0.5 are brought closer to the extremes (0 and 1, respectively), resulting in larger sequential effects for values of the stimulus generative probability closer to 0.5 (Figure 4a, solid lines and red dots; the model is simulated with $\kappa =2.8$, a value representative of the subjects’ bestfitting values for this parameter). We also find stronger sequential effects closer to the equiprobable case in subjects’ data (Figure 2b).
The precisioncost model of a Markov observer ($m=1$) also predicts attractive sequential effects (Figure 4c). While the behavior of the Bernoulli observer (with a precision cost) is determined by two exponentiallyfiltered counts of the two possible stimuli (Equation 6), that of the Markov observer with $m=1$ depends on four exponentially filtered counts of the four possible pairs of stimuli. After observing a stimulus B, the belief that the following stimulus should be A or B is determined by the exponentially filtered counts of the pairs BA and BB. If $p$ is large, i.e., if the stimulus B is infrequent, then the BA and BB pairs are also infrequent and the corresponding counts are close to zero: the model subject thus behaves as if only very little evidence had been observed about the transitions B to A and B to B in this case, resulting in a proportion of predictions A conditional on a preceding B, $\overline{p}(AB)$, close to 0.5 (Figure 4c, orange line). Consequently, the sequential effects are stronger for values of the stimulus generative probabilities closer to the extreme (Figure 4c, red dots).
Both families of costs are thus able to produce attractive sequential effects, albeit with some qualitative differences. (In Figure 4a–d we show the behaviors resulting from the two costs for a Bernoulli observer and a Markov observer of order $m=1$; the Markov observers of higher order exhibit qualitatively similar behaviors; see Methods.) As the model fitting indicates that different groups of subjects are best fitted by models belonging to the two families, we examine separately the behaviors of the subjects whose responses are best fitted by each of the two costs (Figure 4e and f), in comparison with the behaviors of the corresponding bestfitting models (Figure 4g and h). This provides a finer understanding of the behavior of subjects than the group average shown in Figure 2. For the subjects best fitted by precisioncost models, the proportion of predictions A, $\overline{p}(A)$, when the stimulus generative probability is close to 0.5, is a less steep function of this probability than for the subjects bestfitted by unpredictabilitycost models (Figure 4e and f, grey lines); furthermore, their sequential effects are larger (as measured by the difference $\overline{p}(AA)\overline{p}(AB)$), and do not depend much on the stimulus generative probability (Figure 4e and f, red dots). The corresponding models reproduce the behavioral patterns of the subjects that they best fit (Figure 4g and h). Each family of models seems to capture specific behaviors exhibited by the subjects: when fitting the unpredictabilitycost models to the responses of the subjects that are best fitted by precisioncost models, and conversely when fitting the precisioncost models to the responses of the subjects that are best fitted by unpredictabilitycost models, the models do not reproduce well the subjects’ behavioral patterns (Figure 4i and j). The precisioncost models, however, seem slightly better than the unpredictabilitycost models at capturing the behavior of the subjects that they do not best fit (Figure 4, compare panel j to panel f, and panel i to panel e). Substantiating this observation, the examination of the distributions of the models’ BICs across subjects shows that when fitting the models onto the subjects that they do not best fit, the precisioncost models fare better than the unpredictabilitycost models (see Appendix).
Beyond the most recent stimulus: patterns of higherorder sequential effects
Notwithstanding the quantitative differences just presented, both families of models yield qualitatively similar attractive sequential effects: the model subjects’ predictions are biased towards the preceding stimulus. Does this pattern also apply to the longer history of the stimulus, i.e., do more distant trials also influence the model subjects’ predictions? To investigate this hypothesis, we examine the difference between the proportion of predictions A after observing a sequence of length $n$ that starts with A, minus the proportion of predictions A after the same sequence, but starting with B, i.e., $\overline{p}(AAx)\overline{p}(ABx)$, where $x$ is a sequence of length $n1$, and $Ax$ and $Bx$ denote the same sequence preceded by A and by B. This quantity enables us to isolate the influence of the $n$tolast stimulus on the current prediction. If the difference is positive, the effect is ‘attractive’; if it is negative, the effect is ‘repulsive’ (in this latter case, the presentation of an A decreases the probability that the subjects predicts A in a later trial, as compared to the presentation of a B); and if the difference is zero there is no sequential effect stemming from the $n$tolast stimulus. The case $n=1$ corresponds to the immediately preceding stimulus, whose effect we have shown to be attractive, i.e., $\overline{p}(AA)\overline{p}(AB)>0$, in the responses both of the bestfitting models and of the subjects (Figures 2b, 4g and h).
We investigate the effect of the $n$tolast stimulus on the behavior of the two families of models, with $n=1$, $2$, and $3$. We present here the main results of this investigation; we refer the reader to Methods for a more detailed analysis. With unpredictabilitycost models of Markov order $m$, there are nonvanishing sequential effects stemming from the $n$tolast stimulus only if the Markov order is greater than or equal to the distance from this stimulus to the current trial, i.e., if $m\ge n$. In this case, the sequential effects are attractive (Figure 5).
With precisioncost models, the $n$tolast stimuli yield nonvanishing sequential effects regardless of the Markov order, $m$. With $n=1$, the effect is attractive, i.e., $\overline{p}(AA)\overline{p}(AB)>0$. With $n=2$ (secondtolast stimulus), the effect is also attractive, i.e., in the case of the pair of sequences AA and BA, $\overline{p}(AAA)\overline{p}(ABA)>0$ (Figure 5a). By symmetry, the difference is also positive for the other pair of relevant sequences, AB and BB (e.g. we note that $\overline{p}(AAB)=1\overline{p}(BAB)$, and that $\overline{p}(BAB)$ when the probability of A is $p$ is equal to $\overline{p}(ABA)$ when the probability of A is $1p$. We detail in Methods such relations between the proportions of predictions A or B in different situations. These relations result in the symmetries of Figure 2, for the sequential effect of the last stimulus, while for higherorder sequential effects they imply that we do not need to show, in Figure 5, the effects following all possible past sequences of two or three stimuli, as the ones we do not show are readily derived from the ones we do.)
As for the thirdtolast stimulus ($n=3$), it can be followed by four different sequences of length two, but we only need to examine two of these four, for the reasons just presented. We find that for the precisioncost models, with all the Markov orders we examine (from 0 to 3), the probability of predicting A after observing the sequence AAA is greater than that after observing the sequence BAA, i.e., $\overline{p}(AAAA)\overline{p}(ABAA)>0$, that is, there is an attractive sequential effect of the thirdtolast stimulus if the sequence following it is AA (and, by symmetry, if it is BB; Figure 5b). So far, thus, we have found only attractive effects. However, the results are less straightforward when the thirdtolast stimulus is followed by the sequence BA. In this case, for a Bernoulli observer ($m=0$), the effect is also attractive: $\overline{p}(AABA)\overline{p}(ABBA)>0$ (Figure 5c, white circles). With Markov observers ($m\ge 1$), over a range of stimulus generative probability $p$, the effect is repulsive: $\overline{p}(AABA)\overline{p}(ABBA)<0$, that is, the presentation of an A decreases the probability that the model subject predicts A three trials later, as compared to the presentation of a B (Figure 5c, filled circles). The occurrence of the repulsive effect in this particular case is a distinctive trait of the precisioncost models of Markov observers ($m\ge 1$); we do not obtain any repulsive effect with any of the unpredictabilitycost models, nor with the precisioncost model of a Bernoulli observer ($m=0$).
Subjects’ predictions exhibit higherorder repulsive effects
We now examine the sequential effects in subjects’ responses, beyond the attractive effect of the preceding stimulus ($n=1$; discussed above). With $n=2$ (secondtolast stimulus), for the majority of the 19 stimulus generative probabilities $p$, we find attractive sequential effects: the difference $\overline{p}(AAA)\overline{p}(ABA)$ is significantly positive (Figure 6a; pvalues <0.01 for 11 stimulus generative probabilities, <0.05 for 13 probabilities; subjects pooled). With $n=3$ (thirdtolast stimulus), we also find significant attractive sequential effects in subjects’ responses for some of the stimulus generative probabilities, when the thirdtolast stimulus is followed by the sequence AA (Figure 6b; pvalues <0.01 for four probabilities, <0.05 for seven probabilities). When it is instead followed by the sequence BA, we find that for eight stimulus generative probabilities, all between 0.25 and 0.75, there is a significant repulsive sequential effect: $\overline{p}(AABA)\overline{p}(ABBA)<0$ (pvalues <0.01 for six probabilities, <0.05 for eight probabilities; subjects pooled). Thus, in these cases, the occurrence of A as the thirdtolast stimulus increases (in comparison with the occurrence of a B) the proportion of the opposite prediction, B. For the remaining stimulus generative probabilities, this difference is in most cases also negative although not significantly different from zero (Figure 6c). (An acrosssubjects analysis yields similar results; see Supplementary Materials.) Figure 6d summarizes subjects’ sequential effects, and exhibits the attractive and repulsive sequential effects in their responses (compare solid and dotted lines). (In this treelike representation, we show averages across the stimulus generative probabilities; a figure with the individual ‘trees’ for each probability is provided in the Appendix.)
The repulsive sequential effect of the thirdtolast stimulus in subjects’ predictions only occurs when the thirdtolast stimulus is A followed by the sequence BA. It is also only in this case that the repulsive effect appears with the precisioncost models of a Markov observer (while it never appears with the unpredictabilitycost models). This qualitative difference suggests that the precisioncost models offer a better account of sequential effects in subjects. However, modelfitting onto the overall behavior presented above showed that a fraction of the subjects is better fitted by the unpredictabilitycost models. We investigate, thus, the presence of a repulsive effect in the predictions of the subjects best fitted by the precisioncost models, and of those best fitted by the unpredictabilitycost models. For the subjects best fitted by the precisioncost models, we find (expectedly) that there is a significant repulsive sequential effect of the thirdtolast stimulus ($\overline{p}(AABA)\overline{p}(ABBA)<0$; pvalues <0.01 for two probabilities, <0.05 for four probabilities; subjects pooled; Figure 6e, left panel). For the subjects best fitted by the unpredictabilitycost models (a family of model that does not predict any repulsive sequential effects), we also find, perhaps surprisingly, a significant repulsive effect of the thirdtolast stimulus (pvalues <0.01 for three probabilities, <0.05 for five probabilities; subjects pooled), which demonstrates the robustness of this effect (Figure 6e, right panel). Thus, in spite of the results of the modelselection procedure, some sequential effects in subjects’ predictions support only one of the two families of model. Regardless of the model that best fits their overall predictions, the behavior of the subjects is consistent only with the precisioncost family of models with Markov order equal to or greater than 1, that is, with a model of inference of conditional probabilities hampered by a cognitive cost weighing on the precision of belief distributions.
Discussion
We investigated the hypothesis that sequential effects in human predictions result from cognitive constraints hindering the inference process carried out by the brain. We devised a framework of constrained inference, in which the model subject bears a cognitive cost when updating its belief distribution upon the arrival of new evidence: the larger the cost, the more the subject’s posterior differs from the Bayesian posterior. The models we derive from this framework make specific predictions. First, the proportion of forcedchoice predictions for a given stimulus should increase with the stimulus generative probability. Second, most of those models predict sequential effects: predictions also depend on the recent stimulus history. Models with different types of cognitive cost resulted in different patterns of attractive and repulsive effects of the past few stimuli on predictions. To compare the predictions of constrained inference with human behavior, we asked subjects to predict each next outcome in sequences of binary stimuli. We manipulated the stimulus generative probability in blocks of trials, exploring exhaustively the probability range from 0.05 to 0.95 by increments of 0.05. We found that subjects’ predictions depend on both the stimulus generative probability and the recent stimulus history. Sequential effects exhibited both attractive and repulsive components which were modulated by the stimulus generative probability. This behavior was qualitatively accounted for by a model of constrained inference in which the subject infers the transition probabilities underlying the sequences of stimuli and bears a cost that increases with the precision of the posterior distributions. Our study proposes a novel theoretical account of sequential effects in terms of optimal inference under cognitive constraints and it uncovers the richness of human behavior over a wide range of stimulus generative probabilities.
The notion that human decisions can be understood as resulting from a constrained optimization has gained traction across several fields, including neuroscience, cognitive science, and economics. In neuroscience, a voluminous literature that started with Attneave, 1954 and Barlow, 1961 investigates the idea that perception maximizes the transmission of information, under the constraint of costly and limited neural resources (Laughlin, 1981; Laughlin et al., 1998; Simoncelli and Olshausen, 2001); related theories of ‘efficient coding’ account for the bias and the variability of perception (Ganguli and Simoncelli, 2016; Wei and Stocker, 2015; Wei and Stocker, 2017; PratCarrabin and Woodford, 2021c). In cognitive science and economics, ‘bounded rationality’ is a precursory concept introduced in the 1950s by Herbert Simon, who defines it as “rational choice that takes into account the cognitive limitations of the decision maker — limitations of both knowledge and computational capacity” (Simon, 1997). For Gigerenzer, these limitations promote the use of heuristics, which are ‘fast and frugal’ ways of reasoning, leading to biases and errors in humans and other animals (Gigerenzer and Goldstein, 1996; Gigerenzer and Selten, 2002). A range of more recent approaches can be understood as attempts to specify formally the limitations in question, and the resulting tradeoff. The ‘resourcerational analysis’ paradigm aims at a unified theoretical account that reconciles principles of rationality with realistic constraints about the resources available to the brain when it is carrying out computations (Griffiths et al., 2015). In this approach, biases result from the constraints on resources, rather than from ‘simple heuristics’ (see Lieder and Griffiths, 2019 for an extensive review). For instance, in economics, theories of ‘rational inattention’ propose that economic agents optimally allocate resources (a limited amount of attention) to make decisions, thereby proposing new accounts of empirical findings in the economic literature (Sims, 2003; Woodford, 2009; Caplin et al., 2019; Gabaix, 2017; Azeredo da Silveira and Woodford, 2019; Azeredo da Silveira et al., 2020).
Our study puts forward a ‘resourcerational’ account of sequential effects. Traditional accounts since the 1960s attribute these effects to a belief in sequential dependencies between successive outcomes (Edwards, 1961; Matthews and Sanders, 1984) (potentially ‘acquired through life experience’ Ayton and Fischer, 2004), and more generally to the incorrect models that people assume about the processes generating sequences of events (see Oskarsson et al., 2009 for a review; similar rationales have been proposed to account for suboptimal behavior in other contexts, for example in explorationexploitation tasks Navarro et al., 2016). This traditional account was formalized, in particular, by models in which subjects carry out a statistical inference about the sequence of stimuli presented to them, and this inference assumes that the parameters underlying the generating process are subject to changes (Yu and Cohen, 2008; Wilder et al., 2009; Zhang et al., 2014; Meyniel et al., 2016). In these models, sequential effects are thus understood as resulting from a rational adaptation to a changing world. Human subjects indeed dynamically adapt their learning rate when the environment changes (PayzanLeNestour et al., 2013; Meyniel and Dehaene, 2017; Nassar et al., 2010), and they can even adapt their inference to the statistics of these changes (Behrens et al., 2007; PratCarrabin et al., 2021b). However, in our task and in many previous studies in which sequential effects have been reported, the underlying statistics are in fact not changing across trials. The models just mentioned thus leave unexplained why subjects’ behavior, in these tasks, is not rationally adapted to the unchanging statistics of the stimulus.
What underpins our main hypothesis is a different kind of rational adaptation: one, instead, to the ‘cognitive limitations of the decision maker’, which we assume hinder the inference carried out by the brain. We show that rational models of inference under a cost yield rich patterns of sequential effects. When the cost varies with the precision of the posterior (measured here by the negative of its entropy, Equation 3), the resulting optimal posterior is proportional to the product of the prior and the likelihood, each raised to an exponent $1/(\lambda +1)$ (Equation 4). Many previous studies on biased belief updating have proposed models that adopt the same form except for the different exponents applied to the prior and to the likelihood (Grether, 1980; Matsumori et al., 2018; Benjamin, 2019). Here, with the precision cost, both quantities are raised to the same exponent and we note that in this case the inference of the subject amounts to an exponentially decaying count of the patterns observed in the sequence of stimuli, which is sometimes called ‘leaky integration’ in the literature (Yu and Cohen, 2008; Wilder et al., 2009; Jones et al., 2013; Meyniel et al., 2016). The models mentioned above, that posit a belief in changing statistics, indeed are well approximated by models of leaky integration (Yu and Cohen, 2008; Meyniel et al., 2016), which shows that the exponential discount can have different origins. Meyniel et al., 2016 show that the precisioncost, Markovobserver model with $m=1$ (named ‘local transition probability model’ in this study) accounts for a range of other findings, in addition to sequential effects, such as biases in the perception of randomness and patterns in the surprise signals recorded through EEG and fMRI. Here we reinterpret these effects as resulting from an optimal inference subject to a cost, rather than from a suboptimal erroneous belief in the dynamics of the stimulus’ statistics. In our modeling approach, the minimization of a loss function (Equation 1) formalizes a tradeoff between the distance to optimality of the inference, and the cognitive constraints under which it is carried out. We stress that our proposal is not that the brain actively solves this optimization problem online, but instead that it is endowed with an inference algorithm (whose origin remains to be elucidated) which is effectively a solution to the constrained optimization problem.
By grounding the sequential effects in the optimal solution to a problem of constrained optimization, our approach opens avenues for exploring the origins of sequential effects, in the form of hypotheses about the nature of the constraint that hinders the inference carried out by the brain. With the precision cost, more precise posterior distributions are assumed to take a larger cognitive toll. The intuitive assumption that it is costly to be precise finds a more concrete realization in neural models of inference with probabilistic population codes: in these models, the precision of the posterior is proportional to the average activity of the population of neurons and to the number of neurons (Ma et al., 2006; Seung and Sompolinsky, 1993). More neural activity and more neurons arguably come with a metabolic cost, and thus more precise posteriors are more costly in these models. Imprecisions in computations, moreover, was shown to successfully account for decision variability and adaptive behavior in volatile environments (Findling et al., 2019; Findling et al., 2021).
The unpredictability cost, which we introduce, yields models that also exhibit sequential effects (for Markov observers), and that fit several subjects better than the precisioncost models. The unpredictability cost relies on a different hypothesis: that the cost of representing a distribution over different possible states of the world (here, different possible values of $q$) resides in the difficulty of representing these states. This could be the case, for instance, under the hypothesis that the brain runs stochastic simulations of the implied environments, as proposed in models of ‘intuitive physics’ (Battaglia et al., 2013) and in Kahneman and Tversky’s ‘simulation heuristics’ (Kahneman et al., 1982). More entropic environments imply more possible scenarios to simulate, giving rise, under this assumption, to higher costs. A different literature explores the hypothesis that the brain carries out a mental compression of sequences (Simon, 1972; Chekaf et al., 2016; Planton et al., 2021); entropy in this context is a measure of the degree of compressibility of a sequence (Planton et al., 2021), and thus, presumably, of its implied cost. As a result, the brain may prefer predictable environments over unpredictable ones. Human subjects exhibit a preference for predictive information indeed (Ogawa and Watanabe, 2011; Trapp et al., 2015), while unpredictable stimuli have been shown not only to increase anxietylike behavior (Herry et al., 2007), but also to induce more neural activity (Herry et al., 2007; den Ouden et al., 2009; Alink et al., 2010) — a presumably costly increase, which may result from the encoding of larger prediction errors (Herry et al., 2007; Schultz and Dickinson, 2000).
We note that both costs (precision and unpredictability) can predict sequential effects, even though neither carries ex ante an explicit assumption that presupposes the existence of sequential effects. They both reproduce the attractive recency effect of the last stimulus exhibited by the subjects. They make quantitatively different predictions (Figure 4); we also find this diversity of behaviors in subjects.
The precision cost, as mentioned above, yields leakyintegration models which can be summarized by a simple algorithm in which the observed patterns are counted with an exponential decay. The psychology and neuroscience literature proposes many similar ‘leaky integrators’ or ‘leaky accumulators’ models (Smith, 1995; Roe et al., 2001; Usher and McClelland, 2001; Cook and Maunsell, 2002; Wang, 2002; Sugrue et al., 2004; Bogacz et al., 2006; Kiani et al., 2008; Yu and Cohen, 2008; Gao et al., 2011; Tsetsos et al., 2012; Ossmy et al., 2013; Meyniel et al., 2016). In connectionist models of decisionmaking, for instance, decision units in abstract network models have activity levels that accumulate evidence received from input units, and which decay to zero in the absence of input (Roe et al., 2001; Usher and McClelland, 2001; Wang, 2002; Bogacz et al., 2006; Tsetsos et al., 2012). In other instances, perceptual evidence (Kiani et al., 2008; Gao et al., 2011; Ossmy et al., 2013) or counts of events (Sugrue et al., 2004; Yu and Cohen, 2008; Meyniel et al., 2016) are accumulated through an exponential temporal filter. In our approach, leaky integration is not an assumption about the mechanisms underpinning some cognitive process: instead, we find that it is an optimal strategy in the face of a cognitive cost weighing on the precision of beliefs. Although it is less clear whether the unpredictabilitycost models lend themselves to a similar algorithmic simplification, they consist in a distortion of Bayesian inference, for which various neuralnetwork models have been proposed (Deneve et al., 2001; Ma et al., 2008; Ganguli and Simoncelli, 2014; Echeveste et al., 2020).
Turning to the experimental results, we note that in spite of the rich literature on sequential effects, the majority of studies have focused on equiprobable Bernoulli environments, in which the two possible stimuli both had a probability equal to 0.5, as in tosses of a fair coin (Soetens et al., 1985; Cho et al., 2002; Yu and Cohen, 2008; Wilder et al., 2009; Jones et al., 2013; Zhang et al., 2014; Ayton and Fischer, 2004; Gökaydin and Ejova, 2017). In environments of this kind, the two stimuli play symmetric roles and all sequences of a given length are equally probable. In contrast, in biased environments one of the two possible stimuli is more probable than the other. Although much less studied, this situation breaks the regularities of equiprobable environments and is arguably very frequent in real life. In our experiment, we explore stimulus generative probabilities from 0.05 to 0.95, thus allowing to investigate the behavior of subjects in a wide spectrum of Bernoulli environments: from these with ‘extreme’ probabilities (e.g. p = 0.95) to these only slightly different from the equiprobable case (e.g. p = 0.55) to the equiprobable case itself (p = 0.5). The subjects are sensitive to the imbalance of the nonequiprobable cases: while they predict A in half the trials of the equiprobable case, a probability of just p = 0.55 suffices to prompt the subjects to predict A in about in 60% of trials, a significant difference ($\displaystyle \overline{p}(A)=0.602$; sem: 0.008; pvalue of ttest against null hypothesis that $\displaystyle \overline{p}(A)=0.5$: 1.7e11; subjects pooled).
The wellknown ‘probability matching’ hypothesis (Herrnstein, 1961; Vulkan, 2000; Gaissmaier and Schooler, 2008) suggests that the proportion of predictions A matches the stimulus generative probability: $\overline{p}(A)=p$. This hypothesis is not supported by our data. We find that in the nonequiprobable conditions these two quantities are significantly different (all pvalues <1e11, when $\displaystyle p\ne 0.5$). More precisely, we find that the proportion of prediction A is more extreme than the stimulus generative probability (i.e. $\overline{p}(A)>p$ when $p>0.5$, and $\overline{p}(A)<p$ when $p<0.5$; Figure 2a). This result is consistent with the observations made by Edwards, 1961; Edwards, 1956 and with the conclusions of a more recent review (Vulkan, 2000).
In addition to varying with the stimulus generative probability, the subjects’ predictions depend on the recent history of stimuli. Recency effects are common in the psychology literature; they were reported from memory (Ebbinghaus et al., 1913) to causal learning (Collins and Shanks, 2002) to inference (Shanteau, 1972; Hogarth and Einhorn, 1992; Benjamin, 2019). Recency effects, in many studies, are obtained in the context of reaction tasks, in which subjects must identify a stimulus and quickly provide a response (Hyman, 1953; Bertelson, 1965; Kornblum, 1967; Soetens et al., 1985; Cho et al., 2002; Yu and Cohen, 2008; Wilder et al., 2009; Jones et al., 2013; Zhang et al., 2014). Although our task is of a different kind (subjects must predict the next stimulus), we find some evidence of recency effects in the response times of subjects: after observing the less frequent of the two stimuli (when $p\ne 0$), subjects seem slower at providing a response (see Appendix). In prediction tasks (like ours), both attractive recency effects, also called ‘hothand fallacy’, and repulsive recency effects, also called ‘gambler’s fallacy’, have been reported (Jarvik, 1951; Edwards, 1961; Ayton and Fischer, 2004; Burns and Corpus, 2004; Croson and Sundali, 2005; Oskarsson et al., 2009). The observation of both effects within the same experiment has been reported in a visual identification task (Chopin and Mamassian, 2012) and in risky choices (‘wavy recency effect’ Plonsky et al., 2015; Plonsky and Erev, 2017). As to the heterogeneity of these results, several explanations have been proposed; two important factors seem to be the perceived degree of randomness of the predicted variable and whether it relates to human performance (Ayton and Fischer, 2004; Burns and Corpus, 2004; Croson and Sundali, 2005; Oskarsson et al., 2009). In any event, most studies focus exclusively on the influence of ‘runs’ of identical outcomes on the upcoming prediction, for example, in our task, on whether three As in a row increases the proportion of predictions A. With this analysis, Edwards (Edwards, 1961) in a task similar to ours concluded to an attractive recency effect (which he called ‘probability following’). Although our results are consistent with this observation (in our data three As in a row do increase the proportion of predictions A), we provide a more detailed picture of the influence of each stimulus preceding the prediction, whether it is in a ‘run’ of identical stimuli or not, which allows us to exhibit the nontrivial finer structure of the recency effects that is often overlooked.
Up to two stimuli in the past, the recency effect is attractive: observing A at trial $t2$ or at trial $t1$ induces, all else being equal, a higher proportion of predictions A at trial $t$ (in comparison to observing B; Figures 2 and 6a). The influence of the thirdtolast stimulus is more intricate: it can yield either an attractive or a repulsive effect, depending on the secondtolast and the last stimuli. For a majority of probability parameters, $p$, while an A followed by the sequence AA has an attractive effect (i.e. $p(AAAA)>p(ABAA)$), an A followed by the sequence BA has a repulsive effect (i.e. $p(AABA)<p(ABBA)$; Figure 6b and c). How can this reversal be intuited? Only one of our models, the precisioncost model with a Markov order 1 ($m=1$), reproduces this behavior; we show how it provides an interpretation for this result. From the update equation of this model (Equation 4), it is straightforward to show that the posterior of the model subject (a Dirichlet distribution of order 4) is determined by four quantities, which are exponentiallydecaying counts of the four twolong patterns observed in the sequence of stimuli: BB, BA, AB, and AA. The higher the count of a pattern, the more likely the model subject deems this pattern to happen again. In the equiprobable case ($p=0.5$), after observing the sequence AAA, the count of AA is higher than after observing BAA, thus the model subject believes that AA is more probable, and accordingly predicts A more frequently, i.e., $p(AAAA)>p(ABAA)$. As for the sequences ABA and BBA, both result in the same count of AA, but the former results in a higher count of AB — in other words, the short sequence ABA suggests that A is usually followed by B, but the sequence BBA does not — and thus the model subject predicts more frequently B, i.e., less frequently A ($p(AABA)<p(ABBA)$).
In short, the ability of the precisioncost model of a Markov observer to capture the repulsive effect found in behavioral data suggests that human subjects extrapolate the local statistical properties of the presented sequence of stimuli in order to make predictions, and that they pay attention not only to the ‘base rate’ — the marginal probability of observing A, unconditional on the recent history — as a Bernoulli observer would do, but also to the statistics of more complex patterns, including the repetitions and the alternations, thus capturing the transition probabilities between consecutive observations. Wilder et al., 2009, Jones et al., 2013, and Meyniel et al., 2016 similarly argue that sequential effects result from an imperfect inference of the base rate and of the frequency of repetitions and alternations. Dehaene et al., 2015 argue that the knowledge of transition probabilities is a central mechanism in the brain’s processing of sequences (e.g. in language comprehension), and infants as young as 5 months were shown to be able to track both base rates and transition probabilities (see Saffran and Kirkham, 2018 for a review). Learning of transition probabilities has also been observed in rhesus monkeys (Meyer and Olson, 2011).
The deviations from perfect inference, in the precisioncost model, originate in the constraints faced by the brain when performing computation with probability distributions. In spite of the success of the Bayesian framework, we note that human performance in various inference tasks is often suboptimal (Nassar et al., 2010; Hu et al., 2013; Acerbi et al., 2014; PratCarrabin et al., 2021b; PratCarrabin and Woodford, 2022). Our approach suggests that the deviations from optimality in these tasks may be explained by the cognitive constraints at play in the inference carried out by humans.
Other studies have considered the hypothesis that suboptimal behavior in inference tasks results from cognitive constraints. Kominers et al., 2016 consider a model in which Bayesian inference comes with a fixed cost; the observer can choose to forgo updating her belief, so as to avoid the cost. In some cases, the model predicts ‘permanently cycling beliefs’ that do not converge; but in general the model predicts that subjects will choose not to react to new evidence that is unsurprising under the current belief. The significant sequential effects we find in our subjects’ responses, however, seem to indicate that they are sensitive to both unsurprising (e.g. outcome A when p>0.5) and surprising (outcome B when p>0.5) observations, at least across the values of the stimulus generative probability that we test (Figure 2). Graeber, 2020 considers costly information processing as an account of subjects’ neglect of confounding variables in an inference task, but concludes instead that the suboptimal behavior of subjects results from their misunderstanding of the information structure in the task. A model close to ours is the one proposed in Azeredo da Silveira and Woodford, 2019 and Azeredo da Silveira et al., 2020, in which an informationtheoretic cost limits the memory of an otherwise optimal and Bayesian decisionmaker, resulting, here also, in beliefs that fluctuate and do not converge, and in an overweighting, in decisions, of the recent evidence.
Taking a different approach, Dasgupta et al., 2020 implement a neural network that learns to approximate Bayesian posteriors. Possible approximate posteriors are constrained not only by the structure of the network, but also by the fact that the same network is used to address a series of different inference problems. Thus the network’s parameters must be ‘shared’ across problems, which is meant to capture the brain’s limited computational resources. Although this constraint differs from the ones we consider, we note that in this study the distance function (which the approximation aims to minimize) is the same as in our models, namely, the KullbackLeibler divergence from the optimal posterior to the approximate posterior, ${D}_{KL}(\hat{P}P)$. Minimizing this divergence (under a cost) allows the model subject to obtain a posterior as close as possible (at least by this measure) to the optimal posterior given the most recent stimulus and the subject’s belief prior to observing the stimulus, which in turn enables the subject to perform reasonably well in the task.
In principle, rewarding subjects with a higher payoff when they make a correct prediction would change the optimal tradeoff (between the distance to the optimal posterior and the cognitive costs) formalized in Equation 1, resulting in ‘better’ posteriors (closer to the Bayesian posterior), and thus to higher performance in the task. At the same time, incentivization is known to influence, also in the direction of higher performance, the extent to which choice behavior is close to probability matching (Vulkan, 2000). The interesting question of the respective sensitivities of the subjects’ inference process and of their responseselection strategy in response to different levels of incentives is beyond the scope of this study, in which we have focussed on the sensitivity of behavior to different stimulus generative probabilities.
In any case, the approach of minimizing the KullbackLeibler divergence from the optimal posterior to the approximate posterior is widely used in the machine learning literature, and forms the basis of the ‘variational’ family of approximateinference techniques (Bishop, 2006). These techniques have inspired various cognitive models (Sanborn, 2017; Gallistel and Latham, 2022; Aridor and Woodford, 2023); alternatively, a bound on the divergence, known as the ‘evidence bound’, or, in neuroscience, as the negative of the ‘free energy’, is maximized (Moustafa, 2017; Friston et al., 2006; Friston, 2009). (We note that the ‘opposite’ divergence, ${D}_{KL}(P\hat{P})$, is minimized in a different machinelearning technique, ‘expectation propagation’ (Bishop, 2006), and in the cognitive model of causal reasoning of Icard and Goodman, 2015.) In these techniques, the approximate posterior is chosen within a convenient family of tractable, parameterized distributions; other distributions are precluded. This can be understood, in our framework, as positing a cost $C(\hat{P})$ that is infinite for most distributions, but zero for the distributions that belong to some arbitrary family (PratCarrabin et al., 2021a). The precision cost and the unpredictability cost, in comparison, are ‘smooth’, and allow for any distribution, but they penalize, respectively, more precise belief distributions, and belief distributions that imply more unpredictable environments. Our study shows that inference, when subject to either of these costs, yields an attractive sequential effect of the most recent observation; and with a precision cost weighing on the inference of transition probabilities (i.e., $m=1$), the model predicts the subtle pattern of attractive and repulsive sequential effects that we find in subjects’ responses.
Methods
Task and subjects
The computerbased task was programmed using the Python library PsychoPy (Peirce, 2008). The experiment comprised ten blocks of trials, which differed by the stimulus generative probability, p, used in all the trials of each block. The probability p was chosen randomly among the ten values ranging from 0.50 to 0.95 by increments of 0.05, excluding the values already chosen; and with probability 1/2 the stimulus generative probability $1p$ was used instead. Each block started with 200 passive trials, in which the subject was only asked to look at the 200 stimuli sampled with the block’s probability and successively presented. No action from the subject was required for these passive trials. The subject was then asked to predict, in each of 200 trials, the next location of the stimulus. Subjects provided their responses by a keypress. The task was presented as a game to the subjects: the stimulus was a lightning symbol, and predicting correctly whether the lightning would strike the left or the right rod resulted in the electrical energy of the lightning being collected in a battery (Figure 1). A gauge below the battery indicated the amount of energy accumulated in the current block of trials (Figure 1a). Twenty subjects (7 women, 13 men; age: 18–41, mean 25.5, standard deviation 6.2) participated in the experiment. All subjects completed the ten blocks of trials, except one subject who did not finish the experiment and was excluded from the analysis. The study was approved by the ethics committee Île de France VII (CPP 08–021). Participants gave their written consent prior to participating. The number of blocks of trials and the number of trials per block were chosen as a tradeoff between maximizing the statistical power of the study, scanning the values of the generative probability parameter from 0.05 to 0.95 with a satisfying resolution, and maintaining the duration of the experiment under a reasonable length of time. The number of subjects was chosen consistently with similar studies and so as to capture individual variability. Throughout the study, we conduct Student’s ttests when comparing the subjects’ proportion of predictions A to a given value (e.g. 0.5). When comparing two proportions of predictions A obtained under different conditions (e.g. depending on whether the preceding stimulus is A or B), we accordingly conduct Fisher exact tests. The trials in which subjects failed to respond within the limit of 1 s were not included in the analysis. They represented 1.27% of the trials, on average (across subjects); and for 95% of the subjects these trials represented less than 2.5% of the trials.
Sequential effects of the models
We run simulations of the eight models and look at the predictions they yield. To reproduce the conditions faced by the subjects, which included 200 passive trials, we start each simulation by showing to the model subject 200 randomly sampled stimuli (without collecting predictions at this stage). We then show an additional 200 samples, and obtain a prediction from the model subject after each sample. The sequential effects of the most recent stimulus, with the different models, are shown in Figure 7. With the precisioncost models, the posterior distribution of the model subject does not converge, but fluctuates instead with the recent history of the stimulus. This results in attractive sequential effects (Figure 7a), including for the Bernoulli observer, who assumes that the probability of A does not depend on the most recent stimulus. With the unpredictabilitycost models, the posterior of the model subject does converge. With Markov observers, it converges toward a parameter vector $q$ that implies that the probability of observing A depends on the most recent stimulus, resulting in the presence of sequential effects of the most recent stimulus (Figure 7b, second to fourth row). With a Bernoulli observer, the posterior of the model subject converges toward a value of the stimulus generative probability that does not depend on the stimulus history. As more evidence is accumulated, the posterior narrows around this value but not without some fluctuations that depend on the sequence of stimuli presented. In consequence the model subject’s estimate of the stimulus generative probability is also subject to fluctuations, and depends on the history of stimuli (including the most recent stimulus), although the width of the fluctuations tend to zero as more stimuli are observed. After the 200 stimuli of the passive trials, the sequential effects of the most recent stimulus resulting from this transient regime appear small in comparison to the sequential effects obtained with the other models (Figure 7b, first row). The Figure 7 also shows the behaviors of the models when augmented with a propensity to repeat the preceding response: we comment on these in the section dedicated to these models, below.
Turning to higherorder sequential effects, we look at the influence on predictions of the second and thirdtolast stimulus (Figure 8). As mentioned, only precisioncost models of Markov observers yield repulsive sequential effects, and these occur only when the thirdtolaststimulus is followed by BA. They do not occur with the secondtolast stimulus, nor with the thirdtolaststimulus when it is followed by AA (Figure 8a); and they do not occur in any case with the unpredictabilitycost models (Figure 8b).
Derivation of the approximate posteriors
We derive the solution to the constrained optimization problem, in the general case of a ‘hybrid’ model subject who bears both a precision cost, with weight $\lambda}_{p$, and an unpredictability cost, with weight $\lambda}_{u$. Thus the subject minimizes the loss function
in which we have included a Lagrange multiplier, μ, corresponding to the normalization constraint, $\int {\hat{P}}_{t+1}(q)dq=1$. Taking the functional derivative of $L$ and setting to zero, we obtain
and thus we write the approximate posterior as
where ${P}_{t+1}(q)$ is the Bayesian update of the preceding belief, ${\hat{P}}_{t}(q)$, i.e.,
Setting the weight of the unpredictability cost to zero (i.e., ${\lambda}_{u}=0$), we obtain the posterior in presence of the precision cost only, as
The main text provides more details about the posterior in this case (Equation 4), in particular with a Bernoulli observer ($m=0$; Equation 5, Equation 6).
For the hybrid model (in which both $\lambda}_{u$ and $\lambda}_{p$ are potentially different from zero), we obtain
With ${\lambda}_{p}=0$, the sum in the exponential is equal to $t$, and the precisioncost posterior, ${\hat{P}}_{t}^{prec}(q)$, is the Bayesian posterior, ${P}_{t}^{\ast}(q)$, and thus we obtain the posterior in presence of the unpredictability cost only (see Equation 8).
Hybrid models
The hybrid model, described above, features both a precision cost and an unpredictability cost, with respective weights $\lambda}_{p$ and $\lambda}_{u$. As with the models that include only one type of cost, we consider a Bernoulli observer ($m=0$), and three Markov observers ($m=1,2,$ and 3). As for the responseselection strategy, we use, here also, the generalized probabilitymatching strategy parameterized by $\kappa$. We thus obtain four new models; each one has three parameters ($\lambda}_{p$, $\lambda}_{u$, and $\kappa$), while the nonhybrid models (featuring only one type of cost) have only two parameters.
We fit these models to the responses of subjects. For 68% of subjects, the BIC of the bestfitting hybrid model is larger than the BIC of the bestfitting nonhybrid model, indicating a worse fit, by this measure. This suggests that for these subjects, allowing for a second type of cost result in a modest improvement of the fit that does not justify the additional parameter. For the remaining 32% of subjects, the hybrid models yield a better fit (a lower BIC) than the nonhybrid models, although for half of these, the difference in BICs is lower than 6, which is only weak evidence in favor of the hybrid models.
Moreover, we compute the exceedance probability, defined below in the section ‘Bayesian Model Selection’, of the hybrid models (together with the complementary probability of the nonhybrid models). We find that the exceedance probability of the hybrid models is 8.1% while that of the nonhybrid models is 91.9%, suggesting that subjects bestfitted by nonhybrid models are more prevalent.
In summary, we find that for more than two thirds of subjects, allowing for a second cost type does not improve much the fit to the behavioral data (the BIC is higher with the bestfitting hybrid model). These subjects are bestfitted by nonhybrid models, that is, by models featuring only one type of cost, instead of ‘falling in between’ the two cost types. This suggests that for most subjects, only one of the two costs, either the prediction cost or the unpredictability cost, dominates the inference process.
Alternative responseselection strategy, and repetition or alternation propensity
In addition to the generalized probabilitymatching responseselection strategy presented in the main text, in our investigations we also implement several other responseselection strategies. First, a strategy based on a ‘softmax’ function that smoothes the optimal decision rule; it does not yield, however, a behavior substantially different from that of the generalized probabilitymatching responseselection strategy. Second, we examine a strategy in which the model subject chooses the optimal response with a probability that is fixed across conditions, which we fit onto subjects’ choices. No subject is bestfitted by this strategy. Third, another possible strategy proposed in the gametheory literature (Nowak and Sigmund, 1993) is ‘winstay, loseshift’: it prescribes to repeat the same response as long as it proves correct and to change otherwise. In the context of our binarychoice prediction task, it is indistinguishable from a strategy in which the model subject chooses a prediction equal to the outcome that last occurred. This strategy is a special case of our Bernoulli observer hampered by a precision cost whose weight $\lambda$ is large combined with the optimal responseselection strategy ($\kappa \to \mathrm{\infty}$). Since the generalized probabilitymatching strategy parameterized by the exponent $\kappa$ appears either more general, better than or indistinguishable from those other responseselection strategies, we selected it to obtain the results presented in the main text.
Furthermore, we consider the possibility that subjects may have a tendency to repeat their preceding response, or, conversely, to alternate and choose the other response, independently from their inference of the stimulus statistics. Specifically, we examine a generalization of the responseselection strategy, in which a parameter $\eta$, with $\displaystyle 1<\eta <1$, modulates the probability of a repetition or of an alternation. With probability $1\eta $, the model subject chooses a response with the generalized probabilitymatching responseselection strategy, with parameter $\kappa$. With probability $\eta $, the model subject repeats the preceding response, if $\eta$ is positive; or chooses the opposite of the preceding response, if $\eta$ is negative. With $\eta =0$, there is no propensity for repetition nor alternation, and the responseselection strategy is the same as the one we have considered in the main text. We have allowed for alternations ($\displaystyle \eta <0$) in this model for the sake of generality, but for all the subjects the bestfitting value of $\eta$ is nonnegative, thus henceforth we only consider the possibility of repetitions, i.e., nonnegative values of the parameter ($\eta \ge 0$).
We note that with a repetition probability $\eta$, such that $\displaystyle 0\le \eta <1$, the unconditional probability of a prediction A, which we denote by ${\overline{p}}_{\eta}(A)$, is not different from the unconditional probability of a prediction A in the absence of a repetition probability $\eta$, $\overline{p}(A)$, as in the event of a repetition, the response that is repeated is itself A with probability $\overline{p}(A)$; formally, ${\overline{p}}_{\eta}(A)=(1\eta )\overline{p}(A)+\eta {\overline{p}}_{\eta}(A)$, which implies the equality ${\overline{p}}_{\eta}(A)=\overline{p}(A)$.
Now turning to sequential effects, we note that with a repetition probability $\eta$, the probability of a prediction $A$ conditional on an observation A is
In other words, when introducing the repetition probability $\eta$, the resulting probability of a prediction A conditional on observing A is a weighted mean of the unconditional probability of a prediction A and of the conditional probability of a prediction A in the absence of a repetition probability. Figure 7 (dotted lines) illustrates this for the eight models, with $\eta =0.2$. Consequently the sequential effects with this responseselection strategy are more modest (Figure 7, lightred dots).
We fit (by maximizing their likelihoods) our eight models now equipped with a propensity for repetition (or alternation) parameterized by $\eta$. The average bestfitting value of $\eta$, across subjects, is 0.21 (standard deviation: 0.19; median: 0.18); as mentioned, no subjects have a negative bestfitting value of $\eta$. In order to assess the degree to which the models with repetition propensity are able to capture subjects’ data, in comparison with the models without such propensity, we use the Bayesian Information Criterion (BIC) (Schwarz, 1978), which penalizes the number of parameters, as a comparative metric (a lower BIC is better). For 26% of subjects, the BIC with this responseselection strategy (allowing for $\eta \ne 0$) is higher than with the original responseselection strategy (which sets $\eta =0$,) suggesting that the responses of these subjects do not warrant the introduction of a repetition (or alternation) propensity. In addition, for these subjects the bestfitting inference model, characterized by a cost type and a Markov order, is the same when the responseselection strategy allows for repetition or alternation ($\eta \ne 0$) and when it does not ($\eta =0$). For 47% of subjects, the BIC is lower when including the parameter $\eta$ (suggesting that allowing for $\eta \ne 0$ results in a better fit to the data), and importantly, here also the bestfitting inference model (cost type and Markov order) is the same with $\eta \ne 0$ and with $\eta =0$. For 11% of subjects, a better fit (lower BIC) is obtained with $\eta \ne 0$; and the bestfitting inference models, with $\eta \ne 0$ and with $\eta =0$, belong to the same family of models, that is, they have the same cost type (precision cost or unpredictability cost), and only their Markov orders differ. Finally, only for the remaining 16% does the cost type change when allowing for $\eta \ne 0$. In other words, for 84% of subjects the bestfitting cost type is the same whether or not $\eta$ is allowed to differ from 0.
Furthermore, the bestfitting parameters $\lambda$ and $\kappa$ are also stable across these two cases. Among the 73% of subjects whose bestfitting inference model (including both cost type and Markov order) remains the same regardless of the presence of a repetition propensity, we find that the bestfitting values of $\kappa$, with $\eta \ne 0$ and with $\eta =0$, differ by less than 10% for 93% of subjects, and the bestfitting values of $\lambda$ differ by less than 10% for 71% of subjects. For these two parameters, the correlation coefficient (between the bestfitting value with $\eta =0$ and the bestfitting value with $\eta \ne 0$) is above 0.99 (with pvalues lower than 1e19).
The responses of a majority of subjects are thus better reproduced by a responseselection strategy that includes a probability of repeating the preceding response. The impact of this repetition propensity on sequential effects is relatively small in comparison to the magnitude of these effects (Figure 7). For most subjects, moreover, the bestfitting inference model, characterized by its cost type and its Markov order, is the same — with or without repetition propensity —, and the bestfitting parameters $\lambda$ and $\kappa$ are very close in the two cases. Therefore, this analysis supports the results of the modelfitting and modelselection procedure, and validates its robustness. We conclude that the models of costly inference are essential in reproducing the behavioral data, notwithstanding a positive repetition propensity in a fraction of subjects.
Computation of the models’ likelihoods
Model fitting is conducted by maximizing for each model the likelihood of the subject’s choices. With the precisioncost models, the likelihood can be derived analytically and thus easily computed: the model’s posterior is a Dirichlet distribution of order $2}^{m+1$, whose parameters are exponentially filtered counts of the observed sequences of length $m+1$. With a Bernoulli observer, i.e., $m=0$, this is the Beta distribution presented in Equation 5. The expected probability of a stimulus A, conditional on the sequence of $m$ stimuli most recently observed, is a simple ratio involving the exponentially filtered counts, for example $({\stackrel{~}{n}}_{t}^{A}+1)/({\stackrel{~}{n}}_{t}^{A}+{\stackrel{~}{n}}_{t}^{B}+2)$ in the case of a Bernoulli observer. This probability is then raised to the power $\kappa$ and normalized (as prescribed by the generalized probabilitymatching responseselection strategy) in order to obtain the probability of a prediction A.
As for the unpredictabilitycost models, the posterior is given in Equation 8 up to a normalization constant. Unfortunately, the expected probability of a stimulus A implied by this posterior does not come in a closedform expression. Thus we compute the (unnormalized) posterior on a discretized grid of values of the vector $q$. The dimension of the vector $q$ is $2}^{m$, and each element of $q$ is in the segment $[0,1]$. If we discretize each dimension into $n$ bins, we obtain $n}^{{2}^{m}$ different possible values of the vector $q$; for each of these, at each trial, we compute the unnormalized value of the posterior (as given by Equation 8). As $m$ increases, this becomes computationally prohibitive: for instance, with $n=100$ bins and $m=3$, the multidimensional grid of values of $q$ contains 10^{16} numbers (with a typical computer, this would represent 80,000 terabytes). In order to keep the needed computational resources within reasonable limits, we choose a lower resolution of the grid for larger values of $m$. Specifically, for $m=0$ we choose a grid (over $[0,1]$) with increments of 0.01; for $m=1$, increments of 0.02 (in each dimension); for $m=2$, increments of 0.05; and for $m=3$, increments of 0.1. We then compute the mean of the discretized posterior and pass it through the generalized probabilitymatching responseselection model to obtain the choice probability.
To find the bestfitting parameters $\lambda$ and $\kappa$, the likelihood was maximized with the LBFGSB algorithm (Byrd et al., 1995; Zhu et al., 1997). These computations were run using Python and the libraries Numpy and Scipy (Harris et al., 2020; Virtanen et al., 2020).
Symmetries and relations between conditional probabilities
Throughout the paper, we leverage the symmetry inherent to the Bernoulli prediction task to present results in a condensed manner. Specifically, in our analysis, the proportion of predictions A when the probability of A (the stimulus generative probability) is $p$, which we denote here by $\overline{p}(Ap)$, is equal to the proportion of predictions B when the probability of A is $1p$, which we denote by $\overline{p}(B1p)$; i.e., $\overline{p}(Ap)=\overline{p}(B1p)$. More generally, the predictions conditional on a given sequence when the probability of A is $p$ are equal to the predictions conditional on the ‘mirror’ sequence (in which A and B have been swapped), when the probability of A is $1p$, for example extending our notation, $\overline{p}(AAAB,p)=\overline{p}(BBBA,1p)$. Here, we show how this results in the symmetries in Figure 2, and in the fact that in Figures 5 and 6, it suffices to plot the sequential effects obtained with only a fraction of all the possible sequences of two or three stimuli.
First, we note that
which implies the symmetry of $\overline{p}(A)$ in Figure 2a (grey line). Turning to conditional probabilities (and thus sequential effects), we have
As a result, the lines representing $\overline{p}(AA)$ (blue) and $\overline{p}(AB)$ (orange) in Figure 2a are reflections of each other. In addition, these equations result in the equality
which implies the symmetry in Figure 2b.
As for the sequential effect of the secondtolast stimulus, we show in Figures 5a and 6a the difference in the proportions of predictions A conditional on two past sequences of two stimuli, AA and BA; i.e., $\overline{p}(AAA)\overline{p}(ABA)$. There are two other possible sequences of two stimuli: $AB$ and $BB$. The difference in the proportions conditional on these two sequences is implied by the former difference, as:
As for the sequential effect of the thirdtolast stimulus, we show in Figures 5b and 6b the difference in the proportions conditional on the sequences AAA and BAA, and in Figures 5c and 6c the difference in the proportions conditional on the sequences ABA and BBA. The differences in the proportions conditional on the sequences AAB and BAB, and conditional on the sequences ABB and BBB, are recovered as a function of the former two, as
Bayesian model selection
We implement the Bayesian model selection (BMS) procedure described in Stephan et al., 2009. Given $M$ models, this procedure aims at deriving a probabilistic belief on the distribution of these models among the general population. This unknown distribution is a categorical distribution, parameterized by the probabilities of the $M$ models, denoted by $r=({r}_{1},\dots ,{r}_{M})$, with $\sum {r}_{m}=1$. With a finite sample of data, one cannot determine with infinite precision the values of the probabilities $r}_{m$. The BMS, thus, computes an approximation of the Bayesian posterior over the vector $r$, as a Dirichlet distribution parameterized by the vector $\alpha =({\alpha}_{1},\dots ,{\alpha}_{M})$, i.e., the posterior distribution
Computing the parameters $\alpha}_{k$ of this posterior makes use of the logevidence of each model for each subject, i.e., the logarithm of the joint probability, $p(ym)$, of a given subject’s responses, $y$, under the assumption that a given model, $m$, generated the responses. We use the model’s maximum likelihood to obtain an approximation of the model’s logevidence, as (Balasubramanian, 1997)
where $\theta$ denotes the parameters of the model, $p(ym,\theta )$ is the likelihood of the model when parameterized with $\theta$, $d$ is the dimension of $\theta$, and $N$ is the size of the data, that is, the number of responses. (The wellknown Bayesian Information Criterion Schwarz, 1978 is equal to this approximation of the model’s logevidence, multiplied by $1/2$.)
In our case, there are $M=8$ models, each with $d=2$ parameters: $\theta =(\lambda ,\kappa )$. The posterior distribution over the parameters of the categorical distribution of models in the general population, $p(r\alpha )$, allows for the derivation of several quantities of interest; following Stephan et al., 2009, we derive two types of quantities. First, given a family of models, that is, a set $\mathcal{M}=\{{m}_{i}\}$ of different models (for instance, the predictioncost models, or the Bernoulliobserver models), the expected probability of this class of model, that is, the expected probability that the behavior of a subject randomly chosen in the general population follows a model belonging to this class, is the ratio
We compute the expected probability of the precisioncost models (and the complementary probability of the unpredictabilitycost models), and the expected probability of the Bernoulliobserver models (and the complementary probability of the Markovobserver models; see Results).
Second, we estimate, for each family of models $\mathcal{M}$, the probability that it is the most likely, i.e., the probability of the inequality
which is called the ‘exceedance probability’. We compute an estimate of this probability by sampling one million times the Dirichlet belief distribution (Equation 21), and counting the number of samples in which the inequality is verified. We estimate in this way the exceedance probability of the precisioncost models (and the complementary probability of the unpredictabilitycost models), and the exceedance probability of the Bernoulliobserver models (and the complementary probability of the Markovobserver models; see Results).
Unpredictability cost for Markov observers
Here we derive the expression of the unpredictability cost for Markov observers as a function of the elements of the parameter vector $q$. For an observer of Markov order 1 ($m=1$), the vector $q$ has two elements, which are the probability of observing A at a given trial conditional on the preceding outcome being A, and the probability of observing A at a given trial conditional on the preceding outcome being B, which we denote by $q}_{A$ and $q}_{B$, respectively. The Shannon entropy, $H(X;q)$, implied by the vector $q$, is the average of the conditional entropies implied by each conditional probability, i.e.,
where $p}_{A$ and $p}_{B$ are the unconditional probabilities of observing A and B, respectively (see below), and
where $X$ is A or B.
The unconditional probabilities $p}_{A$ and $p}_{B$ are functions of the conditional probabilities $q}_{A$ and $q}_{B$. Indeed, at trial $t+1$, the marginal probability of the event ${x}_{t+1}=A$, $P({x}_{t+1}=A)$, is a weighted average of the probabilities of this event conditional on the preceding stimulus, $x}_{t$, as given by the law of total probability:
i.e.
Solving for $p}_{A$, we find:
The entropy $H(X;q)$ implied by the vector $q$ is obtained by substituting these quantities in Equation 25.
Similarly, for $m=2$ and 3, the $2}^{m$ elements of the vector $q$ are the parameters $q}_{ij$ and $q}_{ijk$, respectively, where $i,j,k\in \{A,B\}$, and where $q}_{ij$ is the probability of observing A at a given trial conditional on the two preceding outcomes being the sequence ‘$ij$’, and $q}_{ijk$ is the probability of observing A at a given trial conditional on the three preceding outcomes being the sequence ‘$ijk$’. The Shannon entropy, $H(X;q)$, implied by the vector $q$, is here also the average of the conditional entropies implied by each conditional probability, as
where $p}_{ij$ and $p}_{ijk$ are the unconditional probabilities of observing the sequence ‘$ij$’, and of observing the sequence ‘$ijk$’, respectively. These unconditional probabilities verify a system of linear equations whose coefficients are given by the conditional probabilities. For instance, for $m=2$, we have the relation
i.e.,
The system of linear equations can be written as
The solution is the eigenvector corresponding to the eigenvalue equal to 1 of the matrix in the equation above, with the additional constraint that the unconditional probabilities must sum to 1, i.e., $\sum _{ij}{p}_{ij}=1$. We find:
For $m=3$, we find the relations:
Together with the normalization constraint ${\mathrm{\Sigma}}_{ijk}{p}_{ijk}=1$, these relations allow determining the eight unconditional probabilities $p}_{ijk$, and thus the expression of the Shannon entropy.
Appendix 1
Stability of subjects’ behavior throughout the experiment
To validate the assumption that we capture, in our experiment, the ‘stationary’ behavior of subjects, we compare their responses in the first half of the task (first 100 trials) to their responses in the second half (last 100 trials). We find that the unconditional proportions of prediction A in these two cases are not significantly different, for most values of the stimulus generative probability. The sign of the difference (regardless of its statistical significance) implies that the proportions of predictions A in the second half of the experiment are slightly closer to 1 when the probability of the stimulus A is greater than 0.5; which would mean that the responses of subjects are slightly closer to optimality, in the second half of the experiment (Appendix 1—figure 1a, grey lines). Regarding the sequential effects, we also obtain very similar behaviors in the first and second halves of the experiment (Appendix 1—figure 1). We conclude that for our analysis it is reasonable to assume that the behavior of subjects is stationary throughout the task.
Robustness of the model fitting
To evaluate the ability of the modelfitting procedure to correctly identify the model that generated a given set of responses, we compute a confusion matrix of the eight models. For each model, we simulate 200 runs of the task (each with 200 passive trials followed by 200 trials in which a prediction is obtained), with values of $\lambda$ and $\displaystyle \kappa$ close to values typically obtained when fitting the subjects’ responses (for predictioncost models, $\displaystyle \lambda \in \{0.03,0.7,2,15\}$; for unpredictabilitycost models, $\displaystyle \lambda \in \{0.7,2\}$; and $\displaystyle \kappa \in \{0.7,1.5,2\}$ for both families of models). We then fit each of the eight models to each of these simulated datasets, and count how many times each model best fit each dataset (Appendix 1—figure 2a). To further test the robustness of the modelfitting procedure, we randomly introduce errors in the simulated responses: for 10% of the responses, randomly chosen in each dataset, we substitute the response by its opposite (i.e., B for A, and A for B), and compute a confusion matrix using these new responses (Appendix 1—figure 2b). In both cases, the modelfitting procedure identifies the correct model a majority of times (i.e., the bestfitting model is the model that generated the data; Appendix 1—figure 2).
Finally, to examine the robustness of the weight of the cost, $\displaystyle \lambda$, we consider for each subject its bestfitting model in each family (the precisioncost family and the unpredictabilitycost family), and we fit separately each model to the subject’s responses obtained in trials in which the stimulus generative probability was medium ($\displaystyle p\in \{.3,.35,.4,.45,.5,.55,.6,.65,.7\}$) and those in which it was extreme ($\displaystyle p\in \{.05,.1,.15,.2,.25,.75,.8,.85,.9,.95\}$). The Appendix 1—figure 3 shows the correlation between the bestfitting parameters obtained in these two cases.
Distribution of subjects’ BICs
Subjects’ sequential effects — tree representation
Subjects’ sequential effects — unpooled data
As mentioned in the main text, we pool together the predictions that correspond, in different blocks of trials, to either event (left or right), as long as these events have the same probability. The Appendix 1—figure 6, below, is the same as Figure 2, but without such pooling. Given a stimulus generative probability, $p$, all the subjects experience one (and only one) block of trials in which either the event ‘right’ or the event ‘left’ had probability $p$. For one group of subjects the ‘right’ event has probability $p$ and for the group of remaining subjects it is the ‘left’ event that has probability $p$. The responses of these subjects are not pooled together in Appendix 1—figure 6, while they were in Figure 2. This also applies for any other stimulus generative probability, $p}^{\mathrm{\prime}$. However, we note that the two groups of subjects for whom $p}^{\mathrm{\prime}$ was the probability of a ‘right’ event or a ‘left’ event are not the same as the two groups just mentioned in the case of the probability $p$. As a result, from one proportion shown in Appendix 1—figure 6 to another, the underlying group of subjects changes. In Figure 2, each proportion is computed with the responses of all the subjects. This illustrates another advantage of the pooling that we use in the main text.
Subjects’ response times
Acrosssubjects results
Data availability
The behavioral data for this study and the computer code used for data analysis are freely and publicly available through the Open Science Framework repository at https://doi.org/10.17605/OSF.IO/BS5CY.

Open Science FrameworkResourceRational Account of Sequential Effects in Human Prediction: Data & Code.https://doi.org/10.17605/OSF.IO/BS5CY
References

On the origins of suboptimality in human probabilistic inferencePLOS Computational Biology 10:e1003661.https://doi.org/10.1371/journal.pcbi.1003661

Stimulus predictability reduces responses in primary visual cortexThe Journal of Neuroscience 30:2960–2966.https://doi.org/10.1523/JNEUROSCI.373010.2010

Some informational aspects of visual perceptionPsychological Review 61:183–193.https://doi.org/10.1037/h0054663

The hot hand fallacy and the gambler’s fallacy: two faces of subjective randomness?Memory & Cognition 32:1369–1378.https://doi.org/10.3758/bf03206327

Noisy memory and overreaction to newsAEA Papers and Proceedings 109:557–561.https://doi.org/10.1257/pandp.20191049

BookOptimally Imprecise Memory and Biased ForecastsNational Bureau of Economic Research.https://doi.org/10.2139/ssrn.3731244

Statistical inference, occam’s razor, and statistical mechanicsNeural Computation 368:349–368.https://doi.org/10.1162/neco.1997.9.2.349

Twenty years of “hot hand” research: Review and critiquePsychology of Sport and Exercise 7:525–553.https://doi.org/10.1016/j.psychsport.2006.03.001

BookPossible principles underlying the transformations of sensory messagesIn: Rosenblith Walter A, editors. Sensory Communication, Chapter 13. Cambridge, MA: The MIT Press. pp. 217–234.

How haptic size sensations improve distance perceptionPLOS Computational Biology 7:e1002080.https://doi.org/10.1371/journal.pcbi.1002080

Learning the value of information in an uncertain worldNature Neuroscience 10:1214–1221.https://doi.org/10.1038/nn1954

BookErrors in probabilistic reasoning and judgment biasesIn: Benjamin DJ, editors. Handbook of Behavioral Economics. Elsevier B.V. pp. 69–186.

Randomness and inductions from streaks: “gambler’s fallacy” versus “hot hand.”Psychonomic Bulletin & Review 11:179–184.https://doi.org/10.3758/bf03206480

A limited memory algorithm for bound constrained optimizationSIAM Journal on Scientific Computing 16:1190–1208.https://doi.org/10.1137/0916069

Rational inattention, optimal consideration sets, and stochastic choiceThe Review of Economic Studies 86:1061–1094.https://doi.org/10.1093/restud/rdy037

Mechanisms underlying dependencies of performance on stimulus history in a twoalternative forcedchoice taskCognitive, Affective & Behavioral Neuroscience 2:283–299.https://doi.org/10.3758/cabn.2.4.283

Predictive properties of visual adaptationCurrent Biology 22:622–626.https://doi.org/10.1016/j.cub.2012.02.021

Momentary and integrative response strategies in causal judgmentMemory & Cognition 30:1138–1147.https://doi.org/10.3758/bf03194331

Dynamics of neuronal responses in macaque MT and VIP during motion detectionNature Neuroscience 5:985–994.https://doi.org/10.1038/nn924

The Gambler’s fallacy and the hot hand: empirical data from casinosJournal of Risk and Uncertainty 30:195–209.https://doi.org/10.1007/s1116600511532

Efficient computation and cue integration with noisy population codesNature Neuroscience 4:826–831.https://doi.org/10.1038/90541

A dual role for prediction error in associative learningCerebral Cortex 19:1175–1185.https://doi.org/10.1093/cercor/bhn161

BookMemory: A Contribution to Experimental PsychologyTeachers College Press.https://doi.org/10.1037/10011000

Reward probability, amount, and information as determiners of sequential twoalternative decisionsJournal of Experimental Psychology 52:177–188.https://doi.org/10.1037/h0047727

Probability learning in 1000 trialsJournal of Experimental Psychology 62:385–394.https://doi.org/10.1037/h0041970

A free energy principle for the brainJournal of Physiology, Paris 100:70–87.https://doi.org/10.1016/j.jphysparis.2006.10.001

The freeenergy principle: a rough guide to the brain?Trends in Cognitive Sciences 13:293–301.https://doi.org/10.1016/j.tics.2009.04.005

Efficient sensory encoding and bayesian inference with heterogeneous neural populationsNeural Computation 26:2103–2134.https://doi.org/10.1162/NECO_a_00638

Reasoning the fast and frugal way: Models of bounded rationalityPsychological Review 103:650–669.https://doi.org/10.1037/0033295X.103.4.650

Bounded Rationality: The Adaptive ToolboxBounded rationality, Bounded Rationality: The Adaptive Toolbox, MIT Press, 10.7551/mitpress/1654.001.0001.

The hot hand in basketball: On the misperception of random sequencesCognitive Psychology 17:295–314.https://doi.org/10.1016/00100285(85)900106

ConferenceSequential effects in predictionProceedings of the Annual Conference of the Cognitive Science Society. pp. 397–402.

Bayes rule as a descriptive model: the representativeness heuristicThe Quarterly Journal of Economics 95:537.https://doi.org/10.2307/1885092

Rational use of cognitive resources: levels of analysis between the computational and the algorithmicTopics in Cognitive Science 7:217–229.https://doi.org/10.1111/tops.12142

Relative and absolute strength of response as a function of frequency of reinforcementJournal of the Experimental Analysis of Behavior 4:267–272.https://doi.org/10.1901/jeab.1961.4267

Processing of temporal unpredictability in human and animal amygdalaThe Journal of Neuroscience 27:5958–5966.https://doi.org/10.1523/JNEUROSCI.521806.2007

On the rate of gain of informationQuarterly Journal of Experimental Psychology 4:11–26.https://doi.org/10.1080/17470215208416600

Order effects in belief updating: The beliefadjustment modelCognitive Psychology 24:1–55.https://doi.org/10.1016/00100285(92)90002J

Nonparametric learning rules from bandit experiments: The eyes have it!Games and Economic Behavior 81:215–231.https://doi.org/10.1016/j.geb.2013.05.003

Stimulus information as a determinant of reaction timeJournal of Experimental Psychology 45:188–196.https://doi.org/10.1037/h0056940

ConferenceA ResourceRational Approach to the Causal Frame ProblemProceedings of the 37th Annual Meeting of the Cognitive Science Society.

Probability learning and a negative recency effect in the serial anticipation of alternative symbolsJournal of Experimental Psychology 41:291–297.https://doi.org/10.1037/h0056878

Sequential effects in response time reveal learning mechanisms and event representationsPsychological Review 120:628–666.https://doi.org/10.1037/a0033180

A simple coding procedure enhances a neuron’s information capacityZeitschrift Für Naturforschung C 36:910–912.https://doi.org/10.1515/znc198191040

Resourcerational analysis: Understanding human cognition as the optimal use of limited computational resourcesThe Behavioral and Brain Sciences 43:e1.https://doi.org/10.1017/S0140525X1900061X

Bayesian inference with probabilistic population codesNature Neuroscience 9:1432–1438.https://doi.org/10.1038/nn1790

Spiking networks for Bayesian inference and choiceCurrent Opinion in Neurobiology 18:217–222.https://doi.org/10.1016/j.conb.2008.07.004

A biased bayesian inference for decisionmaking and cognitive controlFrontiers in Neuroscience 12:734.https://doi.org/10.3389/fnins.2018.00734

Effects of causal and noncausal sequences of information on subjective predictionPsychological Reports 54:211–215.https://doi.org/10.2466/pr0.1984.54.1.211

Subjective probabilities for sex of next child: U.S. College students and Philippine villagersJournal of Population Behavioral, Social, and Environmental Issues 1:132–147.https://doi.org/10.1007/BF01277598

Human inferences about sequences: a minimal transition probability modelPLOS Computational Biology 12:e1005260.https://doi.org/10.1371/journal.pcbi.1005260

An approximately Bayesian deltarule model explains the dynamics of belief updating in a changing environmentThe Journal of Neuroscience 30:12366–12378.https://doi.org/10.1523/JNEUROSCI.082210.2010

Implicit learning increases preference for predictive visual displayAttention, Perception & Psychophysics 73:1815–1822.https://doi.org/10.3758/s1341401000412

What’s next? Judging sequences of binary eventsPsychological Bulletin 135:262–285.https://doi.org/10.1037/a0014821

Generating stimuli for neuroscience using psychoPyFrontiers in Neuroinformatics 2:10.https://doi.org/10.3389/neuro.11.010.2008

A theory of memory for binary sequences: Evidence for A mental compression algorithm in humansPLOS Computational Biology 17:e1008598.https://doi.org/10.1371/journal.pcbi.1008598

Reliance on small samples, the wavy recency effect, and similaritybased learningPsychological Review 122:621–647.https://doi.org/10.1037/a0039413

Human inference in changing environments with temporal structurePsychological Review 128:879–912.https://doi.org/10.1037/rev0000276

ConferenceBias and variance of the Bayesianmean decoderAdvances in Neural Information Processing Systems 34 (NeurIPS 2021). pp. 23793–23805.

Infant statistical learningAnnual Review of Psychology 69:181–203.https://doi.org/10.1146/annurevpsych122216011805

Types of approximation for probabilistic cognition: sampling and variationalBrain and Cognition 112:98–101.https://doi.org/10.1016/j.bandc.2015.06.008

Neuronal coding of prediction errorsAnnual Review of Neuroscience 23:473–500.https://doi.org/10.1146/annurev.neuro.23.1.473

Estimating the Dimension of a ModelThe Annals of Statistics 6:461–464.https://doi.org/10.1214/aos/1176344136

Descriptive versus normative models of sequential inference judgmentJournal of Experimental Psychology 93:63–68.https://doi.org/10.1037/h0032509

Complexity and the representation of patterned sequences of symbolsPsychological Review 79:369–382.https://doi.org/10.1037/h0033118

BookBounded rationalityIn: Simon HA, editors. Models of Bounded Rationality: Empirically Grounded Economic Reason. The MIT Press. pp. 291–294.https://doi.org/10.7551/mitpress/4711.001.0001

Natural image statistics and neural representationAnnual Review of Neuroscience 24:1193–1216.https://doi.org/10.1146/annurev.neuro.24.1.1193

Implications of rational inattentionJournal of Monetary Economics 50:665–690.https://doi.org/10.1016/S03043932(03)000291

Psychophysically principled models of visual simple reaction timePsychological Review 102:567–593.https://doi.org/10.1037/0033295X.102.3.567

Expectancy or automatic facilitation? Separating sequential effects in twochoice reaction timeJournal of Experimental Psychology 11:598–616.https://doi.org/10.1037/00961523.11.5.598

Bayesian model selection for group studiesNeuroImage 46:1004–1017.https://doi.org/10.1016/j.neuroimage.2009.03.025

Expectation in perceptual decision making: neural and computational mechanismsNature Reviews. Neuroscience 15:745–756.https://doi.org/10.1038/nrn3838

Human preferences are biased towards associative informationCognition & Emotion 29:1054–1068.https://doi.org/10.1080/02699931.2014.966064

The time course of perceptual choice: the leaky, competing accumulator modelPsychological Review 108:550–592.https://doi.org/10.1037/0033295x.108.3.550

An economist’s perspective on probability matchingJournal of Economic Surveys 14:101–118.https://doi.org/10.1111/14676419.00106

A Bayesian observer model constrained by efficient coding can explain “antiBayesian” perceptsNature Neuroscience 18:1509–1517.https://doi.org/10.1038/nn.4105

ConferenceSequential effects reflect parallel learning of multiple environmental regularitiesAdvances in Neural Information Processing Systems 22  Proceedings of the 2009 Conference. pp. 2053–2061.

Informationconstrained statedependent pricingJournal of Monetary Economics 56:S100–S124.https://doi.org/10.1016/j.jmoneco.2009.06.014

Sequential effects: Superstition or rational behavior?Advances in Neural Information Processing Systems 21:1873–1880.

ConferenceSequential effects: A Bayesian analysis of prior bias on reaction time and behavioral choiceProceedings of the 36th Annual Conference of the Cognitive Science Society. pp. 1844–1849.

Algorithm 778: Lbfgsb: Fortran subroutines for largescale boundconstrained optimizationACM Transactions on Mathematical Software. Association for Computing Machinery 23:550–560.https://doi.org/10.1145/279232.279236
Article and author information
Author details
Funding
Albert P. Sloan Foundation (Grant G202012680)
 Rava Azeredo da Silveira
CNRS (UMR8023)
 Rava Azeredo da Silveira
Fondation PierreGilles de Gennes pour la recherche (Ph.D. Fellowship)
 Arthur PratCarrabin
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Doron Cohen and Michael Woodford for inspiring discussions. This work was supported by the Alfred P Sloan Foundation through grant G2020–12680 and the CNRS through UMR8023. A.P.C. was supported by a Ph.D. fellowship of the Fondation PierreGilles de Gennes pour la Recherche. We acknowledge computing resources from Columbia University’s Shared Research Computing Facility project, which is supported by NIH Research Facility Improvement Grant 1G20RR03089301, and associated funds from the New York State Empire State Development, Division of Science Technology and Innovation (NYSTAR) Contract C090171, both awarded April 15, 2010.
Ethics
The study was approved by the ethics committee Île de France VII (CPP 08021). Participants gave their written consent prior to participating.
Version history
 Received: June 21, 2022
 Preprint posted: June 22, 2022 (view preprint)
 Accepted: December 11, 2023
 Version of Record published: January 15, 2024 (version 1)
Copyright
© 2024, PratCarrabin et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 334
 views

 65
 downloads

 0
 citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Neuroscience
One wellknown biomarker candidate that supposedly helps capture fluid cognition is Brain Age, or a predicted value based on machinelearning models built to predict chronological age from brain MRI. To formally evaluate the utility of Brain Age for capturing fluid cognition, we built 26 ageprediction models for Brain Age based on different combinations of MRI modalities, using the Human Connectome Project in Aging (n=504, 36–100 years old). First, based on commonality analyses, we found a large overlap between Brain Age and chronological age: Brain Age could uniquely add only around 1.6% in explaining variation in fluid cognition over and above chronological age. Second, the ageprediction models that performed better at predicting chronological age did NOT necessarily create better Brain Age for capturing fluid cognition over and above chronological age. Instead, betterperforming ageprediction models created Brain Age that overlapped larger with chronological age, up to around 29% out of 32%, in explaining fluid cognition. Third, Brain Age missed around 11% of the total variation in fluid cognition that could have been explained by the brain variation. That is, directly predicting fluid cognition from brain MRI data (instead of relying on Brain Age and chronological age) could lead to around a 1/3time improvement of the total variation explained. Accordingly, we demonstrated the limited utility of Brain Age as a biomarker for fluid cognition and made some suggestions to ensure the utility of Brain Age in explaining fluid cognition and other phenotypes of interest.

 Developmental Biology
 Neuroscience
Movement is a key feature of animal systems, yet its embryonic origins are not fully understood. Here, we investigate the genetic basis underlying the embryonic onset of movement in Drosophila focusing on the role played by small noncoding RNAs (microRNAs, miRNAs). To this end, we first develop a quantitative behavioural pipeline capable of tracking embryonic movement in large populations of fly embryos, and using this system, discover that the Drosophila miRNA miR2b1 plays a role in the emergence of movement. Through the combination of spectral analysis of embryonic motor patterns, cell sorting and RNA in situs, genetic reconstitution tests, and neural optical imaging we define that miR2b1 influences the emergence of embryonic movement by exerting actions in the developing nervous system. Furthermore, through the combination of bioinformatics coupled to genetic manipulation of miRNA expression and phenocopy tests we identify a previously uncharacterised (but evolutionarily conserved) chloride channel encoding gene – which we term Movement Modulator (Motor) – as a genetic target that mechanistically links miR2b1 to the onset of movement. Cellspecific genetic reconstitution of miR2b1 expression in a null miRNA mutant background, followed by behavioural assays and target gene analyses, suggest that miR2b1 affects the emergence of movement through effects in sensory elements of the embryonic circuitry, rather than in the motor domain. Our work thus reports the first miRNA system capable of regulating embryonic movement, suggesting that other miRNAs are likely to play a role in this key developmental process in Drosophila as well as in other species.