Resourcerational account of sequential effects in human prediction
Abstract
An abundant literature reports on ‘sequential effects’ observed when humans make predictions on the basis of stochastic sequences of stimuli. Such sequential effects represent departures from an optimal, Bayesian process. A prominent explanation posits that humans are adapted to changing environments, and erroneously assume nonstationarity of the environment, even if the latter is static. As a result, their predictions fluctuate over time. We propose a different explanation in which suboptimal and fluctuating predictions result from cognitive constraints (or costs), under which humans however behave rationally. We devise a framework of costly inference, in which we develop two classes of models that differ by the nature of the constraints at play: in one case the precision of beliefs comes at a cost, resulting in an exponential forgetting of past observations, while in the other beliefs with high predictive power are favored. To compare model predictions to human behavior, we carry out a prediction task that uses binary random stimuli, with probabilities ranging from 0.05 to 0.95. Although in this task the environment is static and the Bayesian belief converges, subjects’ predictions fluctuate and are biased toward the recent stimulus history. Both classes of models capture this ‘attractive effect’, but they depart in their characterization of higherorder effects. Only the precisioncost model reproduces a ‘repulsive effect’, observed in the data, in which predictions are biased away from stimuli presented in more distant trials. Our experimental results reveal systematic modulations in sequential effects, which our theoretical approach accounts for in terms of rationality under cognitive constraints.
Editor's evaluation
This valuable work addresses a longstanding empirical puzzle from a new computational perspective. The authors provide convincing evidence that attractive and repulsive sequential effects in perceptual decisions may emerge from rational choices under cognitive resource constraints rather than adjustments to changing environments. It is relevant to understanding how people represent uncertain events in the world around them and make decisions, with broad applications to economic behavior.
https://doi.org/10.7554/eLife.81256.sa0Introduction
In many situations of uncertainty, some outcomes are more probable than others. Knowing the probability distributions of the possible outcomes provides an edge that can be leveraged to improve and speed up decision making and perception (Summerfield and de Lange, 2014). In the case of choice reactiontime tasks, it was noted in the early 1950s that human reactions were faster when responding to a stimulus whose probability was higher (Hick, 1952; Hyman, 1953). In addition, faster responses were obtained after a repetition of a stimulus (i.e., when the same stimulus was presented twice in a row), even in the case of seriallyindependent stimuli (i.e., when the preceding stimulus carried no information on subsequent ones; Hyman, 1953; Bertelson, 1965). The observation of this seemingly suboptimal behavior has motivated in the following decades a profuse literature on ‘sequential effects’, i.e., on the dependence of reaction times on the recent history of presented stimuli (Kornblum, 1967; Soetens et al., 1985; Cho et al., 2002; Yu and Cohen, 2008; Wilder et al., 2009; Jones et al., 2013; Zhang et al., 2014; Meyniel et al., 2016). These studies consistently report a recency effect whereby the more often a simple pattern of stimuli (e.g. a repetition) is observed in recent stimulus history, the faster subjects respond to it. In tasks in which subjects are asked to make predictions about sequences of random binary events, sequential effects are also observed and they have given rise since the 1950s to a rich literature (Jarvik, 1951; Edwards, 1961; McClelland and Hackenberg, 1978; Matthews and Sanders, 1984; Gilovich et al., 1985; Ayton and Fischer, 2004; Burns and Corpus, 2004; Croson and Sundali, 2005; BarEli et al., 2006; Oskarsson et al., 2009; Plonsky et al., 2015; Plonsky and Erev, 2017; Gökaydin and Ejova, 2017).
Sequential effects are intriguing: why do subjects change their behavior as a function of the recent past observations when those are in fact irrelevant to the current decision? A common theoretical account is that humans infer the statistics of the stimuli presented to them, but because they usually live in environments that change over time, they may believe that the process generating the stimuli is subject to random changes even when it is in fact constant (Yu and Cohen, 2008; Wilder et al., 2009; Zhang et al., 2014; Meyniel et al., 2016). Consequently, they may rely excessively on the most recent stimuli to predict the next ones. In several studies, this was heuristically modeled as a ‘leaky integration’ of the stimuli, that is, an exponential discounting of past observations (Cho et al., 2002; Yu and Cohen, 2008; Wilder et al., 2009; Jones et al., 2013; Meyniel et al., 2016). Here, instead of positing that subjects hold an incorrect belief on the dynamics of the environment and do not learn that it is stationary, we propose a different account, whereby a cognitive constraint is hindering the inference process and preventing it from converging to the correct, constant belief about the unchanging statistics of the environment. This proposal calls for the investigation of the kinds of choice patterns and sequential effects that would result from different cognitive constraints at play during inference.
We derive a framework of constrained inference, in which a cost hinders the representation of belief distributions (posteriors). This approach is in line with a rich literature that views several perceptual and cognitive processes as resulting from a constrained optimization: the brain is assumed to operate optimally, but within some posited limits on its resources or abilities. The ‘efficient coding’ hypothesis in neuroscience (Ganguli and Simoncelli, 2016; Wei and Stocker, 2015; Wei and Stocker, 2017; PratCarrabin and Woodford, 2021c) and the ‘rational inattention’ models in economics (Sims, 2003; Woodford, 2009; Caplin et al., 2019; Gabaix, 2017; Azeredo da Silveira and Woodford, 2019; Azeredo da Silveira et al., 2020) are examples of this approach, which has been called ‘resourcerational analysis’ (Griffiths et al., 2015; Lieder and Griffiths, 2019). Here, we investigate the proposal that human inference is resourcerational, i.e., optimal under a cost. As for the nature of this cost, we consider two natural hypotheses: first, that a higher precision in belief is harder for subjects to achieve, and thus that more precise posteriors come with higher costs; and second, that unpredictable environments are difficult for subjects to represent, and thus that they entail higher costs. Under the first hypothesis, the cost is a function of the belief held, while under the second hypothesis the cost is a function of the inferred environment. We show that the precision cost predicts ‘leaky integration’: in the resulting inference process, remote observations are discarded. Crucially, beliefs do not converge but fluctuate instead with the recent stimulus history. By contrast, under the unpredictability cost, the inference process does converge, although not to the correct (Bayesian) posterior, but rather to a posterior that implies a biased belief on the temporal structure of the stimuli. In both cases, sequential effects emerge as the result of a constrained inference process.
We examine experimentally the degree to which the models derived from our framework account for human behavior, with a task in which we repeatedly ask subjects to predict the upcoming stimulus in sequences of Bernoullidistributed stimuli. Most studies on sequential effects only consider the equiprobable case, in which the two stimuli have the same probability. However, the models we consider here are more general than this singular case and they apply to the entire range of stimulus probability. We thus manipulate in separate blocks of trials the stimulus generative probability (i.e., the Bernoulli probability that parameterizes the stimulus) to span the range from 0.05 to 0.95 by increments of 0.05. This enables us to examine in detail the behavior of subjects in a large gamut of environments from the singular case of an equiprobable, maximallyuncertain environment (with a probability of 0.5 for both stimuli) to the stronglybiased, almostcertain environment in which one stimulus occurs with probability 0.95.
To anticipate on our results, the predictions of subjects depend on the stimulus generative probability, but also on the history of stimuli. We examine whether the occurrence of a stimulus, in past trials, increase the proportion of predictions identical to this stimulus (‘attractive effect’), or whether it decreases this proportion (‘repulsive effect’). The two costs presented above reproduce qualitatively the main patterns in subjects’ data, but they make distinct predictions as to the modulations of the recency effect as a function of the history of stimuli, beyond the last stimulus. We show that the responses of subjects exhibit an elaborate, and at times counterintuitive, pattern of attractive and repulsive effects, and we compare these to the predictions of our models. Our results suggest that the brain infers a stimulus generative probability, but under a constraint on the precision of its internal representations; the inferred generative process may be more general than the actual one, and include higherorder statistics (e.g. transition probabilities), in contrast with the Bernoullidistributed stimulus used in the experiment.
We present the behavioral task and we examine the predictions of subjects — in particular, how they vary with the stimulus generative probability, and how they depend, at each trial, on the preceding stimulus. We then introduce our framework of inference under constraint, and the two costs we consider, from which we derive two families of models. We examine the behavior of these models and the extent to which they capture the behavioral patterns of subjects. The models make different qualitative predictions about the sequential effects of past observations, which we confront to subjects’ data. We find that the predictions of subjects are qualitatively consistent with a model of inference of conditional probabilities, in which precise posteriors are costly.
Results
Subjects’ predictions of a stimulus increase with the stimulus probability
In a computerbased task, subjects are asked to predict which of two rods the lightning will strike. On each trial, the subject first selects by a key press the left or righthandside rod presented on screen. A lightning symbol (which is here the stimulus) then randomly strikes either of the two rods. The trial is a success if the lightning strikes the rod selected by the subject (Figure 1a). The location of the lightning strike (left or right) is a Bernoulli random variable whose parameter $p$ (the stimulus generative probability) we manipulate across blocks of 200 trials: in each block, $p$ is a multiple of 0.05 chosen between 0.05 and 0.95. Changes of block are explicitly signaled to the subjects: each block is presented as a different town exposed to lightning strikes. The subjects are not told that the locations of the strikes are Bernoullidistributed (in fact no information is given to them regarding how the locations are determined). Moreover, in order to capture the ‘stationary’ behavior of subjects, which presumably prevails after ample exposure to the stimulus, each block is preceded by 200 passive trials in which the stimuli (sampled with the probability chosen for the block) are successively shown with no action from the subject (Figure 1b); this is presented as a ‘useful track record’ of lightning strikes in the current town. (To verify the stationarity of subjects’ behavior, we compare their responses in the first and second halves of the 200 trials in which they are asked to make predictions. In most cases we find no significant differences. See Appendix.) We provide further details on the task in Methods.
The behavior of subjects varies with the stimulus generative probability, $p$. In our analyses, we are interested in how the subjects’ predictions of an event (left or right strike) vary with the probability of this event, regardless of its nature (left or right). Thus, for instance, we would like to pool together the trials in which a subject makes a rightward prediction when the probability of a rightward strike is 0.7, and the trials in which a subject makes a leftward prediction when the probability of a leftward strike is also 0.7. Therefore, throughout the paper, we do not discuss whether subjects predict ‘right’ or ‘left’, and instead we discuss whether they predict the event ‘A’ or the complementary event ‘B’: in different blocks of trials, A (and similarly B) may refer to different locations; but importantly, B always corresponds to the location opposite to A, and $p$ denotes the probability of A (thus B has probability $1p$). This allows us, given a probability $p$, to pool together the responses obtained in blocks of trials in which one of the two locations has probability $p$. One advantage of this pooling is that it reduces the noise in data. Looking at the unpooled data, however, does not change our conclusions; see Appendix.
Turning to the behavior of subjects, we denote by $\overline{p}(A)$ the proportion of trials in which a subject predicts the event A. In the equiprobable condition ($p=0.5$), the subjects predict either side on about half the trials ($\overline{p}(A)=.496$, subjects pooled; standard error of the mean (sem): 0.008; pvalue of ttest of equality with 0.5: 0.59). In the nonequiprobable conditions, the optimal behavior is to predict A on none of the trials ($\overline{p}(A)=0$) if $p<0.5$, or on all trials ($\overline{p}(A)=1$) if $p>0.5$. The proportion of predictions A adopted by the subjects also increases as a function of the stimulus generative probability (Pearson correlation coefficient between $p$ and $\overline{p}(A)$, subjects pooled: .97; pvalue: 3.3e6; correlation between the ‘logits’, $\mathrm{ln}\frac{p}{1p}$: 0.994, pvalue: 5.7e9.), but not as steeply: it lies between the stimulus generative probability $p$, and the optimal response 0 (if $p<0.5$) or 1 (if $p>0.5$; Figure 2a).
Firstorder sequential effects: attractive influence of the most recent stimulus on subjects’ predictions
The sequences presented to subjects correspond to independent, Bernoullidistributed random events. Having shown that the subjects’ predictions follow (in a nonoptimal fashion) the stimulus generative probability, we now test whether they also exhibit the nonindependence of consecutive trials featured by the Bernoulli process. Under this hypothesis and in the stationary regime, the proportion of predictions A conditional on the preceding stimulus being A, $\overline{p}(AA)$, should be no different than the proportion of predictions A conditional on the preceding stimulus being B, $\overline{p}(AB)$. (Here and below, $\overline{p}(XY)$ denotes the proportion of predictions X conditional on the preceding observation being Y, and not on the preceding response being Y. For the possibility that subjects’ responses depend on the preceding response, see Methods.)
In other words, conditioning on the preceding stimulus should have no effect. In subjects’ responses, however, these two conditional proportions are markedly different for all stimulus generative probabilities (Fisher exact test, subjects pooled: all pvalues < 1e10; Figure 2a). Both quantities increase as a function of the stimulus generative probability, but the proportions of predictions A conditional on an A are consistently greater than the proportions of predictions A conditional on a B, i.e., $\overline{p}(AA)\overline{p}(AB)>0$ (Figure 2b). (We note that because the stimulus is either A or B, it follows that, symmetrically, the proportions of predictions B conditional on a B are consistently greater than the proportions of predictions B conditional on an A.) In other words, the preceding stimulus has an ‘attractive’ sequential effect. In addition, this attractive sequential effect seems stronger for values of the stimulus generative probability closer to the equiprobable case (p = 0.5), and to decrease for more extreme values ($p$ closer to 0 or to 1; Figure 2b). The results in Figure 2 are obtained by pooling together the responses of the subjects. Results derived from an acrosssubjects analysis are very similar; see Appendix.
A framework of costly inference
The attractive effect of the preceding stimulus on subjects’ responses suggests that the subjects have not correctly inferred the Bernoulli statistics of the process generating the stimuli. We investigate the hypothesis that their ability to infer the underlying statistics of the stimuli is hampered by cognitive constraints. We assume that these constraints can be understood as a cost, bearing on the representation, by the brain, of the subject’s beliefs about the statistics. Specifically, we derive an array of models from a framework of inference under costly posteriors (PratCarrabin et al., 2021a), which we now present. We consider a model subject who is presented on each trial $t$ with a stimulus $x}_{t}\in \{0,1\$ (where 0 and 1 encode for B and A, respectively) and who uses the sequence of stimuli ${x}_{1:t}=({x}_{1},\dots ,{x}_{t})$ to infer the stimulus statistics, over which she holds the belief distribution $\hat{P}}_{t$. A Bayesian observer equipped with this belief $\hat{P}}_{t$ and observing a new observation $x}_{t+1$ would obtain its updated belief $P}_{t+1$ through Bayes’ rule. However, a cognitive cost $C(P)$ hinders our model subject’s ability to represent probability distributions $P$. Thus, she approximates the posterior $P}_{t+1$ through another distribution $\hat{P}}_{t+1$ that minimizes a loss function $L$ defined as
where $D$ is a measure of distance between two probability distributions, and $\lambda \ge 0$ is a coefficient specifying the relative weight of the cost. (We are not proposing that subjects actively minimize this quantity, but rather that the brain’s inference process is an effective solution to this optimization problem.) Below, we use the KullbackLeibler divergence for the distance (i.e. $D({\hat{P}}_{t+1};{P}_{t+1})={D}_{KL}({\hat{P}}_{t+1}{P}_{t+1})$). If $\lambda =0$, the solution to this minimization problem is the Bayesian posterior; if $\lambda \ne 0$, the cost distorts the Bayesian solution in ways that depend on the form of the cost borne by the subject (we detail further below the two kinds of costs we investigate).
In our framework, the subject assumes that the $m$ preceding stimuli ($x}_{tm+1:t$ with $m\ge 0$) and a vector of parameters $q$ jointly determine the distribution of the stimulus at trial $t+1$, $p({x}_{t+1}{x}_{tm+1:t},q)$. Although in our task the stimuli are Bernoullidistributed (thus they do not depend on preceding stimuli) and a single parameter determines the probability of the outcomes (the stimulus generative probability), the subject may admit the possibility that more complex mechanisms govern the statistics of the stimuli, for example transition probabilities between consecutive stimuli. Therefore, the vector $q$ may contain more than one parameter and the number $m$ of preceding stimuli assumed to influence the probability of the following stimulus, which we call the ‘Markov order’, may be greater than 0.
Below, we call ‘Bernoulli observer’ any model subject who assumes that the stimuli are Bernoullidistributed ($m=0$); in this case the vector $q$ consists of a single parameter that determines the probability of observing A, which we also denote by $q$ for the sake of concision. The bias and variability in the inference of the Bernoulli observer is studied in PratCarrabin et al., 2021a. We call ‘Markov observer’ any model subject who posits that the probability of the stimulus depends on the preceding stimuli ($m>0$). In this case, the vector $q$ contains the $2}^{m$ conditional probabilities of observing A after observing each possible sequence of $m$ stimuli. For instance, with $m=1$ the vector $q$ is the pair of parameters $({q}_{A},{q}_{B})$ denoting the probabilities of observing a stimulus A after observing, respectively, a stimulus A and a stimulus B. In the absence of a cost, the belief over the parameter(s) eventually converges towards the parameter vector that is consistent with the generative Bernoulli statistics governing the stimulus (except if the prior precludes this parameter vector). Below, we assume a uniform prior.
To understand how the costs contort the inference process, it is useful to have in mind the solution to the ‘unconstrained’ inference problem (with $\lambda =0$), i.e., the Bayesian posterior, which we denote by ${P}_{t}^{\ast}(q)$. In the case of a Bernoulli observer ($m=0$), after $t$ trials, the Bayesian posterior is a Beta distribution,
where $n}_{t}^{X$ is the number of stimuli $X$ observed up to trial $t$, that is, $n}_{t}^{A}=\sum _{i=1}^{t}{x}_{i$, and ${n}_{t}^{B}=\sum _{i=1}^{t}(1{x}_{i})$. As more evidence is accumulated, the Bayesian posterior gradually narrows and converges towards the value of the stimulus generative probability (Figure 3c and d, grey lines).
The ways in which the Bayesian posterior is distorted, in our models, depend on the nature of the cost that weighs on the inference process. Although many assumptions could be made on the kind of constraint that hinders human inference, and on the cost it would entail in our framework, here we examine two costs that stem from two possible principles: that the cost is a function of the beliefs held by the subject, or that it is a function of the environment that the subject is inferring. We detail, below, these two costs.
Precision cost
A first hypothesis about the inference process of subjects is that the brain mobilizes resources to represent probability distributions, and that more ‘precise’ distributions require more resources. We write the cost associated with a distribution, $\hat{P}(q)$, as the negative of its entropy,
which is a measure of the amount of certainty in the distribution. Wider (less concentrated) distributions provide less information about the probability parameter and are thus less costly than narrower (more concentrated) distributions (Figure 3b). As an extreme case, the uniform distribution is the least costly.
With this cost, the loss function (Equation 1) is minimized by the distribution equal to the product of the prior and the likelihood, raised to the exponent $1/(\lambda +1)$, and normalized, i.e.,
Since $\lambda$ is strictly positive, the exponent is positive and lower than 1. As a result, the solution ‘flattens’ the Bayesian posterior, and in the extreme case of an unbounded cost ($\lambda \to \mathrm{\infty}$) the posterior is the uniform distribution.
Furthermore, in the expression of our model subject’s posterior, the likelihood $p({x}_{t+1}{x}_{tm+1:t},q)$ is raised after $k$ trials to the exponent $1/(\lambda +1{)}^{k+1}$, it thus decays to zero as the number $k$ of new stimuli increases. One can interpret this effect as gradually forgetting past observations. Specifically, we recover the predictions of leakyintegration models, in which remote patterns in the sequence of stimuli are discounted through an exponential filter (Yu and Cohen, 2008; Meyniel et al., 2016); here, we do not posit the gradual forgetting of remote observations, but instead we derive it as an optimal solution to a problem of constrained inference. We illustrate leaky integration in the case of a Bernoulli observer ($m=0$): in this case, the posterior after $t$ trials, ${\hat{P}}_{t}(q)$, is a Beta distribution,
where $\stackrel{~}{n}}_{t}^{A$ and $\stackrel{~}{n}}_{t}^{B$ are exponentiallyfiltered counts of the number of stimuli A and B observed up to trial $t$, i.e.,
In other words, the solution to the constrained inference problem, with the precision cost, is similar to the Bayesian posterior (Equation 2), but with counts of the two stimuli that gradually ‘forget’ remote observations (in the absence of a cost, that is, $\lambda =0$, we have $\stackrel{~}{n}}_{t}^{A}={n}_{t}^{A$ and $\stackrel{~}{n}}_{t}^{B}={n}_{t}^{B$, and thus we recover the Bayesian posterior). As a result, these counts fluctuate with the recent history of the stimuli. Consequently, the posterior ${\hat{P}}_{t}(q)$ is dominated by the recent stimuli: it does not converge, but instead fluctuates with the recent stimulus history (Figure 3c and d, purple lines; compare with the green and gray lines). Hence, this model implies predictions about subsequent stimuli that depend on the stimulus history, i.e., it predicts sequential effects.
Unpredictability cost
A different hypothesis is that the subjects favor, in their inference, parameter vectors $q$ that correspond to more predictable outcomes. We quantify the outcome unpredictability by the Shannon entropy (Shannon, 1948) of the outcome implied by the vector of parameters $q$, which we denote by $H(X;q)$. (In the Bernoulliobserver case, $H(X;q)=q\mathrm{ln}q(1q)\mathrm{ln}(1q)$; for the Markovobserver cases, see Methods.) The cost associated with the distribution $\hat{P}(q)$ is the expectation of this quantity averaged over beliefs, i.e.,
which we call the ‘unpredictability cost’. For a Bernoulli observer, a posterior concentrated on extreme values of the Bernoulli parameter (toward 0 or 1), thus representing more predictable environments, comes with a lower cost than a posterior concentrated on values of the Bernoulli parameter close to 0.5, which correspond to the most unpredictable environments (Figure 3a).
After $t$ trials, the loss function (Equation 1) under this cost is minimized by the posterior
i.e., the product of the Bayesian posterior, which narrows with $t$ around the stimulus generative probability, and of a function that is larger for values of $q$ that imply less entropic (i.e. more predictable) environments (see Methods). In short, with the unpredictability cost the model subject’s posterior is ‘pushed’ towards less entropic values of $q$.
In the Bernoulli case ($m=0$), the posterior after $t$ stimuli has a global maximum, ${q}^{\ast}({n}_{t}/t)$, that depends on the proportion ${n}_{t}/t$ of stimuli A observed up to trial $t$. As the number of presented stimuli $t$ grows, the posterior $\hat{P}}_{t$ becomes concentrated around this maximum. The proportion ${n}_{t}/t$ naturally converges to the stimulus generative probability, $p$, thus our subject’s inference converges towards the value ${q}^{\ast}(p)$ which is different from the true value $p$, in the nonequiprobable case ($p\ne .5$). The equiprobable case ($p=.5$) is singular, in that with a weak cost ($\lambda <1$) the inferred probability is unbiased (${q}^{\ast}(p)=.5$), while with a strong cost ($\lambda >1$) the inferred probability does not converge but instead alternates between two values above and below 0.5; see PratCarrabin et al., 2021a. In other words, except in the equiprobable case, the inference converges but it is biased, i.e., the posterior peaks at an incorrect value of the stimulus generative probability (Figure 3c and d, green lines). This value is closer to the extremes (0 and 1) than the stimulus generative probability, that is, it implies an environment more predictable than the actual one (Figure 3d).
In the case of a Markov observer ($m>0$), the posterior also converges to a vector of parameters $q$ which implies not only a bias but also that the conditional probabilities of a stimulus A (conditioned on different stimulus histories) are not equal. The prediction of the next stimulus being A on a given trial depends on whether the preceding stimulus was A or B: this model therefore predicts sequential effects. We further examine below the behavior of this model in the cases of a Bernoulli observer and of different Markov observers. We refer the reader interested in more details on the Markov models, including their mathematical derivations, to the Methods section.
In short, with the unpredictabilitycost models, when $\displaystyle p\ne 0.5$, the inference process converges to an asymptotic posterior ${q}^{\ast}(p)$ which does not itself depend on the history of the stimulus, but that is biased (Figure 3c, d, green lines). In particular, for Markov observers ($m>0$), the asymptotic posterior corresponds to an erroneous belief about the dependency of the stimulus on the recent stimulus history, which results in sequential effects in behavior.
Overview of the inference models
Although the two families of models derived from the two costs both potentially generate sequential effects, they do so by giving rise to qualitatively different inference processes. Under the unpredictability cost, the inference converges to a posterior that, in the Bernoulli case ($m=0$), implies a biased estimate of the stimulus generative probability (Figure 3d, green lines), while in the Markov case ($m>0$) it implies the belief that there are serial dependencies in the stimuli: predictions therefore depend on the recent stimulus history. By contrast, the precision cost prevents beliefs from converging (Figure 3c, purple lines). As a result, the subject’s predictions vary with the recent stimulus history (Figure 3d). This inference process amounts to an exponential discount of remote observations, or equivalently, to the overweighting of recent observations (Equation 6).
To investigate in more detail the sequential effects that these two costs produce, we implement two families of inference models derived from the two costs. Each model is characterized by the type of cost (unpredictability cost or precision cost), and by the assumed Markov order ($m$): we examine the case of a Bernoulli observer ($m=0$) and three cases of Markov observers (with $m=$ 1, 2, and 3). We thus obtain $2\times 4=8$ models of inference. Each of these models has one parameter $\lambda$ controlling the weight of the cost. (We also examine a ‘hybrid’ model that combines the two costs; see below.)
Responseselection strategy
We assume that the subject’s response on a given trial depends on the inferred posterior according to a generalization of ‘probability matching’ implemented in other studies (Battaglia et al., 2011; Yu and Huang, 2014; PratCarrabin et al., 2021b). In this responseselection strategy, the subject predicts A with the probability ${\overline{p}}_{t}^{\kappa}/({\overline{p}}_{t}^{\kappa}+(1{\overline{p}}_{t}{)}^{\kappa})$, where $\overline{p}}_{t$ is the expected probability of a stimulus A derived from the posterior, i.e., $\displaystyle {\overline{p}}_{t}\equiv \int p({x}_{t+1}=1{x}_{tm+1:t},q){\hat{P}}_{t}(q)\mathrm{d}q$. The single parameter $\kappa$ controls the randomness of the response: with $\displaystyle \kappa =0$ the subject predicts A and B with equal probability; with $\kappa =1$ the responseselection strategy corresponds to probability matching, that is, the subject predicts A with probability $\overline{p}}_{t$; and as $\kappa$ increases toward infinity the choices become optimal, that is, the subjects predicts A if the expected probability of observing a stimulus A, $\overline{p}}_{t$, is greater than 0.5, and predicts B if it is lower than 0.5 (if $\displaystyle {\overline{p}}_{t}=0.5$ the subject chooses A or B with equal probability). In our investigations, we also implement several other responseselection strategies, including one in which subjects have a propensity to repeat their preceding response, or conversely, to alternate; these analyses do not change our conclusions (see Methods).
Model fitting favors Markovobserver models
Each of our eight models has two parameters: the factor weighting the cost, $\lambda$, and the exponent of the generalized probabilitymatching, $\kappa$. We fit the parameters of each model to the responses of each subject, by maximizing their likelihoods. We find that 60% of subjects are best fitted by one of the unpredictabilitycost models, while 40% are best fitted by one of the precisioncost models. When pooling the two types of cost, 65% of subjects are best fitted by a Markovobserver model. We implement a ‘Bayesian model selection’ procedure (Stephan et al., 2009), which takes into account, for each subject, the likelihoods of all the models (and not only the maximum among them) in order to obtain a Bayesian posterior over the distribution of models in the general population (see Methods). The derived expected probability of unpredictabilitycost models is 57% (and 43% for precisioncost models) with an exceedance probability (i.e. probability that unpredictabilitycost models are more frequent in the general population) of 78%. The expected probability of Markovobserver models, regardless of the cost used in the model, is 70% (and 30% for Bernoulliobserver models) with an exceedance probability (i.e. probability that Markovobserver models are more frequent in the general population) of 98%. These results indicate that the responses of subjects are generally consistent with a Markovobserver model, although the stimuli used in the experiment are Bernoullidistributed. As for the unpredictabilitycost and the precisioncost families of models, Bayesian model selection does not provide decisive evidence in favor of either model, indicating that they both capture some aspects of the responses of the subjects. Below, we examine more closely the behaviors of the models, and point to qualitative differences between the predictions resulting from each model family.
Before turning to these results, we validate the robustness of our modelfitting procedure with several additional analyses. First, we estimate a confusion matrix to examine the possibility that the modelfitting procedure could misidentify the models which generated test sets of responses. We find that the bestfitting model corresponds to the true model in at least 70% of simulations (the chance level is 12.5%=1/8 models), and actually more than 90% for the majority of models (see Appendix).
Second, we seek to verify whether the bestfitting cost factor, $\lambda$, that we obtain for each subject is consistent across the range of probabilities tested. Specifically, we fit separately the models to the responses obtained in the blocks of trials whose stimulus generative probability was ‘medium’ (between 0.3 and 0.7, included) on the one hand, and to the responses obtained when the probability was ‘extreme’ (below 0.3, and above 0.7) on the other hand; and we compare the values of the bestfitting cost factors $\lambda$ in these two cases. More precisely, for the precisioncost family, we look at the inverse of the decay time, $\mathrm{ln}(1+\lambda )$, which is the inverse of the characteristic time over which the model subject ‘forgets’ past observations. With both families of models, we find that on a logarithmic scale the parameters in the medium and extremeprobabilities cases are significantly correlated across subjects (Pearson’s $r$, precisioncost models: 0.75, pvalue: 1e4; unpredictabilitycost models: $\displaystyle r=0.47$, pvalue: .036). In other words, if a subject is best fitted by a large cost factor in mediumprobabilities trials, he or she is likely to be also best fitted by a large cost factor in extremeprobabilities trials. This indicates that our models capture idiosyncratic features of subjects that generalize across conditions instead of varying with the stimulus probability (see Appendix).
Third, as mentioned above we examine a variant of the responseselection strategy in which the subject sometimes repeats the preceding response, or conversely alternates and chooses the other response, instead of responding based on the inferred probability of the next stimulus. This propensity to repeat or alternate does not change the bestfitting inference model of most subjects, and the bestfitting values of the parameters $\lambda$ and $\kappa$ are very stable when allowing or not for this propensity. This analysis supports the results we present here, and speaks to the robustness of the modelfitting procedure (see Methods).
Finally, as the unpredictabilitycost family and the precisioncost family of models both seem to capture the responses of a sizable share of the subjects, one might assume that the behavior of most subjects actually fall ‘somewhere in between’, and would be best accounted for by a hybrid model combining the two costs. In our investigations, we have implemented such a model, whereby the subject’s approximate posterior $\hat{P}}_{t$ results from the minimization of a loss function that includes both a precision cost, with weight $\lambda}_{p$, and an unpredictability cost, with weight $\lambda}_{u$ (and the responseselection strategy is the generalized probability matching, with parameter $\kappa$). We do not find that most subjects’ responses are better fitted (as measured by the Bayesian Information Criterion Schwarz, 1978) by a combination of the two costs: instead, for more than two thirds of subjects, the bestfitting model features just one cost (see Methods). In other words, the two cost seems to capture different aspects of the behavior that are predominant in different subpopulations. Below, we examine the behavioral patterns resulting from each cost type, in comparison with the behavior of the subjects.
Models of costly inference reproduce the attractive effect of the most recent stimulus
We now examine the behavioral patterns resulting from the models. All the models we consider predict that the proportion of predictions A, $\overline{p}(A)$, is a smooth, increasing function of the stimulus generative probability (when $\lambda <\mathrm{\infty}$ and $0<\kappa <\mathrm{\infty}$; Figure 4a–d, grey lines), thus we focus, here, on the ability of the models to reproduce the subjects’ sequential effects. With the unpredictabilitycost model of a Bernoulli observer ($m=0$), the belief of the model subject, as mentioned above, asymptotically converges in nonequiprobable cases to an erroneous value of the stimulus generative probability (Figure 3d, green lines). After a large number of observations (such as the 200 ‘passive’ trials, in our task), the sensitivity of the belief to new observations becomes almost imperceptible; as a result, this model predicts practically no sequential effects (Figure 4b), that is, $\overline{p}(AA)\simeq \overline{p}(AB)$. With the unpredictabilitycost model of a Markov observer (e.g. $m=1$), the belief of the model subject also converges, but to a vector of parameters $q$ that implies a sequential dependency in the stimulus, that is, $q}_{A}\ne {q}_{B$, resulting in sequential effects in predictions, that is, $\overline{p}(AA)\ne \overline{p}(AB)$. The parameter vector $q$ yields a more predictable (less entropic) environment if the probability conditional on the more frequent outcome (say, A) is less entropic than the probability conditional on the less frequent outcome (B). This is the case if the former is greater than the latter, resulting in the inequality $\overline{p}(AA)>\overline{p}(AB)$, that is, in sequential effects of the attractive kind (Figure 4d). (The case in which B is the more frequent outcome results in the inequality $\overline{p}(BB)>\overline{p}(BA)$, i.e., $1\overline{p}(AB)>1\overline{p}(AA)$, i.e., the same, attractive sequential effects.)
Turning to the precisioncost models, we have noted that in these models the posterior fluctuates with the recent history of the stimuli (Figure 3c): as a result, sequential effects are obtained, even with a Bernoulli observer ($m=0$; Figure 4a). The most recent stimulus has the largest weight in the exponentially filtered counts that determine the posterior (Equation 6), thus the model subject’s prediction is biased towards the last stimulus, that is, the sequential effect is attractive ($\overline{p}(AA)>\overline{p}(AB)$). With the traditional probabilitymatching responseselection strategy (i.e. $\kappa =1$), the strength of the attractive effect is the same across all stimulus generative probabilities (i.e. the difference $\overline{p}(AA)\overline{p}(AB)$ is constant; Figure 4a, dotted lines and lightred dots). With the generalized probabilitymatching responseselection strategy, if $\kappa >1$, proportions below and above 0.5 are brought closer to the extremes (0 and 1, respectively), resulting in larger sequential effects for values of the stimulus generative probability closer to 0.5 (Figure 4a, solid lines and red dots; the model is simulated with $\kappa =2.8$, a value representative of the subjects’ bestfitting values for this parameter). We also find stronger sequential effects closer to the equiprobable case in subjects’ data (Figure 2b).
The precisioncost model of a Markov observer ($m=1$) also predicts attractive sequential effects (Figure 4c). While the behavior of the Bernoulli observer (with a precision cost) is determined by two exponentiallyfiltered counts of the two possible stimuli (Equation 6), that of the Markov observer with $m=1$ depends on four exponentially filtered counts of the four possible pairs of stimuli. After observing a stimulus B, the belief that the following stimulus should be A or B is determined by the exponentially filtered counts of the pairs BA and BB. If $p$ is large, i.e., if the stimulus B is infrequent, then the BA and BB pairs are also infrequent and the corresponding counts are close to zero: the model subject thus behaves as if only very little evidence had been observed about the transitions B to A and B to B in this case, resulting in a proportion of predictions A conditional on a preceding B, $\overline{p}(AB)$, close to 0.5 (Figure 4c, orange line). Consequently, the sequential effects are stronger for values of the stimulus generative probabilities closer to the extreme (Figure 4c, red dots).
Both families of costs are thus able to produce attractive sequential effects, albeit with some qualitative differences. (In Figure 4a–d we show the behaviors resulting from the two costs for a Bernoulli observer and a Markov observer of order $m=1$; the Markov observers of higher order exhibit qualitatively similar behaviors; see Methods.) As the model fitting indicates that different groups of subjects are best fitted by models belonging to the two families, we examine separately the behaviors of the subjects whose responses are best fitted by each of the two costs (Figure 4e and f), in comparison with the behaviors of the corresponding bestfitting models (Figure 4g and h). This provides a finer understanding of the behavior of subjects than the group average shown in Figure 2. For the subjects best fitted by precisioncost models, the proportion of predictions A, $\overline{p}(A)$, when the stimulus generative probability is close to 0.5, is a less steep function of this probability than for the subjects bestfitted by unpredictabilitycost models (Figure 4e and f, grey lines); furthermore, their sequential effects are larger (as measured by the difference $\overline{p}(AA)\overline{p}(AB)$), and do not depend much on the stimulus generative probability (Figure 4e and f, red dots). The corresponding models reproduce the behavioral patterns of the subjects that they best fit (Figure 4g and h). Each family of models seems to capture specific behaviors exhibited by the subjects: when fitting the unpredictabilitycost models to the responses of the subjects that are best fitted by precisioncost models, and conversely when fitting the precisioncost models to the responses of the subjects that are best fitted by unpredictabilitycost models, the models do not reproduce well the subjects’ behavioral patterns (Figure 4i and j). The precisioncost models, however, seem slightly better than the unpredictabilitycost models at capturing the behavior of the subjects that they do not best fit (Figure 4, compare panel j to panel f, and panel i to panel e). Substantiating this observation, the examination of the distributions of the models’ BICs across subjects shows that when fitting the models onto the subjects that they do not best fit, the precisioncost models fare better than the unpredictabilitycost models (see Appendix).
Beyond the most recent stimulus: patterns of higherorder sequential effects
Notwithstanding the quantitative differences just presented, both families of models yield qualitatively similar attractive sequential effects: the model subjects’ predictions are biased towards the preceding stimulus. Does this pattern also apply to the longer history of the stimulus, i.e., do more distant trials also influence the model subjects’ predictions? To investigate this hypothesis, we examine the difference between the proportion of predictions A after observing a sequence of length $n$ that starts with A, minus the proportion of predictions A after the same sequence, but starting with B, i.e., $\overline{p}(AAx)\overline{p}(ABx)$, where $x$ is a sequence of length $n1$, and $Ax$ and $Bx$ denote the same sequence preceded by A and by B. This quantity enables us to isolate the influence of the $n$tolast stimulus on the current prediction. If the difference is positive, the effect is ‘attractive’; if it is negative, the effect is ‘repulsive’ (in this latter case, the presentation of an A decreases the probability that the subjects predicts A in a later trial, as compared to the presentation of a B); and if the difference is zero there is no sequential effect stemming from the $n$tolast stimulus. The case $n=1$ corresponds to the immediately preceding stimulus, whose effect we have shown to be attractive, i.e., $\overline{p}(AA)\overline{p}(AB)>0$, in the responses both of the bestfitting models and of the subjects (Figures 2b, 4g and h).
We investigate the effect of the $n$tolast stimulus on the behavior of the two families of models, with $n=1$, $2$, and $3$. We present here the main results of this investigation; we refer the reader to Methods for a more detailed analysis. With unpredictabilitycost models of Markov order $m$, there are nonvanishing sequential effects stemming from the $n$tolast stimulus only if the Markov order is greater than or equal to the distance from this stimulus to the current trial, i.e., if $m\ge n$. In this case, the sequential effects are attractive (Figure 5).
With precisioncost models, the $n$tolast stimuli yield nonvanishing sequential effects regardless of the Markov order, $m$. With $n=1$, the effect is attractive, i.e., $\overline{p}(AA)\overline{p}(AB)>0$. With $n=2$ (secondtolast stimulus), the effect is also attractive, i.e., in the case of the pair of sequences AA and BA, $\overline{p}(AAA)\overline{p}(ABA)>0$ (Figure 5a). By symmetry, the difference is also positive for the other pair of relevant sequences, AB and BB (e.g. we note that $\overline{p}(AAB)=1\overline{p}(BAB)$, and that $\overline{p}(BAB)$ when the probability of A is $p$ is equal to $\overline{p}(ABA)$ when the probability of A is $1p$. We detail in Methods such relations between the proportions of predictions A or B in different situations. These relations result in the symmetries of Figure 2, for the sequential effect of the last stimulus, while for higherorder sequential effects they imply that we do not need to show, in Figure 5, the effects following all possible past sequences of two or three stimuli, as the ones we do not show are readily derived from the ones we do.)
As for the thirdtolast stimulus ($n=3$), it can be followed by four different sequences of length two, but we only need to examine two of these four, for the reasons just presented. We find that for the precisioncost models, with all the Markov orders we examine (from 0 to 3), the probability of predicting A after observing the sequence AAA is greater than that after observing the sequence BAA, i.e., $\overline{p}(AAAA)\overline{p}(ABAA)>0$, that is, there is an attractive sequential effect of the thirdtolast stimulus if the sequence following it is AA (and, by symmetry, if it is BB; Figure 5b). So far, thus, we have found only attractive effects. However, the results are less straightforward when the thirdtolast stimulus is followed by the sequence BA. In this case, for a Bernoulli observer ($m=0$), the effect is also attractive: $\overline{p}(AABA)\overline{p}(ABBA)>0$ (Figure 5c, white circles). With Markov observers ($m\ge 1$), over a range of stimulus generative probability $p$, the effect is repulsive: $\overline{p}(AABA)\overline{p}(ABBA)<0$, that is, the presentation of an A decreases the probability that the model subject predicts A three trials later, as compared to the presentation of a B (Figure 5c, filled circles). The occurrence of the repulsive effect in this particular case is a distinctive trait of the precisioncost models of Markov observers ($m\ge 1$); we do not obtain any repulsive effect with any of the unpredictabilitycost models, nor with the precisioncost model of a Bernoulli observer ($m=0$).
Subjects’ predictions exhibit higherorder repulsive effects
We now examine the sequential effects in subjects’ responses, beyond the attractive effect of the preceding stimulus ($n=1$; discussed above). With $n=2$ (secondtolast stimulus), for the majority of the 19 stimulus generative probabilities $p$, we find attractive sequential effects: the difference $\overline{p}(AAA)\overline{p}(ABA)$ is significantly positive (Figure 6a; pvalues <0.01 for 11 stimulus generative probabilities, <0.05 for 13 probabilities; subjects pooled). With $n=3$ (thirdtolast stimulus), we also find significant attractive sequential effects in subjects’ responses for some of the stimulus generative probabilities, when the thirdtolast stimulus is followed by the sequence AA (Figure 6b; pvalues <0.01 for four probabilities, <0.05 for seven probabilities). When it is instead followed by the sequence BA, we find that for eight stimulus generative probabilities, all between 0.25 and 0.75, there is a significant repulsive sequential effect: $\overline{p}(AABA)\overline{p}(ABBA)<0$ (pvalues <0.01 for six probabilities, <0.05 for eight probabilities; subjects pooled). Thus, in these cases, the occurrence of A as the thirdtolast stimulus increases (in comparison with the occurrence of a B) the proportion of the opposite prediction, B. For the remaining stimulus generative probabilities, this difference is in most cases also negative although not significantly different from zero (Figure 6c). (An acrosssubjects analysis yields similar results; see Supplementary Materials.) Figure 6d summarizes subjects’ sequential effects, and exhibits the attractive and repulsive sequential effects in their responses (compare solid and dotted lines). (In this treelike representation, we show averages across the stimulus generative probabilities; a figure with the individual ‘trees’ for each probability is provided in the Appendix.)
The repulsive sequential effect of the thirdtolast stimulus in subjects’ predictions only occurs when the thirdtolast stimulus is A followed by the sequence BA. It is also only in this case that the repulsive effect appears with the precisioncost models of a Markov observer (while it never appears with the unpredictabilitycost models). This qualitative difference suggests that the precisioncost models offer a better account of sequential effects in subjects. However, modelfitting onto the overall behavior presented above showed that a fraction of the subjects is better fitted by the unpredictabilitycost models. We investigate, thus, the presence of a repulsive effect in the predictions of the subjects best fitted by the precisioncost models, and of those best fitted by the unpredictabilitycost models. For the subjects best fitted by the precisioncost models, we find (expectedly) that there is a significant repulsive sequential effect of the thirdtolast stimulus ($\overline{p}(AABA)\overline{p}(ABBA)<0$; pvalues <0.01 for two probabilities, <0.05 for four probabilities; subjects pooled; Figure 6e, left panel). For the subjects best fitted by the unpredictabilitycost models (a family of model that does not predict any repulsive sequential effects), we also find, perhaps surprisingly, a significant repulsive effect of the thirdtolast stimulus (pvalues <0.01 for three probabilities, <0.05 for five probabilities; subjects pooled), which demonstrates the robustness of this effect (Figure 6e, right panel). Thus, in spite of the results of the modelselection procedure, some sequential effects in subjects’ predictions support only one of the two families of model. Regardless of the model that best fits their overall predictions, the behavior of the subjects is consistent only with the precisioncost family of models with Markov order equal to or greater than 1, that is, with a model of inference of conditional probabilities hampered by a cognitive cost weighing on the precision of belief distributions.
Discussion
We investigated the hypothesis that sequential effects in human predictions result from cognitive constraints hindering the inference process carried out by the brain. We devised a framework of constrained inference, in which the model subject bears a cognitive cost when updating its belief distribution upon the arrival of new evidence: the larger the cost, the more the subject’s posterior differs from the Bayesian posterior. The models we derive from this framework make specific predictions. First, the proportion of forcedchoice predictions for a given stimulus should increase with the stimulus generative probability. Second, most of those models predict sequential effects: predictions also depend on the recent stimulus history. Models with different types of cognitive cost resulted in different patterns of attractive and repulsive effects of the past few stimuli on predictions. To compare the predictions of constrained inference with human behavior, we asked subjects to predict each next outcome in sequences of binary stimuli. We manipulated the stimulus generative probability in blocks of trials, exploring exhaustively the probability range from 0.05 to 0.95 by increments of 0.05. We found that subjects’ predictions depend on both the stimulus generative probability and the recent stimulus history. Sequential effects exhibited both attractive and repulsive components which were modulated by the stimulus generative probability. This behavior was qualitatively accounted for by a model of constrained inference in which the subject infers the transition probabilities underlying the sequences of stimuli and bears a cost that increases with the precision of the posterior distributions. Our study proposes a novel theoretical account of sequential effects in terms of optimal inference under cognitive constraints and it uncovers the richness of human behavior over a wide range of stimulus generative probabilities.
The notion that human decisions can be understood as resulting from a constrained optimization has gained traction across several fields, including neuroscience, cognitive science, and economics. In neuroscience, a voluminous literature that started with Attneave, 1954 and Barlow, 1961 investigates the idea that perception maximizes the transmission of information, under the constraint of costly and limited neural resources (Laughlin, 1981; Laughlin et al., 1998; Simoncelli and Olshausen, 2001); related theories of ‘efficient coding’ account for the bias and the variability of perception (Ganguli and Simoncelli, 2016; Wei and Stocker, 2015; Wei and Stocker, 2017; PratCarrabin and Woodford, 2021c). In cognitive science and economics, ‘bounded rationality’ is a precursory concept introduced in the 1950s by Herbert Simon, who defines it as “rational choice that takes into account the cognitive limitations of the decision maker — limitations of both knowledge and computational capacity” (Simon, 1997). For Gigerenzer, these limitations promote the use of heuristics, which are ‘fast and frugal’ ways of reasoning, leading to biases and errors in humans and other animals (Gigerenzer and Goldstein, 1996; Gigerenzer and Selten, 2002). A range of more recent approaches can be understood as attempts to specify formally the limitations in question, and the resulting tradeoff. The ‘resourcerational analysis’ paradigm aims at a unified theoretical account that reconciles principles of rationality with realistic constraints about the resources available to the brain when it is carrying out computations (Griffiths et al., 2015). In this approach, biases result from the constraints on resources, rather than from ‘simple heuristics’ (see Lieder and Griffiths, 2019 for an extensive review). For instance, in economics, theories of ‘rational inattention’ propose that economic agents optimally allocate resources (a limited amount of attention) to make decisions, thereby proposing new accounts of empirical findings in the economic literature (Sims, 2003; Woodford, 2009; Caplin et al., 2019; Gabaix, 2017; Azeredo da Silveira and Woodford, 2019; Azeredo da Silveira et al., 2020).
Our study puts forward a ‘resourcerational’ account of sequential effects. Traditional accounts since the 1960s attribute these effects to a belief in sequential dependencies between successive outcomes (Edwards, 1961; Matthews and Sanders, 1984) (potentially ‘acquired through life experience’ Ayton and Fischer, 2004), and more generally to the incorrect models that people assume about the processes generating sequences of events (see Oskarsson et al., 2009 for a review; similar rationales have been proposed to account for suboptimal behavior in other contexts, for example in explorationexploitation tasks Navarro et al., 2016). This traditional account was formalized, in particular, by models in which subjects carry out a statistical inference about the sequence of stimuli presented to them, and this inference assumes that the parameters underlying the generating process are subject to changes (Yu and Cohen, 2008; Wilder et al., 2009; Zhang et al., 2014; Meyniel et al., 2016). In these models, sequential effects are thus understood as resulting from a rational adaptation to a changing world. Human subjects indeed dynamically adapt their learning rate when the environment changes (PayzanLeNestour et al., 2013; Meyniel and Dehaene, 2017; Nassar et al., 2010), and they can even adapt their inference to the statistics of these changes (Behrens et al., 2007; PratCarrabin et al., 2021b). However, in our task and in many previous studies in which sequential effects have been reported, the underlying statistics are in fact not changing across trials. The models just mentioned thus leave unexplained why subjects’ behavior, in these tasks, is not rationally adapted to the unchanging statistics of the stimulus.
What underpins our main hypothesis is a different kind of rational adaptation: one, instead, to the ‘cognitive limitations of the decision maker’, which we assume hinder the inference carried out by the brain. We show that rational models of inference under a cost yield rich patterns of sequential effects. When the cost varies with the precision of the posterior (measured here by the negative of its entropy, Equation 3), the resulting optimal posterior is proportional to the product of the prior and the likelihood, each raised to an exponent $1/(\lambda +1)$ (Equation 4). Many previous studies on biased belief updating have proposed models that adopt the same form except for the different exponents applied to the prior and to the likelihood (Grether, 1980; Matsumori et al., 2018; Benjamin, 2019). Here, with the precision cost, both quantities are raised to the same exponent and we note that in this case the inference of the subject amounts to an exponentially decaying count of the patterns observed in the sequence of stimuli, which is sometimes called ‘leaky integration’ in the literature (Yu and Cohen, 2008; Wilder et al., 2009; Jones et al., 2013; Meyniel et al., 2016). The models mentioned above, that posit a belief in changing statistics, indeed are well approximated by models of leaky integration (Yu and Cohen, 2008; Meyniel et al., 2016), which shows that the exponential discount can have different origins. Meyniel et al., 2016 show that the precisioncost, Markovobserver model with $m=1$ (named ‘local transition probability model’ in this study) accounts for a range of other findings, in addition to sequential effects, such as biases in the perception of randomness and patterns in the surprise signals recorded through EEG and fMRI. Here we reinterpret these effects as resulting from an optimal inference subject to a cost, rather than from a suboptimal erroneous belief in the dynamics of the stimulus’ statistics. In our modeling approach, the minimization of a loss function (Equation 1) formalizes a tradeoff between the distance to optimality of the inference, and the cognitive constraints under which it is carried out. We stress that our proposal is not that the brain actively solves this optimization problem online, but instead that it is endowed with an inference algorithm (whose origin remains to be elucidated) which is effectively a solution to the constrained optimization problem.
By grounding the sequential effects in the optimal solution to a problem of constrained optimization, our approach opens avenues for exploring the origins of sequential effects, in the form of hypotheses about the nature of the constraint that hinders the inference carried out by the brain. With the precision cost, more precise posterior distributions are assumed to take a larger cognitive toll. The intuitive assumption that it is costly to be precise finds a more concrete realization in neural models of inference with probabilistic population codes: in these models, the precision of the posterior is proportional to the average activity of the population of neurons and to the number of neurons (Ma et al., 2006; Seung and Sompolinsky, 1993). More neural activity and more neurons arguably come with a metabolic cost, and thus more precise posteriors are more costly in these models. Imprecisions in computations, moreover, was shown to successfully account for decision variability and adaptive behavior in volatile environments (Findling et al., 2019; Findling et al., 2021).
The unpredictability cost, which we introduce, yields models that also exhibit sequential effects (for Markov observers), and that fit several subjects better than the precisioncost models. The unpredictability cost relies on a different hypothesis: that the cost of representing a distribution over different possible states of the world (here, different possible values of $q$) resides in the difficulty of representing these states. This could be the case, for instance, under the hypothesis that the brain runs stochastic simulations of the implied environments, as proposed in models of ‘intuitive physics’ (Battaglia et al., 2013) and in Kahneman and Tversky’s ‘simulation heuristics’ (Kahneman et al., 1982). More entropic environments imply more possible scenarios to simulate, giving rise, under this assumption, to higher costs. A different literature explores the hypothesis that the brain carries out a mental compression of sequences (Simon, 1972; Chekaf et al., 2016; Planton et al., 2021); entropy in this context is a measure of the degree of compressibility of a sequence (Planton et al., 2021), and thus, presumably, of its implied cost. As a result, the brain may prefer predictable environments over unpredictable ones. Human subjects exhibit a preference for predictive information indeed (Ogawa and Watanabe, 2011; Trapp et al., 2015), while unpredictable stimuli have been shown not only to increase anxietylike behavior (Herry et al., 2007), but also to induce more neural activity (Herry et al., 2007; den Ouden et al., 2009; Alink et al., 2010) — a presumably costly increase, which may result from the encoding of larger prediction errors (Herry et al., 2007; Schultz and Dickinson, 2000).
We note that both costs (precision and unpredictability) can predict sequential effects, even though neither carries ex ante an explicit assumption that presupposes the existence of sequential effects. They both reproduce the attractive recency effect of the last stimulus exhibited by the subjects. They make quantitatively different predictions (Figure 4); we also find this diversity of behaviors in subjects.
The precision cost, as mentioned above, yields leakyintegration models which can be summarized by a simple algorithm in which the observed patterns are counted with an exponential decay. The psychology and neuroscience literature proposes many similar ‘leaky integrators’ or ‘leaky accumulators’ models (Smith, 1995; Roe et al., 2001; Usher and McClelland, 2001; Cook and Maunsell, 2002; Wang, 2002; Sugrue et al., 2004; Bogacz et al., 2006; Kiani et al., 2008; Yu and Cohen, 2008; Gao et al., 2011; Tsetsos et al., 2012; Ossmy et al., 2013; Meyniel et al., 2016). In connectionist models of decisionmaking, for instance, decision units in abstract network models have activity levels that accumulate evidence received from input units, and which decay to zero in the absence of input (Roe et al., 2001; Usher and McClelland, 2001; Wang, 2002; Bogacz et al., 2006; Tsetsos et al., 2012). In other instances, perceptual evidence (Kiani et al., 2008; Gao et al., 2011; Ossmy et al., 2013) or counts of events (Sugrue et al., 2004; Yu and Cohen, 2008; Meyniel et al., 2016) are accumulated through an exponential temporal filter. In our approach, leaky integration is not an assumption about the mechanisms underpinning some cognitive process: instead, we find that it is an optimal strategy in the face of a cognitive cost weighing on the precision of beliefs. Although it is less clear whether the unpredictabilitycost models lend themselves to a similar algorithmic simplification, they consist in a distortion of Bayesian inference, for which various neuralnetwork models have been proposed (Deneve et al., 2001; Ma et al., 2008; Ganguli and Simoncelli, 2014; Echeveste et al., 2020).
Turning to the experimental results, we note that in spite of the rich literature on sequential effects, the majority of studies have focused on equiprobable Bernoulli environments, in which the two possible stimuli both had a probability equal to 0.5, as in tosses of a fair coin (Soetens et al., 1985; Cho et al., 2002; Yu and Cohen, 2008; Wilder et al., 2009; Jones et al., 2013; Zhang et al., 2014; Ayton and Fischer, 2004; Gökaydin and Ejova, 2017). In environments of this kind, the two stimuli play symmetric roles and all sequences of a given length are equally probable. In contrast, in biased environments one of the two possible stimuli is more probable than the other. Although much less studied, this situation breaks the regularities of equiprobable environments and is arguably very frequent in real life. In our experiment, we explore stimulus generative probabilities from 0.05 to 0.95, thus allowing to investigate the behavior of subjects in a wide spectrum of Bernoulli environments: from these with ‘extreme’ probabilities (e.g. p = 0.95) to these only slightly different from the equiprobable case (e.g. p = 0.55) to the equiprobable case itself (p = 0.5). The subjects are sensitive to the imbalance of the nonequiprobable cases: while they predict A in half the trials of the equiprobable case, a probability of just p = 0.55 suffices to prompt the subjects to predict A in about in 60% of trials, a significant difference ($\displaystyle \overline{p}(A)=0.602$; sem: 0.008; pvalue of ttest against null hypothesis that $\displaystyle \overline{p}(A)=0.5$: 1.7e11; subjects pooled).
The wellknown ‘probability matching’ hypothesis (Herrnstein, 1961; Vulkan, 2000; Gaissmaier and Schooler, 2008) suggests that the proportion of predictions A matches the stimulus generative probability: $\overline{p}(A)=p$. This hypothesis is not supported by our data. We find that in the nonequiprobable conditions these two quantities are significantly different (all pvalues <1e11, when $\displaystyle p\ne 0.5$). More precisely, we find that the proportion of prediction A is more extreme than the stimulus generative probability (i.e. $\overline{p}(A)>p$ when $p>0.5$, and $\overline{p}(A)<p$ when $p<0.5$; Figure 2a). This result is consistent with the observations made by Edwards, 1961; Edwards, 1956 and with the conclusions of a more recent review (Vulkan, 2000).
In addition to varying with the stimulus generative probability, the subjects’ predictions depend on the recent history of stimuli. Recency effects are common in the psychology literature; they were reported from memory (Ebbinghaus et al., 1913) to causal learning (Collins and Shanks, 2002) to inference (Shanteau, 1972; Hogarth and Einhorn, 1992; Benjamin, 2019). Recency effects, in many studies, are obtained in the context of reaction tasks, in which subjects must identify a stimulus and quickly provide a response (Hyman, 1953; Bertelson, 1965; Kornblum, 1967; Soetens et al., 1985; Cho et al., 2002; Yu and Cohen, 2008; Wilder et al., 2009; Jones et al., 2013; Zhang et al., 2014). Although our task is of a different kind (subjects must predict the next stimulus), we find some evidence of recency effects in the response times of subjects: after observing the less frequent of the two stimuli (when $p\ne 0$), subjects seem slower at providing a response (see Appendix). In prediction tasks (like ours), both attractive recency effects, also called ‘hothand fallacy’, and repulsive recency effects, also called ‘gambler’s fallacy’, have been reported (Jarvik, 1951; Edwards, 1961; Ayton and Fischer, 2004; Burns and Corpus, 2004; Croson and Sundali, 2005; Oskarsson et al., 2009). The observation of both effects within the same experiment has been reported in a visual identification task (Chopin and Mamassian, 2012) and in risky choices (‘wavy recency effect’ Plonsky et al., 2015; Plonsky and Erev, 2017). As to the heterogeneity of these results, several explanations have been proposed; two important factors seem to be the perceived degree of randomness of the predicted variable and whether it relates to human performance (Ayton and Fischer, 2004; Burns and Corpus, 2004; Croson and Sundali, 2005; Oskarsson et al., 2009). In any event, most studies focus exclusively on the influence of ‘runs’ of identical outcomes on the upcoming prediction, for example, in our task, on whether three As in a row increases the proportion of predictions A. With this analysis, Edwards (Edwards, 1961) in a task similar to ours concluded to an attractive recency effect (which he called ‘probability following’). Although our results are consistent with this observation (in our data three As in a row do increase the proportion of predictions A), we provide a more detailed picture of the influence of each stimulus preceding the prediction, whether it is in a ‘run’ of identical stimuli or not, which allows us to exhibit the nontrivial finer structure of the recency effects that is often overlooked.
Up to two stimuli in the past, the recency effect is attractive: observing A at trial $t2$ or at trial $t1$ induces, all else being equal, a higher proportion of predictions A at trial $t$ (in comparison to observing B; Figures 2 and 6a). The influence of the thirdtolast stimulus is more intricate: it can yield either an attractive or a repulsive effect, depending on the secondtolast and the last stimuli. For a majority of probability parameters, $p$, while an A followed by the sequence AA has an attractive effect (i.e. $p(AAAA)>p(ABAA)$), an A followed by the sequence BA has a repulsive effect (i.e. $p(AABA)<p(ABBA)$; Figure 6b and c). How can this reversal be intuited? Only one of our models, the precisioncost model with a Markov order 1 ($m=1$), reproduces this behavior; we show how it provides an interpretation for this result. From the update equation of this model (Equation 4), it is straightforward to show that the posterior of the model subject (a Dirichlet distribution of order 4) is determined by four quantities, which are exponentiallydecaying counts of the four twolong patterns observed in the sequence of stimuli: BB, BA, AB, and AA. The higher the count of a pattern, the more likely the model subject deems this pattern to happen again. In the equiprobable case ($p=0.5$), after observing the sequence AAA, the count of AA is higher than after observing BAA, thus the model subject believes that AA is more probable, and accordingly predicts A more frequently, i.e., $p(AAAA)>p(ABAA)$. As for the sequences ABA and BBA, both result in the same count of AA, but the former results in a higher count of AB — in other words, the short sequence ABA suggests that A is usually followed by B, but the sequence BBA does not — and thus the model subject predicts more frequently B, i.e., less frequently A ($p(AABA)<p(ABBA)$).
In short, the ability of the precisioncost model of a Markov observer to capture the repulsive effect found in behavioral data suggests that human subjects extrapolate the local statistical properties of the presented sequence of stimuli in order to make predictions, and that they pay attention not only to the ‘base rate’ — the marginal probability of observing A, unconditional on the recent history — as a Bernoulli observer would do, but also to the statistics of more complex patterns, including the repetitions and the alternations, thus capturing the transition probabilities between consecutive observations. Wilder et al., 2009, Jones et al., 2013, and Meyniel et al., 2016 similarly argue that sequential effects result from an imperfect inference of the base rate and of the frequency of repetitions and alternations. Dehaene et al., 2015 argue that the knowledge of transition probabilities is a central mechanism in the brain’s processing of sequences (e.g. in language comprehension), and infants as young as 5 months were shown to be able to track both base rates and transition probabilities (see Saffran and Kirkham, 2018 for a review). Learning of transition probabilities has also been observed in rhesus monkeys (Meyer and Olson, 2011).
The deviations from perfect inference, in the precisioncost model, originate in the constraints faced by the brain when performing computation with probability distributions. In spite of the success of the Bayesian framework, we note that human performance in various inference tasks is often suboptimal (Nassar et al., 2010; Hu et al., 2013; Acerbi et al., 2014; PratCarrabin et al., 2021b; PratCarrabin and Woodford, 2022). Our approach suggests that the deviations from optimality in these tasks may be explained by the cognitive constraints at play in the inference carried out by humans.
Other studies have considered the hypothesis that suboptimal behavior in inference tasks results from cognitive constraints. Kominers et al., 2016 consider a model in which Bayesian inference comes with a fixed cost; the observer can choose to forgo updating her belief, so as to avoid the cost. In some cases, the model predicts ‘permanently cycling beliefs’ that do not converge; but in general the model predicts that subjects will choose not to react to new evidence that is unsurprising under the current belief. The significant sequential effects we find in our subjects’ responses, however, seem to indicate that they are sensitive to both unsurprising (e.g. outcome A when p>0.5) and surprising (outcome B when p>0.5) observations, at least across the values of the stimulus generative probability that we test (Figure 2). Graeber, 2020 considers costly information processing as an account of subjects’ neglect of confounding variables in an inference task, but concludes instead that the suboptimal behavior of subjects results from their misunderstanding of the information structure in the task. A model close to ours is the one proposed in Azeredo da Silveira and Woodford, 2019 and Azeredo da Silveira et al., 2020, in which an informationtheoretic cost limits the memory of an otherwise optimal and Bayesian decisionmaker, resulting, here also, in beliefs that fluctuate and do not converge, and in an overweighting, in decisions, of the recent evidence.
Taking a different approach, Dasgupta et al., 2020 implement a neural network that learns to approximate Bayesian posteriors. Possible approximate posteriors are constrained not only by the structure of the network, but also by the fact that the same network is used to address a series of different inference problems. Thus the network’s parameters must be ‘shared’ across problems, which is meant to capture the brain’s limited computational resources. Although this constraint differs from the ones we consider, we note that in this study the distance function (which the approximation aims to minimize) is the same as in our models, namely, the KullbackLeibler divergence from the optimal posterior to the approximate posterior, ${D}_{KL}(\hat{P}P)$. Minimizing this divergence (under a cost) allows the model subject to obtain a posterior as close as possible (at least by this measure) to the optimal posterior given the most recent stimulus and the subject’s belief prior to observing the stimulus, which in turn enables the subject to perform reasonably well in the task.
In principle, rewarding subjects with a higher payoff when they make a correct prediction would change the optimal tradeoff (between the distance to the optimal posterior and the cognitive costs) formalized in Equation 1, resulting in ‘better’ posteriors (closer to the Bayesian posterior), and thus to higher performance in the task. At the same time, incentivization is known to influence, also in the direction of higher performance, the extent to which choice behavior is close to probability matching (Vulkan, 2000). The interesting question of the respective sensitivities of the subjects’ inference process and of their responseselection strategy in response to different levels of incentives is beyond the scope of this study, in which we have focussed on the sensitivity of behavior to different stimulus generative probabilities.
In any case, the approach of minimizing the KullbackLeibler divergence from the optimal posterior to the approximate posterior is widely used in the machine learning literature, and forms the basis of the ‘variational’ family of approximateinference techniques (Bishop, 2006). These techniques have inspired various cognitive models (Sanborn, 2017; Gallistel and Latham, 2022; Aridor and Woodford, 2023); alternatively, a bound on the divergence, known as the ‘evidence bound’, or, in neuroscience, as the negative of the ‘free energy’, is maximized (Moustafa, 2017; Friston et al., 2006; Friston, 2009). (We note that the ‘opposite’ divergence, ${D}_{KL}(P\hat{P})$, is minimized in a different machinelearning technique, ‘expectation propagation’ (Bishop, 2006), and in the cognitive model of causal reasoning of Icard and Goodman, 2015.) In these techniques, the approximate posterior is chosen within a convenient family of tractable, parameterized distributions; other distributions are precluded. This can be understood, in our framework, as positing a cost $C(\hat{P})$ that is infinite for most distributions, but zero for the distributions that belong to some arbitrary family (PratCarrabin et al., 2021a). The precision cost and the unpredictability cost, in comparison, are ‘smooth’, and allow for any distribution, but they penalize, respectively, more precise belief distributions, and belief distributions that imply more unpredictable environments. Our study shows that inference, when subject to either of these costs, yields an attractive sequential effect of the most recent observation; and with a precision cost weighing on the inference of transition probabilities (i.e., $m=1$), the model predicts the subtle pattern of attractive and repulsive sequential effects that we find in subjects’ responses.
Methods
Task and subjects
The computerbased task was programmed using the Python library PsychoPy (Peirce, 2008). The experiment comprised ten blocks of trials, which differed by the stimulus generative probability, p, used in all the trials of each block. The probability p was chosen randomly among the ten values ranging from 0.50 to 0.95 by increments of 0.05, excluding the values already chosen; and with probability 1/2 the stimulus generative probability $1p$ was used instead. Each block started with 200 passive trials, in which the subject was only asked to look at the 200 stimuli sampled with the block’s probability and successively presented. No action from the subject was required for these passive trials. The subject was then asked to predict, in each of 200 trials, the next location of the stimulus. Subjects provided their responses by a keypress. The task was presented as a game to the subjects: the stimulus was a lightning symbol, and predicting correctly whether the lightning would strike the left or the right rod resulted in the electrical energy of the lightning being collected in a battery (Figure 1). A gauge below the battery indicated the amount of energy accumulated in the current block of trials (Figure 1a). Twenty subjects (7 women, 13 men; age: 18–41, mean 25.5, standard deviation 6.2) participated in the experiment. All subjects completed the ten blocks of trials, except one subject who did not finish the experiment and was excluded from the analysis. The study was approved by the ethics committee Île de France VII (CPP 08–021). Participants gave their written consent prior to participating. The number of blocks of trials and the number of trials per block were chosen as a tradeoff between maximizing the statistical power of the study, scanning the values of the generative probability parameter from 0.05 to 0.95 with a satisfying resolution, and maintaining the duration of the experiment under a reasonable length of time. The number of subjects was chosen consistently with similar studies and so as to capture individual variability. Throughout the study, we conduct Student’s ttests when comparing the subjects’ proportion of predictions A to a given value (e.g. 0.5). When comparing two proportions of predictions A obtained under different conditions (e.g. depending on whether the preceding stimulus is A or B), we accordingly conduct Fisher exact tests. The trials in which subjects failed to respond within the limit of 1 s were not included in the analysis. They represented 1.27% of the trials, on average (across subjects); and for 95% of the subjects these trials represented less than 2.5% of the trials.
Sequential effects of the models
We run simulations of the eight models and look at the predictions they yield. To reproduce the conditions faced by the subjects, which included 200 passive trials, we start each simulation by showing to the model subject 200 randomly sampled stimuli (without collecting predictions at this stage). We then show an additional 200 samples, and obtain a prediction from the model subject after each sample. The sequential effects of the most recent stimulus, with the different models, are shown in Figure 7. With the precisioncost models, the posterior distribution of the model subject does not converge, but fluctuates instead with the recent history of the stimulus. This results in attractive sequential effects (Figure 7a), including for the Bernoulli observer, who assumes that the probability of A does not depend on the most recent stimulus. With the unpredictabilitycost models, the posterior of the model subject does converge. With Markov observers, it converges toward a parameter vector $q$ that implies that the probability of observing A depends on the most recent stimulus, resulting in the presence of sequential effects of the most recent stimulus (Figure 7b, second to fourth row). With a Bernoulli observer, the posterior of the model subject converges toward a value of the stimulus generative probability that does not depend on the stimulus history. As more evidence is accumulated, the posterior narrows around this value but not without some fluctuations that depend on the sequence of stimuli presented. In consequence the model subject’s estimate of the stimulus generative probability is also subject to fluctuations, and depends on the history of stimuli (including the most recent stimulus), although the width of the fluctuations tend to zero as more stimuli are observed. After the 200 stimuli of the passive trials, the sequential effects of the most recent stimulus resulting from this transient regime appear small in comparison to the sequential effects obtained with the other models (Figure 7b, first row). The Figure 7 also shows the behaviors of the models when augmented with a propensity to repeat the preceding response: we comment on these in the section dedicated to these models, below.
Turning to higherorder sequential effects, we look at the influence on predictions of the second and thirdtolast stimulus (Figure 8). As mentioned, only precisioncost models of Markov observers yield repulsive sequential effects, and these occur only when the thirdtolaststimulus is followed by BA. They do not occur with the secondtolast stimulus, nor with the thirdtolaststimulus when it is followed by AA (Figure 8a); and they do not occur in any case with the unpredictabilitycost models (Figure 8b).
Derivation of the approximate posteriors
We derive the solution to the constrained optimization problem, in the general case of a ‘hybrid’ model subject who bears both a precision cost, with weight $\lambda}_{p$, and an unpredictability cost, with weight $\lambda}_{u$. Thus the subject minimizes the loss function
in which we have included a Lagrange multiplier, μ, corresponding to the normalization constraint, $\int {\hat{P}}_{t+1}(q)dq=1$. Taking the functional derivative of $L$ and setting to zero, we obtain
and thus we write the approximate posterior as
where ${P}_{t+1}(q)$ is the Bayesian update of the preceding belief, ${\hat{P}}_{t}(q)$, i.e.,
Setting the weight of the unpredictability cost to zero (i.e., ${\lambda}_{u}=0$), we obtain the posterior in presence of the precision cost only, as
The main text provides more details about the posterior in this case (Equation 4), in particular with a Bernoulli observer ($m=0$; Equation 5, Equation 6).
For the hybrid model (in which both $\lambda}_{u$ and $\lambda}_{p$ are potentially different from zero), we obtain
With ${\lambda}_{p}=0$, the sum in the exponential is equal to $t$, and the precisioncost posterior, ${\hat{P}}_{t}^{prec}(q)$, is the Bayesian posterior, ${P}_{t}^{\ast}(q)$, and thus we obtain the posterior in presence of the unpredictability cost only (see Equation 8).
Hybrid models
The hybrid model, described above, features both a precision cost and an unpredictability cost, with respective weights $\lambda}_{p$ and $\lambda}_{u$. As with the models that include only one type of cost, we consider a Bernoulli observer ($m=0$), and three Markov observers ($m=1,2,$ and 3). As for the responseselection strategy, we use, here also, the generalized probabilitymatching strategy parameterized by $\kappa$. We thus obtain four new models; each one has three parameters ($\lambda}_{p$, $\lambda}_{u$, and $\kappa$), while the nonhybrid models (featuring only one type of cost) have only two parameters.
We fit these models to the responses of subjects. For 68% of subjects, the BIC of the bestfitting hybrid model is larger than the BIC of the bestfitting nonhybrid model, indicating a worse fit, by this measure. This suggests that for these subjects, allowing for a second type of cost result in a modest improvement of the fit that does not justify the additional parameter. For the remaining 32% of subjects, the hybrid models yield a better fit (a lower BIC) than the nonhybrid models, although for half of these, the difference in BICs is lower than 6, which is only weak evidence in favor of the hybrid models.
Moreover, we compute the exceedance probability, defined below in the section ‘Bayesian Model Selection’, of the hybrid models (together with the complementary probability of the nonhybrid models). We find that the exceedance probability of the hybrid models is 8.1% while that of the nonhybrid models is 91.9%, suggesting that subjects bestfitted by nonhybrid models are more prevalent.
In summary, we find that for more than two thirds of subjects, allowing for a second cost type does not improve much the fit to the behavioral data (the BIC is higher with the bestfitting hybrid model). These subjects are bestfitted by nonhybrid models, that is, by models featuring only one type of cost, instead of ‘falling in between’ the two cost types. This suggests that for most subjects, only one of the two costs, either the prediction cost or the unpredictability cost, dominates the inference process.
Alternative responseselection strategy, and repetition or alternation propensity
In addition to the generalized probabilitymatching responseselection strategy presented in the main text, in our investigations we also implement several other responseselection strategies. First, a strategy based on a ‘softmax’ function that smoothes the optimal decision rule; it does not yield, however, a behavior substantially different from that of the generalized probabilitymatching responseselection strategy. Second, we examine a strategy in which the model subject chooses the optimal response with a probability that is fixed across conditions, which we fit onto subjects’ choices. No subject is bestfitted by this strategy. Third, another possible strategy proposed in the gametheory literature (Nowak and Sigmund, 1993) is ‘winstay, loseshift’: it prescribes to repeat the same response as long as it proves correct and to change otherwise. In the context of our binarychoice prediction task, it is indistinguishable from a strategy in which the model subject chooses a prediction equal to the outcome that last occurred. This strategy is a special case of our Bernoulli observer hampered by a precision cost whose weight $\lambda$ is large combined with the optimal responseselection strategy ($\kappa \to \mathrm{\infty}$). Since the generalized probabilitymatching strategy parameterized by the exponent $\kappa$ appears either more general, better than or indistinguishable from those other responseselection strategies, we selected it to obtain the results presented in the main text.
Furthermore, we consider the possibility that subjects may have a tendency to repeat their preceding response, or, conversely, to alternate and choose the other response, independently from their inference of the stimulus statistics. Specifically, we examine a generalization of the responseselection strategy, in which a parameter $\eta$, with $\displaystyle 1<\eta <1$, modulates the probability of a repetition or of an alternation. With probability $1\eta $, the model subject chooses a response with the generalized probabilitymatching responseselection strategy, with parameter $\kappa$. With probability $\eta $, the model subject repeats the preceding response, if $\eta$ is positive; or chooses the opposite of the preceding response, if $\eta$ is negative. With $\eta =0$, there is no propensity for repetition nor alternation, and the responseselection strategy is the same as the one we have considered in the main text. We have allowed for alternations ($\displaystyle \eta <0$) in this model for the sake of generality, but for all the subjects the bestfitting value of $\eta$ is nonnegative, thus henceforth we only consider the possibility of repetitions, i.e., nonnegative values of the parameter ($\eta \ge 0$).
We note that with a repetition probability $\eta$, such that $\displaystyle 0\le \eta <1$, the unconditional probability of a prediction A, which we denote by ${\overline{p}}_{\eta}(A)$, is not different from the unconditional probability of a prediction A in the absence of a repetition probability $\eta$, $\overline{p}(A)$, as in the event of a repetition, the response that is repeated is itself A with probability $\overline{p}(A)$; formally, ${\overline{p}}_{\eta}(A)=(1\eta )\overline{p}(A)+\eta {\overline{p}}_{\eta}(A)$, which implies the equality ${\overline{p}}_{\eta}(A)=\overline{p}(A)$.
Now turning to sequential effects, we note that with a repetition probability $\eta$, the probability of a prediction $A$ conditional on an observation A is
In other words, when introducing the repetition probability $\eta$, the resulting probability of a prediction A conditional on observing A is a weighted mean of the unconditional probability of a prediction A and of the conditional probability of a prediction A in the absence of a repetition probability. Figure 7 (dotted lines) illustrates this for the eight models, with $\eta =0.2$. Consequently the sequential effects with this responseselection strategy are more modest (Figure 7, lightred dots).
We fit (by maximizing their likelihoods) our eight models now equipped with a propensity for repetition (or alternation) parameterized by $\eta$. The average bestfitting value of $\eta$, across subjects, is 0.21 (standard deviation: 0.19; median: 0.18); as mentioned, no subjects have a negative bestfitting value of $\eta$. In order to assess the degree to which the models with repetition propensity are able to capture subjects’ data, in comparison with the models without such propensity, we use the Bayesian Information Criterion (BIC) (Schwarz, 1978), which penalizes the number of parameters, as a comparative metric (a lower BIC is better). For 26% of subjects, the BIC with this responseselection strategy (allowing for $\eta \ne 0$) is higher than with the original responseselection strategy (which sets $\eta =0$,) suggesting that the responses of these subjects do not warrant the introduction of a repetition (or alternation) propensity. In addition, for these subjects the bestfitting inference model, characterized by a cost type and a Markov order, is the same when the responseselection strategy allows for repetition or alternation ($\eta \ne 0$) and when it does not ($\eta =0$). For 47% of subjects, the BIC is lower when including the parameter $\eta$ (suggesting that allowing for $\eta \ne 0$ results in a better fit to the data), and importantly, here also the bestfitting inference model (cost type and Markov order) is the same with $\eta \ne 0$ and with $\eta =0$. For 11% of subjects, a better fit (lower BIC) is obtained with $\eta \ne 0$; and the bestfitting inference models, with $\eta \ne 0$ and with $\eta =0$, belong to the same family of models, that is, they have the same cost type (precision cost or unpredictability cost), and only their Markov orders differ. Finally, only for the remaining 16% does the cost type change when allowing for $\eta \ne 0$. In other words, for 84% of subjects the bestfitting cost type is the same whether or not $\eta$ is allowed to differ from 0.
Furthermore, the bestfitting parameters $\lambda$ and $\kappa$ are also stable across these two cases. Among the 73% of subjects whose bestfitting inference model (including both cost type and Markov order) remains the same regardless of the presence of a repetition propensity, we find that the bestfitting values of $\kappa$, with $\eta \ne 0$ and with $\eta =0$, differ by less than 10% for 93% of subjects, and the bestfitting values of $\lambda$ differ by less than 10% for 71% of subjects. For these two parameters, the correlation coefficient (between the bestfitting value with $\eta =0$ and the bestfitting value with $\eta \ne 0$) is above 0.99 (with pvalues lower than 1e19).
The responses of a majority of subjects are thus better reproduced by a responseselection strategy that includes a probability of repeating the preceding response. The impact of this repetition propensity on sequential effects is relatively small in comparison to the magnitude of these effects (Figure 7). For most subjects, moreover, the bestfitting inference model, characterized by its cost type and its Markov order, is the same — with or without repetition propensity —, and the bestfitting parameters $\lambda$ and $\kappa$ are very close in the two cases. Therefore, this analysis supports the results of the modelfitting and modelselection procedure, and validates its robustness. We conclude that the models of costly inference are essential in reproducing the behavioral data, notwithstanding a positive repetition propensity in a fraction of subjects.
Computation of the models’ likelihoods
Model fitting is conducted by maximizing for each model the likelihood of the subject’s choices. With the precisioncost models, the likelihood can be derived analytically and thus easily computed: the model’s posterior is a Dirichlet distribution of order $2}^{m+1$, whose parameters are exponentially filtered counts of the observed sequences of length $m+1$. With a Bernoulli observer, i.e., $m=0$, this is the Beta distribution presented in Equation 5. The expected probability of a stimulus A, conditional on the sequence of $m$ stimuli most recently observed, is a simple ratio involving the exponentially filtered counts, for example $({\stackrel{~}{n}}_{t}^{A}+1)/({\stackrel{~}{n}}_{t}^{A}+{\stackrel{~}{n}}_{t}^{B}+2)$ in the case of a Bernoulli observer. This probability is then raised to the power $\kappa$ and normalized (as prescribed by the generalized probabilitymatching responseselection strategy) in order to obtain the probability of a prediction A.
As for the unpredictabilitycost models, the posterior is given in Equation 8 up to a normalization constant. Unfortunately, the expected probability of a stimulus A implied by this posterior does not come in a closedform expression. Thus we compute the (unnormalized) posterior on a discretized grid of values of the vector $q$. The dimension of the vector $q$ is $2}^{m$, and each element of $q$ is in the segment $[0,1]$. If we discretize each dimension into $n$ bins, we obtain $n}^{{2}^{m}$ different possible values of the vector $q$; for each of these, at each trial, we compute the unnormalized value of the posterior (as given by Equation 8). As $m$ increases, this becomes computationally prohibitive: for instance, with $n=100$ bins and $m=3$, the multidimensional grid of values of $q$ contains 10^{16} numbers (with a typical computer, this would represent 80,000 terabytes). In order to keep the needed computational resources within reasonable limits, we choose a lower resolution of the grid for larger values of $m$. Specifically, for $m=0$ we choose a grid (over $[0,1]$) with increments of 0.01; for $m=1$, increments of 0.02 (in each dimension); for $m=2$, increments of 0.05; and for $m=3$, increments of 0.1. We then compute the mean of the discretized posterior and pass it through the generalized probabilitymatching responseselection model to obtain the choice probability.
To find the bestfitting parameters $\lambda$ and $\kappa$, the likelihood was maximized with the LBFGSB algorithm (Byrd et al., 1995; Zhu et al., 1997). These computations were run using Python and the libraries Numpy and Scipy (Harris et al., 2020; Virtanen et al., 2020).
Symmetries and relations between conditional probabilities
Throughout the paper, we leverage the symmetry inherent to the Bernoulli prediction task to present results in a condensed manner. Specifically, in our analysis, the proportion of predictions A when the probability of A (the stimulus generative probability) is $p$, which we denote here by $\overline{p}(Ap)$, is equal to the proportion of predictions B when the probability of A is $1p$, which we denote by $\overline{p}(B1p)$; i.e., $\overline{p}(Ap)=\overline{p}(B1p)$. More generally, the predictions conditional on a given sequence when the probability of A is $p$ are equal to the predictions conditional on the ‘mirror’ sequence (in which A and B have been swapped), when the probability of A is $1p$, for example extending our notation, $\overline{p}(AAAB,p)=\overline{p}(BBBA,1p)$. Here, we show how this results in the symmetries in Figure 2, and in the fact that in Figures 5 and 6, it suffices to plot the sequential effects obtained with only a fraction of all the possible sequences of two or three stimuli.
First, we note that
which implies the symmetry of $\overline{p}(A)$ in Figure 2a (grey line). Turning to conditional probabilities (and thus sequential effects), we have
As a result, the lines representing $\overline{p}(AA)$ (blue) and $\overline{p}(AB)$ (orange) in Figure 2a are reflections of each other. In addition, these equations result in the equality
which implies the symmetry in Figure 2b.
As for the sequential effect of the secondtolast stimulus, we show in Figures 5a and 6a the difference in the proportions of predictions A conditional on two past sequences of two stimuli, AA and BA; i.e., $\overline{p}(AAA)\overline{p}(ABA)$. There are two other possible sequences of two stimuli: $AB$ and $BB$. The difference in the proportions conditional on these two sequences is implied by the former difference, as:
As for the sequential effect of the thirdtolast stimulus, we show in Figures 5b and 6b the difference in the proportions conditional on the sequences AAA and BAA, and in Figures 5c and 6c the difference in the proportions conditional on the sequences ABA and BBA. The differences in the proportions conditional on the sequences AAB and BAB, and conditional on the sequences ABB and BBB, are recovered as a function of the former two, as
Bayesian model selection
We implement the Bayesian model selection (BMS) procedure described in Stephan et al., 2009. Given $M$ models, this procedure aims at deriving a probabilistic belief on the distribution of these models among the general population. This unknown distribution is a categorical distribution, parameterized by the probabilities of the $M$ models, denoted by $r=({r}_{1},\dots ,{r}_{M})$, with $\sum {r}_{m}=1$. With a finite sample of data, one cannot determine with infinite precision the values of the probabilities $r}_{m$. The BMS, thus, computes an approximation of the Bayesian posterior over the vector $r$, as a Dirichlet distribution parameterized by the vector $\alpha =({\alpha}_{1},\dots ,{\alpha}_{M})$, i.e., the posterior distribution
Computing the parameters $\alpha}_{k$ of this posterior makes use of the logevidence of each model for each subject, i.e., the logarithm of the joint probability, $p(ym)$, of a given subject’s responses, $y$, under the assumption that a given model, $m$, generated the responses. We use the model’s maximum likelihood to obtain an approximation of the model’s logevidence, as (Balasubramanian, 1997)
where $\theta$ denotes the parameters of the model, $p(ym,\theta )$ is the likelihood of the model when parameterized with $\theta$, $d$ is the dimension of $\theta$, and $N$ is the size of the data, that is, the number of responses. (The wellknown Bayesian Information Criterion Schwarz, 1978 is equal to this approximation of the model’s logevidence, multiplied by $1/2$.)
In our case, there are $M=8$ models, each with $d=2$ parameters: $\theta =(\lambda ,\kappa )$. The posterior distribution over the parameters of the categorical distribution of models in the general population, $p(r\alpha )$, allows for the derivation of several quantities of interest; following Stephan et al., 2009, we derive two types of quantities. First, given a family of models, that is, a set $\mathcal{M}=\{{m}_{i}\}$ of different models (for instance, the predictioncost models, or the Bernoulliobserver models), the expected probability of this class of model, that is, the expected probability that the behavior of a subject randomly chosen in the general population follows a model belonging to this class, is the ratio
We compute the expected probability of the precisioncost models (and the complementary probability of the unpredictabilitycost models), and the expected probability of the Bernoulliobserver models (and the complementary probability of the Markovobserver models; see Results).
Second, we estimate, for each family of models $\mathcal{M}$, the probability that it is the most likely, i.e., the probability of the inequality
which is called the ‘exceedance probability’. We compute an estimate of this probability by sampling one million times the Dirichlet belief distribution (Equation 21), and counting the number of samples in which the inequality is verified. We estimate in this way the exceedance probability of the precisioncost models (and the complementary probability of the unpredictabilitycost models), and the exceedance probability of the Bernoulliobserver models (and the complementary probability of the Markovobserver models; see Results).
Unpredictability cost for Markov observers
Here we derive the expression of the unpredictability cost for Markov observers as a function of the elements of the parameter vector $q$. For an observer of Markov order 1 ($m=1$), the vector $q$ has two elements, which are the probability of observing A at a given trial conditional on the preceding outcome being A, and the probability of observing A at a given trial conditional on the preceding outcome being B, which we denote by $q}_{A$ and $q}_{B$, respectively. The Shannon entropy, $H(X;q)$, implied by the vector $q$, is the average of the conditional entropies implied by each conditional probability, i.e.,
where $p}_{A$ and $p}_{B$ are the unconditional probabilities of observing A and B, respectively (see below), and
where $X$ is A or B.
The unconditional probabilities $p}_{A$ and $p}_{B$ are functions of the conditional probabilities $q}_{A$ and $q}_{B$. Indeed, at trial $t+1$, the marginal probability of the event ${x}_{t+1}=A$, $P({x}_{t+1}=A)$, is a weighted average of the probabilities of this event conditional on the preceding stimulus, $x}_{t$, as given by the law of total probability:
i.e.
Solving for $p}_{A$, we find:
The entropy $H(X;q)$ implied by the vector $q$ is obtained by substituting these quantities in Equation 25.
Similarly, for $m=2$ and 3, the $2}^{m$ elements of the vector $q$ are the parameters $q}_{ij$ and $q}_{ijk$, respectively, where $i,j,k\in \{A,B\}$, and where $q}_{ij$ is the probability of observing A at a given trial conditional on the two preceding outcomes being the sequence ‘$ij$’, and $q}_{ijk$ is the probability of observing A at a given trial conditional on the three preceding outcomes being the sequence ‘$ijk$’. The Shannon entropy, $H(X;q)$, implied by the vector $q$, is here also the average of the conditional entropies implied by each conditional probability, as
where $p}_{ij$ and $p}_{ijk$ are the unconditional probabilities of observing the sequence ‘$ij$’, and of observing the sequence ‘$ijk$’, respectively. These unconditional probabilities verify a system of linear equations whose coefficients are given by the conditional probabilities. For instance, for $m=2$, we have the relation
i.e.,
The system of linear equations can be written as
The solution is the eigenvector corresponding to the eigenvalue equal to 1 of the matrix in the equation above, with the additional constraint that the unconditional probabilities must sum to 1, i.e., $\sum _{ij}{p}_{ij}=1$. We find:
For $m=3$, we find the relations:
Together with the normalization constraint ${\mathrm{\Sigma}}_{ijk}{p}_{ijk}=1$, these relations allow determining the eight unconditional probabilities $p}_{ijk$, and thus the expression of the Shannon entropy.
Appendix 1
Stability of subjects’ behavior throughout the experiment
To validate the assumption that we capture, in our experiment, the ‘stationary’ behavior of subjects, we compare their responses in the first half of the task (first 100 trials) to their responses in the second half (last 100 trials). We find that the unconditional proportions of prediction A in these two cases are not significantly different, for most values of the stimulus generative probability. The sign of the difference (regardless of its statistical significance) implies that the proportions of predictions A in the second half of the experiment are slightly closer to 1 when the probability of the stimulus A is greater than 0.5; which would mean that the responses of subjects are slightly closer to optimality, in the second half of the experiment (Appendix 1—figure 1a, grey lines). Regarding the sequential effects, we also obtain very similar behaviors in the first and second halves of the experiment (Appendix 1—figure 1). We conclude that for our analysis it is reasonable to assume that the behavior of subjects is stationary throughout the task.
Robustness of the model fitting
To evaluate the ability of the modelfitting procedure to correctly identify the model that generated a given set of responses, we compute a confusion matrix of the eight models. For each model, we simulate 200 runs of the task (each with 200 passive trials followed by 200 trials in which a prediction is obtained), with values of $\lambda$ and $\displaystyle \kappa$ close to values typically obtained when fitting the subjects’ responses (for predictioncost models, $\displaystyle \lambda \in \{0.03,0.7,2,15\}$; for unpredictabilitycost models, $\displaystyle \lambda \in \{0.7,2\}$; and $\displaystyle \kappa \in \{0.7,1.5,2\}$ for both families of models). We then fit each of the eight models to each of these simulated datasets, and count how many times each model best fit each dataset (Appendix 1—figure 2a). To further test the robustness of the modelfitting procedure, we randomly introduce errors in the simulated responses: for 10% of the responses, randomly chosen in each dataset, we substitute the response by its opposite (i.e., B for A, and A for B), and compute a confusion matrix using these new responses (Appendix 1—figure 2b). In both cases, the modelfitting procedure identifies the correct model a majority of times (i.e., the bestfitting model is the model that generated the data; Appendix 1—figure 2).
Finally, to examine the robustness of the weight of the cost, $\displaystyle \lambda$, we consider for each subject its bestfitting model in each family (the precisioncost family and the unpredictabilitycost family), and we fit separately each model to the subject’s responses obtained in trials in which the stimulus generative probability was medium ($\displaystyle p\in \{.3,.35,.4,.45,.5,.55,.6,.65,.7\}$) and those in which it was extreme ($\displaystyle p\in \{.05,.1,.15,.2,.25,.75,.8,.85,.9,.95\}$). The Appendix 1—figure 3 shows the correlation between the bestfitting parameters obtained in these two cases.
Distribution of subjects’ BICs
Subjects’ sequential effects — tree representation
Subjects’ sequential effects — unpooled data
As mentioned in the main text, we pool together the predictions that correspond, in different blocks of trials, to either event (left or right), as long as these events have the same probability. The Appendix 1—figure 6, below, is the same as Figure 2, but without such pooling. Given a stimulus generative probability, $p$, all the subjects experience one (and only one) block of trials in which either the event ‘right’ or the event ‘left’ had probability $p$. For one group of subjects the ‘right’ event has probability $p$ and for the group of remaining subjects it is the ‘left’ event that has probability $p$. The responses of these subjects are not pooled together in Appendix 1—figure 6, while they were in Figure 2. This also applies for any other stimulus generative probability, $p}^{\mathrm{\prime}$. However, we note that the two groups of subjects for whom $p}^{\mathrm{\prime}$ was the probability of a ‘right’ event or a ‘left’ event are not the same as the two groups just mentioned in the case of the probability $p$. As a result, from one proportion shown in Appendix 1—figure 6 to another, the underlying group of subjects changes. In Figure 2, each proportion is computed with the responses of all the subjects. This illustrates another advantage of the pooling that we use in the main text.
Subjects’ response times
Acrosssubjects results
Data availability
The behavioral data for this study and the computer code used for data analysis are freely and publicly available through the Open Science Framework repository at https://doi.org/10.17605/OSF.IO/BS5CY.

Open Science FrameworkResourceRational Account of Sequential Effects in Human Prediction: Data & Code.https://doi.org/10.17605/OSF.IO/BS5CY
References

On the origins of suboptimality in human probabilistic inferencePLOS Computational Biology 10:e1003661.https://doi.org/10.1371/journal.pcbi.1003661

Stimulus predictability reduces responses in primary visual cortexThe Journal of Neuroscience 30:2960–2966.https://doi.org/10.1523/JNEUROSCI.373010.2010

Some informational aspects of visual perceptionPsychological Review 61:183–193.https://doi.org/10.1037/h0054663

The hot hand fallacy and the gambler’s fallacy: two faces of subjective randomness?Memory & Cognition 32:1369–1378.https://doi.org/10.3758/bf03206327

Noisy memory and overreaction to newsAEA Papers and Proceedings 109:557–561.https://doi.org/10.1257/pandp.20191049

BookOptimally Imprecise Memory and Biased ForecastsNational Bureau of Economic Research.https://doi.org/10.2139/ssrn.3731244

Statistical inference, occam’s razor, and statistical mechanicsNeural Computation 368:349–368.https://doi.org/10.1162/neco.1997.9.2.349

Twenty years of “hot hand” research: Review and critiquePsychology of Sport and Exercise 7:525–553.https://doi.org/10.1016/j.psychsport.2006.03.001

BookPossible principles underlying the transformations of sensory messagesIn: Rosenblith Walter A, editors. Sensory Communication, Chapter 13. Cambridge, MA: The MIT Press. pp. 217–234.

How haptic size sensations improve distance perceptionPLOS Computational Biology 7:e1002080.https://doi.org/10.1371/journal.pcbi.1002080

Learning the value of information in an uncertain worldNature Neuroscience 10:1214–1221.https://doi.org/10.1038/nn1954

BookErrors in probabilistic reasoning and judgment biasesIn: Benjamin DJ, editors. Handbook of Behavioral Economics. Elsevier B.V. pp. 69–186.

Randomness and inductions from streaks: “gambler’s fallacy” versus “hot hand.”Psychonomic Bulletin & Review 11:179–184.https://doi.org/10.3758/bf03206480

A limited memory algorithm for bound constrained optimizationSIAM Journal on Scientific Computing 16:1190–1208.https://doi.org/10.1137/0916069

Rational inattention, optimal consideration sets, and stochastic choiceThe Review of Economic Studies 86:1061–1094.https://doi.org/10.1093/restud/rdy037

Mechanisms underlying dependencies of performance on stimulus history in a twoalternative forcedchoice taskCognitive, Affective & Behavioral Neuroscience 2:283–299.https://doi.org/10.3758/cabn.2.4.283

Predictive properties of visual adaptationCurrent Biology 22:622–626.https://doi.org/10.1016/j.cub.2012.02.021

Momentary and integrative response strategies in causal judgmentMemory & Cognition 30:1138–1147.https://doi.org/10.3758/bf03194331

Dynamics of neuronal responses in macaque MT and VIP during motion detectionNature Neuroscience 5:985–994.https://doi.org/10.1038/nn924

The Gambler’s fallacy and the hot hand: empirical data from casinosJournal of Risk and Uncertainty 30:195–209.https://doi.org/10.1007/s1116600511532

Efficient computation and cue integration with noisy population codesNature Neuroscience 4:826–831.https://doi.org/10.1038/90541

A dual role for prediction error in associative learningCerebral Cortex 19:1175–1185.https://doi.org/10.1093/cercor/bhn161

BookMemory: A Contribution to Experimental PsychologyTeachers College Press.https://doi.org/10.1037/10011000

Reward probability, amount, and information as determiners of sequential twoalternative decisionsJournal of Experimental Psychology 52:177–188.https://doi.org/10.1037/h0047727

Probability learning in 1000 trialsJournal of Experimental Psychology 62:385–394.https://doi.org/10.1037/h0041970

A free energy principle for the brainJournal of Physiology, Paris 100:70–87.https://doi.org/10.1016/j.jphysparis.2006.10.001

The freeenergy principle: a rough guide to the brain?Trends in Cognitive Sciences 13:293–301.https://doi.org/10.1016/j.tics.2009.04.005

Efficient sensory encoding and bayesian inference with heterogeneous neural populationsNeural Computation 26:2103–2134.https://doi.org/10.1162/NECO_a_00638

Reasoning the fast and frugal way: Models of bounded rationalityPsychological Review 103:650–669.https://doi.org/10.1037/0033295X.103.4.650

Bounded Rationality: The Adaptive ToolboxBounded rationality, Bounded Rationality: The Adaptive Toolbox, MIT Press, 10.7551/mitpress/1654.001.0001.

The hot hand in basketball: On the misperception of random sequencesCognitive Psychology 17:295–314.https://doi.org/10.1016/00100285(85)900106

ConferenceSequential effects in predictionProceedings of the Annual Conference of the Cognitive Science Society. pp. 397–402.

Bayes rule as a descriptive model: the representativeness heuristicThe Quarterly Journal of Economics 95:537.https://doi.org/10.2307/1885092

Rational use of cognitive resources: levels of analysis between the computational and the algorithmicTopics in Cognitive Science 7:217–229.https://doi.org/10.1111/tops.12142

Relative and absolute strength of response as a function of frequency of reinforcementJournal of the Experimental Analysis of Behavior 4:267–272.https://doi.org/10.1901/jeab.1961.4267

Processing of temporal unpredictability in human and animal amygdalaThe Journal of Neuroscience 27:5958–5966.https://doi.org/10.1523/JNEUROSCI.521806.2007

On the rate of gain of informationQuarterly Journal of Experimental Psychology 4:11–26.https://doi.org/10.1080/17470215208416600

Order effects in belief updating: The beliefadjustment modelCognitive Psychology 24:1–55.https://doi.org/10.1016/00100285(92)90002J

Nonparametric learning rules from bandit experiments: The eyes have it!Games and Economic Behavior 81:215–231.https://doi.org/10.1016/j.geb.2013.05.003

Stimulus information as a determinant of reaction timeJournal of Experimental Psychology 45:188–196.https://doi.org/10.1037/h0056940

ConferenceA ResourceRational Approach to the Causal Frame ProblemProceedings of the 37th Annual Meeting of the Cognitive Science Society.

Probability learning and a negative recency effect in the serial anticipation of alternative symbolsJournal of Experimental Psychology 41:291–297.https://doi.org/10.1037/h0056878

Sequential effects in response time reveal learning mechanisms and event representationsPsychological Review 120:628–666.https://doi.org/10.1037/a0033180

A simple coding procedure enhances a neuron’s information capacityZeitschrift Für Naturforschung C 36:910–912.https://doi.org/10.1515/znc198191040

Resourcerational analysis: Understanding human cognition as the optimal use of limited computational resourcesThe Behavioral and Brain Sciences 43:e1.https://doi.org/10.1017/S0140525X1900061X

Bayesian inference with probabilistic population codesNature Neuroscience 9:1432–1438.https://doi.org/10.1038/nn1790

Spiking networks for Bayesian inference and choiceCurrent Opinion in Neurobiology 18:217–222.https://doi.org/10.1016/j.conb.2008.07.004

A biased bayesian inference for decisionmaking and cognitive controlFrontiers in Neuroscience 12:734.https://doi.org/10.3389/fnins.2018.00734

Effects of causal and noncausal sequences of information on subjective predictionPsychological Reports 54:211–215.https://doi.org/10.2466/pr0.1984.54.1.211

Subjective probabilities for sex of next child: U.S. College students and Philippine villagersJournal of Population Behavioral, Social, and Environmental Issues 1:132–147.https://doi.org/10.1007/BF01277598

Human inferences about sequences: a minimal transition probability modelPLOS Computational Biology 12:e1005260.https://doi.org/10.1371/journal.pcbi.1005260

An approximately Bayesian deltarule model explains the dynamics of belief updating in a changing environmentThe Journal of Neuroscience 30:12366–12378.https://doi.org/10.1523/JNEUROSCI.082210.2010

Implicit learning increases preference for predictive visual displayAttention, Perception & Psychophysics 73:1815–1822.https://doi.org/10.3758/s1341401000412

What’s next? Judging sequences of binary eventsPsychological Bulletin 135:262–285.https://doi.org/10.1037/a0014821

Generating stimuli for neuroscience using psychoPyFrontiers in Neuroinformatics 2:10.https://doi.org/10.3389/neuro.11.010.2008

A theory of memory for binary sequences: Evidence for A mental compression algorithm in humansPLOS Computational Biology 17:e1008598.https://doi.org/10.1371/journal.pcbi.1008598

Reliance on small samples, the wavy recency effect, and similaritybased learningPsychological Review 122:621–647.https://doi.org/10.1037/a0039413

Human inference in changing environments with temporal structurePsychological Review 128:879–912.https://doi.org/10.1037/rev0000276

ConferenceBias and variance of the Bayesianmean decoderAdvances in Neural Information Processing Systems 34 (NeurIPS 2021). pp. 23793–23805.

Infant statistical learningAnnual Review of Psychology 69:181–203.https://doi.org/10.1146/annurevpsych122216011805

Types of approximation for probabilistic cognition: sampling and variationalBrain and Cognition 112:98–101.https://doi.org/10.1016/j.bandc.2015.06.008

Neuronal coding of prediction errorsAnnual Review of Neuroscience 23:473–500.https://doi.org/10.1146/annurev.neuro.23.1.473

Estimating the Dimension of a ModelThe Annals of Statistics 6:461–464.https://doi.org/10.1214/aos/1176344136

Descriptive versus normative models of sequential inference judgmentJournal of Experimental Psychology 93:63–68.https://doi.org/10.1037/h0032509

Complexity and the representation of patterned sequences of symbolsPsychological Review 79:369–382.https://doi.org/10.1037/h0033118

BookBounded rationalityIn: Simon HA, editors. Models of Bounded Rationality: Empirically Grounded Economic Reason. The MIT Press. pp. 291–294.https://doi.org/10.7551/mitpress/4711.001.0001

Natural image statistics and neural representationAnnual Review of Neuroscience 24:1193–1216.https://doi.org/10.1146/annurev.neuro.24.1.1193

Implications of rational inattentionJournal of Monetary Economics 50:665–690.https://doi.org/10.1016/S03043932(03)000291

Psychophysically principled models of visual simple reaction timePsychological Review 102:567–593.https://doi.org/10.1037/0033295X.102.3.567

Expectancy or automatic facilitation? Separating sequential effects in twochoice reaction timeJournal of Experimental Psychology 11:598–616.https://doi.org/10.1037/00961523.11.5.598

Bayesian model selection for group studiesNeuroImage 46:1004–1017.https://doi.org/10.1016/j.neuroimage.2009.03.025

Expectation in perceptual decision making: neural and computational mechanismsNature Reviews. Neuroscience 15:745–756.https://doi.org/10.1038/nrn3838

Human preferences are biased towards associative informationCognition & Emotion 29:1054–1068.https://doi.org/10.1080/02699931.2014.966064

The time course of perceptual choice: the leaky, competing accumulator modelPsychological Review 108:550–592.https://doi.org/10.1037/0033295x.108.3.550

An economist’s perspective on probability matchingJournal of Economic Surveys 14:101–118.https://doi.org/10.1111/14676419.00106

A Bayesian observer model constrained by efficient coding can explain “antiBayesian” perceptsNature Neuroscience 18:1509–1517.https://doi.org/10.1038/nn.4105

ConferenceSequential effects reflect parallel learning of multiple environmental regularitiesAdvances in Neural Information Processing Systems 22  Proceedings of the 2009 Conference. pp. 2053–2061.

Informationconstrained statedependent pricingJournal of Monetary Economics 56:S100–S124.https://doi.org/10.1016/j.jmoneco.2009.06.014

Sequential effects: Superstition or rational behavior?Advances in Neural Information Processing Systems 21:1873–1880.

ConferenceSequential effects: A Bayesian analysis of prior bias on reaction time and behavioral choiceProceedings of the 36th Annual Conference of the Cognitive Science Society. pp. 1844–1849.

Algorithm 778: Lbfgsb: Fortran subroutines for largescale boundconstrained optimizationACM Transactions on Mathematical Software. Association for Computing Machinery 23:550–560.https://doi.org/10.1145/279232.279236
Decision letter

Hang ZhangReviewing Editor; Peking University, China

Floris P de LangeSenior Editor; Donders Institute for Brain, Cognition and Behaviour, Netherlands
Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Decision letter after peer review:
Thank you for submitting your article "ResourceRational Account of Sequential Effects in Human Prediction" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Floris de Lange as the Senior Editor. The reviewers have opted to remain anonymous.
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
Essential revisions:
1) Including alternative models of sequential effects in the model comparison would be necessary. Please see Reviewers #2's and #3's comments for details.
2) Additional statistical tests are required to tease out potentially confounding effects of motor responses (see Reviewer #3's comments). Besides, there should be corrections for multiple comparisons (see Reviewer #2's comments).
3) The costs assumed in the resourcerational models need better theoretical justification. Please see Reviewers #1's and #3's comments for details.
4) Testing a hybrid model that combines the precision cost and unpredictability cost is highly recommended, given that the two models seem to explain complementary aspects of the data. Please see Reviewer #1's comments for details.
Reviewer #1 (Recommendations for the authors):
There is a clear invertedU shape in Figure 2b that the authors don't comment on. This seems like a salient feature of the data that should be explained or at least commented on. Interestingly, the bestfitting models can account for this (Figure 4b), but it doesn't seem to be discussed. It would also be helpful to see the predictions separately for precisioncost and unpredictabilitycost models.
At a conceptual level, I'm not sure I understand the reasoning behind the unpredictability cost. It's not intuitive to me why more unpredictability should be registered by an observer as a cost of updating their beliefs. The Discussion didn't really clear this up; there's a reference to a preference for predictable environments, but the argument that this is somehow costly to the brain is handwavy. Just because unpredictability increases neural activity doesn't mean that it's something the brain is trying to minimize.
The pattern of higherorder sequential effects (Figure 6) seems to suggest that behavior is consistent with some combination of precision cost and unpredictability cost. Neither model on its own explains the data particularly well (compare with Figure 5). Have the authors considered hybrid models?
There are a few references related to the costly inference that deserve mention:
– Kominers et al. (2016), who develop a model of costly updating.
– Dasgupta et al. (2020), who develop a model of resourcebounded inference.
– Graeber (202), who develops a model of inattentive inference.
The authors cite a number of earlier modeling papers, but it's not clear to me what those previous models would predict about the new data, and whether the new models proposed in this paper predict the earlier data.
Reviewer #2 (Recommendations for the authors):
I would recommend the authors conduct additional modeling analyses including models that express the alternative hypotheses clearly, generate and present individual model predictions, and consider running a (possibly online) larger experiment with incentives in place. The ideal test of the hypothesis would be an experiment with multiple levels of incentives, showing that people adjust their representations as this model predicts as they trade off computational costs and real rewards. If monetary rewards cannot be used, incentives could be implemented by having a longer delay after erroneous predictions.
Reviewer #3 (Recommendations for the authors):
1. I recommend using regression analyses (e.g. a GLM) where regressors for both previous choices and previous stimuli are entered (as e.g. in Urai et al. Nature Communications 2017) to resolve this possible confound. Such GLM also allows looking back further back in time at the impact of longer lags on decisions. This would allow testing for example if the repulsive bias to the stimulus at lag 2 (in sequences 111 or 101) extends to longer lags, both in experimental data and simulations.
2. Authors could test whether the very first prediction on each block already shows a signature of the p and whether the prediction is stable within blocks.
I provide here some recommendations on how the clarity of the manuscript could be improved. A thorough work on improving the text and figures and the general flow and organization of the manuscript would make a major difference in the impact of that paper on the wider community. which is very frustrating for the reader is to see one statement (e.g. a choice of methods or a result) exposed at some point and then the explanation for the statement much later in the manuscript (see below). Here are my suggestions:
– I believe the models would be better motivated to the reader if the Introduction made a brief mention of the ideas of bounded rationality (and related concepts) and justified focus on these two specific types of cost – all of which are nicely detailed in the Discussion.
– Please try to make understanding figures more intuitive; for example, using a colour code for the different cost types may help differentiate them. A treelike representation of history biases (showing the mean accuracy for different types of sequences, e.g. in Meyniel et al. 2016 Plos CB) may be more intuitive to read and reveal a richer structure in the data and models than the current Figure 56 (also given than the authors do not comment much on the impact of the "probability of observation 1", so perhaps this effects could be marginalized out).
– Figure 3 is really helpful in understanding the two types of cost (much more than the equations for most readers). Unfortunately, it is hardly referred to in the main text. I suggest rewriting the presentation of that part of the Results section around these examples.
– Why and how the two types of costs give rise to historical effects (beyond the fact that these costs generate suboptimalities) is a central idea in the paper but it is only exposed in the Discussion session. Integrating these explanations within the Results section would help a lot. Plotting some example simulations for a sequence of trials and/or some cartoon explanations of how the historical effects emerge for the different models would also help.
– Placing figures in Methods does not help in my opinion – please consider moving to the main text or as supplementary Figures.
https://doi.org/10.7554/eLife.81256.sa1Author response
Essential revisions:
Reviewer #1 (Recommendations for the authors):
There is a clear invertedU shape in Figure 2b that the authors don't comment on. This seems like a salient feature of the data that should be explained or at least commented on. Interestingly, the bestfitting models can account for this (Figure 4b), but it doesn't seem to be discussed. It would also be helpful to see the predictions separately for precisioncost and unpredictabilitycost models.
We agree with Reviewer #1 that it is notable that there is an inverted U shape in Figure 2b, i.e., that the sequential effect of the last stimulus is smaller for more extreme values of the stimulus generative probability, in the responses of the subjects. Some models reproduce this pattern: following Reviewer #1’s suggestion, we now show in Figure 4 the predictions of the precisioncost and unpredictability cost models separately.
The panels a, b, c, and d in Figure 4 show the predictions of the precisioncost model of a Bernoulli observer (a) and of a Markov observer with m=1 (c), and the unpredictabilitycost model of a Bernoulli observer (b) and of a Markov observer with m=1 (d). In panel (a) we also show the predictions of the precisioncost model of a Bernoulli observer (m=0) with the “traditional” probabilitymatching strategy, i.e., with kappa = 1 (in this case the probability of predicting A is equal to the inferred probability of the event A). In this case the size of the sequential effect, p(AA)p(AB), is the same for all values of stimulus generative probability (see dotted lines and lightred dots). But with kappa = 2.8 (a value representative of subjects’ bestfitting values), the proportions of predictions are brought closer to optimality (i.e., to the extremes), and the sequential effect, p(AA)p(AB) now depends on the stimulus generative probability, resulting in the inverted Ushape of the sequential effects in Figure 4a.
We note, however, that the precisioncost model of a Markov observer (with m=1) yields a (noninverted) U shape of the sequential effects. While the behavior of the precisioncost model of a Bernoulli observer is determined by two exponentiallyfiltered counts of the two possible stimuli, the behavior of the precisioncost model of a Markov observer (m=1) is determined by four exponentiallyfiltered counts of the four possible pairs of stimuli, and in particular p(AB) is determined by the counts of the pairs BA and BB. But when p is large, the pairs BA and BB are rare: thus it is as if the model subject had little total evidence to inform its decision. The resulting predictions are close to that of an uninformed observer, i.e., p(AB) ≈ 0.5. By contrast, p(AA) is more extreme, and this difference yields stronger sequential effects for more extreme values of the stimulus generative probability (i.e., U shape in Figure 4c, right).
Among the subjects that are bestfitted by a precisioncost model, some are bestfitted by a model of a Bernoulli observer (m=0) while some others are bestfitted by a model of a Markov observer (m>0). Overall the sequential effects for these subjects exhibit a small decrease at more extreme stimulus generative probabilities (Figure 4e, right). By contrast, the subjects bestfitted by an unpredictabilitycost model show a stronger decrease in the sequential effects at more extreme probabilities (Figure 4f, right). In addition the latter subjects exhibit weaker sequential effects than the former ones. This is reproduced by simulations of the corresponding bestfitting models (Figure 4g,h). The models belonging to the ‘other’ family, which does not best fit each subject (i.e., the precisioncost models, for the subjects bestfitted by an unpredictabilitycost model; and the unpredictabilitycost models, for the subjects that are bestfitted by a precisioncost model) do not reproduce well the patterns in subjects’ data (Figure 4i,j).
In the revised version of the manuscript, we now point to the inverted Ushape of sequential effects in group average of subjects’ data (l. 182185), and we have completely reworked the presentation of the sequential effects of the model. In particular we detail the behavior of each family of models, separately, using the updated Figure 4; we explain the origin of the shape of the sequential effects (inverted Ushape in Figure 4a and Ushape in Figure 4c); and we compare the models’ behaviors with that of the subjects (l. 422493).
At a conceptual level, I'm not sure I understand the reasoning behind the unpredictability cost. It's not intuitive to me why more unpredictability should be registered by an observer as a cost of updating their beliefs. The Discussion didn't really clear this up; there's a reference to a preference for predictable environments, but the argument that this is somehow costly to the brain is handwavy. Just because unpredictability increases neural activity doesn't mean that it's something the brain is trying to minimize.
In the revised version of the manuscript, we now provide more details on the rationale for the unpredictability cost. We note, first, that we make a similar argument based on the cost of neural activity for the precision cost. Although the two cases differ by the hypothesized origin of the increase in the neural activity, in both cases we assume that this neural activity comes with a metabolic cost, which is not to be minimized per se, but which enters a tradeoff with the correctness of the represented belief, in comparison to the optimal, Bayesian belief.
As to the rationale subtending the unpredictability cost, it resides in the assumption that the difficulty of representing a belief distribution over the parameters generating the environment (here, q), originates in the difficulty of representing the environments themselves. For instance, in the models of ‘intuitive physics’ (e.g., Battaglia, Hamrick, and Tenenbaum, 2013) or in the ‘simulation heuristic’ of Kahneman and Tversky (1982), the brain runs simulations of the possible sequences of outcomes, in a given environment. Environments that are more entropic result in a greater diversity of sequences of outcomes, and thus in more simulations, resulting, presumably, in higher costs. Furthermore, several cognitive models posit that the brain compresses sequences (Simon, 1972; Planton et al., 2021); but a greater entropy in sequences reduces the compression rate, resulting in longer descriptions of these sequences (here also, because of the greater diversity of potential outcomes), which presumably is more costly.
We note that for neither the precision cost nor the unpredictability cost do we provide a mechanistic account of the underlying representational system in which the cost naturally emerges. But under the assumption that the cost of representing a distribution over environments resides in the cost of representing the environments themselves, it seems that, for the reasons just presented, a reasonable assumption is that more unpredictable environments are more difficult to represent.
In the revised version of the manuscript, we now provide more details on the rationale for the unpredictability cost, in the Discussion (l. 687700). We have also reworked the short presentation of the costs in the Introduction (l. 7179).
References:
Peter W. Battaglia, Jessica B. Hamrick, and Joshua B. Tenenbaum. Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences of the United States of America, 110(45):18327–18332, 2013.
Daniel Kahneman and Amos Tversky. The simulation heuristic, pages 201–208. Cambridge University Press, 1982.
Herbert A Simon. Complexity and the representation of patterned sequences of symbols. Psychological review, 79(5):369, 1972.
Samuel Planton, Timo van Kerkoerle, Leïla Abbih, Maxime Maheu, Florent Meyniel, Mariano Sigman, Liping Wang, Santiago Figueira, Sergio Romano, and Stanislas Dehaene. A theory of memory for binary sequences: Evidence for a mental compression algorithm in humans, 2021.
The pattern of higherorder sequential effects (Figure 6) seems to suggest that behavior is consistent with some combination of precision cost and unpredictability cost. Neither model on its own explains the data particularly well (compare with Figure 5). Have the authors considered hybrid models?
Regarding the patterns of sequential effects in subjects’ data and resulting from the models, we note that the main objective of Figure 5 was to illustrate the signs of the sequential effects occurring with the models, and in particular that the sequential effects are repulsive only with the precisioncost model of a Markov observer (m=1). A diversity of behaviors, however, can result from the models, depending on the type of cost (precision or unpredictability), on the Markov order (m=0 to 3), and on the values of the model’s parameters (λ and kappa). The Figure 8, in Methods, shows the higherorder sequential effects for the two types of costs and the four Markov orders we consider: depending on the model, the sequential effects can be an increasing function of the stimulus generative probability, or a decreasing function, or a nonmonotonous function; but in all these cases the signs of the sequential effects are consistent with what is shown in Figure 5 and with the message we seek to convey, that in most cases the sequential effects are attractive, except in one case with the precisioncost model of a Markov observer (m>0).
Taking into account Reviewer #1’s comment, however, and for the benefit of the reader, we have added the behavior of another model to Figure 5, in the revised version of the manuscript: that of the precisioncost model of a Bernoulli observer (m=0). Not only does it exhibit how this model does not yield repulsive sequential effects (unlike the precisioncost model of a Markov observer, m=1; Figure 5c), but also it shows how it yields attractive sequential effects, in Figure 5a and 5b, whose behaviors as a function of the stimulus generative probability are qualitatively different from that of the precisioncost model of a Markov observer (m=1), thus suggesting the diversity of behaviors resulting from the models.
However, we agree with Reviewer #1 that hybrid models are an interesting possibility. Thus, we have investigated a hybrid model, in which both the precision cost and the unpredictability cost weigh on the representation of posteriors, each with a different weight parameter (denoted by λ_{p} and λ_{u}). We derive the optimal inference procedure under this double cost, and find that it results in a posterior that fluctuates with the recent history of stimuli, and that is biased toward values of the generative parameter q that implies less entropic environments; in other words, it combines features of the two costs taken separately. For a given Markov order, m, this model is a generalization of both the precisioncost model and the unpredictabilitycost model. It has one more parameter than these models (due to the two weights of the costs). To compare the ability of models to capture parsimoniously subjects’ data, we use as a comparison metric the wellknown Bayesian Information Criterion (BIC), which is based on the loglikelihood but also includes a penalty term for the number of parameters. Although one might expect that the behavior of most subjects may be best captured (as per the BIC) by this hybrid model, we find that its BIC is in fact larger (indicating a worse fit) than the bestfitting nonhybrid model for more than two thirds of subjects; and for half of the remaining subjects (for whom the BIC is lower with the hybrid model), the difference in BIC is lower than 6, which indicates weak evidence in support of the hybrid models. In other words, for a majority of subjects, the improvement in the loglikelihood that results from allowing a second type of cost is too modest to justify the additional parameter. This suggests that the two families of models capture specific behaviors that are prevalent in different subpopulations of subjects. In the revised manuscript, we comment on the hybrid model in the main text (l. 407419) and we present it in more detail in Methods (p. 4446).
There are a few references related to the costly inference that deserve mention:
– Kominers et al. (2016), who develop a model of costly updating.
– Dasgupta et al. (2020), who develop a model of resourcebounded inference.
– Graeber (202), who develops a model of inattentive inference.
We thank Reviewer #1 for pointing to these papers which also consider the hypothesis of a cost in the inference process of decisionmakers. We note that Dasgupta et al. (2020) also uses the KullbackLeibler function as a distance metric to be minimized. We comment on these papers in the Discussion of the revised manuscript (l. 804827).
The authors cite a number of earlier modeling papers, but it's not clear to me what those previous models would predict about the new data, and whether the new models proposed in this paper predict the earlier data.
The main kind of other models that we refer to, in the Introduction and in the Discussion, are ‘leaky integration’ models, in which past observations are gradually forgotten (through an exponential discount). We show that the optimal solution to the problem of constrained inference (Equation (1)) with a precision cost (Equation (3)), is precisely one in which remote patterns in the sequence of observed stimuli are discounted, in the posterior, through an exponential filter. In other words, we recover the leakyintegration model (i.e., a model identical, for instance, to the one examined by Meyniel, Maheu and Dehaene, 2016), and thus the predictions of the precisioncost model are exactly those of a leakyintegration model (precisioncost models with different Markov order differ by the length of the sequences of observations that are counted through an exponential filter). One difference of our study, in comparison with previous works, is that the leaky integration is derived as the optimal solution to a problem of constrained optimization, rather than posited a priori in the definition of the model. We improved the revised manuscript, by clarifying in the Introduction that we recover leaky integration (l.7679); by explaining in more details, in Results, that the precisioncost model results in a leakyintegration model, and by explicitly providing the posterior in the case of a Bernoulli observer (with the exponentiallyfiltered counts of past observations, Equation (6)); and by pointing out, in the Discussion, that we derive the exponential filtering from the constrainedinference problem, rather than assuming leaky integration from the start (l. 706716).
Reference:
Florent Meyniel, Maxime Maheu, and Stanislas Dehaene. Human Inferences about Sequences: A Minimal Transition Probability Model. PLoS Computational Biology, 12(12):1–26, 2016.
Reviewer #2 (Recommendations for the authors):
I would recommend the authors conduct additional modeling analyses including models that express the alternative hypotheses clearly, generate and present individual model predictions, and consider running a (possibly online) larger experiment with incentives in place. The ideal test of the hypothesis would be an experiment with multiple levels of incentives, showing that people adjust their representations as this model predicts as they trade off computational costs and real rewards. If monetary rewards cannot be used, incentives could be implemented by having a longer delay after erroneous predictions.
We thank Reviewer #2 for her/his attention to our paper and for her/his comments.
As for the point on the models: many models in the sequentialeffects literature (Refs. [712] in the manuscript) are ‘leakyintegration’ models that interpret sequential effects as resulting from an attempt to learn the statistics of a sequence of stimuli, through exponentially decaying counts of the simple patterns in the sequence (e.g., single stimuli, repetitions, and alternations). In some studies, the ‘forgetting’ of remote observations that results from the exponential decay is justified by the fact that people live in environments that are usually changing: it is thus natural that they should expect that the statistics underlying the task’s stimuli undergo changes (although in most experiments, they do not), and if they expect changes, then they should discard old observations that are not anymore relevant. This theoretical justification raises the question as to why subjects do not seem to learn that the generative parameters in these tasks are in fact not changing — all the more as other studies suggest that subjects are able to learn the statistics of changes (and consistently they are able to adapt their inference) when the environment does undergo changes (Refs. [42,57]).
Our models are derived from a different approach: we derive behavior from the resolution of a problem of constrained optimization of the inference process. It is not a phenomenological model. When the constraint that weighs on the inference process is a cost on the precision of the posterior, as measured by its entropy, we find that the resulting posterior is one in which remote observations are ‘forgotten’, through an exponentially discount, i.e., we recover the predictions of the leakyintegration models, which past studies have empirically found to be reasonably good accounts of sequential effects. (Thus these models are already in our model comparison.) In our framework, the sequential effects do not stem from the subjects’ irrevocable belief that the statistics of the stimuli change from time to time, but rather from the difficulty that they have in representing precise belief; a rather different theoretical justification.
Furthermore, we show that a large fraction of subjects are not bestfitted by precisioncost models (i.e., they are not bestfitted by leaky integration), but instead they are best fitted by unpredictabilitycost models. These models suggest a different explanation of sequential effects: that they result from the subjects favoring predictable environments, in their inference.
In the revised version of the manuscript, we have made clearer that the derivation of the optimal posterior under a precision cost results in the exponential forgetting of remote observations, as in the leakyintegration models. We mention it in the abstract, in the Introduction (l. 7678), in the Results when presenting the precisioncost models (l. 264278), and in the Discussion (l.706716).
As for the point on incentivization: we agree that it would be very interesting to measure whether and to which extent the performance of subjects increases with the level of incentivization. Here, however, we wanted, first, to establish that subjects’ behavior could be understood as resulting from inference under a cost, and second, to examine the sensitivity of their predictions to the underlying generative probability — rather than to manipulating a tradeoff involving this cost (e.g. with financial reward). We note that we do find that subjects are sensitive to the generative probability, which implies that they exhibit some degree of motivation to put some effort in the task (which is the goal of incentivization), in spite of the lack of economic incentives. But it would indeed be interesting to know how the potential sensitivity to reward interacts with the sensitivity to the generative probability. Furthermore, as Reviewer #2 mentions, some studies show that incentives affect probabilitymatching behavior: it is then unclear whether the introduction of incentives in our task would change the inference of subjects (through a modification of the optimal tradeoff that we model); or whether it would change their probabilitymatching behavior, as modeled by our generalized probabilitymatching responseselection strategy; or both. Note that we disentangled both aspects in our modeling and that our conclusions are about the inference, not the responseselection strategy. We deem the incentivization effects very much worth investigating; but they fall outside of the scope of our paper.
We now mention this point in the Discussion of the revised manuscript (l. 828840).
Reviewer #3 (Recommendations for the authors):
1. I recommend using regression analyses (e.g. a GLM) where regressors for both previous choices and previous stimuli are entered (as e.g. in Urai et al. Nature Communications 2017) to resolve this possible confound. Such GLM also allows looking back further back in time at the impact of longer lags on decisions. This would allow testing for example if the repulsive bias to the stimulus at lag 2 (in sequences 111 or 101) extends to longer lags, both in experimental data and simulations.
We thank Reviewer #3 for pointing out the possibility that subjects may have a tendency to repeat motor responses that is not related to their inference.
We note that in Urai et al., 2017, as in many other sensory 2AFC tasks, successive trials are independent: the stimulus at a given trial is a random event independent of the stimulus at the preceding trial; the response at a given trial should in principle be independent of the stimulus at the preceding trial; and the response at the preceding trial conveys no information about the response that should be given at the current trial (although subjects might exhibit a serial dependency in their responses). By contrast, in our task an event is more likely than not to be followed by the same event (because observing this event suggests that its probability is greater than.5); and a prediction at a given trial should be correlated with the stimuli at the preceding trials, and with the predictions at the preceding trials. In a logit model (or any other GLM), this would mean that the predictors exhibit multicollinearity, i.e., they are strongly correlated. Multicollinearity does not reduce the predictive power of a model, but it makes the identification of parameters extremely unreliable: in other words, we wouldn’t be able to confidently attribute to each predictor (e.g., the past observations and the past responses) a reliable weight in the subjects’ decisions. Furthermore, our study shows that past stimuli can yield both attractive and repulsive effects, depending on the exact sequence of past observations. To capture this in a (generalized) linear model, we would have to introduce interaction terms for each possible past sequence, resulting in a very high number of parameters to be identified.
However, this does not preclude the possibility that subjects may have a motor propensity to repeat responses. In order to take this hypothesis into account, we examined the behavior and the ability to capture subjects’ data of models in which the responseselection strategy allows for the possibility of repeating, or alternating, the preceding response. Specifically, we consider models that are identical to those in our study, except for the responseselection strategy, which is an extension of the generalized probabilitymatching strategy, in which a parameter eta, greater than 1 and lower than 1, determines the probability that the model subject repeats its preceding response, or conversely alternates and chooses the other response. With probability 1η, the model subject follows the generalized probabilitymatching responseselection strategy (parameterized by κ). With probability η, the model subject repeats the preceding response, if η > 0, or chooses the other response, if η < 0. We included the possibility of an alternation bias (negative η), but we find that no subject is bestfitted by a negative η, thus we focus on the repetition bias (positive η). We fit the models by maximizing their likelihoods, and we compared, using the Bayesian Information Criterion (BIC), the quality of their fit to that of the original models that do not include a repetition propensity.
Taking into account the repetition bias of subjects leaves the assignment of subjects into two families of inference cost mostly unchanged. We find that for 26% of subjects the introduction of the repetition propensity does not improve the fit (as measured by the BIC) and can therefore be discarded. For 47% of subjects, the fit is better with the repetition propensity (lower BIC), and the bestfitting inference model (i.e., the type of cost, precision or unpredictability, and the Markov order) is the same with or without repetition propensity. Thus for 73% (=26+47) of subjects, allowing for a repetition propensity does not change the inference model. We also find that the bestfitting parameters λ and κ, for these subjects, are very stable, when allowing or not for the repetition propensity. For 11% of subjects, the fit is better with the repetition propensity, and the cost type of the inference model is the same (as without the repetition propensity), but the Markov order changes. For the remaining 16%, both the cost type and the Markov order change.
Thus for a majority of subjects, the BIC is improved when a repetition propensity is included, suggesting that there is indeed a tendency to repeat responses, independent of the subjects’ inference process and generative stimulus probability. In Figure 7, in Methods, we show the behavior of the models without repetition propensity, and with repetition propensity, with a parameter η = 0.2 close to the average bestfitting value of eta across subjects. We show, in Methods, that (i) the unconditional probability of a prediction A, p(A), is the same with and without repetition propensity, and that (ii) the conditional probabilities p(AA) and p(AB) when η≠0 are weighted means of the unconditional probability p(A) and of the conditional probabilities when eta=0 (see p. 4749 of the revised manuscript).
In summary, our results suggest that a majority of subjects do exhibit a propensity to repeat their responses. Most subjects, however, are bestfitted by the same inference model, with or without repetition propensity, and the parameters λ and κ are stable, across these two cases; this speaks to the robustness of our model fitting. We conclude that the models of inference under a cost capture essential aspects of the behavioral data, which does not exclude, and is not confounded by, the existence of a tendency, in subjects, to repeat motor responses.
In the revised manuscript, we present this analysis in Methods (p.4749), and we refer to it in the main text (l. 353356 and 400406).
2. Authors could test whether the very first prediction on each block already shows a signature of the p and whether the prediction is stable within blocks.
The assumptions that subjects reach their asymptotic behavior after being presented with 200 observations in the passive trials should indeed be tested. To that end, we compared the behavior of the subjects in the first 100 active trials with their behavior in the remaining 100 active trials. The results of this analysis are shown in figure 9.
For most values of the stimulus generative probability, the unconditional proportions of predictions A, in the first and the second half (panel a, solid and dashed gray lines), are not significantly different (panel a, white dots), except for two values (pvalue < 0.05; panel a, filled dots). Although in most cases the difference between the two is not significant, in the second half the proportions of prediction A seem slightly closer to the extremes (0 and 1), i.e., closer to the optimal proportions. As for the sequential effects, they appear very similar in the two halves of trials. We conclude that for the purpose of our analysis we can reasonably consider that the behavior of the subjects is stationary throughout the task.
On top of that, I provide here some recommendations on how the clarity of the manuscript could be improved. A thorough work on improving the text and figures and the general flow and organization of the manuscript would make a major difference in the impact of that paper on the wider community. which is very frustrating for the reader is to see one statement (e.g. a choice of methods or a result) exposed at some point and then the explanation for the statement much later in the manuscript (see below). Here are my suggestions:
– I believe the models would be better motivated to the reader if the Introduction made a brief mention of the ideas of bounded rationality (and related concepts) and justified focus on these two specific types of cost – all of which are nicely detailed in the Discussion.
– Please try to make understanding figures more intuitive; for example, using a colour code for the different cost types may help differentiate them. A treelike representation of history biases (showing the mean accuracy for different types of sequences, e.g. in Meyniel et al. 2016 Plos CB) may be more intuitive to read and reveal a richer structure in the data and models than the current Figure 56 (also given than the authors do not comment much on the impact of the "probability of observation 1", so perhaps this effects could be marginalized out).
– Figure 3 is really helpful in understanding the two types of cost (much more than the equations for most readers). Unfortunately, it is hardly referred to in the main text. I suggest rewriting the presentation of that part of the Results section around these examples.
– Why and how the two types of costs give rise to historical effects (beyond the fact that these costs generate suboptimalities) is a central idea in the paper but it is only exposed in the Discussion session. Integrating these explanations within the Results section would help a lot. Plotting some example simulations for a sequence of trials and/or some cartoon explanations of how the historical effects emerge for the different models would also help.
– Placing figures in Methods does not help in my opinion – please consider moving to the main text or as supplementary Figures.
We thank Reviewer #3 for these recommendations, which we have followed in the revised manuscript.
– We now explain early in the Introduction how our approach relates to other “resourcerational” accounts of perceptual and cognitive processes, in which behavior is assumed to result from some constrained optimization. We also provide more details on the two costs we consider in the paper (l. 6579).
– As for the color code for the different costs, the colors used for each cost in Figure 3 are consistent across panels. In the other figures, we have favored a colorcoding that emphasizes the sequential effects (allowing to distinguish, for instance, p(AA) and p(AB)). However, in the revised manuscript, wherever a figure shows a simulation resulting from one of the two costs, we have set the color of the figure’s axes, and of the diagonal (\bar p = p), to the color that corresponds to the cost (consistently with Figure 3), so as to facilitate the identification of the models across figures.
– We have now included a “treelike” representation of the sequential effects, in Figure 6d of the revised manuscript. The impact of the stimulus generative probability was “marginalized out”, as suggested by Reviewer #3. The nonmarginalized tree representations (per stimulus probability) can be found in Methods (Figure 13).
– We now refer more often to Figure 3, in the Results section of the main text (p. 1115), so as to connect what is described in the text (e.g., the nonconvergence of the posterior with the precision cost) to its illustration in Figure 3. In addition we also refer to it in the section describing the sequential effects of the model, so as to make clear that with the precisioncost models the sequential effects stem from the nonconvergence of the posterior (p. 20).
– In the revised manuscript we give more detailed explanations as to how each cost gives rise to sequential effects (which we agree is an important conceptual idea in the paper). In particular we have revised the passages in the Results section in which we present these costs and the corresponding solutions to the optimization problem (p. 1115). Furthermore, we have added a new panel in Figure 3 (panel d), which shows, as suggested by the reviewer, the “trajectories” of a subject’s estimates, for different sequences of observations (though with the same stimulus generative probability), under the two costs, and under no cost (the Bayesian solution). It shows that the Bayesian observer converges to the correct value of the stimulus generative probability; that the observer under an unpredictability cost converges to an erroneous value; and that the observer under a precision cost does not converge, but keeps fluctuating with the history of stimuli. Finally, when describing the sequential effects of the model, we explain in detail how and why each cost gives rise to the sequential effects shown in Figure 4.
– We have reconsidered the Figures in Methods and whether we should place them elsewhere. We now place in Supplementary Materials the modelfitting confusion matrix and the figure showing the stability of the costweight parameter across medium and extreme values of the stimulus generative probability. We have kept the two figures (Figures 7 and 8) showing the sequential effects of each of the eight models (with each cost type and each Markov order) in Methods, because we think that they provide interesting information and that they are probably expected by the reader, although it is not crucial that they appear in the flow of the main text.
https://doi.org/10.7554/eLife.81256.sa2Article and author information
Author details
Funding
Albert P. Sloan Foundation (Grant G202012680)
 Rava Azeredo da Silveira
CNRS (UMR8023)
 Rava Azeredo da Silveira
Fondation PierreGilles de Gennes pour la recherche (Ph.D. Fellowship)
 Arthur PratCarrabin
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Doron Cohen and Michael Woodford for inspiring discussions. This work was supported by the Alfred P Sloan Foundation through grant G2020–12680 and the CNRS through UMR8023. A.P.C. was supported by a Ph.D. fellowship of the Fondation PierreGilles de Gennes pour la Recherche. We acknowledge computing resources from Columbia University’s Shared Research Computing Facility project, which is supported by NIH Research Facility Improvement Grant 1G20RR03089301, and associated funds from the New York State Empire State Development, Division of Science Technology and Innovation (NYSTAR) Contract C090171, both awarded April 15, 2010.
Ethics
The study was approved by the ethics committee Île de France VII (CPP 08021). Participants gave their written consent prior to participating.
Senior Editor
 Floris P de Lange, Donders Institute for Brain, Cognition and Behaviour, Netherlands
Reviewing Editor
 Hang Zhang, Peking University, China
Version history
 Received: June 21, 2022
 Preprint posted: June 22, 2022 (view preprint)
 Accepted: December 11, 2023
 Version of Record published: January 15, 2024 (version 1)
Copyright
© 2024, PratCarrabin et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 262
 Page views

 44
 Downloads

 0
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Neuroscience
Basal forebrain cholinergic neurons modulate how organisms process and respond to environmental stimuli through impacts on arousal, attention, and memory. It is unknown, however, whether basal forebrain cholinergic neurons are directly involved in conditioned behavior, independent of secondary roles in the processing of external stimuli. Using fluorescent imaging, we found that cholinergic neurons are active during behavioral responding for a reward – even prior to reward delivery and in the absence of discrete stimuli. Photostimulation of basal forebrain cholinergic neurons, or their terminals in the basolateral amygdala (BLA), selectively promoted conditioned responding (licking), but not unconditioned behavior nor innate motor outputs. In vivo electrophysiological recordings during cholinergic photostimulation revealed rewardcontingencydependent suppression of BLA neural activity, but not prefrontal cortex. Finally, ex vivo experiments demonstrated that photostimulation of cholinergic terminals suppressed BLA projection neuron activity via monosynaptic muscarinic receptor signaling, while also facilitating firing in BLA GABAergic interneurons. Taken together, we show that the neural and behavioral effects of basal forebrain cholinergic activation are modulated by reward contingency in a targetspecific manner.

 Neuroscience
Orbitofrontal cortex (OFC) is classically linked to inhibitory control, emotion regulation, and reward processing. Recent perspectives propose that the OFC also generates predictions about perceptual events, actions, and their outcomes. We tested the role of the OFC in detecting violations of prediction at two levels of abstraction (i.e., hierarchical predictive processing) by studying the eventrelated potentials (ERPs) of patients with focal OFC lesions (n = 12) and healthy controls (n = 14) while they detected deviant sequences of tones in a local–global paradigm. The structural regularities of the tones were controlled at two hierarchical levels by rules defined at a local (i.e., between tones within sequences) and at a global (i.e., between sequences) level. In OFC patients, ERPs elicited by standard tones were unaffected at both local and global levels compared to controls. However, patients showed an attenuated mismatch negativity (MMN) and P3a to local prediction violation, as well as a diminished MMN followed by a delayed P3a to the combined local and global level prediction violation. The subsequent P3b component to conditions involving violations of prediction at the level of global rules was preserved in the OFC group. Comparable effects were absent in patients with lesions restricted to the lateral PFC, which lends a degree of anatomical specificity to the altered predictive processing resulting from OFC lesion. Overall, the altered magnitudes and time courses of MMN/P3a responses after lesions to the OFC indicate that the neural correlates of detection of auditory regularity violation are impacted at two hierarchical levels of rule abstraction.