1 Introduction

In naturalistic environments, novelty can be a source of both reward and dangers. Despite these duelling aspects, investigations of novelty in reinforcement learning (RL) have mostly focused on neophilia driven by optimism in the face of uncertainty, and so information-seeking (Duff, 2002a; Dayan and Sejnowski, 1996; Gottlieb et al., 2013; Wilson et al., 2014). Neophobia has attracted fewer computational studies, apart from some interesting evolutionary analyses (Greggor et al., 2015).

Excessive novelty seeking and excessive novelty avoidance can both be maladaptive – they are flip sides of a disturbed balance. Here, we seek to examine potential sources of such disturbances, for instance, in distorted priors about the magnitude or probabilities of rewards (which have been linked to mania; Radulescu and Niv, 2019; Bennett and Niv, 2020; Eldar et al., 2016) or threats (linked to anxiety and depression; Bishop and Gagne, 2018; Paulus and Yu, 2012), or in extreme risk attitudes (Gagne and Dayan, 2022).

To do this, we take advantage of a recent study by Akiti et al. (2022) on the behaviour of mice exploring a familiar open-field arena after the introduction of a novel object near to one corner. The mice could move freely and interact with the object at will. Akiti et al. (2022) performed detailed analyses of how individual animals’ trajectories reflected the novel object, including using DeepLabCut (Mathis et al., 2018) to track the orientation of the mice relative to the object and MOSEQ (Wiltschko et al., 2020) to extract behavioural ‘syllables’ whose prevalence was affected by it. The animals differed markedly in how they approached the object, and in what pattern. For the former, Akiti et al. (2022) observed two characteristic positionings of the animals when near to the object: ‘tail-behind’ and ‘tail-exposed’, associated respectively with cautious risk-assessment and engagement. For the latter, there was substantial heterogeneity along a spectrum of timidity, with all animals initially performing tail-behind approach, but some taking much longer (or failing altogether) to transition to tail-exposed approach.

Akiti et al. (2022) provides a model-free account of their data, focusing on the prediction of threat and its realization in the tail of the striatum. In contrast, we provide a model-based account, focusing on the rich details of the dynamics of approach carefully characterized by Akiti et al. (2022). These include intermittency (i.e., why animals retreat from the object), approach drive (or why animals approach in the first place), the significant long-run approach of timid animals despite having reached the “avoid” state, and how the intensity of approach increases when brave animals transition from risk-assessment to engagement and then decreases in the long-run of the “engagement” phase. Our model also provides an alternative explanation for why animals learn to avoid the novel object in a completely benign environment. Through modeling these additional statistics and behaviors, we reveal the multidimensional aspect of timidity in exploration which cannot be captured just in terms of time spent at the object.

We model an abstract depiction of the behaviour of individual mice by combining the Bayes-adaptive Markov Decision Process (BAMDPs) treatment of rational exploration (Dearden et al., 2013; Duff, 2002a; Guez et al., 2013) with two sources of risk-sensitivity: the prior over the potential hazard associated with the object, and the conditional value at risk (CVaR) probability distortion mechanism (Artzner et al., 1999; Chow et al., 2015; Gagne and Dayan, 2022; Bellemare et al., 2023).

In a BAMDP, the agent maintains a belief about the possible rewards, costs and transitions in the environment, and decides upon optimal actions based on these beliefs. Since the agent can optionally reuse or abandon incompletely known actions based on what it discovers about them, these actions traditionally enjoy an exploration bonus or “value of information”, which generalizes the famous Gittins indices (Gittins, 1979; Weber, 1992). In addition to beliefs about reward, the agent also maintains a belief about potential hazard which is the first source of risk-sensitivity. These beliefs are initialized as prior expectations about the environment; and so are readily subject to individual differences.

In addition to beliefs about hazards which may be specific to a particular environment, we include a second source of trait risk-sensitivity. We consider optimizing the CVaR, in which agents concentrate on the average value within lower (risk-averse) or upper (risk-seeking) quantiles of the distribution of potential outcomes (Rigter et al., 2021). In the context of a BAMDP, this can force agents to pay particular attention to hazards. More extreme quantiles are associated with more extreme risk-sensitivity; and again are a potential locus of individual differences (as examined in regular Markov decision processes in the context of anxiety disorders in Gagne and Dayan, 2022).

Here, we present a behavioral model of risk sensitive exploration. Our agent computes optimal actions using the BAMDP framework under the CVaR objective. This model provides a normative explanation of individual variability – the agent makes decisions by trading off potential reward and threat in a principled way. Different priors and risk sensitivities lead to different exploratory schedules, from timid (indicative of neophobia) to brave. The model captures differences in duration, frequency, and type of approach (risk-assessment versus engagement) across animals, and through time. We report features of the different behavioural trajectories the model is able to capture, providing mechanistic insight into how the trade-off between potential reward and threat leads to rational exploratory schedules. Behavioral phenotypes emerge from the interaction of the separate computational mechanisms elucidated by our model-based treatment. This paves the way for future experimental investigations of these mechanisms, including the unexpected non-identifiability of our two sources of risk-sensitivty: hazard priors and CVaR.

2 Results

2.1 Behavior Phases and Animal Groups

Our goal is to provide a computational account of the exploratory behavior of individual mice under the assumption that they have different prior expectations and risk sensitivities. We start from Akiti et al. (2022)’s observation that the animal approaches and remains within a threshold distance (determined by them to be 7cm) of the object in “bouts” which can be characterized as “cautious” or tail-behind (if the animal’s nose lies between the object and tail) or otherwise “confident” or tail-exposed. We sought to capture both these qualitative differences (cautious versus confident) and aspects of the quantitative changes in bout durations and frequencies as the animal learns about their environment.

In order to focus narrowly on interaction with the object, we abstracted away the details of the spatial interaction with the object, rather fitting boxcar functions to the percentages of its time gcau (t), gcon (t) that the animal spends in cautious and confident bouts around time t in the apparatus. We can then well encompass the behaviour of most animals via four coarse phases of behaviour that arise from two binary factors: whether the animal is mainly performing cautious or confident approaches, and whether bouts happen frequently, at a peak rate, or at a lower, steady-state rate. The time an animal spends near the object in one of these phases reflects the product of how frequently it visits the object, and how long it stays per visit. We average these two factors within each phase.

Consider the behaviour of the animal in Fig 1a. Here, gcau (t) (top graph) makes a transition from an initial level (during the “cautious” phase) to a final steady-state level (which we simplify as being at a transition point t = t1. At the same timepoint, gcon (t) (second row) makes a transition from 0 to a peak level of confident approach (defining the “peak confident” phase). Finally, there is another transition at time t2 from peak to a steady-state confident approach time (in the “steady-state confident” phase). The lower two rows of figure 1a show the duration of the bouts in the relevant phases, and the frequency per unit time of such bouts. The upper panel of fig 1b shows the same data in a more convenient manner. The colours in the top row indicate the type of approach (green is cautious; blue is confident). The second and third rows indicate the duration and frequency of approach. Darker colours represent higher values.

a.) Detailed visualization of minute-to-minute statistics of animal 25 (in the sessions after the introduction of the novel object). From top to bottom, the plots show % time within (Akiti et al., 2022)’s 7cm threshold of the object with (cautious) and without (confident) tail behind, the length of a bout at the object and the number of bouts per minute. Orange lines are the box-car functions fitted to segment phases and illustrate the change in time, duration, and frequency statistics across phases. The transition points t1 and t2 as well as the initial cautious , final cautious , peak confident and steady-state confident approach percentage times are shown. The right plots show examples of minute-to-minute and phase-averaged approach time, duration, and frequency for (b.) brave, (c.) intermediate, and (d.) timid animals. Note that animals are ordered by the group-timidity animal index (see main text Section 2.2.6). Green indicates cautious and blue indicates confident approach. Darker colors indicate higher values. Averaging statistics over phases ignores idiosyncracies of behavior to provide a high-level summary of learning dynamics.

The orange coloured lines in Fig 1a and the lower panel in Fig 1b render the abstracted behaviour of this animal in an integrated form, showing how we generate “phase-level” statistics from minute-to-minute statistics. Averaging statistics over phases ignores idiosyncrasies of behavior and allows us to fit the high-level statistics of behavior: phase-transiton times, phase-averaged durations and frequencies. We consider animal 25 to be a “brave” animal because of its transition to peak and then steady-state confident approach. There were 12 brave mice out of the 26 in total.

Fig 1c shows an example of another characteristic “intermediate” animal. This animal makes a transition from cautious to confident approach (where both duration and frequency of visits can change), but the approach time during the confident phase does not decrease. Hence, intermediate animals do not have a transition from peak to steady-state confident phase. There were 5 such intermediate mice.

Fig 1d shows the behaviour of an example of the last class of “timid” animals. This animal never makes a transition to confident approach. Hence, for it, gcon (t) = 0. However, the cautious approach time makes a transition to a non-zero steady state , often via a change in frequency, dening the fourth phase (“steady-state cautious”). There were 9 such timid mice.

Fig 2 summarizes our categorization of the animals into the three groups: brave, intermediate, and timid based on the phases identified in the animal’s exploratory trajectories. Timid animals spend no time in confident approach. Brave animals differ from intermediate animals in that their approach time during the first ten minutes of the confident phase is greater than the last ten minutes (steady-state phase).

Separating the three animal groups. The x-axis shows the ratio of total time spent in confident versus cautious bouts. The y-axis shows the ratio of bout time in the first 10 minutes of confident approach and the last 10 minutes of confident approach (set to 0 for timid animals that do not have a confident phase). The horizontal line indicates y = 1.0. All 9 timid animals are close to the origin. We separate brave and intermediate animals according to the y = 1 line.

2.2 A Bayes-adaptive Model-based Model for Exploration and Timidity

2.2.1 State description

We use a model-based Bayes-adaptive reinforcement learning model (BAMDP) to provide a mechanistic account of the behavior of the mice under threat of predation. This extends the model-free description of threat in Akiti et al. (2022) by constructing various mechanisms to explain additional facets of the dynamics of the behavior.

Underlying the BAMDP is a standard multi-step decision-making problem of the sort that is the focus of a huge wealth of studies (Russell and Norvig, 2016). We cartoon the problem with the four real and four counterfactual states shown in Fig 3. The nest is a place of safety, (modelling all places in the environment away from the object, ignoring, for instance, the change to thigmotactic behaviour that the mice exhibit when the object is introduced. The animal can choose to stay at the nest (possibly for multiple steps) or choose to make a cautious or confident approach.

Markov decision process underlying the BAMDP model. Four real (nest, cautious object, confident object, retreat) and three imagined (cautious detect, confident detect, dead) states. Agent actions are italicized. Blue arrows indicate (possibly stochastic) transitions caused by agent actions. Green arrows indicate (possibly stochastic) forced transitions. Cautious approach provides less informational reward r2 < r1 but has a smaller chance of death p2 < p1 compared to confident approach. Travel and dying costs are not shown.

At an approach state, the modelled agent can either stay, or return to the nest via the retreat state; the latter happens anyhow after four steps. The animal also imagines the (in reality, counter-factual) possibility of being detected by a potential predator. It can then either manage to escape back to the nest, or alternatively expire. We parameterize costs associated with the various movements; and also the probability of unsuccessful escape starting from confident (p1) or cautious (p2 < p1) approach.

We describe the dilemma between cautious and confident approach as a calculation of the risk and reward trade-off between the two types of approaches. Cautious approach (the “cautious object” state) has a lower (informational) reward (e.g. because in the cautious state the animal spends more cognitive effort monitoring for lurking predators rather than exploring the object). However, cautious approach leads to a lower probability of expiring if detected than does confident approach (the “cautious object” state) (e.g. because in the cautious state the animal is better poised to escape). Risk aversion modulates the agent’s choice of approach type.

The next sections describe the components of the BAMDP model: a characterization of the time-dependent risk of predation, an informational reward for exploration, and a method for handling risk sensitivity. Finally, we will discuss the way we fitted individual mice, and present a full analysis of their behaviour. We report on recovery simulations in the supplement.

2.2.2 Modeling Threat with a Bayesian, Generalizing Hazard Function

Whilst exploring the novel object in the “object” state, the decision problem allows for the possibility of detection, and then attack, by a predator whose appearance is governed by a temporal hazard function (see Fig 4).

Hazard function learning for (a.) brave and (b.) timid animals. Brave animals start with a flexible hazard prior with a low mean for h2. This leads to longer bouts (first length 2, then 3 and 4), which imply that the hazard posterior quickly approaches zero (here, after 10 bouts). Timid animals start with an inflexible hazard prior with a higher mean h2, and are limited to length 2 bouts. The hazard posterior only changes slightly after 10 bouts.

Formally, the probability of detection given either cautious or confident approach is modelled using the hazard function hτ, where τ is the number of steps the animal has so far spent at the object in the current bout. In a key simplification, this probability resets back to baseline upon a return to the nest. We treat the hazard function as being learned in a Bayesian manner, from the experience (in this case, of not being detected). We assume that the animal has the inductive bias that the hazard function is increasing over time, reflecting a potential predator’s evidence accumulation process about the prey. Therefore, we derive it from a succession of independent

Beta-distributed random variables θ1 = 0; θτ ∼ Beta(μτ, στ), τ > 1 as:

rather as in what is known as a stick-breaking process. Note that, for convenience, we parameterize the Beta distribution in terms of its mean μ and standard deviation σ rather than its pseudocounts, as is perhaps more common.

Eq 2 shows that the hazard function is always increasing. As we will see, the duration of bouts at the object depend on the (discrete) slope of the hazard function, with steep hazard functions leading to short bouts. In our model, the agent can stay at the object 2, 3 or 4 turns (we take θ1 = 0 as a way of coding actual approach). [We therefore sometimes refer to cautious−k or confident−k bouts in which the model animal spends k = {2, 3, 4} steps at the object.] Hence the collection of random variables, hτ, is derived from six parameters (the mean μτ and the standard deviation στ of the Beta distribution for the turn). These start at initial prior values, and are subject to an update from experience. Here, that experience is exclusively negative, since there is no actual predator; this implies that the update has a simple, closed form (see Methods). The animals’ initial ignorance, which is mitigated by learning, makes the problem a BAMDP, whose solution is a risk-averse itinerant policy.

A particular characteristic of the noisy-or hazard function of Eq 1 is that the derived bout duration increases progressively. This is because not being detected at τ = 2, say, provides information that θ2 is small, and so reduces the hazard function for longer bouts τ > 2.

Fig 4 shows the fitted priors of a brave (top) and timid (bottom) animal, as well as the posteriors after ten exploratory bouts. The brave animal starts with a high variance prior. This flexibility allows it to transition from short, cautious bouts (duration τ = 2) to longer confident bouts (duration τ = 3, 4), reducing the hazard function to near zero. The timid animal has a low variance prior, and does not stay long enough at the object to build sufficient confidence (only performing duration τ = 2 bouts). As a result, its posterior hazard function remains similar to its prior.

2.2.3 Modeling the Motivation to Approach

We model the mouse’s drive to approach the object as stemming from its belief that the object might be rewarding. In a fully Bayesian treatment, the agent would maintain a posterior over the possibility of rewards and would enjoy a conventional, informational, Bayes-adaptive exploration bonus encouraging it to approach the object. However, this would add substantial computational complexity. Thus, instead, we use a simple, heuristic, exploration bonus G(t) (Kakade and Dayan, 2002). The model mouse moves from the “nest” state to the “object” state when this exploration bonus exceeds the costs implied by the risk of being attacked.

We characterize the exploration bonus as coming from an initial ‘pool’ G0 that becomes depleted when the animal is at the objec, as it experiences a lack of reward, but is replenished at a steady rate f when the animal is at the nest, through forgetting or potential change. We model the animal as harvesting this exploration bonus pool more quickly under confident than cautious approaches, for instance since it can pay more attention to the object (an issue captured in more explicit detail in the context of foraging by Lloyd and Dayan (2018)). This underpins the transition between the two types of approach for non-timid animals. In simulations, when G(t) is high, the agent has a high motivation to explore the object. In other words, the depletion from G0 substantially influences the time point at which approach makes a transition from peak to steady-state; the steady-state time then depends on the dynamics of depletion (when at the object) and replenishment (when at the nest).

Finally, the animal is also motivated to approach by informational reward from the hazard function (which can be exploited to collect more future reward) – according to a standard Bayes-adaptive bonus mechanism (Duff, 2002a).

2.2.4 Conditional Value at Risk Sensitivity

Along with varying degrees of pessimism in their prior over the hazard function, the mice could have different degrees of risk sensitivity in the aspect of the return that they seek to optimize. There are various ways in which the mice might be risk sensitive. Following Gagne and Dayan (2022), we consider a form called nested conditional value at risk (nCVaR). In general, CVaRα, for risk sensitivity 0 ≤ α ≤ 1, measures the expected value in the lower α quantile of returns – thus over-weighting the worse outcomes. The lower α, the more extreme the risk-aversion; with α = 1 being associated with the conventional, risk-neutral, expected value of the return. Section 4.2 details the optimization procedure concerned – it operates by upweighting the probabilities of outcomes with low returns – which come here from detection and expiration. Thus, when α is low, confident and longer bouts are costly, inducing shorter, cautious ones. nCVaRα affects behavior in a similar manner to pessimistic hazard priors, except that nCVaRα acts on both the aleatoric uncertainty of expiring and epistemic uncertainty of detection, while priors only affect the latter. As we will see, despite this difference, we were not able to differentiate pessimistic priors from risk sensitivity using the data in (Akiti et al., 2022).

2.2.5 Model Fitting

The output of each simulation is a sequence of states which we use to derive summary statistics that can be compared directly with our abstraction of the behavior of a mouse (as in figure 1). This requires us to model transition points in this behavior, and the times involved in each state.

In the model, the transition point from cautious to confident approach happens when the agent first ventures a confident approach; this switch is rarely reversed. Peak to steady-state transition points occur when the model mouse decreases its frequency of bouts, which tends to happen abruptly in the model. We fit the transition points in mouse data by mapping the length of a step in the model to wall-clock time. As in the abstraction of the experimental data, we average the duration (number of turns at the object) and frequency statistics in each phase. We characterize the relative frequencies of the bouts across phase transitions. Frequency mainly governs the total time at or away from the object and is formally defined as the inverse of the number of steps the model spends at the object and the nest.

We use a form of Approximate Bayesian computation Sequential Monte Carlo (ABCSMC; Toni et al. (2009)) to fit the elements of our abstraction of the approach behaviour of the mice (section 2.1), namely change points, peak and steady-state durations as well as relative frequencies of bouts. See the Methods section 4.5 for details on the fitted statistics. At the core of ABCSMC is the ability to simulate the behaviour of model mice for given parameters. We do this by solving the underlying BAMDP problem approximately using receding horizon tree search with a maximum depth of 5 steps (which covers the longest allowable bout, defined as a subsequence of states where the model mouse goes from the nest to the object and back to the nest).

The full set of parameters includes 6 for the prior over the hazard function (given that we limit to four the number of time steps the model mouse can stay at the object), the risk sensitivity parameter α for CVaRα, the initial reward pool G0 and the forgetting rate f.

2.2.6 A Spectrum of Risk-Sensitive Exploration Trajectories

Fig 5 shows model fits on the 26 mice from Akiti et al. (2022). The animal ranking is sorted first by animal group, and second by total time spent near the object. We call this ranking the group-timidity animal index – it slightly differs from the timidity index used in Akiti et al. (2022) which is only based on total time spent near the object. The model captures many details of the data across the entire spectrum of courage to timidity, explaining the behavior mechanistically. Differing schedules of exploration emerge because of the battle between learning about threat and reward.

Summary of model fit. Left panels: minute-to-minute time the animals spend within 7cm of the novel object (top), duration (middle), and frequency (bottom). There are 26 animals (one per row) sorted by the group-timidity animal index (see main text Section 2.2.6). Central panels: the same values averaged over behavioral phases. Right panels: time, duration and frequency of bouts generated as sample trajectories from the individual fits of the BAMDP model. Legend: green/blue distinguishes cautious and confident bouts. The intensity of colors indicates higher values, and gray indicates zeros.

All animals initially assess risk with cautious approach, since potential predation significantly outweighs potential rewards. Brave animals assess risk either with short (length 2 bouts) or medium (length 3 bouts) depending on the hazard priors (Fig 6 a. and b. versus c. and d.). If E[h3] is high, then the animal performs cautious length 2 bouts, otherwise, it performs cautious length 3 bouts. With more bout experience, the posterior hazard function becomes more optimistic (since there is no actual predator to observe; Fig 4), empowering it to take on more risk by staying even longer at the object and performing confident approach. Animals with low E[h4] perform the longest, confident, length 4 bouts instead of length 3 bouts (Fig 6 a. and c. versus b. and d). How long brave animals spend assessing risk depends on hazard priors and the risk sensitivity nCVaRα.

The bout durations of brave animals depend on the hazard prior. a.) Brave animals that initially perform cautious-2 bouts, then confident-3 bouts. The prior mean μ3 for τ = 3 is higher than in (c.) because there is some hazard to overcome before the animal does a duration-3 bout. Blue indicates individual animals and black indicates the mean. The y-axis E[μτ ], shows μτ averaged over the ABCSMC posterior particles for each animal. b.) Cautious-2 then confident-4 animals. Since the mean μ4 prior is low, once the animal overcomes the τ = 2 hazard, it quickly transitions from duration 2 to 4. c.) Cautious-3, then confident-3 animals. These animals are fitted with a low μ3 prior and high μ4 prior because they never perform duration-4 bouts. d.) Cautious-3 then confident-4 animals. Since the μ3 prior is lower than in (b.), these animals begins with duration-3 bouts.

Fig 7 shows that the fitted hazard priors and nCVaRα relate to the group-timidity animal index. Brave animals are fitted with higher nCVaRα and a low slope and high variance (flexibility) hazard prior. In other words, the model brave mouse believes that the hazard probability for long bouts is low in its environment. Timid animals are fitted by lower nCVaRα and a higher slope, inflexible hazard prior. The parameters for intermediate animals lie between those for brave and timid animals.

a.) nCVaR versus the group-timidity animal index ranking defined in Section 2.2.6. Color indicates the animal group. More timid animals are generally fitted by a lower nCVaRα. Prior hazard parameter for t=2 (b.), t=3 (c.), and t=4 (d.) versus timidity ranking. Dots indicate the mean; the probability density is represented by color where darker means higher density regions. The t=2 prior mean is similar across all animals (timid = 0.28 ± 0.02, intermediate = 0.26 ± 0.04, brave = 0.22 ± 0.08) explaining the short, cautious bouts all animals initially use to assess risk. However, timid animals are best fit with lower variance (inflexible) and higher t=3 and t=4 prior means. This leads to shorter, cautious bouts in the long run. Brave animals are fitted by a low slope (indicated by lower mean for t=3 and t=4) and high variance (flexible) hazard prior. This allows them to perform longer bouts over time. t=4 mean is low (panel d) for brave animals that perform length 4 bouts. Like brave animals, most intermediate animals have flexible, gradual hazards up to t=3.

G0 determines how much time brave animals spend in the peak-confident exploration phase, or the peak to steady-state change point. Animals with larger G0 tend to have high bout frequencies for a longer period (see Fig 8). Finally, how often brave animals revisit the object, which is related to the relative steady-state frequency, is determined by the forgetting rate.

a.) The relationship between G0 and the peak to steady-state change point for brave animals. The best fit line is shown in black. Higher G0 means the agent explores longer, hence postponing the change point. b.) G0 versus peak to steady-state change point for timid animals. c.) Forgetting rate versus steady-state turns at the nest state for brave animals. A higher forgetting rate leads to quicker replenishment of the exploration pool and hence fewer turns at the nest before approaching the object. d.) Forgetting rate versus turns at nest timid animal. All correlations are significant with p < 0.002.

Timid animals have short bouts and continue to assess risk with cautious approach in the steady-state. Fig 7 shows that their hazard priors are inflexible (low variance), with a high slope, and that they have low nCVaRα. The priors are slow to update and risk sensitivity causes timid agents to overestimate the probability of bad outcomes, leading to their prolonged cautious behavior. Hence, the reward exploration pool is depleted (i.e. the agent transitions to the steady-state phase) before the agent overcomes its priors. This particular dynamic of approach-drive and hazard function updating leads to self-censoring and neophobia. In the steady-state phase, the agent stays long periods at the nest (how long depends again on the forgetting rate). As a result, the animal (at least during the course of the experiment) never accumulates sufficient evidence to learn the safety of the object or if the object yields rewards. Akiti et al’s experiment did not last long enough to answer the question of whether all animals, even the most timid ones, eventually perform confident approach. Our model predicts that they will since the agent only accumulates negative evidence for the hazard function. However, with sufficient low CVaR or pessimistic priors, this may take a very long time.

Intermediate animals, like brave animals, eventually switch to confident approach to maximize information gained about potential rewards. Similar to brave animals, the cautious to confident transition tends to be later with lower nCVaRα and steeper, less flexible priors. Intermediate animals perform both cautious and confident bouts with medium duration. This is captured by a hazard prior with smaller E[h3] and larger E[h4]. The percentage of time spent at the object is relatively constant throughout the experiment for intermediate animals. This can be explained by either large G0 or a high forgetting rate. In other words, the animal is either slow to update its belief about the potential reward at the object, or it expects the reward probability to change quickly.

Fig 5 also illustrates several limitations of the model. In particular, the duration of bouts can only increase, whereas a few animals exhibit decreasing bout duration between confident-peak and confident-steady-state phases. Furthermore, the model has trouble capturing abrupt changes in duration (from 2 turns to 4) coinciding with an animal’s transition from cautious to confident approach.

2.2.7 Risk Sensitivity versus Prior Belief Pessimism

We found that risk sensitivity and prior pessimism could not be teased apart in our model fits. This is illustrated in Fig 9. In the ABCSMC posterior distributions, nCVaRα is correlated with the mean μ2 for timid and intermediate animals, μ3 for cautious-2/confident-4 and cautious-2/confident-3 animals, and μ4 for cautious-2/confident-4 and cautious-3/confident-4 animals. In other words, lower nCVaRα (higher risk-sensitivity) can be traded off against lower (more optimistic) priors to explain the observed risk-aversion in animals.

Non-identifiability of nCVaRα against the hazard prior. Animals are labeled using the group-timidity animal index. a.) The scatter plot shows the t=2 prior mean (μ2) versus nCVaRα for ABCSMC particles of timid animal 1. The ellipse indicates one standard deviation in a Gaussian density model. Animal 1 (and timid animals generally) can be either fit with a higher nCVaRα and a higher μ2, or a lower nCVaRα and a lower μ2. The box-and-whisker plot illustrates the correlation between μ2 and nCVaRα across all timid animals. b.) The scatter plot shows an example intermediate animal 10; the box-and-whisker plot shows μ2 versus nCVaRα for the intermediate population. c.) The scatter plot shows an example animal 11 from the group containing cautious-2/confident-4 and cautious-2/confident-3 animals. This group of animals starts with duration= 2 bouts and hence must overcome the prior μ3. The box-and-whisker plot shows μ3 versus nCVaRα for the population. d.) The scatter plot shows an example animal 25 from the group containing cautious-2/confident-4 and cautious-3/confident-4 animals. This group of animals eventually performs duration= 4 bouts and hence must overcome the prior μ4. The box-and-whisker plot shows μ4 versus nCVaRα for the population. nCVaRα and μ are correlated in the ABCSMC posterior for all animals and hence non-identifiable. p < 0.05 for all correlations.

In ablation studies (not shown), we found that it is possible to fit the full range of the behavior equally well with a risk-neutral nCVaR1.0 objective, only varying the hazard priors. The only advantage of fitting both nCVaRα and hazard priors to each animal is greater diversity in the particles discovered by ABCSMC. While the model with nCVaR1.0 is simpler, one might suspect, on general grounds, that both risk sensitivity and belief pessimism affect mice behavior – and they would be distinguishable under other conditions.

By contrast, we found that nCVaRα alone, with the same hazard prior for all animals, is incapable of fitting the full range of animal behavior (results not shown). This can be explained by the fact that nCVaRα cannot model the different slopes in the hazard function. For example, a cautious-2/confident-3 animal must be modeled using a high value of μ4. Starting with the parameters for a cautious-2/confident-4 animal and decreasing nCVaRα will not create a cautious-2/confident-3 animal. Instead, decreasing nCVaRα will delay the cautious-to-confident transition of the cautious-2/confident-4 animal and eventually create a cautious-2 timid animal. Therefore, in our task, structured prior beliefs are required to model the detailed behavior of animals. It is not clear in general, in which environments one can expect nCVaRα and priors to be identifiable given the complex interaction of these two sources of risk-sensitivity.

2.2.8 Familiar Object Novel Context

As a contrast with their main experiment, in which mice were exposed to an unfamiliar object in a novel context (UONC), Akiti et al. (2022) also looked at the consequences of exposing animals to a familiar object in a novel context (FONC), where the animals still habituate in the arena over two days but the combination of the object and arena is novel. We fit the behavior of the 9 FONC amimals, and, as the closest match compared this with that of the 11 brave amimals in the UONC condition. Figure 10 shows that there are 1 intermediate and 8 brave FONC animals, with the latter having exploration schedules similar to the bravest UONC animals. The 8 FONC animals have confident-peak and confident-steady-state phases, meaning their approach decreases in the steady-state, suggesting that they are reinvestigating the familiar object for reward.

Comparing the behavior of FONC and UONC conditions. There are 9 FONC and 11 UONC brave animals (one per row). Left panels: minute-to-minute time the animals spend within 7cm of the novel object (top), duration (middle), and frequency (bottom). Animals are again sorted by group-timidity animal index but split by experiment condition (UONC then FONC). Central panels: the same values averaged over behavioral phases. Right panels: time, duration and frequency of bouts generated as sample trajectories from the individual fits of the BAMDP model.

Figure 11 compares the posteriors of the ABCSMC fit of brave UONC and FONC animals. The x-axis shows the group-timidity animal index, but split by experiment condition (UONC then FONC). Compared to brave UONC animals, FONC animals are fitted with higher nCVaRα and lower hazard priors (average posterior parameters across animals are significantly different according to the Kolmogorov-Smirnov test, p < 0.05). Both the hazard prior means and variances are lower for the FONC animals indicating these animals are more certain of the safety of the object compared to UONC animals. For 3 animals the hazard prior means are nearly zero, indicating belief of almost certain safety. This is similar to the hazard function of a brave UONC animal at the end of the experiment. For the other 6 FONC animals, the hazard prior is high enough to warrant initial cautious bouts suggesting that the novelty of the context has increased their beliefs of the threat level of the familiar object. However, even these animals transition faster to confident approach than the brave UONC animals. This can be seen in Figure 10. Figure 11b shows that FONC animals also have on average lower (p < 0.05)) exploration pool than brave UONC animals. Taken together, these results show that pre-exposure to the object decreases both the animals’ beliefs about potential hazards but also their motivation to explore the object for reward.

ABCSMC parameter fits of the 9 FONC and 11 UONC animals (with the latter replotted from figure 7 for convenience). The x-axis shows group-timidity animal index but UONC and FONC animals are separated. a.) Average nCVaRα over posterior particles of each animal. Color indicates the animal group. Dashed lines indicate the average (across animals) values of each condition (UONC brave or FONC brave). p-values for the Kolmogorov-Smirnov test of condition differences are shown. p < 0.05 and therefore the nCVaRα values of brave FONC animals are significantly higher than those of brave UONC animals. b.) Exploration bonus pool, which is also significantly different between FONC and UONC animals. c.) Forgetting rate, which is not significantly different between the two conditions. Prior hazard parameter for t=2 (d.), t=3 (e.), and t=4 (f.). The probability density is represented by color where darker means higher density regions. Dots indicate the mean. Dashed lines indicate the average of mean values across animals while dotted lines indicate the average of standard deviation values across animals. p-values testing the difference between the two conditions’ means and standard deviations are shown on the right-hand-side and left-hand-side of the plots respectively. Brave FONC animals have both significantly lower hazard prior mean and standard deviation than brave UONC animals.

3 Discussion

We combined a Bayes adaptive Markov decision process framework with beliefs about hazards, and a conditional value at risk objective to capture many facets of an abstraction of the substantially different risk-sensitive exploration of individual animals reported by Akiti et al. (2022). In the model, behaviour reflects a battle between learning about potential threat and potential reward (neither of which actually exists). The substantial individual variability in the schedules of exploratory approach was explained by different risk sensitivities, forgetting rates, exploration bonuses and prior beliefs about an assumed hazard associated with a novel object. Neophilia arises from a form of optimism in the face of uncertainty, and neophobia from the hazard. Critically, the hazard function is generalizing (reducing the t = 2 hazard reduces the t = 4 hazard) and monotonic. The former property induces an increasing approach duration over time (Arsenian, 1943). Furthermore, the exploration bonus associated with the object regenerates, as if the subjects consider its affordance to be non-stationary (Dayan et al., 2000). This encourages even the most timid animals to continue revisiting it.

A main source of persistent timidity is a sort of path-dependent self-censoring (Dayan et al., 2020). That is, the agents could be so pessimistic about the object that they never visit it for long enough to overturn their negative beliefs. This can, in principle, arise from either excessive risk-sensitivity or overly pessimistic priors. We found that it was not possible to use the model to disen-tangle the extent to which these two were responsible for the behavior of the mice, since they turn out to have very similar behavioural phenotypes in this task. One key difference is that risk aversion continues to affect behaviour at the asymptote of learning; something that might be revealed by due choice of a series of environments. Certainly, according to the model, forced exposure (Huys et al., 2022) would hasten convergence to the true hazard function and the transition to confident approach.

Due to the complexity of the dataset, we made several rather substantial simplifying assumptions. First, the model employs a particular set of state abstractions, for instance representing thigmotaxis as a notional “nest” (Simon et al., 1994). Second, the model only allows the frequency of approach, and not its duration, to decrease during the steady-state phase - some animals are better fit by decreasing duration. This limitation could be remediated in future models with, for example, a mechanism for boredom causing the animal to retreat when little potential reward remains at the object. Third, the probability of being detected was the same between cautious and confident approaches, which may not be true in general. Note that the agent decides the type of approach before the bout, and is incapable of switching from cautious to confident mid-bout or vice versa. This is consistent with behavior reported in Akiti et al. (2022). Fourth, we restricted ourselves to a monotonic hazard function for the predator. It would be interesting to experiment with a non-monotonic hazard function instead, as would arise, for instance, if the agent believed that if the predator has not shown up after a long time, then there actually is no predator. Of course, a sophisticated predator would exploit the agent’s inductive bias about the hazard function – by waiting until the agent’s posterior distribution has settled. In more general terms, the hazard function is a first-order approximation to a complex game-theoretic battle between prey and predator, which could be modeled, for instance using an interactive IPOMDP (Gmytrasiewicz and Doshi, 2005). How the predator’s belief about the whereabouts of the prey diminishes could also be modeled game-theoretically, leading to partial hazard resetting rather than the simplified complete resetting in our model.

Our account is model-based, with the mice assumed to be learning the statistics of the environment and engaging in prospective planning (Mobbs et al., 2020). By contrast, Akiti et al. (2022) provide a model-free account of the same data. They suggest that the mice learn the values of threat using an analogue of temporal difference learning (Sutton, 1988), and explain individual variability as differences in value initialization (Akiti et al., 2022). The initial values are generalizations from previous experiences with similar objects, and are implemented by activity of dopamine in the tail of the striatum (TS) responding to stimuli salience (Akiti et al., 2022). By contrast, our model encompasses extra features of behavior such as bout duration, frequency, and type of approach – ultimately arriving at a different mechanistic explanation of neophobia. In the context of our model, TS dopamine could still respond to the physical salience of the novel object but might then affect choices by determining the potential cost of the encountered threat (a parameter we did not explore here) or perhaps the prior on the hazard function. An analogous mechanism may set the exploration pool or the prior belief about reward - perhaps involving projections from other dopamine neurons, which have been implicated in novelty in the context of exploration bonuses (Kakade and Dayan, 2002) and information-seeking for reward (Ogasawara et al., 2022; Bromberg-Martin and Hikosaka, 2009).

As reported in Akiti et al. (2022), animals in the FONC condition in which the object is familiar (though the context is less so) transition quickly to tail-exposed approach and therefore spend more time near the object compared to animals in the UONC condition. Akiti et al. (2022) models the FONC animals using low initial mean threat and high initial threat uncertainty. We directly compare the behavior of FONC animals against that of the 11 brave UONC animals, showing that FONC animals make choices that are comparable to the bravest UONC animals. FONC behavior is fit by significantly higher nCVaRα than brave UONC behavior animals. It is also characterized by both lower hazard prior means and standard deviations, implying greater certainty about the object’s safety. Furthermore, FONC behavior is fitted with lower exploration pools than brave UONC behavior. Taken together, we can understand the FONC animals as having both lower uncertainty about hazard and reward compared to the brave UONC animals at the start of the experiment. However, the hazard and reward uncertainties are higher than what we might expect of UONC animals at the end of the experiment, suggesting the novel context modulates both of these uncertainties. However, heterogeneity exists between FONC individuals in terms of nCVaRα, hazard priors, and exploration pool which allows another possibility: that both hazard and reward uncertainty are restored by forgetting during the time that passed between pre-exposure and the experiment.

Our model-based account recovers several behavioral phenotypes in addition to those considered in Akiti et al. (2022). First, intermittency in our model emerges from the fact that the (possibly CVaR perturbed) hazard function increases with time spent at the object. Therefore, it is rational for the model mice to retreat to the nest when the probability of detection becomes too high and wait until (they believe) the “predator has forgotten about them”, before venturing to the object again.

Second, we offer an alternative explanation for why animals avoid after risk-assessment in a benign environment. In Akiti’s model, timid animals perform risk-assessment because of the delay in model-free value updating from the initial threat at the object (at timestep t = 10 in their account) to the time of decision (t = 8). In our model, avoidance arises from a rational trade-off between potential risk and reward: timid animals perform risk-assessment because of the potential reward at the object and having found none, cease to approach because, although potential threat is lower than at the outset, it still outweighs the even further-reduced potential reward. The same exhaustion of the exploration bonus explains why the brave animals decrease their approach during the steady-state of engagement. If the potential reward is low, there is no reason to return to the object at the initial, high rate of engagement.

Third, the temporally evolving battle between reward and threat also explains why brave animals increase their duration of approach when transitioning from risk-assessment to engagement. During confident approach, the animals harvest the exploration pool faster, at the cost of an increased probability of expiring. For brave animals, the hazard posterior decreases faster than the depletion of the exploration pool, and hence brave animals decide to save on travel costs by exploring the object longer in each bout.

Fourth, timid animals return to the object in the steady-state of “avoidance”, albeit at a lower rate than during risk-assessment. This was not considered in Akiti et al. (2022)’s account. In our model, timid animals’ steady-state approach is explained by the regenerating exploration pool. Such regeneration is natural if the animals assume that the environment is non-stationary, allowing reward structures to change and thus potentially repaying occasional returns to the object if the potential threat has become sufficiently low. Similarly, the animal may believe that threat is non-stationary. Threat forgetting may act on longer time-scales than reward forgetting in our studied environment, and is one possible explanation for the initial non-zero hazard functions of some brave animals in the FONC condition.

Finally, our model shows the multi-faceted nature of timidity during exploration. Not only do animals differ in time spent near the object but also in how quickly they transition from cautious to confident approach, and their duration and frequency of approach along their exploration schedules. These proxies for timidity are imperfectly correlated. Indeed, an animal could believe that short bouts (τ = 2) are very safe while long bouts (τ = 4) certainly lead to expiration.

Of course, agents do not need to be fully model-free or model-based. They can truncate model-based planning using model-free values at leaf nodes (Keramati et al., 2016). Furthermore, replay-like prioritized model-based updates can update a model-free policy when environmental contingencies change (Antonov and Dayan, 2023). Finally, while online BAMDP planning can be computationally expensive, a model-based agent may simply amortize planning into a model-free policy which it can reuse in similar environments or even precompile model-based strategies into an efficient model-free policy using meta-learning (Wang et al., 2017). Agents may have faced many different exploration environments with differing reward and threat trade-offs through their life-times and even over evolution that they have used to create fast, instinctive model-free policies that resemble prospective, model-based behavior (Rusu et al., 2016; Mattar and Daw, 2018). In turn, TS dopamine might reflect aspects of MF values or prediction errors that had been trained by a MB system following the precepts we outlined.

In Akiti et al. (2022), ablating TS-projecting dopamine neurons made mice “braver”. They spent more time near the object, performed more tail-exposed approach and transitioned faster to tail-exposed approach compared to control. In Menegas et al. (2018) TS ablation affected the learning dynamics for actual, rather than predicted threat. Both ablated and control animals initially demonstrated retreat responses towards airpuffs but only control mice maintained this response (Menegas et al., 2018). After airpuff punishment, ablated individuals surprisingly did not decrease their choices of water ports associated with airpuffs (while controls did). One possibility is that this additional exposure could have caused acclimatization to the airpuffs in the same way that brave animals in our study acclimatize to the novel object by approaching more, and timid animals fail to acclimatize because of self-censoring. Indeed, future experiments might investigate why punishment-avoidance does not occur in ablated animals and whether the same holds in risk-sensitive exploration settings (Menegas et al., 2018). In other words, would mice decrease approach after reaching the “detected” state, as expected by our model, or would they maladaptively continue the same rate of approach? Finally, while our study has focused on threat, Menegas et al. (2017) showed that TS also responds to novelty and salience in the context of rewards and neutral stimuli. That TS ablated animals spend more, rather than less time near the novel object suggests that the link from novelty to neophilia and exploration bonuses might not be mediated by this structure.

The behaviour of the mice in Akiti et al. (2022) somewhat resembles attachment behaviour in toddlers (Ainsworth, 1964; Bowlby, 1955), albeit with the care-giver’s trusty leg (a secure base from which to explore) replaced by thigmotaxis (or, in our case, the notional ‘nest’). Characteristic to this behaviour is an intermittent exploration strategy, with babies venturing away from the leg for a period before retreating back to its safety. Through the time course of exposure to a novel environment, toddlers progressively venture out longer and farther away, spending more time actively playing with the toys rather than passively observing them in hesitation (Arsenian, 1943). This is another example of a dynamic exploratory strategy, putatively arising again from differential updates to beliefs about threats and the rewards in the environment (Arsenian, 1943; Ainsworth, 1964).

Variability in timidity during exploration has been reported in other animal species and can be caused by differences in both prior experience and genotype. Fish from predator-dense environments tend to make more inspection approaches but stay further away, avoid dangerous areas (attack-cone avoidance) and approach in larger shoals compared to fish from predator-sparse environments (Magurran and Seghers, 1990; Dugatkin, 1988; Magurran, 1986). Dugatkin (1988) and Magurran (1986) report significant within-population differences in the inspection behavior of guppies and minnows respectively. Brown and Dreier (2002) directly manipulates the predator experience of glowlight tetras, leading to changes to inspection behavior. Similar inter- and intra-population differences in timidity have been reported in mammals. In Coss and Biardi (1997), the squirrel population sympatric with the tested predators stayed further away and spent less time facing the predator compared to the allopatric population. Furthermore, the number of inspection bouts differed between litters, between individuals within the same litter, and even between the same individuals at different times during development (Coss and Biardi, 1997). In Kemp and Kaplan (2011), marmosets differed in risk-aversion when inspecting a potential (taxidermic) predator but risk-aversion was not stable across contexts for some individuals. FitzGibbon (1994) reports age differences in inspection behavior - adolescent gazelles inspected cheetahs more than adults or half-growns. Finally, Mazza et al. (2019); Eccard et al. (2020) report substantial individual differences in the foraging behavior of voles in risky environments and Lloyd and Dayan (2018) provide a somewhat general model of foraging under risk.

In conclusion, our model shows that risk-sensitive, normative, reinforcement learning can account for individual variability in exploratory schedules of animals, providing a crisp account of the competition between neophilia and neophobia that characterizes many interactions with an incompletely known world.

4 Materials and methods

4.1 BAMDP Hyperstate

A Bayes-Adaptive Markov Decision Process (BAMDP; Duff, 2002b; Guez et al., 2013) is an extension of model-based MDP and a special case of a Partially Observable Markov Decision Process (POMDP; Kaelbling et al., 1998) in which the agent models its uncertainty about the (unchanging) transition dynamics. In a BAMDP, the agent extends its state representation into a hyperstate consisting of the original MDP state s, and the belief over the transition dynamics b(T).

In our model s is the conjunction of the “physical state” (the location of the agent, as shown in Fig 3) and the number of turns the agent has spent at the object so far τ. In the general case, T is a |S| × |A| × |S| tensor where each element is p(s, a, s) and S and A are the number of states and actions respectively. Therefore, b(T) is a probability distribution over (possibly infinite) transition tensors. In our model, all transition probabilities are assumed fixed except for the hazard function probabilities. Therefore, a belief over transition tensors b(T) is a belief over hazard functions b(h). We use a noisy-or hazard function parameterized by a vector of Beta distribution parameters . In totality, the belief over transition tensors b(T) is a belief over parameter vectors.

However, to maintain generality in the next section, we derive the Bellman updates using the notation b(T).

Our hyperstate additionally contains the nCVaR static risk preference , and the parameters of the heuristic exploration bonus (see Section 4.4).

4.2 Bellman Updates for BAMDP nCVaR

As for a conventional MDP, the nCVaR objective for a BAMDP can be solved using Bellman updates. We use Eq 4 which assumes a deterministic, state-dependent, reward.

s is the next state and b(T) is the posterior belief over transition dynamics after observing the transition (s, a, s). is the expected transition probability.

Proof of Eq 4.

where is the risk envelope for CVaR (Chow et al., 2015). But is only non-zero when .

Hence we can drop the independent integration over , and only integrate over s.

Epistemic uncertainty about the transitions only generates risk in as much as it affects the probabilities of realizable transitions in the environment.

4.3 Noisy-Or Hazard Function

In our model, the hazard function defines a binary detection event Xτ for each number of turns the agent spends at the object τ = 2, 3, 4. The predator detects the agent when Xτ = 1. We use a noisy-or hazard function which defines Xτ as the union of Bernoulli random variables Zj ∼ Bernoulli(θj) (Eq 6) with priors θj ∼ Beta(μj, σj) for j = 2, 3, 4. Fig 12 shows the relationships between the random variables in plate notation.

Bayes-net showing the relationship between the random variables in the noisy-or model. Only xτ is shown. xτ+1 depends on zt=1:τ+1, and so on.

Posterior inference for the noisy-or model is intractable in the general case (Jaakkola and Jordan, 1999). However, there is a closed-form solution for the posterior when the agent only makes negative observations, meaning (in our case, since there is no actual predator). For example, given a single observation xτ = 0,

Here we switch back to the pseudocount parameterization of the Beta distribution Beta(θ; a, b) to exploit its conjugacy.

Hence the posterior update simply increments the Beta pseudocounts for the ‘0’ outcomes. The hazard probability is the posterior predictive distribution h(τ) = p(xτ = 1 |D) where D are a set of observations of X1, X2, … Xτ.

Where μj = 𝔼[θj] is the expected value of the posterior on θj.

Proof of Eq 8.

where are the pseudocounts of negative observations after updating the Beta prior with D using Eq 7. It can be shown that h(τ) is recursive.

This recursion has two implications. First, the hazard function is monotonic since (1−h(τ−1)) > 0 and μτ > 0. Second, the hazard function generalizes. From Eq 9 it is clear if h(τ − 1) increases, then h(τ) increases. It is this generalization that allows the agent to progressively spend more turns at the object.

4.3.1 Transforming μ, σ to Pseudocount Parameterization of Beta Distribution

We use the mean μ and variance v = σ2 parameterization of the Beta distribution to get a more uniform sampling of the prior parameter space for ABCSMC fitting. We sample μ and σ from uniform distributions. However, it is more convenient to work with pseudocounts for computing the hazard posterior. Therefore, we transform μ and σ to pseudocounts a, b using the identities below. Note that v must be less than μμ2 to avoid negative values of a, b.

4.4 Heuristic Exploration Bonus Pool

The heuristic reward function approximates the sort of exploration bonus (Gittins, 1979) that would arise from uncertainty about potential exploitable benefits of the object. It incentivizes approach and engagement. In the experiment, there is no actual reward so the motivation is purely intrinsic (Oudeyer and Kaplan, 2007). The exploration bonus depletes as the agent learns about the object; but regenerates if the agent believes that the object can change over time (or, equivalently, if the agent forgets what it has learnt). This regenerating uncertainty can be modeled normatively using POMDPs but is only approximated here. Since we imagine the agent as finding more out about the object through confident than cautious approach, the former generates a greater bonus per step, but also depletes it more quickly.

We model the exploration-based reward as an exponentially decreasing resource. G(t) is the “exploration bonus pool” and can be interpreted as the agent’s remaining motivation to explore in the future. We fit the size of the initial exploration pool G(0) = G0 to the behavior of each animal. During planning, the agent imagines receiving rewards at the cautious and confident object states proportional to G(t).

On every turn at the cautious or confident object states, the agent extracts reward or from its budget G, depleting G at rates ωcautious or ωconfident. This leads to an exponential decrease in G(t) with turns spent at the object which is clear from Eq 17. For example, at the cautious object state the update to G(t) is,

However, a secondary factor affects the update to G(t). G linearly regenerates back to G0 at the forgetting rate f which we also fit for each animal. The full update to the reward pool for spending one turn at the cautious object state is,

Note that G(t) regenerates by f in all states, not only at the object states. We use linear forgetting for its simplicity although other mechanisms such as exponential forgetting are possible.

Finally, for completeness in other environments, the reward the agent imagines receiving also depends on the actual reward it has received in the past. Let n1 and n0 be the number of times the agent has received one or zero reward at the object state, analogous to the pseudocounts of a Beta posterior in a fully Bayesian treatment of reward. Furthermore, let and be the (fitted) values at t = 0. We use and . The agent imagines receiving reward

after spending one turn in the cautious object state. A similar equation applies to the confident object state.

We define the depletion rates as and ωcautious = Kωconfident with constants R = 1.1 and K = 0.89 < 1.0. These values were fitted to capture the full range of behavior of the 26 animals.

4.5 Data Fitting

Data fitting aims to elucidate individual differences and population patterns in behavior by searching for the model parameters that best describe the behavior of each animal. We map the behavior of model and animals to a shared abstract space using a common set of statistics and then fit the model to data using ABCSMC.

4.5.1 Animal Statistics

To extract animal statistics, we first coarse-grain behavior into phases and subsequently classify the animals into three groups: brave, intermediate, and timid (as described in the main text). This allows us to maintain the temporal dynamics of the behavior while reducing the dimension of the data. We average the approach type, duration, and frequency over each phase and fit a subset of statistics that capture the high-level temporal dynamics of behavior of animals in each group.

The behavior of brave animals comes in three phases: cautious, confident-peak and confident-steady-state. We fit five statistics: the transition time from cautious to confident-peak phase tconfident-steady,tcautious-to-confident, the transition time from confident-peak to confident-steady-state phase tpeak-to-steady, the average durations during the cautious and confident-peak phases dcautious, dpeak-confident, and the ratio of confident-peak and confident-steady-state phases’ frequencies .

Intermediate animals only exhibit two phases: cautious and confident. We fit four statistics: the transition time from cautious to confident phase tcautious-to-confident, the durations of the two phases dcautious, dconfident, and the ratio of the cautious and confident phases frequencies . However, one limitation of the model is that frequency can only decrease, not increase, because of the dynamics of depletion and replenishment of the exploration bonus pool. Hence we instead fit

Timid animals also only exhibit two phases, albeit different ones from the intermediate animals: cautious-peak and cautious-steady-state. We fit four statistics: the transition time from cautious-peak to cautious-steady-state phase tpeak-to-steady, the durations of the two phases dcautious-peak and dcautious-steady, and the ratio of the frequencies of the two phases .

4.5.2 Model Statistics

By design, our BAMDP agent also enjoys a notion of bouts and behavioral phases. We map the behavior of the agent to the same abstract space of duration, frequency, and transition time statistics as the animals to allow the fitting.

We consider the agent as performing a bout when it leaves the nest, stays at the object state for some turns, and finally returns to the nest. We parse bouts and behavioral phases from the overall state trajectory of the agent which, like the animals, has what we can describe as contiguous periods of cautious or confident approach and low or high approach frequency.

The transition from cautious to confident phase (measured in the number of turns) is when the model begins visiting the confident-object state rather than the cautious-object state (this transition never happens for low ). The transition from peak to steady-state phase is when the model starts spending > 1 consecutive turns at the nest (to regenerate G), which happens when G reaches its steady-state value determined by the forgetting rate. We linearly map the agent’s transition times (in units of turns) to the space of animals’ transition times (units of minutes) using the relationship: 2 turns to 1 minute. Therefore, agent is simulated for 200 turns corresponding to 100 minutes in the experiment.

Bout duration is naturally defined as the number of consecutive turns the agent spends at the object. Because the agent lives in discrete time, we map its duration (units of turns) to the space of animal duration (units of seconds) using the formula,

Hence the agent is capable of having durations from 0.75 to 3.75 seconds. This captures a large range of the animals’ phase-averaged durations.

We define the momentary frequency with which the agent visits the object as the inverse of the period, which is the number of turns between two consecutive bouts (sum of turns at nest and object states). Frequency ratios are computed by dividing the average periods of two phases (in units of turns) and are unitless. Hence, no mapping between agent and animal frequency ratios is necessary.

4.5.3 Approximate Bayesian Computation

We fit each of the 26 animals from Akiti et al. (2022) separately using an Approximate Bayesian Computation Sequential Monte Carlo (ABCSMC) algorithm (Toni et al., 2009). We use an adaptive acceptance threshold schedule that sets ϵt to the lowest 30-percentile of distances d(x, x0) in the previous population. We use a Gaussian transition kernel Kt(θ|θ*) = 𝒩 (0, Σ), where the bandwidth of Σ is set using the Silverman heuristic. We ran ABC-SMC for T = 30 populations for each animal but most animals converged earlier. We used uniform priors. Table 1 contains a list of ABCSMC parameters.

Table of ABCSMC Parameters

Given agent statistics x and animal statistics x0 in a joint space, we compute the ABC distance d(x, x0) using the a normalized L1 distance function.

where i indexes the statistics. Ci(xi) is a normalization constant that depends on the statistic and possibly the value xi. Normalization is necessary because the statistics have different units and value ranges.

We normalize durations using a constant Ci(xi) = 4.0 seconds. We normalize the transition times using a piece-wise linear function to prevent extremely small or large values from dominating the distance.

We also normalize the frequency ratio using a piece-wise linear function.

Acknowledgements

We are grateful to Chris Gagne, Vikki Neville, Mike Mendl, Elizabeth S. Paul, Richard Gao and particularly Mitsuko Watabe-Uchida for their helpful discussion and feedback. Funding was from the Max Planck Society and the Humboldt Foundation. Open access funding provided by Max Planck Society. PD is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 39072764 and of the Else Kröner Medical Scientist Kolleg “ClinbrAIn: Artificial Intelligence for Clinical Brain Research. We thank the IT team from the Max Planck Institute for Biological Cybernetics for technical support.

Appendix 1

A Recovery Analysis

We performed recovery analysis on our ABCSMC fits. The recovery targets were the bestfitting particles for each of the 26 mice. We ran ABCSMC a second time, using the same hyperparameters, to check that we could recover the recovery targets.

Fig. 1 compares the recovery targets against the closest particles in the posterior of the (recovery) ABCSMC fit. Each subplot shows one of the nine fitted parameters: nCVaRα, G0, the forgetting rate f, the three hazard prior means and the three hazard prior deviations. In general, the ABCSMC fitting algorithm recovers the recovery-target reasonably well for all animals, with a minimum R2 value of 0.72.

recovery targets versus the closest particles in the ABCSMC posterior. Each subplot plots one of the nine fitted parameters for all 26 animals. The colors of the points indicate the animal group. The gray y = x line represents a perfect recovery of the recovery targets. Most points lie close to the y = x line, suggesting our ABCSMC fitting algorithm has good recoverability.

Fig. 2 compares the recovery targets against the (marginal) means of the ABCSMC posterior. The exploration pool G0 and forgetting rate f are well recovered. However, there is poor recoverability for nCVaRα and the prior parameters due to non-identifiability. This is further illustrated in Fig. 3 for a single brave animal. Fig. 3 plots the univariate and bivariate marginals of the ABCSMC posterior. As expected, the recovery targets lie within a narrow range of the posterior distributions for G0 and f. For nCVaRα and the prior parameters, the recovery targets are farther from the means of the posterior but still lie within a region of the posterior with support.

Identical to Fig. 1 but the recovery targets are plotted against the (marginal) means of the ABCSMC posterior. We chose the final ABCSMC population for the posterior (population 15). R2 is high for G0 and f, suggesting that these parameters are identifiable. R2 is low for nCVaRα and the hazard priors due to the non-identifiability discussed in the main text. In particular, R2 is less than 0.0 for nCVaRα and θ2-mean suggesting these parameters are the most confounded. However, R2 is high for θ2-deviation, suggesting nCVaRα does not confound the flexibility of the hazard function. Finally, the R2 for θ3 is nearly zero. This is expected because timid and some intermediate animals do not have duration-3 approach and for these animals, θ3 can take on arbitrary large values.

The ABCSMC posterior for animal 24. Univariate and bivariate marginals are shown on the diagonal and off-diagonal respectively. recovery targets are shown as green vertical lines in univariate plots and green points on bivariate plots. Marginal means are shown in orange. recovery targets and means are close for G0 and f due to their identifiability. nCVaRα and the hazard prior parameters are non-identifiable. Hence, the recovery targets are farther from the mean but still lie in a region of the posterior with support.