Abstract
The safety-efficiency dilemma describes the problem of maintaining safety during efficient exploration and is a special case of the exploration-exploitation dilemma in the face of potential dangers. Conventional exploration-exploitation solutions collapse punishment and reward into a single feedback signal, whereby early losses can be overcome by later gains. However, the brain has a separate system for Pavlovian fear learning, suggesting a possible computational advantage to maintaining a specific fear memory during exploratory decision-making. In a series of simulations, we show this promotes safe but efficient learning and is optimised by arbitrating Pavlovian avoidance of instrumental decision-making according to uncertainty. We provide a basic test of this model in a simple human approach-withdrawal experiment, and show that this flexible avoidance model captures choice and reaction times. These results show that the Pavlovian fear system has a more sophisticated role in decision-making than previously thought, by shaping flexible exploratory behaviour in a computationally precise manner.
Introduction
Humans and animals inhabit a complex and dynamic world where they need to find essential rewards such as food, water and shelter, whilst avoiding a multitude of threats and dangers which can cause injury, disability or even death. This illustrates a tension at the heart of learning and decision-making systems: on the one hand one wants to minimise environmental interactions required to learn to acquire rewards (be sample efficient), but on the other hand, it is important not to accrue excessive damage in the process – which is particularly important if you only get one chance at life. This safety-efficiency dilemma is related to the exploration-exploitation dilemma, in which the long-term benefits of information acquisition are balanced against the short-term costs of avoiding otherwise valuable options. Most solutions to the exploration-exploitation dilemma consider things only from the point of view of a single currency of reward, and hence early losses can be overcome by later gains. Thus many engineering solutions involve transitioning from exploratory strategies to more exploitative strategies over time as an agent gets more familiar with the environment. However, such solutions could be insufficient if some outcomes are incommensurable with others; for instance, damage accrues to the point that cannot be overcome, or worse still, lead to system failure ‘death’ before you ever get the chance to benefit through exploitation, emphasizing the need for safe (early) exploration. Safe learning [Garcıa and Fernández, 2015] is an emerging topic in artificial intelligence and robotics, with the advent of adaptive autonomous control systems that learn primarily from experience: for example, robots intended to explore the world without damaging or destroying themselves (or others) - the same concern animals and humans have.
A biological solution to this problem may be to have distinct systems for learning, for instance having Pavlovian reward and punishment systems in addition to an instrumental system, which can then be integrated together to make a decision [Elfwing and Seymour, 2017, Bach and Dayan, 2017]. A dissociable punishment system could then allow, for example, setting a lower bound on losses which must not be crossed during early learning. The brain seems likely to adopt a strategy like this since we know that Pavlovian fear processes influence instrumental reward acquisition processes (e.g. in paradigms such as conditioned suppression [Kamin et al., 1963] and Pavlovian-instrumental transfer[Talmi et al., 2008, Prévost et al., 2012]). However, it is not clear if this exists as a static system, with a constant Pavlovian influence over instrumental decisions, or a flexible system in which the Pavlovian influence is gated by information or experience. Computationally, it implies a multi-attribute architecture involving modular systems that separately learn different components of feedback (rewards, punishments) with the responses or actions to each then combined.
In this paper we ask two central questions i) whether it is computationally (normatively) adaptive to have a flexible system that titrates the influence of ‘fear’ based on uncertainty i.e. reduces the impact of fear after exploration and ii) whether there is any evidence that humans use this sort of flexible meta-control strategy. We first describe a computational model of how Pavlovian (state-based) responses shape instrumental (action-based control) processes and show how this translates to a multi-attribute reinforcement learning (RL) framework at an algorithmic level. We propose how Pavlovian-instrumental transfer may be flexibly guided by an estimate of outcome uncertainty [Bach and Dolan, 2012, Dorfman and Gershman, 2019] - which effectively acts as a measure of uncontrollability. We use Pearce-Hall associability [Krugel et al., 2009], which is an implementationally simple and direct measure of uncertainty that has been shown to correlate well with both fear behaviour (skin conductance) and brain (amygdala) activity in fear learning studies [Li et al., 2011, Zhang et al., 2016, 2018]. Below, we demonstrate the safety-efficiency trade-off in a range of simulation environments and show how it can be solved with a flexible Pavlovian fear bias. Consequently, we then test basic experimental predictions of the model in a virtual reality-based approach-withdrawal task involving pain.
Results
Pavlovian avoidance learning (PAL) model
Our model consists of a Pavlovian punishment (fear) learning system and an integrated instrumental learning system (Fig. 1). The Pavlovian system learns punishment expectations for each stimulus/state, with the corresponding Pavlovian responses manifest as action propensities. For simplicity, we don’t include a Pavlovian reward system, or other types of Pavlovian fear response [Bolles, 1970]. The instrumental system learns combined reward and punishment expectations for each stimulus-action or state-action pair and also converts these into action propensities. Both systems learn using a basic Rescorla-Wagner updating rule. The ultimate decision that the system reaches is based on integrating these two action propensities, according to a linear weight, ω. Below we consider fixed and flexible implementations of this parameter, and test whether a flexible ω confers an advantage. The Pearce-Hall associability maintains a running average of absolute temporal difference errors (δ) as per equation 14 and the flexible ω is implemented using associability (see equation 15 in Methods). For simulations, we use standard grid-world-like environments, which provide a didactic tool for understanding Pavlovian-Instrumental interactions [Dayan et al., 2006]. Since Pavlovian biases influence not only choices but also reaction times, we extend our model to reinforcement learning diffusion decision-making (RLDDM) models [Pedersen et al., 2017, Fontanesi et al., 2019, Fengler et al., 2022].
Experiment 1: A simulated flexible fear-commissioning model balances safety and efficiency
We consider a simple fully-observable grid-world environment with stochastic state transitions and fixed starting state and fixed rewarding goal state. Fig. 2A illustrates how the misalignment of Pavlovian bias and instrumental action can lead to a safety-efficiency dilemma. The Pavlovian action is assumed to be an evolutionarily acquired simple withdrawal response Fig. 2B, and the Pavlovian state value is learned during the episode and shapes instrumental policy and value acquisition (Fig. 2C). Figure 2C shows value plots for the instrumental policy with and without a Pavlovian bias. All plots show values and policy at the end of 1000 episodes of learning. These heatmaps denote value i.e. the expectation of cumulative long-term rewards R (including any punishments) and the arrows show the policy i.e. actions that maximize this value. Additionally, the learned punishment value Vp of the Pavlovian bias is also shown along with the PAL policy. The PAL value function and policy shown in Fig. 2C utilizes the flexible ω scheme utilized below.
Fig. 2D plots cumulative pain accrued over multiple episodes during learning and is our measure of safety. Fig. 2E plots cumulative steps or environment interactions over episodes and is our measure of sample efficiency. Here, sampling efficiency is represented by the total number of environment interactions or samples required to reach the rewarding goal which terminates the episode. Simply, if an agent requires more samples to reinforce and acquire the rewarding goal, it is less efficient.
The simulation results with a fixed Pavlovian influence (Fig. 2D, 2E) show that adding a Pavlovian fear system to the instrumental system makes it safer in the sense that it achieves the goal of solving the environment while accruing lesser cumulative pain over episodes. However, we observe that as the influence of the Pavlovian fear system increases, with an increase in ω, it achieves safety at the expense of sample efficiency (within reasonable bounds such as until ω = 0.5). Whereas under very high Pavlovian fear influence (ω = 0.9), the agent loses sight of the rewarding goal and performs poorly in terms of both safety and efficiency as the episode doesn’t terminate until it finds the goal.
However, the flexible omega policy (with αΩ = 0.6 and κ = 6.5) achieves safety almost comparable to ω = 0.5 (which is the safest fixed ω policy amongst ω = 0.1, 0.5, 0.9 at a much higher efficiency than ω = 0.5, 0.9, thus improving the safety-efficiency trade-offs (Fig. 2F). In this way, PAL model encourages cautious exploration early on when uncertainty is higher and reduces the Pavlovian biases as the uncertainty is resolved (Fig.2F). The flexible ω value at convergence depends on the environment statistics: transition probabilities and reward/punishment magnitudes. We utilise a simple linear scaling of associability clipped at 1 to arrive at arbitrator ω (equation 15) instead of another alternative such as sigmoid to avoid additional unnecessary meta parameters (i.e. bias shift) to be tuned. In this environment, the value at convergence is ω = 0.42, due to some irreducible uncertainty in state transitions (10% chance of incorrect transition). The differences in learned instrumental value functions between PAL and a purely instrumental agent are visible in (Fig. 2C) showing how the Pavlovian bias sculpts instrumental value acquisition.
In the supplementary methods, we provide additional simulations that show the robustness of these results with respect to metaparameters αΩ and κ (Appendix A.1), environments in which the reward locations vary (Appendix A.2), and other grid-world environments (Appendix A.3).
Experiment 2: Constant Pavlovian bias introduces sampling asymmetry and affects instrumental value propagation
Observing the differences in the on-policy value functions with an without the Pavlovian influence (Fig. 2C) prompted us to further tease apart the effect of constant Pavlovian bias on sampling asymmetry, and consequent differences in instrumental value discovery and value propagation through the states. We investigated how different fixed values of ω can lead to sampling asymmetry, which refers to exploration where certain states are visited or sampled unevenly compared to others. We tested agents with different fixed ω in two simulated environments: (1) A T-maze and (2) a three-route task. The T-maze task environment(Fig. 3A) has asymmetric rewards (R = +0.1 on the left, whereas R = +1 on the right). However, the agent will have to walk through a patch of painful states to reach the larger goal on the right, even the safest path will incur a damage of at least R = −0.5 or worse. Taking discounting into account, the goal on the right is marginally better than the one on the left and the instrumental agent achieves both of the goals nearly an equal number of times (Fig. 3B). Comparing the instrumental agent with other agents in Fig. 3C shows diminished positive (reward) value propagation leading to the R = 1 goal on the right as the constant Pavlovian bias increases, showing how such sampling asymmetry can prevent value discovery of states leading to R = 1 goal. The safety efficiency trade-off can also be observed through Fig. 3B. This illustrates one of the main tenets of our model - that having a Pavlovian fear system ensures a separate ‘un-erasable’ fear/punishment memory which makes the agent more avoidant to punishments. This is helpful in softly ensuring an upper bound on losses, by (conservatively) foregoing decisions resulting in immediate loss, but followed by much larger rewards. This is where the safety-efficiency trade-off marks a clear distinction from the exploration-exploitation trade-off, in which earlier losses can be overcome by gains later on.
The three-route task simulation includes three routes with varying degrees of punishments (Fig. 3D), inspired by previous manipulandum tasks [Meulders et al., 2016, van Vliet et al., 2020, 2021, Glogan et al., 2021]. We observe that increasing the constant Pavlovian bias up until ω = 0.7 leads to increased safety (Fig. 3E). Beyond ω = 0.7, a high fixed Pavlovian bias may incur unnecessarily high cumulative pain and steps as its reward value propagation is diminished (Fig. 3G) and attempts to restrict itself to pain-free states (Fig. 3F) whilst searching for reward (despite stochastic transitions which may lead to slightly more painful encounters in the long run). Comparing the cumulative state visit plots of Fig. 3F, the instrumental agent with an agent with high constant Pavlovian bias ω = 0.9, we observe that the latter showed an increased sampling of the states on the longest route with no punishments. Comparing the value function plots (Fig. 3G), we observe that a high constant Pavlovian bias impairs the value propagation (it is more diffused) of the rewarding goal in comparison to an instrumental agent. Such high levels of constant Pavlovian bias can be a model of maladaptive anxious behaviour.
In conclusion, the simulations with this environment show that the Pavlovian fear system can assist in avoidance acquisition, however a constant Pavlovian bias depending on the degree of bias, leads to sampling asymmetry and impaired value propagation.
Appendix A.3 includes the performance comparison of agents with a suitable flexible ω and with fixed ω values on the three-route task. Appendix A.4 shows the results of a human experiment with subjects navigating a 3-route virtual reality maze similar to the simulated one.
Experiment 3: Human approach-withdrawal conditioning is modulated by outcome uncertainty
Our first experiment showed the benefit of having a outcome uncertainty-based flexible ω arbitration scheme in balancing safety and efficiency, in a series of grid worlds. In this next experiment, we aimed to find basic evidence that humans employ such a flexible fear commisioning scheme. This is not intended as an exhaustive test of all predictions of the model, but to show in principle that there are situations in which a flexible, rather than fixed pavlovian influence, provides a good fit to real behavioural data. In line with our grid world simulations, we expected a Pavlovian bias in choices, but in addition to it, we also expected the a Pavlovian bias in reaction times.
Fig. 4 describes the trial protocol (Fig.4A), block protocol (Fig.4B) and experimental setup (Fig. 4C). We conducted a VR-based approach-avoidance task (28 healthy subjects, of which 14 females and average age 27.96 years) inspired by previous Go-No Go task studies for isolating Pavlovian bias, especially its contributions to misbehaviour [Guitart-Masip et al., 2012, Cavanagh et al., 2013, Mkrtchian et al., 2017a,b, Dorfman and Gershman, 2019, Gershman et al., 2021]. The subjects goal was to make a correct approach or withdrawal decision to avoid pain, with four different cues associated with different probabilities of neutral or painful outcomes. We expected the Pavlovian misbehaviour to cause incorrect withdrawal choices for cues where the correct response would be to approach. And in terms of reaction times, we expected the bias to slow down correct approach responses and speed up correct withdrawal responses. We explicitly attempted to change the outcome uncertainty or controllability, in a similar way to previous demonstrations Dorfman and Gershman [2019], but with controllability changing within the task. To do this, we set up two of the four cues to be uncontrollable in the first half (i.e. outcome is painful 50% of the times regardless of the choice), but which then become controllable in the second half (i.e. the correct choice will avoid the pain 80% of the times). We anticipated that the Pavlovian bias in choice and reaction times would be modulated along with the change in uncontrollability. The virtual reality environment improves ecological validity[Parsons, 2015] and introduces gamification, which is known to also improve reliability of studies[Sailer et al., 2017, Kucina et al., 2023, Zorowitz et al., 2023], which is important in attempts to uncover potentially subtle biases.
We observe that all subjects learn to solve the task well and solve it better than chance (i.e. lesser than 120 shocks in 240 trials). Out of 240 trials, they receive 88.96 shocks on average (std. deviation = 12.62). We first attempted to test our hypotheses using behavioural metrics of Pavlovian withdrawal bias in choices and reaction times. However, our behavioural choice-based metrics cannot distinguish a random exploratory action from Pavlovian misbehaviour. Further, it cannot account for effects of a non-Pavlovian baseline bias b. Thus, we did not find any statistically significant result due to noisy behavioural metrics, results and more information provided in Appendix A.5.
We next aimed to test our hypotheses by model comparison of RL models (Fig.5A) and RLDDM models Fig.5E) which guides our results below. We used a hierarchical Bayesian estimation of model parameters, to increase the reliability across tasks. We found that the baseline action bias b, instrumental learning and the Pavlovian withdrawal bias competed for behavioural control, as observed in previous studies [Guitart-Masip et al., 2012, Cavanagh et al., 2013] (parameter distribution plots in Appendix A.6). However, unlike previous studies that have treated Pavlovian bias as fixed, we found that the flexible Pavlovian bias better explained the behavioural data, please see Fig.5 (B) and (F).
Similar to Guitart-Masip et al. [2012], the simple Rescorla-Wagner learning (RW) model represented the base model. RW+bias includes a baseline bias b that can take any positive or negative value, positive value denoting a baseline bias for approach and negative denoting a baseline bias for withdrawal. From group-level and subject-level parameter distribution plots (Appendix A.6) we observe that this baseline bias is for approach for most subjects. This is in line with previous studies[Guitart-Masip et al., 2012, Cavanagh et al., 2013] and as suggested by our data showing a significant baseline difference in number of approaches and withdrawal actions across all subjects and trials (Appendix A.5, Fig. 10A). Note that here, this baseline bias is not learned as it is with a Pavlovian bias. RW+bias+Pavlovian(fixed) model includes a fixed Pavlovian bias and is most similar to models by Guitart-Masip et al. [2012], Cavanagh et al. [2013], which also used reward and punishment sensitivities for the instrumental learning but did not scale the instrumental values by (1 −ω) as done in our model. Our models do not have reward and punishment sensitivities. From group-level and subject-level parameter distribution plots (Appendix A.6), we observe that the distribution of fixed ω is significantly positive and non-zero. RW+bias+Pavlovian(flexible) model includes a flexible Pavlovian bias as per our proposed associability-based arbitration scheme. We found that the flexible ω model fits significantly better than the fixed ω model (Fig. 5B) i.e. flexible ω model has the lowest Leave-one-out information criteria (LOOIC) score amongst models compared. By comparing incremental improvements in LOOIC, we observe that adding the baseline bias term leads most improvement in model fit, followed by changing the fixed ω to a flexible ω scheme. Here, we plot LOOIC for model comparison but Appendix A.7 includes both LOOIC and WAIC scores, showing the same result. Further, it can be seen that ω tracks associability, which decreases over the trials (Fig. 5C) (which also resembles Fig. 2E). Fig. 5D shows a plot comparing the number of approaches (normalized to 1), aggregated over all subjects and trials by cue types for data and the best fitting model predictions. We observe qualitatively that the subjects learn to perform the correct actions for each cue and that the model predictions qualitatively reproduce the data.
We then extend the model-fitting to also incorporate reaction times, using an RLDDM (reinforcement learning drift diffusion model) [Pedersen et al., 2017, Fontanesi et al., 2019, Desch et al., 2022]. The propensities calculated using the RL model are now used as drift rates in the DDM and the reaction times are calculated using a weiner distribution for a diffusion-to-bound process. Thus the drift rate is proportional to the difference in propensities between approach and withdrawal action. Since Pavlovian bias is also dependent on punishment value, similar to instrumental values, we included the Pavlovian bias and the baseline bias in the drift rate. Thus, the best propensity for an action in choice selection in RL models, drives the drift rate in our RLDDM models. We found that the RLDDM replicates the results for model-fitting (Fig. 5E) and flexible ω (Fig. 5F). Fig. 5H shows the distribution for approach and withdrawal reaction times (RTs) aggregated over all subjects, over all trials, in data and model predictions. The data shows that the withdrawal RTs are slightly faster than approach RTs (Fig. 5H) and that the best fitting model captures the withdrawal RT distribution well, but can be improved in the future to capture approach RT distribution better.
Appendix A.5 includes behavioural results for the experiment data. Appendix A.6 includes group-level and subject-level (hierarchically fitted) model parameter distributions. Appendix A.7 includes tables of model parameters with LOOIC and WAIC values for all RL models and RLDDM models.
Discussion
In summary, this paper shows that addition of a fear-learning system, implemented as a Pavlovian controller in a multi-attribute RL architecture, improves safe exploratory decision-making with little cost of sample efficiency. Employing a flexible arbitration scheme where Pavlovian responses are gated by outcome uncertainty [Bach and Dolan, 2012] provides a neurally plausible approach to solving the safety-efficiency dilemma. Our experimental results support the hypothesis of such a flexible fear commissioning scheme and suggest that inflexible Pavlovian bias can explain certain aspects of maladaptive ‘anxious’ behaviour (please see Appendix A.8). This can be helpful in making novel predictions in clinical conditions, including maladaptive persistent avoidance in chronic pain in which it may be difficult to ‘unlearn’ an injury.
Broadly, our model sits amidst with the landscape of safe reinforcement learning (RL) [Garcıa and Fernández, 2015]. In principle, it can be viewed through the lens of constrained Markov decision processes [Altman, 1999], where the Pavlovian fear system is dedicated towards keeping constraint violations at a minimum. In the realm of safe learning, there exists a dichotomy: one can either apply computer science-driven approaches to model human and animal behaviour, as seen in optimizing worst-case scenarios [Heger, 1994] and employing beta-pessimistic Q-learning [Gaskett, 2003] for modelling anxious behaviour [Zorowitz et al., 2020], or opt for neuro-inspired algorithms and demonstrate their utility in safe learning. Our model falls into the latter category, draws inspiration from the extensive literature on Pavlovian-Instrumental interactions [Mowrer, 1951, 1960, Kamin et al., 1963, Brown and Jenkins, 1968, Mackintosh, 1983, Talmi et al., 2008, Maia, 2010, Huys et al., 2012, Prévost et al., 2012], fear conditioning [LaBar et al., 1998] and punishment-specific prediction errors [Pessiglione et al., 2006, Seymour et al., 2007, 2012, Roy et al., 2014, Berg et al., 2014, Elfwing and Seymour, 2017, Watabe-Uchida and Uchida, 2018], and elucidates a safety-efficiency trade-off. Classical theories of avoidance such as two-factor theory [Mowrer, 1951], and indeed actor-critic models [Maia, 2010] intrinsially invoke Pavlovian mechanisms in control, although primarily to act as a teaching signal for instrumental control as opposed to directly biasing action propensities such as in our case or [Dayan et al., 2006]. Recent studies in computer science, particularly those employing policy optimization (gradient-based) reinforcement learning under CMDPs [Altman, 1999], have also observed a similar safety-efficiency trade-off [Moskovitz et al., 2023]. Additionally, the fundamental trade-off demonstrated by Fei et al. [2020] between risk-sensitivity (with exponential utility) and sample-efficiency in positive rewards aligns with our perspective on the safety-efficiency trade-off, especially when broadening our definition of safety beyond cumulative pain to include risk considerations. Safety-efficiency trade-offs may also have a close relation with maladaptive avoidance [Ball and Gunaydin, 2022] often measured in clinical anxiety, and our work provides insights into the maladaptive avoidance via the heightened threat appraisal pathway.
We illustrate that a high Pavlovian impetus is characterized by reduced sample efficiency in learning, worsened/weakened (instrumental) value propagation and impervious rigidity in the policy, and misbehaviour due misalignment of bias with the instrumental action. This way it also further promotes short term safer smaller rewards opposed long term higher rewards, echoing the idea of Pavlovian pruning of decision trees[Huys et al., 2012]. The idea of alignment between the Pavlovian and instrumental actions leading to harm-avoiding safe behaviours and misalignment being the root of maladaptive behaviours was proposed by Mkrtchian et al. [2017b] through a Go-No Go task with human subjects and the threat of shock responsible for the Pavlovian instrumental transfer. Recently Yamamori et al. [2023], have developed a restless bandit-based approach-avoidance task to capture anxiety-related avoidance, by using the ratio of reward and punishment sensitivities as a computational measure of approach-avoidance conflict. We show in our simulations that misalignment can also lead to safe behaviours, but at the cost of efficiency. But having a flexible fear commissioning alleviates majority of Pavlovian misbehaviour and in turn making the agent more cautious in the face of uncertainty and catastrophe, contrasting with ‘optimism bias’ observed in humans [Sharot, 2011]. A limitation of our work would be that we do not model the endogenous modulation of pain and stress induced analgesia which may have the opposite effect of the proposed uncertainty-based fear commissioning scheme. A limitation of our VR experiment is that we only consider uncertainty decrease from first half to second half. This was motivated to make it similar to the grid world simulations as well as to help with behavioural tests (Appendix A.5), as this would keep all of the reducible and irreducible uncertainty in the first half and none in the second half. However, a stringent test would also require a balanced case, where the outcomes of cues 3 and 4 are more certain in first half and more uncertain in the second half, or consider differentiating uncertainty and volatility.
While our flexible ω scheme, rooted in associability, shares motivation with Dorfman and Gershman [2019] to track uncontrollability, our approach differs. Unlike Dorfman and Gershman [2019], which employs a Bayesian arbitrator emphasizing the most useful predictor (Pavlovian or Instrumental), our Pearce-Hall associability-based measure provides a direct and separate controllability assessment. This distinction allows our measure to scale effectively to complex tasks, including gridworld environments, maintaining stability throughout experiments. In contrast, the measure by Dorfman and Gershman [2019] exhibits notable variability, even when the controllability of the cue-outcome pair remains constant throughout the task. Previous fMRI studies have associated associability signals with the amygdala [Zhang et al., 2016] and pgACC [Zhang et al., 2018]. Additionally, outcome uncertainty computation could be possibly performed within the basal ganglia using scaled prediction errors [Mikhael and Bogacz, 2016, Moeller et al., 2022] and is encoded in the firing rates of orbitofrontal cortex neurons and possibly in slow ramping activity in dopaminergic midbrain neurons [Fiorillo et al., 2003, O’Neill and Schultz, 2010, Bach and Dolan, 2012]. Associability as a measure of outcome uncertainty, though very practical and useful on an implementational level, cannot distinguish between various kinds of uncertainties. Further, future work can help differentiate between controllability and predictability; [Ligneul et al., 2022] suggest that controllability and not predictability may arbitrate the dominance of Pavlovian versus instrumental control. Additionally, Cavanagh et al. [2013] demonstrated that theta-band oscillatory power in the frontal cortex tracks and overrides Pavlovian biases, later suggesting its connection to inferred controllability [Gershman et al., 2021]. Notably, Kim et al. [2023] revealed that upregulation of the dorsolateral prefrontal cortex (dlPFC) through anodal transcranial direct current stimulation (tDCS) induces behavioral suppression or changes in Pavlovian bias in the punishment domain, implying a causal role of the dlPFC in Pavlovian-Instrumental arbitration.
An natural clinical application [Fullana et al., 2020] of this model is towards mechanistic models of anxiety and chronic pain. Quite simply, both have been considered as reflecting excessive Pavlovian punishment learning systems. In the case of anxiety disorder, this equates a strong influence of Pavlovian control with subjectively experienced anxiety symptomatology, leading to excessively defensive behaviour and avoidance of anxiogenic environments [Norton and Paulus, 2017]. In the case of chronic pain, the idea is that failure to overcome a Pavlovian incentive to avoid moving results in failure to discover that pain escape is possible (the fear avoidance model) [Vlaeyen and Linton, 2000]. In both cases, the pathological state can be considered a failure to turn down the Pavlovian system when the environment becomes more predictable (i.e. less uncertain). This illustrates a subtle distinction between existing theories that simply propose a constant excess Pavlovian influence, from the possibility they might result from a deficit in the flexible commission of Pavlovian control. This distinction can be therefore be experimentally tested in clinical studies. Furthermore, accruing evidence also indicates a role of excessive Pavlovian punishment learning in models depression [Nord et al., 2018, Huys et al., 2016], suggesting that this may be a common mechanistic factor in comorbidity between chronic pain, anxiety and depression. Recent experiments and perspectives also suggest apsychological mechanism of how avoidance in humans can lead to growth of anxiety (increased belief of threats)[Urcelay, 2024]. A key distinctive prediction of our model for an intervention is that we should help patient groups reduce Pavlovian bias not by training to reduce the bias, but rather by attempting to make the arbitration more flexible. This could potentially be done via some sort of controllability discrimination paradigm, i.e. helping distinguish between what is controllable and what is not - this is something also found in stoicism-based approaches to cognitive behavioural therapy (CBT) [Turk and Rudy, 1992, Thorn and Dixon, 2007].
In conclusion, we outline how the Pavlovian fear system provides an important and computationally precise mechanism to shape or sculpt instrumental decision-making. This role for the Pavlvoian fear system extends its utility far beyond merely representing the evolutionary vestiges of a primitive defence system, as sometimes portrayed. This opens avenues for future research in basic science of safe self-preserving behaviour (including in artificial systems), and clinical applications for mechanistic models of anxiety and chronic pain.
Materials and methods
Instrumental learning and Pavlovian fear learning
We consider a standard reinforcement learning setting in an environment containing reward and punishments (pain). In each time step t, the agent observes a state s and selects an action a according to its stochastic policy πt(s, a) (i.e., the probability of selecting action at = a in state st = s). The environment then makes a state transition from the current state s to the next state s′ and the agent receives a scalar reward R ∈ (−∞, +∞).
In the Instrumental system, we define the value of taking action a in state s under a policy π, denoted as the action-value function Qπ (s, a), as the expected return starting from s, taking the action a, and thereafter following policy π:
And the optimal action-value function is defined as Q∗(s, a) = maxπ Qπ (s, a).
In addition to the instrumental system, we define a Pavlovian fear (i.e. punishment/pain) system over-and- above the instrumental system which makes it safer. The Pavlovian fear systems aims to increase the impetus of the pain-avoidance actions that minimize pain. For that we split the standard reward R into pain (or punishment) p ≥ 0:
We can similarly define a Pavlovian reward system trained on max(R, 0), however that’s not relevant to the questions of this study so we will only focus on the arbitration between the instrumental (state-action based) model-free and Pavlovian (state-based) fear system. And we define the pain state-value Vp(s) of the Pavlovian fear system as follows :
The subset of actions with the Pavlovian bias Ap are arrived at using a pretraining in the same environemnt with only punishments and random starting points. Vp(s) then bias this pretrained subset of actions Ap according to equation 13.
The Pavlovian fear state-value functions are updated as follows:
The instrumental value function for qualitative value plots is updated in an on-policy manner as follows (but is not used in the PAL algorithm):
And the Instrumental action-value functions are updated as follows:
where α is the learning rate and while using off-policy Q-learning (sarsamax) algorithm, the TD-errors are calculated as follows:
The equations above are valid for a general case and are used in grid-world simulations. For model-fitting purposes for the VR Approach-Withdrawal task, there is no next state s′, thus the equations reduce to a simpler form of the Rescorla-Wagner learning rule.
Action selection
Let A be the action set. In the purely instrumental case, propensities ρ(s, a) of actions a ∈ A in state s are the advantages of taking action a in state s:
And thus using softmax action selection with a Boltzmann distribution, the stochastic policy π(a|s) (probability of taking action a in state s) as follows:
where τ is the temperature that controls the trade-off between exploration and exploitation. For gridworld simulations, we use hyperbolic annealing of the temperature, where the temperature decreases after every episode i:
Here τ0 is the initial temperature and τk controls the rate of annealing. This is to ensure the policy converges in large state spaces like a gridworld and follows previous studies [Elfwing and Seymour, 2017, Wang et al., 2018]. For model-fitting of the VR Approach-Withdrawal task we do not anneal it and keep it as a free parameter (inverse temperature β = 1/τ) to be fitted to each participant, this is also consistent with previous literature [Guitart-Masip et al., 2012, Cavanagh et al., 2013, Dorfman and Gershman, 2019, Gershman et al., 2021] and several other works modelling Go-No Go tasks.
In case of a Pavlovian pain/fear system, let Ap be the subset of actions in state s which has the Pavlovian pain urges or impetus associated with it. These are usually a small set of species specific defensive reactions (SSDR). In the VR Approach-Withdrawal task we assume or rather propose it is the bias to withdraw from potentially harmful stimuli (in our case jellyfish). For the purpose of the gridworld simulations, these can either be hardcoded geographical controller moving away from harmful states or through a pretrained value-based controller [Dayan et al., 2006]. This work does not delve into the evolutionary acquisition of these biases, but one can derive the action subset Ap from evolutionarily acquired value initializations which may also help avoid novel stimuli and is a direction for future work.
Thus after adding a Pavlovian fear system over and above the instrumental system, the propensities for actions are modified as follows:
The same can be compactly written as mentioned in the illustration (Fig. 1):
where ω is the parameter responsible for Pavlovian-instrumental transfer. These equations are constructed following the preceding framework by Dayan et al. [2006] which laid out the foundation for interplay between Pavlovian reward system and the instrumental system. 𝕀[·] = 1∀ap ∈ Ap and 𝕀[·] = 0∀an ∈ An = A\ Ap following the succinct vectorised notation by [Dorfman and Gershman, 2019]..
We have refered to this algorithm as the Pavlovian Avoidance Learning (PAL) algorithm in this paper. The equations above assume only Pavlovian fear system in addition to the an instrumental system, and the given equations would vary depending on if we add a Pavlovian reward system too. After this modification, the action selection probabilities are calculate in similar fashion as described in equation 9.
Uncertainty based modulation of ω
We further modulate the parameter ω which is responsible for Pavlovian-instrumental transfer using perceived uncertainty in rewards. We use Pearce-Hall associability for this uncertainty estimation based on unsigned prediction errors [Krugel et al., 2009, Zhang et al., 2016, 2018]. We maintain a running average of absolute TD-errors δ (equation 7) at each state using the following update rule:
where Ω is the absolute TD-error estimator, α is the learning rate for Q(s, a) and V (s) values as mentioned earlier and αΩ ∈ [0, 1] is the scalar multiplier for the learning rate used for running average of TD-error. To obtain parameter ω ∈ [0, 1] from this absolute TD-error estimator Ω ∈ [0, ∞), we scale it linearly using scalar κ and clip it between [0, 1] as follows:
We note that the range values Ω takes largely depends on the underlying reward function in the environment and αΩ. Thus we choose a suitable value of κ for αΩ using gridsearch in each gridworld environment simulation to ensure that the Pavlovian system dominates in cases of high uncertainty and that the instrumental system starts to take control as uncertainty reduces. We aim to show that this flexible ω scheme is a viable candidate for arbitration between the two systems and addresses the safety-efficiency dilemma wherever it arises. The initial associability Ω0 is set to 0 in grid-world simulations as there is no principled way to set it. In the case of model-fitting for the VR Approach-Withdrawal task, Ω0, κ and αΩ are set as free parameters fitted to each participant and instead of the TD-errors, we have the Rescorla-Wagner rule equivalent - punishment prediction errors (PPE) without any next state s′.
Gridworld simulation details
We consider a series of painful grid-world based navigational tasks including moderate sources of pain (more variations in supplementary materials with catastrophic and dynamic sources of pain). In the grid-worlds, the goal is to navigate from starting position (in blue) to goal position (in green) while avoiding the static moderately painful states (in red). The agent receives a positive reward of 1 for reaching the goal and pain of 0.1 for moderately painful states (red). The pain is encoded as a negative reward of −0.1 in the case of standard RL. Four actions move the agent one step north, south, east, or west (or choose not to move, allowed only in certain environments). If the agent hits a wall, it stays and remains in its current state. All simulation environments have the following stochastic transition probabilities: 0.9 probability of correct (desired) state transition, whereas with 0.05 probability, the agent’s state transitions to the position perpendicular to action taken (right or left). We test the PAL algorithm for varying ω = 0.1, 0.5, 0.9 and for uncertainty-based modulation of flexible ω and compare the performance with standard instrumental policy (Q-learning). The following meta parameters are fixed for all our tabular grid world simulations - learning rate α = 0.1, discount factor γ = 0.99, temperature annealing meta parameters τ0 = 1, τk = 0.025. The meta parameters αΩ and κ are tuned using grid-search on the safety-efficiency trade-off metrics for each environment. This is necessary, as different environments have different underlying reward distributions leading to different distributions of TD-errors, thus its running average needs to appropriately scaled to map it to ω ∈ [0, 1]. Due to these meta parameter tuning, the claim in the simulation experiments is a modest one that there exists a αΩ and κ that mitigate the trade-off as opposed to the trade-off is mitigated by every possible combination of αΩ and κ. This resembles the model-fitting procedure in other experimental tasks, where αΩ and κ are fit in a hierarchical Bayesian manner, suggesting that humans perform this tuning to varying degree to the best of their ability. The Q-tables and Vp-tables are initialized with zeros. Plots are averaged over 10 runs with different seed values.
We quantify safety using cumulative pain accrued by the agent, and sample efficiency using the cumulative steps (or environment interactions or samples) taken by the agent across all the episodes in the learning process. The lesser is the cumulative pain accrued over episodes, the safer is the learning; and the lesser the cumulative steps (or environment interactions), the more efficient is the learning in terms of reward seeking and task completion in each episode. Furthermore, we also construct a trade-off metric to measure how well the safety-efficiency trade-off is improved. We define the safety-efficiency trade-off metrics as follows which is maximised when both cumulative pain and cumulative steps are independently minimised:
where CPn and CSn are cumulative pain and cumulative steps normalized by dividing the maximum cumulative pain and steps achieved (usually by fixed ω = 0 or ω = 0.9) in that run. We acknowledge that this normalization can make the metric favour improvements in either safety or efficiency unequally to an extent, as it weighs the improvements in safety or efficiency relative to worst performance in each of them. This metric can be further weight-adjusted to give more priority to either CP or CS as required but we don’t do that in the current instance. Thus this metric should only be used as didactic tool and not an absolute metric of performance, and one should instead draw conclusions by observing at the cumulative pain accrued and steps taken over multiple episodes.
Approach-Withdrawal conditioning task: experimental design
Participant recruitment and process
30 adults participated in the experiment (15 females, 15 males; age: min=18, max=60, mean=30.5, standard deviation=12.44). Participants from ages 18-60 were allowed to participate in the study. All subjects provided written informed consent for the experiment, which was approved by the local ethics board (CUREC2 R58778/RE002). One participant withdrew and did not complete the study and one participant turned out to be a fibromyalgia patient upon arrival, thus were excluded. The rest of the 28 healthy subjects’ (14 female, average age 27.96 years) data was used for the analysis.
Participants filled a short demographic form upon arrival, followed by pain tolerance calibration procedure, followed by putting on all of the sensors, followed by a re-calibration of their pain tolerance before starting the practice session and the main experiment. All of this was usually completed within 2 hours and participants were paid £30 for their participation (and were adequately compensated for any unexpected overtime and reasonable travel reimbursements). They were free to withdraw from the experiment at any time.
Trial protocol
We use a trial-based approach-withdrawal task, however the subjects had complete control over when to start the next trial. Each trial consisted of four events: a choice to initiate the trial, a coloured jellyfish cue, an approach or withdrawal motor response and a probabilistic outcome. The timeline is displayed in Fig. 1. In each trial, subjects will initiate the trial by bringing in their hand inside a hovering bubble in front of them. Then a jellyfish will emerge and fade-in (gradually decreasing transparency) within the next 0.5 seconds and then stay in front of the subject for another 1 second, making the total fixation segment 1.5 seconds long. Throughout the fixation segment, the jellyfish colour will remain greyish-black. After this fixation segment terminates, the jellyfish takes one of the four colours with associated pain outcome contingencies. This is the stimulus phase and the subject is required to perform either an approach or a withdrawal response within the next two seconds. Approach response involved reaching out their hand and touching the jellyfish, whereas the withdrawal response involved withdrawing the hand away from the jellyfish and towards one ownself. The subjects practised these two actions in the practice session before the main experiment and were instructed to perform either of these two actions. The stimulus ended as soon as an action was successfully completed and was followed by the probabilistic outcome phase. In the rare case that the two seconds time window completed before the subject could successfully perform either of these two actions, then for the purpose of the probabilistic outcome segment, the action was decided based on the hand-distance from the jellyfish (i.e. whether it was closer to an approach or a withdrawal action). The possible outcomes were either a painful electric shock (along with some shock animation visualisations around the jellyfish) or a neutral outcome (along with bubble animations from the jellyfish). The outcomes were presented depending on the action taken and the contingencies for each cue, as per shown in Fig 4. After the outcome segment which lasted for 1.5 seconds, the jellyfish proceeded to fade-out (become more transparent gradually) for the next 0.75 seconds and then the subject could start the next trial by again bringing their hand within the bubble in front of them.
Subjects were instructed to try to keep their hand inside the bubble during this fixation segment and only move the hand after the jellyfish changes colour. The bubble was placed halfway between the subject and the jellyfish and was placed slightly to the right for right-handed and slightly to the left for left-handed subjects. The subjects performed the task with their dominant hand.
Block protocol
Prior to the main task and the practice session, we perform a calibration of the intensity of pain stimulation used for the experiment according to each individual’s pain tolerance. To do this, we start with the minimum stimulation value and gradually increase the value using the ‘staircase’ procedure. We will record a “threshold” value (typically rated as 3/10 on Likert scale), which is identified as the participant first reports pain sensation. We will record a second “maximum” value, which the participant reports as the maximum pain sensation that the participant would be comfortable to tolerate for the complete experiment (typically rated 8/10 on the Likert scale). We then use 80% of that maximum value for stimulation throughout the experiment.
Before the main task, the subjects had to go through a short practice session to get acquainted with approach and withdrawal motions and the speed requirements. Subjects had one attempt at each of the two actions with no painful outcomes and no timeouts followed by a short practice session with two jellyfish (5 trials each, randomised) and with 80% painful outcome contingencies for approach and withdrawal respectively. They were informed as to which of these two jellyfish likes to be touched and which does not, during the practice session but not for the main experiment. The colours of the jellyfish for the practice session were different from that used for the main experiment. The four colours of the jellyfish cues for the main experiment were chosen so as to be colourblind-friendly. The main experiment had a total of 240 trials, 60 trials for each of the four jellyfish which was balanced across each quarter of the block i.e. 15 trials per jellyfish per quarter block. The jellyfish 1 was the approach-to-avoid type and jellyfish 2 was the withdraw-to-avoid type throughout 240 trials. The jellyfish 3 was uncontrollable for the first half of the block (first 120 trials) and then was approach-to-avoid type for the rest of the block. The jellyfish 4 was uncontrollable for the first half of the block (first 120 trials) and then was the withdraw-to-avoid type for the rest of the block. Approach-to-avoid type means that outcome would be a neutral outcome 80% of the times (and shock, 20% of the times) if the ‘correct’ approach action was performed or else the outcome would be a shock 80% of the times (and neutral, 20% of the times) if the ‘incorrect’ withdrawal action was performed. Withdraw-to-avoid type means that outcome would be a neutral outcome 80% of the times (and shock, 20% of the times) if the ‘correct’ withdraw action was performed or else the outcome would be a shock 80% of the times (and neutral, 20% of the times) if the ‘incorrect’ approach action was performed. Uncontrollable type means that the outcome could be shock or neutral with 50% probability each, regardless of the actions performed. After each quarter of the block, the subjects were informed of their progress through the block with a 10 second rest.
Analysis
The choices and reaction times were extracted and used for model-fitting. The EEG, EMG and skin conductance data was acquired but found to be too corrupted by movement artefacts and noise to allow reliable analysis.
Hierarchical bayesian model-fitting choices and reaction times
For both RL model-fitting to choices and RLDDM model-fitting to choices and reaction times, we built 4 models each: RW (i.e. Rescorla-Wagner learning rule model), RW+bias (i.e. RW model with base-line bias), RW+bias+Pavlovian(fixed) (i.e. RW+bias model with a fixed Pavlovian withdrawal bias) and RW+bias+Pavlovian(flexible) (i.e. RW+bias model with a flexible Pavlovian withdrawal bias) similar to Guitart-Masip et al. [2012].
The action selection for RL models was performed using a softmax as per equation 9 with free parameter β = 1/τ and β > 0. The learning rule for RW models was:
where R = −1 in case of electric shocks or R = 0 in case of neutral outcome. Punishment p can be defined as per equation 2 and thus p = 1 in case of electric shocks or p = 0 and the Pavlovian punishment value is calculated as per:
α > 0 is the learning rate and fitted as a free-parameter and note that here Vp is always positive. For RW+bias model,
Here b ∈ (−∞, +∞) is the baseline bias, which if positive represents a baseline approach bias and if negative represents baseline negative bias and is not Pavlovian in nature.
For RW+bias+Pavlovian(fixed) and RW+bias+Pavlovian(flexible) models,
Here ω ∈ [0, 1] is a free parameter for the RW+bias+Pavlovian(fixed) model. ω is not a free parameter for the RW+bias+Pavlovian(flexible) model, but computed as per equations 14 and 15 with free parameters Ω0 (initial associability), κ (scaling factor for ω) and αΩ (learning rate multiplier for associability).
For RLDDM models, it is assumed that within a trial, the evidence is accumulated using a drift-diffusion process with parameters drift rate (v), non-decision time (ndt), threshold and starting point. Non-decision time and threshold were kept as free parameters and the starting point was kept constant and equal to half the threshold (making it equally likely starting point for both approach and avoidance actions). The drift rate v was set according to the difference in action propensities between the choices as follows.
Thus, the baseline bias and the Pavlovian biases were also included in the drift rate.
For model-fitting we used a hierarchical Bayesian modelling approach, all models were fit using Stan. They were fit using both custom code in PyStan as well as using the hBayesDM package [Ahn et al., 2017] and final plots of group-level and subject-level parameter distributions were generated using the plotting functions in hBayesDM. Four parallel chains were run for all models. To assess the predictive accuracy of the models, we computed the leave one out information criterion (LOOIC) and Watanabe-Aikake information criterion (WAIC) [Vehtari et al., 2017].
Software and hardware setup
We used the HTC Vive Pro Eye for the virtual reality (VR) with Alienware PC setup and the experiment was designed in Unity (game engine). The pain stimulator used was DS5 with WASP electrodes for the VR approach-withdrawal task and Silver-Silver Chloride (Ag/AgCl) Cup Electrodes for the VR maze task. We also collected glavanic skin response (GSR), heart rate(HR), electromyography(EMG) signals, wireless EEG using Brainproducts LiveAmp and Vive tracker movement signals and eye-tracking inside the VR headset.
The pain stimulator electrodes were attached on the ring finger, between the ring and the middle finger. The SCP sensors were attached on the middle and the index fingers and the EMG sensors were attached on the brachioradialis muscle of the active hand used in the task with the ground electrode on the elbow. Heart rate sensor was attached to the index finger of the opposite hand.
Acknowledgements
PM would like to thank Michael Browning, Rafal Bogacz, Suyi Zhang, Charlie Yan, Maryna Alves Rosa Reges, Danielle Hewitt, Katja Wiech, the anonymous COSYNE 2022 reviewers, CCN 2023 reviewers and Science Advances reviewers for their feedback on earlier draft(s) of the subsections/extended abstracts of the manuscript. PM would like to thank Simon Desch for feedback on RLDDM fitting and Danielle Hewitt for suggestions and guidance on EEG data analysis. The work was funded by Wellcome Trust (214251/Z/18/Z, 203139/Z/16/Z and 203139/A/16/Z), IITP (MSIT 2019-0-01371) and JSPS (22H04998). This research was also partly supported by the NIHR Oxford Health Biomedical Research Centre (NIHR203316). The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care. For the purpose of open access, the authors have applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.
Code and data availability
Code and data: https://github.com/PranavMahajan25/Safety-Efficiency-Trade-off.
APPENDIX
(A.1) Robustness of the associability-based ω in gridworld simulations
(A.2) Flexible ω agent better adapts to reward relocation than a fixed ω agent
(A.3) Solving the safety-efficiency trade-off in a range of grid world environments
(A.4) Human three-route virtual reality maze results
(A.5) Behavioural results from Approach-Withdrawal VR task
We observe a baseline approach bias through significant asymmetry in average number of cumulative approaches and withdrawals (Fig. 10A) and we consider the few inactive approaches and inactive withdrawals due to timeout as approaches and withdrawals respectively (Fig. 10B).
We consider a couple of model-free metrics of Pavlovian withdrawal bias prior to model fitting. The withdrawal bias metric on choices for two cues (say, cues X and cue Y) calculated as follows:
Choice bias metric(cue X, cue Y) = % withdrawal choices on ‘cue X’ - % approach choices on ‘cue Y’ (24) and the metric for withdrawal bias in reaction times is simply the subtraction of average withdrawal times from average approach times in a half (60 trials) or the quarter block (30 trials) under consideration. This choice metric is an extension of the metric used by Dorfman and Gershman [2019] to punishment bias, and it’s logic is as follows. Consider Choice bias metric(cue2,cue1) - As Pavlovian withdrawals will increase %withdrawal choices i.e.(correct choice) for cue2 and decrease %approach choices i.e. (correct choice) for cue1. Similarly it’ll also make sense for metric(cue4,cue3) in the second half, albeit the bias would be lesser as they will be exploiting the optimal actions. It makes less sense for (cue4,cue3) in the first half as there is no optimal action however, helps act as a control and quantify a baseline approach bias. Unfortunately, this metric cannot differentiate an action due to random exploration from an action due to Pavlovian misbehaviour, leading to noisy estimates. Further, it cannot capture baseline approach bias b at all, because the model by Dorfman and Gershman [2019] does not consider this parameter, unlike Guitart-Masip et al. [2012], Cavanagh et al. [2013]. However we show that including baseline bias contributes the most to an incremental improvement in model fit.
We expect this bias to be largest in the first half with uncontrollable cues, and especially in the second quarter as opposed to first quarter, by when enough punishment value would have been accrued for each of the cues and there would be a significant drop in random exploration. The Pavlovian withdrawal bias in choices is measured using the controllable cues 1 and 2 and thus computing the same quantity in uncontrollable cues 3 and 4 acts as control (Fig. 10C). Likewise we also hypothesized that there will exist a Pavlovian bias in reaction times which speed up all withdrawals and slow down all approaches regardless of the cue (Fig. 10E). For our second hypothesis, the Pavlovian bias should decrease with decrease in outcome uncertainty i.e. it would be higher in the second quarter as opposed to the fourth quarter (Fig. 10D & F). We compare the quarters rather than first and second half to minimise the noise through random exploration in the first quarter. However, the differences we observe are not statistically significant.
In addition to these results from behavioural metrics in choices and reaction times, we further observe certain change-of-mind like patterns in motor responses. It is unclear if these are due to a Pavlovian bias or due to other factors and can be investigated in future studies.
(A.6) Group and subject level parameter distributions of RL and RLDDM models
(A.7) RL and RLDDM model parameters and model comparison tables
Please refer to Table 1 and Table 2.
(A.8) Model predictions: Adapting fear responses in a chronic pain gridworld
(A.9) Neurobiology of Pavlovian contributions to bias avoidance behaviour
References
- Revealing neurocomputational mechanisms of reinforcement learning and decision-making with the hbayesdm packageComputational Psychiatry (Cambridge, Mass 1
- Constrained Markov decision processesCRC Press
- Algorithms for survival: a comparative perspective on emotionsNature Reviews Neuroscience 18:311–319
- Knowing how much you don’t know: a neural organization of uncertainty estimatesNature reviews neuroscience 13:572–586
- Measuring maladaptive avoidance: from animal models to clinical anxietyNeuropsychopharmacology :1–9
- The dorsal raphe nucleus is integral to negative prediction errors in pavlovian fearEuropean Journal of Neuroscience 40:3096–3101
- Species-specific defense reactions and avoidance learningPsychological review 77
- Auto-shaping of the pigeon’s key-peck 1Journal of the experimental analysis of behavior 11:1–8
- Frontal theta overrides pavlovian learning biasesJournal of Neuroscience 33:8541–8548
- Fear-avoidance model of chronic pain: the next generationThe Clinical journal of pain 28:475–483
- The misbehavior of value and the discipline of the willNeural networks 19:1153–1160
- Endogenous modulation of pain relief: evidence for dopaminergic but not opioidergic involvementbioRxiv
- Controllability governs the balance between pavlovian and instrumental action selectionNature communications 10:1–8
- Parallel reward and punishment control in humans and robots: Safe reinforcement learning using the maxpain algorithmin 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob) IEEE :140–147
- Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regretAdvances in Neural Information Processing Systems 33:22384–22395
- Beyond drift diffusion models: Fitting a broad class of decision and rl models with hddmbioRxiv
- Discrete coding of reward probability and uncertainty by dopamine neuronsScience 299:1898–1902
- A reinforcement learning diffusion decision model for value-based decisionsPsychonomic bulletin & review 26:1099–1121
- Human fear conditioning: From neuroscience to the clinicBehaviour research and therapy 124
- A comprehensive survey on safe reinforcement learningJournal of Machine Learning Research 16:1437–1480
- Reinforcement learning under circumstances beyond its control
- Smart exploration in reinforcement learning using absolute temporal difference errorsProceedings of the 2013 international conference on Autonomous agents and multi-agent systems :1037–1044
- Neural signatures of arbitration between pavlovian and instrumental action selectionPLoS computational biology 17
- When do we not face our fears? investigating the boundary conditions of costly pain-related avoidance generalizationThe Journal of Pain
- Go and no-go learning in reward and punishment: interactions between affect and effectNeuroimage 62:154–166
- Consideration of risk in reinforcement learningin Machine Learning Proceedings 1994 Elsevier :105–111
- Bonsai trees in your head: how the pavlovian system sculpts goal-directed choices by pruning decision treesPLoS computational biology 8
- The specificity of pavlovian regulation is associated with recovery from depressionPsychological medicine 46:1027–1035
- Conditioned suppression as a monitor of fear of the cs in the course of avoidance trainingJournal of comparative and physiological psychology 56
- Causal role of the dorsolateral prefrontal cortex in modulating the balance between pavlovian and instrumental systems in the punishment domainPlos one 18
- Genetic variation in dopaminergic neuromodulation influences the ability to rapidly and flexibly adapt decisionsProceedings of the National Academy of Sciences 106:17951–17956
- Calibration of cognitive tests to address the reliability paradox for decision-conflict tasksNature Communications 14
- Human amygdala activation during conditioned fear acquisition and extinction: a mixed-trial fmri studyNeuron 20:937–945
- Reducing shock imminence eliminates poor avoidance in ratsLearning & Memory 27:270–274
- Differential roles of human striatum and amygdala in associative learningNature neuroscience 14:1250–1252
- Stress-sensitive inference of task controllabilityNature Human Behaviour 6:812–822
- Conditioning and associative learningClarendon Press Oxford
- Two-factor theory, the actor-critic model, and conditioned avoidanceLearning & behavior 38:50–67
- Acquisition and extinction of operant pain-related avoidance behavior using a 3 degrees-of-freedom robotic armPain 157:1094–1104
- The acquisition of fear of movement-related pain and associative learning: a novel pain-relevant human fear conditioning paradigmPain 152:2460–2469
- Learning reward uncertainty in the basal gangliaPLoS computational biology 12
- Modeling avoidance in mood and anxiety disorders using reinforcement learningBiological psychiatry 82:532–539
- Threat of shock and aversive inhibition: Induced anxiety modulates pavlovian-instrumental interactionsJournal of Experimental Psychology: General 146
- Uncertainty–guided learning with scaled prediction errors in the basal gangliaPLoS computational biology 18
- Reload: Reinforcement learning with optimistic ascent-descent for last-iterate convergence in constrained mdpsin International Conference on Machine Learning PMLR :25303–25336
- Learning theory and behavior
- Two-factor learning theory: summary and commentPsychological review 58
- Depression is associated with enhanced aversive pavlovian control over instrumental behaviourScientific reports 8:1–10
- Transdiagnostic models of anxiety disorder: Theoretical and empirical underpinningsClinical Psychology Review 56:122–137
- Coding of reward risk by orbitofrontal neurons is mostly distinct from coding of reward valueNeuron 68:789–800
- Virtual reality for enhanced ecological validity and experimental control in the clinical, affective and social neurosciencesFrontiers in human neuroscience 9
- The drift diffusion model as the choice rule in reinforcement learningPsychonomic bulletin & review 24:1234–1251
- Dopamine-dependent prediction errors underpin reward-seeking behaviour in humansNature 442:1042–1045
- Neural correlates of specific and general pavlovian-to-instrumental transfer within human amygdalar subregions: a high-resolution fmri studyJournal of Neuroscience 32:8383–8390
- Representation of aversive prediction errors in the human periaqueductal grayNature neuroscience 17:1607–1612
- How gamification motivates: An experimental study of the effects of specific game design elements on psychological need satisfactionComputers in human behavior 69:371–380
- Differential encoding of losses and gains in the human striatumJournal of Neuroscience 27:4826–4831
- Serotonin selectively modulates reward value in human decision-makingJournal of Neuroscience 32:5833–5842
- The optimism biasCurrent biology 21:R941–R945
- Human pavlovian–instrumental transferJournal of Neuroscience 28:360–368
- Coping with chronic pain: A stress-appraisal coping modelin Coping with chronic illness and disability: Theoretical, empirical, and clinical aspects Springer :313–335
- Cognitive factors and persistent pain: A glimpse into pandora’s boxCognitive therapy and research 16:99–122
- A psychological mechanism for the growth of anxiety
- Changes in pain-related fear and pain when avoidance behavior is no longer effectiveThe Journal of Pain 21:494–505
- Avoidance behaviour performed in the context of a novel, ambiguous movement increases threat and pain-related fearPain 162:875–885
- Practical bayesian model evaluation using leave-one-out cross-validation and waicStatistics and computing 27:1413–1432
- Fear-avoidance and its consequences in chronic musculoskeletal pain: a state of the artPain 85:317–332
- Deep reinforcement learning by parallelizing reward and punishment using the maxpain architecturein 2018 Joint IEEE 8th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob) IEEE :175–180
- Multiple dopamine systems: weal and woe of dopaminein Cold Spring Harbor Symposia on Quantitative Biology Cold Spring Harbor Laboratory Press :83–95
- Approach-avoidance reinforcement learning as a translational and computational model of anxiety-related avoidancebioRxiv :2023–4
- Dissociable learning processes underlie human pain conditioningCurrent Biology 26:52–58
- The control of tonic pain by active relief learningElife 7
- Improving the reliability of the pavlovian go/no-go task
- Anxiety, avoidance, and sequential evaluationComputational Psychiatry 4:1–17
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
Copyright
© 2024, Mahajan et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 178
- downloads
- 7
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.