Balancing safety and efficiency in human decision-making

  1. Pranav Mahajan  Is a corresponding author
  2. Shuangyi Tong
  3. Sang Wan Lee
  4. Ben Seymour  Is a corresponding author
  1. Wellcome Centre for Integrative Neuroimaging, FMRIB, Nuffield Department of Clinical Neurosciences, University of Oxford, United Kingdom
  2. Institute of Biomedical Engineering, University of Oxford, United Kingdom
  3. Department of Brain and Cognitive Sciences, Korea Advanced Institute of Science and Technology (KAIST), Republic of Korea
  4. Kim Jaechul Graduate School of AI, KAIST, Republic of Korea
  5. KAIST Center for Neuroscience-inspired Artificial Intelligence, Republic of Korea
16 figures, 2 tables and 1 additional file

Figures

An illustration of Pavlovian Avoidance Learning (PAL) model.

Pavlovian and instrumental valuations are combined to arrive at action propensities used for (softmax) action selection. The Pavlovian bias influences protective behaviours through safer (Boltzmann) exploration, and the arbitration between the Pavlovian and instrumental systems is performed using the parameter ω. Here, R denotes the feedback signal which can take both positive values (in the case of rewards) and negative values (in the case of punishments). Please see Materials and methods for technical details; notations for the illustration follow Dorfman and Gershman, 2019.

Demonstration of safety-efficiency trade-off and the flexible arbitration scheme in a grid world environment.

(A) Grid world environment with starting state in the top-right corner and rewarding goal state (R=+1) in the bottom-left corner and the red states are painful (R=0.1). The grid world layout follows Gehring and Precup, 2013. Inset provides a didactic example of misalignment between Pavlovian bias and instrumental action. (B) Stochastic policy of pre-trained withdrawal action subset Ap, which is biased with Pavlovian punishment values in the Pavlovian avoidance learning (PAL) agent. (C) The learned instrumental values and Pavlovian fear bias Vp(heatmap) and policy (arrows) are learned by the instrumental and flexible ω agent by the end of the learning duration. The value functions plotted are computed in an on-policy manner. (D) Cumulative pain accrued by fixed and flexible ω agents whilst learning over 1000 episodes as a measure of safety averaged over 10 runs. (E) Cumulative steps required to reach the fixed goal by fixed and flexible ω agents whilst learning over 1000 episodes as a measure of sample efficiency, averaged over 10 runs (F) Plot of flexibly modulated ω arbitration parameter over the learning duration averaged over 10 runs. This shows a transition from a higher Pavlovian bias to a more instrumental agent over episodes as learning about the environment reduces uncertainty. (G) Comparison of different agents using a trade-off metric and to be used only for didactic purposes (using Equation 16 and more details in Materials and methods).

Demonstration of sampling asymmetry due to constant Pavlovian bias.

(A) T-maze grid world environment with annotated rewards and punishments. (B) Proportion of the rewarding goal chosen by the agent. (C) Value function plots for ω=0,0.1,0.5,0.9 show diminished value propagation from the reward on the right. (D) Grid world environment with three routes with varying pain. (E) Cumulative steps required to reach the goal vs cumulative pain accrued by fixed ω agents ranging from ω=10 to ω=0.9. (F) State visit count plots for ω=0, 0.5, 0.9, i.e., instrumental and constant Pavlovian bias agents. (F) Value function plots for ω=0, 0.5, 0.9.

An illustration of the VR Approach-Withdrawal task and trial and block protocols.

(A) Trial protocol: The participant is expected to take either an approach action (touch the jellyfish) or withdrawal action (withdraw the hand towards oneself) within the next 2.5 s once the jellyfish changes colour. The participant was requested to bring their hand at the centre of a bubble located halfway between the participant and the jellyfish to initiate the next trial where a new jellyfish would emerge (video). (B) Block protocol: First half of the trials had two uncontrollable cues and two controllable cues, and the second half had all controllable cues with aforementioned contingencies. The main experiment 240 trials were preceded by 10 practice trials which do not count towards the results. (C) Illustration of experimental setup. VR: virtual reality, WASP: surface electrode for electrodermal stimulation, DS-5: constant current stimulator, GSR: galvanic skin response sensors, HR: heart rate sensor, EMG: electromyography sensors, EEG: electroencephalogram electrodes, Liveamp: wireless amplifier for mobile EEG.

RL and RLDDM model fitting results on VR Approach-Withdrawal task.

The left panels show choice model fit results using reinforcement learning (RL) models. The right panels show choice and reaction times (RTs) model fit results using reinforcement learning diffusion decision-making (RLDDM) models. (A) Simplified RL model from Figure 1 for the approach-withdrawal task. (B) Model comparison shows that the model with flexible Pavlovian bias fits best to choices in terms of leave-one-out information criteria (LOOIC). (C) Flexible ω from the RL model over 240 trials for 28 participants. (D) Number of approaches aggregated over all subjects and all trials in data and model predictions by the RL model with flexible.ω (normalised to 1). (E) Simplified illustration of RLDDM for the approach-withdrawal task, where the baseline bias b and the Pavlovian bias ωVp(s) is also included in the drift rate. (The base figure is reproduced from Figure 8 from Desch et al., 2022 with modifications.) (F) Model comparison shows that the model with flexible Pavlovian bias fits best to choices and RTs in terms of LOOIC. (G) Flexible ω from the RLDDM over 240 trials for 28 participants. (H) Distribution of approach and withdrawal RTs aggregated over all subjects and trials in data and model predictions by the RLDDM with flexible ω. The bump in RTs at 2.5 s is because of timeout (inactive approaches and withdrawals, please see Appendix 1—figure 5).

Appendix 1—figure 1
Robustness of PAL in gridworld simulation.

This figure shows the robustness of grid search for tuning the meta parameters for the associability-based ω in grid world simulations. We show that the results hold for a range of values close to the chosen meta-parameters. (A–C) Grid search results for the environment in Figure 2 for varying κ and αΩ. (D–G) Results for another set of meta-parameters.

Appendix 1—figure 2
Additional reward relocation experiments with PAL.

This figure shows cumulative state visit plots and value function plots of the flexible ω and fixed ω agents at the end of 1000 episodes when we relocate the reward goal from the bottom-left corner (Figure 2) to the bottom-right corner on episode 500. Comparing state visit plots (A, B) and comparing value function plots (C, D), we observe that persistent Pavlovian influence leads to persistent rigidity, while the flexible fear commissioning scheme is able to efficiently locate the goal. We observe that, unlike flexible ω, constant ω=0.5 leads to diminished value propagation of the rewarding value (C, D).

Appendix 1—figure 3
Flexible arbitration solves safety-efficiency trade-off in a range of gridworlds.

In this figure, we show the performance of fixed ω=0.1,0.5,0.9 and flexible ω agents on a range of grid world environments, namely (A) the three-route environment from Figure 3, (D) an environment with a moving predator on routine path and (G) wall maze grid world from Elfwing and Seymour, 2017. Colliding with the predator results in a negative reward of –1 and catastrophic death (episode terminates). Otherwise, colliding with the walls results in moderate pain of 0.1, and the agent’s state remains unchanged. The latter two are completely deterministic environments, unlike the previous environments in the main paper. (B, C, E, F, H and I) show the safety-efficiency trade-off that arises in these three environments as well, and there is a separate optimal fixed ω for each environment. Alternatively, there exists a flexible ω scheme for each environment that can solve the trade-off, suggesting that the brain may be calibrating ω flexibly.

Appendix 1—figure 4
Additional results on human three-route VR maze task.

(A) Top-view of virtual reality (VR) maze with painful regions annotated by highlighted borders. (B) Cumulative steps required to reach the goal vs cumulative pain acquired by participants over 20 episodes in the VR maze task. In this figure, we show the results of a VR maze replicating the three-route grid world environment from simulation results; however, it had fewer states and the participants were instructed to reach the goal which was visible to them as a black cube with ‘GOAL’ written on it. In order to move inside the maze, participants had to physically rotate in the direction they wanted to move and then press a button on the joystick to move forward in the virtual space. Thus, the participant did not actually walk in the physical space but did rotate up to 360 degrees in physical space. The painful regions were not known to the participants, but they were aware that some regions of the maze may give them painful shocks with some unknown probability. Walking over the painful states in the VR maze, demarcated by grid borders (see A), potentially shocked them with 75% probability while ensuring 2 s of pain-free interval between two consecutive shocks. Participants were not given shocks with 100% probability as that would be too painful for participants due to the temporal summation effects of pain. The participants engaged in 20 episodes of trials and were aware of this before starting the task and were free to withdraw from the experiment at any point. 16 participants (11 female, average age 30.25 years) were recruited and were compensated adequately for their time. The pain tolerance was acquired similarly to the approach-withdrawal task. (B) All participant trajectories inside the maze were discretised into an 8×9 (horizontal × vertical) grid. Entering a 1×1 grid section counted incremented the cumulative steps (CS) count. Upon receiving the shocks, the cumulative pain (CP) count was incremented. CP and CS over 20 episodes were plotted against each other to observe the trade-off. A limitation of this experiment is that it reflects the constraints of the grid world, and future experiments are necessary to show the trade-off in a range of environments.

Appendix 1—figure 5
Behavioural results from Approach-Withdrawal VR task.

(A) Asymmetry in average approach and withdrawal responses over all subjects showing a baseline approach bias - Mann-Whitney U test (statistic=597.0, p-value=0.0004). (B) Incomplete approaches and incomplete withdrawals that were counted as approaches and withdrawals, respectively. (C) Withdrawal bias in choices in the first half with uncontrollable cues - Mann-Whitney U test (statistic = 492.5, p-value = 0.0504). (D) Decrease in withdrawal bias in choice with decrease in uncontrollability - Mann-Whitney U test (statistic = 350.5, p-value = 0.7611). (E) Withdrawal bias in reaction times in the first half with uncontrollable cues - Mann-Whitney U test (statistic = 475.0, p-value = 0.0882). (F) Decrease in withdrawal bias in reaction times with decrease in uncontrollability - Mann-Whitney U test (statistic = 456.0, p-value = 0.1490). (G) Change-of-mind trials observed in motor data.

Appendix 1—figure 6
Group-level parameter distributions from RL models.

Group-level parameter distributions from (A) the reinforcement learning (RL) model (M3) with fixed ω and (B) the RL model (M4) with flexible ω. Shaded red regions denote 95% confidence intervals.

Appendix 1—figure 7
Group-level parameter distributions from RLDDM models.

Group-level parameter distributions from (A) the reinforcement learning diffusion decision-making (RLDDM) model (M3) with fixed ω and (B) the RLDDM model (M4) with flexible ω. Shaded red regions denote 95% confidence intervals.

Appendix 1—figure 8
Subject-level parameter distributions from RL models.

Subject-level parameter distributions from (A) the reinforcement learning (RL) model (M3) with fixed ω and (B) the RL model (M4) with flexible ω. Shaded red regions denote 95% confidence intervals.

Appendix 1—figure 9
Subject-level parameter distributions from RLDDM models.

Subject-level parameter distributions from (A) the reinforcement learning diffusion decision-making (RLDDM) model (M3) with fixed ω and (B) the RLDDM model (M4) with flexible ω. Shaded red regions denote 95% confidence intervals.

Appendix 1—figure 10
Model predictions: Adapting fear responses in a chronic pain gridworld.

Pavlovian-instrumental interactions are invoked in a popular model of chronic pain, in which excessive Pavlovian fear of movement is self-punitive in a context in which active avoidance would reduce pain (Crombez et al., 2012; Meulders et al., 2011). (A) Grid world with a start at the centre (blue) and goal at the left end (green) operationalises this. We augment the action set to include an additional ‘immobilise’ action, to the action set, resulting in no state change and repeated rewards. An upper bound of 100 steps per episode is set; exceeding it leads to a painless death and episode restart. (B) Cumulative failures to reach the goal as a measure of efficiency. With a constant Pavlovian fear influence, the agent struggles to complete episodes, resembling effects seen in rodent models of anxiety (Ligneul et al., 2022). (C) Cumulative pain accrued as a measure of safety. In clinical terms, the agent remains stuck in a painful state, contrasting with an instrumental system that can seek and consume rewards despite pain. Flexible parameter ω (κ=3,αΩ=0.01) allows the agent to overcome fear and complete episodes efficiently, demonstrating a safety-efficiency dilemma. The flexible ω policy outperforms fixed variants, emphasising the benefits of adapting fear responses for task completion. (D) Results from Laughlin et al., 2020 show that 25% of the (anxious) rats fail the signalled active avoidance task due to freezing. GIFs for different configurations: pure instrumental agent, adaptively safe agent (flexible ω), and maladaptively safe agent (constant ω) can be found here.

Appendix 1—figure 11
An overview of neurobiological substrates for the proposed Pavlovian avoidance learning (PAL) model based on relevant prior literature.

Tables

Appendix 1—table 1
Model comparison results for reinforcement learning (RL) models.
ModelFree parametersLOOICWAIC
M1α,β8201.538182.15
M2α,β,b7960.007926.73
M3α,β,b,ω7947.847918.20
M4α,β,b,αΩ, κ,Ω07863.797830.18
Appendix 1—table 2
Model comparison results for reinforcement learning diffusion decision-making (RLDDM) model.
ModelFree parametersLOOICWAIC
M1ndt, threshold,α,β12539.1412495.44
M2ndt, threshold,α,β,b12303.0012247.63
M3ndt, threshold,α,β,b,ω12296.3812247.22
M4ndt, threshold,α,β,b,αΩ,κ,Ω012205.5212164.14

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Pranav Mahajan
  2. Shuangyi Tong
  3. Sang Wan Lee
  4. Ben Seymour
(2025)
Balancing safety and efficiency in human decision-making
eLife 13:RP101371.
https://doi.org/10.7554/eLife.101371.3