Figures and data

Illustration of policy-gradient RL.
A) Outline of policy gradient algorithm. B) Reward landscape for a task with 2-dimensional action space, and actions generated according to an initial policy, in this case a 2-dimensional Gaussian. The green ellipse indicates the covariance of the Gaussian policy (90% confidence interval), while dots represent individual samples. C) Illustration of policy update for a positive reward prediction error. In this trial, the randomly selected action leads to a beKer-than-average reward (positive reward prediction error). Consequently, the mean is updated towards the sampled action. D) Policy update for a negative reward prediction error. Here, the randomly selected action results in a worse-than-average reward (negative reward prediction error). Consequently, the mean is updated away from the sampled action. E) After a few thousand samples, the policy converges on the region of greatest reward.

Policy-gradient RL model of a throwing task.
A) In Sternad and colleagues’ virtual skiKles task17,18, participants swivel a paddle and press a buKon to release a ball, aiming to topple a skiKle while avoiding a central post. The ball’s movement is guided by an elastic force acting towards the central post, and is fully determined by the initial release angle and velocity. The task error is taken as the minimum distance of the balls trajectory to the center of the skiKle. B) Example human performance in this task: Heatmap shows task error given release angle and velocity, adapted from18. Yellow dots show example data from a human participant on their first day performing the task. Green dots show data from the same participant after 11 practice sessions (∼2,000 trials). C) Example performance of the policy gradient algorithm trained on the same task. Ellipses show 90% confidence regions for the policy at different stages of learning. D) Illustration of the TNC-Cost decomposition used to compare human and model behavior. Based on actions taken over a block of 60 trials, the Tolerance Cost (T-Cost) estimates potential performance gains achievable over the current policy by translating the policy in action space. The Noise-Amplitude Cost (N-Cost) estimates potential performance gains achievable by uniformly scaling the noise. The Noise-Covariance Cost (C-Cost) estimates potential performance gains achievable by optimizing the policy covariance. E) TNC-Cost decomposition19 for 9 participants, adapted from18. F) TNC-Cost decomposition for simulated learning of the policy-gradient RL model.
© 2014 Springer Nature. This figure is reproduced from Sternad et al. 2014, with permission from Springer Nature. It is not covered by the CC-BY 4.0 licence and further reproduction of this figure would need permission from the copyright holder.

Policy-Gradient RL model of learning a cursor-control task.
A) In the bimanual cursor control task, participants must learn to control a cursor using a non-intuitive mapping from the positions of their left and right hands to an on-screen cursor. Forward-backward movement of the left hand leads to left-right movement of the cursor, and left-right movement of the right hand leads to forward-backward movement of the cursor. B) Learning of a policy-gradient reinforcement learning model for this task. Large circles represent different target locations, to be reached by the cursor from a central starting location. Each dot represents a trial endpoint, with color indicating the associated target for that trial. The three panels illustrate the policy at the outset of training, after 1,000 training trials, and after 2,500 training trials. C) Human performance in this task. Each curve shows average absolute directional error across blocks of 60 trials, averaged across N=13 participants. Shaded region indicates +/-standard error in the mean across participants. D) Average performance of the policy-gradient RL models with policies initialized to match initial behavior of human participants. E) Bias-variance decomposition for human performance over trials, averaged across participants. Most practice-based improvement is aKributable to reductions in variability. F) Averaged bias-variance decomposition for policy-gradient RL models with policies initialized to match initial behavior of human participants. Shaded regions indicate +/- sem across participants (E) or across models initialized to different participants (F).

Policy-gradient model of learning to generate precise movements.
A) Precision execution task, adapted from16. Participants used movements of their wrist to rapidly guide a cursor through an arc-shaped path from left to right with the goal of not leaving the channel. Initially, participants’ paths were highly variable and most trials were unsuccessful. After 8 sessions of practice (∼1,000 trials), participants’ movement variability was reduced, and the rate of successful movements increased. B) Fraction of trials successfully remaining within the channel throughout across 8 blocks of practice (120 trials per block), averaged across human participants. Panels A) and B), reproduced from16. C) Policy-gradient RL model of motor acuity learning. Initial policy was set to approximately match initial human performance. After 1,000 training trials, performance of the model was substantially improved. Red dots indicate intermediate goal locations used to model generation of the movement trajectories. The parameters of the policy were the 2-D location of the 6 goal locations, along with variance of their locations across trials. D) Fraction of trials staying within the channel, averaged across 100 simulated experiments.
© 2012 American Physiological Society. This figure is reproduced from Figure 6 from Shmuelof et al. 2012, with permission from American Physiological Society. It is not covered by the CC-BY 4.0 licence and further reproduction of this figure would need permission from the copyright holder.