Implementation of the value-RNN in the cortico-basal ganglia-DA circuits.

Simulation of a Pavlovian cue-reward association task. (A) Simulated task with variable inter-trial intervals. (B) Black line: Estimated true values of states, defined by relative timings from the cue, through simulations according to the definition of state value, i.e., expected cumulative discounted future rewards. Red line: TD-RPEs calculated from the estimated true state values. (C-G) State values (black lines) and TD-RPEs (red lines) at 1000-th trial, averaged across 100 simulations (error-bars indicating ± SEM across simulations), in different types of agent: (C) TD-RL agent having punctate/CSC state representation and state values without continuation between trials (i.e., the value of the last state in a trial was not updated by TD-RPE upon entering the next trial); (D) TD-RL agent having punctate/CSC state representation and continuously updated state values across trials; (E) Value-RNN with backprop (VRNNbp). The number of RNN units was 7 (same applied to (F,G); (F) Value-RNN with fixed random feedback (VRNNrf); (G) Agent with untrained RNN. (H) State values at 1000-th trial in individual simulations of VRNNbp (top), VRNNrf (middle), and untrained RNN (bottom). (I) Histograms of the value of the pre-reward state (i.e., the state one-time step before the reward state) at 1000-th trial in individual simulations of the three models. The vertical black dashed lines indicate the true value of the pre-reward state (estimated through simulations). (J) Learning performance of VRNNbp (red line), VRNNrf (blue line), and the untrained RNN (gray line) when the number of RNN units was varied from 5 to 40 (horizontal axis). Leaning performance was measured by the sum of squares of differences between the state values developed at 1000-th trial by each of these three types of agent and the estimated true state values between cue and reward (vertical axis), averaged over 100 simulations (error-bars indicating ± SEM across simulations).

Occurrence of feedback alignment and an intuitive understanding of its mechanism. (A) Over-trial changes in the angle between the value-weight vector w and the fixed random feedback vector c in the simulations of VRNNrf (7 RNN units). The solid line and the dashed lines indicate the mean and ± SD across 100 simulations, respectively. (B) The relation between the angle between w and c (horizontal axis) and the value of the pre-reward state (vertical axis) at 1000-th trial. The dots indicate the results of individual simulations, and the line indicates the regression line. (C) Angle between the hypothetical change in x(t) = f(Ax(t−1),Bo(t−1)) in case A and B were replaced with their updated ones, multiplied with the sign of TD-RPE (sign(δ(t))), and the fixed random feedback vector c across time-steps. The black thick line and the gray lines indicate the mean and ± SD across 100 simulations, respectively (same applied to (D)). (D) Multiplication of TD-RPEs in successive trials at individual states (top: cue, 4th from the top: reward). Positive or negative value indicates that TD-RPEs in successive trials have the same or different signs, respectively. (E) Left: RNN trajectories mapped onto the primary and secondary principal components (horizontal and vertical axes, respectively) in three successive trials (red, blue, and green lines (heavily overlapped)) at different phases in an example simulation (10th-12th, 300th-302th, 600th-602th, and 900th-902th trials from top to bottom). The crosses and circles indicate the cue and reward states, respectively. Right: State values (black lines) and TD-RPEs (red lines) at 11th, 301th, 601th, and 901th trial.

Simulation of two tasks having probabilistic structures, which were qualitatively similar to the two tasks examined in experiments 1 and modeled by value-RNN 2. (A) Simulated two tasks, in which reward was given at the early or the late timing with equal probabilities in all the trials (task 1) or 60% of trials (task 2). (B) (a) Top: Trial types. Two trial types (with early reward and with late reward) in task 1 and three trial types (with early reward, with late reward, and without reward) in task 2. Bottom: Value (expected discounted cumulative future rewards) of each timing in each trial type. (b) Agent’s probabilistic belief about the current trial type, in the case where agent was in fact in the trial with early reward (top row), the trial with late reward (second row), or the trial without reward (third row in task 2). (c) Top: States defined by considering the probabilistic beliefs at each timing from cue. Bottom: State values (expected discounted cumulative future rewards, estimated through simulations), which should theoretically match an integration (multiplication) of the values of each trial type (shown in (a)-bottom) with the probabilistic beliefs (shown in (b)). (C) Expected TD-RPE calculated from the estimated true values of the states for task 1 (left) and task 2 (right). Red lines: case where reward was given at the early timing, blue lines: case where reward was given at the late timing. (D-H) TD-RPEs at the latest trial within 1000 trials in which reward was given at the early timing (red lines) or the late timing (blue lines), averaged across 100 simulations (error-bars indicating ± SEM across simulations), in the different types of agent: (D,E) TD-RL agent having punctate/CSC state representation and state values without (D) or with (E) continuation between trials; (F) VRNNbp. The number of RNN units was 12 (same applied to (G,H); (G) VRNNrf; (H) Untrained RNN.

Modified value-RNN models with further biological constraints. (A) RevVRNNbp: VRNNbp (value-RNN with backprop) was modified so that the activities of neurons in the RNN (x) and the value weights (w) became non-negative. (B) BioVRNNrf: VRNNrf (value-RNN with fixed random feedback) was modified so that x and w, as well as the fixed random feedback (c), became non-negative and also the dependence of the update rules for recurrent/feed-forward connections (A and B) on post-synaptic activity became monotonic + saturation.

Performances of the modified value-RNN models in the cue-reward association task, in comparison with untrained RNN that also had the non-negative constraint. (A-D) State values (black lines) and TD-RPEs (red lines) at 1500-th trial in revVRNNbp (A), bioVRNNrf (B), untrained RNN with x and w constrained to be non-negative (C), and untrained RNN with non-negative x and w and having connections shuffled from those learnt in bioVRNNrf (D). The number of RNN units was 12 in all the cases. Error-bars indicate mean ± SEM across 100 simulations (same applied to (E,F)). The right histograms show the across-simulation distribution of the value of the pre-reward state in each model. The vertical black dashed lines in the histograms indicate the true value of the pre-reward state (estimated through simulations). (E) Learning performance of revVRNNbp (red line), bioVRNNrf (blue line), untrained RNN (gray solid line: partly out of view), and untrained RNN with connections shuffled from those learnt in bioVRNNrf (gray dotted line) when the number of RNN units was varied from 5 to 40 (horizontal axis). Leaning performance was measured by the sum of squares of differences between the state values developed at 1500-th trial by each of these four types of agent and the estimated true state values between cue and reward (vertical axis). (F) Mean of the elements of the recurrent and feed-forward connections (at 1500-th trial) of revVRNNbp (red line), bioVRNNrf (blue line), and untrained RNN (gray solid line).

Loose alignment of the value weights (w) and the random feedback (c) in bioVRNNrf (with 12 RNN units), and its relation to the developed state values. (A) Over-trial changes in the angle between the value weights w and the fixed random feedback c. The solid line and the dashed lines indicate the mean and ± SD across 100 simulations, respectively. (B) Relation between the w-c angle (horizontal axis) and the value of the pre-reward state (vertical axis) at 1500-th trial. The dots indicate the results of individual simulations. (C) Correlation between the w-c angle at k-th trial (horizontal axis) and the value of the cue, post-cue, pre-reward, or reward state (top-bottom panels) at 500-th trial across 1000 simulations. The solid lines indicate the correlation coefficient, and the short vertical bars at the top of each panel indicates the cases in which p-value was less than 0.05.

Performances of the modified value-RNN models in the two tasks having probabilistic structures, in comparison with untrained RNN having the non-negative constraint. TD-RPEs at the latest trial within 2000 trials in which reward was given at the early timing (red lines) or the late timing (blue lines) in task 1 (left) and task 2 (right), averaged across 100 simulations (error-bars indicating ± SEM across simulations), are shown for the four types of agent: (A) revVRNNbp; (B) bioVRNNrf; (C) untrained RNN with non-negative x and w; (D) untrained RNN with non-negative x and w and having connections shuffled from those learnt in bioVRNNrf. The number of RNN units was 20 for all the cases.