Neuroscience

Online reinforcement learning of state representation in recurrent network: the power of random feedback and biological constraints

Takayuki Tsurumi
Ayaka Kato
Arvind Kumar
Kenji Morita author has email address

Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo, Japan
Theoretical Sciences Visiting Program, Okinawa Institute of Science and Technology, Okinawa, Japan
Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, United States
Division of Computational Science and Technology, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden
International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo, Tokyo, Japan

https://doi.org/10.7554/eLife.104101.2

Open access
Copyright information

Figures and data

Implementation of the online value-RNN in the cortico-basal ganglia-DA circuits.

Simulation of a Pavlovian cue-reward association task.

(A) Simulated task with variable inter-trial intervals. (B) Black line: Estimated true values of states/timings through simulations according to the definition of state value, i.e., expected cumulative discounted future rewards, taking into account the effect of probabilistic inter-trial interval (ITI). Red line: TD-RPEs calculated from the estimated true state/timing values. (C-G) State values (black lines) and TD-RPEs (red lines) at 1000-th trial, averaged across 100 simulations (error-bars indicating mean ± SEM across simulations; same applied to the followings unless otherwise mentioned), in different types of agent: (C) TD-RL agent having punctate state representation and state values without continuation between trials (i.e., the value of the last state in a trial was not updated by TD-RPE upon entering the next trial); (D) TD-RL agent having punctate state representation and continuously updated state values across trials; (E) Online value-RNN with backprop (oVRNNbp). The number of RNN units was 7 (same applied to (F,G); (F) Online value-RNN with fixed random feedback (oVRNNrf); (G) Agent with untrained RNN. (H) State values at 1000-th trial in individual simulations of oVRNNbp (top), oVRNNrf (middle), and untrained RNN (bottom). (I) Histograms of the value of the pre-reward state (i.e., the state one-time step before the reward state) at 1000-th trial in individual simulations of the three models. The vertical black dashed lines indicate the true value of the pre-reward state (estimated through simulations). (J) Left: Mean of the squares of differences between the state values developed by each agent and the estimated true state values between cue and reward (referred to as the mean squared value-error) at 1000-th trial in oVRNNbp (red line), oVRNNrf (blue line), and the model with untrained RNN (gray line) when the number of RNN units (n) was varied from 5 to 40. Learning rate for value weights was normalized by dividing by n/7 (same applied to the followings unless otherwise mentioned). Right: Mean squared value-error in oVRNNrf (blue line: same data as in the left panel) and oVRNN with uniform feedback (green line). (K) Log contribution ratios of the principal components of the time series (for 1000 trials) of RNN activities in each model with 20 RNN units. (L) Mean squared value-error in each model with 20 RNN units across trials. (M) Mean squared value-error in each model at 3000-th trial in the cases where the cue-reward delay was 3, 4, 5, or 6 time-steps (top to bottom panels).

Occurrence of feedback alignment and an intuitive understanding of its mechanism.

(A) Over-trial changes in the angle between the value-weight vector w and the fixed random feedback vector c in the simulations of oVRNNrf (7 RNN units). The solid line and the dashed lines indicate the mean and ± SD across 100 simulations, respectively. (B) Negative correlation (r = −0.288, p = 0.00362) between the angle between w and c (horizontal axis) and the value of the pre-reward state (vertical axis) at 1000-th trial. The dots indicate the results of individual simulations, and the line indicates the regression line. (C) Angle between the hypothetical change in x(t) = f(Ax(t−1),Bo(t−1)) in case A and B were replaced with their updated ones, multiplied with the sign of TD-RPE (sign(δ(t))), and the fixed random feedback vector c across time-steps. The black thick line and the gray lines indicate the mean and ± SD across 100 simulations, respectively (same applied to (D)). (D) Multiplication of TD-RPEs in successive trials at individual states (top: cue, 4th from the top: reward). Positive or negative value indicates that TD-RPEs in successive trials have the same or different signs, respectively. (E) Left: RNN trajectories mapped onto the primary and secondary principal components (horizontal and vertical axes, respectively) in three successive trials (red, blue, and green lines (heavily overlapped)) at different phases in an example simulation (10th-12th, 300th-302th, 600th-602th, and 900th-902th trials from top to bottom). The crosses and circles indicate the cue and reward states, respectively. Right: State values (black lines) and TD-RPEs (red lines) at 11th, 301th, 601th, and 901th trial.

Simulation of two tasks having probabilistic structures, which were qualitatively similar to the two tasks examined in experiments ⁵⁴ and modeled by the original value-RNN with BPTT ²⁶.

(A) Simulated two tasks, in which reward was given at the early or the late timing with equal probabilities in all the trials (task 1) or 60% of trials (task 2). (B) (a) Top: Trial types. Two trial types (with early reward and with late reward) in task 1 and three trial types (with early reward, with late reward, and without reward) in task 2. Bottom: Value of each timing in each trial type estimated through simulations. (b) Agent’s probabilistic belief about the current trial type, in the case where agent was in fact in the trial with early reward (top row), the trial with late reward (second row), or the trial without reward (third row in task 2). (c) Top: States defined by considering the probabilistic beliefs at each timing from cue. Bottom: True state/timing values calculated by taking (mathematical) expected value of the estimated value of each timing in each trial type. (C) Expected TD-RPE calculated from the estimated true values of the states/timings for task 1 (left) and task 2 (right). Red lines: case where reward was given at the early timing, blue lines: case where reward was given at the late timing. It is expected that TD-RPE at early reward is larger than TD-RPE at late reward in task 1 whereas the opposite is the case in task 2, as indicated by the inequality sings. (D-H) TD-RPEs at the latest trial within 1000 trials in which reward was given at the early timing (red lines) or the late timing (blue lines), averaged across 100 simulations (error-bars indicating ± SEM across simulations), in the different types of agent: (D,E) TD-RL agent having punctate state representation and state values without (D) or with (E) continuation between trials; (F) oVRNNbp. The number of RNN units was 12 (same applied to (G,H)); (G) oVRNNrf; (H) agent with untrained RNN. The p values are for paired t-test between TD-RPE at early reward and TD-RPE at late reward (100 pairs, two-tailed), and the d values are Cohen’s d using an average variance and their signs are with respect to the expected patterns shown in (C) (same applied to Fig. 8 and Fig. 9C).

Revised online value-RNN models with further biological constraints.

(A) oVRNNbp-rev: oVRNNbp (online value-RNN with backprop) was modified so that the activities of neurons in the RNN (x) and the value weights (w) became non-negative. (B) oVRNNrf-bio: oVRNNrf (online value-RNN with fixed random feedback) was modified so that x and w, as well as the fixed random feedback (c), became non-negative and also the dependence of the update rules for recurrent/feed-forward connections (A and B) on post-synaptic activity became monotonic + saturation.

Performances of the revised online value-RNN models in the cue-reward association task, in comparison with models with untrained RNN that also had the non-negative constraint.

(A-D) State values (black lines) and TD-RPEs (red lines) at 1500-th trial in oVRNNbp-rev (A), oVRNNrf-bio (B), agent with naive untrained RNN (i.e., randomly initialized RNN) with x and w constrained to be non-negative (C), and agent with untrained RNN with connections shuffled from those learnt in oVRNNrf-bio and also with non-negative x and w (D). The number of RNN units was 12 in all the cases. Error-bars indicate mean ± SEM across 100 simulations; same applied to the followings unless otherwise mentioned. The right histograms show the across-simulation distribution of the value of the pre-reward state in each model. The vertical black dashed lines in the histograms indicate the true value of the pre-reward state (estimated through simulations). (E) Left: Mean squared value-error at 1500-th trial in oVRNNbp-rev (red line), oVRNNrf-bio (blue line), agent with naive untrained RNN (gray solid line: partly out of view), and agent with shuffled untrained RNN (gray dotted line) when the number of RNN units (n) was varied from 5 to 40. Learning rate for value weights was normalized by dividing by n/12 (same applied to the followings). Right: Mean squared value-error in oVRNNrf-bio (blue line: same data as in the left panel), oVRNN-bio with random-magnitude uniform feedback (green line), oVRNN-bio with fixed-magnitude (0.5) uniform feedback (light blue line), and oVRNNrf-rev where the update rule of oVRNNrf-bio was changed back to the original one (blue dotted line). (F) Left: Mean of the elements of the recurrent and feed-forward connections (at 1500-th trial) of oVRNNbp-rev (red line), oVRNNrf-bio (blue line), and naive untrained RNN (gray solid line). Right: Mean of the elements of the recurrent and feed-forward connections of oVRNNrf-bio (blue line: same data as in the left panel), oVRNN-bio with random-magnitude uniform feedback (green line), oVRNN-bio with fixed-magnitude (0.5) uniform feedback (light blue line), and oVRNNrf-rev (blue dotted line). (G) Learned state values (left panel) and TD-RPEs (right panel) in oVRNNbp-rev (red lines) and oVRNNrf-bio (blue lines) in the cases with 40 RNN units, compared to the estimated true values (black lines). (H) Log of contribution ratios of the principal components of the time series (for 1500 trials) of RNN activities in each model with 20 RNN units. (I) Mean squared value-error in each model with 20 RNN units across trials. (J) Mean squared value-error at 3000-th trial in each model in the cases where the cue-reward delay was 3, 4, 5, or 6 time-steps (top to bottom panels). Left and right panels show the results with default learning rates and halved learning rates, respectively.

Performances of the revised online value-RNN models with further biological constraints in the two tasks having probabilistic structures, in comparison with models with untrained RNN.

TD-RPEs at the latest trial within 2000 trials in which reward was given at the early timing (red lines) or the late timing (blue lines) in task 1 (left) and task 2 (right), averaged across 100 simulations (error-bars indicating ± SEM across simulations), are shown for the four types of agent: (A) oVRNNbp-rev; (B) oVRNNrf-bio; (C) agent with naive untrained RNN; (D) agent with untrained RNN with connections shuffled from those learnt in oVRNNrf-bio. The number of RNN units was 20 for all the cases.

Loose alignment of the value weights (w) and the random feedback (c) in oVRNNrf-bio (with 12 RNN units).

(A) Over-trial changes in the angle between the value weights w and the fixed random feedback c. The solid line and the dashed lines indicate the mean and ± SD across 100 simulations, respectively. (B) No correlation between the w-c angle (horizontal axis) and the value of the pre-reward state (vertical axis) at 1500-th trial (r = 0.0117, p = 0.908). The dots indicate the results of individual simulations. (C) Correlation between the w-c angle at k-th trial (horizontal axis) and the value of the cue, post-cue, pre-reward, or reward state (top-bottom panels) at 500-th trial across 1000 simulations. The solid lines indicate the correlation coefficient, and the short vertical bars at the top of each panel indicates the cases in which p-value was less than 0.05. (D) Distribution of the angle between two 12-dimensional vectors when the elements of both vectors were drawn from [0 1] uniform pseudo-random numbers (a) or when one of the vectors was replaced with [1 0 0 … 0] (i.e., on the edge of the non-negative quadrant) (b) or [1 1 0 … 0] (i.e., on the boundary of the non-negative quadrant) (c). (E) Across-simulations histograms of elements of w in oVRNNrf-bio with 12 RNN units ordered from the largest to smallest ones after 1500 trials when there was no value-weight-decay (a) or there was value-weight-decay with decay rate (per time-step) of 0.001 (b) or 0.002 (c). The error-bars indicate the mean ± SEM across 100 simulations. (F) Over-trial changes in the angle between the value weights w and the fixed random feedback c when there was value-weight-decay with decay rate (per time-step) of 0.001 (top panel) or 0.002 (bottom panel). Notations are the same those in (A). (G) Mean squared value-error at 1500-th trial in oVRNNrf-bio with 12 RNN units with the rate of value-weight-decay was varied (horizontal axis). The error-bars indicate the mean ± SEM across 100 simulations.

oVRNNbp-rev-ei and oVRNNrf-bio-ei models incorporating excitatory E-units and inhibitory I-units.

(A) Schematic illustration of the models’ architecture. For ease of viewing, only limited parts of units and connections are drawn. (B) Mean squared value-error at 1500-th trial in the cue-reward association task in oVRNNbp-rev-ei (red line), oVRNNrf-bio-ei (blue line), and E-/I-units-incorporated models with naive untrained RNN (i.e., randomly initialized RNN) (gray solid line) or untrained RNN with connections shuffled from those learnt in oVRNNrf-bio-ei (gray dotted line). (C) Patterns of TD-RPE in the tasks with probabilistic structures generated in the four models with E-/I-units. Simulation conditions and notations are the same those in Fig. 7.

Cue-reward association task with distractor cue.

(A) Modification of oVRNNbp-rev and oVRNNrf-bio to incorporate possible existence of distractor cue. The observation units o had an additional element (leftmost circle labeled as ‘Dist’), which was 1 at the time-steps where distractor cue was present and 0 otherwise. (B-E) Results of the cases where the probability of the presence of distractor cue at every time-step was 0 (B), 0.1 (C), 0.2 (D), and 0.3 (E). Left panels: Examples of the presence of distractor (’D’), reward-associated cue (’C’), and reward (’R’) over 100 time-steps. Middle panels: Mean squared value-error at 1500-th trial in oVRNNbp-rev (red line), oVRNNrf-bio (blue line), and the models with naive or shuffled untrained RNN (gray solid or dotted line). Right panels: Results for the models with E- and I-units (oVRNNbp-rev-ei: red line, oVRNNrf-bio-ei: blue line, models with naive or shuffled untrained RNN: gray solid or dotted line), which were modified to incorporate possible existence of distractor cue in the same manner as in (A).

Incorporation of action selection. (A) Schematic illustration of the models incorporating an actor-critic architecture.

(B) Two-alternative choice task. (a) Task diagram. (b,c) Proportion of Action 1 selection in 2901∼3000-th trials in oVRNNbp-rev-as (red line), oVRNNrf-bio-as (blue line), and the models with naive untrained RNN (i.e., randomly initialized RNN) (gray solid line) or untrained RNN with connections shuffled from those learnt in VRNNrf-bio-as (gray dotted line). Error-bars indicate the mean ± SEM over 100 simulations. The inverse temperature was set to 1 (b) or 2 (c). (C) Inter-temporal choice task. The notations are the same as those in (B).

Sign up for email alerts