Introduction

Multiple lines of studies have suggested that Temporal-Difference-Reinforcement-Leaning (TDRL) is implemented in the cortico-basal ganglia-dopamine(DA) circuits in the way that DA represents TD reward-prediction-error (RPE) 37 and DA-dependent plasticity of cortico-striatal synapses represents TD-RPE-dependent update of state/action values 810. Traditionally, TDRL in the cortico-basal ganglia-DA circuits was considered to serve only for relatively simple behavior. However, subsequent studies suggested that more sophisticated, apparently goal-directed/model-based behavior can also be achieved by TDRL if states are appropriately represented 1113 and that DA signals indeed reflect model-based predictions 14, 15. Conversely, state representation-related issues could potentially cause behavioral or mental-health problems 1620. Early modeling studies treated state representations appropriate to the situation/task as given (’handcrafted’ by the authors), but representation itself should be learnt in the brain 2126. Recently it was shown that appropriate state representation can be learnt through RL in a recurrent neural network (RNN) by minimization of squared value-error without explicit teacher/target 2, 13, while state value can be simultaneously learnt in the downstream of RNN.

However, whether such a learning method, named the value-RNN 2, can be implemented in the brain remains unclear, because there are problems in terms of biological plausibility. A major problem, among others, is that the update rule proposed in the previous work for the connections onto the ’neurons’ in the RNN 2, derived from the gradient-descent error-’backpropagation’ (hereafter referred to as backprop) method 27, 28, involves the weights of the connections from these RNN units onto the downstream value-encoding unit. Given that the state-representing RNN and the value-encoding unit are implemented by the intra-cortical circuit and the striatal neurons, respectively, as generally suggested 3, 29, 30, this means that the update (plasticity) rule for intra-cortical connections involves the downstream cortico-striatal synaptic strengths, which would not be able to be accessed from the cortex. Indeed, this is an example of the long-standing difficulty in biological implementation of backprop 31, 32, in which update of upstream connections requires biologically unavailable downstream connection strengths.

Recently, a potential solution for this difficulty has been proposed 33 (see also 3441 for other potential solutions). Specifically, in supervised learning of feed-forward network, it was shown that when the downstream connection strengths used for updating upstream connections in backprop were replaced with fixed random strengths, comparable learning performance was still achieved 33. This was suggested to be because the information of the introduced fixed random strengths transferred, through learning, to the upstream connections and then to the downstream feed-forward connections so that these feed-forward connections became aligned to the random feedback strengths, and thus in turn, the random feedback can play the same role as the one played by the downstream connection strengths in backprop. This mechanism was named the ’feedback alignment’ 33, and was subsequently shown to work also in supervised learning of RNN 42 and proposed to be neurally implemented 43 (in a different way from the present study as we discuss in the Discussion).

The value-RNN 2, 13, the above-introduced simultaneous RL of state values and state representation through minimization of squared value-error, differs from supervised learning considered in these previous feedback-alignment studies in two ways: i) it is TD learning, i.e., it approximates the true error by the TD-RPE because the true error, or true state value, is unknown, and ii) it uses a scalar error (TD-RPE) rather than a vector error. Therefore it was nontrivial whether the feedback alignment mechanism could work also for the value-RNN. In the present work, we first examined this, demonstrating that it does work and providing a mechanistic insight into how it works.

After that, we further addressed other biological-plausibility problems. Specifically, we imposed biological constraints that the downstream (cortico-striatal) weights and the fixed random feedback, as well as the activities of neurons in the RNN, were all non-negative. Moreover, we also remedied the non-monotonic dependence of the update of RNN connection-strength on post-synaptic neural activity. We then found, unexpectedly, that the non-negative constraint appeared to aid, rather than degrade, the learning by ensuring that the downstream weights and the fixed random feedback are loosely aligned even without operation of the feedback alignment mechanism. These results suggest how learning of state representation and value can be neurally implemented, more specifically, through synaptic plasticity depending on DA, which represents TD-RPE, in the cortex and the striatum.

Results

Consideration of the value-RNN with fixed random feedback

We considered an implementation of the value-RNN in the cortico-basal ganglia circuits (Fig. 1). A cortical region/population is supposed to represent information of sensory observation (o) and send it to another cortical region/population, which has rich recurrent connections and therefore can be approximated by an RNN. Activities of neurons in the RNN (x) are supposed to learn to represent states, through updates of the strengths of recurrent connections A and feed-forward connections B. The activity of a population of striatal neurons that receive inputs from the RNN is supposed to learn to represent the state values (v), by learning the weights of cortico-striatal connections (from the RNN to the striatal neurons) (w) indicating the value weights. DA neurons in the ventral tegmental area (VTA) receive (direct and indirect) inputs from the striatum and other structure conveying information of obtained reward (r), and thereby the activity of the DA neurons, as well as released DA, represents TD-RPE (δ). TD-RPE-representing DA is released in the striatum and also in the cortical RNN through mesocorticolimbic projections, and used for modifying the strengths of cortical recurrent and feed-forward connections (A and B) and cortico-striatal connections (w).

Implementation of the value-RNN in the cortico-basal ganglia-DA circuits.

In the original value-RNN 2, 13, update rule for the connections onto the RNN (A and B) requires the (gradually changing) value weights (w), but this is biologically implausible because the corticostriatal synaptic strengths are not available in the cortex as discussed above. Therefore, we considered a modified value-RNN by replacing the cortico-striatal weights used in the updates of intra-cortical connections with fixed random strengths (c). Besides, the original value-RNN adopted a learning rule called the backpropagation through time (BPTT) 44, in which the error in the output needs to be incrementally accumulated in the temporally backward order, but such an acausality is also biologically implausible, as previously pointed out 42. Therefore, we instead used an online learning rule, which considers only the influence of the recurrent weights at the previous time step (see the Methods for details and equations).

Simulation of a Pavlovian cue-reward association task with variable inter-trial intervals

We compared the learning of the modified value-RNN with fixed random feedback (referred to as VRNNrf) and the value-RNN with backprop (referred to as VRNNbp), both of which adopted the online learning rule rather than the BPTT, and also untrained RNN. The number of RNN units was set to 7 for all the cases. Traditional TD-RL agents with punctate state representation (called the complete serial compound, CSC 3, 45) were also compared. We simulated a Pavlovian cue-reward association task, in which a cue was followed by a reward three time-steps later, and inter-trial interval (i.e., reward to next cue) was randomly chosen from 4, 5, 6, or 7 time-steps (Fig. 2A). In this task, states can be defined by relative timings from the cue, and we estimated the true state values through simulations according to the definition of state value, i.e., expected cumulative discounted future rewards 46 (Fig. 2B, black line). Expected TD-RPE calculated from these estimated true values (Fig. 2B, red line) was almost 0 at any states, as expected. Agent having punctate/CSC state representation and state values without continuation between trials (i.e., the value of the last state in a trial was not updated by TD-RPE upon entering the next trial) developed positive values between cue and reward, and abrupt TD-RPE upon cue (Fig. 2C). Agent having punctate/CSC state representation and continuously updated state values across trials developed positive values also for states in the inter-trial interval (Fig. 2D). VRNNbp developed state values between cue and reward, and to some extent in the inter-trial interval, and showed abrupt TD-RPE upon cue and smaller TD-RPE upon reward (Fig. 2E). This indicates that this agent largely learned the task structure, confirming the previously proposed effectiveness of value-RNN in this different task.

Simulation of a Pavlovian cue-reward association task. (A) Simulated task with variable inter-trial intervals. (B) Black line: Estimated true values of states, defined by relative timings from the cue, through simulations according to the definition of state value, i.e., expected cumulative discounted future rewards. Red line: TD-RPEs calculated from the estimated true state values. (C-G) State values (black lines) and TD-RPEs (red lines) at 1000-th trial, averaged across 100 simulations (error-bars indicating ± SEM across simulations), in different types of agent: (C) TD-RL agent having punctate/CSC state representation and state values without continuation between trials (i.e., the value of the last state in a trial was not updated by TD-RPE upon entering the next trial); (D) TD-RL agent having punctate/CSC state representation and continuously updated state values across trials; (E) Value-RNN with backprop (VRNNbp). The number of RNN units was 7 (same applied to (F,G); (F) Value-RNN with fixed random feedback (VRNNrf); (G) Agent with untrained RNN. (H) State values at 1000-th trial in individual simulations of VRNNbp (top), VRNNrf (middle), and untrained RNN (bottom). (I) Histograms of the value of the pre-reward state (i.e., the state one-time step before the reward state) at 1000-th trial in individual simulations of the three models. The vertical black dashed lines indicate the true value of the pre-reward state (estimated through simulations). (J) Learning performance of VRNNbp (red line), VRNNrf (blue line), and the untrained RNN (gray line) when the number of RNN units was varied from 5 to 40 (horizontal axis). Leaning performance was measured by the sum of squares of differences between the state values developed at 1000-th trial by each of these three types of agent and the estimated true state values between cue and reward (vertical axis), averaged over 100 simulations (error-bars indicating ± SEM across simulations).

VRNNrf, having fixed random feedback instead of backprop-based feedback, developed state values that were largely similar to, although smaller (on average across simulations) than, those developed by VRNNbp (Fig. 2F black line). VRNNrf generated abrupt TD-RPEs upon cue and reward, again similarly to VRNNbp although the relative size of reward-response was (on average) larger (Fig. 2F red line). As a comparison, agent with untrained RNN developed (on average) even smaller state values and larger relative size of TD-RPE upon reward (Fig. 2G). These results indicate that value-

RNN could be trained by fixed random feedback at least to a certain extent, although somewhat less effectively (as maybe expected) than by backprop-based feedback. Figure 2H shows state values developed in individual simulations of VRNNbp (top), VRNNrf (middle), and untrained RNN (bottom), and Figure 2I shows the histograms of the value of the pre-reward state (i.e., one time-step before the state where reward was obtained) developed in individual simulations of these three models. These figures indicate that VRNNrf did not tend to develop moderately smaller state values than VRNNbp in each simulation. Rather, state values developed in VRNNrf were largely comparable to those developed in VRNNbp once they were successfully learned but the success rate was smaller than VRNNbp while still larger than the untrained RNN.

So far, we examined the cases where the number of RNN units was 7. We compared the learning performance of VRNNbp, VRNNrf, and untrained RNN when the number of RNN units was varied from 5 to 40. Leaning performance was measured by the sum of squares of differences between the state values developed by each of these three types of agents and the estimated true state values (Fig. 2B) between cue and reward. As shown in Fig. 2J, on average across simulations, VRNNbp generally achieved the highest performance, but VRNNrf also exhibited largely comparable performance and it always outperformed the untrained RNN. As the number of RNN units increased from 5 to 15, all these three agents improved their performance, while additional increase of RNN units to 20 or 25 resulted in smaller changes. Further increase of RNN units caused decrease in the mean performance in all the three agents, and when the number of RNN units was increased to 45, there were occasions where learning appeared to diverge. We will discuss these in the Discussion.

Occurrence of feedback alignment and an intuitive understanding of its mechanism

We questioned if feedback alignment underlay the learnability of VRNNrf. Returning to the case with 7 RNN units, we examined whether the value weight vector w became aligned to the random feedback vector c in VRNNrf, by looking at the changes in the angle between these two vectors across trials. As shown in Fig. 3A, this angle, averaged across simulations, decreased over trials, indicating that the value weight w indeed tended to become aligned to the random feedback c. We then examined whether better alignment of w to c related to better development of state value by looking at the relation between the angle between w and c and the value of the pre-reward state at 1000-th trial. As shown in Fig. 3B, there was a negative correlation such that the smaller the angle was (i.e., more aligned), the larger the state value tended to be (r = −0.288, p = 0.00362), in line with our expectation. These results indicate that the mechanism of feedback alignment, previously shown to work for supervised learning, also worked for TD learning of value weights and recurrent/feed-forward connections.

Occurrence of feedback alignment and an intuitive understanding of its mechanism. (A) Over-trial changes in the angle between the value-weight vector w and the fixed random feedback vector c in the simulations of VRNNrf (7 RNN units). The solid line and the dashed lines indicate the mean and ± SD across 100 simulations, respectively. (B) The relation between the angle between w and c (horizontal axis) and the value of the pre-reward state (vertical axis) at 1000-th trial. The dots indicate the results of individual simulations, and the line indicates the regression line. (C) Angle between the hypothetical change in x(t) = f(Ax(t−1),Bo(t−1)) in case A and B were replaced with their updated ones, multiplied with the sign of TD-RPE (sign(δ(t))), and the fixed random feedback vector c across time-steps. The black thick line and the gray lines indicate the mean and ± SD across 100 simulations, respectively (same applied to (D)). (D) Multiplication of TD-RPEs in successive trials at individual states (top: cue, 4th from the top: reward). Positive or negative value indicates that TD-RPEs in successive trials have the same or different signs, respectively. (E) Left: RNN trajectories mapped onto the primary and secondary principal components (horizontal and vertical axes, respectively) in three successive trials (red, blue, and green lines (heavily overlapped)) at different phases in an example simulation (10th-12th, 300th-302th, 600th-602th, and 900th-902th trials from top to bottom). The crosses and circles indicate the cue and reward states, respectively. Right: State values (black lines) and TD-RPEs (red lines) at 11th, 301th, 601th, and 901th trial.

How did the feedback alignment mechanistically occur? We made an attempt to obtain an intuitive understanding. Assume that positive TD-RPE (δ(t) > 0) is generated at a state, S (= x(t)), in a task trial. Because of the update rule for w (ww + (t)x(t)), w is updated in the direction of x(t). Next, what is the effect of updates of recurrent/feed-forward connections (A and B) on x? For simplicity, here we consider the case where observation is null (o = 0) and so x(t) = f(Ax(t−1)) holds (but similar argument can be done in the case where observation is not null). If A is replaced with its updated one, it can be calculated that i-th element of Ax(t−1) will hypothetically change by ci × (a positive value) (technical note: the value is (t){Σjxj(t−1)2}(0.5 + xi(t))(0.5 − xi(t)) which is positive unless x(t−1) = 0), and therefore the vector Ax(t−1) as a whole will hypothetically change by a vector that is in a relatively close angle with c (in a sense that, for example, [c1 c2 c3]T and [0.5c1 1.2c2 0.8c3]T are in a relatively close angle in the same quadrant). Then, because f is a monotonically increasing sigmoidal function, x(t) = f(Ax(t−1)) will also hypothetically change by a vector that is in a relatively close angle with c. This was indeed the case in our simulations as shown in Fig. 3C.

In this way, at state S where TD-RPE is positive, w is updated in the direction of x(t), and x(t) will hypothetically change by a vector that is in a relatively close angle with c if A is replaced with its updated one. Then, if the update of w and the hypothetical change in x(t) due to the update of A could be integrated, w would become aligned to c (if TD-RPE is instead negative, w is updated in the opposite direction of x(t), and x(t) will hypothetically change by a vector that is in a relatively close angle with −c, and so the same story holds in the end).

There is, however, a caveat regarding how the update of w and the hypothetical change in x(t) can be integrated. Although technical, here we briefly describe it, and a possible solution. The updates of w and A use TD-RPE, which is calculated based on v(t) = wT x(t) and v(t+1) = wT x(t+1), and so x(t) and x(t+1) should already be determined beforehand. Therefore, the hypothetical change in x(t) due to the update of A, described in the above, does not actually occur (this was why we mentioned ‘hypothetical’) and thus cannot be integrated with the update of w. Nevertheless, integration could still occur across successive trials, at least to a certain extent. Specifically, although TD-RPEs at S in successive trials would generally differ from each other, they would still tend to have the same sign, as was indeed the case in our simulations (Fig. 3D). Also, although the trajectories of RNN activity (x) in successive trials would differ, we could expect a certain level of similarity because the RNN is entrained by observation-representing inputs, again as was indeed the case in our example simulation (Fig. 3E). Then, the hypothetical change in x(t) due to the update of A, considered above, could become a reality in the next trial, to a certain extent, and could thus be integrated into the update of w, explaining the occurrence of feedback alignment.

Simulation of tasks with probabilistic structures of reward timing/existence

We also simulated two tasks (Fig. 4A) that were qualitatively similar to (though simpler than) the two tasks examined in previous experiments 1 and modeled by the original value-RNN with backprop2. In our task 1, a cue was always followed by a reward either two or four time-steps later with equal probabilities. Task 2 was the same as task 1 except that reward was omitted with 40% probability. In task 1, if reward was not given at the early timing (i.e., two-steps later than cue), agent could predict that reward should be given at the late timing (i.e., four-steps later than cue), and thus TD-RPE upon reward at the late timing is expected to be smaller than TD-RPE upon reward at the early timing (if agent perfectly learned the task structure, TD-RPE upon reward at the late timing should be 0). By contrast, in task 2, if reward was not given at the early timing, it might indicate that reward was given at the late timing but might instead indicate that reward was omitted in that trial, and thus TD-RPE upon reward at the late timing is expected to exist and can even be larger than TD-RPE upon reward at the early timing.

Simulation of two tasks having probabilistic structures, which were qualitatively similar to the two tasks examined in experiments 1 and modeled by value-RNN 2. (A) Simulated two tasks, in which reward was given at the early or the late timing with equal probabilities in all the trials (task 1) or 60% of trials (task 2). (B) (a) Top: Trial types. Two trial types (with early reward and with late reward) in task 1 and three trial types (with early reward, with late reward, and without reward) in task 2. Bottom: Value (expected discounted cumulative future rewards) of each timing in each trial type. (b) Agent’s probabilistic belief about the current trial type, in the case where agent was in fact in the trial with early reward (top row), the trial with late reward (second row), or the trial without reward (third row in task 2). (c) Top: States defined by considering the probabilistic beliefs at each timing from cue. Bottom: State values (expected discounted cumulative future rewards, estimated through simulations), which should theoretically match an integration (multiplication) of the values of each trial type (shown in (a)-bottom) with the probabilistic beliefs (shown in (b)). (C) Expected TD-RPE calculated from the estimated true values of the states for task 1 (left) and task 2 (right). Red lines: case where reward was given at the early timing, blue lines: case where reward was given at the late timing. (D-H) TD-RPEs at the latest trial within 1000 trials in which reward was given at the early timing (red lines) or the late timing (blue lines), averaged across 100 simulations (error-bars indicating ± SEM across simulations), in the different types of agent: (D,E) TD-RL agent having punctate/CSC state representation and state values without (D) or with (E) continuation between trials; (F) VRNNbp. The number of RNN units was 12 (same applied to (G,H); (G) VRNNrf; (H) Untrained RNN.

In these tasks, states can be defined in the following way. There were two types of trials, with early or late reward, in task 1, and additionally one more type of trial, without reward, in task 2 (Fig. 4Ba, top). For each timing in each of these trial types, its value, i.e., expected discounted cumulative future rewards, can be estimated through simulations (Fig. 4Ba, bottom). Agent could not know the current trial type until receiving reward at the early timing or the late timing or receiving no reward at both timings. Until these timings, agent could have probabilistic belief about the current trial type, e.g., 50% in the trial with early reward and 50% in the trial with late reward (in task 1) or 30% in the trial with early reward, 30% in the trial with late reward, and 40% in the trial without reward (in task 2) (Fig. 4Bb). States can be defined by incorporating these probabilistic beliefs at each timing (Fig. 4Bc, top), and state values (Fig. 4Bc, bottom: expected discounted cumulative future rewards, estimated through simulations) should theoretically match an integration (multiplication) of the values of each trial type (Fig. 4Ba, bottom) with the probabilistic beliefs (Fig. 4Bb). Expected TD-RPE calculated from these estimated state values (Fig. 4C) exhibited features that matched the conjecture mentioned above: in task 1, TD-RPE upon reception of late reward, which was actually 0, was smaller than TD-RPE upon reception of early reward, whereas in task 2, TD-RPE upon reception of late reward was larger than TD-RPE upon reception of early reward.

The previous experimental work 1 has shown that VTA DA neurons exhibited similar activity patterns to the abovementioned TD-RPE patterns, and the theoretical work 2 has shown that the original value-RNN with backprop could reproduce such TD-RPE patterns. We examined what TD-RPE patterns were generated in the agents with punctate/CSC representation, VRNNbp, VRNNrf, and untrained RNN (12 RNN units in all cases) in our simulated two tasks. VRNNbp developed similar TD-RPE patterns (smaller TD-RPE upon late than early timing in task 1 but opposite pattern in task 2) (Fig. 4F), qualitatively reproducing the result of the previous work 2. Crucially, VRNNrf also developed similar TD-RPE patterns (Fig. 4G), indicating that this agent with random feedback could also learn the distinct structures of the two tasks. By contrast, agents with punctate/CSC state representation without or with continuous value update across trials (Fig. 4D,E), as well as agent with untrained RNN (Fig. 4H), could not develop such patterns well.

Value-RNN with further biological constraints

So far, the activities of neurons in the RNN (x) were initialized to pseudo standard normal random numbers, and thereafter took numbers in the range between −0.5 and 0.5 that was the range of the sigmoidal input-output function. The value weights (w) could also take both positive and negative values since no constraint was imposed. The fixed random feedback in VRNNrf (c) was generated by pseudo standard normal random numbers, and so could also be positive or negative. Negativity of the neurons’ activities and the value weights could potentially be regarded as inhibitory or smaller-than-baseline quantities. However, because neuronal firing rate is non-negative and cortico-striatal projections are excitatory, it would be biologically more plausible to assume that the activities of neurons in the RNN and the value weights are non-negative. As for the fixed random feedback, if it is negative, the update rule becomes anti-Hebbian under positive TD-RPE, and so assuming non-negativity would be plausible since Hebbian property has been suggested for rapid plasticity of cortical synapses 47. There was another issue in the update rule for recurrent and feed-forward connections, derived from the gradient descent. Specifically, the dependence on the post-synaptic activity was non-monotonic, maximized at the middle of the range of activity. It would be more biologically plausible to assume monotonic dependence.

In order to address these issues, we considered revised models. We first considered a revised VRNNbp, referred to as revVRNNbp, in which the RNN activities and the value weights were constrained to be non-negative, while the non-monotonic dependence of the update rule on the post-synaptic activity remained unchanged (Fig. 5A). We then considered a revised VRNNrf, referred to as bioVRNNrf, in which the fixed random feedback, as well as the RNN activities and the value weights, were constrained to be non-negative, and also the update rule was modified so that the dependence on the post-synaptic activity became monotonic (with saturation) (Fig. 5B).

Modified value-RNN models with further biological constraints. (A) RevVRNNbp: VRNNbp (value-RNN with backprop) was modified so that the activities of neurons in the RNN (x) and the value weights (w) became non-negative. (B) BioVRNNrf: VRNNrf (value-RNN with fixed random feedback) was modified so that x and w, as well as the fixed random feedback (c), became non-negative and also the dependence of the update rules for recurrent/feed-forward connections (A and B) on post-synaptic activity became monotonic + saturation.

We examined how these revised models, in comparison with untrained RNN that also had non-negative constraint for x and w, performed in the Pavlovian cue-reward association task examined above (the number of neurons/trials were set to 12/1500). RevVRNNbp well developed state values toward reward (Fig. 6A). BioVRNNrf also developed state values to a largely comparable extent (Fig. 6B). By contrast, untrained RNN could not develop such a pattern of state values (Fig. 6C). This, however, could be because initially set recurrent/feed-forward connections were far from those learned in the value-RNNs. Therefore, as a more strict control, we conducted simulations of untrained RNN with non-negative x and w, where in each simulation the recurrent/feed-forward connections were set to be those shuffled from the learnt connections in a simulation of bioVRNNrf. Untrained RNN with this setting performed somewhat better than the original untrained RNN case (Fig. 6D), but still worse than revVRNNbp and bioVRNNrf. We varied the number of neurons in the RNN, and compared the performance (sum of squared errors from the true state values) of revVRNNbp and bioVRNNrf, in comparison with untrained RNN (both naive one and the one with shuffled learnt connections from bioVRNNrf). As shown in Fig. 6E, regardless of the number of neurons, the performance of bioVRNNrf was largely comparable to that of revVRNNbp, and better than the performance of untrained RNN of both kinds. Figure 6F shows the mean of the elements of the recurrent and feed-forward connections at 1500-th trial in the different models. As shown in the figure, these connections (initialized to pseudo standard normal random numbers) were learnt to become negative on average, in revVRNNbp and more prominently in bioVRNNrf. This learnt negative-dominance (inhibition-dominance) could possibly be related, e.g., through prevention of excessive activity, to the good performance of bioVRNNrf and also the better performance of the untrained RNN with connections shuffled from bioVRNNrf than that of the naive untrained RNN.

Performances of the modified value-RNN models in the cue-reward association task, in comparison with untrained RNN that also had the non-negative constraint. (A-D) State values (black lines) and TD-RPEs (red lines) at 1500-th trial in revVRNNbp (A), bioVRNNrf (B), untrained RNN with x and w constrained to be non-negative (C), and untrained RNN with non-negative x and w and having connections shuffled from those learnt in bioVRNNrf (D). The number of RNN units was 12 in all the cases. Error-bars indicate mean ± SEM across 100 simulations (same applied to (E,F)). The right histograms show the across-simulation distribution of the value of the pre-reward state in each model. The vertical black dashed lines in the histograms indicate the true value of the pre-reward state (estimated through simulations). (E) Learning performance of revVRNNbp (red line), bioVRNNrf (blue line), untrained RNN (gray solid line: partly out of view), and untrained RNN with connections shuffled from those learnt in bioVRNNrf (gray dotted line) when the number of RNN units was varied from 5 to 40 (horizontal axis). Leaning performance was measured by the sum of squares of differences between the state values developed at 1500-th trial by each of these four types of agent and the estimated true state values between cue and reward (vertical axis). (F) Mean of the elements of the recurrent and feed-forward connections (at 1500-th trial) of revVRNNbp (red line), bioVRNNrf (blue line), and untrained RNN (gray solid line).

We examined how the angle between the value weights (w) and the random feedback (c) changed across trials in bioVRNNrf. As shown in Fig. 7A, the angle was on average smaller than the chance-level angle (90°) from the beginning, while there was no further alignment over trials. This could be understood as follows. Because both the value weights (w) and the random feedback (c) were now constrained to be non-negative, these two vectors were ensured to be in a relatively close angle (i.e., in the same quadrant) from the beginning. By virtue of this loose alignment, the random feedback could act similarly to backprop-derived proper feedback, even without further alignment. We examined if the angle between the value weights (w) and the random feedback (c) at 1500-th trial was associated with the developed value of pre-reward state across simulations, but found no association (r = 0.0117, p = 0.908) (Fig. 7B). We then examined if the w-c angle at earlier trials (2nd - 500-th trials) was associated with the developed values at 500-th trial, with the number of simulations increased to 1000 so that small correlation could be detected. We found that the w-c angle at initial trials (2nd - around 10-th trials) was negatively correlated with the developed values of the reward state and preceding states at 500-th trial (Fig. 7C). As for the reward state, negative correlation at around 100-th - 300-th trial was also observed. These results suggest that better alignment of w and c at initial and early timings was associated with better development of state values, in line with the conjecture that loose alignment of w and c coming from the non-negative constraint supported learning. It should be noted, however, that there were cases where positive (although small) correlation was observed. Its exact reason is not sure, but it could be related to the fact that largeness of developed values or the speed of value development does not necessarily mean good learning.

Loose alignment of the value weights (w) and the random feedback (c) in bioVRNNrf (with 12 RNN units), and its relation to the developed state values. (A) Over-trial changes in the angle between the value weights w and the fixed random feedback c. The solid line and the dashed lines indicate the mean and ± SD across 100 simulations, respectively. (B) Relation between the w-c angle (horizontal axis) and the value of the pre-reward state (vertical axis) at 1500-th trial. The dots indicate the results of individual simulations. (C) Correlation between the w-c angle at k-th trial (horizontal axis) and the value of the cue, post-cue, pre-reward, or reward state (top-bottom panels) at 500-th trial across 1000 simulations. The solid lines indicate the correlation coefficient, and the short vertical bars at the top of each panel indicates the cases in which p-value was less than 0.05.

We further examined how the revised value-RNN models performed in the two tasks with probabilistic structures examined above. Since the revised value-RNN models with 12 neurons appeared not able to produce the different patterns of TD-RPEs in the two tasks (TD-RPE at early reward > TD-RPE at late reward in task 1 and opposite pattern in task 2), we increased the number of neurons to 20. Then, both revVRNNbp and bioVRNNrf produced such TD-RPE patterns (Fig. 8A,B) whereas untrained RNN of both kinds (naive, and with connections shuffled from bioVRNNrf) could not (Fig. 8C,D). This indicates that the value-RNN with random feedback and further biological constraints could learn the differential characteristics of the tasks.

Performances of the modified value-RNN models in the two tasks having probabilistic structures, in comparison with untrained RNN having the non-negative constraint. TD-RPEs at the latest trial within 2000 trials in which reward was given at the early timing (red lines) or the late timing (blue lines) in task 1 (left) and task 2 (right), averaged across 100 simulations (error-bars indicating ± SEM across simulations), are shown for the four types of agent: (A) revVRNNbp; (B) bioVRNNrf; (C) untrained RNN with non-negative x and w; (D) untrained RNN with non-negative x and w and having connections shuffled from those learnt in bioVRNNrf. The number of RNN units was 20 for all the cases.

Discussion

We have shown that state representation and value can be learned in the RNN and its downstream by using random feedback instead of backprop-derived biologically unavailable downstream weights. In the model without non-negative constraint, the feedback alignment, previously shown for supervised learning, occurred, and we have presented an intuitive understanding of its mechanism. In the model with non-negative constraint, loose alignment occurred from the beginning because of the constraint, and it appeared to support learning. Below we discuss implementation of the value-RNN with random feedback, pointing to a crucial role of DA outside of striatum, and also heterogeneity of DA signals. We further discuss limitations, relations to other proposals and suggestions, and future perspectives.

Implementation of the value-RNN with random feedback, featuring a role of DA outside of striatum

DA neurons in the midbrain project to not only the striatum but also the cortex, including the prefrontal cortex 48 and the hippocampus 49. Previous studies demonstrated a crucial role of prefrontal DA in working memory 50, 51, presumably through the effects on synaptic/ionic conductance 52, 53. Roles of prefrontal DA in behavioral flexibility or decision making have also been suggested 54. Moreover, a role of hippocampal DA in modulation of aversive memory formation has been demonstrated 55. However, although i) there has been increasing evidence that DA represents TD-RPE 56, ii) human fMRI experiments found TD-RPE correlates in cortical regions 57, iii) DAergic modulation or initiation of plasticity in the prefrontal cortex 58 or the hippocampus 59 have been demonstrated, and iv) lesion or inactivation of prefrontal or hippocampal regions were found to disrupt DA’s encoding of RPE reflecting appropriate state representation 6062, what computational role in RL is played by TD-RPE-representing DA in the cortex remains to be clarified. This is behind compared to the case of striatum, where it has been widely considered that DAergic modulation of cortico-striatal synaptic weights implements TD-RPE-based update of state/action values 8, 63.

The value-RNN with fixed random feedback and biological constraints considered in the present work suggests a possibility that TD-RPE-representing DA modulates plasticity of RNN in the cortex so that state representation can be learnt. Different from the original value-RNN with backprop 2, 13, update of intra-cortical connections does not require downstream cortico-striatal weights but requires only non-negative fixed random feedback specific to each post-synaptic neuron. The non-negativity was assumed so that the update rule became Hebbian under positive TD-RPE, since Hebbian plasticity has been suggested for rapid plasticity of cortical synapses 47. The fixed randomness would naturally be achieved by intrinsic heterogeneity of neurons. The successful learning performance of our model thus indicates that DA-dependent modulation of Hebbian plasticity of cortical excitatory connections serves for learning of state representation that captures task structure.

VTA DA neurons project also to regions other than the striatum and cortex, including the basolateral amygdala (BLA) 64, and DA was suggested to regulate plasticity also in the BLA 65. Recent work 66 demonstrated that VTA→BLA DA entailed properties of TD-RPE, although increased rather than decreased upon aversive event, and was not itself reinforcing but necessary and sufficient for the formation of environmental model. BLA has recurrent connections 67, projects to the striatum 68, 69, and engages in abstract context representation together with the prefrontal cortex 70. Thus, given that environmental relationships needed for goal-directed behavior could be embedded in state representation 1113, it seems possible that mechanism partly akin to the learning of state representation, but not value, in the RNN of our model takes place in the BLA. It remains open, however, whether and how such sophisticated representation can be learned. It might require multidimensional error 71 beyond TD-RPE, and/or multi-compartment unit 26, both of which we will further discuss below.

DA’s encoding of TD-RPE and other variables

There have been many results suggesting heterogeneity of DA signals. Recent work 72 suggested that there (co)exist different origins: (i) heterogeneity of learning target (reward or other), (ii) heterogeneity of state features, and (iii) others, such as ramping patterns. (i) is typically observed in DA neurons projecting to different regions, which can represent prediction errors of things other than reward. In contrast, (ii) is applied to DA neurons projecting to a same region, in which even though individual DA neurons show heterogeneous responses, the resulting merged DA signal still represents a scalar error such as TD-RPE.

Referring to a result 73 of type (i) and the fact that DA neurons receive inputs from the cerebellum 74, 75 that supposedly implements supervised learning 76, a recent modeling work 43 proposed that DA neurons convey vector-valued error signals, which are used for supervised learning of actions in continuous space. This previous work showed that learning occurred without adjustment of DA projection strengths because the feedback alignment mechanism worked. In contrast, in the present work, we assumed scalar TD-RPE, which can be consistent with type (ii) heterogeneity of DA signals. We have shown that the feedback alignment mechanism works also for RL, and moreover, learning could also occur by virtue of loose alignment coming from the biological constraints even without the operation of feedback alignment. Notably, the previous model 43 and our model can coexist, given that different DA neuronal populations may encode vector-valued error and TD-RPE, or even same single DA neuron might represent both errors depending on the context, reflecting which inputs are active.

Limitations and possible reasons

We have shown that state representation and value could be learned in the value-RNN with fixed random feedback with a relatively small number of simple RNN units and observation inputs in simple simulated tasks. These simplicities enabled us to derive an intuitive understanding of how the feedback alignment could occur. However, in our models without the non-negativity constraint, as the number of RNN units increased, the performance of the models initially improved but then degraded when the number of RNN units increased beyond around 25. In contrast, in the original value-RNN with backprop 2, 13, the ability to develop belief-state-like representation was reported to improve as the number of RNN units increased to 100 or 50.

There are several possible reasons for this difference. First, as a performance measure we used the sum of squared errors between the values developed by the value-RNN and the values estimated according to the definition (expected discounted cumulative future rewards) between cue and reward, while the previous studies focused on the similarity between the representation developed by the value-RNN and the handcrafted belief states. Second, there was a difference in the way of weight update. Specifically, as mentioned in the Results, while the previous studies used the BPTT 44, which considers the recursive influence of the recurrent weights in a way that lacks causality, our models used an online learning rule, which only considers the influence of the recurrent weights at the previous time step.

Last but not least, there was a difference in the RNN unit. Specifically, we used a simple sigmoidal function, whereas the previous studies used the “Gated Recurrent Unit (GRU) cell” 77. RNN with simple nonlinear unit is known to have the “vanishing gradient problem” 78: through repetitive learning, the gradient of the loss function becomes so small that update becomes invisible. This issue could be alleviated by using an RNN unit having a memory, called the Long Short-Term Memory (LSTM) unit 79. The GRU cell was suggested to have a memory function similar to the LSTM unit 77. We focused on resolving the biological implausibility of the backprop, and sticked to the simple sigmoidal unit. However, gated unit similar to the LSTM unit has actually been proposed to be implemented in cortical microcircuits 80, and incorporating the features of real neuron into value-RNN could enhance the computational power as we will discuss below.

Biological details and future perspectives

Our RNN unit did not incorporate neuronal spiking and its effects on plasticity 38, 81, 82, as well as neuronal morphology with nonlinear dendritic computations 41, 83, 84. Importantly, recent studies suggest that dendritic mechanisms 34, 35, potentially together with burst-dependent plasticity 38, 39, can realize credit assignment without backprop in supervised learning, and also in unsupervised learning 85, 86. Dendritic mechanisms have their own specific features, or constraints, and so having them is different from increasing the number of layers of neural network, and it was argued 41 that adding such biological constraints enables learning in deep neural networks. Moreover, recent model of hippocampus 26 has shown that a network of multi-compartment units could learn complex representations. Given these, it would be interesting to explore if incorporating biological details into RNN unit can improve the performance of value-RNN.

A different alternative to backprop is the Associative Reward-Penalty (AR-P) algorithm 8789, in which the hidden units behave stochastically, and thereby the gradient could be estimated, in effect, through stochastic sampling without explicit information of the downstream weights. More recent work 90 demonstrated that noise-induced learning of back projections could achieve better alignment and performance compared with the case of fixed random feedback in a feed-forward network. These mechanisms can be biologically implemented because neurons and neural networks can exhibit noisy or chaotic behavior 9193, and are expected to potentially improve the performance of value-RNN.

Regarding the connectivity, in our models, recurrent/feed-forward connections could take both positive and negative values. This could be justified because there are both excitatory and inhibitory connections in the cortex and the net connection sign between two units can be positive or negative depending on whether excitation or inhibition exceeds the other. However, recent studies have shown that feed-forward and recurrent neural networks conforming to Dale’s law can perform well depending on the architecture, initialization, and update rules 94, 95. Integration of these models and ours, also with other connectivity features 96, may be a fruitful direction.

More specific to the cortico-basal ganglia circuit, existences of D1/D2 DA receptors and D1-direct and D2-indirect basal ganglia pathways 97100, as well as cortical areas and cell types 101104, were also not incorporated. Furthermore, circuit/synaptic mechanisms of how TD-RPE is calculated in DA neurons (c.f., 105, 106) and/or how it can be learned (c.f., 107) were unspecified. Future studies are expected to incorporate these factors.

Methods

Value-RNN with backprop (VRNNbp)

We constructed a value-RNN model based on the previous proposals 2, 13 but with several differences. We assumed that the activities of neurons in the RNN at time t+1 were determined by the activities of these neurons and neurons representing observation (cue, reward, or nothing) at time t:

where

where

were the value weights. The error between this estimated value and the true value, vtrue(t), was defined as:

Parameters wj, Aij, and Bik that minimize the squared error ε(t)2 could be found by a gradient descent / error-backpropagation (backprop) method, i.e., by updating them in the directions of −∂(ε(t)2)/∂wj, −∂(ε(t)2)/∂Aij, and −∂(ε(t)2)/∂Bik. −∂(ε(t)2)/∂wj was calculated as follows:

In the last line, since ε(t) was unavailable as vtrue(t) was unknown, it was approximated by the TD-RPE:

−∂(ε(t)2)/∂Aij was calculated as follows:

Similarly, −∂(ε(t)2)/∂Bik was calculated as follows:

According to these, the update rule for the value-RNN was determined as follows:

where a was the learning rate. In each simulation, the elements of A and B, as well as the elements of x, were initialized to pseudo standard normal random numbers, and the elements of w were initialized to 0.

Value-RNN with fixed random feedback (VRNNrf)

We considered an implementation of the value-RNN described above in the cortico-basal ganglia-DA system (Fig. 1):

x : activities of neurons in a cortical region with rich recurrent connections

A : recurrent connection strengths among x

  • ο : activities of neurons in a cortical region processing sensory inputs

B : feed-forward connection strengths from o to x

f : sigmoidal relationship from the input to the output of the cortical neurons

w : connection strengths from cortical neurons x to a group of striatal neurons

v : activity of the group of striatal neurons

δ : activity of a group of DA neurons / released DA

The update rule for w

could be naturally implemented as cortico-striatal synaptic plasticity, which depends on DA (δ(t)) and pre-synaptic (cortical) neuronal activity (xj(t)). However, an issue emerged in implementation of the update rules for A and B:

Specifically, wi included in the rightmost of these update rules (for the strengths of cortico-cortical synapses Aij and Bik) is the connection strength from cortical neuron xi to striatal neurons, i.e., the strength of the cortico-striatal synapses (located within the striatum), which is considered to be unavailable at the cortico-cortical synapses (located within the cortex).

As mentioned in the Introduction, this is an example of the long-standing difficulty in biological implementation of backprop, and recently a potential solution for this difficulty, i.e., replacement of the downstream connection strengths in the update rule for upstream connections with fixed random strengths, has been demonstrated in supervised learning of feed-forward and recurrent networks 33, 42,43. The value-RNN, which we considered here, differed from supervised learning considered in these previous studies in two ways: i) it was TD learning, apparent in the approximation of the true error ε(t) by the TD-RPE δ(t) in the derivation described above, and ii) it used a scalar error (TD-RPE) rather than a vector error. But we expected that the feedback alignment mechanism could still work at least to some extent, and explored it in this study. Specifically, we examined a modified value-RNN with fixed random feedback (VRNNrf), in which the update rules for A and B were modified as follows:

where wi in the update rules of the value-RNN with backprop (VRNNbp) was replaced with a fixed random parameter ci. Notably, these modified update rules for the cortico-cortical connections A and B required only pre-synaptic activities (xj(t−1), ok(t−1)), post-synaptic activities (xi(t)), TD-RPE-representing DA (δ(t)), and fixed random strengths (ci), which would all be available at the cortico-cortical synapses given that VTA DA neurons project not only to the striatum but also to the cortex and random ci could be provided by intrinsic heterogeneity. In each simulation, the elements of c were initialized to pseudo standard normal random numbers.

Revised value-RNN models with further biological constraints

In the later part of this study, we examined revised value-RNN models with further biological constraints. Specifically, we considered models, in which the value weights and the activities of neurons in the RNN were constrained to be non-negative. In order to do so, the update rule for w was modified to:

where max(q1, q2) returned the maximum of q1 and q2. Also, the sigmoidal input-output function was replaced with

and the elements of x were initialized to pseudo uniform [0 1] random numbers. The backprop-based update rules for A and B in VRNNbp were replaced with

We referred to the model with these modifications to VRNNbp as revVRNNbp.

As a revised value-RNN with fixed random feedback (VRNNrf), in addition to the abovementioned modifications of the update of w, the sigmoidal input-output function, and the initialization of x, the fixed random feedback c was assumed to be non-negative. Specifically, the elements of c were set to pseudo uniform [0 1] random numbers. Moreover, the update rules for A and

B were replaced with

so that the originally non-monotonic dependence on xi(t) (post-synaptic activity) became monotonic + saturation (Fig. 5B). These update rules with non-negative ci could be said to be Hebbian with additional modulation by TD-RPE (Hebbian under positive TD-RPE). We referred to the model with these modifications to VRNNrf as bioVRNNrf.

Simulation of the tasks

In the Pavlovian cue-reward association task, at time 1 of each trial, cue observation was received by the RNN, and at time 4, reward observation was received. Trial was pseudo-randomly ended at time 7, 8, 9, or 10, and the next trial started from the next time-step. Reward size was r = 1. The tasks with probabilistic structures (task 1 and task 2) were implemented in the same way except that reward timing was not time 4 but time 3 or 5 with equal probabilities, specifically, 50% and 50% in task 1 and 30% and 30% in task 2, and there was no reward in the remaining 40% of trials in task 2.

The cue or reward state/timing, mentioned in the text and marked in the figures, was defined to be the timing when the RNN received the cue or reward observation, respectively. Specifically, if o(t) = (1 0)T or o(t) = (0 1)T at time t, t + 1 was defined to be a cue or reward timing, respectively. For the agents with punctate (CSC) representation, each timing in the tasks was represented by a 10-dimensional one-hot vector, starting from (1 0 0 … 0)T for the cue state, with the next state (0 1 0 … 0)T and so on.

Unless otherwise mentioned, parameters were set to the following values. Learning rate (a): 0.1 (normalization by the squared norm of feature vector was not implemented). Time discount factor (γ): 0.8.

Estimation of true state values

As for the Pavlovian cue-reward association task, we defined states by relative timings from the cue, and estimated their (true) state values by simulations according to the definition of state value. Specifically, we generated a sequence of cues and rewards corresponding to 1000 trials, and calculated cumulative discounted future rewards within the sequence:

where t_rew denotes the time-step of each reward counted from the starting state, starting from -2, -1, …, and +6 time steps from a cue. We repeated this 1000 times, generating 1000 sequences (i.e., 1000 simulations of 1000 trials), with different sets of pseudo-random numbers, and calculated an average over these 1000 sequences so as to estimate the expected cumulative discounted future rewards, i.e., state value (by definition) for each state (−2, -1, …, and +6 time steps from cue). Using these estimated true state values, we calculated TD-RPE at each state (−2, -1, …, and +5 time steps from cue).

In a similar manner, we defined states and estimated true state values, and also calculated TD-RPE, for tasks 1 and 2 that had probabilistic structures. As for task 1, we defined the following states: -2, -1, .., and +2 time steps from cue (i.e., states visited (entered) before knowing whether reward was given at the early timing (= +2 time step from cue)), +3, 4, 5, and 6 time steps from cue after reception of reward at the early timing, and +3, 4, 5, and 6 time steps from cue after no reception of reward at the early timing (in total 5 + 4 + 4 = 13 states) (Fig. 4Bc, left-top). We generated 10000 sequences of cues and rewards corresponding to 1000 trials (i.e., 10000 simulations of 1000 trials), and for each state, calculated cumulative discounted future rewards within the sequence for each of 10000 simulations and took an average to obtain the expected cumulative discounted future rewards (i.e., estimation of state value) (Fig. 4Bc, left-bottom). Using the estimated state values, we calculated TD-RPE (Fig. 4C, left).

As for task 2, we defined the following states: -2, -1, .., and +2 time steps from cue (i.e., states visited (entered) before knowing whether reward was given at the early timing), +3, 4, 5, and 6 time steps from cue after reception of reward at the early timing, +3 and 4 time steps from cue after no reception of reward at the early timing (states visited (entered) before knowing whether reward was given at the late timing (= +4 time step from cue)), +5 and 6 time steps from cue after reception of reward at the late timing, and +5 and 6 time steps from cue after no reception of reward at both early and late timings (in total 5 + 4 + 2 + 2 + 2 = 15 states) (Fig. 4Bc, right-top). We estimated state values of these states (Fig. 4Bc, right-bottom), and also calculated TD-RPE (Fig. 4C, right), in similar manners to the above.

Analyses, software, and code availability

SEM (Standard error of the mean) was approximated by SD (standard deviation)/√N (number of samples). Linear regression and principal component analysis (PCA) were conducted by using R (functions lm and prcomp). Simulations were conducted by using MATLAB, and pseudo-random numbers were implemented by using rand, randn, and randperm functions. All the codes will be made available at GitHub upon publication of this work in a journal.

Author contributions

Conceptualization: KM; Formal analysis: KM, TT; Investigation: KM, TT, AyK; Writing – original draft: KM; Writing – review & editing: KM, TT, AyK, ArK

Acknowledgements

The authors thank Dr. Kenji Doya for valuable suggestions. KM was supported by Grants-in-Aid for Scientific Research 23H03295 and 23K27985 from Japan Society for the Promotion of Science (JSPS) and the Naito Foundation. AyK was supported by JSPS Overseas Research Fellowships. ArK was partially funded by Digital Futures (KTH) grant and StratNeuro SRA.